We are currently considering CouchDB as a replacement for PostgreSQL in a usage monitoring application, which polls thousands of services every 5 minutes and stores data about the amount of data they upload and download.
We have a number of important metrics that we need to calculate based on this data, with the most common requirement being to determine how much data an individual service has done today. The second most common is determining how much data an individual service has done in a month, and the final important requirement is to determine the sum of usage of all services in a month. There are other reporting needs, but they are secondary to these.
The scale of the data is such that we add approximately 600,000 rows per day and in order to handle this relatively quickly, we use Postgres’ partitioning capabilities to partition the usage table based on the day.
In this manner, the primary use case is handled very quickly. Because the data is partitioned by day, Postgres only has to read one table and can quickly discard the rest. With appropriate indices on the service id, we can narrow down an entire day’s usage quickly. However, it struggles when an entire month’s worth of data is required as this involves reading 30 or so tables. It is also incredibly lethargic for querying multiple services simultaneously, as when the number of services becomes great enough, Postgres ignores the service id index and reverts to a table scan. This kind of query has to be run outside of business hours, such is the capacity to slow things down.
In an effort to improve the second and third metrics, we are investigating alternative storage architectures and came across CouchDB, which looks promising.
The first major decision was document format. If, for each poll, we need to store service id, the timestamp, data in and data out, there are a number of ways we could structure our documents.
Ideally, we would have liked to use one document per service per month, as a coarse granularity would seem more manageable. However, the show stopper was that the document would be updated every 5 minutes, and because of CouchDB’s MVCC architecture, a new document revision would be created each time. This also caused an unacceptable amount of overhead as each document had to be retrieved to get its _rev value before it could be updated. We don’t want to have to do a read everytime we want to do a write when we have this many writes.
The other extreme was one document per service per poll. This would have resulted in 15M new documents each month, which was infeasible.
With some help from users at Stack Overflow and @couchdb, as well as some judicious testing, we decided on a format that stored one document for each poll. This solved the problem of having to update an existing document and was coarse enough that only 288 documents would be added per day.
This is the final document format we went for:
Where data is an array of polled service data usage. For this (and indeed for all formats we tried), insert time is much slower than for Postgres, MySQL MyISAM and MySQL InnoDB, even when doing the standard tricks of using bulk insert and using our own ids. Inserting a single document takes about 2 seconds, whereas for the others it is measured in milliseconds.
On the view side of things we have good and bad news. Once built, the views are lightning quick, to the point where calculating the usage for a single connection for an entire month is completed before I have time to blink. Of course, because CouchDB is quite limited in how it allows you to filter data (you can’t parameterise queries beyond the key you are returning, which from an SQL background I find quite limiting), it isn’t possible to query multiple services in the same request. However, querying them concurrently and aggregating the results is still incredibly fast.
Building the views is what may turn out to be our main problem though. We have 7 years worth of data sitting in Postgres at the moment. A dummy build of a single view for 3 years of data took more than 10 days to complete! A rewrite of the view dropped this down to 8 days – at this volume, removing a single output variable manages to save GB and days of time. You have some reasonable size of data when removing the quotes from around integers drops your monthly data by 10GB! However, this kind of build time may prove to be an insurmountable risk to business. Couch DB is supposed to be crash proof, but you can never be too sure. If we are ever in the position of losing the view index, it will take weeks (literally) to rebuild the index so that we can make a single query. And this is for only one view. It hardly lends itself to agile development when you have to wait weeks before you can roll out a production view. Disk size also becomes a consideration, with the view appearing to be 3 times larger than the raw data, and with 3 years of data taking up approaching half a terrabyte, we will have to think about that as well. Disk space is reasonably cheap, but it’s not free, especially on a SAN, and as far as I can tell, you can’t (yet) shard CouchDB across multiple disks.
Some more investigation is required.
Lessons to be learned:
- Get your views right before you load up all your data.
- Bulk test the database with as much data as you can to find potential future problems.
- Use your own IDs – they really do make a significant difference.
- Always bulk load data, even if you have only have a couple of documents.
- Hovercraft, for native interaction with the data, rather than through HTTP.
- Research on indexes to see if they can partially be used while being rebuilt.
- Investigating the PHP Erlang extension to see if we can take advantage of better concurrency there.