CouchDB – Stoat – Where?

Giving Back: My 2011 Manifesto

Jamie Talbot — Sat, 08 Jan 2011 23:14:53 +0000

New Year’s resolutions are made to be broken, but as this is a week after the fact, perhaps I have a better chance.

The open source world has enriched my life significantly, both professionally, in providing the databases, languages, operating systems and IDEs that are my livelihood, and personally, in providing communications tools, entertainment and even this blogging platform. Over the course of my career, I’ve tried to give back to these communities, both in time and code submissions, and I’ve contributed in some small way to a number of projects. The challenging nature of my recent full-time employment has unfortunately meant that more recently I’ve only been able to do this in limited ways.

With my impending hiatus and upcoming travels, I hope to have more time to dedicate to the various online communities I follow. This might be through forums such as stackoverflow where I’ve received answers to numerous difficult questions, or simply helping people in chatrooms. It might be answering questions on Quora. It will certainly involve code contributions, though to which projects I’m not sure. No deed is entirely selfless, and regular development will ensure I stay sharp and widen my employment opportunities, but primarily this is about passion for development, which I’m certain I couldn’t live without for any length of time. I’m very excited about CouchDB, which would also develop my capabilities in Erlang. I value Zend Framework very highly, which would deepen my expertise in PHP. Perhaps it will be both, or others instead. But as the new year begins, I will make this my manifesto – that I will give back to those people and projects that have made my life, and many others’ lives better.

My job title from January 22nd – Open Source Contributor.

Using Multiple Start and End Keys for CouchDB Views

Jamie Talbot — Wed, 24 Mar 2010 00:55:57 +0000

CouchDB view collation is great and only has one real drawback that has caused me any real pain – the inability to handle queries that need to be parameterised by more than one dimension. These are suprisingly common, including problems such as “find me posts in Category A in March”.

This can be handled with a function that emits keys like:

[javascript]
["Category A", "2010", "03", "Post 1"]
["Category B", "2010", "03", "Post 2"]
["Category A", "2010", "03", "Post 3"]
[/javascript]

and then use:

[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
[/javascript]

However find its reciprocal “All March posts regardless of category” is problematic. You can’t do:

[javascript]
startkey=[*,"2010", "03"]&endkey=[*,"2010", "03",{}]
[/javascript]

where * (or _, or nil, or pass) would represent “all”.

To handle this, there are currently only 2 options; design a new view with the the key components ordered differently, such that they emit:

[javascript]
["2010", "03", "Post 1"]
["2010", "03", "Post 2"]
["2010", "03", "Post 3"]
[/javascript]

or, make multiple connections to the database like

[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category C","2010","03"]&endkey=["Category C","2010", "03",{}]
[/javascript]

where you have a query for each category.

Neither approach is particularly satisfactory. On a recent particular problem set, a single view would be many hundreds of gigabytes of data, and while space is cheap, it’s not that cheap. Additional views were not an option. That same data set contained around 2000 different categories (or their equivalent) and 2000 connections for a particular query seemed excessive.

Since 0.9, Couch has had a way of passing multiple keys to a query in the post body of a view request. Unfortunately, this only supported precise keys, not start-end key ranges. There has been a ticket in the issue tracker to add this additional support since October, but it’s classed as a minor priority and nothing had been done on it. So I decided to have a crack.

On the face of it, it seems like a fairly simple change, only affecting the HTTP View Erlang module. On the other hand, I’ve probably written about 100 lines of Erlang in my life and never looked at the CouchDB code before, so it’s entirely possible I’ve done something wrong. Regardless, the following is a simple solution that appears to work correctly.

The output_map_view and output_reduce_view functions already had the ability to handle start and end keys, but they were being artificially restricted to treat the supplied keys and both start and end. I used Erlang’s pattern matching to make this a little richer:

[erlang]
case Key of
{[{<<"startkey">>,StartKey},{<<"endkey">>,EndKey}]} ->
nil;
_ ->
StartKey = Key,
EndKey = Key
end
[/erlang]

and then passing those new variables in the appropriate place. This seemed to work well. I presume that the Keys parameter is processed just like multiple connections, and then the results aggregated, because the results are exactly the same as a call with the same parameters in the query string.

One final change was that group_level=X is mysteriously disallowed for Multikey queries. I took a punt and removed this restriction and it all seemed to work fine. I can only guess that this restriction didn’t make sense when you had to pass precise keys.

I then query using the following as POST data:

[javascript]
{
"keys": [
{
"startkey": ["Category A","2010","03"],
"endkey": ["Category A","2010","03",{}]
},
{
"startkey": ["Category B","2010","03"],
"endkey": ["Category B","2010","03",{}]
}
]
}
[/javascript]

With this solution, I’m able to query 2000 services simultaneously, group them at any level I like, and get back the results at the lightning speed I’ve become accustomed to.

One small caveat: If I want to get back keys across non-contiguous blocks like this:

[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category A","2010","06"]&endkey=["Category A","2010", "06",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category B","2010","06"]&endkey=["Category B","2010", "06",{}]
[/javascript]

To get all posts in Category A and B in March and June, I can. However, if I have a reduce function and group at level 1, I still end up with 4 rows, 2 for Category A, 2 for Category B. I think this is because the queries are being run independently, without reference to the other. To do a full aggregation across time periods (for example to get the total number of posts by category in March and June), I’d still need to do a client aggregation on the resulting data-set. This may or may not be a big problem for you; it’s certainly something I can live with.

The CouchDB issue lives here, and the patch to 0.10.1 lives here.

Handling JSON Objects in CouchDB Native Erlang Views

Jamie Talbot — Thu, 18 Mar 2010 05:41:02 +0000

I’ve been working with CouchDB a fair bit in recent weeks and am really enjoying it so far. Once I got my head around how to structure views and take advantage of view collation, I found it to be far more expressive than I first thought.

I still have a couple of gripes, the largest one of which is that you can’t use a wildcard parameter at the beginning of your view keys, so if you need to get “items by user by category” and “items by category by user”, you need two views. I’m sure there are good architectural reasons for this, but for me it’s the one place where collation lets me down. For at least one of the solutions I’m working on, multiple views are a major problem, as even one takes up 120GB (and counting).

But, to the main point. Native Erlang views are now possible, and if you can create them, potentially significantly faster than Javascript ones. There are a couple of gotchas though, not least for me the handling of JSON objects.

We start with a document like this:

[javascript]
{
"_id": "36kem",
"_rev": "1-c895d5a55945a9898880bf870a3b3025",
"type": "usage",
"timestamp": [
"2010",
"02",
"28",
"23",
"10"
],
"data": [
{
"t": "E000005861",
"i": "232920",
"o": "2365730"
},
{
"t": "E000006504",
"i": "15784",
"o": "17786"
},
{
"t": "E000006505",
"i": "16661",
"o": "17786"
}
]
}
[/javascript]

In reality there are thousands of entries in the data array, but this will do. Our aim is to emit one key-value pair for each item in the “data” field of each document of type “usage”. In Javascript this is pretty trivial. Erlang however, proves more of a challenge.

Based on pointers from the CouchDB Wiki, I started with:

[erlang]
fun ({Doc}) ->
case proplists:get_value(<<"type">>, Doc) of
<<"usage">> ->
Emit(proplists:get_value(<<"_id">>, Doc), null);
_ ->
ok
end
end.
[/erlang]

and was very happy to see that work. Two things to note here: Don’t forget the {} around the Doc in the function definition or you’ll get strange errors, and; to get the value of a field in a document, you can use the standard proplists:get_value(<<"fieldname">>, Doc) construct. So far so good.

The main issue for me came with manipulating the “data” field. I didn’t actually want to emit null, but instead the “i” and “o” parts of the data field. First off, I tried:

[erlang]
lists:foreach(fun(Item) -> Emit(null, [proplists:get_value(<<"i">>, Item), proplists:get_value(<<"o">>, Item)]) end, proplists:get_value(<<"data">>, Doc)
[/erlang]

But met with some (very long) errors. (Gripe number two – they could really do with humanising the Erlang crash dump.)

It took me quite a few attempts, including stripping it right back to confirm that I had an array to iterate and that each object does in fact contain an “i” and an “o” field, before I found the problem, which is this:

Even though Documents are defined within {} braces, and JSON objects within that definition are also defined within {} braces, you cannot access them the same way in an Erlang view.

proplists:get_value(<<"field">>, Doc) is fine for the document as a whole, but you can’t access JSON objects the same way. Bad assumption on my part. Luckily, the answer I got to another Stack Overflow question recently pointed the way.

To access the data we need to pattern match the components using the Erlang representation of a JSON object, like so:

[erlang]
{[{<<"t">>, TrackingID},{<<"i">>, In},{<<"o">>, Out}]} = Row
[/erlang]

Ugly, hey? Useful though, as it extracts the TrackingID, In and Out values all in one go, kind of like a list() statement on steroids.

With that in place, and a little more tidying up of the code, we arrive at:

[erlang]
fun({Doc}) ->
case proplists:get_value(<<"type">>, Doc) of
<<"usage">> ->
[Year, Month, Day, Hour, Minute | _] = proplists:get_value(<<"timestamp">>, Doc),
lists:foreach(fun(Row) ->
{[{<<"t">>, TrackingID},{<<"i">>, In},{<<"o">>, Out}]} = Row,
Emit([TrackingID, Year, Month, Day, Hour, Minute],[In, Out])
end, proplists:get_value(<<"data">>, Doc));
_ ->
ok
end
end.
[/erlang]

That little beauty lets me query the usage of a service at any granularity over data from the last 7 years in a faster time than the browser can render it. Across an HTTP connection to a data source 1000km away. On development hardware.

CouchDB For A Real-Time Monitoring System

Jamie Talbot — Tue, 16 Feb 2010 10:08:05 +0000

We are currently considering CouchDB as a replacement for PostgreSQL in a usage monitoring application, which polls thousands of services every 5 minutes and stores data about the amount of data they upload and download.

We have a number of important metrics that we need to calculate based on this data, with the most common requirement being to determine how much data an individual service has done today. The second most common is determining how much data an individual service has done in a month, and the final important requirement is to determine the sum of usage of all services in a month. There are other reporting needs, but they are secondary to these.

The scale of the data is such that we add approximately 600,000 rows per day and in order to handle this relatively quickly, we use Postgres’ partitioning capabilities to partition the usage table based on the day.
In this manner, the primary use case is handled very quickly. Because the data is partitioned by day, Postgres only has to read one table and can quickly discard the rest. With appropriate indices on the service id, we can narrow down an entire day’s usage quickly. However, it struggles when an entire month’s worth of data is required as this involves reading 30 or so tables. It is also incredibly lethargic for querying multiple services simultaneously, as when the number of services becomes great enough, Postgres ignores the service id index and reverts to a table scan. This kind of query has to be run outside of business hours, such is the capacity to slow things down.

In an effort to improve the second and third metrics, we are investigating alternative storage architectures and came across CouchDB, which looks promising.

The first major decision was document format. If, for each poll, we need to store service id, the timestamp, data in and data out, there are a number of ways we could structure our documents.

Ideally, we would have liked to use one document per service per month, as a coarse granularity would seem more manageable. However, the show stopper was that the document would be updated every 5 minutes, and because of CouchDB’s MVCC architecture, a new document revision would be created each time. This also caused an unacceptable amount of overhead as each document had to be retrieved to get its _rev value before it could be updated. We don’t want to have to do a read everytime we want to do a write when we have this many writes.

The other extreme was one document per service per poll. This would have resulted in 15M new documents each month, which was infeasible.

With some help from users at Stack Overflow and @couchdb, as well as some judicious testing, we decided on a format that stored one document for each poll. This solved the problem of having to update an existing document and was coarse enough that only 288 documents would be added per day.

This is the final document format we went for:

[javascript]
{
_id: "usg-123456",
ts: ["2009","12","01","12","00"],
data: [{
s: "XXXXXX",
i: 1413,
o: 345323
},{
s: "YYYYYY",
i: 2203,
o: 118213
}]
}
[/javascript]

Where data is an array of polled service data usage. For this (and indeed for all formats we tried), insert time is much slower than for Postgres, MySQL MyISAM and MySQL InnoDB, even when doing the standard tricks of using bulk insert and using our own ids. Inserting a single document takes about 2 seconds, whereas for the others it is measured in milliseconds.

On the view side of things we have good and bad news. Once built, the views are lightning quick, to the point where calculating the usage for a single connection for an entire month is completed before I have time to blink. Of course, because CouchDB is quite limited in how it allows you to filter data (you can’t parameterise queries beyond the key you are returning, which from an SQL background I find quite limiting), it isn’t possible to query multiple services in the same request. However, querying them concurrently and aggregating the results is still incredibly fast.

Building the views is what may turn out to be our main problem though. We have 7 years worth of data sitting in Postgres at the moment. A dummy build of a single view for 3 years of data took more than 10 days to complete! A rewrite of the view dropped this down to 8 days – at this volume, removing a single output variable manages to save GB and days of time. You have some reasonable size of data when removing the quotes from around integers drops your monthly data by 10GB! However, this kind of build time may prove to be an insurmountable risk to business. Couch DB is supposed to be crash proof, but you can never be too sure. If we are ever in the position of losing the view index, it will take weeks (literally) to rebuild the index so that we can make a single query. And this is for only one view. It hardly lends itself to agile development when you have to wait weeks before you can roll out a production view. Disk size also becomes a consideration, with the view appearing to be 3 times larger than the raw data, and with 3 years of data taking up approaching half a terrabyte, we will have to think about that as well. Disk space is reasonably cheap, but it’s not free, especially on a SAN, and as far as I can tell, you can’t (yet) shard CouchDB across multiple disks.

Some more investigation is required.

Lessons to be learned:

Get your views right before you load up all your data.
Bulk test the database with as much data as you can to find potential future problems.
Use your own IDs – they really do make a significant difference.
Always bulk load data, even if you have only have a couple of documents.

Further Research:

Hovercraft, for native interaction with the data, rather than through HTTP.
Research on indexes to see if they can partially be used while being rebuilt.
Investigating the PHP Erlang extension to see if we can take advantage of better concurrency there.