• 24 . 03 . 10
  • Describing how to patch CouchDB 0.10.1 to query views with multiple start and end key ranges in one request.

  • Tags

    , , ,

  • StumbleUpon

Using Multiple Start and End Keys for CouchDB Views

CouchDB view collation is great and only has one real drawback that has caused me any real pain – the inability to handle queries that need to be parameterised by more than one dimension. These are suprisingly common, including problems such as “find me posts in Category A in March”.

This can be handled with a function that emits keys like:

["Category A", "2010", "03", "Post 1"]
["Category B", "2010", "03", "Post 2"]
["Category A", "2010", "03", "Post 3"]

and then use:

startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]

However find its reciprocal “All March posts regardless of category” is problematic. You can’t do:

startkey=[*,"2010", "03"]&endkey=[*,"2010", "03",{}]

where * (or _, or nil, or pass) would represent “all”.

To handle this, there are currently only 2 options; design a new view with the the key components ordered differently, such that they emit:

["2010", "03", "Post 1"]
["2010", "03", "Post 2"]
["2010", "03", "Post 3"]

or, make multiple connections to the database like

startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category C","2010","03"]&endkey=["Category C","2010", "03",{}]

where you have a query for each category.

Neither approach is particularly satisfactory. On a recent particular problem set, a single view would be many hundreds of gigabytes of data, and while space is cheap, it’s not that cheap. Additional views were not an option. That same data set contained around 2000 different categories (or their equivalent) and 2000 connections for a particular query seemed excessive.

Since 0.9, Couch has had a way of passing multiple keys to a query in the post body of a view request. Unfortunately, this only supported precise keys, not start-end key ranges. There has been a ticket in the issue tracker to add this additional support since October, but it’s classed as a minor priority and nothing had been done on it. So I decided to have a crack.

On the face of it, it seems like a fairly simple change, only affecting the HTTP View Erlang module. On the other hand, I’ve probably written about 100 lines of Erlang in my life and never looked at the CouchDB code before, so it’s entirely possible I’ve done something wrong. Regardless, the following is a simple solution that appears to work correctly.

The output_map_view and output_reduce_view functions already had the ability to handle start and end keys, but they were being artificially restricted to treat the supplied keys and both start and end. I used Erlang’s pattern matching to make this a little richer:

case Key of
	{[{<<"startkey">>,StartKey},{<<"endkey">>,EndKey}]} ->
	_ ->
		StartKey = Key,
		EndKey = Key

and then passing those new variables in the appropriate place. This seemed to work well. I presume that the Keys parameter is processed just like multiple connections, and then the results aggregated, because the results are exactly the same as a call with the same parameters in the query string.

One final change was that group_level=X is mysteriously disallowed for Multikey queries. I took a punt and removed this restriction and it all seemed to work fine. I can only guess that this restriction didn’t make sense when you had to pass precise keys.

I then query using the following as POST data:

    "keys": [
            "startkey": ["Category A","2010","03"],
            "endkey": ["Category A","2010","03",{}]
            "startkey": ["Category B","2010","03"],
            "endkey": ["Category B","2010","03",{}]

With this solution, I’m able to query 2000 services simultaneously, group them at any level I like, and get back the results at the lightning speed I’ve become accustomed to.

One small caveat: If I want to get back keys across non-contiguous blocks like this:

startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category A","2010","06"]&endkey=["Category A","2010", "06",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category B","2010","06"]&endkey=["Category B","2010", "06",{}]

To get all posts in Category A and B in March and June, I can. However, if I have a reduce function and group at level 1, I still end up with 4 rows, 2 for Category A, 2 for Category B. I think this is because the queries are being run independently, without reference to the other. To do a full aggregation across time periods (for example to get the total number of posts by category in March and June), I’d still need to do a client aggregation on the resulting data-set. This may or may not be a big problem for you; it’s certainly something I can live with.

The CouchDB issue lives here, and the patch to 0.10.1 lives here.