- 24 . 03 . 10
Describing how to patch CouchDB 0.10.1 to query views with multiple start and end key ranges in one request.
- StumbleUpon
- TwitterTweet
- Facebook
Using Multiple Start and End Keys for CouchDB Views
CouchDB view collation is great and only has one real drawback that has caused me any real pain – the inability to handle queries that need to be parameterised by more than one dimension. These are suprisingly common, including problems such as “find me posts in Category A in March”.
This can be handled with a function that emits keys like:
[javascript]
["Category A", "2010", "03", "Post 1"]
["Category B", "2010", "03", "Post 2"]
["Category A", "2010", "03", "Post 3"]
[/javascript]
and then use:
[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
[/javascript]
However find its reciprocal “All March posts regardless of category” is problematic. You can’t do:
[javascript]
startkey=[*,"2010", "03"]&endkey=[*,"2010", "03",{}]
[/javascript]
where * (or _, or nil, or pass) would represent “all”.
To handle this, there are currently only 2 options; design a new view with the the key components ordered differently, such that they emit:
[javascript]
["2010", "03", "Post 1"]
["2010", "03", "Post 2"]
["2010", "03", "Post 3"]
[/javascript]
or, make multiple connections to the database like
[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category C","2010","03"]&endkey=["Category C","2010", "03",{}]
[/javascript]
where you have a query for each category.
Neither approach is particularly satisfactory. On a recent particular problem set, a single view would be many hundreds of gigabytes of data, and while space is cheap, it’s not that cheap. Additional views were not an option. That same data set contained around 2000 different categories (or their equivalent) and 2000 connections for a particular query seemed excessive.
Since 0.9, Couch has had a way of passing multiple keys to a query in the post body of a view request. Unfortunately, this only supported precise keys, not start-end key ranges. There has been a ticket in the issue tracker to add this additional support since October, but it’s classed as a minor priority and nothing had been done on it. So I decided to have a crack.
On the face of it, it seems like a fairly simple change, only affecting the HTTP View Erlang module. On the other hand, I’ve probably written about 100 lines of Erlang in my life and never looked at the CouchDB code before, so it’s entirely possible I’ve done something wrong. Regardless, the following is a simple solution that appears to work correctly.
The output_map_view and output_reduce_view functions already had the ability to handle start and end keys, but they were being artificially restricted to treat the supplied keys and both start and end. I used Erlang’s pattern matching to make this a little richer:
[erlang]
case Key of
{[{<<"startkey">>,StartKey},{<<"endkey">>,EndKey}]} ->
nil;
_ ->
StartKey = Key,
EndKey = Key
end
[/erlang]
and then passing those new variables in the appropriate place. This seemed to work well. I presume that the Keys parameter is processed just like multiple connections, and then the results aggregated, because the results are exactly the same as a call with the same parameters in the query string.
One final change was that group_level=X is mysteriously disallowed for Multikey queries. I took a punt and removed this restriction and it all seemed to work fine. I can only guess that this restriction didn’t make sense when you had to pass precise keys.
I then query using the following as POST data:
[javascript]
{
"keys": [
{
"startkey": ["Category A","2010","03"],
"endkey": ["Category A","2010","03",{}]
},
{
"startkey": ["Category B","2010","03"],
"endkey": ["Category B","2010","03",{}]
}
]
}
[/javascript]
With this solution, I’m able to query 2000 services simultaneously, group them at any level I like, and get back the results at the lightning speed I’ve become accustomed to.
One small caveat: If I want to get back keys across non-contiguous blocks like this:
[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category A","2010","06"]&endkey=["Category A","2010", "06",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category B","2010","06"]&endkey=["Category B","2010", "06",{}]
[/javascript]
To get all posts in Category A and B in March and June, I can. However, if I have a reduce function and group at level 1, I still end up with 4 rows, 2 for Category A, 2 for Category B. I think this is because the queries are being run independently, without reference to the other. To do a full aggregation across time periods (for example to get the total number of posts by category in March and June), I’d still need to do a client aggregation on the resulting data-set. This may or may not be a big problem for you; it’s certainly something I can live with.
The CouchDB issue lives here, and the patch to 0.10.1 lives here.
Followrama: A 1ª rede de Followers 2.0 do Brasil! A única que busca Followers automaticamente! http://tinyurl.com/yf3ud8f #Followrama 14
This comment was originally posted on Twitter
Stoat – Where?: Using Multiple Start and End Keys for CouchDB Views http://bit.ly/dhf1AX
This comment was originally posted on Twitter
how do you make those code segments look so cool? Is this a wordpress theme? I would love to know how to format my code segments on my blog like this!
hi,
i am new to couchdb but shouldn't
startkey=[“Category A”,”2010″,”03″]&endkey=[“Category Z”,”2010″, “03”,{}]
work for your example?
Hey forgemo,
Unfortunately that doesn't work. Couch evaluates the key matching from left to right, stopping when the first field matches. So your example would return everything in “Category C”, because “Category C” is between “Category A” and “Category Z”. It wouldn't even check the Year/Month/Day options because it will already have been found to match. Does that make sense?
Feck, I wish this feature were available.
Yeah, I know what you mean. Unfortunately it has been pushed back a few
times because developers can’t come to a consensus. The patch doesn’t apply
to current trunk, but the principle is sound. There is an alternative patch
up there from another contributor too, which works just as well. I used
this in a fairly complex prototype system and found no issues, beyond having
to aggregate results at the application layer.
Best,
Can you query your first example with this patch?
startkey=[*,”2010″, “03”]&endkey=[*,”2010″, “03”,{}]
You second example takes a different route. Or did I miss something?
Hi,
You’re correct – the first way still isn’t possible, which is problematic. However, in this instance, I had a finite, reasonable sized list of ‘top-level’ collation items (about 2000, growing slowly), so was enable to enumerate them explicitly in the POST body. Not ideally scalable of course, but it suited my needs at the time, and was still blindingly fast despite the larger POST body (was all on a local network in our case, so no worries there either).
I’m not sure if the asterisk method is even possible, as I haven’t grokked the exact internal format of the view index. That’s something the developers would have to answer, but my guess is that is would require some fundamental changes.
Cheers,
Jamie.
[…] also like to provide a lot of thanks to Jamie Talbot for this blog, which finally made me realize that I couldn’t do what I wanted. Nothing else I came […]
Hi, Thanks for your wonderful post. I am facing the exact same problem as you!
Could you highlight the steps to add the patch to couchdb? I’m not an Erlang developer and I have no idea how CouchDB patching works.