Adventures in Engrish
Archive for March, 2010
Using Multiple Start and End Keys for CouchDB Views
Mar 24th
CouchDB view collation is great and only has one real drawback that has caused me any real pain – the inability to handle queries that need to be parameterised by more than one dimension. These are suprisingly common, including problems such as “find me posts in Category A in March”.
This can be handled with a function that emits keys like:
["Category A", "2010", "03", "Post 1"] ["Category B", "2010", "03", "Post 2"] ["Category A", "2010", "03", "Post 3"]
and then use:
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
However find its reciprocal “All March posts regardless of category” is problematic. You can’t do:
startkey=[*,"2010", "03"]&endkey=[*,"2010", "03",{}]
where * (or _, or nil, or pass) would represent “all”.
To handle this, there are currently only 2 options; design a new view with the the key components ordered differently, such that they emit:
["2010", "03", "Post 1"] ["2010", "03", "Post 2"] ["2010", "03", "Post 3"]
or, make multiple connections to the database like
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category C","2010","03"]&endkey=["Category C","2010", "03",{}]
where you have a query for each category.
Neither approach is particularly satisfactory. On a recent particular problem set, a single view would be many hundreds of gigabytes of data, and while space is cheap, it’s not that cheap. Additional views were not an option. That same data set contained around 2000 different categories (or their equivalent) and 2000 connections for a particular query seemed excessive.
Since 0.9, Couch has had a way of passing multiple keys to a query in the post body of a view request. Unfortunately, this only supported precise keys, not start-end key ranges. There has been a ticket in the issue tracker to add this additional support since October, but it’s classed as a minor priority and nothing had been done on it. So I decided to have a crack.
On the face of it, it seems like a fairly simple change, only affecting the HTTP View Erlang module. On the other hand, I’ve probably written about 100 lines of Erlang in my life and never looked at the CouchDB code before, so it’s entirely possible I’ve done something wrong. Regardless, the following is a simple solution that appears to work correctly.
The output_map_view and output_reduce_view functions already had the ability to handle start and end keys, but they were being artificially restricted to treat the supplied keys and both start and end. I used Erlang’s pattern matching to make this a little richer:
case Key of
{[{<<"startkey">>,StartKey},{<<"endkey">>,EndKey}]} ->
nil;
_ ->
StartKey = Key,
EndKey = Key
end
and then passing those new variables in the appropriate place. This seemed to work well. I presume that the Keys parameter is processed just like multiple connections, and then the results aggregated, because the results are exactly the same as a call with the same parameters in the query string.
One final change was that group_level=X is mysteriously disallowed for Multikey queries. I took a punt and removed this restriction and it all seemed to work fine. I can only guess that this restriction didn’t make sense when you had to pass precise keys.
I then query using the following as POST data:
{
"keys": [
{
"startkey": ["Category A","2010","03"],
"endkey": ["Category A","2010","03",{}]
},
{
"startkey": ["Category B","2010","03"],
"endkey": ["Category B","2010","03",{}]
}
]
}
With this solution, I’m able to query 2000 services simultaneously, group them at any level I like, and get back the results at the lightning speed I’ve become accustomed to.
One small caveat: If I want to get back keys across non-contiguous blocks like this:
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category A","2010","06"]&endkey=["Category A","2010", "06",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category B","2010","06"]&endkey=["Category B","2010", "06",{}]
To get all posts in Category A and B in March and June, I can. However, if I have a reduce function and group at level 1, I still end up with 4 rows, 2 for Category A, 2 for Category B. I think this is because the queries are being run independently, without reference to the other. To do a full aggregation across time periods (for example to get the total number of posts by category in March and June), I’d still need to do a client aggregation on the resulting data-set. This may or may not be a big problem for you; it’s certainly something I can live with.
The CouchDB issue lives here, and the patch to 0.10.1 lives here.
Handling JSON Objects in CouchDB Native Erlang Views
Mar 18th
I’ve been working with CouchDB a fair bit in recent weeks and am really enjoying it so far. Once I got my head around how to structure views and take advantage of view collation, I found it to be far more expressive than I first thought.
I still have a couple of gripes, the largest one of which is that you can’t use a wildcard parameter at the beginning of your view keys, so if you need to get “items by user by category” and “items by category by user”, you need two views. I’m sure there are good architectural reasons for this, but for me it’s the one place where collation lets me down. For at least one of the solutions I’m working on, multiple views are a major problem, as even one takes up 120GB (and counting).
But, to the main point. Native Erlang views are now possible, and if you can create them, potentially significantly faster than Javascript ones. There are a couple of gotchas though, not least for me the handling of JSON objects.
We start with a document like this:
{
"_id": "36kem",
"_rev": "1-c895d5a55945a9898880bf870a3b3025",
"type": "usage",
"timestamp": [
"2010",
"02",
"28",
"23",
"10"
],
"data": [
{
"t": "E000005861",
"i": "232920",
"o": "2365730"
},
{
"t": "E000006504",
"i": "15784",
"o": "17786"
},
{
"t": "E000006505",
"i": "16661",
"o": "17786"
}
]
}
In reality there are thousands of entries in the data array, but this will do. Our aim is to emit one key-value pair for each item in the “data” field of each document of type “usage”. In Javascript this is pretty trivial. Erlang however, proves more of a challenge.
Based on pointers from the CouchDB Wiki, I started with:
fun ({Doc}) ->
case proplists:get_value(<<"type">>, Doc) of
<<"usage">> ->
Emit(proplists:get_value(<<"_id">>, Doc), null);
_ ->
ok
end
end.
and was very happy to see that work. Two things to note here: Don’t forget the {} around the Doc in the function definition or you’ll get strange errors, and; to get the value of a field in a document, you can use the standard proplists:get_value(<<"fieldname">>, Doc) construct. So far so good.
The main issue for me came with manipulating the “data” field. I didn’t actually want to emit null, but instead the “i” and “o” parts of the data field. First off, I tried:
lists:foreach(fun(Item) -> Emit(null, [proplists:get_value(<<"i">>, Item), proplists:get_value(<<"o">>, Item)]) end, proplists:get_value(<<"data">>, Doc)
But met with some (very long) errors. (Gripe number two – they could really do with humanising the Erlang crash dump.)
It took me quite a few attempts, including stripping it right back to confirm that I had an array to iterate and that each object does in fact contain an “i” and an “o” field, before I found the problem, which is this:
Even though Documents are defined within {} braces, and JSON objects within that definition are also defined within {} braces, you cannot access them the same way in an Erlang view.
proplists:get_value(<<"field">>, Doc) is fine for the document as a whole, but you can’t access JSON objects the same way. Bad assumption on my part. Luckily, the answer I got to another Stack Overflow question recently pointed the way.
To access the data we need to pattern match the components using the Erlang representation of a JSON object, like so:
{[{<<"t">>, TrackingID},{<<"i">>, In},{<<"o">>, Out}]} = Row
Ugly, hey?
Useful though, as it extracts the TrackingID, In and Out values all in one go, kind of like a list() statement on steroids.
With that in place, and a little more tidying up of the code, we arrive at:
fun({Doc}) ->
case proplists:get_value(<<"type">>, Doc) of
<<"usage">> ->
[Year, Month, Day, Hour, Minute | _] = proplists:get_value(<<"timestamp">>, Doc),
lists:foreach(fun(Row) ->
{[{<<"t">>, TrackingID},{<<"i">>, In},{<<"o">>, Out}]} = Row,
Emit([TrackingID, Year, Month, Day, Hour, Minute],[In, Out])
end, proplists:get_value(<<"data">>, Doc));
_ ->
ok
end
end.
That little beauty lets me query the usage of a service at any granularity over data from the last 7 years in a faster time than the browser can render it. Across an HTTP connection to a data source 1000km away. On development hardware.
CSS Technique: Morning Sunset
Mar 10th
After coming across an excellent article on generating full page images, I had the idea that it would be cool to blend between multiple images as a user scrolled down the page. The somewhat artistic conceit that a user could scroll a similar scene from morning to sunset, ironically came to me at sunset on Saturday and was finished by morning!
The effect is pretty simple to set up, though does require a specific bit of markup. We start with 2 images and set a z-index on them, such that the first one is in front of the second. With the full page image CSS rules, the first image fills the screen and the second is completely obscured.
<img id="morning" class="bg" style="z-index: 2;" src="morning.jpg" /> <img id="sunset" class="bg" style="z-index: 1;" src="sunset.jpg" /
For the CSS, In addition to the standard full page image rules, I add an additional higher z-index, plus a little bit of transparency This allows the content to sit on top of both of these images, and for the effect to be more pronounced.
div#content {
/* This is the only important rule */
/* We need our content to show up on top of the background */
position: relative;
z-index: 10;
/* Added some opacity to demonstrate the effect better */
opacity: 0.8;
filter: alpha(opacity=80);
}
So far, so what? We’ve arranged for a user to download an image he can’t see – not so good. The magic comes with a blending function tied to the scrollbar. The idea is that the top layer becomes more and more transparent as the user scrolls through the content.
In working with the Zend Framework, I’ve been getting to grips with Dojo and its supporting classes, so I was happy to see that Dijit had tools for getting the dimensions of the viewport. With this information, I was able to calculate the scroll ratio, which gave me a number ranging from 0 at the top of the page to 1 at the bottom.
Dojo also has a great style() function, which allows you to set opacity and have it “just work”, across all browsers, regardless of their non-standard filter() shennanigans. At the outset, the top layer image has an opacity of 1. Subtracting the scroll ratio from this allows it to be fully opaque at the top of the page, and fully transparent at the bottom.
dojo.subscribe("/window/scrolled", function(e){
// Calculate the scroll percentage, and adjust the opacity of the top layer, appropriately.
var vp = dijit.getViewport();
dojo.style("morning", {
"opacity": 1 - (vp.t / (document.documentElement.scrollHeight - vp.h))
});
});
There is some rate-limiting code going on to prevent the event firing continuously and slowing down the page. That came from a helpful Dojo Cookie by Peter Higgins over at Dojo Campus. I haven’t played fully with the rate limiting yet, but 50ms seemed to give a reasonable balance of subtle movement without overloading the page.
Degradation is variable – without Javascript, the user just sees the top image and there’s some overhead of the second image that is never seen. Without CSS, there’s more of a problem, as the images are inlined in the page, which pushes all the content down. If there is a full page image solution using background images only, I’d love to hear about it.
There’s a working, self-describing example of the effect so you can try it out for yourself. It works best in Chrome, with its superior Javascript handling, but works in all modern browsers to a reasonable degree.
The next step is to generalise the code, so that I can pass it an array of image URLs and have it automatically build the markup necessary to generate the effect. With more work on the blending function, I’ll be able to have it blend between multiple layers and potentially follow different rates – I might investigate Dojo curves for that.
I’m not aware of this technique being described anywhere else, but if there are other approaches to doing this, I’d be interested to see them.