Erlang – Stoat – Where?

An Erlang Matrix Module

Jamie Talbot — Tue, 18 Jan 2011 22:27:19 +0000

As part of a recent investigation into counting Hamiltonian Paths in undirected graphs, I began researching the work of Eric Bax, whose paper on Inclusion Exclusion promised to yield a quick solution. Although I ended up going in a different direction, his reliance on adjacency matrices that would be processed 2ⁿ times required a matrix module that was as efficient as possible. Writing the solution in Erlang, I found a surprising lack of matrix code on the web, with the most developed I could find being here.

Erlang perhaps isn’t the best language for dealing with matrices, with its ‘one-time’ mathematical approach to assignment, however the existing methods all seemed to make use of tuple_to_list(), list_to_tuple() conversion and excessive list copying. Below I present a simple module which is at least more efficient than the above, and relies on a list of lists, using lists:nth() to retrieve elements. Perhaps in the future this will be superceded by new array syntax, but for now it suits my needs.

Matrices are generally stable, by which I mean unchanging in dimensions. Of course matrices can be different sizes, but the majority of matrix operations that return a matrix result return one the same size as at least one of the inputs. Given that we know the matrix dimensions, in theory this makes tuples a suitable choice for implementation, but in practice the immutability of tuples (even more immutable than lists) make them unwieldy and involve a lot of overhead when building them. And with Erlang’s binding there will be lots of rebuilding.

The main approach is to accept that we will be rebuilding the matrix in full each time and generalise to a standard matrix building function that takes matrix size and cell content generator function parameters:

[erlang]
new(Columns, Rows, ContentGenerator) ->
[
[ContentGenerator(Column, Row, Columns, Rows)
|| Column <- lists:seq(1, Columns)
]
|| Row <- lists:seq(1, Rows)
].
[/erlang]
which reads roughly “for each row, for each column on that row, call the content generation function for that cell”.

Each content generator function has the following spec:
[erlang]
fun((pos_integer(), pos_integer(), pos_integer(), pos_integer()) -> any()
[/erlang]

where the four parameters are current column, current row, number of columns, number of rows. A simple identity function is as follows:
[erlang]
fun(Column, Row, _, _) ->
case Column of Row -> 1; _ -> 0 end
end
[/erlang]
which will generate the identity matrix:
[erlang]
[[1 0 0]
[0 1 0]
[0 0 1]]
[/erlang]
A sequential matrix might be generated with the following function:
[erlang]
fun(Column, Row, Columns, _) ->
Columns * (Row – 1) + Column
end
[/erlang]
giving:
[erlang]
[[1 2 3]
[4 5 6]
[7 8 9]]
[/erlang]
while a matrix composed of pseudo-random numbers 1 to MaxValue could be:
[erlang]
fun(_, _, _, _) ->
random:uniform(MaxValue)
end
[/erlang]
where MaxValue is bound in a closure, giving (for example):
[erlang]
[[3 9 6]
[1 1 4]
[6 2 1]]
[/erlang]

With this general mechanism, matrix operations such and addition or multiplication become a lot simpler. All that is required is to define a generator function that sets each cell correctly according to whatever operation you are performing. Matrix addition is defined thus:
[erlang]
% Adds two matrices together.
-spec add(num_matrix(), num_matrix()) -> num_matrix().
add(A, B) ->
new(length(lists:nth(1, A)), length(A),
fun(Column, Row, _, _) ->
element_at(Column, Row, A) + element_at(Column, Row, B)
end
).
[/erlang]

Performance isn’t great for modification of a single cell, as it rebuilds the entire matrix. A future optimisation could include recognising that an entire row hasn’t changed and simply rebinding that. However, for my current use cases, I haven’t needed efficient single-cell manipulation. There are specs for the functions, but very little in the way of error checking. There are no checks for the addition of matrices of two different sizes, for example.

Future extensions will include the ‘Hungarian’ approach to the assignment problem, retrieval functions for entire columns or rows, some unit testing code, and confirmed support for NxM matrices. (Non-square matrices currently work in theory, but not all functions have been tested to work with them.) Check out the code on GitHub. It’s freely available to use and fork, and if you do, be sure to send a pull request!

Using Multiple Start and End Keys for CouchDB Views

Jamie Talbot — Wed, 24 Mar 2010 00:55:57 +0000

CouchDB view collation is great and only has one real drawback that has caused me any real pain – the inability to handle queries that need to be parameterised by more than one dimension. These are suprisingly common, including problems such as “find me posts in Category A in March”.

This can be handled with a function that emits keys like:

[javascript]
["Category A", "2010", "03", "Post 1"]
["Category B", "2010", "03", "Post 2"]
["Category A", "2010", "03", "Post 3"]
[/javascript]

and then use:

[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
[/javascript]

However find its reciprocal “All March posts regardless of category” is problematic. You can’t do:

[javascript]
startkey=[*,"2010", "03"]&endkey=[*,"2010", "03",{}]
[/javascript]

where * (or _, or nil, or pass) would represent “all”.

To handle this, there are currently only 2 options; design a new view with the the key components ordered differently, such that they emit:

[javascript]
["2010", "03", "Post 1"]
["2010", "03", "Post 2"]
["2010", "03", "Post 3"]
[/javascript]

or, make multiple connections to the database like

[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category C","2010","03"]&endkey=["Category C","2010", "03",{}]
[/javascript]

where you have a query for each category.

Neither approach is particularly satisfactory. On a recent particular problem set, a single view would be many hundreds of gigabytes of data, and while space is cheap, it’s not that cheap. Additional views were not an option. That same data set contained around 2000 different categories (or their equivalent) and 2000 connections for a particular query seemed excessive.

Since 0.9, Couch has had a way of passing multiple keys to a query in the post body of a view request. Unfortunately, this only supported precise keys, not start-end key ranges. There has been a ticket in the issue tracker to add this additional support since October, but it’s classed as a minor priority and nothing had been done on it. So I decided to have a crack.

On the face of it, it seems like a fairly simple change, only affecting the HTTP View Erlang module. On the other hand, I’ve probably written about 100 lines of Erlang in my life and never looked at the CouchDB code before, so it’s entirely possible I’ve done something wrong. Regardless, the following is a simple solution that appears to work correctly.

The output_map_view and output_reduce_view functions already had the ability to handle start and end keys, but they were being artificially restricted to treat the supplied keys and both start and end. I used Erlang’s pattern matching to make this a little richer:

[erlang]
case Key of
{[{<<"startkey">>,StartKey},{<<"endkey">>,EndKey}]} ->
nil;
_ ->
StartKey = Key,
EndKey = Key
end
[/erlang]

and then passing those new variables in the appropriate place. This seemed to work well. I presume that the Keys parameter is processed just like multiple connections, and then the results aggregated, because the results are exactly the same as a call with the same parameters in the query string.

One final change was that group_level=X is mysteriously disallowed for Multikey queries. I took a punt and removed this restriction and it all seemed to work fine. I can only guess that this restriction didn’t make sense when you had to pass precise keys.

I then query using the following as POST data:

[javascript]
{
"keys": [
{
"startkey": ["Category A","2010","03"],
"endkey": ["Category A","2010","03",{}]
},
{
"startkey": ["Category B","2010","03"],
"endkey": ["Category B","2010","03",{}]
}
]
}
[/javascript]

With this solution, I’m able to query 2000 services simultaneously, group them at any level I like, and get back the results at the lightning speed I’ve become accustomed to.

One small caveat: If I want to get back keys across non-contiguous blocks like this:

[javascript]
startkey=["Category A","2010","03"]&endkey=["Category A","2010", "03",{}]
startkey=["Category A","2010","06"]&endkey=["Category A","2010", "06",{}]
startkey=["Category B","2010","03"]&endkey=["Category B","2010", "03",{}]
startkey=["Category B","2010","06"]&endkey=["Category B","2010", "06",{}]
[/javascript]

To get all posts in Category A and B in March and June, I can. However, if I have a reduce function and group at level 1, I still end up with 4 rows, 2 for Category A, 2 for Category B. I think this is because the queries are being run independently, without reference to the other. To do a full aggregation across time periods (for example to get the total number of posts by category in March and June), I’d still need to do a client aggregation on the resulting data-set. This may or may not be a big problem for you; it’s certainly something I can live with.

The CouchDB issue lives here, and the patch to 0.10.1 lives here.

Handling JSON Objects in CouchDB Native Erlang Views

Jamie Talbot — Thu, 18 Mar 2010 05:41:02 +0000

I’ve been working with CouchDB a fair bit in recent weeks and am really enjoying it so far. Once I got my head around how to structure views and take advantage of view collation, I found it to be far more expressive than I first thought.

I still have a couple of gripes, the largest one of which is that you can’t use a wildcard parameter at the beginning of your view keys, so if you need to get “items by user by category” and “items by category by user”, you need two views. I’m sure there are good architectural reasons for this, but for me it’s the one place where collation lets me down. For at least one of the solutions I’m working on, multiple views are a major problem, as even one takes up 120GB (and counting).

But, to the main point. Native Erlang views are now possible, and if you can create them, potentially significantly faster than Javascript ones. There are a couple of gotchas though, not least for me the handling of JSON objects.

We start with a document like this:

[javascript]
{
"_id": "36kem",
"_rev": "1-c895d5a55945a9898880bf870a3b3025",
"type": "usage",
"timestamp": [
"2010",
"02",
"28",
"23",
"10"
],
"data": [
{
"t": "E000005861",
"i": "232920",
"o": "2365730"
},
{
"t": "E000006504",
"i": "15784",
"o": "17786"
},
{
"t": "E000006505",
"i": "16661",
"o": "17786"
}
]
}
[/javascript]

In reality there are thousands of entries in the data array, but this will do. Our aim is to emit one key-value pair for each item in the “data” field of each document of type “usage”. In Javascript this is pretty trivial. Erlang however, proves more of a challenge.

Based on pointers from the CouchDB Wiki, I started with:

[erlang]
fun ({Doc}) ->
case proplists:get_value(<<"type">>, Doc) of
<<"usage">> ->
Emit(proplists:get_value(<<"_id">>, Doc), null);
_ ->
ok
end
end.
[/erlang]

and was very happy to see that work. Two things to note here: Don’t forget the {} around the Doc in the function definition or you’ll get strange errors, and; to get the value of a field in a document, you can use the standard proplists:get_value(<<"fieldname">>, Doc) construct. So far so good.

The main issue for me came with manipulating the “data” field. I didn’t actually want to emit null, but instead the “i” and “o” parts of the data field. First off, I tried:

[erlang]
lists:foreach(fun(Item) -> Emit(null, [proplists:get_value(<<"i">>, Item), proplists:get_value(<<"o">>, Item)]) end, proplists:get_value(<<"data">>, Doc)
[/erlang]

But met with some (very long) errors. (Gripe number two – they could really do with humanising the Erlang crash dump.)

It took me quite a few attempts, including stripping it right back to confirm that I had an array to iterate and that each object does in fact contain an “i” and an “o” field, before I found the problem, which is this:

Even though Documents are defined within {} braces, and JSON objects within that definition are also defined within {} braces, you cannot access them the same way in an Erlang view.

proplists:get_value(<<"field">>, Doc) is fine for the document as a whole, but you can’t access JSON objects the same way. Bad assumption on my part. Luckily, the answer I got to another Stack Overflow question recently pointed the way.

To access the data we need to pattern match the components using the Erlang representation of a JSON object, like so:

[erlang]
{[{<<"t">>, TrackingID},{<<"i">>, In},{<<"o">>, Out}]} = Row
[/erlang]

Ugly, hey? Useful though, as it extracts the TrackingID, In and Out values all in one go, kind of like a list() statement on steroids.

With that in place, and a little more tidying up of the code, we arrive at:

[erlang]
fun({Doc}) ->
case proplists:get_value(<<"type">>, Doc) of
<<"usage">> ->
[Year, Month, Day, Hour, Minute | _] = proplists:get_value(<<"timestamp">>, Doc),
lists:foreach(fun(Row) ->
{[{<<"t">>, TrackingID},{<<"i">>, In},{<<"o">>, Out}]} = Row,
Emit([TrackingID, Year, Month, Day, Hour, Minute],[In, Out])
end, proplists:get_value(<<"data">>, Doc));
_ ->
ok
end
end.
[/erlang]

That little beauty lets me query the usage of a service at any granularity over data from the last 7 years in a faster time than the browser can render it. Across an HTTP connection to a data source 1000km away. On development hardware.