Wavelet Trees

Media_httpalexbowes3a_qjefe

In my last post I introduced a data structure called RRR, which is used to quickly answer rank queries on binary sequences, and provide implicit compression.

Today I will talk about an elegant way of answering rank queries on sequences over larger alphabets. The structure is called the Wavelet Tree, which organises a string into a hierarchy of bit vectors. A rank query has time complexity is O(log_2 A), where A is the size of the alphabet. It was introduced by Grossi, Gupta and Vitter in their 2003 paper High-order entropy-compressed text indexes [4] (see the Further Reading section for more papers). It has since been featured in many papers [1, 2, 3, 5, 6].

If you store the bit vectors in RRR sequences, it may take less space than the original sequence. Alternatively, you could store the bit vectors in the rank indexes proposed by Sadakane and Okonohara [7]. It has a different approach to compression. I will talk about it another time ;) – fortunately, I will be studying under Sadakane-san at a later date.

In a different future post, I will show how Suffix Arrays can be used to find arbitrary patterns of length P, by issuing 2P rank queries. If using a Wavelet Tree, this means a pattern search has O(P log_2 A) time complexity, that is, the size of size of the ‘haystack’ doesn’t matter, it instead depends on the size of the ‘needle’ and size of the alphabet.

Construction

A Wavelet Tree converts a string into a balanced binary-tree of bit vectors, where a 0 replaces half of the symbols, and a 1 replaces the other half. This creates ambiguity, but at each level this alphabet is filtered and re-encoded, so the ambiguity lessens, until there is no ambiguity at all.

The tree is defined recursively as follows:

  1. Take the alphabet of the string, and encode the first half as 0, the 2nd half as 1: { a, b, c, d } would become { 0, 0, 1, 1 }
  2. Group each 0-encoded symbol, { a, b }, as a sub-tree;
  3. Group each 1-encoded symbol, { c, d }, as a sub-tree;
  4. Reapply this to each subtree recursively until there is only one or two symbols left (when a 0 or 1 can only mean one thing).

For the string “Peter Piper picked a peck of pickled peppers” (spaces and a string terminator have been represented as _ and $ respectively, due to convention in the literature) the Wavelet Tree would look like this:

Media_httpalexbowes3a_odjnn

note: the strings aren’t actually stored, but are shown here for convenience

It has the alphabet { $, P, _, a, c, d, e, f, i, k, l, o, p, r, s, t }, which would be mapped to { 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 }. So, for example, $ would map to 0, and r would map to 1.

The left subtree is created by taking just the 0-encoded symbols { $, P, _, a, c, d, e, f } and then re-encoding them by dividing this new alphabet: { 0, 0, 0, 0, 1, 1, 1, 1 }. Note that on the first level an e would be encoded as a 0, but now it is encoded as a 1 (it becomes a 0 again at a leaf node).

We can store the bit vectors in RRR structures for fast binary rank queries (which are needed, as described below), and compression :) In fact, since it is a balanced tree, we can concatenate each of the levels and store it as one single bit vector.

Querying

Recall from my last post that a rank query is the count of 1-bits up to a specified position. Rank queries over larger alphabets are analogous – instead of a 1, it may be any other symbol:

Media_httpalexbowes3a_wfllf

After the tree is constructed, a rank query can be done with log A (A = alphabet size) binary rank queries on the bit vectors – O(1) if you store them in RRR or another binary rank index. The encoding at each internal node may be ambiguous, but of course it isn’t useless – we use the ambiguous encoding to guide us to the appropriate sub-tree, and keep doing so until we have our answer.

For example, if we wanted to know rank(5, e), we use the following procedure which is illustrated below. We know that e is encoded as 0 at this level, so we take the binary rank query of 0 at position 5:

Media_httpalexbowes3a_ahscp

Which is 4, which we then use to indicate where to rank in the 0-child: the 4th bit (or the bit at position 3, due to 0-basing). We know to query the 0-child, since that is what e was encoded as at the parent level. We then repeat this recursively:

Media_httpalexbowes3a_bdgct

At a leaf node we have our answer. I would love to explain why this works, but it is fun and rewarding to think about it yourself ;)

There are also ways to provide fast select queries, but once again I will leave that up to you to research. The curious among you might also be interested in the Huffman-Shaped Wavelet Tree described by Mäkinen and Navarro [5].

Using Your New Powers for Good

Feel free to implement this yourself, but if you want to get your hands dirty right away, all-around-clever-guy Francisco Claude has made an implementation available in his Compressed Data Structure Library (libcds). If you create something neat with it be sure to report back ;)

And if you read this far, consider following me on Twitter: @alexbowe.

Further Reading

I didn’t want to saturate this blog with proofs and so-on, as it was meant to be a light introduction. It is also a pain typesetting math on this blog :/ If you want to learn more about this awesome structure, check out the following papers:

[1] F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, pages 176–187. Springer, 2008.

[2] P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Information and Computation, 207(8):849–866, 2009.

[3] P. Ferragina, G. Manzini, V. M ̈akinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Al- gorithms, 3(2):20, 2007.

[4] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th annual ACM-SIAM symposium on Dis- crete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.

[5] V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40–66, 2005.

[6] V. Mäkinen and G. Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 4726, pages 214–226. Springer, 2007.

[7] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. Arxiv Computing Research Repository, abs/cs/0610001, 2006.

YaRRR Me Hearties - a post about a succinct data structure (not pirates, sorry)

Media_httpalexbowes3a_igeiq

This blog post will give an overview of a static bitsequence data structure known as RRR, which answers arbitrary length rank queries in O(1) time, and provides implicit compression.

As my blog is informal, I give an introduction to this structure from a birds eye view. If you want, read my thesis for a version with better markup, and follow the citations for proofs by people smarter than myself :)

My intended future posts will cover the other aspects of my thesis, including generalising RRR (for sequences over small alphabets), Wavelet Trees (which answer rank queries over bigger alphabets), and Suffix Arrays (a text index which – when combined with the above structures – can answer queries in O(P log A) time, when P is the length of the search pattern, and A is the alphabet size).

Update: I have now posted about Wavelet Trees! Check it out here.

Example Problem

Cracking the Oyster, the first column of Programming Pearls, opens with a programmer asking for advice when sorting around ten million unique seven-digit integers – phone numbers.

After some discussion, the author concludes that a bitmap should be used. If we wanted to store ten million integers, we could use an array of 32-bit integers, consuming 38 MB, or we could represent our numbers as positions on a number line.

All of these phone numbers will be within the range [0000000, 9999999]. To represent the presence of these numbers, we only need a bitmap 107 bits long, about 1 MB, which would represent our number line. Then, for a bitmap M, if we want to store phone number p, we set the bit M[p] to 1. Sorting would involve setting the numbers that are present to 1, then iterating over the bitmap, printing the positions of the 1-bits – O(N) time.

In the following sections, I will detail operations that can be done on bitmaps, named rank and select, and explain how to answer rank queries in O(1) time, and implicitly compress the bitmap. Using rank and select, a compressed bitmap can be a very powerful way to store sets. This isn’t limited to just sets of numbers, all sorts of things, such as sets of graph nodes for example; A friend of mine is using succinct bitmaps to represent De Bruijn graphs, which are used in genome assembly.

Extension: Rank

Allow me to extend the problem. I want to query our simple phone number database to see how many phone numbers are allocated within the range $[0005000, 0080000]$. I could iterate over that range and update a counter whenever I encounter a 1-bit. Actually, this operation is what is known as a rank operation.

Media_httpalexbowes3a_trqyt

The operation rank(i) is defined as the number of set bits (1s) in the range [0, i] (or [0, i) in some papers). In the bitstring above, the answer to rank(5) is 3… This is a generalisation of the popcount operation which counts all set bits, which I have discussed before. rank(i) can be implemented by left-shifting L - i bits (where L is the length of the datatype you are using, int, long, etc) to remove the unwanted bits, then calling popcount on the resulting value. This could be done iteratively over an array if you want, but I will discuss a much faster way below.

Then, the above question can be answered as: rank(0080000) - rank(0005000 - 1). This will give us just the number of 1s between 0005000 and 0080000.

This isn’t the only place we would use a popcounts; it happens that popcounts are common enough that we want to optimise them. Check out this blog post at valuedlessons.com for a discussion and empirical comparison of several fast approaches.

RRR

As it happens, we can build a data structure for static bitmaps that answers rank queries in O(1) time, and provides implicit compression. It is what is known as a succinct data structure, which means that even though it is compressed, we don’t need to decompress the whole thing t operate on it efficiently. Sadakane (a big name in succinct data structures) gives a nice analogy in his introduction of the field, likening it to forcing dehydrated noodles apart with your chopsticks (decompression) as you are rehydrating them, but before the whole thing is fully cooked and separated. This allows you to keep some of the noodles compressed while you eat the decompressed fragment.

Since it is static it isn’t well suited for a bitmap which you want to update (although work has been done toward this), it is still really cool :)

The structure I’m referring to is named RRR. It sounds like a radio station, but it is named after its creators: Raman, Raman, Rao, from their 2002 paper Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. Its a data structure I had to become intimately involved with for my honours thesis, where I extended it for sequences of larger (but still small) alphabets. If you want to answer rank queries on large alphabets, a wavelet tree might be what you are after, but that will be covered in a different blog post (or you could read my thesis!).

In my last post (Generating Binary Permutations in Popcount Order) I discussed how to compress a bitstring by replacing blocks of a certain blocksize with their corresponding pop number, and (variable length) offset into a lookup table. I briefly mentioned building an index over it to improve lookup as well.

RRR: Construction

To construct a RRR sequence we divide our bitmap into blocks, as I mentioned in my previous blog post. These are grouped in superblocks, too, which allows us to construct an index to enable O(1) rank queries. In the following image, I have fragmented the bitmap using a blocksize of b = 5, and grouped them with a superblock factor of f = 3 – so each superblock is three blocks.

Media_httpalexbowes3a_qeoif

First we replace the blocks with a pair of values, a class value C and offset value O, which are used together as a lookup key into a table of precomputed (small – for each possible block only) ranks – this is demonstrated in the figure below. This is the same as the previous blog post, although in that I called the “class” P. This is because the class of a block is defined as the popcount – the number of set bits – in the block: class(B) = popcount(B) for block B.

Media_httpalexbowes3a_gbopg

The table is shared among all your RRR sequences, and is in fact a table of tables, where C points to the first element for the ranks of a given popcount:

Media_httpalexbowes3a_cjgfp

For this table (let’s call it G), for a given class C, the sub-table at G[C] has b Choose C entries, which correspond to all possible permutations that have a popcount of C. This means that while our C values will always be log(b + 1) (the number of bits to represent values 0, 1, 2… b – these are all possible popcount values for the blocksize), but our O values will vary in size, requiring log(b Choose C) bits (oh yeah, and of course I’m using log_2 here :)). During a query, we can use our C values to work out how many bits will follow for the O values.

Using this approach alone we get the compression, but not O(1) ranks. C is fixed width, the compression comes from O being varied width.

In order to get the O(1) ranks we use a method discussed by Munro in Tables, 1996. This is where the superblocks come in to play:

Media_httpalexbowes3a_jjdii

For each superblock boundary we store the global rank up to that position. We also store a prefix sum of the bits, which gives us the address to the first block in the next superblock (since it is variable length!). This allows us to not require iterating over the whole RRR sequence, but instead going straight to the required superblock. We will only need to iterate over the blocks within a superblock, so it is now bound by whatever your superblock factor is.

RRR: Querying

To calculate rank(i):

  1. Calculate which block our index is in as i_b = i/b. (i_b is the global index of the block)
  2. Calculate which superblock our block resides in as i_s = i_b/f. (i_s is the index of the superblock)
  3. Set result to the sum of previous ranks at is boundary (which is pre- calculated).
  4. Using each blocks class-offset pair (c,o) after the boundary at is, add the rank for that entire block to result. 5. Repeat previous step until we reach i_b. We then add rank(j,c) (from i_b, not the global rank) to our result,where j = i mod b, and is the position we are querying local to i_b. Our final answer is result.

Select

Media_httpalexbowes3a_heucn

Select is the inverse operation to rank; it answers the question “at which position is the ith set bit?”. To tie this in with the phone numbers example, maybe we want to find out the fiftieth phone number in the set (excluding unassigned numbers). This is a way we can index just the present elements of a bitmap. It turns out select can be answered in O(1) time as well. I won’t cover select here, as my future posts (and thesis) will mainly use rank. You can read about it in the RRR paper.

Go Forth and…

Feel free to implement this (somewhat complicated) data structure yourself, or you can use a pre-rolled one by my friend Francisco Claude in his LIBCDS – Compressed Data Structure Library.

If you read this far, consider adding me to twitter :) or you may enjoy reading my post on Wavelet Trees.

Generating Binary Permutations in Popcount Order

Media_httpalexbowes3a_mpcoj

I’ve been keeping an eye on the search terms that land people at my site, and although I get the occasional “alex bowe: fact or fiction” and “alex bowe bad ass phd student” queries (the frequency strangely increased when I mentioned this on Twitter) I also get some queries that relate to the actual content.

One query I received recently was “generating integers in popcount order”, I guess because I mentioned popcounts (the number of 1-bits in a binary string) in a previous post, but the post wasn’t able to answer that visitors question.

What would this be used for? Among other applications, I have used it for generating a table of numbers ordered by popcount, which I used in a compression algorithm: by breaking a bitstring into fixed-length chunks (of B bits) and replacing them with a (P, O) pair, where P is the block’s popcount which can be used to point to the table where each entry has the popcount P, and O is the offset in that subtable. Then P can be stored with log2(B + 1) bits – we need to represent all possible P values from 0 to B – and O can be stored with log2(binomial(B, P)) bits.

Media_httpalexbowes3a_adipi

Note that the bit-length of P varies; binomial represents the binomial coefficient, which can be seen in Pascal’s triangle expanded to row 5:

Media_httpalexbowes3a_szwrn

So binomial(5, x) for x = 0, 1, … , 5 yields the sequence 1, 5, 10, 10, 5, 1 – some things take more bits than others. Once you know the P value, you will know how many bits the O value is, so you can read it that way. This means access is O(N) (since each values position relies on the previous P values), but you can build an index on top of that to allow O(1) lookup. But all this is a story for another time ;) [1].

Here is the code:

def next_perm(v):
    """
    Generates next permutation with a given amount of set bits,
    given the previous lexicographical value.
    Taken from http://graphics.stanford.edu/~seander/bithacks.html
    """
    t = (v | ( v - 1)) + 1
    w = t | ((((t & -t) / (v&-v)) >> 1) - 1)
    return w

This will take a number with a certain popcount, and generate the next number with the same popcount. For example, if you feed it 7, which is 111 in binary, you will get 1011 back – or 11 – the next number with the same popcount (lexicographically speaking).

To find the first number of a given popcount, you can use this:

def element_0(c):
    """Generates first permutation with a given amount
       of set bits, which is used to generate the rest."""
    return (1 << c) - 1

I should clear up what lexicographically means in the context of numbers. Well, it’s actually the same as any other symbols (such as an alphabet), 0 is the symbol that comes before 1 (if it helps, you can picture 0 as a and 1 as b):

00111 aabbb
01011 ababb
01101 abbab
01110 abbba
10011 baabb
10101 babab
10110 babba
11001 bbaab
11010 bbaba
11100 bbbaa

Psuedocode

Looking at the above pattern, here is some loose pseudocode that may help us understand how the above bithacks work:

  1. Set i to the position of the rightmost bit
  2. Stop if there are no set bits, or if we have looked at all the bits (i >= length of bitstring)
  3. If the i+1th bit (one place to the left) is 0: move the ith bit left
  4. Otherwise, if the i+1th bit (one place to the left) is 1: set i to i + 1 and repeat from 2.
  5. Shift the bits on the range [0, i] right so that the rightmost bit is in position 0

Explanation

Understanding element_0() is pretty easy. 1 << c is the same as moving a 1 to position c, then -1 sets all the bits from 0 to c – 1, giving c set bits:

c = 4
1 << c = 10000
10000 - 1 = 01111

next_perm() is a bit more complicated. The v | (v -1) in t = (v | (v - 1)) + 1 right-propagates the rightmost bit. Allow me to show an example: 01110 | 01101 = 01111

In the case of 0, this isn’t quite correct: 00000 | 11111 = 11111 But it’s okay because we proceed to add 1 to this value (which returns it to zero). This increment, combined with the right-propagation, will do step 2, 3, and part of step 5 above. For example, 10100 -> 10111 -> 11000.

We are on our way to generating integers by popcount in lexicographical order.

Now let’s break down the next line, w = t | ((((t & -t) / (v&-v)) >> 1) - 1). Bitwise equations of the form (x & -x) isolate the rightmost bit: 01110 & 10010 (two's complement) = 00010

If you take the two’s complement, you invert a numbers bits and then add 1. If you think about it, this means there are 0s where there were 1s, and 1s where there were 0s, and adding 1 bumps the rightmost 1 left and sets the subsequent right bits to 0. This means that the only position that will remain set in both numbers is the the rightmost 1-bit.

So let R(x) denote the isolated rightmost bit of x, then for x = 01110 we calculate t = 10000, R(x) = 00010 and R(t) = 10000.

Following the calculation of w, we need to divide them: R(t) / R(x) = 10000 / 00010 = 01000

Shift to the right by 1: 00100

Subtract 1: 00011

Then we bitwise-or them to stick them together: 10000 | 00011 = 10011. This corresponds to our table above :)

So w = t | ((((t & -t) / (v&-v)) >> 1) - 1) corresponds to the rest of step 5 (the moving part, t was the zeroing part of moving the sub-range) in our pseudocode. Well, kind of anyway, there are a few steps happening in parallel, but the pseudocode was only loosely explaining what was happening :)

Testing it out

def gen_blocks(p, b):
    """
    Generates all blocks of a given popcount and blocksize
    """
    v = initial = element_0(p)
    block_mask = element_0(b)

    while (v >= initial):
        yield v
        v = next_perm(v) & block_mask



>>> for x in gen_blocks(3, 5): print bin(x, 5)
... 
00111
01011
01101
01110
10011
10101
10110
11001
11010
11100

Note: bin is just a function I found online for printing binay numbers and isn’t important to this post, but you can find it here if you need one.

Then of course you can loop through all values of P from 0 to B to build the complete table.

Questions? Comments? Flames? I wanna hear em :)

[1] – Check out Tables by Munro, 1996, and Succinct indexable dictionaries with applications to encoding k-ary trees and multisets by Raman et al, 2002 for a first step into this stuff. Also check out my honours thesis, 2010 for a recent look at succinct data structures. I will write a blog post about this stuff sooner or later :P

Some Lazy Fun

Media_httpianumcesedu_imcvu

Image taken from http://ian.umces.edu/

Update: fellow algorithms researcher Francisco Claude just posted a great article about using lazy evaluation to solve Tic Tac Toe games in Common Lisp. Niki (my brother) also wrote a post using generators with asynchronous prefetching to hide IO latency. Worth a read I say!

I’ve recently been obsessing over this programming idea called streams (also known as infinite lists or generators), which I learned about from the Structure and Interpretation of Computer Programs book. It is kind of like an iterator that creates its own data as you go along, and it can lead to performance increases and wonderfully readable code when you utilise them with higher order functions such as map and reduce (which many things can be rewritten in). It also allows you to express infinitely large data structures.

When regular lists are processed with higher order functions, you need to compute the entire list at each stage; if you have 100 elements, and you map a function to them, then filter them, then partially reduce them, you may be doing up to 300 operations, but what if you only want to take the first 5 elements of the result? That would be a waste, hence streams are sometimes a better choice.

Although SICP details how to do it in Scheme, in this blog post I will show some languages that have it built in - Haskell and Python - and how to implement streams yourself if you ever find yourself in a language without it1.

Haskell

Haskell is a lazy language. It didn’t earn this reputation from not doing the dishes when you ask it to (although that is another reason it is lazy). What it means in the context of formal languages is that evaluation is postponed until absolutely necessary (Here is a cute (illustrated) blog post describing this lazy evaluation stuff). Take this code for example:

Prelude> let x = [1..10]

At this stage you might be tempted to say that x is the list of numbers from 1 to 10. Actually it only represents a promise that when you need that list, x is your guy. The above code that creates a list from 1 to 10 still hasn’t been executed until I finally ask it to be (by referring to x):

Prelude> x
[1,2,3,4,5,6,7,8,9,10]

It is kind of like telling your mum you’ll do the dishes, but waiting until she shouts your name out again before you put down your DS. Actually, it is sliiiiightly different - if I instead wrote:

Prelude> let x = [1..10]
Prelude> let y = x ++ [11..20]

I have referred to x again when I declared y, but x still hasn’t evaluated. Only after I shout y’s name will y shout x’s name and give me back my whole list:

Prelude> y
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

Here you ask your robot to wash half the dishes, but he is too busy playing DS too (stupid robot). Finally when your mum shouts, you shout at the robot, and he does his set of dishes, and you do yours. But what is the benefit here? It isn’t that I can get more DS time in…

Take for example a list of positive integers from 1. Yes, all of them. In other languages it might be hard to express this, but in Haskell it is as simple as [1..]. This means we have a list of infinite integers, but we will only calculate as far as we need:

Prelude> let z = [1..]
Prelude> head z
1
Prelude> take 5 z
[1,2,3,4,5]

The syntax here is amazingly terse, and it may make your code more efficient. But even if we don’t have the syntax for it in another language, we can provide it ourselves very easily.

Python

Python has a similar concept called generators, which are made using the yield keyword in place of return, more than one time (or in a loop) in a function:

def integers_from(N):
     while(1):
         yield N
         N += 1

>>> z = integers_from(1)
>>> z.next()
1
>>> z.next()
2

Note: Python generators are stateful and are hence slightly different to an infinite list in Haskell. For example z.next() returns different values in two places, and thus is time sensitive - we cannot get z to ‘rewind’ like we could in Haskell, where z is stateless. Statelessness can lead to easier to understand code, among other benefits.

Rolling Our Own

Let’s reinvent this wheel in Python (but in a stateless manner), so if we ever find ourselves craving infinite lists we can easily roll our own in pretty much any language with Lambdas.

I have chosen Python to implement this, even though it already supports infinite lists through generators, simply because its syntax is more accessible. Indeed, the below can already be done with Python’s built-in-functions (although with state). It is probably not a great idea to do it this way in Python, as it doesn’t have tail call optimisation (unless you use this hack using decorators and exceptions).

First we’ll look at adding lazy evaluation, however the syntax requires it to be explicit:

>>> x = lambda: 5
>>> y = lambda: 2 + x()

Here, x is not 5, and y is not 7, they are both functions that will evaluate to that when we finally run them; the expression inside the lambda won’t be evaluated until we do so explicitly:

>>> x()
5
>>> y()
7

And that’s pretty much all the heavy lifting. To make an infinite list, we basically make a linked list where we generate each node as we need it:

def integers_from(N): return (N, lambda: integers_from(N+1))

def head((H, _)): return H

def tail((_, T)): return T()

And there is our infinite list. To access it use head() and tail() (recursively if necessary):

>>> z = integers_from(1)
>>> head(z)
1
>>> head(tail(z))
2

Helper Functions

First we should make a way for us to look at our streams:

def to_array(stream):
    return reduce(lambda a, x: a + [x], [], stream)

Which is a reduce operation that puts each head element into an array (which is carried along as a parameter to reduce()). Here is reduce() (map() can be found in this gist):

null_stream = (None, None)
def reduce(f, result, stream):
    if stream is null_stream: return result
    return reduce(f, f(result, head(stream)), tail(stream))

We needed some way to tell if we had reached the end of a stream - not all streams are infinitely long. Meet our next function, which will help us terminate a stream:

def take(N, stream):
    if N <= 0 or stream is null_stream: return null_stream
    return (head(stream), lambda: take(N-1, tail(stream)))

This will take the first N elements from the specified stream. So now we can inspect the first N elements:

>>> to_array(take(10, integers_from(1)))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

For our upcoming example, we also need a filter() method, which will filter out elements that meet a provided predicate:

def filter(pred, stream):
    if pred(head(stream)):
        return (head(stream), lambda: filter(pred, tail(stream)))
    return filter(pred, tail(stream))

Now onto our example :)

Textbook Example

Here is the standard example to demonstrate the terseness of streams:

def sieve(stream):
    h = head(stream)
    return (h, lambda: sieve(
        filter(lambda x: x%h != 0, tail(stream))))

Here is a function which recursively filters anything which is divisible by any number we have previously seen in our stream. Math aficionados will notice that this is the Sieve of Eratosthenes algorithm.

>>> primes = sieve(integers_from(2))
>>> to_array(take(10, primes))
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

Recursively defined data, and only as much of it as we want - pretty neat.

And Now For Something Almost Completely Different

When I first saw this, I wondered what application there might be to have a stream of functions. Here I have defined a stream which recursively applies a function to itself:

def rec_stream(f):
    return (f, lambda: rec_stream(lambda x: f(f(x))))

When might this be useful? It might yield speed improvements if you commonly want to recursively apply a function a certain amount of times, but have costly branching (so the condition check at each level is slow). It could also be used as a abstraction for recursive iteration 2, which gives you back the function 'already recursed' so to speak (although lazily).

One such recursive process I can think of is Newton’s method for approximating roots, defined recursively as:

Media_httpmathurlcom3_ghfkj

When f(x) = 0.

The more iterations you do the more accurate the solution becomes. One use of Newton’s method is to use it until you have reached a certain error tolerance. Another way, which I learned about recently when reading about the fast inverse square root algorithm, which uses just one step of Newton’s method as a cheap way to improve it’s (already pretty good) initial guess. There is a really great article here which explains it very well.

After reading that, I wondered about a stream that would consist of functions of increasing accuracy of Newton’s method.

def newton(f, fdash):
    return lambda x: x - f(x)/float(fdash(x))

The newton() function accepts f(x) and f'(x), and returns a function that accepts a first guess.

def newton_solver(iters, f, fdash):
    def solve(v):
        n = newton(lambda x: f(x) - v, fdash)
        stream = rec_stream(n)
        return to_array(take(iters, stream))[-1]
    return solve

This one is a little more complicated. In order to have it solve for a value other than zero, I needed to either define it in f(x), since f(x) must equal zero, but I didn’t want the user to have to iterate over the stream each time they wanted to compute the square root of a different number, say. To allow it to return a function that solved for square roots in the general case, I had to make the internal function solve(), which would bind for the value the caller specifies, hence solving f(x) = v for x. Hopefully this becomes clearer with an example:

>>> sqrt = newton_solver(1, lambda x: x**2, lambda x: 2*x) # 1 iter
>>> sqrt(64)(4) # Sqrt of 64 with initial guess of 4
10.0
>>> sqrt = newton_solver(3, lambda x: x**2, lambda x: 2*x) # 3 iters
>>> sqrt(64)(4)
8.000000371689179

Now we can pass around this square root function and it will always do 3 iterations of Newton's method.

This may not be practical unless compilers can optimise the resulting function (or if there is a way to do the reduction myself easily), but it was fun to do :) As always comments and suggestions are appreciated. If anyone who reads this is good with compilers, advice would be great :D

What you can do now is read SICP for more cool things like streams and functional programming, or check out Sean Anderson’s bit hacks page for more cool hacks like the fast inverse square root. Or refactor your code to use map, reduce and streams :)


  1. The reason I have chosen Python for this exercise is for reasons of accessibility. Here is a post about implementing streams in Ruby, and here is one for Erlang :) but of course it’s all pretty much the same deal. ↩

  2. If a compiler could optimise this, simplifying the reapplied function, but keeping the generality, that’d be really cool :) I don’t think many compilers would/could do that for lambdas though. Any information would be great. ↩

Design Pattern Flash Cards

Flashcards

Last year I studied a subject which required me to memorise design patterns. I tried online flash card web sites, but I was irritated that I didn't own the data I put up (they had no export option). So I wrote a something in Python to generate flash cards for me using LaTeX and the Cheetah templating library. The repository is hosted here, although it could do with a refactor.

If you don't want to generate your own, you can download the pre-generated design pattern intent flash cards here which contains the 23 original design patterns from the Gang Of Four.

To generate your own flash cards, create an input text file with this structure:

Front text (such as pattern name):
Definition line 1.
Definition line 2.

For example:

Abstract Factory:
Provides an interface for creating families of related or
dependent objects without specifying their concrete classes.

Currently the front text is single-line only. The regex could be updated of course (if you do, feel free to send a pull request!).

To compile this:

./cardgen.py -i inputfile -o outputfile
pdflatex outputfile

Then just print it out on a double side printer (or glue the two sheets together). I carried these around with me all the time during the lead-up to the exam, and I was scay-fast when it came to recalling which design pattern did what. Just flick through them (shuffle first) in forward or reverse order when you are on the train next :)