YaRRR Me Hearties - a post about a succinct data structure (not pirates, sorry)

Media_httpalexbowes3a_igeiq

This blog post will give an overview of a static bitsequence data structure known as RRR, which answers arbitrary length rank queries in O(1) time, and provides implicit compression.

As my blog is informal, I give an introduction to this structure from a birds eye view. If you want, read my thesis for a version with better markup, and follow the citations for proofs by people smarter than myself :)

My intended future posts will cover the other aspects of my thesis, including generalising RRR (for sequences over small alphabets), Wavelet Trees (which answer rank queries over bigger alphabets), and Suffix Arrays (a text index which – when combined with the above structures – can answer queries in O(P log A) time, when P is the length of the search pattern, and A is the alphabet size).

Update: I have now posted about Wavelet Trees! Check it out here.

Example Problem

Cracking the Oyster, the first column of Programming Pearls, opens with a programmer asking for advice when sorting around ten million unique seven-digit integers – phone numbers.

After some discussion, the author concludes that a bitmap should be used. If we wanted to store ten million integers, we could use an array of 32-bit integers, consuming 38 MB, or we could represent our numbers as positions on a number line.

All of these phone numbers will be within the range [0000000, 9999999]. To represent the presence of these numbers, we only need a bitmap 107 bits long, about 1 MB, which would represent our number line. Then, for a bitmap M, if we want to store phone number p, we set the bit M[p] to 1. Sorting would involve setting the numbers that are present to 1, then iterating over the bitmap, printing the positions of the 1-bits – O(N) time.

In the following sections, I will detail operations that can be done on bitmaps, named rank and select, and explain how to answer rank queries in O(1) time, and implicitly compress the bitmap. Using rank and select, a compressed bitmap can be a very powerful way to store sets. This isn’t limited to just sets of numbers, all sorts of things, such as sets of graph nodes for example; A friend of mine is using succinct bitmaps to represent De Bruijn graphs, which are used in genome assembly.

Extension: Rank

Allow me to extend the problem. I want to query our simple phone number database to see how many phone numbers are allocated within the range $[0005000, 0080000]$. I could iterate over that range and update a counter whenever I encounter a 1-bit. Actually, this operation is what is known as a rank operation.

Media_httpalexbowes3a_trqyt

The operation rank(i) is defined as the number of set bits (1s) in the range [0, i] (or [0, i) in some papers). In the bitstring above, the answer to rank(5) is 3… This is a generalisation of the popcount operation which counts all set bits, which I have discussed before. rank(i) can be implemented by left-shifting L - i bits (where L is the length of the datatype you are using, int, long, etc) to remove the unwanted bits, then calling popcount on the resulting value. This could be done iteratively over an array if you want, but I will discuss a much faster way below.

Then, the above question can be answered as: rank(0080000) - rank(0005000 - 1). This will give us just the number of 1s between 0005000 and 0080000.

This isn’t the only place we would use a popcounts; it happens that popcounts are common enough that we want to optimise them. Check out this blog post at valuedlessons.com for a discussion and empirical comparison of several fast approaches.

RRR

As it happens, we can build a data structure for static bitmaps that answers rank queries in O(1) time, and provides implicit compression. It is what is known as a succinct data structure, which means that even though it is compressed, we don’t need to decompress the whole thing t operate on it efficiently. Sadakane (a big name in succinct data structures) gives a nice analogy in his introduction of the field, likening it to forcing dehydrated noodles apart with your chopsticks (decompression) as you are rehydrating them, but before the whole thing is fully cooked and separated. This allows you to keep some of the noodles compressed while you eat the decompressed fragment.

Since it is static it isn’t well suited for a bitmap which you want to update (although work has been done toward this), it is still really cool :)

The structure I’m referring to is named RRR. It sounds like a radio station, but it is named after its creators: Raman, Raman, Rao, from their 2002 paper Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. Its a data structure I had to become intimately involved with for my honours thesis, where I extended it for sequences of larger (but still small) alphabets. If you want to answer rank queries on large alphabets, a wavelet tree might be what you are after, but that will be covered in a different blog post (or you could read my thesis!).

In my last post (Generating Binary Permutations in Popcount Order) I discussed how to compress a bitstring by replacing blocks of a certain blocksize with their corresponding pop number, and (variable length) offset into a lookup table. I briefly mentioned building an index over it to improve lookup as well.

RRR: Construction

To construct a RRR sequence we divide our bitmap into blocks, as I mentioned in my previous blog post. These are grouped in superblocks, too, which allows us to construct an index to enable O(1) rank queries. In the following image, I have fragmented the bitmap using a blocksize of b = 5, and grouped them with a superblock factor of f = 3 – so each superblock is three blocks.

Media_httpalexbowes3a_qeoif

First we replace the blocks with a pair of values, a class value C and offset value O, which are used together as a lookup key into a table of precomputed (small – for each possible block only) ranks – this is demonstrated in the figure below. This is the same as the previous blog post, although in that I called the “class” P. This is because the class of a block is defined as the popcount – the number of set bits – in the block: class(B) = popcount(B) for block B.

Media_httpalexbowes3a_gbopg

The table is shared among all your RRR sequences, and is in fact a table of tables, where C points to the first element for the ranks of a given popcount:

Media_httpalexbowes3a_cjgfp

For this table (let’s call it G), for a given class C, the sub-table at G[C] has b Choose C entries, which correspond to all possible permutations that have a popcount of C. This means that while our C values will always be log(b + 1) (the number of bits to represent values 0, 1, 2… b – these are all possible popcount values for the blocksize), but our O values will vary in size, requiring log(b Choose C) bits (oh yeah, and of course I’m using log_2 here :)). During a query, we can use our C values to work out how many bits will follow for the O values.

Using this approach alone we get the compression, but not O(1) ranks. C is fixed width, the compression comes from O being varied width.

In order to get the O(1) ranks we use a method discussed by Munro in Tables, 1996. This is where the superblocks come in to play:

Media_httpalexbowes3a_jjdii

For each superblock boundary we store the global rank up to that position. We also store a prefix sum of the bits, which gives us the address to the first block in the next superblock (since it is variable length!). This allows us to not require iterating over the whole RRR sequence, but instead going straight to the required superblock. We will only need to iterate over the blocks within a superblock, so it is now bound by whatever your superblock factor is.

RRR: Querying

To calculate rank(i):

  1. Calculate which block our index is in as i_b = i/b. (i_b is the global index of the block)
  2. Calculate which superblock our block resides in as i_s = i_b/f. (i_s is the index of the superblock)
  3. Set result to the sum of previous ranks at is boundary (which is pre- calculated).
  4. Using each blocks class-offset pair (c,o) after the boundary at is, add the rank for that entire block to result. 5. Repeat previous step until we reach i_b. We then add rank(j,c) (from i_b, not the global rank) to our result,where j = i mod b, and is the position we are querying local to i_b. Our final answer is result.

Select

Media_httpalexbowes3a_heucn

Select is the inverse operation to rank; it answers the question “at which position is the ith set bit?”. To tie this in with the phone numbers example, maybe we want to find out the fiftieth phone number in the set (excluding unassigned numbers). This is a way we can index just the present elements of a bitmap. It turns out select can be answered in O(1) time as well. I won’t cover select here, as my future posts (and thesis) will mainly use rank. You can read about it in the RRR paper.

Go Forth and…

Feel free to implement this (somewhat complicated) data structure yourself, or you can use a pre-rolled one by my friend Francisco Claude in his LIBCDS – Compressed Data Structure Library.

If you read this far, consider adding me to twitter :) or you may enjoy reading my post on Wavelet Trees.

Generating Binary Permutations in Popcount Order

Media_httpalexbowes3a_mpcoj

I’ve been keeping an eye on the search terms that land people at my site, and although I get the occasional “alex bowe: fact or fiction” and “alex bowe bad ass phd student” queries (the frequency strangely increased when I mentioned this on Twitter) I also get some queries that relate to the actual content.

One query I received recently was “generating integers in popcount order”, I guess because I mentioned popcounts (the number of 1-bits in a binary string) in a previous post, but the post wasn’t able to answer that visitors question.

What would this be used for? Among other applications, I have used it for generating a table of numbers ordered by popcount, which I used in a compression algorithm: by breaking a bitstring into fixed-length chunks (of B bits) and replacing them with a (P, O) pair, where P is the block’s popcount which can be used to point to the table where each entry has the popcount P, and O is the offset in that subtable. Then P can be stored with log2(B + 1) bits – we need to represent all possible P values from 0 to B – and O can be stored with log2(binomial(B, P)) bits.

Media_httpalexbowes3a_adipi

Note that the bit-length of P varies; binomial represents the binomial coefficient, which can be seen in Pascal’s triangle expanded to row 5:

Media_httpalexbowes3a_szwrn

So binomial(5, x) for x = 0, 1, … , 5 yields the sequence 1, 5, 10, 10, 5, 1 – some things take more bits than others. Once you know the P value, you will know how many bits the O value is, so you can read it that way. This means access is O(N) (since each values position relies on the previous P values), but you can build an index on top of that to allow O(1) lookup. But all this is a story for another time ;) [1].

Here is the code:

def next_perm(v):
    """
    Generates next permutation with a given amount of set bits,
    given the previous lexicographical value.
    Taken from http://graphics.stanford.edu/~seander/bithacks.html
    """
    t = (v | ( v - 1)) + 1
    w = t | ((((t & -t) / (v&-v)) >> 1) - 1)
    return w

This will take a number with a certain popcount, and generate the next number with the same popcount. For example, if you feed it 7, which is 111 in binary, you will get 1011 back – or 11 – the next number with the same popcount (lexicographically speaking).

To find the first number of a given popcount, you can use this:

def element_0(c):
    """Generates first permutation with a given amount
       of set bits, which is used to generate the rest."""
    return (1 << c) - 1

I should clear up what lexicographically means in the context of numbers. Well, it’s actually the same as any other symbols (such as an alphabet), 0 is the symbol that comes before 1 (if it helps, you can picture 0 as a and 1 as b):

00111 aabbb
01011 ababb
01101 abbab
01110 abbba
10011 baabb
10101 babab
10110 babba
11001 bbaab
11010 bbaba
11100 bbbaa

Psuedocode

Looking at the above pattern, here is some loose pseudocode that may help us understand how the above bithacks work:

  1. Set i to the position of the rightmost bit
  2. Stop if there are no set bits, or if we have looked at all the bits (i >= length of bitstring)
  3. If the i+1th bit (one place to the left) is 0: move the ith bit left
  4. Otherwise, if the i+1th bit (one place to the left) is 1: set i to i + 1 and repeat from 2.
  5. Shift the bits on the range [0, i] right so that the rightmost bit is in position 0

Explanation

Understanding element_0() is pretty easy. 1 << c is the same as moving a 1 to position c, then -1 sets all the bits from 0 to c – 1, giving c set bits:

c = 4
1 << c = 10000
10000 - 1 = 01111

next_perm() is a bit more complicated. The v | (v -1) in t = (v | (v - 1)) + 1 right-propagates the rightmost bit. Allow me to show an example: 01110 | 01101 = 01111

In the case of 0, this isn’t quite correct: 00000 | 11111 = 11111 But it’s okay because we proceed to add 1 to this value (which returns it to zero). This increment, combined with the right-propagation, will do step 2, 3, and part of step 5 above. For example, 10100 -> 10111 -> 11000.

We are on our way to generating integers by popcount in lexicographical order.

Now let’s break down the next line, w = t | ((((t & -t) / (v&-v)) >> 1) - 1). Bitwise equations of the form (x & -x) isolate the rightmost bit: 01110 & 10010 (two's complement) = 00010

If you take the two’s complement, you invert a numbers bits and then add 1. If you think about it, this means there are 0s where there were 1s, and 1s where there were 0s, and adding 1 bumps the rightmost 1 left and sets the subsequent right bits to 0. This means that the only position that will remain set in both numbers is the the rightmost 1-bit.

So let R(x) denote the isolated rightmost bit of x, then for x = 01110 we calculate t = 10000, R(x) = 00010 and R(t) = 10000.

Following the calculation of w, we need to divide them: R(t) / R(x) = 10000 / 00010 = 01000

Shift to the right by 1: 00100

Subtract 1: 00011

Then we bitwise-or them to stick them together: 10000 | 00011 = 10011. This corresponds to our table above :)

So w = t | ((((t & -t) / (v&-v)) >> 1) - 1) corresponds to the rest of step 5 (the moving part, t was the zeroing part of moving the sub-range) in our pseudocode. Well, kind of anyway, there are a few steps happening in parallel, but the pseudocode was only loosely explaining what was happening :)

Testing it out

def gen_blocks(p, b):
    """
    Generates all blocks of a given popcount and blocksize
    """
    v = initial = element_0(p)
    block_mask = element_0(b)

    while (v >= initial):
        yield v
        v = next_perm(v) & block_mask



>>> for x in gen_blocks(3, 5): print bin(x, 5)
... 
00111
01011
01101
01110
10011
10101
10110
11001
11010
11100

Note: bin is just a function I found online for printing binay numbers and isn’t important to this post, but you can find it here if you need one.

Then of course you can loop through all values of P from 0 to B to build the complete table.

Questions? Comments? Flames? I wanna hear em :)

[1] – Check out Tables by Munro, 1996, and Succinct indexable dictionaries with applications to encoding k-ary trees and multisets by Raman et al, 2002 for a first step into this stuff. Also check out my honours thesis, 2010 for a recent look at succinct data structures. I will write a blog post about this stuff sooner or later :P