Wavelet Trees
In my last post I introduced a data structure called RRR, which is used to quickly answer rank queries on binary sequences, and provide implicit compression.
Today I will talk about an elegant way of answering rank queries on sequences over larger alphabets. The structure is called the Wavelet Tree, which organises a string into a hierarchy of bit vectors. A rank query has time complexity is O(log_2 A), where A is the size of the alphabet. It was introduced by Grossi, Gupta and Vitter in their 2003 paper High-order entropy-compressed text indexes [4] (see the Further Reading section for more papers). It has since been featured in many papers [1, 2, 3, 5, 6].
If you store the bit vectors in RRR sequences, it may take less space than the original sequence. Alternatively, you could store the bit vectors in the rank indexes proposed by Sadakane and Okonohara [7]. It has a different approach to compression. I will talk about it another time ;) – fortunately, I will be studying under Sadakane-san at a later date.
In a different future post, I will show how Suffix Arrays can be used to find arbitrary patterns of length P, by issuing 2P rank queries. If using a Wavelet Tree, this means a pattern search has O(P log_2 A) time complexity, that is, the size of size of the ‘haystack’ doesn’t matter, it instead depends on the size of the ‘needle’ and size of the alphabet.
Construction
A Wavelet Tree converts a string into a balanced binary-tree of bit vectors, where a 0 replaces half of the symbols, and a 1 replaces the other half. This creates ambiguity, but at each level this alphabet is filtered and re-encoded, so the ambiguity lessens, until there is no ambiguity at all.
The tree is defined recursively as follows:
- Take the alphabet of the string, and encode the first half as 0, the 2nd half as 1:
{ a, b, c, d }would become{ 0, 0, 1, 1 } - Group each 0-encoded symbol,
{ a, b }, as a sub-tree; - Group each 1-encoded symbol,
{ c, d }, as a sub-tree; - Reapply this to each subtree recursively until there is only one or two symbols left (when a
0or1can only mean one thing).
For the string “Peter Piper picked a peck of pickled peppers” (spaces and a string terminator have been represented as _ and $ respectively, due to convention in the literature) the Wavelet Tree would look like this:
note: the strings aren’t actually stored, but are shown here for convenience
It has the alphabet { $, P, _, a, c, d, e, f, i, k, l, o, p, r, s, t }, which would be mapped to { 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 }. So, for example, $ would map to 0, and r would map to 1.
The left subtree is created by taking just the 0-encoded symbols { $, P, _, a, c, d, e, f } and then re-encoding them by dividing this new alphabet: { 0, 0, 0, 0, 1, 1, 1, 1 }. Note that on the first level an e would be encoded as a 0, but now it is encoded as a 1 (it becomes a 0 again at a leaf node).
We can store the bit vectors in RRR structures for fast binary rank queries (which are needed, as described below), and compression :) In fact, since it is a balanced tree, we can concatenate each of the levels and store it as one single bit vector.
Querying
Recall from my last post that a rank query is the count of 1-bits up to a specified position. Rank queries over larger alphabets are analogous – instead of a 1, it may be any other symbol:
After the tree is constructed, a rank query can be done with log A (A = alphabet size) binary rank queries on the bit vectors – O(1) if you store them in RRR or another binary rank index. The encoding at each internal node may be ambiguous, but of course it isn’t useless – we use the ambiguous encoding to guide us to the appropriate sub-tree, and keep doing so until we have our answer.
For example, if we wanted to know rank(5, e), we use the following procedure which is illustrated below. We know that e is encoded as 0 at this level, so we take the binary rank query of 0 at position 5:
Which is 4, which we then use to indicate where to rank in the 0-child: the 4th bit (or the bit at position 3, due to 0-basing). We know to query the 0-child, since that is what e was encoded as at the parent level. We then repeat this recursively:
At a leaf node we have our answer. I would love to explain why this works, but it is fun and rewarding to think about it yourself ;)
There are also ways to provide fast select queries, but once again I will leave that up to you to research. The curious among you might also be interested in the Huffman-Shaped Wavelet Tree described by Mäkinen and Navarro [5].
Using Your New Powers for Good
Feel free to implement this yourself, but if you want to get your hands dirty right away, all-around-clever-guy Francisco Claude has made an implementation available in his Compressed Data Structure Library (libcds). If you create something neat with it be sure to report back ;)
And if you read this far, consider following me on Twitter: @alexbowe.
Further Reading
I didn’t want to saturate this blog with proofs and so-on, as it was meant to be a light introduction. It is also a pain typesetting math on this blog :/ If you want to learn more about this awesome structure, check out the following papers:
[1] F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, pages 176–187. Springer, 2008.
[2] P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Information and Computation, 207(8):849–866, 2009.
[3] P. Ferragina, G. Manzini, V. M ̈akinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Al- gorithms, 3(2):20, 2007.
[4] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th annual ACM-SIAM symposium on Dis- crete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
[5] V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40–66, 2005.
[6] V. Mäkinen and G. Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 4726, pages 214–226. Springer, 2007.
[7] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. Arxiv Computing Research Repository, abs/cs/0610001, 2006.
