Failing at Google Interviews

I’ve participated in about four sets of interviews (of about 3 interviews each) for various Google positions. I’m still not a Googler though, which I guess indicates that I’m not the best person to give this advice. However, I think it’s about time I put in my 0.002372757892 bitcoins. I recently did exactly this to help my brother prepare for his interviews and the guy kicked ass. If he gets the job I’m going to take as much credit for it as I can ;)

During my interviews I didn’t sign a NDA, but I do respect the effort that interviewers put into preparing their questions so I’m not going to discuss them. That doesn’t matter though, because you probably won’t get the same questions anyway :) and the algorithm stuff is far from the whole story.

This post is mainly about the rituals I perform during preparation for the interviews, and the lessons I have learned from them. I am of the strong opinion that everyone should apply for a job at Google.

Why Should I?!

Not everyone wants to work for Google, but there are valuable side effects to a Google interview. Even if you don’t think you want a job there, or think that you are under-qualified, it is a great idea to just try for one. The absolute worst thing that could happen is that you have fun and learn something.

A couple of the things I learned are algorithms for (weighted) random sampling, queueing, vector calculus, and some cool applications of bloom filters.

The people you will talk to are smart, and it’s a fun experience to be able to solve problems with smart and passionate people. One of my interviews was just a discussion about the good and bad parts (in our opinions) of a bunch of programming languages (Scheme, Python, C, C++, Java, Erlang). We discussed SICP and the current state of education, and he recommended some research papers for me to read. All the intriguing questions and back-and-forth made me feel like I was being taught by a modern Socrates (perhaps Google should consider offering a Computer Science degree taught entirely with interviews :P).

Sadly, a subsequent interview stumped me because I didn’t understand the requirements. Even the stumping interviews have given me a great chance to realise some gaps in my knowledge and refine my approach. I knew that it was important to get the requirements right, but this really drove it home.

I hope I’ve got you curious about what you could learn from a Google interview. If you are worried about the possible rejection, treat it as a win in a game of Rejection Therapy. You can re-apply as many times as you like, so you could also think of it as TDD for your skills, and you like TDD, right?

How To Prepare – Technical

When you are accepted for a phone interview, Google sends you an email giving you tips on how to prepare. Interestingly, this has been a different list each time. I’ll discuss the one I liked the most. They only give advice on the technical side. I will also discuss what I think are some other important aspects to be mindful of.

First of all, you are going to want to practice. Even if you have been coding every day for years, you might not be used to the short question style. Project Euler is the bomb for this. You will learn some maths too, which will come in handy, and it builds confidence. Do at least one of these every day until your interview.

You will also want some reading material. Google recommended this post by Steve Yegge, which does a good job of calming you. They also recommended another post by Steve Yegge where he covers some styles of questions that are likely to be asked. Yegge recommends a particular book very highly – The Algorithm Design Manual.

More than any other book it helped me understand just how astonishingly commonplace (and important) graph problems are – they should be part of every working programmer’s toolkit. The book also covers basic data structures and sorting algorithms, which is a nice bonus. But the gold mine is the second half of the book, which is a sort of encyclopedia of 1-pagers on zillions of useful problems and various ways to solve them, without too much detail. Almost every 1-pager has a simple picture, making it easy to remember. This is a great way to learn how to identify hundreds of problem types.

I haven’t read the whole thing, but what I have read of it is eye and mind opening. This wasn’t recommended to me directly by Google recruiting staff, but one of my interviewers emailed me a bunch of links after, including a link to the page for this book. There was a recent review of this book featured on Hacker News. It is very good. The author, Steve Skiena, also offers his lecture videos and slides – kick back and watch them with a beer after work/uni.

Media_httpcmbelllabsc_rtdya

If the size of The Algorithm Design Manual is daunting and you want a short book to conquer quickly (for morale reasons), give Programming Pearls a read. Answer as many questions in it as you can.

The phone interviews usually are accompanied by a Google doc for you to program into. I usually nominate Python as my preferred language, but usually they make me use C or C++ (they often say I can use Java too). I was rusty with my C++ syntax, but they didn’t seem to mind. I just explained things like using templates, even though I can never remember the syntax for the tricks.

Speaking of tricks, you get style points for using features of the language that are less well known. I had an interviewer say he was impressed because I used Pythons pattern matching (simple example: (a, b) = (b, a)), list comprehensions, map/reduce, generators, lambdas, and I guess decorators could help make you look cool, too. Only use them if they are useful though!

How To Prepare – Non-Technical

There will also be a few non-technical questions. When I did my first one, a friend recommended that I have answers ready for cookie-cutter questions like “Where do you see yourself in ten years?” and “Why do you want to work for Google?”. Don’t bother with that! Do you really think one of the biggest companies in the world will waste their time asking questions like that? Everyone candidate would say the same answer, something about leading a team and how Google would let you contribute to society, or whatever (great, but everyone wants that).

They will ask you about your previous work and education, though, and pretty much always ask about a technical challenge you overcame. I like to talk about a fun iterative A* search I did at my first job (and why we needed it to be iterative). You can probably think of something, don’t stress, but better to think of it before the interview.

And have a question ready for when they let you have your turn. Don’t search for “good questions to ask in technical interviews”, because if it isn’t your question, you might be uninterested if the interviewer talks about it for a long time. Think of something that you could have a discussion about, something you are opinionated about. Think of something you hated at a previous job (but don’t come across as bitter), how you would improve that, and then ask them if they do that. For me, I was interested in the code review process at Google, and what sort of project they would start a beginner be assigned to.

I know someone who asked questions from The Joel Test. The interviewer might recognise these questions and either congratulate you on reading blogs about your field, or quietly yawn to themselves. It’s up if you want to take that risk. I definitely think it’s better to ask about something that has the potential to annoy you on a personal level if they don’t give you the answer you want ;) it’s subtle but people can detect your healthy arrogance and passion.

If you have a tech blog, refer to it. I’ve had interviewers discuss my posts with me (which they found from my resume). Blogs aren’t hard to write, and even a few posts on an otherwise barren blog will make you look more thoughtful.

Finally, the absolute best way to prepare for a Google interview is to do more Google interviews, so if you fail, good for you! ;)

Just Before the Interview

Here are a few things that help me handle the pressure before an interview.

One time I was walking to an interview in the city (not a Google interview) and I was really nervous, even though I didn’t care either way if I got the job. I thought about how the nerves wouldn’t be an issue after the interview, because I’d have already done the scary thing by then. I couldn’t time travel, but I instead wondered if there is a way to use up the nerves on something else. There was a girl walking next to me, so I turned to her and said she was dressed very nicely. She said a timid “thank you” and picked up pace to get away from me. I laughed at my failure, but suddenly I didn’t feel so scared about the interview. I think this is a great example of why Rejection Therapy is worthwhile.

So yeah, talk to a stranger. If you are waiting at home for a phone call though, another thing I do is jack jumps, dancing, or jogging on the spot just to make myself forget the other reason my heart is pounding so fast.

During the Interview

If you are doing a phone interview, answer it standing up (you can sit down after) and pace around a little bit. Smile as you talk, as well. You should also take down their name on paper ready to use a few times casually. These are tricks from the infamous How to Win Friends and Influence People. Maybe these alone wont make you likeable, but I think it causes you to think about the other person and stop being so self conscious, which helps you to relax. You’ll be one charming motherfucking pig.

Take some time to think before answering, and especially to seek clarification on the questions. Ask what the data representation is. I’ve found that they tend to say “whatever you want”. In a graph question, I said “Okay, then it’s an adjacency matrix”, which made the question over and done with in ten seconds. The interviewer seemed to like that, so don’t be afraid to be a (humble) smart ass.

You might recognise the adjacency matrix as potentially being a very poor choice, depending on the nature of the graph. I did discuss when this might not be a good option. In fact, for every question, I start off by describing a naive approach, and then refine it. This helps to verify the question requirements, and gives you an easy starting point.

One last thing! Google schedules the interview to be from 45 minutes to an hour. I have had awkward moments at the end of interviews where the interviewer mentions that our time is nearly up, and then asks another question, or asks if I have any questions. It made me feel like he was in a rush, so I didn’t feel like expanding on things much. Now, I recommend taking as much time as they will give you. Keep talking until they hang up on you if you have to :) Although it might help to say “I don’t mind if we go over, as long as I’m not keeping you from something” when the interviewer mentions the time.

Reflect

Steve Yegge says there are lots of smart Googlers who didn’t get in until their third attempt (I still haven’t gotten in after my fourth, and I don’t think I’m stupid). As I mentioned, I’m writing this post because I found the process of doing a Google interview at all to be very rewarding.

It is important to reflect afterwards in order to reap the full benefits of interviewing at Google. If you did well, why? But more importantly, if you feel you did poorly, why? Google won’t give feedback, which can be a bit depressing at times. After each interview write notes about what you felt went well and what didn’t – this way you can look back if you don’t get the job, and decide what you need to work on. This post is the culmination of my reflections and the notes – if you decide to write a blog post, I’d enjoy reading it and will link it here.

If you want more blog posts to read about how to get better at Computer Science, I recently found this post by Matt Might to be a good target to aim for. Check out Ten Things Every Computer Science Major Should Learn as well, and my previous post Advice to CS Undergrads (the links at the end in particular).

Have you really read this far? Consider adding me to Twitter and telling me what you thought :)

FM-Indexes and Backwards Search

Media_httpalexbowes3a_mxdzx

Last time (way back in June! I have got to start blogging consistently again) I discussed a gorgeous data structure called the Wavelet Tree. When a Wavelet Tree is stored using RRR sequences, it can answer rank and select operations in O(log A) time, where A is the size of the alphabet. If the size of the alphabet is 2, we could just use RRR by itself, which answers rank and select in O(1) time for binary strings. RRR also compresses the binary strings, and hence compresses a Wavelet Tree which is stored using RRR.

So far so good, but I suspect rank and select queries seem to be of limited use right now (although once you are familiar with the available structures, applications show up often). One of the neatest uses of rank that I’ve seen is in substring search, which is certainly a wide reaching problem (for a very recent application to genome assembly, see Jared Simpson’s paper from 2010 called Efficient construction of an assembly string graph using the FM-index)

Also, my apologies but I am using 1-basing instead of 0-basing, because that is how I did my diagrams a year ago :) (and bear with my lack of nicely typeset math, I am migrating to GitHub pages where I will be allowed to use Mathjax soon)

Suffix Arrays

There is a variety of Suffix Array construction algorithms, including some O(N) ones (Puglisi et al. 2007). However, I will explain it from the most common (and intuitive) angle.

In its simplest form, a suffix array can be constructed for a string S[1..N] like so:

  1. Construct an array of pointers to all suffixes S[1..N], S[2..N], …, S[N..N].
  2. Sort these pointers by the lexicographical (i.e. alphabetical) ordering of their associated suffixes.

Media_httpalexbowes3a_giuel

For example, the sorting of the string ‘mississippi’ with terminating character $ would look like this:

Media_httpalexbowes3a_qbwxw

Burrows-Wheeler Transform

The Burrows-Wheeler Transform (BWT) is a was developed by Burrows and Wheeler to reversibly permute a string in such a way that characters from repeated substrings would be clustered together. It was useful for compression schemes such as run-length encoding.

It is not the point of this blog to explain how it works, but it is closely linked to Suffix Arrays: BWT[i] = S[SA[i] – 1, BWT[1] = $ (it wraps around) for the original string S, Suffix Array SA, and Burrows-Wheeler Transform string BWT. In other words, the ith symbol of the BWT is the symbol just before the ith suffix. See the image below:

In particular, BWT[1] = S[SA[1] – 1] = S[12 – 1] = S[11] = ‘i’ (or the 11th symbol from the original string ‘mississippi’)

Ferragina and Manzini (Ferragina et al. 2000) recommended that a BWT be paired with a Suffix Array, creating the so-called FM-Index, which enables backward search. The BWT also lets us reconstruct the original string S (not covered in this blog), allowing us to discard the original document – indexes with this property are known as self indexes.

Media_httpalexbowes3a_rejrx

Backward Search

This is where rank comes in. If it is hard to follow (it is certainly not easy to explain) then hang in there until the example, which should clear things up.

Since any pattern P in S (the original string) is a prefix of a suffix (our Suffix Array stores suffixes), and because the suffixes are lexicographically ordered, all occurrences of a search pattern P lie in a contiguous portion of the Suffix Array. One way to hone in on our search term is to use successive binary searches. Storing the BWT lets us use a cooler way, though…

Backward search instead utilises the BWT in a series of paired rank queries (which can be answered with a Wavelet Tree, for example), improving the query performance considerably. Backward search issues p pairs of rank queries, where p denotes the length of the pattern P. The paired rank queries are:

Media_httpmathurlcom4_aychb

Where s denotes the start of the range and e is the end of the range. Initially s = 1 and e = N. If at any stage e < s, then P doesn’t exist in S.

As for C… C is a lookup table containing the count of all symbols in our alphabet which sort lexicographically before P[i]. What does this mean? Well, C would look like this:

Media_httpalexbowes3a_dauos

Which means that there aren’t any characters in S that sort before ‘$’, one that sorts before ‘i’ (the ‘$’), five that sort before m (the ‘$’ and the ‘i’s) and so on. In the example I store it in a less compact way as the column F (which contains the first symbol for each suffix – essentially the same information, since each suffix is sorted), so it might be easier to follow (wishful thinking).

Why is this called backwards search? Well, our index variable i actually starts at |P| (the last character of our search pattern), and decreases to 1. This maintains the invariant that SA[s..e] contains all the suffixes of which P[i..|P|] is a prefix, and hence all locations of P[i..|P|] in S.

Example

Let’s practice this magic spell… Let our search pattern P be ‘iss’, and our string S be ‘mississippi’. Starting with i = 3, c = P[i] = ’s'. The working for each rank query is shown below each figure. I’m representing the current symbol as c to avoid confusion between ‘s’ and s and s′.

Media_httpalexbowes3a_cvian

Here (above) we are at the first stage of backwards search for ‘iss’ on ‘mississippi’ string – before any rank queries have been made. Note: we do not store the document anymore – the gray text – and we don’t store F, but instead store C – see section on Backward Search.

Starting from s=1 and e=12 (as above) and c = P[i] = ’s' where i = 3, we make our first two rank queries:

Media_httpmathurlcom3_jqexj

Media_httpalexbowes3a_ycebo

After the above, we are now at the second stage of backwards search for ‘iss’ on ‘mississippi’ string. All the occurrences of ’s' lie in SA[9..12].

From s = 9 and e = 11, and c = P[i] = ’s' where i = 2, our next two rank queries are:

Media_httpmathurlcom3_sdxgn

Media_httpalexbowes3a_arabe

We are now at the third stage of backwards search for ‘iss’ on ‘mississippi’ string. All the occurrences of ‘ss’ lie in SA[11..12].

From s = 11 and e = 12, and c = P[i] = ‘i’ where i = 1, our final two rank queries are:

Media_httpmathurlcom3_jocuc

Media_httpalexbowes3a_pjtdo

This is the fourth and final stage of our backwards search for ‘iss’ in the string ‘mississippi’. All the occurrences of ‘iss’ lie in SA[4..5].

It impresses me every time…

Play Time

No doubt you want to get your hands dirty. I have played around with libdivsufsort before, although I think you may have to implement backward search yourself (it’d be a good exercise), since it doesn’t appear to come with fast rank query providers. For rank structures for your BWT you might want to check out libcds. In fact there are heaps out there, but I haven’t used any others. Hopefully someone will comment below with a good recommendation of this.

Also, please comment here if you develop something cool with it :) and as always, if you have journeyed this far, consider following me on Twitter: @alexbowe.

Bibliography

Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications. Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science, pages 390–398.

S. J. Puglisi, W. F. Smyth, and A. Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39(2):1–31, 2007. Jared Simpson’s paper from 2010 called *Efficient construction of an assembly string graph using the FM-index.

Simpson, J. T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):i367–i373.

Wavelet Trees

Media_httpalexbowes3a_qjefe

In my last post I introduced a data structure called RRR, which is used to quickly answer rank queries on binary sequences, and provide implicit compression.

Today I will talk about an elegant way of answering rank queries on sequences over larger alphabets. The structure is called the Wavelet Tree, which organises a string into a hierarchy of bit vectors. A rank query has time complexity is O(log_2 A), where A is the size of the alphabet. It was introduced by Grossi, Gupta and Vitter in their 2003 paper High-order entropy-compressed text indexes [4] (see the Further Reading section for more papers). It has since been featured in many papers [1, 2, 3, 5, 6].

If you store the bit vectors in RRR sequences, it may take less space than the original sequence. Alternatively, you could store the bit vectors in the rank indexes proposed by Sadakane and Okonohara [7]. It has a different approach to compression. I will talk about it another time ;) – fortunately, I will be studying under Sadakane-san at a later date.

In a different future post, I will show how Suffix Arrays can be used to find arbitrary patterns of length P, by issuing 2P rank queries. If using a Wavelet Tree, this means a pattern search has O(P log_2 A) time complexity, that is, the size of size of the ‘haystack’ doesn’t matter, it instead depends on the size of the ‘needle’ and size of the alphabet.

Construction

A Wavelet Tree converts a string into a balanced binary-tree of bit vectors, where a 0 replaces half of the symbols, and a 1 replaces the other half. This creates ambiguity, but at each level this alphabet is filtered and re-encoded, so the ambiguity lessens, until there is no ambiguity at all.

The tree is defined recursively as follows:

  1. Take the alphabet of the string, and encode the first half as 0, the 2nd half as 1: { a, b, c, d } would become { 0, 0, 1, 1 }
  2. Group each 0-encoded symbol, { a, b }, as a sub-tree;
  3. Group each 1-encoded symbol, { c, d }, as a sub-tree;
  4. Reapply this to each subtree recursively until there is only one or two symbols left (when a 0 or 1 can only mean one thing).

For the string “Peter Piper picked a peck of pickled peppers” (spaces and a string terminator have been represented as _ and $ respectively, due to convention in the literature) the Wavelet Tree would look like this:

Media_httpalexbowes3a_odjnn

note: the strings aren’t actually stored, but are shown here for convenience

It has the alphabet { $, P, _, a, c, d, e, f, i, k, l, o, p, r, s, t }, which would be mapped to { 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 }. So, for example, $ would map to 0, and r would map to 1.

The left subtree is created by taking just the 0-encoded symbols { $, P, _, a, c, d, e, f } and then re-encoding them by dividing this new alphabet: { 0, 0, 0, 0, 1, 1, 1, 1 }. Note that on the first level an e would be encoded as a 0, but now it is encoded as a 1 (it becomes a 0 again at a leaf node).

We can store the bit vectors in RRR structures for fast binary rank queries (which are needed, as described below), and compression :) In fact, since it is a balanced tree, we can concatenate each of the levels and store it as one single bit vector.

Querying

Recall from my last post that a rank query is the count of 1-bits up to a specified position. Rank queries over larger alphabets are analogous – instead of a 1, it may be any other symbol:

Media_httpalexbowes3a_wfllf

After the tree is constructed, a rank query can be done with log A (A = alphabet size) binary rank queries on the bit vectors – O(1) if you store them in RRR or another binary rank index. The encoding at each internal node may be ambiguous, but of course it isn’t useless – we use the ambiguous encoding to guide us to the appropriate sub-tree, and keep doing so until we have our answer.

For example, if we wanted to know rank(5, e), we use the following procedure which is illustrated below. We know that e is encoded as 0 at this level, so we take the binary rank query of 0 at position 5:

Media_httpalexbowes3a_ahscp

Which is 4, which we then use to indicate where to rank in the 0-child: the 4th bit (or the bit at position 3, due to 0-basing). We know to query the 0-child, since that is what e was encoded as at the parent level. We then repeat this recursively:

Media_httpalexbowes3a_bdgct

At a leaf node we have our answer. I would love to explain why this works, but it is fun and rewarding to think about it yourself ;)

There are also ways to provide fast select queries, but once again I will leave that up to you to research. The curious among you might also be interested in the Huffman-Shaped Wavelet Tree described by Mäkinen and Navarro [5].

Using Your New Powers for Good

Feel free to implement this yourself, but if you want to get your hands dirty right away, all-around-clever-guy Francisco Claude has made an implementation available in his Compressed Data Structure Library (libcds). If you create something neat with it be sure to report back ;)

And if you read this far, consider following me on Twitter: @alexbowe.

Further Reading

I didn’t want to saturate this blog with proofs and so-on, as it was meant to be a light introduction. It is also a pain typesetting math on this blog :/ If you want to learn more about this awesome structure, check out the following papers:

[1] F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, pages 176–187. Springer, 2008.

[2] P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Information and Computation, 207(8):849–866, 2009.

[3] P. Ferragina, G. Manzini, V. M ̈akinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Al- gorithms, 3(2):20, 2007.

[4] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th annual ACM-SIAM symposium on Dis- crete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.

[5] V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40–66, 2005.

[6] V. Mäkinen and G. Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 4726, pages 214–226. Springer, 2007.

[7] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. Arxiv Computing Research Repository, abs/cs/0610001, 2006.

YaRRR Me Hearties - a post about a succinct data structure (not pirates, sorry)

Media_httpalexbowes3a_igeiq

This blog post will give an overview of a static bitsequence data structure known as RRR, which answers arbitrary length rank queries in O(1) time, and provides implicit compression.

As my blog is informal, I give an introduction to this structure from a birds eye view. If you want, read my thesis for a version with better markup, and follow the citations for proofs by people smarter than myself :)

My intended future posts will cover the other aspects of my thesis, including generalising RRR (for sequences over small alphabets), Wavelet Trees (which answer rank queries over bigger alphabets), and Suffix Arrays (a text index which – when combined with the above structures – can answer queries in O(P log A) time, when P is the length of the search pattern, and A is the alphabet size).

Update: I have now posted about Wavelet Trees! Check it out here.

Example Problem

Cracking the Oyster, the first column of Programming Pearls, opens with a programmer asking for advice when sorting around ten million unique seven-digit integers – phone numbers.

After some discussion, the author concludes that a bitmap should be used. If we wanted to store ten million integers, we could use an array of 32-bit integers, consuming 38 MB, or we could represent our numbers as positions on a number line.

All of these phone numbers will be within the range [0000000, 9999999]. To represent the presence of these numbers, we only need a bitmap 107 bits long, about 1 MB, which would represent our number line. Then, for a bitmap M, if we want to store phone number p, we set the bit M[p] to 1. Sorting would involve setting the numbers that are present to 1, then iterating over the bitmap, printing the positions of the 1-bits – O(N) time.

In the following sections, I will detail operations that can be done on bitmaps, named rank and select, and explain how to answer rank queries in O(1) time, and implicitly compress the bitmap. Using rank and select, a compressed bitmap can be a very powerful way to store sets. This isn’t limited to just sets of numbers, all sorts of things, such as sets of graph nodes for example; A friend of mine is using succinct bitmaps to represent De Bruijn graphs, which are used in genome assembly.

Extension: Rank

Allow me to extend the problem. I want to query our simple phone number database to see how many phone numbers are allocated within the range $[0005000, 0080000]$. I could iterate over that range and update a counter whenever I encounter a 1-bit. Actually, this operation is what is known as a rank operation.

Media_httpalexbowes3a_trqyt

The operation rank(i) is defined as the number of set bits (1s) in the range [0, i] (or [0, i) in some papers). In the bitstring above, the answer to rank(5) is 3… This is a generalisation of the popcount operation which counts all set bits, which I have discussed before. rank(i) can be implemented by left-shifting L - i bits (where L is the length of the datatype you are using, int, long, etc) to remove the unwanted bits, then calling popcount on the resulting value. This could be done iteratively over an array if you want, but I will discuss a much faster way below.

Then, the above question can be answered as: rank(0080000) - rank(0005000 - 1). This will give us just the number of 1s between 0005000 and 0080000.

This isn’t the only place we would use a popcounts; it happens that popcounts are common enough that we want to optimise them. Check out this blog post at valuedlessons.com for a discussion and empirical comparison of several fast approaches.

RRR

As it happens, we can build a data structure for static bitmaps that answers rank queries in O(1) time, and provides implicit compression. It is what is known as a succinct data structure, which means that even though it is compressed, we don’t need to decompress the whole thing t operate on it efficiently. Sadakane (a big name in succinct data structures) gives a nice analogy in his introduction of the field, likening it to forcing dehydrated noodles apart with your chopsticks (decompression) as you are rehydrating them, but before the whole thing is fully cooked and separated. This allows you to keep some of the noodles compressed while you eat the decompressed fragment.

Since it is static it isn’t well suited for a bitmap which you want to update (although work has been done toward this), it is still really cool :)

The structure I’m referring to is named RRR. It sounds like a radio station, but it is named after its creators: Raman, Raman, Rao, from their 2002 paper Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. Its a data structure I had to become intimately involved with for my honours thesis, where I extended it for sequences of larger (but still small) alphabets. If you want to answer rank queries on large alphabets, a wavelet tree might be what you are after, but that will be covered in a different blog post (or you could read my thesis!).

In my last post (Generating Binary Permutations in Popcount Order) I discussed how to compress a bitstring by replacing blocks of a certain blocksize with their corresponding pop number, and (variable length) offset into a lookup table. I briefly mentioned building an index over it to improve lookup as well.

RRR: Construction

To construct a RRR sequence we divide our bitmap into blocks, as I mentioned in my previous blog post. These are grouped in superblocks, too, which allows us to construct an index to enable O(1) rank queries. In the following image, I have fragmented the bitmap using a blocksize of b = 5, and grouped them with a superblock factor of f = 3 – so each superblock is three blocks.

Media_httpalexbowes3a_qeoif

First we replace the blocks with a pair of values, a class value C and offset value O, which are used together as a lookup key into a table of precomputed (small – for each possible block only) ranks – this is demonstrated in the figure below. This is the same as the previous blog post, although in that I called the “class” P. This is because the class of a block is defined as the popcount – the number of set bits – in the block: class(B) = popcount(B) for block B.

Media_httpalexbowes3a_gbopg

The table is shared among all your RRR sequences, and is in fact a table of tables, where C points to the first element for the ranks of a given popcount:

Media_httpalexbowes3a_cjgfp

For this table (let’s call it G), for a given class C, the sub-table at G[C] has b Choose C entries, which correspond to all possible permutations that have a popcount of C. This means that while our C values will always be log(b + 1) (the number of bits to represent values 0, 1, 2… b – these are all possible popcount values for the blocksize), but our O values will vary in size, requiring log(b Choose C) bits (oh yeah, and of course I’m using log_2 here :)). During a query, we can use our C values to work out how many bits will follow for the O values.

Using this approach alone we get the compression, but not O(1) ranks. C is fixed width, the compression comes from O being varied width.

In order to get the O(1) ranks we use a method discussed by Munro in Tables, 1996. This is where the superblocks come in to play:

Media_httpalexbowes3a_jjdii

For each superblock boundary we store the global rank up to that position. We also store a prefix sum of the bits, which gives us the address to the first block in the next superblock (since it is variable length!). This allows us to not require iterating over the whole RRR sequence, but instead going straight to the required superblock. We will only need to iterate over the blocks within a superblock, so it is now bound by whatever your superblock factor is.

RRR: Querying

To calculate rank(i):

  1. Calculate which block our index is in as i_b = i/b. (i_b is the global index of the block)
  2. Calculate which superblock our block resides in as i_s = i_b/f. (i_s is the index of the superblock)
  3. Set result to the sum of previous ranks at is boundary (which is pre- calculated).
  4. Using each blocks class-offset pair (c,o) after the boundary at is, add the rank for that entire block to result. 5. Repeat previous step until we reach i_b. We then add rank(j,c) (from i_b, not the global rank) to our result,where j = i mod b, and is the position we are querying local to i_b. Our final answer is result.

Select

Media_httpalexbowes3a_heucn

Select is the inverse operation to rank; it answers the question “at which position is the ith set bit?”. To tie this in with the phone numbers example, maybe we want to find out the fiftieth phone number in the set (excluding unassigned numbers). This is a way we can index just the present elements of a bitmap. It turns out select can be answered in O(1) time as well. I won’t cover select here, as my future posts (and thesis) will mainly use rank. You can read about it in the RRR paper.

Go Forth and…

Feel free to implement this (somewhat complicated) data structure yourself, or you can use a pre-rolled one by my friend Francisco Claude in his LIBCDS – Compressed Data Structure Library.

If you read this far, consider adding me to twitter :) or you may enjoy reading my post on Wavelet Trees.