Generating Binary Permutations in Popcount Order

Media_httpalexbowes3a_mpcoj

I’ve been keeping an eye on the search terms that land people at my site, and although I get the occasional “alex bowe: fact or fiction” and “alex bowe bad ass phd student” queries (the frequency strangely increased when I mentioned this on Twitter) I also get some queries that relate to the actual content.

One query I received recently was “generating integers in popcount order”, I guess because I mentioned popcounts (the number of 1-bits in a binary string) in a previous post, but the post wasn’t able to answer that visitors question.

What would this be used for? Among other applications, I have used it for generating a table of numbers ordered by popcount, which I used in a compression algorithm: by breaking a bitstring into fixed-length chunks (of B bits) and replacing them with a (P, O) pair, where P is the block’s popcount which can be used to point to the table where each entry has the popcount P, and O is the offset in that subtable. Then P can be stored with log2(B + 1) bits – we need to represent all possible P values from 0 to B – and O can be stored with log2(binomial(B, P)) bits.

Media_httpalexbowes3a_adipi

Note that the bit-length of P varies; binomial represents the binomial coefficient, which can be seen in Pascal’s triangle expanded to row 5:

Media_httpalexbowes3a_szwrn

So binomial(5, x) for x = 0, 1, … , 5 yields the sequence 1, 5, 10, 10, 5, 1 – some things take more bits than others. Once you know the P value, you will know how many bits the O value is, so you can read it that way. This means access is O(N) (since each values position relies on the previous P values), but you can build an index on top of that to allow O(1) lookup. But all this is a story for another time ;) [1].

Here is the code:

def next_perm(v):
    """
    Generates next permutation with a given amount of set bits,
    given the previous lexicographical value.
    Taken from http://graphics.stanford.edu/~seander/bithacks.html
    """
    t = (v | ( v - 1)) + 1
    w = t | ((((t & -t) / (v&-v)) >> 1) - 1)
    return w

This will take a number with a certain popcount, and generate the next number with the same popcount. For example, if you feed it 7, which is 111 in binary, you will get 1011 back – or 11 – the next number with the same popcount (lexicographically speaking).

To find the first number of a given popcount, you can use this:

def element_0(c):
    """Generates first permutation with a given amount
       of set bits, which is used to generate the rest."""
    return (1 << c) - 1

I should clear up what lexicographically means in the context of numbers. Well, it’s actually the same as any other symbols (such as an alphabet), 0 is the symbol that comes before 1 (if it helps, you can picture 0 as a and 1 as b):

00111 aabbb
01011 ababb
01101 abbab
01110 abbba
10011 baabb
10101 babab
10110 babba
11001 bbaab
11010 bbaba
11100 bbbaa

Psuedocode

Looking at the above pattern, here is some loose pseudocode that may help us understand how the above bithacks work:

  1. Set i to the position of the rightmost bit
  2. Stop if there are no set bits, or if we have looked at all the bits (i >= length of bitstring)
  3. If the i+1th bit (one place to the left) is 0: move the ith bit left
  4. Otherwise, if the i+1th bit (one place to the left) is 1: set i to i + 1 and repeat from 2.
  5. Shift the bits on the range [0, i] right so that the rightmost bit is in position 0

Explanation

Understanding element_0() is pretty easy. 1 << c is the same as moving a 1 to position c, then -1 sets all the bits from 0 to c – 1, giving c set bits:

c = 4
1 << c = 10000
10000 - 1 = 01111

next_perm() is a bit more complicated. The v | (v -1) in t = (v | (v - 1)) + 1 right-propagates the rightmost bit. Allow me to show an example: 01110 | 01101 = 01111

In the case of 0, this isn’t quite correct: 00000 | 11111 = 11111 But it’s okay because we proceed to add 1 to this value (which returns it to zero). This increment, combined with the right-propagation, will do step 2, 3, and part of step 5 above. For example, 10100 -> 10111 -> 11000.

We are on our way to generating integers by popcount in lexicographical order.

Now let’s break down the next line, w = t | ((((t & -t) / (v&-v)) >> 1) - 1). Bitwise equations of the form (x & -x) isolate the rightmost bit: 01110 & 10010 (two's complement) = 00010

If you take the two’s complement, you invert a numbers bits and then add 1. If you think about it, this means there are 0s where there were 1s, and 1s where there were 0s, and adding 1 bumps the rightmost 1 left and sets the subsequent right bits to 0. This means that the only position that will remain set in both numbers is the the rightmost 1-bit.

So let R(x) denote the isolated rightmost bit of x, then for x = 01110 we calculate t = 10000, R(x) = 00010 and R(t) = 10000.

Following the calculation of w, we need to divide them: R(t) / R(x) = 10000 / 00010 = 01000

Shift to the right by 1: 00100

Subtract 1: 00011

Then we bitwise-or them to stick them together: 10000 | 00011 = 10011. This corresponds to our table above :)

So w = t | ((((t & -t) / (v&-v)) >> 1) - 1) corresponds to the rest of step 5 (the moving part, t was the zeroing part of moving the sub-range) in our pseudocode. Well, kind of anyway, there are a few steps happening in parallel, but the pseudocode was only loosely explaining what was happening :)

Testing it out

def gen_blocks(p, b):
    """
    Generates all blocks of a given popcount and blocksize
    """
    v = initial = element_0(p)
    block_mask = element_0(b)

    while (v >= initial):
        yield v
        v = next_perm(v) & block_mask



>>> for x in gen_blocks(3, 5): print bin(x, 5)
... 
00111
01011
01101
01110
10011
10101
10110
11001
11010
11100

Note: bin is just a function I found online for printing binay numbers and isn’t important to this post, but you can find it here if you need one.

Then of course you can loop through all values of P from 0 to B to build the complete table.

Questions? Comments? Flames? I wanna hear em :)

[1] – Check out Tables by Munro, 1996, and Succinct indexable dictionaries with applications to encoding k-ary trees and multisets by Raman et al, 2002 for a first step into this stuff. Also check out my honours thesis, 2010 for a recent look at succinct data structures. I will write a blog post about this stuff sooner or later :P

Some Lazy Fun

Media_httpianumcesedu_imcvu

Image taken from http://ian.umces.edu/

Update: fellow algorithms researcher Francisco Claude just posted a great article about using lazy evaluation to solve Tic Tac Toe games in Common Lisp. Niki (my brother) also wrote a post using generators with asynchronous prefetching to hide IO latency. Worth a read I say!

I’ve recently been obsessing over this programming idea called streams (also known as infinite lists or generators), which I learned about from the Structure and Interpretation of Computer Programs book. It is kind of like an iterator that creates its own data as you go along, and it can lead to performance increases and wonderfully readable code when you utilise them with higher order functions such as map and reduce (which many things can be rewritten in). It also allows you to express infinitely large data structures.

When regular lists are processed with higher order functions, you need to compute the entire list at each stage; if you have 100 elements, and you map a function to them, then filter them, then partially reduce them, you may be doing up to 300 operations, but what if you only want to take the first 5 elements of the result? That would be a waste, hence streams are sometimes a better choice.

Although SICP details how to do it in Scheme, in this blog post I will show some languages that have it built in - Haskell and Python - and how to implement streams yourself if you ever find yourself in a language without it1.

Haskell

Haskell is a lazy language. It didn’t earn this reputation from not doing the dishes when you ask it to (although that is another reason it is lazy). What it means in the context of formal languages is that evaluation is postponed until absolutely necessary (Here is a cute (illustrated) blog post describing this lazy evaluation stuff). Take this code for example:

Prelude> let x = [1..10]

At this stage you might be tempted to say that x is the list of numbers from 1 to 10. Actually it only represents a promise that when you need that list, x is your guy. The above code that creates a list from 1 to 10 still hasn’t been executed until I finally ask it to be (by referring to x):

Prelude> x
[1,2,3,4,5,6,7,8,9,10]

It is kind of like telling your mum you’ll do the dishes, but waiting until she shouts your name out again before you put down your DS. Actually, it is sliiiiightly different - if I instead wrote:

Prelude> let x = [1..10]
Prelude> let y = x ++ [11..20]

I have referred to x again when I declared y, but x still hasn’t evaluated. Only after I shout y’s name will y shout x’s name and give me back my whole list:

Prelude> y
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

Here you ask your robot to wash half the dishes, but he is too busy playing DS too (stupid robot). Finally when your mum shouts, you shout at the robot, and he does his set of dishes, and you do yours. But what is the benefit here? It isn’t that I can get more DS time in…

Take for example a list of positive integers from 1. Yes, all of them. In other languages it might be hard to express this, but in Haskell it is as simple as [1..]. This means we have a list of infinite integers, but we will only calculate as far as we need:

Prelude> let z = [1..]
Prelude> head z
1
Prelude> take 5 z
[1,2,3,4,5]

The syntax here is amazingly terse, and it may make your code more efficient. But even if we don’t have the syntax for it in another language, we can provide it ourselves very easily.

Python

Python has a similar concept called generators, which are made using the yield keyword in place of return, more than one time (or in a loop) in a function:

def integers_from(N):
     while(1):
         yield N
         N += 1

>>> z = integers_from(1)
>>> z.next()
1
>>> z.next()
2

Note: Python generators are stateful and are hence slightly different to an infinite list in Haskell. For example z.next() returns different values in two places, and thus is time sensitive - we cannot get z to ‘rewind’ like we could in Haskell, where z is stateless. Statelessness can lead to easier to understand code, among other benefits.

Rolling Our Own

Let’s reinvent this wheel in Python (but in a stateless manner), so if we ever find ourselves craving infinite lists we can easily roll our own in pretty much any language with Lambdas.

I have chosen Python to implement this, even though it already supports infinite lists through generators, simply because its syntax is more accessible. Indeed, the below can already be done with Python’s built-in-functions (although with state). It is probably not a great idea to do it this way in Python, as it doesn’t have tail call optimisation (unless you use this hack using decorators and exceptions).

First we’ll look at adding lazy evaluation, however the syntax requires it to be explicit:

>>> x = lambda: 5
>>> y = lambda: 2 + x()

Here, x is not 5, and y is not 7, they are both functions that will evaluate to that when we finally run them; the expression inside the lambda won’t be evaluated until we do so explicitly:

>>> x()
5
>>> y()
7

And that’s pretty much all the heavy lifting. To make an infinite list, we basically make a linked list where we generate each node as we need it:

def integers_from(N): return (N, lambda: integers_from(N+1))

def head((H, _)): return H

def tail((_, T)): return T()

And there is our infinite list. To access it use head() and tail() (recursively if necessary):

>>> z = integers_from(1)
>>> head(z)
1
>>> head(tail(z))
2

Helper Functions

First we should make a way for us to look at our streams:

def to_array(stream):
    return reduce(lambda a, x: a + [x], [], stream)

Which is a reduce operation that puts each head element into an array (which is carried along as a parameter to reduce()). Here is reduce() (map() can be found in this gist):

null_stream = (None, None)
def reduce(f, result, stream):
    if stream is null_stream: return result
    return reduce(f, f(result, head(stream)), tail(stream))

We needed some way to tell if we had reached the end of a stream - not all streams are infinitely long. Meet our next function, which will help us terminate a stream:

def take(N, stream):
    if N <= 0 or stream is null_stream: return null_stream
    return (head(stream), lambda: take(N-1, tail(stream)))

This will take the first N elements from the specified stream. So now we can inspect the first N elements:

>>> to_array(take(10, integers_from(1)))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

For our upcoming example, we also need a filter() method, which will filter out elements that meet a provided predicate:

def filter(pred, stream):
    if pred(head(stream)):
        return (head(stream), lambda: filter(pred, tail(stream)))
    return filter(pred, tail(stream))

Now onto our example :)

Textbook Example

Here is the standard example to demonstrate the terseness of streams:

def sieve(stream):
    h = head(stream)
    return (h, lambda: sieve(
        filter(lambda x: x%h != 0, tail(stream))))

Here is a function which recursively filters anything which is divisible by any number we have previously seen in our stream. Math aficionados will notice that this is the Sieve of Eratosthenes algorithm.

>>> primes = sieve(integers_from(2))
>>> to_array(take(10, primes))
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

Recursively defined data, and only as much of it as we want - pretty neat.

And Now For Something Almost Completely Different

When I first saw this, I wondered what application there might be to have a stream of functions. Here I have defined a stream which recursively applies a function to itself:

def rec_stream(f):
    return (f, lambda: rec_stream(lambda x: f(f(x))))

When might this be useful? It might yield speed improvements if you commonly want to recursively apply a function a certain amount of times, but have costly branching (so the condition check at each level is slow). It could also be used as a abstraction for recursive iteration 2, which gives you back the function 'already recursed' so to speak (although lazily).

One such recursive process I can think of is Newton’s method for approximating roots, defined recursively as:

Media_httpmathurlcom3_ghfkj

When f(x) = 0.

The more iterations you do the more accurate the solution becomes. One use of Newton’s method is to use it until you have reached a certain error tolerance. Another way, which I learned about recently when reading about the fast inverse square root algorithm, which uses just one step of Newton’s method as a cheap way to improve it’s (already pretty good) initial guess. There is a really great article here which explains it very well.

After reading that, I wondered about a stream that would consist of functions of increasing accuracy of Newton’s method.

def newton(f, fdash):
    return lambda x: x - f(x)/float(fdash(x))

The newton() function accepts f(x) and f'(x), and returns a function that accepts a first guess.

def newton_solver(iters, f, fdash):
    def solve(v):
        n = newton(lambda x: f(x) - v, fdash)
        stream = rec_stream(n)
        return to_array(take(iters, stream))[-1]
    return solve

This one is a little more complicated. In order to have it solve for a value other than zero, I needed to either define it in f(x), since f(x) must equal zero, but I didn’t want the user to have to iterate over the stream each time they wanted to compute the square root of a different number, say. To allow it to return a function that solved for square roots in the general case, I had to make the internal function solve(), which would bind for the value the caller specifies, hence solving f(x) = v for x. Hopefully this becomes clearer with an example:

>>> sqrt = newton_solver(1, lambda x: x**2, lambda x: 2*x) # 1 iter
>>> sqrt(64)(4) # Sqrt of 64 with initial guess of 4
10.0
>>> sqrt = newton_solver(3, lambda x: x**2, lambda x: 2*x) # 3 iters
>>> sqrt(64)(4)
8.000000371689179

Now we can pass around this square root function and it will always do 3 iterations of Newton's method.

This may not be practical unless compilers can optimise the resulting function (or if there is a way to do the reduction myself easily), but it was fun to do :) As always comments and suggestions are appreciated. If anyone who reads this is good with compilers, advice would be great :D

What you can do now is read SICP for more cool things like streams and functional programming, or check out Sean Anderson’s bit hacks page for more cool hacks like the fast inverse square root. Or refactor your code to use map, reduce and streams :)


  1. The reason I have chosen Python for this exercise is for reasons of accessibility. Here is a post about implementing streams in Ruby, and here is one for Erlang :) but of course it’s all pretty much the same deal. ↩

  2. If a compiler could optimise this, simplifying the reapplied function, but keeping the generality, that’d be really cool :) I don’t think many compilers would/could do that for lambdas though. Any information would be great. ↩

Au Naturale

Media_httpdatawhicdnc_apuda

This blog post is an introduction on how to make a key phrase extractor in Python, using the Natural Language Toolkit (NLTK).

But how will a search engine know what it is about? How will this document be indexed correctly? A human can read it and tell that it is about programming, but no search engine company has the money to pay thousands of people to classify the entire Internet for them. Instead they must reasonably predict what a human may decide to be the key points of a document. And they must automate this.

Remember how proper sentences need to be structured with a subject and a predicate? A subject could be a noun, or a adjective followed by a noun, or a pronoun… A predicate may be or include a verb… We can take a similar approach by defining our key phrases in terms of what types of words (or parts-of-speech) they are, and the pattern in which they occur.

But how do we know what words are nouns or verbs in an automated fashion?

Throughout this post I will use an excerpt from Zen and the Art of Motorcycle Maintenance as an example:

The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital computer or the gears of a cycle transmission as he does at the top of a mountain or in the petals of a flower. To think otherwise is to demean the Buddha…which is to demean oneself.

Before proceeding, make a (mental) note of the key phrases here. What is the document about?

Tokenizing

In a program, text is represented as a string of characters. How can we go about moving one level of abstraction up, to the level of words, or tokens? To tokenize a sentence you may be tempted to use Python’s .split() method, but this means you will need to code additional rules to remove hyphens, newlines and punctuation when appropriate.

Thankfully the Natural Language Toolkit (NLTK) for Python provides a regular expression tokenizer. There is an example of it (including how it fares against Pythons regular expression tokenization method) in Chapter 3 of the NLTK book. It also allows you to have comments:

#Word Tokenization Regex adapted from NLTK book
sentence_re = r'''(?x)      # set flag to allow comments in regexps
# abbreviations, e.g. U.S.A. (with optional last period)
([A-Z])(\.[A-Z])+\.?
# words with optional internal hyphens
| \w+(-\w+)*
# currency and percentages, e.g. $12.40, 82%
| \$?\d+(\.\d+)?%?
# ellipsis
| \.\.\.
# these are separate tokens
| [][.,;"'?():-_`]
'''

Once we have constructed our regex for defining what sort of format our words should be in, we call it like so:

import nltk
#doc is a string containing our document
toks = nltk.regexp_tokenize(doc, sentence_re)

>>> toks
['The', 'Buddha', ',', 'the', 'Godhead', ',', 'resides', ...

Tagging

The next step is tagging. This uses statistical data to apply a Part-of-speech tag to each token, e.g. ADJ, NN (Noun), and so on. Since it is statistical, we need to either train our model or use a pre-trained model. NLTK comes with a pretty good one for general use, but if you are looking at a certain kind of document you may want to train your own tagger, since it may greatly affect the accuracy (think about very vocabulary-dense fields such as biology).

Note that to train your own tagger you will need a pre-tagged corpus (NLTK comes with some) or use a bootstrapped method (which can take a long time). Check out Streamhacker and Chapter 5 of the NLTK book for a good discussion on training your own (and how to test it empirically).

For the sake of this introduction, we will use the default one. The result is a list of token-tag pairs:

>>> postoks = nltk.tag.pos_tag(toks)
>>> postoks
[('The', 'DT'), ('Buddha', 'NNP'), (',', ','), ('the', 'DT'), ...

Chunking

Now we can use the part-of-speech tags to lift out noun phrases (NP) based on patterns of tags.

Media_httpnltkgooglec_amtzp

Note: All diagrams have been stolen from the NLTK book (which is available under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US License).

This is called chunking. We can define the form of our chunks using a regular expression, and build a chunker from that:

# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
    NBAR:
        # Nouns and Adjectives, terminated with Nouns
        {<NN.*|JJ>*<NN.*>}

    NP:
        {<NBAR>}
        # Above, connected with in/of/etc...
        {<NBAR><IN><NBAR>}
"""

chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)

It is also possible to describe a Context Free Grammar (CFG) to do this, and help deal with ambiguity – information can be found in Chapter 8 of the NLTK book. Chunk regexes can be much more complicated if needed, and support chinking, which allows you to specify patterns in terms what you don’t want – see Chapter 7 of the NLTK book.

The output of chunking is a tree, where the noun phrase nodes are located just one level before the leaves, which are the words that constitute the noun phrase:

Media_httpnltkgooglec_hwqkc

To access the leaves, we can use this code:

def leaves(tree):
    """Finds NP (nounphrase) leaf nodes of a chunk tree."""
    for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
        yield subtree.leaves()

Walking the tree and Normalisation

We can now walk the tree to get the terms, applying normalisation if we want to:

def get_terms(tree):
    for leaf in leaves(tree):
        term = [ normalise(word) for word, tag in leaf
            if acceptable_word(word) ]
        yield term

Normalisation may consist of lower-casing words, removing stop-words which appear in many documents (i.e. if, the, a…), stemming (i.e. cars –> car), and lemmatizing (i.e. drove, drives, rode –> drive). We normalise so that at later stages we can compare similar key phrases to be the same; 'the man drove the truck' should be comparable to 'The man drives the truck'. This will allow us to better rank our key phrases :)

Functions for normalising and checking for stop-words are described below:

lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()

def normalise(word):
    """Normalises words to lowercase and stems and lemmatizes it."""
    word = word.lower()
    word = stemmer.stem_word(word)
    word = lemmatizer.lemmatize(word)
    return word

def acceptable_word(word):
    """Checks conditions for acceptable word: length, stopword."""
    from nltk.corpus import stopwords
    stopwords = stopwords.words('english')

    accepted = bool(2 <= len(word) <= 40
        and word.lower() not in stopwords)
    return accepted

And the result is:

>>> terms = get_terms(tree)
>>> for term in terms:
...    for word in term:
...        print word,
...    print
buddha
godhead
circuit
digit comput
gear
cycl transmiss
mountain
petal
flower
buddha
demean oneself

Are these similar to the key phrases you chose? There are lots of areas above that can be tweaked. Let me know what you come up with :) (the code can be found in this gist).

In future posts I will talk about how to rank key phrases. I will also discuss how to scale this to process many documents at once using MapReduce.

In the mean time check out the demos on Streamhacker, solve the problems in the NLTK book, or read the NLTK Cookbook :)

How to Win Friends and Generate People

Media_httpimages20x20_eeusq

I'm doing a project for a subject at RMIT which needs to manage thousands of patient records for a hospital. We haven't been given any sample data though, so I wanted to write a generator (so we can test it with small or large data sets whenever needed).

I started with the name generator (in Python), which selected a random male/female/last name from a file [1]. I then realised an address generator would behave similarly (street, city, country lists), so I decided to make a Generator base class. You can get the source code for these here.

I wanted the Generator base class to create methods dynamically; It would be instantiated with a bunch of methodname-filename pairs and have the methods for randomly selecting an entry from each file made dynamically. Exactly how to do this wasn't easy to find, as it wasn't well documented... the solution ended up being:

Niki (my brother) helped me get 70% of the way there... (after lots of scope and decorator issues) thanks man :)

[1] The name lists were taken from this census data, which actually provides percentages for each name too, if you wanted to make the name distribution more realistic...