# Succinct de Bruijn Graphs

## Table of Contents

This post will give a brief explanation of a Succinct implementation for storing de Bruijn graphs, which is recent (and continuing) work I have been doing with Sadakane.

Using our new structure, we have squeezed a graph for a human genome (which took around 300 GB of memory if using previous representations) down into 2.5 GB. In addition, the construction method allows much of the work to be done on disk. Your computer might not have 300 GB of RAM, but you might have 2.5 GB of RAM and a hard disk.

I have given a talk about this a few times, so I’ve been itching to write it up as a blog post (if only to shamelessly plug my blog at conferences). However, since it is a lowly blog post, I won’t give attention to the gory details, nor provide any experimental results. But this post is merely to *communicate the approach*. Feel free to check out our conference paper, and stay tuned for the journal paper.

In this blog post, I will first give an introduction to de Bruijn graphs and how they are used in DNA assembly. Then I will briefly explain some previous implementations before reaching the main topic of this post: our new succinct representation. The explanation first explains some preliminaries, such as how we were inspired by the Burrows Wheeler Transform, and what rank and select are, which are required to understand the construction method, and traversal interface respectively.

For those following along at home, I have implemented a demo version in Python. It doesn’t use efficient implementations of rank and select, nor provide any compression – it is merely meant to demonstrate the key ideas with (hopefully) readable high level code. An optimised version will be made available at some point.

Okay, here we go…

## De Bruijn Graphs

De Bruijn graphs are a beautifully simple, yet useful combinatoric object which I challenge you not to lose sleep over.

Since their discovery around 1946, they have been used in a variety of applications, such as encryption, psychology, chess and even card tricks. Quite recently they have become a popular data structure for DNA assembly of short read data.

They are defined as a directed graph, where each node $u$ represents a fixed-length string (say, of length $k$, and an edge exists from $u$ to $v$ iff they overlap by $k–1$ symbols. That is, $u[2..k] = v[1..k–1]$.

Let’s make this concrete with a diagram:

Doesn’t this just *feel good* to look at? Now tilt your head and look at it this way: it is essentially a Finite State Machine with additional bounded memory of the last $k$ visited states, if you added a few more states to allow for incomplete input at the start. For example, if X represents blank input, from a starting state XXX we might have the transition chain: XXX -> XX1 -> X10 -> 101 (which is already a node in the graph). Put this way, it is easy to see that the $k$ previous edges define the current node. This perspective will make things easier to understand later, I promise.

Still with me? Good (: Then let’s also consider that if *one node* is defined by a $k$-length string, then a *pair of nodes* (i.e. an edge) can be identified by a $k+1$ length string, since they overlap and differ by 1 symbol. This will also be important later.

And of course, this can be extended to larger alphabet sizes than binary (say, 4…).

## DNA Assembly

First suggested in 2001 by Pevzner et al.[1], we can use de Bruijn graphs to represent a network of overlapping *short read data*.

The long and short of it (heh heh) is that a DNA molecule is currently too difficult to sequence (that is, read it into a computer) in its entirety. Special methods must be used so we can sequence parts of the molecule, and hand off the putting-back-together process (assembly) to an algorithm.

One current popular sequencing method is *shotgun sequencing*, which clones the genome a bunch of times, then randomly breaks each clone into short segments. If we can sequence the short segments, then the fact that we randomly cut up the clones should lead us to have overlapping reads. Of course it is a bit more complicated than this in reality, but this is the essence of it.

We then move a sliding window over each read, outputing overlapping $k$-mers (the $k$-length strings), which we use to create our de Bruijn graph.

About now mathematicians among you might raise your hand to tell me “that may not technically yield a de Bruijn graph”. That’s correct – in Bioinformatics the term “de Bruijn graph” is overloaded to mean a subgraph. Even though genomes are long strings, most genomes won’t have *every single k-mer* present, and there is usually repeated regions. This means our data will be sparse.

Consider the following contrived example. Take the sequence `TACGACGTCGACT`

. If we set (k=3), our k-mers will be `TAC`

, `ACG`

, `CGA`

, and so on. We would end up with this de Bruijn graph:

After we construct the graph from the reads, assembly becomes finding the “best” contiguous regions. The jury is still out on what the best method is (or what “best” even means); the point of this post isn’t about assembly, but our implementation of the data structure and how it provides all required navigation options to implement any traversal method. I recommend reading this primer from Nature if you want to get deeper into this.

Even though this is only a de Bruijn *subgraph*, these things still grow pretty big. It is worthwhile considering how to handle this scalability issue, if only to reduce hardware requirements of sequencing (thus proliferating personal genomics), and potentially improve traversal speed (due to better memory locality). Increased efficiency might also enable richer multiple-genomic analysis.

## Previous Representations

One of the first approaches to this was to scale “horizontally”. Simpson et al.[2] introduced ABySS in 2009. The graph for reads from a human genome (HapMap: NA18507), which used a distributed hash table, reached 336 GB.

In 2011, Conway and Bromage[3] instead approached this problem from a “vertical” scaling perspective (that is, scaling to make better use of a single system’s resources), by using a sparse bitvector (by Okanohara and Sadakane[4]) to represent the $(k + 1)$-mers (the edges), and used rank and select (to be described shortly) to traverse it. As a result, their representation took 32 GB for the same data set.

Minia, by Cikhi and Rizk (2012)[5], proposed yet another approach by using a bloom filter (with additional structure to avoid false positive edges that would affect the assembly). They traverse by generating possible edges and testing for it in the bloom filter. Using this approach, the graph was reduced to 5.7 GB.

## Our Succinct Representation

As stated, we were able to represent the same graph in 2.5 GB (after some further compression techniques, which I will save for a future post).

The key insight is that the edges define overlapping node labels. This is similar to that of Conway and Bromage, although they have some redundancy, since some *nodes* are represented more times than necessary.

We further exploit the mutual information of edges by taking inspiration from the Burrows Wheeler Transform[6].

### Inspiration from the Burrows Wheeler Transform

The Burrows Wheeler Transform[6] is a reversible string permutation that can be searched directly and has the admirable quality of having long strings of repeated characters (great for compression). The easiest way to calculate the BWT of a string is to sort each symbol by their prefixes in colex order (that is, alphabetic order of the reverse of the string, not reverse alphabetic!) More information can be found on Wikipedia and this Dr. Dobbs article.

The XBW is a generalisation of the BWT that applies to rooted, labeled trees[7]. The idea is that instead of taking all suffixes, we sort all paths from the root to each node, and support tree navigation (since it isn’t a linearly shaped string) with auxiliary bit vectors indicating which edges are leaves, and which are the last edges (of their siblings) of internal nodes.

I won’t go into detail, but in the next section you should be able to see glimpses of these two ideas.

### Construction

The simplest construction method[8] is to take every <node, edge> pair and sort them based on the reverse of the node label (colex order), removing duplicates[9]. We also padding to ensure every node has an incoming and an outgoing edge. This maintains the fact that a node is defined by its previous k edges. An example can be seen below:

You may have spotted that we have flagged some edges with a minus symbol. This is to disambiguate identically labelled incoming edges – edges that exit separate nodes, but have the same symbol, and thus enter the same node. In the example below, the nodes `ACG`

and `CGA`

both have two incoming edges.

Notice that each outgoing edge is stored contiguously? We include a bit vector to represent whether an edge is the *last* edge exiting a node. This means that each node will have a sequence of zero-or-more 0-bits, followed by a single 1-bit.

Since a 1 in the $L$ vector identifies a unique node, we can use this vector (and select, explained shortly) to index nodes, whereas standard array indexing points to edges.

Finally, instead of storing the node labels we just need to store the final column of the node labels. Since the node labels are sorted, it is equivalent to store an array of first positions:

In total we have a bitvector $L$, an array of flagged edge labels $W$, and a position array $F$ of size $\sigma$ (the alphabet size). Respectively, these take $m$ bits, $m \log{2*\sigma} = 3 m$ bits (for DNA), and $\sigma \log{m} = o(m)$ bits[10], given $m$ edges – a bit over 4 bits per edge. Using appropriate structures (not detailed) we can compress this further, to around 3 bits per edge.

### Rank and Select

Rank and select are the bread and butter of succinct data structures, because so many operations can be implemented using them alone. $rank_c(i)$ returns the number of occurences of symbol $c$ on the closed range $[0, i]$, whereas $select_c(i)$ returns the position of the $i^{th}$ occurence of symbol $c$.

They are *kind of* like inverse functions, although $rank()$ is not injective, so cannot have a true inverse. For this reason, if you want to find the *position* of the left-nearest $c$, you would have to use $select()$ in orchestra with $rank()$ (a common pattern).

Speaking of patterns of use, it may also help to keep this in mind: rank is for counting, and select is for searching, or can be thought of as an indirect addressing technique (such as addressing a node using the $L$ array). Two rank queries can count over a range, whereas two select queries can find a range. A rank and a select query can find a range where either a start point or end point are fixed.

Rank, select and standard array access can all be done in $\mathcal{O}(1)$ time when $\sigma = polylog(N)$[11], if represent a bitvector using the structure described by Raman, Raman and Rao in 2007[12] (which I explained in an earlier blog post), and for larger alphabets use the index described by Ferragina, Manzini, Makinen, and Navarro in 2006[13]. In our implementation, we use modified versions to get it down to 3 bits per edge.

### Interface Overview

While it might not be obvious, these three arrays provides support for a full suite of navigation operations. An overview is given in these tables, which link to the implementation details that follow.

First, to navigate each of the edges, we define two internal functions (using rank and select calls):

Operation | Description | Complexity |
---|---|---|

\(forward(i)\) | Return index of the last edge of the node pointed to by edge \(i\). |
$\mathcal{O}(1)$ |

\(backward(i)\) | Return index of the first edge that points to the node that the edge at \(i\) exits. |
\(\mathcal{O}(1)\) |

Using these two functions, we can implement the less confusing public interface below, which operate on node indexes:

Operation | Description | Complexity |
---|---|---|

\(outdegree(v)\) | Return number of outgoing edges from node \(v\). | \(\mathcal{O}(1)\) |

\(outgoing(v,c)\) | From node \(v\), follow the edge labeled by symbol \(c\). | \(\mathcal{O}(1)\) |

\(label(v)\) | Return (string) label of node \(v\). | \(\mathcal{O}(k)\) |

\(indegree(v)\) | Return number of incoming edges to node \(v\). | \(\mathcal{O}(1)\) |

\(incoming(v,c)\) | Return predecessor node starting with symbol \(c\), that has an edge to node \(v\). | \(\mathcal{O}(k \log \sigma) \) |

The details of the above functions are given in the following sections.

### Forward

In order to support the public interface, we create for ourselves a simpler way to work with edges: the complementing forward and backward functions.

Recall that all node labels are defined by predecessor edges, then we have represented each edge in two different places: the (F) array (which is equivalent to the last column of the “Node” array), and the edge array (W). It follows that, since it is sorted, the node labels maintain the same *relative order* as the edge labels. This can be seen in the following figure:

Note that the number of `C`

s in the last column of Node is different from the number of `C`

s in (W), because the first `C`

in (W) points to two edges. For this reason, we ignore the first edge from node `GAC`

(although it doesn’t affect the relative order). In fact, we ignore any edge that doesnt have L[i] == 1.

Then, following an edge is simply finding the corresponding relatively positioned node! All it takes is some creative counting, using rank and select, as pictured below:

First we access W[i] to find the edge label, then calculate (rank_C) up to row to gives us the relative ordering of our *edges* with this label. Let’s call this relative index (r). In our example we are following the 2nd C-labeled edge.

To find the 2nd occurence of C in F, first we need to know where the first occurence is. We can use F to find that. Then we can select to the 2nd one, using the last array. Because the last array is binary only, this requires us to count how many 1s there are before the run of Cs (using rank), then adding 2 to land us at the 2nd C.

The W access, rank over W, rank and select over L, accessing F, and the addition are all done in O(1) time, so forward also takes O(1) time.

### Backward

Backward is very similar to forward, but calculated in a different order: we find the relative index of the node label first, and use that to find the corresponding edge (which may not be the only edge to point to this node, but we define it to point to the first one, that is one that isn’t flagged with a minus).

We can find our relative index of the node label by issuing two rank queries instead, and using select on W to find the first incoming edge.

For similar reasons to Forward this is O(1).

### Outdegree

This is an easy one. This function accepts a **node** (not edge!) index `v`

, and returns the number of outgoing edges from that node. Why did I say this was easy? Well, remember that all our outgoing edges from the same node are contiguous in our data structure. See the diagrams below, paying attention to the node `ACG`

.

By definition (since nodes are unique and sorted), this will always be the case. We also defined our so-called “last” vector $L$ to have a 0 for every edge from a given node, *except the last* edge. All we need to do is count how many 0s there are, then add 1. Put another way, we need to measure the distance between 1s (since the previous 1 will belong to the previous node). Since we know the node index, we can use select for that!

In the above example, we are querying the outdegree of node 6 (the 7th node due to zero-basing). First we select to find the position of the 7th 1, which gives us the last edge of that node. Then we simply subtract the position of the previous node (node 5, the 6th node): (select(7) – select(6) = 8 – 6 = 2). Boom.

Select queries can be answered in O(1) time, so outdegree is also O(1).

### Outgoing

Outgoing(v,c) returns the target node after traversing edge c from node v, which might not exist. The hard part is finding the correct edge index to follow; after that we can conveniently use the forward() function we defined earlier.

To ease the explanation, consider a simple bit vector `00110100`

. To count how many 1s there are up to and including the 7th position, we would use rank. The answer is 3, but at this stage we still don’t know the position of the 3rd 1 (we can’t see the bitvector). In general, it may or may not be in the 7th position. We could scan backwards, or we could just use select(3) to find the position (since this returns the first position i that has rank(i) = 3).

So essentially we can count how many of those edges there are before the node we are interested in, then use select to find the position. If the position is inside this nodes range (of contiguous edges), then we follow it.

We complicated things a bit by separating our edges into flagged and non-flagged, so we may have to issue two of these queries (for the minus flags). The flagging is useful later in Indegree.

In the next example, our nonflagged edge doesn’t fall in our nodes range, so we make a second query. It is possible that this one will return a positive result, but in the example it doesn’t. By that stage though, we can respond with by returning –1 to signal that the edge doesn’t exist.

forward() is defined to move us to the last edge of the resulting node, so the value in last will be 1. Hence, we can use rank to convert our edge index into a node index before returning it.

This is a constant number of calls to O(1) functions, so outgoing is also O(1).

### Label

At some point (e.g. during traversal) we are probably going to want to print the node labels out. Let’s work out how to do that (:

Remember, we aren’t storing the node labels explicitly. The F array will come in handy: We can use the position of our node (found using select) as a reverse lookup into F. This can be done in constant time with a sparse bit vector, or in logarithmic time using binary search, or we can use a linear scan if our alphabet is small. In any case, lets assume it is O(1) time, although a linear scan might be faster in practice (fewer function calls may yield a lower constant coefficient).

In the above example, we select to node 6 (the 7th 1 in last), which gives us the last edge index (any edge will do, but the last one is the easiest to find). This happens to be row 8, so from F we know that all the last symbols between on the open range [7,10) are G.

Then we just use bwd() on the current edge to find an edge that pointed to this node, then rinse and repeat k times.

### Indegree

In a similar manner to outdegree, all we need to do is count the edges that point to the current node label. Take for example the graph:

We can easily find the first incoming edge by using backward(). To count the remaining edges our minus flags come in handy; In the W array, the next G (non-flagged) belongs to a different node, because we defined W to have minus flags if the source node has the same $k–1$ suffix (or, if same-labeled edges also share the same target node).

From here it is simple enough to do a linear scan (or even use select) until the next non-flagged edge; the maximum distance it could be is $\sigma^2$. For larger alphabets, a more efficient method is to use rank instead:

First we find the position of the next non-flagged $G$, which gives us the end point of our range (we already have the start point from the first rank). Then we use rank to calculate how many $G-$ there are to the end position, and subtract the initial rank value from this, giving us how many $G-$ occur within the range.

This is once again a constant number of $O(1)$ function calls, which means $indegree()$ is also $O(1)$.

### Incoming

Incoming, which returns the predecessor node that begins with the provided symbol, is probably the most difficult operation to implement. However, it does use approaches similar to the previous functions.

Consider this: from indegree(), we already know how to count each of the predecessor nodes. We can access these nodes if we instead use select to iterate over the predecessor nodes, rather than using rank to simply count. Then, to disambiguate them by first character, we can use label().

A linear scan over the predecessors in this fashion would work, but for large alphabets we can use binary search (with a select call before each array access) to support this in $O(k \log \sigma)$ time; $\log \sigma$ for the binary search, where each access is a $O(1)$ select, followed by $O(k)$ to compute the label.

This is demonstrated in the example below:

## Conclusion

In conclusion, by using memory more efficiently, hopefully the cost of genome seqeuencing can be reduced, both proliferating the technology, but also giving way to more advanced population analysis. To that end, we have described a novel approach to representing de Bruijn graphs efficiently, while supporting a full suite of navigation operations quickly. Much of the (BWT-inspired) construction can be done efficiently on disk, but we intend to improve this soon to compete with Minia.

The total space is a theoretical $m(2 + \log{\sigma} + o(1))$ bits in general, or $4m + o(m)$ bits for DNA, given $m$ edges. Using specially modified indexes we can lower this to around 3 bits per edge.

I apologize that an efficient implementation isn’t available, nor have I provided experimental results. But if you found this post interesting

you can get your hands dirty with the Python implementation I have provided. The results and efficient implementation are on their way (:

And as always, follow me on Twitter, and feel free to send me questions, criticisms, and especially pull requests!

- Pevzner, P. A., Tang, H., and Waterman, M. S. (2001). An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748–9753. ↩
- Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. M., and Birol, I. (2009). ABySS: A parallel assembler for short read sequence data. Genome Research, 19(6):1117–1123. ↩
- Conway, T. C. and Bromage, A. J. (2011). Succinct data structures for assembling large genomes. Bioinformatics, 27(4):479–486. ↩
- Okanohara, D. and Sadakane, K. (2006). Practical Entropy-Compressed Rank/Select Dictionary. ↩
- R. Chikhi, G. Rizk. (2012) Space-efficient and exact de Bruijn graph representation based on a Bloom filter. WABI. ↩
- M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994. ↩
- P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM, 57(1):4:1–4:33, 2009. ↩
- The paper also describes an online construction method (where we can update by appending), and an iterative method that builds the k-dimensional de Bruijn

graph from an existing (k–1)-dimensional de Bruijn graph. ↩ - We currently use an external merge sort, but intend to optimise as this is where Minia beats us time-wise. ↩
- This is “little o” notation, which may be unfamiliar to some people. Intuitively it means “grows much slower than”, and is stricter than big O.

A formal definition can be found on Wikipedia. ↩ - http://stackoverflow.com/questions/1801135/what-is-the-meaning-of-o-polylogn-in-particular-how-is-polylogn-defined ↩
- R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4), November 2007. ↩
- P. Ferragina, G. Manzini, V. M ̈akinen, and G. Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms, 3(2):No. 20, 2006. ↩

## alexbowe.com Newsletter

Join the newsletter to receive the latest updates in your inbox.