Adventures in bit arrays/SIMD (Go)

Introduction

I've been fortunate enough to have a career where I've been free to explore new technologies and techniques to support a business need.  One of the problems I was once tasked with solving was graph traversal on a rather large scale.  It was not uncommon for a graph in our system to have hundreds of millions of edges, or for a large portion of that graph to be traversed.

However, traversal in our use-case was not path finding as much as node visitation in a particular order, aka, a cyclic dependency graph.  As long as ordering was correct, we didn't necessarily need to know how traversal got to any individual node.

Performance in this use-case was also paramount, so this particular traversal process was going to be done in-memory.  So, to summarize, I needed a way to maintain hundreds of millions of edges and to run a process in the correct order in the dependency graph.  IE, if B relied on A, A needed to be processed first.  To throw a little wrinkle into the fabric, these edges were often modified in bulk.  It was not uncommon to delete or add millions of edges at once, a process that also needed to be performant.

The Data Structure

There are many possible data structures that immediately come to mind here, with the restriction that this process was written in Go and we were thus constrained to whatever was part of the standard library or whatever we could create (here is where Go could really use generics).

Given we had to support


  1. Bulk performant insertions/deletions
  2. Traversals that commonly cover a large percentage of the graph vertices
  3. In-memory state management
it seemed as if none of the standard options would suffice.  Arrays weren't great for bulk updates, maps weren't great for bulk updates, so-so with data locality/traversal, and neither were great with regards to memory consumption.  I needed a data structure that could contain all of these edges in a performant and compact manner.

Bit arrays provide a good compromise to the above problems.  Using bit arrays, I could number off my vertices and have each vertex store a bit array that described its dependency edges.  For those of you that aren't familiar with bit arrays, Wikipedia has a pretty good explanation.

Simply put, bit arrays are a compact representation of a set of numbers.  Imagine I need to store three numbers, the list [1, 3, 7] as single byte ints.  This list (excluding Go's special slice pointer) consumes 3 bytes of memory.  A bit array is also a slice, but every number would be represented in a single byte based on position.  Using this example, I would have a single byte where bits 1, 3, and 7 are set (1s)... or 01010001, which can be represented as a single byte int (81).

Then adding or removing a number from the set is as simple as finding out where the number should go and performing some bitwise operation, OR for insert, AND NOT for deletion.

This actually provided a really useful way to store all of those edges, and importantly, a way to take two sets of vertex ids (represented as 32 bit ints) and quickly perform operations on them to find dependency relationships.  I won't get into too much detail how that works here (you can find a patent regarding this process here) but needless to say I had a need to perform bitwise operations over the whole array.

Now, while bit arrays worked really well up to medium-sized graphs, they are actually a bit wasteful if the data being stored is sparse.  For instance, if I need to store the ints [1, 3, 2323523] then a bit array is not a good choice as I'll need an array long enough to represent position 
2323523.  There are many ways to address this issue, such as compression or Judy arrays.

However, I was able to take advantage of one peculiar feature of our graph.  The vertex ids tended to be well-clustered.  That is, if a vertex had edges [1, 2, 3] then the next edge was far more likely to be 4 as opposed to 3,712.  This is partly due to the nature of our data, and partly due to the fact that we engineered vertices to be this way.  Thus, I implemented a sparse bit array whose implementation can be found in go-datastructures.  The sparse bit array is actually a combination of a list of ints that represent the set bits and a list of ints that represent the offsets of the previous list.  In some ways, you can think of the solution being similar to fractional cascading.  A search involves looking up an index in a sorted list to see if we have a range of bits that match.  If there is, we perform a bitwise operation on the number.  The time complexity is then dominated by the binary search for log n time, but in a highly compact data structure (remember, in our data we had excellent clustering).

Results

I went and created a few comparable benchmarks in the go-datastructures repo to compare dense and sparse bit arrays.  For comparison's sake, I also threw in a map.

SetBit:

Pretty easy case here, benchmark the amount of time it takes to add a value to the set.  I do this up to 1600 values, with the sparse bit array having a bit of an advantage here due to the fact that I add these values sequentially resulting in fewer memmoves in the cascading index.


No surprise here really, the dense bit array being the fastest and Go's map and the sparse bit array being about even.

GetBit:

Similar to SetBit, but simply benchmarks the amount of time to retrieve a value after it has been set.  Also done over 1600 items.


Again, no surprise here.  The dense bit array is very fast as the set value basically points to a location after a simple division operation which is then followed by a bitwise operation.  The map is in second place, going directly to the desired location in the tree after a hashing operation.  Not sure what addressing technique the standard library uses, maybe that'll be another blog post.  Sparse bit array is in last place, suffering from a log n lookup.

Equality checking:

Where things finally get interesting.  In this case, we add items to the bit arrays (and the map) and then check to see how long it takes to check them for equality.  In all cases, the bit arrays (and map) are equal so no early out optimization can be used in any case.


Now we start to see why the sparse bit array is so interesting.  In well-clustered scenarios (and this is well-clustered), it can nearly keep pace with the dense bit array without the space overhead.  Either solution is more space and time efficient than Go's map.

SIMD Surprise

Not happy leaving performance on the table, and following a question from an old coworker named Tyler Treat, I was curious how adding SIMD optimizations might help the bitwise operations across the whole bit array.  Technically speaking, I should be able to leverage instructions across 2 64-bit wide values at once.  So in the case of a bitwise AND NOT, I should be able to compare 128 bits with a single instruction, greater than the 64 bits without SIMD.  Not wanting to maintain my own assembly (development time does have a cost), I turned to gensimd.

Gensimd generates assembly from your code that calls into its optimized functions using the library's defined types.  The gensimd library has some pretty strict limitations in the code from which the assembly is generated (for instance, no range keyword) and not all instructions are supported.  For instance, the only bitwise operation I found for SSE2 was ANDNPD and not OR, XOR, or AND.  So the benchmarks I created focus solely on AND NOT, one of the more useful operations for bit arrays.


Using the generated assembly is slower than not.  A snippet of the CPU profile:


Relatively speaking, a great deal more time in the generated code is spent moving data than actually performing the ANDNPD instruction.  This could probably be optimized, but only in the most extreme performance-sensitive application would I want the burden of maintaining hand-tuned assembly.

Knowing that most of the penalty is in the move, I wanted to see how many instructions I'd have to perform for the performance to equalize.  At 3 AND NOT operations in the loop, SIMD and non-SIMD perform nearly identically.  Beyond that, SIMD starts to pull ahead.

Moral of the story is that generated SIMD assembly using gensimd is worth it if:
  1. Your application is extremely performance sensitive and you don't mind the extra burden.
  2. Your algorithm is burdensome enough to justify the overhead.  Try to get as much of the algorithm as possible into the generated assembly, which can be difficult due to the library's limitations.

Conclusion

As a first blog post, I simply wanted to introduce the lowly and well-understood bit array.  A dense bit array can be very useful under certain conditions and provide extreme performance.  A sparse bit array can be useful if your data has an uneven distribution.  And under the right circumstances, both can be more storage efficient than traditional maps.

Comments

Popular posts from this blog

Having fun with Go slices

Engineering for Offline Workflows