Having fun with Go slices

The fun

Go offers precious few data structures that can be generically typed: queue, map, and slice (including array).  This often leaves developers using slices, particularly ordered slices, for all sorts of use cases.  I thought it might be fun to create some utilities for slice manipulation, starting with a faster stable sort. I hope in Go 2 we can get generics so data structures like binary trees will be safer.

For those that aren't aware, a stable sort is a sort where two elements with the same key retain their same relative positions as they had in the input.  There are several sorting algorithms that support stable sorting, including merge sorts, insertion sorts, and bubble sorts.  Efficient implementations of the ever-popular quicksort are not stable.

I've had several use cases for a stable sort in the past so I thought it might be fun to take the stable sort from the standard library and attempt to make it concurrent to improve performance.  The standard library's sort mimics a divide and conquer algorithm so making this concurrent is somewhat straightforward.


The standard library's stable sort breaks the input slice into much smaller subslices of length 20.  The subslices are then sorted using an insertion sort, which is stable.  The small subslices are then iteratively merged in ever larger subslices using an algorithm called symmerge.  Symmerge is a space-efficient algorithm to merge two sorted arrays invented by Kim and Kutzner.  It involves symmetric comparison and rotation to achieve an optimal number of comparisons which can be defined as O(m log n/m), where m is the length of the smaller array and n is the length of the longer array.  The explanation is quite a bit more complex than the algorithm and I won't rehash it here, but I highly encourage you to visit the paper and give it a quick read.

Combining these techniques, Go's standard library is able to quickly, efficiently, and stably sort arrays.  My goal was to improve performance by simply making the current algorithm concurrent.  The initial insertion sorts can safely happen across the slice concurrently and then the final merge stages can also happen concurrently in each block size iteration.  To ensure consumers can drop this in as a replacement, this stable sort function must also accept a sort.Interface interface (stuttering is ugly Go!).

To ensure I wasn't paying a dynamic dispatch penalty, I confirmed that the comparison function on the interface was inlined using -gcflags="-m".

The results

To begin, I tested the algorithm with varying input sizes and various sizes of chunks for the initial insertion sort.  All tests were run on a Macbook Pro with a quad core 2.2 GHz i7 (Haswell).  All code used in this test can be found at https://github.com/dzyp/slice.

The vertical access is logarithmic and all algorithms display linear scaling.  Go authors have done their research and confirmed that a slice size of 20-30 is optimal with larger sizes taking too great a penalty on the insertion sort (insertion sort has average comparison complexity of O(n^2)).  The logarithmic scale makes the final time appear much closer than actual; the std lib is able to sort 10,000,000 structs in 1859ms while the multithreaded sort can accomplish the same in 898ms.

Next, I wanted to see how the number of available CPUs affected performance. I performed this test modifying GOMAXPROCS, but in recent versions of Go this is automatically set to runtime.NumCPU().

Good scaling was seen going from one to four CPUs, but little additional performance was gained going from four to eight.  Whether this is due to the fact that this CPU only has four physical cores (with the remainder being virtual due to hyperthreading) or running up against Amdahl's Law is a subject for further investigation.

Further work

In order to reduce the number of allocations, I use pools so messages sent down queues to workers can be reused.  However, if the CPU-local cache implemented by the pool is cold, this can subject threads to a global lock.  This can reduce performance if many consumers are trying to sort simultaneously.  I have a stash that implements the same functionality using a ring buffer instead, but this had a marginal impact on performance and was slightly more complex. As such, the initial version uses the simple algorithm.

Most remaining performance gains will likely come from implementing an improved sorting algorithm, although these options are somewhat limited by being restricted to the data.Interface methods.  Another option is pipelining the insertion sorts directly into the merging process to bypass the wait at the end of the insertion sort operations.

Up next

Simply making the current stable sorting algorithm concurrent can yield fairly dramatic improvements to sort time.  Beyond improving sort, I'd like to explore quicker insertions (using the sort) and faster deletions without the use of tombstones.  My sincerest hopes, however, is that these would be made redundant with an improved type system or generic binary trees in the standard library :).

Comments

  1. The graphs are barely intelligible. I suggest reading up something about effective information visualization (e.g. Tufte's books on the matter)

    ReplyDelete

Post a Comment

Popular posts from this blog

Adventures in bit arrays/SIMD (Go)

Engineering for Offline Workflows