Columnarization in Rust
So like everyone without a job, I’ve started to learn Rust. And like everyone who has started to learn Rust, I now feel it is very important to tell you about my experience with it.
The project I’ll talk about is for columnarization, a technique from the database community for laying out structured records in a format that is more convenient for serialization than the records themselves. This is important if you want to send data around in binary format at high throughputs, which is indeed something I often enjoy.
The rough idea is that, starting from a vector of some type, or Vec<T>
in Rust, we want to repeatedly reduce the complexity of the type T
, at the cost of increasing the number of vectors we have on hand. There are roughly three types of rules we’ll follow:
-
Vec<uint>
A vector of base types is our base case. Do nothing. -
Vec<(S, T)>
A vector of pairs is transformed to a pair of vectors:(Vec<S>, Vec<T>)
. -
Vec<Vec<T>>
A vector of vectors goes to a pair(Vec<uint>, Vec<T>)
, corresponding to the original vector lengths, and their concatenated payloads.
These three rules, and generalizations thereof, allow us to transform vectors of fairly complex types into a sequence of vectors of simple types. These vectors of simple types are then very easy to serialize and deserialize, simply casting typed vectors to and from byte vectors.
Implementation
I’ve gone and tried this out in Rust. Naturally, I won’t share the first few evolutions with you, because you might conclude that neither Rust nor Frank is much of anything you’d want to be associated with. However, the current state of the project really appeals to me, and shows off some things I don’t think I could easily do in most other languages I am familiar with.
For the curious, the repository can be cloned from https://github.com/frankmcsherry/columnar.
The trait (“interface”) I’ve implemented is ColumnarVec<T>
, whose role in life is to push
and pop
records, much like a Vec<T>
in Rust. That is to say, it accepts records and adds them to its columnar stash, and it can return those records when asked. This isn’t yet much more interesting than Vec<T>
, so the ColumnarVec<T>
also adds the ability to encode
its contents into byte arrays, and decode
from byte arrays populating its contents.
The intended use of a ColumnarVec
is to repeatedly call push
with your favorite records, decide that you’d like to encode
them to binary, ship the resulting arrays to a dear friend, who is then able to decode
into her empty ColumnarVec
and read the contents out using pop
.
uints, and basic types
For basic types, we have a very basic implementation of ColumnarVec<T>
: we just use a Vec<T>
. When records are pushed or popped, we simply use the underlying methods on Vec
.
To encode
we swap in an empty vector at *self
(using mem::swap
, which takes two mutable references), cast the swapped out vector to a Vec<u8>
, and push it on to the Vec<Vec<u8>>
stack.
To decode
we pop a byte vector from the stack, casting to a Vec<T>
, and then assign it to *self
. This overwrites anything currently in the ColumnarVec
, which … perhaps is not expected behavior.
Pairs, tuples, and structs
Pairs are a bit more interesting than base types, in that we want to destructure the pair into its component elements so that they can each be pushed into their corresponding vectors. More generally, we will only assume that the two types in the pair have implementations of ColumnarVec
supporting them.
In Rust, as well as other sophisticated languages, we can indicate that any pair of types implementing ColumnarVec<T1>
and ColumnarVec<T2>
do themselves implement ColumnarVec<(T1, T2)>
. We simply name the types, state the constraints, and then provide the implementation:
One appealing part of Rust, and several other similar languages, is that one can specify an implementation in a fairly light-weight manner. Here, I’ve not even defined a new type to support the methods, I’ve just indicate that they exist for a family of pairs of two types.
Vectors and collections
Vectors, and variable sized fields like Option<T>
, are where things start to get a bit sticky with columnarization. This is not only where we need to do some non-trivial re-shaping of the data, but also where we need to start dealing with dynamically allocated memory ourselves: people calling pop
are going to want to see some vectors in their results.
One of the very appealing parts of Rust is its notion of ownership of data, and this is indicated obliquely in the signature of the push
method. Whereas the ColumnarVec
itself is passed in as a reference (the &mut self
oddness), the record of type T
is unadorned, meaning it is the actual record and we own it now. Owning the record is great, because it means no one else owns it, and if we want to rip it in to small pieces in the name of columnarization, we are free to do so.
A second appealing part of owning the record is that it means we also own all of the dynamically allocated memory the record owns, for example any Vec
fields we might find. Obviously they hold valuable data we want to columnarize, but once that is done and the data are moved out, we can simply snag the Vec
and hold on to it for future use. Future use such as needing a Vec
when someone calls pop
. By stashing the vectors we can avoid much interaction with the allocator when repeatedly encoding and decoding, and that means performance.
Let’s look at the push
and pop
methods for the implementation of ColumnarVec<Vec<T>>
. The first performs the possibly unsurprising tasks of pushing the input vector’s length into a ColumnarVec<uint>
and pushing its contents into a ColumnarVec<T>
. Once done with the vector’s contents, it stashes the empty-but-allocated vector, which is the part that I thought was really clever. The pop
method runs in the opposite direction, fetching a stashed vector (or allocating if one doesn’t exist), reading the intended length, and then filling the vectors contents before returning the vector.
This approach to vector columnarization steers us clear of the traditional hazards of memory allocation inside what is meant to be a tight loop. In applications where one is repeatedly encoding, exchanging, and decoding data, in steady state the program will not need to allocate any new memory, which is an excellent position to be in.
Performance
Measuring the performance of columnarization in Rust turned out to be harder than I expected. Mostly, coming from a managed JIT background, one could reasonably believe that only effect of an optimizing compiler was to randomize line numbers. Rust and LLVM are a fair bit smarter and without the right test harness they wil just optimize away your program.
The harness is in example.rs, and tests out columnar encoding and decoding of a few different datatypes. Be warned that these numbers are likely optimistic, as fewer optimizations will apply in the wild.
-
Serializing
uint
data goes very fast, because it is just copying data and casting a pointer, if that.Encoding/decoding at 11.24GB/s
-
Serializing
(uint, (uint, uint))
data goes less fast, but still quite fast:Encoding/decoding at 5.29GB/s
-
Serializing
Vec<Vec<uint>>
data goes slower, due to conditional logic in the loop, but is still fast.Encoding/decoding at 2.26GB/s
These throughputs numbers will obviously vary as the shape of the data change. Currently, working with Option<T>
types is slowest, due to the large ratio of conditional logic to actual bytes moved.
These are great numbers by my experience, and for about 150 lines of code all written in the base language (rather than code-gened nonsense), I am delighted.