Rabin fingerprints

rabin fingerprinting

I have been a bit quiet, time to bring some news! Quite a bit has happened since the previous post. The most important change is the new rabin fingerprinting code which can now be used by vacput, vacget and vacfs to split blocks not at fixed byte offsets, but at content boundaries. This code for rabin fingerprinting has actually been in the hg repository for some time, but I recently changed a few of the parameters making the block boundaries much better and hadn’t yet explained it. So that’s what is post is about.

the algorithm

The rabin fingerprinting algorithm calculates a rolling checksum over data (a file to store in venti). The window of the data to look at is configurable, but typically a few dozen bytes long. The rabin module will read through a file, and let the window “slide” over the data, recalculating the fingerprint each time when advancing a byte. When the fingerprint assumes a special value, the rabin module considers the corresponding window position to be a boundary. Al data preceding this window position is taken to be a “block” of the file.

Calculating a checksum relative to a previous one is not very expensive cpu-wise. The operations involved are:

  1. Multiplying the previous checksum by a prime (the prime is a parameter of the algorithm).
  2. Adding the value of the byte that has just come into the sliding window.
  3. Subtracting the value of the byte that has just slid out of the window times the nth power of the prime (where n is the width of the sliding window; the 256 possible values are precalculated during module initialisation).
  4. Taking the modulo of the value by the average desired block size (for now, the average desired block size is required to be a power of two).
  5. Check whether the new checksum has the special value that makes it a boundary.

This means the algorithm has three parameters: prime, width and modulo. The properties that make this algorithm useful for our purpose is that: 1. they are cheap to calculate; 2. they find the same boundaries, no matter where in the file they occur. Thus, when a byte has been prepended to or inserted into a file, this has no influence on how later blocks are formed, as opposed to the case where all block boundaries are at fixed file offsets.

This implementation also allows you to specify a minimum and maximum block size . If a block boundary occurs before the minimum block size is encountered, it is ignored. If no block boundary occurs before the maximum block size, the window is treated as a block boundary anyway and a new block emitted.

At first I had set the width of the sliding window to three, but after reading through lots of data (my homedir, ~7GB in size) with different values for the parameters, it was clear that a window width of three is too small: too few unique boundary patterns existed, and the ones that did exist were found too often. It has now been set to 31. The tests showed that which prime number to choose did not matter (for the values I tested). The modulo (average block size) is reasonably accurate, but the minimum and maximum block sizes, and the fact that end of files result in short blocks, skew the actual average block size a bit. The special value at which to find boundaries had at first been set to zero: not good, patterns of all zero bytes will be a block boundary this way, causing lots of boundaries on “null” data. Better is to use the modulo-1, which is the value currently being used.

not there yet

Okay, so we can split a file in blocks in another way. Unfortunately, this cannot be integrated into vacput/vacget/vacfs without changing the data structures of venti archives. A hash tree stored in a venti archive needs all its blocks to be of a fixed length, typically 8KB. Vacget/vacfs use this assumption when finding the right block for fulfilling a read for a given offset in a file; they use it to determine which score in a pointer block to follow to get to the data block.

Now consider what happens when a (random) file offset in a vac+rabin fingerprinting archive should be read. When starting at the top pointer block of a hash tree, there is no way of knowing below which score the requested offset resides. Previously it could be calculated using the depth of the tree and the width of the tree branches and leaf size.

To solve this, I have changed the format of the pointer blocks. Each score is now accompanied by the size of the subtree the score references. The size referenced by the top score can be found, as usual, in the Entry data structure (which also contains the top score).

When walking from the top down to some file offset, the sizes of the subtrees allow vacget/vacfs to determine which ranges of the file reside below each score in the pointer block.

more (data structure) changes

The pointer block is not the only thing that has changed. To make sure the two types of data can never be misinterpreted, the new pointer blocks are stored with different “type” numbers in venti: the usual values added to 16; so 19 for depth 0, 20 for depth 1, and so on.

There is a new root block version, version 3. The maximum data block and pointer block size can be left intact (though I an currently writing them with “0” filled in).

The “flags” field in the Entry structure has a new bit set (1<<7) to indicate the entry is has variable-size blocks. In theory venti archives can be made with the two types of entries (variable sized blocks or not) mixed.

more ideas

A few more ideas have crossed my mind:

  1. Currently, the new pointer blocks always hold a fixed number of scores (either to more pointer blocks or to data). When a new block is inserted in an existing hash tree (e.g. when writing a new version of an already existing archive) is split in the new archive, it now changes all pointer blocks to right of it in the hash tree. Since we now have variable block sizes anyway, we could as well grow the current pointer block by another score (unless of course it would grow too large). This would mean fewer pointer blocks need to be written when storing a new version of a file. I am not sure how much it really helps though, and whether the added code complexity is warranted.

  2. Rather simple, but currently the size of a subtree is always stored in 8 bytes. This grows the size of one pointer from 20 bytes (the score size) to 28 bytes. The lowest-level pointer blocks, which point to blocks of at most 32KB, will be the most numerous. Thus, perhaps a variable length encoding of the size could be used. But it probably isn’t worth the trouble.

  3. Should the rabin parameters be configurable, and parametrized into the root block of an archive? Probably not, well-choosen defaults should help. At least changing the prime doesn’t make a difference, the sliding window width should be okay too. Only the average block size may be worth parametrizing, perhaps as “data block size” in the root block. This information is only useful for writing venti archives, so in this case, for writing an archive relative to a previous version.

conclusions

It seems the current rabin fingerprinting code works and can be used to write and read venti archives. I still want to test how fast fingerprinting is (both with and without JIT), and whether not using power of two’s as modulo has a noticeable influence on performance.

I am probably forgetting to explain something, please bug me about it on irc or send me an e-mail.