Plan 9 and Inferno at the Google Summer of Code

Manual pages (2007/08/31)

manual pages

yesterday i wrote the missing manual pages for the rabin and vac library, and also added some documentation to the venti library. it can also be found in the mercurial repository of course.

Ventisrv fileformat (2007/08/20)

Ventisrv file format

A few minutes ago I added a document to the ventivac hg repository (doc/ventisrv-fileformat.ms, a troff -ms file) about the file format used by ventisrv. The format is not very complex, the description should be enough to recover data in case of problems (such as disk failures, or bugs–though note that ventisrv will only ever append data to the index/data file, never overwrite it). I also made the file available as ventisrv-fileformat.ps (original postscript) and ventisrv-fileformat.pdf (converted to pdf).

Now getting back to finishing the last bits!

Rabin fingerprints (2007/08/06)

rabin fingerprinting

I have been a bit quiet, time to bring some news! Quite a bit has happened since the previous post. The most important change is the new rabin fingerprinting code which can now be used by vacput, vacget and vacfs to split blocks not at fixed byte offsets, but at content boundaries. This code for rabin fingerprinting has actually been in the hg repository for some time, but I recently changed a few of the parameters making the block boundaries much better and hadn’t yet explained it. So that’s what is post is about.

the algorithm

The rabin fingerprinting algorithm calculates a rolling checksum over data (a file to store in venti). The window of the data to look at is configurable, but typically a few dozen bytes long. The rabin module will read through a file, and let the window “slide” over the data, recalculating the fingerprint each time when advancing a byte. When the fingerprint assumes a special value, the rabin module considers the corresponding window position to be a boundary. Al data preceding this window position is taken to be a “block” of the file.

Calculating a checksum relative to a previous one is not very expensive cpu-wise. The operations involved are:

  1. Multiplying the previous checksum by a prime (the prime is a parameter of the algorithm).
  2. Adding the value of the byte that has just come into the sliding window.
  3. Subtracting the value of the byte that has just slid out of the window times the nth power of the prime (where n is the width of the sliding window; the 256 possible values are precalculated during module initialisation).
  4. Taking the modulo of the value by the average desired block size (for now, the average desired block size is required to be a power of two).
  5. Check whether the new checksum has the special value that makes it a boundary.

This means the algorithm has three parameters: prime, width and modulo. The properties that make this algorithm useful for our purpose is that: 1. they are cheap to calculate; 2. they find the same boundaries, no matter where in the file they occur. Thus, when a byte has been prepended to or inserted into a file, this has no influence on how later blocks are formed, as opposed to the case where all block boundaries are at fixed file offsets.

This implementation also allows you to specify a minimum and maximum block size . If a block boundary occurs before the minimum block size is encountered, it is ignored. If no block boundary occurs before the maximum block size, the window is treated as a block boundary anyway and a new block emitted.

At first I had set the width of the sliding window to three, but after reading through lots of data (my homedir, ~7GB in size) with different values for the parameters, it was clear that a window width of three is too small: too few unique boundary patterns existed, and the ones that did exist were found too often. It has now been set to 31. The tests showed that which prime number to choose did not matter (for the values I tested). The modulo (average block size) is reasonably accurate, but the minimum and maximum block sizes, and the fact that end of files result in short blocks, skew the actual average block size a bit. The special value at which to find boundaries had at first been set to zero: not good, patterns of all zero bytes will be a block boundary this way, causing lots of boundaries on “null” data. Better is to use the modulo-1, which is the value currently being used.

not there yet

Okay, so we can split a file in blocks in another way. Unfortunately, this cannot be integrated into vacput/vacget/vacfs without changing the data structures of venti archives. A hash tree stored in a venti archive needs all its blocks to be of a fixed length, typically 8KB. Vacget/vacfs use this assumption when finding the right block for fulfilling a read for a given offset in a file; they use it to determine which score in a pointer block to follow to get to the data block.

Now consider what happens when a (random) file offset in a vac+rabin fingerprinting archive should be read. When starting at the top pointer block of a hash tree, there is no way of knowing below which score the requested offset resides. Previously it could be calculated using the depth of the tree and the width of the tree branches and leaf size.

To solve this, I have changed the format of the pointer blocks. Each score is now accompanied by the size of the subtree the score references. The size referenced by the top score can be found, as usual, in the Entry data structure (which also contains the top score).

When walking from the top down to some file offset, the sizes of the subtrees allow vacget/vacfs to determine which ranges of the file reside below each score in the pointer block.

more (data structure) changes

The pointer block is not the only thing that has changed. To make sure the two types of data can never be misinterpreted, the new pointer blocks are stored with different “type” numbers in venti: the usual values added to 16; so 19 for depth 0, 20 for depth 1, and so on.

There is a new root block version, version 3. The maximum data block and pointer block size can be left intact (though I an currently writing them with “0” filled in).

The “flags” field in the Entry structure has a new bit set (1<<7) to indicate the entry is has variable-size blocks. In theory venti archives can be made with the two types of entries (variable sized blocks or not) mixed.

more ideas

A few more ideas have crossed my mind:

  1. Currently, the new pointer blocks always hold a fixed number of scores (either to more pointer blocks or to data). When a new block is inserted in an existing hash tree (e.g. when writing a new version of an already existing archive) is split in the new archive, it now changes all pointer blocks to right of it in the hash tree. Since we now have variable block sizes anyway, we could as well grow the current pointer block by another score (unless of course it would grow too large). This would mean fewer pointer blocks need to be written when storing a new version of a file. I am not sure how much it really helps though, and whether the added code complexity is warranted.

  2. Rather simple, but currently the size of a subtree is always stored in 8 bytes. This grows the size of one pointer from 20 bytes (the score size) to 28 bytes. The lowest-level pointer blocks, which point to blocks of at most 32KB, will be the most numerous. Thus, perhaps a variable length encoding of the size could be used. But it probably isn’t worth the trouble.

  3. Should the rabin parameters be configurable, and parametrized into the root block of an archive? Probably not, well-choosen defaults should help. At least changing the prime doesn’t make a difference, the sliding window width should be okay too. Only the average block size may be worth parametrizing, perhaps as “data block size” in the root block. This information is only useful for writing venti archives, so in this case, for writing an archive relative to a previous version.

conclusions

It seems the current rabin fingerprinting code works and can be used to write and read venti archives. I still want to test how fast fingerprinting is (both with and without JIT), and whether not using power of two’s as modulo has a noticeable influence on performance.

I am probably forgetting to explain something, please bug me about it on irc or send me an e-mail.

Ventisrv compression (2007/06/20)

Compression

Work done since my previous post: Compression in ventisrv.

As you might know now, ventisrv stores blocks from 1 byte to 56kb in size on disk. These blocks are immutable: cannot be (re)moved or modified. The typical block size is 8kb for normal data blocks, and usually smaller for directory entries and pointer blocks (but that’s outside the scope of this post). Now venti from Plan 9, and the newer venti from Plan 9 From User Space try to compress a block before writing it to disk, and only if compression makes it smaller, they store it compressed. This saves disk space for many blocks.

Ventisrv at first did not have the ability to compress blocks, but always stored them ‘raw’. With the changes of this week, the option -c enables compression. It isn’t on by default because the compression speed limits write throughput: Compression is expensive (more so in Limbo then in C). Also, if you enable this, make sure you have the just-in-time compiled enabled as well.

Some details

Ventisrv handles compression a bit differently from the other venti’s. First, ventisrv uses the deflate/inflate compression algorithms (the one used by gzip), whereas the venti’s use whack, an algorithm that doesn’t have much documentation but seems to at least be favourable when compression smaller blocks. It isn’t available for inferno though, and deflate/inflate is: easy choice.

The second difference is that the other venti’s write a header and the compressed payload for each block, whereas ventisrv gathers multiple venti blocks, compresses the payload, and writes the the headers first, and then the compressed payload (representing data for multiple blocks). The idea is that compression more data in one go makes the compression history buffer larger, allowing for better compression. A quick test on the data in the venti I have been using for backups for a few months show this really does increase the compression ratio: Gzip on the entire datafile resulted in a compressed data file of ~68% of the original, compressing up to 64KB of blocks resulted in ~70%, and compressing each block separately in ~79%. The downside to compressing multiple blocks in one go is that they all have to be decompressed (at least up to the one found) to read a single block, and all the compressed data has to be read from disk to do that. Thus, a maximum of ~64KB raw/uncompressed data to compress into one block has been chosen. This ensures we don’t have to (de)compress or read from disk too much data for each block.

The file format has been “extended” with a new header for compressed blocks. This leaves data files written by previous ventisrv’s valid. Note that this code is pretty fresh, may have bugs and hasn’t been tested a lot, so use with care (and report bugs back to me please).

There is always room for improvement.

But well, these are mostly details and the benefit is not obvious.

Other news

Another small thing has changed in ventisrv: Support for specifying read-only connections. Specifying -r addr makes ventisrv listen on addr and disallow write and sync messages from clients that connected to that address. With -w addr, writable connections can be listened for.

I’ve also started on some data fingerprinting code, but that is not working yet, so nothing to show. That’s it for now!

Vac tools in inferno (2007/06/15)

news

good news! the vacget, vacput and vacfs code of the ventivac project is now in the inferno-os subversion repository. many thanks to charles forstyh. this makes it very easy to test that code, so please try it out. the patch to /appl/lib/venti.b (mentioned here) that was needed before, has been included in inferno-os svn as well, so no more patching is required.

ventisrv and vcache are not in the inferno-os svn, so keep checking out the ventivac hg repository (it will also contain updates to the vac tools).

ventisrv testing

for testing, i’ve set up a ventisrv on gsoc.cat-v.org. note that it is for testing, so don’t store your very important data in it, yet. for example, to write a file to it, try this:

   ; vacput -a net!gsoc.cat-v.org!venti welcome.txt
   vac:a39bc0605cc80c7dee50f02c2ef6c502e740c376

now list the contents of the venti archive vac:a39bc0605cc80c7dee50f02c2ef6c502e740c376:

   ; vacget -t -a net!gsoc.cat-v.org!venti a39bc0605cc80c7dee50f02c2ef6c502e740c376
   ./welcome.txt

and retrieve the contents of the archive:

   ; rm welcome.txt
   ; vacget -v -a net!gsoc.cat-v.org!venti a39bc0605cc80c7dee50f02c2ef6c502e740c376
   ./welcome.txt
   ; ls -l welcome.txt
   --rw-rw---- U 0 mjl mjl 10 Jun 15 12:59 welcome.txt

and now using vacfs:

   ; mkdir /n/vac
   ; mount {vacfs -a net!gsoc.cat-v.org!venti} /n/vac
   ; ls -l /n/vac/a39bc0605cc80c7dee50f02c2ef6c502e740c376
   --rw-rw---- M 4 mjl mjl 10 Jun 15 12:57 /n/vac/a39bc0605cc80c7dee50f02c2ef6c502e740c376/welcome.txt

that’s how to use it, enjoy!

Ventisrv bugfix (2007/06/12)

a short note again. i just fixed a bug in ventisrv that caused it to misread the data file and spew error messages. that has been fixed now in the mercurial repository. there are also a few vacfs changes in there.

so hg pull and try again please!

Ventisrv and more (2007/06/12)

some news on the ventivac front. this post is mostly about how to use ventisrv.b, the (still work in progress) venti server.

most of the code has been cleaned up a bit. not vacfs.b yet though, that one is next. the manual pages have been cleared up a bit. the biggest difference since my previous post has been that ventisrv is now closer to being a useful venti server. it should not be considered stable yet, and the file format might change as well, but i think it’s worth a try now!

so, where to start?

the ventisrv manual page (snapshot version) should offer a decent introduction. if anything is unclear after reading it or questions got raised by it, please let me know so i can clarify the manual page.

in short, if you want to run a venti server with a data file of 8 gigabytes and an expected mean blocksize of 8 kilobytes, executing the following commands starts a fresh ventisrv:

   echo -n > data
   echo -n > index # these first two only the first time ;)
   ventisrv -v 8g 8k

data is used to store the data blocks, index keeps the index. ventisrv calculates how much memory it needs based on the arguments passed at startup. specify the verbose option to get it printed.

now vacput your current working directory to the venti:

   vacput .

and use the resulting score to read the files in the vac archive again:

   vacget -t yourscore

it should be as easy as that. if something fails along the way, please let me know so i can update this post with better descriptions, or fix the bug. please read the README in the ventivac hg repository, it has the instructions to patch inferno’s appl/lib/venti.b to make ventisrv work.

a few quick tests with randtest from plan9ports (src/cmd/venti/randtest.c) showed write performance (of new blocks) of 15MB/s and a sequential read performance of over 18MB/s. but the machine was quite fast and i’m not sure it represents an average machine ventisrv might run on. oh, if you want to do a few performance tests yourself, do not specify -D, it kills performance.

what’s next?

well, there is no authentication for connections. that does not really bother me, the scores are a nice and simple capability system. the only real worry i have is someone filling up a public venti server. thus, an address listen for read-only connections may do the trick.

compression is missing as well. the easiest would be per-block compression. a “smarter” thing might be to compress larger blocks. i don’t have well developed intuition for data compression methods, but i have a hunch that compressing larger blocks (say 128kb) gives the compressor more history to work with and better compression ratios. writing would still be append-only (no problems there), but reading would need special care: for a given file offset, accounting needs to be done to find the location of the 128kb block in the compressed file. if this is a viable option, it may even be done as a sys->file2chan program.

other open issues are listed in the bugs section of the manual page referenced above. solutions (and more problems) are welcome. thanks for listening!

Mercurial (2007/06/06)

just a short note. i put the source code so far in mercurial. it can be found at:

http://gsoc.cat-v.org/hg/ventivac/

so, a hg clone http://gsoc.cat-v.org/hg/ventivac/ my-ventivac should suffice to get a copy. see the README on how to compile and install.

enjoy!

Starting (2007/06/06)

Hi all. Here is a first progress report. To start with, here is some code in action:

http://ueber.net/vac/bc32a6a6f09395131bec9f34d75e7cfda4b1ffc7/

This is an http interface to a venti server. The files are served by vacfs.b, which is a vac file server a lot like vacfs from Plan 9. The difference is that this one can serve arbitrary vac root scores: The root directory appears empty. Walking to a score tries to open that score as a vac root score and serves the contents of that vac. The current (work in progress, with at least one severe bug, be warned!) vacfs.b can be found in the vac archive referenced above. As a sidenote, the http interface is a scgi program that reads directories and prints out html.

There are some more interesting bits in the archive. But note that these programs are not final, they may be removed/replaced/redesigned later on. And when I put this in the mercurial repository, I will also make a proper directory hierarchy and mkfile. Here is only a summary of the somewhat working programs in the archive. They also have manual pages which have near-complete usage information. The limbo source files contain lists of work to do, and some of the manual pages have lists as well.

Vacfs, as described above, it can connect to a venti server and serve any score present at that server. A specific score can also be specified at startup which will be openend and it’s contents served in the root directory.

Vacget, lists of retrieves the contents of a venti archive. -t just lists them, -x retrieves them.
Vacput, write a file tree to venti. The resulting vac root score is printed to standard out. The tree is always written entirely, i.e. it cannot write only changes relative to a previous archive.

Vcache, a venti cache and/or proxy. It can serve as memory-cache, keeping venti blocks in memory. Requests for non-present blocks are sent to the authorative venti server. A proxy can also be specified: Before reading data from the venti server, the proxy (a venti server, typically one nearby) is queried for the data. Data received from the authorative venti server is written to the proxy server. The proxy can also be used as write-through server: all writes (that normally go only to the authorative server) go to the proxy simultaneously, and the operation succeeds only when both servers respond okay.

Vread and Vwrite, read a single score or write a single data block from/to a venti. Mostly for debugging.
Vparse, parses vac root scores, metablocks, entries, etc.
Ventry, parses and prints a given entry in a file with many entries.

If you want to play around with some of this code, you also need these changes to inferno’s appl/lib/venti.b:

http://ueber.net/vac/78d72a3daf13c934f4b161e0268feffefc232f7d/venti.b.diff

That’s it for now. Next on the todo list is setting up the hg repository, and working on the venti server.