next up previous
Next: Availability Up: The rsync algorithm Previous: Pipelining

Results

To test the algorithm, tar files were created of the Linux kernel sources for two versions of the kernel. The two kernel versions were 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and are separated by 5 released patch levels.

Out of the 2441 files in the 1.99.10 release, 291 files had changed in the 2.0.0 release, 19 files had been removed and 25 files had been added.

A ``diff'' of the two tar files using the standard GNU diff utility produced over 32 thousand lines of output totalling 2.1 MB.

The following table shows the results for rsync between the two files with a varying block size.2



block matches tag false data written read
size   hits alarms      

300

64247 3817434 948 5312200 5629158 1632284
500 46989 620013 64 1091900 1283906 979384
700 33255 571970 22 1307800 1444346 699564
900 25686 525058 24 1469500 1575438 544124
1100 20848 496844 21 1654500 1740838 445204


In each case, the CPU time taken was less than the time it takes to run ``diff'' on the two files.3

The columns in the table are as follows:

block size
The size in bytes of the checksummed blocks.
matches
The number of times a block of B was found in A.
tag hits
The number of times the 16 bit hash of the rolling checksum matched a hash of one of the checksums from B.
false alarms
The number of times the 32 bit rolling checksum matched but the strong checksum didn't.
data
The amount of file data transferred verbatim, in bytes.
written
The total number of bytes written by $\alpha$ including protocol overheads. This is almost all file data.
read
The total number of bytes read by $\alpha$ including protocol overheads. This is almost all checksum information.

The results demonstrate that for block sizes above 300 bytes, only a small fraction (around 5%) of the file was transferred. The amount transferred was also considerably less than the size of the diff file that would have been transferred if the diff/patch method of updating a remote file was used.

The checksums themselves took up a considerable amount of space, although much less than the size of the data transferred in each case. Each pair of checksums consumes 20 bytes: 4 bytes for the rolling checksum plus 16 bytes for the 128-bit MD4 checksum.

The number of false alarms was less than 1/1000 of the number of true matches, indicating that the 32 bit rolling checksum is quite good at screening out false matches.

The number of tag hits indicates that the second level of the checksum search algorithm was invoked about once every 50 characters. This is quite high because the total number of blocks in the file is a large fraction of the size of the tag hash table. For smaller files we would expect the tag hit rate to be much closer to the number of matches. For extremely large files, we should probably increase the size of the hash table.

The next table shows similar results for a much smaller set of files. In this case the files were not packed into a tar file first. Rather, rsync was invoked with an option to recursively descend the directory tree. The files used were from two source releases of another software package called Samba. The total source code size is 1.7 MB and the diff between the two releases is 4155 lines long totalling 120 kB.



block matches tag false data written read
size   hits alarms      

300

3727 3899 0 129775 153999 83948
500 2158 2325 0 171574 189330 50908
700 1517 1649 0 195024 210144 36828
900 1156 1281 0 222847 236471 29048
1100 921 1049 0 250073 262725 23988



next up previous
Next: Availability Up: The rsync algorithm Previous: Pipelining
Andrew Tridgell
1998-11-09