Next: Availability Up: The rsync algorithm Previous: Pipelining

Results

To test the algorithm, tar files were created of the Linux kernel sources for two versions of the kernel. The two kernel versions were 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and are separated by 5 released patch levels.

Out of the 2441 files in the 1.99.10 release, 291 files had changed in the 2.0.0 release, 19 files had been removed and 25 files had been added.

A ``diff'' of the two tar files using the standard GNU diff utility produced over 32 thousand lines of output totalling 2.1 MB.

The following table shows the results for rsync between the two files with a varying block size.²

block matches tag false data written read

size hits alarms

300
64247 3817434 948 5312200 5629158 1632284

500 46989 620013 64 1091900 1283906 979384

700 33255 571970 22 1307800 1444346 699564

900 25686 525058 24 1469500 1575438 544124

1100 20848 496844 21 1654500 1740838 445204

In each case, the CPU time taken was less than the time it takes to run ``diff'' on the two files.³

The columns in the table are as follows:

block size: The size in bytes of the checksummed blocks.
matches: The number of times a block of B was found in A.
tag hits: The number of times the 16 bit hash of the rolling checksum matched a hash of one of the checksums from B.
false alarms: The number of times the 32 bit rolling checksum matched but the strong checksum didn't.
data: The amount of file data transferred verbatim, in bytes.
written: The total number of bytes written by $\alpha$ including protocol overheads. This is almost all file data.
read: The total number of bytes read by $\alpha$ including protocol overheads. This is almost all checksum information.

The results demonstrate that for block sizes above 300 bytes, only a small fraction (around 5%) of the file was transferred. The amount transferred was also considerably less than the size of the diff file that would have been transferred if the diff/patch method of updating a remote file was used.

The checksums themselves took up a considerable amount of space, although much less than the size of the data transferred in each case. Each pair of checksums consumes 20 bytes: 4 bytes for the rolling checksum plus 16 bytes for the 128-bit MD4 checksum.

The number of false alarms was less than 1/1000 of the number of true matches, indicating that the 32 bit rolling checksum is quite good at screening out false matches.

The number of tag hits indicates that the second level of the checksum search algorithm was invoked about once every 50 characters. This is quite high because the total number of blocks in the file is a large fraction of the size of the tag hash table. For smaller files we would expect the tag hit rate to be much closer to the number of matches. For extremely large files, we should probably increase the size of the hash table.

The next table shows similar results for a much smaller set of files. In this case the files were not packed into a tar file first. Rather, rsync was invoked with an option to recursively descend the directory tree. The files used were from two source releases of another software package called Samba. The total source code size is 1.7 MB and the diff between the two releases is 4155 lines long totalling 120 kB.

block matches tag false data written read

size hits alarms

300
3727 3899 0 129775 153999 83948

500 2158 2325 0 171574 189330 50908

700 1517 1649 0 195024 210144 36828

900 1156 1281 0 222847 236471 29048

1100 921 1049 0 250073 262725 23988

Next: Availability Up: The rsync algorithm Previous: Pipelining

Andrew Tridgell
1998-11-09

block	matches	tag	false	data	written	read
size		hits	alarms
300	64247	3817434	948	5312200	5629158	1632284
500	46989	620013	64	1091900	1283906	979384
700	33255	571970	22	1307800	1444346	699564
900	25686	525058	24	1469500	1575438	544124
1100	20848	496844	21	1654500	1740838	445204