To test the algorithm, tar files were created of the Linux kernel sources for two versions of the kernel. The two kernel versions were 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and are separated by 5 released patch levels.
Out of the 2441 files in the 1.99.10 release, 291 files had changed in the 2.0.0 release, 19 files had been removed and 25 files had been added.
A ``diff'' of the two tar files using the standard GNU diff utility produced over 32 thousand lines of output totalling 2.1 MB.
The following table shows the results for rsync between the two files with a varying block size.2
block | matches | tag | false | data | written | read |
size | hits | alarms | ||||
300 |
64247 | 3817434 | 948 | 5312200 | 5629158 | 1632284 |
500 | 46989 | 620013 | 64 | 1091900 | 1283906 | 979384 |
700 | 33255 | 571970 | 22 | 1307800 | 1444346 | 699564 |
900 | 25686 | 525058 | 24 | 1469500 | 1575438 | 544124 |
1100 | 20848 | 496844 | 21 | 1654500 | 1740838 | 445204 |
In each case, the CPU time taken was less than the time it takes to run ``diff'' on the two files.3
The columns in the table are as follows:
The results demonstrate that for block sizes above 300 bytes, only a small fraction (around 5%) of the file was transferred. The amount transferred was also considerably less than the size of the diff file that would have been transferred if the diff/patch method of updating a remote file was used.
The checksums themselves took up a considerable amount of space, although much less than the size of the data transferred in each case. Each pair of checksums consumes 20 bytes: 4 bytes for the rolling checksum plus 16 bytes for the 128-bit MD4 checksum.
The number of false alarms was less than 1/1000 of the number of true matches, indicating that the 32 bit rolling checksum is quite good at screening out false matches.
The number of tag hits indicates that the second level of the checksum search algorithm was invoked about once every 50 characters. This is quite high because the total number of blocks in the file is a large fraction of the size of the tag hash table. For smaller files we would expect the tag hit rate to be much closer to the number of matches. For extremely large files, we should probably increase the size of the hash table.
The next table shows similar results for a much smaller set of files. In this case the files were not packed into a tar file first. Rather, rsync was invoked with an option to recursively descend the directory tree. The files used were from two source releases of another software package called Samba. The total source code size is 1.7 MB and the diff between the two releases is 4155 lines long totalling 120 kB.
block | matches | tag | false | data | written | read |
size | hits | alarms | ||||
300 |
3727 | 3899 | 0 | 129775 | 153999 | 83948 |
500 | 2158 | 2325 | 0 | 171574 | 189330 | 50908 |
700 | 1517 | 1649 | 0 | 195024 | 210144 | 36828 |
900 | 1156 | 1281 | 0 | 222847 | 236471 | 29048 |
1100 | 921 | 1049 | 0 | 250073 | 262725 | 23988 |