Next: Rolling checksum
Up: The rsync algorithm
Previous: The problem
Suppose we have two general purpose computers
and .
Computer
has access to a file A and
has access to
file B, where A and B are ``similar''. There is a slow
communications link between
and .
The rsync algorithm consists of the following steps:
- 1.
-
splits the file B into a series of non-overlapping
fixed-sized blocks of size S bytes1.
The last block may be shorter than S bytes.
- 2.
- For each of these blocks
calculates two checksums:
a weak ``rolling'' 32-bit checksum (described below) and a strong
128-bit MD4 checksum.
- 3.
-
sends these checksums to .
- 4.
-
searches through A to find all blocks of length S bytes (at any offset, not just multiples of S) that have the same
weak and strong checksum as one of the blocks of B. This can be
done in a single pass very quickly using a special property of the
rolling checksum described below.
- 5.
-
sends
a sequence of instructions for
constructing a copy of A. Each instruction is either a reference
to a block of B, or literal data. Literal data is sent only for
those sections of A which did not match any of the blocks of B.
The end result is that
gets a copy of A, but only the pieces
of A that are not found in B (plus a small amount of data for
checksums and block indexes) are sent over the link. The algorithm
also only requires one round trip, which minimises the impact of the
link latency.
The most important details of the algorithm are the rolling checksum
and the associated multi-alternate search mechanism which allows the
all-offsets checksum search to proceed very quickly. These will be
discussed in greater detail below.
Next: Rolling checksum
Up: The rsync algorithm
Previous: The problem
Andrew Tridgell
1998-11-09