About integration of rsync and Debian Martin Pool mbp@samba.org $Date: 2002/05/14 03:10:09 $ Introduction

It seems like there is an rsync thread on

By way of background, I have been the maintainer of rsync for about a year, and I wrote the

I have tried to respond to all the issues raised on the list recently, but possibly have missed some. After putting many hours into rsync I suppose I am quite fond of it and may be a little biased, but I think I also know its strengths and problems more than most.

If there are issues ommitted by this document, or if you think the answers are incomplete, unbalanced, or incorrect, then please mail Background The rsync algorithm

rsync is really two independent but related ideas: the

The rsync algorithm is described in detail in the ANU Technical Report included with the distribution and on the web site.

Briefly, the rsync algorithm provides an efficient means to transfer a file

The algorithm could be embodied in other forms and uses than the rsync program. Indeed, the algorithm has been reimplemented in programs other than rsync itself: in other languages (Rexx, CAML, Java), in a C library (

rsync deltas are much more efficient than diffs, for several reasons: most importantly, useful diff formats include context lines, which are wasted information. rsync gets the same benefit of making sure that the change is applied to a compatible version of the file, but in a much more efficient way. diffs suffer from being human-readable and therefore verbose, and cannot handle binary files. Both can be transparently compressed. diffs are an invaluable tool for software development, but not ideal for distribution of deltas.

The rsync mirroring program is similar in functionality to other tools such as

In general, rsync is substantially faster than other tools, because it uses delta compression, and because the protocol is designed to be efficient both in traffic and round trips. rsync can use

rsync has the important property that it is generally idempotent: repeated runs of rsync, even if interrupted, make the contents of the destination machine converge towards those of the source machine.

Unlike wget and ftpmirror, rsync must be installed on both the client and the server. It uses its own protocol, rather than HTTP or FTP. This is probably the biggest practical drawback at the moment: one cannot use rsync from arbitrary web sites, but only from systems that have specifically chosen to support it. On the other hand, many important sites for free software do now support rsync, and the developers see them as an important constituency to support.

rsync can tunnel through a proxy server that supports rsync can run as a daemon, similar to an /etc/rsyncd.conf file. The archetypal use is to offer public, anonymous, read-only access to a software archive but as with

rsync can also run across a tunnel program such as

rsync may also be used locally, which is a degenerate case with client and server connected across a unix-domain or localhost socket.

rsync runs on many varieties of Unix under both gcc and native compilers, and also under Cygwin on Microsoft platforms. There is somebody working on a VMS port.

For example, you can imagine the server calculating and caching the checksums, or the client calculating the delta once and then sending it over UDP multicast to several destinations, or uuencoded in email. Neither of these would be straightforward to implement in the rsync codebase, but all can be done with rdiff. program, which is a prototype of transparent integration of delta compression into HTTP.

rproxy implements delta compression of arbitrary HTTP content, whether dynamic or static. For example, rproxy gives very good compression on repeated visits to portal sites such as Slashdot, since it can transmit only the sections of the page modified since last time it was viewed. Regular HTTP caches, by contrast, must either hit on the whole page, or reload the whole page, and therefore do poorly on template-based dynamic content.

rproxy adds compatible extensions to HTTP headers, and can pass through HTTP caches. It requires upstream and downstream support, which at the moment means two installations of rproxy, but in the future could conceivably be integrated into Mozilla, Squid, Apache and similar programs.

rproxy is packaged in Debian and is moderately useful for people with slow links.

rproxy is not HTML-specific, but HTML is by far the most common case of dynamic HTTP content that can be delta-compressed. However, HTML documents tend to be fairly small, and as connectivity improves they're becoming of decreasing interest as a compression problem.

(My personal interest in this project declined significantly when I went from a 56k6 modem to ADSL at home. I realize many people in the world still have slow connections.)

There are some Internet Drafts from Jeffrey Mogul and others that add similar delta compression based on the server storing all possible deltas. Last time I looked, there did not seem much interest in wide adoption of these proposals.

Documenting a protocol extension to the standard expected by an RFC can be a lot of work. A beginning on this work has been made for rproxy, but more experiments with the implementation are needed before proposal as a standard.

rproxy is not being actively developed at the moment. Obviously I cannot answer why every programmer in the world does not work on this. Personally I think that developing rsync itself, and then librsync/rdiff, is likely to be more useful; I suspect other people interested in working in this area might have similar thoughts.

I don't think there are any problems in the code or project that would actively prevent anybody from working on it. I'd be happy to hand over maintenance to somebody else. Ben Elliston has expressed interest in looking after it in the last couple of months, and possibly it will be more active in the future. Introduction to Debian

Debian is a free operating system developed by the cooperation of people all over the world. The most important relation of rsync to Debian is copying of software packages from developers to distribution servers to end users.

Many Debian users get their software by internet download from a public server, rather than pre-installed on a machine or on CD-ROM.

Debian software archives include both software source, binary packages for various architectures, and index metadata, primarily the

Because of Debian's community development process, new packages are released very often. On a typical day in Debian's

Debian is quite different from other software distributions in shipping so many packages so frequently. The BSDs (as I understand it) base most of their development out of a CVS or CVSup tree, which inherently distributes coarse-grained deltas. Proprietary systems make binary releases, but much less frequently. Possibly the development branches of other distributions, such as Redhat "Rawhide" and Mandrake "Cooker" are similar.

apt-proxy

Because rsync is less efficient than HTTP for transferring compressed files, The original apt-proxy was written by Paul 'Rusty' Russel. The current maintainer is Chris Halls.

Open Issues

Although delta compression is not possible, rsync is still a very useful tool for mirroring compressed files. The efficient network protocol mean that it will generally be slightly faster and use less traffic than HTTP or FTP.

There is a patch called I believe that this will merge into the upstream The scheme for containing changes is not specific to rsync, and might also be useful with This scheme relies on the output file being determined only by the input file. As far as I know compression schemes like gzip, lzw, and bzip2 are deterministic in this way. (GnuPG, for example, is not, since it uses a random session key.)

Incidentally this would allow more efficient deltas between Debian ISO images.

Alternatively, you could distribute an uncompressed package tree that would rsync efficiently, but since the gzip patch should merge soon this seems unnecessary.

There is also the interesting possibility of using It could be possible to make an equivalent patch to A patch to gzip which implements this behaviour is available from rsync CVS. If it is, then I think we should fix the problems, rather than invent a new system from scratch. I think the scalability problems are accidents of the current codebase, rather than anything inherent in the design.

Some projects (Apache 2.0, Mozilla, ...) choose to do this; some concentrate on piecemeal improvement (Linux).

As Joel Spolsky recently observed, throwing the code out to start from scratch always seems like a nice idea, but rarely works out. Essentially you are opting for the devil you don't know, but he definitely exists.

Having dealt with some fairly crufty old code I sometimes feel like doing this in rsync, but I can also see the arguments against it. Starting from a new codebase would bring in plenty of new bugs (including security bugs), orphan existing users, and halt progress until it caught up to the same level. It would require carefully documenting all the existing internal protocols and behaviours, which is a nice idea but not clearly a good use of the developers' time.

If you throw out the code, you have the questions of whether or not to keep the current network protocol, and whether or not to keep the current command-line option.

Not being able to talk to old remote versions would be a real pain, and would certainly slow adoption. (Think of how unpopular SSH2 was when it wanted to be incompatible with SSH1, but use the same TCP port.) On the other hand, keeping the same network protocol limits your ability to redesign things.

There is some cruft in the command line syntax, but throwing it out would also break everybody's scripts, mental maps, and documentation.

Possibly rsync 3.0 will be largely rewritten. Debian's goals for rsync have to be balanced against the requirements of other users (including the mirror sites that distribute Debian) and the limited development resources.

As of April 2002, the 2.5.5 release is current and 2.5.6 is pending. These versions seem to work quite well and we encourage people to upgrade from stable 2.4.6 and previous versions. It is very important to the developers to establish a stable base release before starting out on new development.

A number of enhancements are planned for the 2.6 series. Some of them are discussed below, and more details can be found in the If you want rsync to progress faster, the best thing you can do is find reproducible failure cases and report them well, or to help us write a regression test suite.

Nobody is very actively working on rproxy or librsync as far as I know. Personally at the moment I feel supporting rsync itself is more useful. Goswin Brederlow has a nice proposal that allows rsync compression using a regular HTTP server. It depends only upon rsyncable archives, and development of new client and support programs.

The idea, as I understand it, is that the checksums for each package should be pre-calculated and stored on the HTTP server along with the files they describe. (The checksum files should be on the order of a thousand times smaller than the archive file.)

The client program will download the checksum file over HTTP if it exists. The client can then search for common blocks in the local file, and therefore determine what new regions it must fetch from the server. The client then sends an HTTP request using the optional This scheme has the great advantage that the server is entirely passive, and only needs to support standard HTTP. The checksum files can be generated once for each package either during upload, or by the administrator of a particular server.

(This is not the same as rproxy, which uses a different encoding and requires upstream support, but can transparently compress any request without requiring signature files on the server.)

This scheme could be fairly easily built on top of rdiff or librsync.

I think this sounds like a very promising scheme. I think rsync might still be better for people who want to copy large trees, but for users apt-getting single packages this would be a simple way to get delta-compression in, pending only --rsyncable files.

I'm not sure if all HTTP servers currently handle Because it requires a special client, and special checksum files stored on the server it has a wider impact than just using rsync. It ought to be done in a non-Debian-specific way. Rather than adding knowledge about the deltas into apt-get itself, we ought to make a separate tool which downloads a delta over HTTP. rsync only compares files with the exact same name

Even for uncompressed files, rsync will not currently try to use Paul Russell has contributed a patch which adds a heuristic to detect this. It will probably be merged in 2.6.

rsync traverses the entire directory tree before it begins copying files. On machines with little virtual memory and a lot of files to copy this can be a problem.

Some problems in this area have been fixed in the 2.5 series. Other solutions involving internal restructuring of the We are not likely to change the general approach of traversing the tree up-front in the near future, because it is tightly tied to the network protocol. It might be attempted in the 3.0 timeframe, but it is not entirely clear that it is really a problem. rsync hangs?

There were a number of deadlock bugs in the past. We believe they are all fixed in 2.5.5, but would welcome good bug reports to the contrary. Some of them were in fact Linux and Solaris kernel bugs, but they've been fixed by the relevant parties quite some time ago.

Reduced network usage is not justified by increased CPU usage

The balance point will vary for each administrator. It is hard to answer this universally. The balance is very different in countries other than the US.

I suspect CPU cycles are falling in price faster than network bandwidth, and for many people unused CPU cycles are wasted but network packets cost money.

rsync with --whole-file (to turn off the delta algorithm and be more comparable to ftp) floods a 100Mbps link. If you have something faster than that to your upstream Debian mirror, you're a lucky person.

It is not inherently impossible to make rsync use sendfile() and mmap(); it's just that most people don't need them. (Apache did not use them by default either last time I looked, but that doesn't mean it's not fast enough for most people.)

rsync is not actively developed

This was a pretty fair accusation a year ago, but I don't think it is true anymore. We have made five releases this year, and the Mail is answered promptly. The web site has been revised and cleaned. We're working on a regression test suite, code cleanup and documentation, and other quality issues.

I'm the most active developer at the moment, but there are a number of other people with experience in the code who regularly commit changes or send patches.

The stability expectations of our users require a somewhat disciplined approach to accepting patches, but I don't think that's a bad thing. The goal is to iterate through The Debian rsync package maintainer, Phil Hands, has been somewhat inactive, but apparently he's back now. Colin Walters has been helping out.

If you have patches that were dropped, please send them through again. Servers should cache deltas

This also smells like premature optimization. It might be useful, but it certainly seems less important than some other tasks. Possible patent on rsync?

It has been suggested that some uses of rsync may conflict with a US patent. I am not sure of the current situation so I won't comment. I don't know of any suggestion that the rsync program as it currently exists infringes any patent.

I do know that rsync has been in use for many years by large and small organization with no complaints of patent infringement.

I would point out that linked lists, mark-and-copy garbage collection, and the Tab key are all patented too. Somebody who always carefully checked first for software patents would never write anything at all.

Debian should store more metadata outside of the package file

I can see some reasons why this might be a good idea, but it doesn't seem particularly relevant to rsync either way.

Debian should distribute diffs or xdeltas between different versions of their packages

A problem with this is that Debian releases packages very often, and so the number of delta files is large. This will be a problem for mirror sites who cannot carry all of the delta files for history, although of course the client could fall back to directly fetching the new package if the relevant deltas are not present.

Fetching a series of deltas which update the same area is less efficient than calculating the deltas directly.

Although this idea does not seem very practical for people using rsync should be used by

Even if there are few gains from compressing Revision history

$Log: rsync-and-debian.sgml,v $ Revision 1.10 2002/05/14 03:10:09 mbp Add note about just transferring Package files. Revision 1.9 2002/05/14 02:35:43 mbp Update notes on apt-proxy based on mail from Chris Halls. Revision 1.8 2002/04/12 02:04:15 mbp More information about Goswin's idea, as I understand it. Revision 1.7 2002/04/12 01:39:52 mbp Lots more details about rproxy. Thanks to Brian May for prompting me to do this. Revision 1.6 2002/04/12 00:43:06 mbp Fix revision history SGML stuff. Revision 1.5 2002/04/12 00:24:50 mbp Fix mailing list name. Add revision history.