Accessing NCBI FTP via rsync

A little-known way of accessing data on the NCBI FTP servers is by using rsync. This method was first mentioned in this mailing list post in 2004.

Using rsync instead of ftp has a few key advantages

  • Fully incremental downloads
  • Resumable downloads
  • Faster than FTP
  • Only a single connection, no problems with FTP active/passive ports (e.g. important for dual-stack lite, e.g. see this excellent post in german )

Accessing the servers by rsync is simple and straightforward. Assuming one wants to download the GenBank GSS files to the genbank directory, he could use this command:

rsync --partial --progress -av ftp.ncbi.nlm.nih.gov::genbank/gbgss*.gz genbank/

The —progress flag displays detailed progress and download speed information whereas the —partial flag allows resuming incomplete file downloads (by default, only fully completed). Note that you can combine both flags to the -P flag

Once the download is (partially) finished, the genbank directory could be synced with the NCBI directory by simply repeating the command listed above. Any files that already have been downloaded completely will be skipped automatically.

If you want to sync files that are regularly updated, keep in mind that rsync by default only checks the filesize. If the file has been modified, but the size didn’t change, rsync will not update the file. To force checking the full file, add the —checksum option to the command.