Accessing NCBI FTP via rsync

A little-known way of accessing data on the NCBI FTP servers is by using rsync. This method was first mentioned in this mailing list post in 2004.

Using rsync instead of ftp has a few key advantages

  • Fully incremental downloads
  • Resumable downloads
  • Faster than FTP
  • Only a single connection, no problems with FTP active/passive ports (e.g. important for dual-stack lite, e.g. see this excellent post in german )

Read more

A GeneOntology OBO v1.4 parser in Python

The GeneOntology Consortium provides bulk data download for the GO terms in the OBO v1.2 format.

If you Google GO OBO parser, there is something missing. You can easily find parsers in Perl, parsers in Java, but not even BioPython has a parser in Python. The format itself, however seems like it’s tailor-made for Python’s generator concept. Only a few SLOCs are needed to get it work without storing everything in RAM.

I used this parser in a prototype project that allows to search GO interactively (it’s fast). I’m not sure when/if I’ll publish that, but here is the parser code.

Read more

A simple tool for FASTA statistics

The issue

It is surprisingly difficult to compute simple statistics of FASTA files using existing software. I recently needed to compute the nucleotide count and relative GC frequency of a single sequence in FASTA format, but unless you install dependency-heavy native software like FASTX or you develop it by yourself using BioPython or similar, there doesn’t seem to be a simple, dependency-free solution for this simple set of problem.

Read more