Python

Parsing NCBI GeneInfo in Python

Problem:

You need to parse files in the NCBI GeneInfo format, like those that can be downloaded from the NCBI FTP GENE_INFO directory, in Python. You want to avoid any dependencies.

Continue reading →

Posted by Uli Köhler in Bioinformatics, Python

A simple GFF3 parser in Python

Problem:

You need to parse a GFF3 file containing information about sequence features. You prefer to use a minimal, depedency-free solution instead of importing the GFF3 data into a database right away. However, you need to have a standard-compatible parser

Continue reading →

Posted by Uli Köhler in Bioinformatics, Python

A GeneOntology OBO v1.4 parser in Python

The GeneOntology Consortium provides bulk data download for the GO terms in the OBO v1.2 format.

If you Google GO OBO parser, there is something missing. You can easily find parsers in Perl, parsers in Java, but not even BioPython has a parser in Python. The format itself, however seems like it’s tailor-made for Python’s generator concept. Only a few SLOCs are needed to get it work without storing everything in RAM.

I used this parser in a prototype project that allows to search GO interactively (it’s fast). I’m not sure when/if I’ll publish that, but here is the parser code.

Continue reading →

Posted by Uli Köhler in Bioinformatics, Python

A simple tool for FASTA statistics

The issue

It is surprisingly difficult to compute simple statistics of FASTA files using existing software. I recently needed to compute the nucleotide count and relative GC frequency of a single sequence in FASTA format, but unless you install dependency-heavy native software like FASTX or you develop it by yourself using BioPython or similar, there doesn’t seem to be a simple, dependency-free solution for this simple set of problem.

Continue reading →

Posted by Uli Köhler in Bioinformatics, Python