A simple GFF3 parser in Python

Problem:

You need to parse a GFF3 file containing information about sequence features. You prefer to use a minimal, depedency-free solution instead of importing the GFF3 data into a database right away. However, you need to have a standard-compatible parser

Continue reading →

Posted by Uli Köhler in Bioinformatics, Python

Mapping STRING aliases to UniProt IDs

In a recent project, I needed to compare STRING records to other PPI databases. However, this is not always as easy as it sounds, because STRING uses KEGG protein identifiers. Fortunately, at the STRING download page, a list of alias mappings is freely downloadable.

There’s still one major problem left, though: I couldn’t find any documentation about the format. It seems to be somewhat easy once you’ve figured out the basics, but I created a reusable Python function that filters a given organism and outputs a STRING ID, UniProt ID CSV:

Continue reading →

Posted by Uli Köhler in Allgemein

Filtering STRING PPI dumps by taxonomy

Recently I needed to filter a STRING protein-view database dump (e.g. protein.links.full.v9.05.txt.gz) by taxonomy ID. The original dataset was way too large (it had more than 670 million records).

In order to filter with constant memory (After all, the full STRING dump is 47GB large), I created this script that allows to filter for binary PPIs both matching the given organism (NCBI taxonomy ID), but also allows to filter for binary PPIs with at least one interacting protein of the given organism. Usually this doesn’t really make a difference for STRING.

Continue reading →

Posted by Uli Köhler in Allgemein

A GeneOntology OBO v1.4 parser in Python

The GeneOntology Consortium provides bulk data download for the GO terms in the OBO v1.2 format.

If you Google GO OBO parser, there is something missing. You can easily find parsers in Perl, parsers in Java, but not even BioPython has a parser in Python. The format itself, however seems like it’s tailor-made for Python’s generator concept. Only a few SLOCs are needed to get it work without storing everything in RAM.

I used this parser in a prototype project that allows to search GO interactively (it’s fast). I’m not sure when/if I’ll publish that, but here is the parser code.

Continue reading →

Posted by Uli Köhler in Bioinformatics, Python

Bash: TCP/Internet connection handling

There’re many problems nowadays, which could easier be solved through the internet. In this post I descripe how to address these problems by bash alone.

The standard way to connect to a server in the internet, is to embed the connection stream

exec 3<>/dev/tcp/$server/$ircPort || echo "Some text or doing, if connecting failed"

Continue reading →

Posted by Yann Spöri in Allgemein

gtf2gff.py: A replacement for gtf2gff.pl

Recently we had to work with the gtf2gff.pl tool to convert CONTRAST and TwinScan GTF output to the GFF format which can be read by many annotation tools.

Working with that script was really hard, it did not report errors at all, plus it is not programmatically reusable at all. There are different versions of the perl script on the internet, but what we needed was a standardized, short, readable version that does proper command line parsing using a standard tool like argparse and a conversion function that is usable from other scripts.

Continue reading →

Posted by Uli Köhler in Allgemein

A simple tool for FASTA statistics

The issue

It is surprisingly difficult to compute simple statistics of FASTA files using existing software. I recently needed to compute the nucleotide count and relative GC frequency of a single sequence in FASTA format, but unless you install dependency-heavy native software like FASTX or you develop it by yourself using BioPython or similar, there doesn’t seem to be a simple, dependency-free solution for this simple set of problem.

Continue reading →

Posted by Uli Köhler in Bioinformatics, Python