A GeneOntology OBO v1.4 parser in Python

The GeneOntology Consortium provides bulk data download for the GO terms in the OBO v1.2 format.

If you Google GO OBO parser, there is something missing. You can easily find parsers in Perl, parsers in Java, but not even BioPython has a parser in Python. The format itself, however seems like it’s tailor-made for Python’s generator concept. Only a few SLOCs are needed to get it work without storing everything in RAM.

I used this parser in a prototype project that allows to search GO interactively (it’s fast). I’m not sure when/if I’ll publish that, but here is the parser code.

Just copy both functions to your codebase, it doesn’t have any dependencies besides collections.defaultdict from the Python standard library.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
A constant-space parser for the GeneOntology OBO v1.2 & v1.4 format

https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html

Version 1.1: Python3 ready & --verbose CLI option
"""
from __future__ import with_statement
from collections import defaultdict

__author__    = "Uli Köhler"
__copyright__ = "Copyright 2013 Uli Köhler"
__license__   = "Apache v2.0"
__version__   = "1.1"

def processGOTerm(goTerm):
    """
    In an object representing a GO term, replace single-element lists with
    their only member.
    Returns the modified object as a dictionary.
    """
    ret = dict(goTerm) #Input is a defaultdict, might express unexpected behaviour
    for key, value in ret.items():
        if len(value) == 1:
            ret[key] = value[0]
    return ret

def parseGOOBO(filename):
    """
    Parses a Gene Ontology dump in OBO v1.2 format.
    Yields each 
    Keyword arguments:
        filename: The filename to read
    """
    with open(filename, "r") as infile:
        currentGOTerm = None
        for line in infile:
            line = line.strip()
            if not line: continue #Skip empty
            if line == "[Term]":
                if currentGOTerm: yield processGOTerm(currentGOTerm)
                currentGOTerm = defaultdict(list)
            elif line == "[Typedef]":
                #Skip [Typedef sections]
                currentGOTerm = None
            else: #Not [Term]
                #Only process if we're inside a [Term] environment
                if currentGOTerm is None: continue
                key, sep, val = line.partition(":")
                currentGOTerm[key].append(val.strip())
        #Add last term
        if currentGOTerm is not None:
            yield processGOTerm(currentGOTerm)

if __name__ == "__main__":
    """Print out the number of GO objects in the given GO OBO file"""
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('infile', help='The input file in GO OBO v1.2 format.')
    parser.add_argument('-v', '--verbose', action="store_true",
                        help='Print all GO items instead of just printing their count')
    args = parser.parse_args()
    #Iterate over GO terms
    termCounter = 0
    if args.verbose:
        for goTerm in parseGOOBO(args.infile):
            print(goTerm)
    else:
        for goTerm in parseGOOBO(args.infile):
            termCounter += 1
        print ("Found %d GO terms" % termCounter)