A GeneOntology OBO v1.4 parser in Python
The GeneOntology Consortium provides bulk data download for the GO terms in the OBO v1.2 format.
If you Google GO OBO parser, there is something missing. You can easily find parsers in Perl, parsers in Java, but not even BioPython has a parser in Python. The format itself, however seems like it’s tailor-made for Python’s generator concept. Only a few SLOCs are needed to get it work without storing everything in RAM.
I used this parser in a prototype project that allows to search GO interactively (it’s fast). I’m not sure when/if I’ll publish that, but here is the parser code.
Just copy both functions to your codebase, it doesn’t have any dependencies besides collections.defaultdict from the Python standard library.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
A constant-space parser for the GeneOntology OBO v1.2 & v1.4 format
https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html
Version 1.1: Python3 ready & --verbose CLI option
"""
from __future__ import with_statement
from collections import defaultdict
__author__ = "Uli Köhler"
__copyright__ = "Copyright 2013 Uli Köhler"
__license__ = "Apache v2.0"
__version__ = "1.1"
def processGOTerm(goTerm):
"""
In an object representing a GO term, replace single-element lists with
their only member.
Returns the modified object as a dictionary.
"""
ret = dict(goTerm) #Input is a defaultdict, might express unexpected behaviour
for key, value in ret.items():
if len(value) == 1:
ret[key] = value[0]
return ret
def parseGOOBO(filename):
"""
Parses a Gene Ontology dump in OBO v1.2 format.
Yields each
Keyword arguments:
filename: The filename to read
"""
with open(filename, "r") as infile:
currentGOTerm = None
for line in infile:
line = line.strip()
if not line: continue #Skip empty
if line == "[Term]":
if currentGOTerm: yield processGOTerm(currentGOTerm)
currentGOTerm = defaultdict(list)
elif line == "[Typedef]":
#Skip [Typedef sections]
currentGOTerm = None
else: #Not [Term]
#Only process if we're inside a [Term] environment
if currentGOTerm is None: continue
key, sep, val = line.partition(":")
currentGOTerm[key].append(val.strip())
#Add last term
if currentGOTerm is not None:
yield processGOTerm(currentGOTerm)
if __name__ == "__main__":
"""Print out the number of GO objects in the given GO OBO file"""
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('infile', help='The input file in GO OBO v1.2 format.')
parser.add_argument('-v', '--verbose', action="store_true",
help='Print all GO items instead of just printing their count')
args = parser.parse_args()
#Iterate over GO terms
termCounter = 0
if args.verbose:
for goTerm in parseGOOBO(args.infile):
print(goTerm)
else:
for goTerm in parseGOOBO(args.infile):
termCounter += 1
print ("Found %d GO terms" % termCounter)
If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow