Reading the UniProt text format in Python

Just like many other databases in computational biology, the downloads for the popular UniProt database are available in a custom text format which is documented on ExPASy.

While it is certainly not difficult writing a generic parser and one can use BioPython, I believe it is often easier to use a tested parser that only uses standard libraries.

This code provides an easy-to-use generator-based streaming parser that parses the records into dictionary-like objects. Newlines inside fields are always kept. The parser assumes the file is UTF-8 encoded.

#!/usr/bin/env python3
A simple reader for the UniProt text format, as downloadable
import gzip
from collections import defaultdict

__author__ = "Uli Köhler"
__copyright__ = "Copyright 2014, Uli Koehler"
__license__ = "Apache License v2.0"
__version__ = "1.0"

def readUniprot(fin):
    """Given a file-like object, generates uniprot objects"""
    lastKey = None  # The last encountered key
    currentEntry = defaultdict(str)
    for line in fin:
        key = line[:2].decode("ascii")
        #Handle new entry
        if key == "//":
            yield currentEntry
            currentEntry = defaultdict(str)
        #SQ field does not have a line header except for the first line
        if key == "  ":
            key = lastKey
        lastKey = key
        #Value SHOULD be ASCII, else we assume UTF8
        value = line[5:].decode("utf-8")
        currentEntry[key] += value
    #If there is a non-empty entry left, print it
    if currentEntry:
        yield currentEntry

if __name__ == "__main__":
    #Example of how to use readUniprot()
    import argparse
    parser = argparse.ArgumentParser()
    args = parser.parse_args()
    with, "rb") as infile:
        #readUniprot() yields all documents
        for uniprot in readUniprot(infile):