Reading the UniProt text format in Python
Just like many other databases in computational biology, the downloads for the popular UniProt database are available in a custom text format which is documented on ExPASy.
While it is certainly not difficult writing a generic parser and one can use BioPython, I believe it is often easier to use a tested parser that only uses standard libraries.
This code provides an easy-to-use generator-based streaming parser that parses the records into dictionary-like objects. Newlines inside fields are always kept. The parser assumes the file is UTF-8 encoded.
#!/usr/bin/env python3
"""
A simple reader for the UniProt text format, as downloadable
on http://www.uniprot.org/downloads
"""
import gzip
from collections import defaultdict
__author__ = "Uli Köhler"
__copyright__ = "Copyright 2014, Uli Koehler"
__license__ = "Apache License v2.0"
__version__ = "1.0"
def readUniprot(fin):
"""Given a file-like object, generates uniprot objects"""
lastKey = None # The last encountered key
currentEntry = defaultdict(str)
for line in fin:
key = line[:2].decode("ascii")
#Handle new entry
if key == "//":
yield currentEntry
currentEntry = defaultdict(str)
#SQ field does not have a line header except for the first line
if key == " ":
key = lastKey
lastKey = key
#Value SHOULD be ASCII, else we assume UTF8
value = line[5:].decode("utf-8")
currentEntry[key] += value
#If there is a non-empty entry left, print it
if currentEntry:
yield currentEntry
if __name__ == "__main__":
#Example of how to use readUniprot()
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("file")
args = parser.parse_args()
with gzip.open(args.file, "rb") as infile:
#readUniprot() yields all documents
for uniprot in readUniprot(infile):
print(uniprot)