Just like many other databases in computational biology, the downloads for the popular UniProt database are available in a custom text format which is documented on ExPASy.
While it is certainly not difficult writing a generic parser and one can use BioPython, I believe it is often easier to use a tested parser that only uses standard libraries.
This code provides an easy-to-use generator-based streaming parser that parses the records into dictionary-like objects. Newlines inside fields are always kept. The parser assumes the file is UTF-8 encoded.
#!/usr/bin/env python3 """ A simple reader for the UniProt text format, as downloadable on http://www.uniprot.org/downloads """ import gzip from collections import defaultdict __author__ = "Uli Köhler" __copyright__ = "Copyright 2014, Uli Koehler" __license__ = "Apache License v2.0" __version__ = "1.0" def readUniprot(fin): """Given a file-like object, generates uniprot objects""" lastKey = None # The last encountered key currentEntry = defaultdict(str) for line in fin: key = line[:2].decode("ascii") #Handle new entry if key == "//": yield currentEntry currentEntry = defaultdict(str) #SQ field does not have a line header except for the first line if key == " ": key = lastKey lastKey = key #Value SHOULD be ASCII, else we assume UTF8 value = line[5:].decode("utf-8") currentEntry[key] += value #If there is a non-empty entry left, print it if currentEntry: yield currentEntry if __name__ == "__main__": #Example of how to use readUniprot() import argparse parser = argparse.ArgumentParser() parser.add_argument("file") args = parser.parse_args() with gzip.open(args.file, "rb") as infile: #readUniprot() yields all documents for uniprot in readUniprot(infile): print(uniprot)