Mapping STRING aliases to UniProt IDs

In a recent project, I needed to compare STRING records to other PPI databases. However, this is not always as easy as it sounds, because STRING uses KEGG protein identifiers. Fortunately, at the STRING download page, a list of alias mappings is freely downloadable.

There’s still one major problem left, though: I couldn’t find any documentation about the format. It seems to be somewhat easy once you’ve figured out the basics, but I created a reusable Python function that filters a given organism and outputs a STRING ID, UniProt ID CSV:

# A STRING alias mapping filterer & converter
# Released under Apache License v2.0
# Copyright (c) 2013 Uli Köhler
# Version 1.0
import gzip

def filterSTRINGAliases(infilename, outfilename, taxonomyFilter):
    """
    Filter & convert STRING aliases to CSV.
    Keyword arguments:
        infilename: The filename of the gzipped STRING aliases download
        outfilename: The CSV file to write the converted & filtered mapping to
        taxonomyFilter: A string containing the NCBI Taxonomy identifier to filter for
    """
    recordCtr = 0
    with gzip.open(infilename, "rb") as infile, open(outfilename,"w") as outfile:
        for line in infile:
            # Some crude statistics
            recordCtr += 1
            if recordCtr % 1000000 == 0:
                print "Processed %d records..." % recordCtr
            # This line ensures we're mapping STRING ID --> UniProt ID
            if "_UniProt_AC" not in line: continue
            parts = line.split()
            if taxonomyFilter != parts[0]: continue
            #Extract and write alias
            print (",".join([parts[1], parts[2]], end="", file=outfile)
    print ("Processed %d records" % recordCtr)

if __name__ == "__main__":
    #Example usage: Filter for Saccharomyces cerevisiae (4932), write to string-aliases.csv
    filterSTRINGAliases("/tmp/protein.aliases.v9.05.txt.gz", "string-aliases.csv", "4932")