Mapping STRING aliases to UniProt IDs
In a recent project, I needed to compare STRING records to other PPI databases. However, this is not always as easy as it sounds, because STRING uses KEGG protein identifiers. Fortunately, at the STRING download page, a list of alias mappings is freely downloadable.
There’s still one major problem left, though: I couldn’t find any documentation about the format. It seems to be somewhat easy once you’ve figured out the basics, but I created a reusable Python function that filters a given organism and outputs a STRING ID, UniProt ID CSV:
# A STRING alias mapping filterer & converter
# Released under Apache License v2.0
# Copyright (c) 2013 Uli Köhler
# Version 1.0
import gzip
def filterSTRINGAliases(infilename, outfilename, taxonomyFilter):
"""
Filter & convert STRING aliases to CSV.
Keyword arguments:
infilename: The filename of the gzipped STRING aliases download
outfilename: The CSV file to write the converted & filtered mapping to
taxonomyFilter: A string containing the NCBI Taxonomy identifier to filter for
"""
recordCtr = 0
with gzip.open(infilename, "rb") as infile, open(outfilename,"w") as outfile:
for line in infile:
# Some crude statistics
recordCtr += 1
if recordCtr % 1000000 == 0:
print "Processed %d records..." % recordCtr
# This line ensures we're mapping STRING ID --> UniProt ID
if "_UniProt_AC" not in line: continue
parts = line.split()
if taxonomyFilter != parts[0]: continue
#Extract and write alias
print (",".join([parts[1], parts[2]], end="", file=outfile)
print ("Processed %d records" % recordCtr)
if __name__ == "__main__":
#Example usage: Filter for Saccharomyces cerevisiae (4932), write to string-aliases.csv
filterSTRINGAliases("/tmp/protein.aliases.v9.05.txt.gz", "string-aliases.csv", "4932")