In a recent project, I needed to compare STRING records to other PPI databases. However, this is not always as easy as it sounds, because STRING uses KEGG protein identifiers. Fortunately, at the STRING download page, a list of alias mappings is freely downloadable.
There’s still one major problem left, though: I couldn’t find any documentation about the format. It seems to be somewhat easy once you’ve figured out the basics, but I created a reusable Python function that filters a given organism and outputs a STRING ID, UniProt ID CSV:
# A STRING alias mapping filterer & converter # Released under Apache License v2.0 # Copyright (c) 2013 Uli Köhler # Version 1.0 import gzip def filterSTRINGAliases(infilename, outfilename, taxonomyFilter): """ Filter & convert STRING aliases to CSV. Keyword arguments: infilename: The filename of the gzipped STRING aliases download outfilename: The CSV file to write the converted & filtered mapping to taxonomyFilter: A string containing the NCBI Taxonomy identifier to filter for """ recordCtr = 0 with gzip.open(infilename, "rb") as infile, open(outfilename,"w") as outfile: for line in infile: # Some crude statistics recordCtr += 1 if recordCtr % 1000000 == 0: print "Processed %d records..." % recordCtr # This line ensures we're mapping STRING ID --> UniProt ID if "_UniProt_AC" not in line: continue parts = line.split() if taxonomyFilter != parts[0]: continue #Extract and write alias print (",".join([parts[1], parts[2]], end="", file=outfile) print ("Processed %d records" % recordCtr) if __name__ == "__main__": #Example usage: Filter for Saccharomyces cerevisiae (4932), write to string-aliases.csv filterSTRINGAliases("/tmp/protein.aliases.v9.05.txt.gz", "string-aliases.csv", "4932")