Recently I needed to filter a STRING protein-view database dump (e.g. protein.links.full.v9.05.txt.gz) by taxonomy ID. The original dataset was way too large (it had more than 670 million records).
In order to filter with constant memory (After all, the full STRING dump is 47GB large), I created this script that allows to filter for binary PPIs both matching the given organism (NCBI taxonomy ID), but also allows to filter for binary PPIs with at least one interacting protein of the given organism. Usually this doesn’t really make a difference for STRING.
The input and the output file may be plaintext or gzipped (transparent compresion / decompression is enabled).
Example:
#Filter for both interacting proteins in Saccharomyces cerevisiae ./filter-protein-links.py protein.links.full.v9.05.txt.gz output.txt.gz --filter-organism=4932 --match-both
Source code:
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ filter-protein-links.py Filters STRING protein link dumps by taxonomy """ from __future__ import with_statement import gzip __author__ = "Uli Koehler" __license__ = "Apache v2.0" __copyright__ = "Copyright 2013, Uli Koehler" def filterSTRINGFile(infilename, outfilename, organismFilter, mustMatchBothOrganisms=True): inOpenFunc = gzip.open if infilename.endswith(".gz") else open outOpenFunc = gzip.open if outfilename.endswith(".gz") else open allRecordsCounter = 0 passCounter = 0 with inOpenFunc(infilename) as infile, outOpenFunc(outfilename, "w") as outfile: for line in infile: if line.startswith("protein1"): outfile.write(line) continue #Skip header line allRecordsCounter += 1 if allRecordsCounter % 1000000 == 0: print ("Processed {} input records, {} records passed test...".format( allRecordsCounter, passCounter)) parts = line.split() #Check organism organismA = parts[0].partition(".")[0] organismB = parts[1].partition(".")[0] if mustMatchBothOrganisms and (organismA != organismFilter or organismB != organismFilter): continue if (not mustMatchBothOrganisms) and (organismA != organismFilter and organismB != organismFilter): continue #All tests passed, write line passCounter += 1 outfile.write(line) #Print final statistics print ("Processed {} input records, {} records passed test".format(allRecordsCounter, passCounter)) if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("--filter-organism", type=int, required=True, dest="filterOrganism", help="Both proteins must match this organism ID") parser.add_argument("--match-both", action="store_true", dest="matchBoth", help="Specify this flag") parser.add_argument("infile", help="The input file (.gz supported)") parser.add_argument("outfile", help="The output file (.gz supported") args = parser.parse_args() filterSTRINGFile(args.infile, args.outfile, str(args.filterOrganism), args.matchBoth)