Filtering STRING PPI dumps by taxonomy
Recently I needed to filter a STRING protein-view database dump (e.g. protein.links.full.v9.05.txt.gz) by taxonomy ID. The original dataset was way too large (it had more than 670 million records).
In order to filter with constant memory (After all, the full STRING dump is 47GB large), I created this script that allows to filter for binary PPIs both matching the given organism (NCBI taxonomy ID), but also allows to filter for binary PPIs with at least one interacting protein of the given organism. Usually this doesn’t really make a difference for STRING.
The input and the output file may be plaintext or gzipped (transparent compresion / decompression is enabled).
Example:
#Filter for both interacting proteins in Saccharomyces cerevisiae
./filter-protein-links.py protein.links.full.v9.05.txt.gz output.txt.gz --filter-organism=4932 --match-both
Source code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
filter-protein-links.py
Filters STRING protein link dumps by taxonomy
"""
from __future__ import with_statement
import gzip
__author__ = "Uli Koehler"
__license__ = "Apache v2.0"
__copyright__ = "Copyright 2013, Uli Koehler"
def filterSTRINGFile(infilename, outfilename, organismFilter, mustMatchBothOrganisms=True):
inOpenFunc = gzip.open if infilename.endswith(".gz") else open
outOpenFunc = gzip.open if outfilename.endswith(".gz") else open
allRecordsCounter = 0
passCounter = 0
with inOpenFunc(infilename) as infile, outOpenFunc(outfilename, "w") as outfile:
for line in infile:
if line.startswith("protein1"):
outfile.write(line)
continue #Skip header line
allRecordsCounter += 1
if allRecordsCounter % 1000000 == 0:
print ("Processed {} input records, {} records passed test...".format(
allRecordsCounter, passCounter))
parts = line.split()
#Check organism
organismA = parts[0].partition(".")[0]
organismB = parts[1].partition(".")[0]
if mustMatchBothOrganisms and (organismA != organismFilter or organismB != organismFilter):
continue
if (not mustMatchBothOrganisms) and (organismA != organismFilter and organismB != organismFilter):
continue
#All tests passed, write line
passCounter += 1
outfile.write(line)
#Print final statistics
print ("Processed {} input records, {} records passed test".format(allRecordsCounter, passCounter))
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--filter-organism", type=int, required=True, dest="filterOrganism", help="Both proteins must match this organism ID")
parser.add_argument("--match-both", action="store_true", dest="matchBoth", help="Specify this flag")
parser.add_argument("infile", help="The input file (.gz supported)")
parser.add_argument("outfile", help="The output file (.gz supported")
args = parser.parse_args()
filterSTRINGFile(args.infile, args.outfile, str(args.filterOrganism), args.matchBoth)