Filtering STRING PPI dumps by taxonomy

Recently I needed to filter a STRING protein-view database dump (e.g. protein.links.full.v9.05.txt.gz) by taxonomy ID. The original dataset was way too large (it had more than 670 million records).

In order to filter with constant memory (After all, the full STRING dump is 47GB large), I created this script that allows to filter for binary PPIs both matching the given organism (NCBI taxonomy ID), but also allows to filter for binary PPIs with at least one interacting protein of the given organism. Usually this doesn’t really make a difference for STRING.

The input and the output file may be plaintext or gzipped (transparent compresion / decompression is enabled).

Example:

#Filter for both interacting proteins in Saccharomyces cerevisiae
./filter-protein-links.py protein.links.full.v9.05.txt.gz output.txt.gz --filter-organism=4932 --match-both

Source code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
filter-protein-links.py

Filters STRING protein link dumps by taxonomy
"""
from __future__ import with_statement
import gzip

__author__ = "Uli Koehler"
__license__ = "Apache v2.0"
__copyright__ = "Copyright 2013, Uli Koehler"

def filterSTRINGFile(infilename, outfilename, organismFilter, mustMatchBothOrganisms=True):
    inOpenFunc = gzip.open if infilename.endswith(".gz") else open
    outOpenFunc = gzip.open if outfilename.endswith(".gz") else open
    allRecordsCounter = 0
    passCounter = 0
    with inOpenFunc(infilename) as infile, outOpenFunc(outfilename, "w") as outfile:
        for line in infile:
            if line.startswith("protein1"):
                outfile.write(line)
                continue #Skip header line
            allRecordsCounter += 1
            if allRecordsCounter % 1000000 == 0:
                print ("Processed {} input records, {} records passed test...".format(
                        allRecordsCounter, passCounter))
            parts = line.split()
            #Check organism
            organismA = parts[0].partition(".")[0]
            organismB = parts[1].partition(".")[0]
            if mustMatchBothOrganisms and (organismA != organismFilter or organismB != organismFilter):
                continue
            if (not mustMatchBothOrganisms) and (organismA != organismFilter and organismB != organismFilter):
                continue
            #All tests passed, write line
            passCounter += 1
            outfile.write(line)
    #Print final statistics
    print ("Processed {} input records, {} records passed test".format(allRecordsCounter, passCounter))

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--filter-organism", type=int, required=True, dest="filterOrganism", help="Both proteins must match this organism ID")
    parser.add_argument("--match-both", action="store_true", dest="matchBoth", help="Specify this flag")
    parser.add_argument("infile", help="The input file (.gz supported)")
    parser.add_argument("outfile", help="The output file (.gz supported")
    args = parser.parse_args()
    filterSTRINGFile(args.infile, args.outfile, str(args.filterOrganism), args.matchBoth)