Ein einfaches Werkzeug für FASTA-Statistiken

Das Problem

Es ist überraschend schwierig, einfache Statistiken von FASTA-Dateien mit vorhandener Software zu berechnen. Ich musste kürzlich die Nukleotidanzahl und die relative GC-Häufigkeit einer einzelnen Sequenz im FASTA-Format berechnen, aber es scheint keine einfache, abhängigkeitsfreie Lösung für diese Problemklasse zu geben, es sei denn, man installiert abhängigkeitsreiche native Software wie FASTX oder entwickelt sie selbst mit BioPython oder ähnlichem.

Die Lösung: `fasta-stats.py`

Ich habe ein einfaches Python-Skript entwickelt, das nur die Standard-Python-Bibliotheken verwendet und in der Lage ist, diese Statistiken für eine oder mehrere FASTA-Dateien (plain oder Gzip-komprimiert) mit jeweils einer oder mehreren Sequenzen zu generieren. Die absolute und relative Häufigkeit von Zeichen (wobei ein Zeichen normalerweise ein Aminosäure-Code oder Nukleotid ist) wird berechnet und für jede Sequenz der Eingabedateien ausgegeben. Wenn die Dateien eine Nukleotidsequenz enthalten (d.h. sie enthält nur ATGC-Zeichen), werden auch die A+T- und G+C-Anzahlen automatisch ausgegeben.

Standardmäßig werden die Zeichen case-insensitiv gezählt.

Verwendungsbeispiele:

fasta_stats_usage.sh

#Einfache Statistiken für mehrere Dateien, erkennt .gz-Eingabe automatisch
fasta-stats.py sequence1.fa sequence.2.fa.gz
#Zeichen case-sensitiv zählen
fasta-stats.py --case-sensitive sequence1.fa sequence.2.fa.gz
#Nur ATGC-Zeichen zählen, alle anderen Zeichen ignorieren
fasta-stats.py --only ATGC sequence1.fa sequence.2.fa.gz

#Einfache Statistiken für mehrere Dateien, erkennt .gz-Eingabe automatisch
fasta-stats.py sequence1.fa sequence.2.fa.gz
#Zeichen case-sensitiv zählen
fasta-stats.py --case-sensitive sequence1.fa sequence.2.fa.gz
#Nur ATGC-Zeichen zählen, alle anderen Zeichen ignorieren
fasta-stats.py --only ATGC sequence1.fa sequence.2.fa.gz

Skript-Quellcode:

fasta_stats.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
fasta-stats.py: Utility script to count number of nucleotides/Aminoacids in a FASTA files.
Changelog:
    1.1: Python3-ready
"""
from __future__ import with_statement
import sys
import argparse
import gzip
#Counter is used for per-character statistics
from collections import Counter
__author__    = "Uli Koehler & Anton Smirnov"
__copyright__ = "Copyright 2013 Uli Koehler & Anton Smirnov"
__license__   = "Apache v2.0"
__version__   = "1.1"

def printSequenceStats(fileName, sequenceName, charOccurrenceMap, totalCharCount):
    """
    Print details about a sequence to stdout.
    Called from inside parseFile().

    Keyword arguments:
        sequenceName: The name of the sequence to print
        charOccurrenceMap: A dict-like object that contains a char -->
                            number of occurrences mapping
        totalCharCount: The overall character count of the sequence
    """
    print ("Sequence '{}' from FASTA file '{}' contains {} sequence characters:".format(
        sequenceName, fileName, totalCharCount))
    for char in sorted(charOccurrenceMap.keys()):
        charCount = charOccurrenceMap[char]
        relativeFrequency = charCount * 100.0 / totalCharCount
        print ("\t{} : {} = {}%".format(char, charCount, relativeFrequency))
    #For nucleotide sequences (ATGC only), also print A+T vs G+C count
    if sorted(charOccurrenceMap.keys()) == ["A","C","G","T"]:
        #Print A+T count
        atCount = charOccurrenceMap["A"] + charOccurrenceMap["T"]
        atRelFrequency = atCount * 100.0 / totalCharCount
        print ("\tA+T : {} = {}%".format(atCount, atRelFrequency))
        #Print G+C count
        gcCount = charOccurrenceMap["G"] + charOccurrenceMap["C"]
        atRelFrequency = gcCount * 100.0 / totalCharCount
        print ("\tG+C : {} = {}%%".format(gcCount, atRelFrequency))

def parseFile(filename, caseSensitive=False, charWhitelist=None):
    """
    Parse a FASTA fil and call printRe
    """
    #Set to the header line, with ">" removed from the beginning
    sequenceName = None
    #Key: character, value: Number of occurrences
    charOccurrenceMap = Counter()
    #The number of characters in the current sequence, not including \n
    charCount = 0
    #Keep track of consecutive comments, because they are appended
    previousLineWasComment = False
    #Open and iterate the file, auto- detect gzip
    openFunc = gzip.open if filename.endswith(".gz") else open
    with openFunc(filename, "r") as infile:
        for line in infile:
            line = line.strip()
            #Be super-compatible with the original specification
            if line.startswith(">") or line.startswith(";"):
                #Process previous sequence, if any
                if sequenceName is not None:
                    printSequenceStats(filename, sequenceName, charOccurrenceMap, charCount)
                    charOccurrenceMap = Counter()
                    charCount = 0
                #Take the entire comment line as (new) sequence ID (with ">" stripped)
                #Concatenate consecutive sequence lines
                if previousLineWasComment: #Append -- add one space between to normalize whitespace count
                    sequenceName += " " + line[1:].strip()
                else:
                    sequenceName = line[1:].strip()
                previousLineWasComment = True
            else: #Line belongs to the sequence
                previousLineWasComment = False
                #Line has been stripped before, so we can count directly
                #Increment per-character stats (character occurrences)
                for char in line:
                    #Skip any character not in the whitelist, if whitelist (--only) is enabled
                    if charWhitelist is not None and not char in charWhitelist:
                        continue
                    #We can only count after whitelite filter
                    charCount += 1
                    #In case-insensitive mode (default) count uppercased chars only
                    char = char if caseSensitive else char.upper()
                    charOccurrenceMap[char] += 1
        #The last line has been read, print the last record, if any
        if sequenceName is not None:
            printSequenceStats(filename, sequenceName, charOccurrenceMap, charCount)

if __name__ == "__main__":
    #Allow single or multiple files to be specified
    parser = argparse.ArgumentParser(description='Compute simple statistics for FASTA files.')
    parser.add_argument('infiles', nargs='+', help='FASTA files (.fa, .fa.gz) to generate statistics for')
    parser.add_argument('--case-sensitive', action='store_true', help='Count characters in a case-sensitive way. Disabled per default.')
    parser.add_argument('-o','--only', help='If this option is supplied (e.g. set to \'ATGC\'), characters not in the set will be ignored for all statistics')
    args = parser.parse_args()
    #Process all FASTA files
    for infile in args.infiles:
        parseFile(infile, caseSensitive=args.case_sensitive, charWhitelist=args.only)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
fasta-stats.py: Utility script to count number of nucleotides/Aminoacids in a FASTA files.
Changelog:
    1.1: Python3-ready
"""
from __future__ import with_statement
import sys
import argparse
import gzip
#Counter is used for per-character statistics
from collections import Counter
__author__    = "Uli Koehler & Anton Smirnov"
__copyright__ = "Copyright 2013 Uli Koehler & Anton Smirnov"
__license__   = "Apache v2.0"
__version__   = "1.1"

def printSequenceStats(fileName, sequenceName, charOccurrenceMap, totalCharCount):
    """
    Print details about a sequence to stdout.
    Called from inside parseFile().

    Keyword arguments:
        sequenceName: The name of the sequence to print
        charOccurrenceMap: A dict-like object that contains a char -->
                            number of occurrences mapping
        totalCharCount: The overall character count of the sequence
    """
    print ("Sequence '{}' from FASTA file '{}' contains {} sequence characters:".format(
        sequenceName, fileName, totalCharCount))
    for char in sorted(charOccurrenceMap.keys()):
        charCount = charOccurrenceMap[char]
        relativeFrequency = charCount * 100.0 / totalCharCount
        print ("\t{} : {} = {}%".format(char, charCount, relativeFrequency))
    #For nucleotide sequences (ATGC only), also print A+T vs G+C count
    if sorted(charOccurrenceMap.keys()) == ["A","C","G","T"]:
        #Print A+T count
        atCount = charOccurrenceMap["A"] + charOccurrenceMap["T"]
        atRelFrequency = atCount * 100.0 / totalCharCount
        print ("\tA+T : {} = {}%".format(atCount, atRelFrequency))
        #Print G+C count
        gcCount = charOccurrenceMap["G"] + charOccurrenceMap["C"]
        atRelFrequency = gcCount * 100.0 / totalCharCount
        print ("\tG+C : {} = {}%%".format(gcCount, atRelFrequency))

def parseFile(filename, caseSensitive=False, charWhitelist=None):
    """
    Parse a FASTA fil and call printRe
    """
    #Set to the header line, with ">" removed from the beginning
    sequenceName = None
    #Key: character, value: Number of occurrences
    charOccurrenceMap = Counter()
    #The number of characters in the current sequence, not including \n
    charCount = 0
    #Keep track of consecutive comments, because they are appended
    previousLineWasComment = False
    #Open and iterate the file, auto- detect gzip
    openFunc = gzip.open if filename.endswith(".gz") else open
    with openFunc(filename, "r") as infile:
        for line in infile:
            line = line.strip()
            #Be super-compatible with the original specification
            if line.startswith(">") or line.startswith(";"):
                #Process previous sequence, if any
                if sequenceName is not None:
                    printSequenceStats(filename, sequenceName, charOccurrenceMap, charCount)
                    charOccurrenceMap = Counter()
                    charCount = 0
                #Take the entire comment line as (new) sequence ID (with ">" stripped)
                #Concatenate consecutive sequence lines
                if previousLineWasComment: #Append -- add one space between to normalize whitespace count
                    sequenceName += " " + line[1:].strip()
                else:
                    sequenceName = line[1:].strip()
                previousLineWasComment = True
            else: #Line belongs to the sequence
                previousLineWasComment = False
                #Line has been stripped before, so we can count directly
                #Increment per-character stats (character occurrences)
                for char in line:
                    #Skip any character not in the whitelist, if whitelist (--only) is enabled
                    if charWhitelist is not None and not char in charWhitelist:
                        continue
                    #We can only count after whitelite filter
                    charCount += 1
                    #In case-insensitive mode (default) count uppercased chars only
                    char = char if caseSensitive else char.upper()
                    charOccurrenceMap[char] += 1
        #The last line has been read, print the last record, if any
        if sequenceName is not None:
            printSequenceStats(filename, sequenceName, charOccurrenceMap, charCount)

if __name__ == "__main__":
    #Allow single or multiple files to be specified
    parser = argparse.ArgumentParser(description='Compute simple statistics for FASTA files.')
    parser.add_argument('infiles', nargs='+', help='FASTA files (.fa, .fa.gz) to generate statistics for')
    parser.add_argument('--case-sensitive', action='store_true', help='Count characters in a case-sensitive way. Disabled per default.')
    parser.add_argument('-o','--only', help='If this option is supplied (e.g. set to \'ATGC\'), characters not in the set will be ignored for all statistics')
    args = parser.parse_args()
    #Process all FASTA files
    for infile in args.infiles:
        parseFile(infile, caseSensitive=args.case_sensitive, charWhitelist=args.only)

Check out similar posts by category: Bioinformatics, Python

If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow

Buy me a coffee

Das Problem

Die Lösung: fasta-stats.py

Die Lösung: `fasta-stats.py`