Problem:
You want to read substitution matrices in the matblas format, e.g. this BLOSUM62 from NCBI into a numpy ndarray.
You want to read substitution matrices in the matblas format, e.g. this BLOSUM62 from NCBI into a numpy ndarray.
You need a list of countries, ordered by continent, under a liberal license.
You need to calculate the center star approximation for a given set of sequences. Instead of calculating the sequence distances and center string by hand, you want the computer to do the hard work.
You need to parse files in the NCBI GeneInfo format, like those that can be downloaded from the NCBI FTP GENE_INFO directory, in Python. You want to avoid any dependencies.
You need to parse a GFF3 file containing information about sequence features. You prefer to use a minimal, depedency-free solution instead of importing the GFF3 data into a database right away. However, you need to have a standard-compatible parser
The GeneOntology Consortium provides bulk data download for the GO terms in the OBO v1.2 format.
If you Google GO OBO parser, there is something missing. You can easily find parsers in Perl, parsers in Java, but not even BioPython has a parser in Python. The format itself, however seems like it’s tailor-made for Python’s generator concept. Only a few SLOCs are needed to get it work without storing everything in RAM.
I used this parser in a prototype project that allows to search GO interactively (it’s fast). I’m not sure when/if I’ll publish that, but here is the parser code.
You’ve got a gzipped file that you want to decompress using C++. You don’t want to use pipes to gzip in an external process. You don’t want to use zlib and manual buffering either.
You want to read alignment matrices like BLOSUM62 in the QUASAR format. The solution needs to be integrable into C++ code easily.
It is surprisingly difficult to compute simple statistics of FASTA files using existing software. I recently needed to compute the nucleotide count and relative GC frequency of a single sequence in FASTA format, but unless you install dependency-heavy native software like FASTX or you develop it by yourself using BioPython or similar, there doesn’t seem to be a simple, dependency-free solution for this simple set of problem.
You want to use mmap()
from sys/stat.h
POSIX header to map a file for reading (not writing). You can’t find any simple bare example on the internet.
You want to use stat()
from sys/stat.h
POSIX header in order to get the size of a file.
You want to use the mkdir()
function from the sys/stat.h
POSIX header, but you don’t know what the mode_t
argument should look like.
The following C++ program uses boost::iostreams
to memory-map a file, read it’s content into a std::string
and print it to cout
.
It provides a minimal example of how to use the boost::iostreams
portable mmap
functionality.
//Compile like this: g++ -o mmap mmap.cpp -lboost_iostreams #include <boost/iostreams/device/mapped_file.hpp> #include <iostream> #include <string> using namespace std; using namespace boost::iostreams; int main(int argc, char** argv) { //Initialize the memory-mapped file mapped_file_source file(argv[1]); //Read the entire file into a string string fileContent(file.data(), file.size()); //Print the string cout << fileContent; //Cleanup file.close(); }
Also see A simple mmap() readonly example
This article describes a method of reading TAR archives (including .tar.gz and .tar.bz2) in C++ using Boost IOStreams.
You could use libtar for this, but the original version hasn’t been updated since 2003 and doesn’t provide you flexibility and insight to the internal structure of a TAR archive. Continue reading →
Using the libcurl easy API you want to download a file using HTTP GET. No extended features such as authentication shall be used.
The download result shall be stored in a std::string
You want to compile and install libc++ (sometimes also named libcxx), but CMake complains with this error message
CMake Error at cmake/Modules/MacroEnsureOutOfSourceBuild.cmake:7 (message):
libcxx requires an out of source build. Please create a separate</em>
build directory and run 'cmake /path/to/libcxx [options]' there.
Call Stack (most recent call first):
CMakeLists.txt:24 (MACRO_ENSURE_OUT_OF_SOURCE_BUILD)
CMake Error at cmake/Modules/MacroEnsureOutOfSourceBuild.cmake:8 (message):
In-source builds are not allowed.
CMake would overwrite the makefiles distributed with Compiler-RT.
Please create a directory and run cmake from there, passing the path
to this source directory as the last argument.
This process created the file `CMakeCache.txt' and the directory `CMakeFiles'.
Please delete them.
Call Stack (most recent call first):
CMakeLists.txt:24 (MACRO_ENSURE_OUT_OF_SOURCE_BUILD)
You want to use git-svn
to clone a SVN repository, but you don’t want to clone the entire history (which can be quite slow) but only the latest revision.
You want to find out what the last revision number of a remote subversion repository is without cloning it (e.g. because cloning takes a looong time with subversion).