Almost anyone productively using Haskell has already stumbled upon Michael Snoyman’s Conduit library.
In this post I will show how to leverage the power of Conduit interleaved IO in order to parse author’s names from the Open Library data dumps.
Using traditional languages like Java or C, it would be significantly more difficult and error-prone to interleave the processing pipeline actions of
- Downloading the file
ol_dump_authors_latest.txt.gz using HTTP
- Decompressing using a
- Splitting into lines and discarding everything but the JSON from said lines
- Appropriately decoding the JSON and extracing the
name field, or ignoring the line if parsing is not possible
- Writing all said names into a files (
authors.txt in this example), one per line.
Because the uncompressed dump is a multi-GiB file, a streaming parser is required. Using pipes or FIFOs to download and parse from stdin would certainly be possible in this simple example, however
Although there are libraries for other languages facilitating native interleaved IO, it proves to be quite difficult to integrate third-party functionality like
gunzip, especially if trying to avoid performance bottlenecks. When focusing on performance, you’re often stuck with manually manipulating memory buffers yielding complex, error-prone code.
Conduits provide a ready-to-use solution for these issues. Using libraries like conduit-extras it’s also easily possible to integrate streaming-gunzip while still fulfilling near-constant memory requirements and deterministic resource usage.
While maintaining high performance, they allow the programmer not to think about the specifics of splicing together the individual steps of the processing pipeline.