Almost anyone productively using Haskell has already stumbled upon Michael Snoyman’s Conduit library.
In this post I will show how to leverage the power of Conduit interleaved IO in order to parse author’s names from the Open Library data dumps.
Using traditional languages like Java or C, it would be significantly more difficult and error-prone to interleave the processing pipeline actions of
- Downloading the file
ol_dump_authors_latest.txt.gz
using HTTP - Decompressing using a
gunzip
-like function - Splitting into lines and discarding everything but the JSON from said lines
- Appropriately decoding the JSON and extracing the
name
field, or ignoring the line if parsing is not possible - Writing all said names into a files (
authors.txt
in this example), one per line.