Extracting open library author names using Haskell Conduits

Almost anyone productively using Haskell has already stumbled upon Michael Snoyman’s Conduit library.

In this post I will show how to leverage the power of Conduit interleaved IO in order to parse author’s names from the Open Library data dumps.

Using traditional languages like Java or C, it would be significantly more difficult and error-prone to interleave the processing pipeline actions of

  • Downloading the file ol_dump_authors_latest.txt.gz using HTTP
  • Decompressing using a gunzip-like function
  • Splitting into lines and discarding everything but the JSON from said lines
  • Appropriately decoding the JSON and extracing the name field, or ignoring the line if parsing is not possible
  • Writing all said names into a files (authors.txt in this example), one per line.

Read more

Haskell: Invert filter predicate

Problem: In Haskell, you intend to filter with an inverted filter operator.

For example, your code is (GHCi):

Prelude> import Data.List
Prelude Data.List> filter ( isPrefixOf "a" ) ["a","ab","cd","abcd","xyz"]
["a","ab","abcd"]

The list you need is ["cd","yz"]. In some cases, the easiest solution would be to use for example <= instead of >, but there isn’t a literal complement for all functions, like for example isPrefixOf, which is being used in the example.

Read more