Extracting open library author names using Haskell Conduits

Almost anyone productively using Haskell has already stumbled upon Michael Snoyman’s Conduit library.

In this post I will show how to leverage the power of Conduit interleaved IO in order to parse author’s names from the Open Library data dumps.

Using traditional languages like Java or C, it would be significantly more difficult and error-prone to interleave the processing pipeline actions of

  • Downloading the file ol_dump_authors_latest.txt.gz using HTTP
  • Decompressing using a gunzip-like function
  • Splitting into lines and discarding everything but the JSON from said lines
  • Appropriately decoding the JSON and extracing the name field, or ignoring the line if parsing is not possible
  • Writing all said names into a files (authors.txt in this example), one per line.

Because the uncompressed dump is a multi-GiB file, a streaming parser is required. Using pipes or FIFOs to download and parse from stdin would certainly be possible in this simple example, however

Although there are libraries for other languages facilitating native interleaved IO, it proves to be quite difficult to integrate third-party functionality like gunzip, especially if trying to avoid performance bottlenecks. When focusing on performance, you’re often stuck with manually manipulating memory buffers yielding complex, error-prone code.

Conduits provide a ready-to-use solution for these issues. Using libraries like conduit-extras it’s also easily possible to integrate streaming-gunzip while still fulfilling near-constant memory requirements and deterministic resource usage.

While maintaining high performance, they allow the programmer not to think about the specifics of splicing together the individual steps of the processing pipeline.

{-# LANGUAGE OverloadedStrings #-}
{-|

A program to stream-parse the open library authors dump and extract a list of author names.

(C) 2014 Uli Koehler

Released under Apache License v2.0

-}

import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.Encoding as TE
import Data.Conduit.Zlib (ungzip)
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as LB
import Network.HTTP.Conduit (parseUrl, withManager, http, responseBody)
import Data.Conduit
import qualified Data.Conduit.Combinators as CC
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Binary as CB
import Control.Applicative
import Data.Char
import Data.Aeson
import Data.Maybe
import Control.Monad

data Author = Author {authorKey :: Text, -- ^ Open library author key, e.g. /authors/OL5900296A
                      authorName :: Text -- ^ Name of the author person
                      } deriving (Show, Eq)

instance FromJSON Author where
    parseJSON (Object v) = Author <$>
                            fmap T.pack (v .: "key") <*>
                            fmap T.pack (v .: "name")
    parseJSON _ = mzero

main :: IO ()
main = do
    -- Define our conduit processing chaing:
    --  - Skip lines that can't be parsed
    --  - Also skip anything until the JSON beginning in each line
    let parseConduit = CL.mapMaybe (decode . LB.fromStrict .  B.dropWhile ('{' /=))
    let showAuthor = CL.map (TE.encodeUtf8 . authorName)
    let processingConduit = ungzip =$= CB.lines =$= parseConduit =$= showAuthor =$= CC.unlinesAscii
    -- Initialize HTTP request
    request <- parseUrl "http://openlibrary.org/data/ol_dump_authors_latest.txt.gz"
    withManager $ \manager -> do
        response <- http request manager
        -- Use interleaved IO to fetch and process incrementally
        responseBody response $$+- processingConduit =$ CB.sinkFile "authors.txt"