Almost anyone productively using Haskell has already stumbled upon Michael Snoyman’s Conduit library.
In this post I will show how to leverage the power of Conduit interleaved IO in order to parse author’s names from the Open Library data dumps.
Using traditional languages like Java or C, it would be significantly more difficult and error-prone to interleave the processing pipeline actions of
- Downloading the file
ol_dump_authors_latest.txt.gz
using HTTP - Decompressing using a
gunzip
-like function - Splitting into lines and discarding everything but the JSON from said lines
- Appropriately decoding the JSON and extracing the
name
field, or ignoring the line if parsing is not possible - Writing all said names into a files (
authors.txt
in this example), one per line.
Because the uncompressed dump is a multi-GiB file, a streaming parser is required. Using pipes or FIFOs to download and parse from stdin would certainly be possible in this simple example, however
Although there are libraries for other languages facilitating native interleaved IO, it proves to be quite difficult to integrate third-party functionality like gunzip
, especially if trying to avoid performance bottlenecks. When focusing on performance, you’re often stuck with manually manipulating memory buffers yielding complex, error-prone code.
Conduits provide a ready-to-use solution for these issues. Using libraries like conduit-extras it’s also easily possible to integrate streaming-gunzip while still fulfilling near-constant memory requirements and deterministic resource usage.
While maintaining high performance, they allow the programmer not to think about the specifics of splicing together the individual steps of the processing pipeline.
{-# LANGUAGE OverloadedStrings #-} {-| A program to stream-parse the open library authors dump and extract a list of author names. (C) 2014 Uli Koehler Released under Apache License v2.0 -} import Data.Text (Text) import qualified Data.Text as T import qualified Data.Text.Encoding as TE import Data.Conduit.Zlib (ungzip) import qualified Data.ByteString.Char8 as B import qualified Data.ByteString.Lazy.Char8 as LB import Network.HTTP.Conduit (parseUrl, withManager, http, responseBody) import Data.Conduit import qualified Data.Conduit.Combinators as CC import qualified Data.Conduit.List as CL import qualified Data.Conduit.Binary as CB import Control.Applicative import Data.Char import Data.Aeson import Data.Maybe import Control.Monad data Author = Author {authorKey :: Text, -- ^ Open library author key, e.g. /authors/OL5900296A authorName :: Text -- ^ Name of the author person } deriving (Show, Eq) instance FromJSON Author where parseJSON (Object v) = Author <$> fmap T.pack (v .: "key") <*> fmap T.pack (v .: "name") parseJSON _ = mzero main :: IO () main = do -- Define our conduit processing chaing: -- - Skip lines that can't be parsed -- - Also skip anything until the JSON beginning in each line let parseConduit = CL.mapMaybe (decode . LB.fromStrict . B.dropWhile ('{' /=)) let showAuthor = CL.map (TE.encodeUtf8 . authorName) let processingConduit = ungzip =$= CB.lines =$= parseConduit =$= showAuthor =$= CC.unlinesAscii -- Initialize HTTP request request <- parseUrl "http://openlibrary.org/data/ol_dump_authors_latest.txt.gz" withManager $ \manager -> do response <- http request manager -- Use interleaved IO to fetch and process incrementally responseBody response $$+- processingConduit =$ CB.sinkFile "authors.txt"