Notes‎ > ‎

Lazy IO considered harmful; way to go, Left-fold enumerator!

posted Oct 11, 2008, 7:56 PM by Wei Hu   [ updated Aug 26, 2009, 5:10 PM ]

Oleg is an admirable expert in PL. He gave a talk at DEFUN 2008 on Iteratee IO, as a better alternative to lazy IO. Lazy IO is a sin against pure FP code, and should be frowned upon because it often causes errors or performance overhead.

To motivate iteratee IO, let's consider the case of folding a function on a collection. The function being folded is called an iteratee. Fold traverses the collection without having to check if the traversal finished or not. We see the separation of concerns here. But, fold is not perfect because it exposes iteratee state and has no interface for early termination. In the following code, we only want the accumulator part of the iteratee's return value, but its internal state, the counter, is also returned. Also, we cannot terminate early when we hit a 0 in the collection.
fold :: (a -> b -> b) -> b -> IntMap a -> b

-- the product of the collection, skipping the first n numbers
prodbut n coll = snd (fold iteratee (n,1) coll)
  where iteratee a (n,s) = if n <= 0 then (n,a*s) else (n-1,s)

We formalize the above idea and design a better interface. Oleg first defines the stream type, and the iteratee type that consumes chunks of the stream (we read in a chunk at a time for efficiency). He then defines primitive parser combinator iteratees, which will be combined (because they are monads) to read HTTP request headers.

The counterpart of fold is the enumerator. Each enumerator takes an iteratee and returns an iteratee, so it's also referred to as an iteratee transformer.

It gets interesting when we start tackling IO. We abandon Handle completely, and directly talk to the OS using FFI and the System.Posix interface. The function enum_fd can be thought of as the driver - it reads from the file descriptor into chunks, sets up the stream, and apply the iteratee over the stream.

We've only talked about iteratees on Chars and Strings. See IterateeM for the general library. This module also defines a function enum_file, so that you don't have to set up the file descriptor. To use it in real life, make sure to change buffer_size in enum_fd. The generated iteratee now can do side effects, meaning that we can interleave IO freely and safely.

Hyena is a web server along the same line (slides).


UPDATE: A safe lazy IO lib is announced. Four pieces of word-counting code side by side:

-- lazy IO, uses constant memory
> import System.IO 
> main = print . length . words =<< getContents 

-- strict IO, reads the entire stdin into the memory
> import qualified System.IO.Strict as SIO 
> import System.IO.Strict (SIO) 
> main = (SIO.print . length . words =<< SIO.getContents) 

-- safe lazy IO, uses constant memory. LI is a functor, not a monad.
> import qualified System.IO.Lazy.Input as LI
> main = print =<< ( $ fmap (length . words) LI.getContents)

-- or equivalently
> import qualified System.IO.Lazy.Input as LI
> import qualified System.IO.Strict as SIO
> import Control.Applicative
> main =' $ SIO.print . (length . words) <$> LI.getContents

Lazy IO is bad because it allows pure functions to be interleaved with IO actions arbitrarily. The safe-lazy-io lib allows lazy IO in a controlled way. Pure functions must be fmap-ed onto lazy inputs, and then be turned into an IO action using the run function.

The LI module remains lazy by 'copy-n-paste'-ing GHC's code, with one exception: errors during reading are not silently discarded. Every file opened lazily is encoded together with a finalizer that executes after the file processing has been strictly forced (via SIO), which makes the closing of files predictable. The lib also addresses a third issue of vanilla lazy IO: processing multiple inputs is problematic: keeping too many files open will exhaust the allowed resources, and closing files prematurely is also bad. Multiple inputs can be combined, or interleaved.