Software Tools in Haskell: count

count lines or chars on stdin

Posted on 2016-02-11 by nbloomf
Tags: software-tools-in-haskell, literate-haskell

This page is part of a series on Software Tools in Haskell.

This post is literate Haskell; you can load the source into GHCi and play along.


As usual, we start with some imports.

-- sth-count: count lines or chars on stdin
module Main where

import System.Exit (exitSuccess)
import System.Environment (getArgs)
import Control.Arrow ((>>>))
import Data.List (unfoldr)
import Data.Foldable (foldl')

Software Tools includes two programs – charcount and linecount – which count chars and lines on stdin, respectively. Because these functions are so similar, and we anticipate that counting lines will be more useful in practice than counting characters, we’ll combine these into one program and trigger character counting with an optional command line argument. count is a filter, taking either char or line text and producing a line.

Counting Chars

We want count --char to count the characters on stdin until EOF is reached. Almost right off the bat we have a problem: with Unicode it is not obvious what a “character” is. For simplicity’s sake I will split this problem in two: count --char, which ignores any issues with unicode normalization like combining diacritics, and glyphcount, which takes these issues into account. So for instance O̰ (capital O, with combining tilde below) counts as two characters but one glyph. Character encodings were much simpler and less useful when Software Tools was written. 🙂

Let’s write a generic list-length counter:

count :: (Num t) => [a] -> t
count = foldl' inc 0
  where inc n _ = n+1

We use foldl' to force strictness and avoid a memory leak. The foldl function is lazy by default, meaning that it would generate a stack of unevaluated additions to be carried out only once EOF is reached.

Counting Lines

By default, count counts lines on stdin until EOF is reached. Now while the definition of “character” is reasonably clear (unicode notwithstanding), we have to think more carefully about what a “line” is.

By custom on unix systems the “newline” character, denoted \n, signals the end of a line. To a first approximation we can march down the sequence of characters on stdin and count the \ns. But is every line necessarily terminated by a newline? What if there are no newlines? It seems wrong to say that a sequence of characters containing no newlines consists of zero lines. More generally, if a sequence of characters does not terminate in a newline, then the characters between the last \n and EOF should comprise a line. For example, the sequence of characters

foo\nbar\nbaz

should have three lines, not two. On the other hand, if there are no characters on stdin, then there are no lines. These edge cases are nicely handled by defining a line to be a maximal subsequence of a nonempty sequence of characters which does not include a newline. More concretely,

This logic is captured by the getLines function, which splits a string into lines at \n (taking care of any \ns at the end as necessary).

getLines :: String -> [String]
getLines = unfoldr firstLine
  where
    firstLine :: String -> Maybe (String, String)
    firstLine xs = case break (== '\n') xs of
      ("","")   -> Nothing
      (as,"")   -> Just (as,"")
      (as,b:bs) -> Just (as,bs)

The Program

Our main program is a little more complicated than Kernighan and Plauger’s because we’ve chosen to deal with command line arguments.

main :: IO ()
main = do
  args <- getArgs

  case args of
    ["--char"] -> do
      charFilter (show . count)
      putNewLine
    otherwise -> do
      charFilter (getLines >>> count >>> show)
      putNewLine

  exitSuccess

putNewLine is a semantic synonym for putStrLn "".

-- print a line break
putNewLine :: IO ()
putNewLine = putStrLn ""

-- apply a map to stdin
charFilter :: (String -> String) -> IO ()
charFilter f = do
  xs <- getContents
  putStr $ f xs

Note also that we’ve demonstrated the >>> operator from Control.Arrow in contrast with composition. This is a standard library operator which (used here) is simply reversed function composition; it allows us to read chains of functions as if data flows from left to right, following the arrows.