Software Tools in Haskell: glyphcount

count glyphs on stdin

Posted on 2016-02-14 by nbloomf
Tags: software-tools-in-haskell, literate-haskell

This page is part of a series on Software Tools in Haskell.

This post is literate Haskell; you can load the source into GHCi and play along.

As usual, we start with some imports.

-- sth-glyphcount: count glyphs on stdin
module Main where

import System.Exit (exitSuccess)
import Data.Char (isMark)
import Data.List (unfoldr)
import Data.Foldable (foldl')

When we wrote count, we saw that there is a serious ambiguity in the meaning of “character” in Unicode. On one hand Unicode defines a list of character code points, but on the other hand sequences of code points do not necessarily correspond to symbols on the screen in the way ASCII characters do. In fact there is a Unicode Technical Report which addresses this ambiguity; unfortunately the conclusion there is that

The correspondence between glyphs and characters is generally not one-to-one, and cannot be predicted from the text alone. Unicode Technical Report #17

Given this difficulty, we will make a simplifying assumption. A “glyph” is a non-combining character followed by zero or more combining characters. Fortunately there is a standard library function, isMark, which detects which characters are combining diacritics. The getGlyphs function splits a string into a list of glyphs.

-- break a string into a list of "glyphs"
getGlyphs :: String -> [String]
getGlyphs = unfoldr firstGlyph
    firstGlyph :: String -> Maybe (String, String)
    firstGlyph "" = Nothing
    firstGlyph (x:xs) = if isMark x
      then Just $ break (not . isMark) (x:xs)
      else do
        let (as,bs) = break (not . isMark) xs
        Just (x:as, bs)

-- generic length
count :: (Num t) => [a] -> t
count = foldl' inc 0
  where inc n _ = n+1

-- print a line break
putNewLine :: IO ()
putNewLine = putStrLn ""

-- apply a map to stdin
charFilter :: (String -> String) -> IO ()
charFilter f = do
  xs <- getContents
  putStr $ f xs

Now the main function is much like that of count.

main :: IO ()
main = do
  charFilter (show . count . getGlyphs)