Software Tools in Haskell: glyphcount
count glyphs on stdin
This page is part of a series on Software Tools in Haskell.
This post is literate Haskell; you can load the source into GHCi and play along.
As usual, we start with some imports.
-- sth-glyphcount: count glyphs on stdin
module Main where
import System.Exit (exitSuccess)
import Data.Char (isMark)
import Data.List (unfoldr)
import Data.Foldable (foldl')
When we wrote count
, we saw that there is a serious ambiguity in the meaning of “character” in Unicode. On one hand Unicode defines a list of character code points, but on the other hand sequences of code points do not necessarily correspond to symbols on the screen in the way ASCII characters do. In fact there is a Unicode Technical Report which addresses this ambiguity; unfortunately the conclusion there is that
The correspondence between glyphs and characters is generally not one-to-one, and cannot be predicted from the text alone. Unicode Technical Report #17
Given this difficulty, we will make a simplifying assumption. A “glyph” is a non-combining character followed by zero or more combining characters. Fortunately there is a standard library function, isMark
, which detects which characters are combining diacritics. The getGlyphs
function splits a string into a list of glyphs.
-- break a string into a list of "glyphs"
getGlyphs :: String -> [String]
getGlyphs = unfoldr firstGlyph
where
firstGlyph :: String -> Maybe (String, String)
firstGlyph "" = Nothing
firstGlyph (x:xs) = if isMark x
then Just $ break (not . isMark) (x:xs)
else do
let (as,bs) = break (not . isMark) xs
Just (x:as, bs)
-- generic length
count :: (Num t) => [a] -> t
count = foldl' inc 0
where inc n _ = n+1
-- print a line break
putNewLine :: IO ()
putNewLine = putStrLn ""
-- apply a map to stdin
charFilter :: (String -> String) -> IO ()
charFilter f = do
xs <- getContents
putStr $ f xs
Now the main function is much like that of count
.