Software Tools in Haskell: translit
transliterate or remove chars on stdin
This page is part of a series on Software Tools in Haskell.
This post is literate Haskell; you can load the source into GHCi and play along.
As usual, we start with some imports.
-- sth-translit: transliterate characters on stdin
module Main where
import System.Environment (getArgs, getProgName)
import System.Exit (exitSuccess, exitFailure)
import System.IO (hPutStrLn, stderr)
import Data.List (unfoldr)
See also the Backslash module.
The purpose of translit
is to replace characters by other characters; it applies a mapping with signature Char -> Char
to each character on stdin. (While simple, this is surprisingly useful.) The most succinct way to specify such a mapping is with two lists of characters, one denoting the domain of the character mapping and the other the codomain. For instance, calling
translit "abc" "xyzw"
replaces all a
s by x
, b
s by y
, and c
s by z
. (The w
has no effect.) It is useful if we can also specify ranges of characters with hyphens. For instance the string a-g
should be shorthand for abcdefg
. If the second list is nonempty, but shorter than the first, we pretend its final character is repeated indefinitely, so that
translit "abc" "x"
replaces all copies of a
, b
, or c
by x
.
If the second list is empty or not given, we can reasonably interpret this to mean we should remove the characters in the first list from the input. This is all well and good.
I have one quibble with Kernighan and Plauger’s design, though; they introduce another special input case. If the second list contains 0 or 1 characters, then the first list can be prepended by a “not” symbol like so:
translit -"abc" "x"
This means to replace every character except a
, b
, and c
by x
. I respectfully claim that this is the Wrong Thing. The arguments to translit
are shorthand for a mapping, given in the form of two lists. We can tell by the significance of order; translit ab xy
is not the same as translit ba xy
. It is only by a quirk of combinatorics that this dependence on order goes away if the second list has 0 or 1 entry. But the special “not” option implicitly treats the arguments of translit
as sets. To see this, think about what the complement of a list of characters is. In principle we could say that -"a-z"
, as a list, means all characters (in order) except for the lower case roman letters. But is this useful? For instance, what if the user tries to run
translit -"abc" "xy"
What does this mean? Is the y
just ignored? (User input should not be silently ignored.) Does the first unicode code point get mapped to x
, and all others to y
? Is this useful? Is it useful enough to warrant complicating translit
with the extra code needed to handle this special case of a special case? I don’t think it is.
But the ability to replace or remove characters from a set complement is useful. And so I will split translit
into two: translit
will handle the list-wise mapping of characters, and a separate program, charreplace
, will handle set-wise mappings. My opinion (and it’s just an opinion!) is that these two uses are different enough, semantically, to deserve separate tools; this avoids burdening the user with too many special cases and cluttering the interface with delicate options.
We already have most of the machinery needed for translit
, except for the code needed to interpret command line arguments. We introduce an internal representation of character sequences: a character sequence is a list of either single characters or ranges of characters.
data CharSeq
= Single Char
| Range Char Char
deriving (Show)
readCharSeq :: String -> Maybe String
readCharSeq = fmap charSeqsToList . readCharSeqs . bsUnEsc
where
charSeqsToList :: [CharSeq] -> String
charSeqsToList = concatMap charSeqToList
where
charSeqToList (Single x) = [x]
charSeqToList (Range x y) = enumFromTo x y
readCharSeqs :: String -> Maybe [CharSeq]
readCharSeqs = unfoldrMaybe firstCharSeq
where
firstCharSeq :: String -> Maybe (Maybe (CharSeq, String))
firstCharSeq "" = Just Nothing
firstCharSeq [x] = Just (Just (Single x, ""))
firstCharSeq ('-':_) = Nothing
firstCharSeq [x,y] = Just (Just (Single x, [y]))
firstCharSeq (x:y:z:xs) = case y of
'-' -> Just (Just (Range x z, xs))
otherwise -> Just (Just (Single x, y:z:xs))
Now the main program just has to interpret its arguments and call some library functions.
main :: IO ()
main = do
args <- getArgs
(from,to) <- case map (readCharSeq . bsUnEsc) args of
[Just as] -> return (as, "")
[Just as, Just bs] -> return (as, bs)
otherwise -> argError
let
remove = filter (not . (`elem` from))
translit = map (applyListMap $ zip from (padLast to))
case to of
"" -> charFilter remove
otherwise -> charFilter translit
exitSuccess
argError :: IO a
argError = reportErrorMsgs
[ "usage:"
, " translit [FROM] [TO] -- replace chars in FROM by those in TO"
, " translit [REMOVE] -- remove chars in REMOVE"
] >> exitFailure
padLast :: [a] -> [a]
padLast [] = []
padLast [x] = repeat x
padLast (x:xs) = x : padLast xs
Note that the arguments of translit
are run through bsUnEsc
, so that we can easily work with otherwise untypeable characters. (We could, for example, use this to replace charfullwidth
.) With translit
, many of the small tools we’ve written so far can suddenly be combined to do neat things. As a simple example, put the following text in a file called unicode-test.txt
.
\uxyz0 \uxyz1 \uxyz2 \uxyz3
\uxyz4 \uxyz5 \uxyz6 \uxyz7
\uxyz8 \uxyz9 \uxyza \uxyzb
\uxyzc \uxyzd \uxyze \uxyzf
Now the pipeline
cat unicode-test.txt | translit "xyz" "001" | unescape
replaces the x
, y
, and z
with 0
, 0
, and 1
and interprets the \uXXXX
as escape codes. This lets us see what several unicode code points look like at one time. With a larger “template” file we could see more characters at a time.
Old Stuff
-- apply a map to stdin
charFilter :: (String -> String) -> IO ()
charFilter f = do
xs <- getContents
putStr $ f xs
-- apply a list of input-output pairs
applyListMap :: (Eq a) => [(a,a)] -> a -> a
applyListMap zs x = case lookup x zs of
Nothing -> x
Just y -> y
-- write list of messages to stderr
reportErrorMsgs :: [String] -> IO ()
reportErrorMsgs errs = do
name <- getProgName
sequence_ $ map (hPutStrLn stderr) $ ((name ++ " error"):errs)
unfoldrMaybe :: (b -> Maybe (Maybe (a,b))) -> b -> Maybe [a]
unfoldrMaybe f x = case f x of
Nothing -> Nothing
Just Nothing -> Just []
Just (Just (a,b)) -> do
as <- unfoldrMaybe f b
Just (a:as)