Software Tools in Haskell: compress

compress text on stdin (run length encoding)

Posted on 2016-02-23 by nbloomf

Tags: software-tools-in-haskell, literate-haskell

This page is part of a series on Software Tools in Haskell.

This post is literate Haskell; you can load the source into GHCi and play along.

As usual, we start with some imports.

-- sth-compress: compress stdin (run length encoding)
--   character-oriented
module Main where

import System.Exit (exitSuccess)
import Data.List (unfoldr)
import Data.Foldable (foldl')

In a list of characters, a run is a sublist of characters which are all the same. For example, the list

bookkeeper

has three runs, each of two characters. If a list contains many long runs, it can be losslessly compressed using a technique called run length encoding. With such a scheme, instead of storing a literal run like aaaaaaa we store the repeated character and the number of times it repeats. Kernighan and Plauger do this by breaking a sequence of characters into two kinds of sublists: runs, sublists of repeated characters (longer than some threshold length), and chunks, sublists containing no runs. The run length encoding scheme in Software Tools then transforms a stream of characters into blocks of the form

(symbol denoting repeat)
(character to be repeated)
(repeat count)

and

(chunk count)
(list of that many characters)

Where (symbol denoting repeat) is a special character, which we will call the sigil. Let’s tweak this scheme just slightly. As K&P point out, no compression scheme can perform well on all input (in fact every compression scheme must make some inputs bigger). But we are wise to consider “how bad it gets when it gets bad”. How bad does this scheme get? The worst possible input for a run length encoding scheme is one with no runs at all, since there are no opportunities for compression. But notice what happens in this case; the entire input is a “chunk”, and so must be encoded with its count. The amount of space required to store an arbitrarily large integer, regardless of the scheme required, is proportional to its number of digits. (We can make that proportion smaller by choosing a larger base, but only to a point.) That means an input stream of length \(n\) with no runs will require about \(\log(n)\) characters just to store the chunk count, for a “compressed” file size of about \(n + \log(n)\). Not great!

What if, instead of keeping track of both chunk sizes and repeat counts, we only keep track of repeat counts and reuse the sigil to also denote the end of the repeat count. This way, an encoded stream looks like a stream of

(sigil)
(character to be repeated)
(repeat count)
(sigil)

and

(characters not including the sigil)

Of course now we also need to provide a way to encode literal instances of the sigil. What is the simplest way to do this? All characters other than the sigil are interpreted literally, unless we want to introduce another escape character. We can’t use a single sigil, since that means “start a new encoded run”. And we cannot use two sigils, because that means “start a new encoded run of sigils”. But three copies of the sigil character in a row does not mean anything, if we remember not to use the sigil character to encode numbers. So we interpret the string

(sigil)(sigil)(sigil)

to mean a literally encoded sigil. (Note that it is then more space efficient to encode two or more literal sigils as a run. (Two literal sigils is 6 characters in this scheme, but only 4 as a run.))

What is the worst case now? Well, an input stream with no runs and no sigils – a chunk – will be encoded as is with no overhead. An input stream with no runs and including sigils will require two extra characters for each sigil. In the worst case, a stream of \(n\) characters will require about \(2n\) characters for the sigils, for a “compressed” size of about \(3n\). Really not great!

On the face of it the second scheme is much worse, in the worst case, than the first. But which is worse on realistic data? If we plan to use compress on textual data we can choose the sigil to be a rarely used character. ASCII includes several control characters, like \BEL, which do not appear in text. Note that if our input does not have any literal sigils, then the second scheme can never compress its input to a larger size as long as we only compress runs of at least 5 characters (as the only way this happens is by encoding literal sigils). On the other hand, the first scheme adds overhead proportional to \(\log(n)\) for every chunk of length \(n\) – so unless our input includes long runs, or lots of short runs, the size may not decrease much and can easily increase.

The main program is basic; compress is a character-oriented filter.

main :: IO ()
main = do
  charFilter (rlEncode '\BEL' 5)
  exitSuccess

The actual run length encoding is a little complicated. We define an internal representation for run length encoded data.

data RLE a
  = Chunk   [a]
  | Repeat  Int a
  | Literal Int
  deriving Show

Doing this is not strictly necessary, but introducing a type for run length encoded data makes it easier to decompose algorithms (algebraic data types are a big win here). Now rlEncode works in two phases: first it reads a stream of characters into the internal representation of RLEs, and then it serializes that representation as a string.

rlEncode :: Char -> Int -> String -> String
rlEncode sig k = showRLE sig . runLengthEncode sig k
  where
    showRLE :: Char -> [RLE Char] -> String
    showRLE sig = concatMap write
      where
        write :: RLE Char -> String
        write (Repeat k x) = concat
          [[sig], [x], showBase86Nat k, [sig]]
        write (Chunk xs) = xs
        write (Literal 1) = [sig,sig,sig]
        write (Literal k) = concat $
          [[sig], [sig], showBase86Nat k, [sig]]

    runLengthEncode :: (Eq a) => a -> Int -> [a] -> [RLE a]
    runLengthEncode sig t = unfoldr (getFirst t) . getRuns
      where
        getFirst _ [] = Nothing
        getFirst t ((x,k):xs)
          | t <= k    = Just (Repeat k x, xs)
          | x == sig  = Just (Literal k, xs)
          | otherwise = let (ys,zs) = split ((x,k):xs)
                        in Just (Chunk $ fromRuns ys, zs)
              where
                split = span (\(z,h) -> t > h && z /= sig)

We use two helper functions to work with run-encoded lists.

getRuns :: (Eq a) => [a] -> [(a, Int)]
getRuns = unfoldr firstRun
  where
    firstRun :: (Eq a) => [a] -> Maybe ((a, Int), [a])
    firstRun []     = Nothing
    firstRun (x:xs) = let (as,bs) = span (== x) xs in
      Just ((x, 1 + count as), bs)


fromRuns :: [(a, Int)] -> [a]
fromRuns = concatMap (\(x,k) -> replicate k x)

Repeat counts are encoded in base 86 for space efficiency. (Counts up to 85 need only one character, counts up to 7395 need at most two, and up to 636055 need at most three.)

showBase86Nat :: Int -> String
showBase86Nat k
  | k < 0     = ""
  | otherwise = case sequence $ map intToChar $ digitsToBase 86 k of
      Nothing -> error "showBase86Nat"
      Just x  -> x
      
  where
    intToChar :: Int -> Maybe Char
    intToChar x = lookup x $ map swap
      [ ('0',0),  ('1',1),  ('2',2),  ('3',3),  ('4',4)
      , ('5',5),  ('6',6),  ('7',7),  ('8',8),  ('9',9)
      , ('a',10), ('b',11), ('c',12), ('d',13), ('e',14)
      , ('f',15), ('g',16), ('h',17), ('i',18), ('j',19)
      , ('k',20), ('l',21), ('m',22), ('n',23), ('o',24)
      , ('p',25), ('q',26), ('r',27), ('s',28), ('t',29)
      , ('u',30), ('v',31), ('w',32), ('x',33), ('y',34)
      , ('z',35), ('A',36), ('B',37), ('C',38), ('D',39)
      , ('E',40), ('F',41), ('G',42), ('H',43), ('I',44)
      , ('J',45), ('K',46), ('L',47), ('M',48), ('N',49)
      , ('O',50), ('P',51), ('Q',52), ('R',53), ('S',54)
      , ('T',55), ('U',56), ('V',57), ('W',58), ('X',59)
      , ('Y',60), ('Z',61), ('?',62), ('!',63), ('#',64)
      , ('&',65), ('@',66), ('$',67), ('=',68), ('+',69)
      , ('-',70), ('~',71), ('<',72), ('>',73), ('[',74)
      , (']',75), ('(',76), (')',77), ('{',78), ('}',79)
      , ('|',80), ('/',81), ('*',82), ('^',83), (':',84)
      , (';',85)
      ]

    swap (x,y) = (y,x)

It will be difficult to test compress until we’ve also written its companion, expand.

Old stuff

-- apply a map to stdin
charFilter :: (String -> String) -> IO ()
charFilter f = do
  xs <- getContents
  putStr $ f xs


-- generic length
count :: (Num t) => [a] -> t
count = foldl' inc 0
  where inc n _ = n+1


-- digits base b
digitsToBase :: (Integral n) => n -> n -> [n]
digitsToBase b k
  | b <= 1 || k <= 0 = []
  | otherwise = reverse $ foo k
      where
        foo t
          | t < b = [t]
          | otherwise = (t`rem`b) : (foo (t`quot`b))