Software Tools in Haskell: archive

bundle text files

Posted on 2016-03-08 by nbloomf
Tags: software-tools-in-haskell, literate-haskell

This page is part of a series on Software Tools in Haskell.

This post is literate Haskell; you can load the source into GHCi and play along.

As usual, we start with some imports.

-- archive: bundle text files
module Main where

import System.Exit (exitSuccess, exitFailure)
import System.Environment (getArgs, getProgName)
import System.Directory (doesFileExist)
import System.IO (hPutStrLn, stderr)
import Data.List (unfoldr, find)
import Control.Monad (foldM)

archive is an extremely limited form of the standard tool tar. From a bird’s eye view, it bundles one or more text files into a single archive, which can be thought of as a mapping from file names to file contents. It does not maintain a hierarchy of its contents and only works with text files. In fact the archive “format” is little more than the concatenation of file names and contents.

An archive consists of a sequence of entries of the form

#name of file
>line 1 of file
>line 2 of file

Lines starting with any character other than # or > are ignored. (We shouldn’t be editing the archive file by hand, and we’ll be careful that archive writes correctly formatted archives.) If two items in an archive have the same name, we ignore all but the first. This shouldn’t happen if we only manipulate archives with archive, but our archive format is such that the concatenation of two archives may be a valid archive. The program is called in one of the following ways:

archive NAME --list
archive NAME --add FILE ...
archive NAME --get FILE ...
archive NAME --remove FILE ...
archive NAME --replace FILE ...

list prints the contents of the files in an archive. add puts new files (given by name as arguments) into an existing archive, and get retrieves them again, printing to stdout. remove deletes an entry in an archive, and replace is equivalent to a remove followed by an add, swapping out the contents of a file in the archive. Destructive operations (remove and replace) send the altered archive to stdout rather than destroying the original. Trying to operate on a nonexistent archive creates it on the fly.

First, we’ll define an abstract “archive” type to keep track of the information we want: an archive is a list of named items.

type Name = String

data Item a = I
  { nameOf     :: Name
  , contentsOf :: a
  } deriving Show

data Archive a
  = A [Item a]
  deriving Show

emptyArchive :: Archive a
emptyArchive = A []

The use of the synonym Name for String is an example of a common idiom in languages with type inference. Names are Strings, but by introducing a lightweight type to indicate that a particular string is not just a String, but a Name, will make our type signatures more meaningful with no additional overhead. Due in part to purity, type signatures in Haskell, and I imagine in languages of similar descent like ML and F#, are a powerful kind of documentation. Just knowing the signature of a function tells us quite a bit about what it may – or may not! – do. But now I’m rambling.

Our archive type needs to support the five basic operations: adding, retrieving, removing, viewing, and listing items. This is straightforward.

getNames :: Archive a -> [Name]
getNames (A xs) = map nameOf xs

getItem :: Archive a -> Name -> Maybe a
getItem (A xs) str = do
  let hasName x = nameOf x == str
  item <- find hasName xs
  return (contentsOf item)

getItems :: Archive a -> [Name] -> Maybe [a]
getItems arch = mapM (getItem arch)

putItem :: Archive a -> (Name, a) -> Maybe (Archive a)
putItem (A []) (str, x) =
  Just $ A [I { nameOf = str, contentsOf = x }]
putItem (A (item:xs)) (str, x) = do
  if nameOf item == str
    then Nothing
    else do
      A zs <- putItem (A xs) (str, x)
      return $ A (item:zs)

putItems :: Archive a -> [(Name, a)] -> Maybe (Archive a)
putItems = foldM putItem

replaceItem :: Archive a -> (Name, a) -> Archive a
replaceItem (A []) (str, x) =
  A [I { nameOf = str, contentsOf = x }]
replaceItem (A (item:xs)) (str, x) =
  if nameOf item == str
      A (item { contentsOf = x } : xs)
      let A zs = replaceItem (A xs) (str, x)
      in A (item:zs)

replaceItems :: Archive a -> [(Name, a)] -> Archive a
replaceItems = foldl replaceItem

deleteItem :: Archive a -> Name -> Archive a
deleteItem (A []) _ = A []
deleteItem (A (item:xs)) str =
  if nameOf item == str
    then A xs
      let A zs = deleteItem (A xs) str
      in A (item:zs)

deleteItems :: Archive a -> [Name] -> Archive a
deleteItems = foldl deleteItem

Note that implementing archives as a (pure) type means that in principle we do have to read in an entire archive before we can operate on it, laziness notwithstanding. It’s possible this will cause performance issues on large archives, but this is fine for now.

The final functions we need for our abstract Archive type allow us to read from, and write to, lists of strings.

readArchiveBy ::
  ([String] -> a) -> [String] -> Archive a
readArchiveBy rd lns = A $ map try $ readArchive lns
    try (name, strs) = I { nameOf = name, contentsOf = rd strs }

readArchive :: [String] -> [(Name, [String])]
readArchive = unfoldr rdFst
    rdFst :: [String] -> Maybe ((Name, [String]), [String])
    rdFst []             = Nothing
    rdFst (('#':ln):lns) = do
        isContent as = case as of
          '>':_     -> True
          otherwise -> False
        (xs,rest) = span isContent lns
      return ((ln, map tail xs), rest)
    rdFst (_:lns)        = rdFst lns

writeArchiveBy :: (a -> [String]) -> Archive a -> [String]
writeArchiveBy wr (A xs) = concatMap writeItem xs
    writeItem item
      = ('#' : nameOf item) : map ('>':) (wr $ contentsOf item)

readStringArchive :: [String] -> Archive [String]
readStringArchive = readArchiveBy id

writeStringArchive :: Archive [String] -> [String]
writeStringArchive = writeArchiveBy id

The Archive type and basic operations are kept in a separate module; users of the module can only create and manipulate Archives via the provided interface. In this way we have the ability to change the implementation later if needed; for instance, we could store items in an archive using a tree rather than a list. (Although, since in typical usage archive has to read an entire archive anyway, I doubt using an asymptotically better structure will have much effect on the overall efficiency of the entire program. In such cases I think the simpler implementation is preferable.)

The main program is reasonably straightforward. We read the command line arguments to determine the name of the archive being manipulated, which of the five commands is being invoked, and the name(s) of the file(s) used.

data Action = List | Add | Get | Remove | Replace

main :: IO ()
main = do
  args <- getArgs

  -- process arguments
  (file, act, names) <- case args of
    [x,"--list"]       -> return (x, List, [])
    (x:"--add":xs)     -> return (x, Add, xs)
    (x:"--get":xs)     -> return (x, Get, xs)
    (x:"--remove":xs)  -> return (x, Remove, xs)
    (x:"--replace":xs) -> return (x, Replace, xs)
    otherwise          -> argErr >> exitFailure

  -- read the archive
  arch <- do
    fileExists <- doesFileExist file
    case fileExists of
      True -> do
        x <- fmap getLines $ readFile file
        return $ readStringArchive x
      False -> return emptyArchive

  -- how we process the items
    getItem str = case str of
      "-" -> do
        lns <- fmap getLines $ getContents
        return ("-", lns)
      otherwise -> do
        lns <- fmap getLines $ readFile str
        return (str, lns)

  -- do the thing
  case act of
    List -> sequence_ $ map putStrLn $ getNames arch

    Add -> do
      items <- mapM getItem names
      case putItems arch items of
        Just x  -> putFileLns file $ writeStringArchive x
        Nothing -> do
            [ "name exists in archive." ]
            >> exitFailure

    Get -> do
      case getItems arch names of
        Just xs -> mapM_ (sequence_ . map putStrLn) xs
        Nothing -> do
            [ "name does not exist in archive." ]
            >> exitFailure

    Remove -> do
        $ map putStrLn
        $ writeStringArchive
        $ deleteItems arch names

    Replace -> do
      items <- mapM getItem names
        $ map putStrLn
        $ writeStringArchive
        $ replaceItems arch items


argErr :: IO ()
argErr = reportErrorMsgs
  [ "usage:"
  , "  archive ARCH --list"
  , "  archive ARCH --add FILE ..."
  , "  archive ARCH --get FILE ..."
  , "  archive ARCH --remove FILE ..."
  , "  archive ARCH --replace FILE ..."

This program could be improved in several ways. As is, if the user tries to add an existing name to an archive, we bail. This seems like the Right Thing. But if the user tries to add a list of several names to an archive, and only one of them already exists, the entire operation fails. The same is true of the get operation. Would it be better to add/get what we can? Maybe, maybe not. The “fail early” strategy means that archive operations are atomic; each operation either succeeds entirely or fails entirely. This property can help us avoid inconsistent state, if, say, archive is used as a component in another program. Kernighan and Plauger also suggest keeping track of the date and time that a particular file is archived in the item header; this could be done by adding a DateTime field to the Item type and passing the date and time as an extra argument from the main function.

One possible drawback of this design is that we add some overhead to the files being archived – one extra character per line. For some files this might be a significant burden.

Old Stuff

-- split on \n
getLines :: String -> [String]
getLines = unfoldr firstLine
    firstLine :: String -> Maybe (String, String)
    firstLine xs = case break (== '\n') xs of
      ("","")   -> Nothing
      (as,"")   -> Just (as,"")
      (as,b:bs) -> Just (as,bs)

-- write list of messages to stderr
reportErrorMsgs :: [String] -> IO ()
reportErrorMsgs errs = do
  name <- getProgName
  sequence_ $ map (hPutStrLn stderr) $ ((name ++ " error"):errs)

putFileLns :: String -> [String] -> IO ()
putFileLns name lns = writeFile name (unlines lns)