# Wordlist

Posted on 2017-06-25 by nbloomf

This post is literate Haskell; you can load the source into GHCi and play along.

{-# LANGUAGE BangPatterns         #-}
{-# LANGUAGE TypeSynonymInstances #-}
{-# LANGUAGE FlexibleInstances    #-}
module Main where

import System.FilePath
import System.Exit
import System.Environment
import Data.List
import Data.List.Split
import Text.ParserCombinators.Parsec

For a project I’m fiddling with (to be written up at some point) I found myself wishing I had a machine-readable dictionary of English words, including parts of speech, pronunciations, and division into syllables. As I’m sure we all do from time to time. :) It doesn’t need to be fancy – just a tab delimited file with lines like

word   part_of_speech   IPA_pronunciation   syllables

for as many English words as I can get. Fortunately for me, there are several large databases of this sort available under open licenses. I was able to find these:

1. The Moby Project (public domain) by Grady Ward – a huge trove of data and the only reason I thought this project would be feasible. Can be downloaded from Project Gutenberg.
2. The English Wiktionary (CC-BY-SA/GFDL) – also a huge trove of pronunciations and parts of speech, but less structured than the Moby Project. Wiktionary makes database dumps available for download; I used the one from June 20, 2017, but by the time you are reading this a more recent version will be up.
3. The Google Books Ngrams Viewer data (CC-BY-3.0) – specifically, the unigrams. This is more useful for the frequency data (to be used later) but also includes some part of speech data. Warning: these files are enormous.
4. The WordNet project (idiosyncratic license, but seems to be permissive). Focused more on semantic relations among words, still a great resource.
5. The GCIDE, a.k.a. GNU Collaborative International Dictionary of English (GPL3, not surprisingly) – based on the public domain Webster’s Unabridged 1913 and supplemented with WordNet.

There are probably other sources out there too, but this is a good start.

To give away the punchline, the final database is available on GitHub. This post is an explanation of the scripts I used to generate that file.

Clearly editing by hand is a non-starter. Instead we’ll build some small tools to parse our open dictionary data and merge it together the best we can manage. I will focus my attention on the Moby project and Wiktionary for now, since the other resources seem to be focused on kinds of data I’m less interested in for this project.

I’ll begin with the Moby Project data, since it explicitly contains everything I want – part of speech, pronunciation, and hyphenation (which a spot check indicates is a decent proxy for syllables). The first problem is that the Moby Project data isn’t encoded as UTF-8 text. This is reasonable, since unicode barely existed when the Moby Project was put in the public domain. Wikipedia indicates that it is encoded as “macintosh roman” text, which we can test with

cat foo.txt | iconv -f macintosh -t utf-8 > foo-utf8.txt

replacing foo with whatever file needs to be converted. (I know this is a Useless Use of Cat and I don’t care; having data flow in one direction makes more sense to me.) The line endings are also CRLF-style, which can be fixed with sed or (as I did) a text editor. Fixing up the encoding and line endings of the Moby files, I ended up with the following:

1. mwords.txt: 354984 lines.
2. mhyph.txt: 187175 lines, with hyphenation points (other than spaces or hyphens that are part of the word) indicated by a “•” symbol.
3. mpos.txt: 233356 lines, with part(s) of speech given by a particular encoding scheme.
4. mpron.txt: 177267 lines, with IPA pronunciation using an ASCII encoding.

In other words, jackpot. The data I want is there. It is, however, encoded, so we’ll have to do some tinkering to get it in the format I want. For example, a typical line in mhyph.txt looks like this:

so•da foun•tain

while a typical line in mpos.txt looks like this:

double-time\ti

and a typical line in mpron.txt looks like this:

flibbertigibbet 'fl/I/b/@/rt/i/,/dZ//I/b/I/t

Unifying all this will be a little more complicated and error-prone than my awk skills can handle, so I’ll use Haskell instead. This may turn out to be overkill, but we’ll see.

My strategy will be something like this.

• First, define types for the data we eventually want to store in a database – namely, Words.
• Write functions for rendering and parsing Words in the format I want to use in the final database.
• Write functions to parse Words from the textual formats in our open sources – the Moby files, and Wiktionary, and Google Ngrams.
• Finally, actually parse the data, merge it, and write it to a single flatfile database.

Let’s get to it!

# Basic Types

About this database. Each record needs to represent a “word”. But what is a “word”, exactly? I’m not a linguist and am too lazy to look this up, but to my eye a word has the following things.

1. A written representation (how you write it),
2. A spoken representation (how you say it),
3. A grammatical function (how you use it), and
4. A meaning.

If two “words” are such that any of these items differ, then the words are different. For instance “close” (verb) and “close” (adjective) are spelled the same, but have different functions and pronunciations, so are different words. Likewise “run” (move by flapping your legs) and “run” (execute a computer program) are different words, despite being spelled and spoken the same way and both being nouns.

Of course real language is not so clear cut. For instance, the written and spoken representations of a word don’t have to be unique (think US vs. UK spellings, or accents). But for my needs this is good enough. (I have to remember, this is for a specific project!)

I will represent words using the following sum type. Note in particular that part_of_speech, pronunciation, and syllables are all Maybes, while spelling is not, and a word can have at most one pronunciation. These choices privilege written words and “General American” pronunciations, but I’m ok with that for now.

data Word = Word
{ spelling       :: String
, part_of_speech :: Maybe PartOfSpeech
, pronunciation  :: Maybe Pronunciation
, hyphenation    :: Maybe Hyphenation
} deriving Show

And a simple constructor:

word :: String -> Word
word x = Word
{ spelling       = x
, part_of_speech = Nothing
, pronunciation  = Nothing
, hyphenation    = Nothing
}

Eventually we will want to store lists of words in text files. I’ll wrap functions to write and parse Words behind the class DictFormat. Since parsing can fail, it will return a Writer [String] trather than simply t; the [String] type is meant for logging error messages.

class DictFormat t where
writeDictFormat :: t -> String
readDictFormat  :: String -> Writer [String] t

For example:

instance DictFormat String where
writeDictFormat = id

instance (DictFormat t) => DictFormat (Maybe t) where
writeDictFormat Nothing  = ""
writeDictFormat (Just x) = writeDictFormat x

return x

And assuming DictFormat instances exist for PartOfSpeech, Pronunciation, and Hyphenation, we can define:

instance DictFormat Word where
writeDictFormat w = intercalate "\t"
[ writeDictFormat $spelling w , writeDictFormat$ part_of_speech w
, writeDictFormat $pronunciation w , writeDictFormat$ hyphenation w
]

readDictFormat cs = case wordsBy (== '\t') cs of
[a,b,c,d] -> do
return $Word { spelling = sp , part_of_speech = pos , pronunciation = pro , hyphenation = hyp } _ -> do tell$ ["Unrecognized format: " ++ cs]
return $word "error" Now for the summand types. PartOfSpeech is (for now) determined by the possible values in the Moby data. data PartOfSpeech = Noun | NounSingular | NounPlural | NounPhrase | Verb | VerbParticiple | VerbTransitive | VerbIntransitive | Adjective | Adverb | Conjunction | Preposition | Interjection | Pronoun | ArticleDefinite | ArticleIndefinite | Nominative | Unknown deriving (Eq, Ord, Show) For now we’ll use an ad-hoc DictFormat encoding of this type. instance DictFormat PartOfSpeech where writeDictFormat x = case x of Noun -> "n" NounSingular -> "n-s" NounPlural -> "n-pl" NounPhrase -> "n-ph" Verb -> "v" VerbParticiple -> "v-p" VerbTransitive -> "v-t" VerbIntransitive -> "v-i" Adjective -> "adj" Adverb -> "adv" Conjunction -> "conj" Preposition -> "pre" Interjection -> "int" Pronoun -> "pro" ArticleDefinite -> "a-def" ArticleIndefinite -> "a-ind" Nominative -> "nom" Unknown -> "" readDictFormat cs = case cs of "n" -> return Noun "n-s" -> return NounSingular "n-pl" -> return NounPlural "n-ph" -> return NounPhrase "v" -> return Verb "v-p" -> return VerbParticiple "v-t" -> return VerbTransitive "v-i" -> return VerbIntransitive "adj" -> return Adjective "adv" -> return Adverb "conj" -> return Conjunction "pre" -> return Preposition "int" -> return Interjection "pro" -> return Pronoun "a-def" -> return ArticleDefinite "a-ind" -> return ArticleIndefinite "nom" -> return Nominative _ -> do tell ["Unrecognized part of speech: " ++ cs] return Unknown Pronunciations will eventually be strings of IPA characters, but the Moby data uses an ASCII encoding of IPA. We’ll just store these as strings, without doing any kind of validation. data Pronunciation = Pron { unPron :: String } deriving (Eq, Show) instance DictFormat Pronunciation where writeDictFormat = unPron readDictFormat cs = return$ Pron { unPron = cs }

Hyphenation is a little trickier. In the Moby database, syllables of a given word are delimited by any of three different characters. Compound words have their parts separated by spaces, hyphenated words are divided by hyphens (of course), and the possible hyphenation points of a simple word are separated by bullets. These form a kind of hierarchy which we can model by saying that a “word” is a list of lists of lists of syllables, with the outer list delimited by spaces and the middle lists delimited by hyphens.

data Hyphenation = Hyp
{ unHyp :: [[[String]]]
} deriving (Eq, Show)

This is a decent format for storing hyphenations, so I’ll steal it for our eventual final database.

instance DictFormat Hyphenation where
writeDictFormat =
intercalate " "
. map (intercalate "-")
. map (map (intercalate "•"))
. unHyp

return
. Hyp
. map (map (wordsBy (== '•')))
. map (wordsBy (== '-'))
. wordsBy (== ' ')

# Parsing Moby Data

Next we’ll deal with parsing the Moby files. For reasons we’ll see in a moment, instead of parsing the files as lists of Words, we’ll simply parse them as key-value pairs, keyed by the spelling of the word. For instance, the hyphenation file will become a [(String, Hyphenation)]. First we need a helper function to strip extra spaces from a string.

stripSpaces :: String -> String
stripSpaces =
reverse
. dropWhile (== ' ')
. reverse
. dropWhile (== ' ')

Parsing the list of words alone is simple enough.

readWords :: FilePath -> IO [(String,())]
raw <- fmap (map stripSpaces . lines) $readFile path return$ map (\x -> (x, ())) raw

Parsing in the Moby hyphenations list is also not too bad; we just need to filter out the bullet characters to recover the original word.

readMobyHyph :: FilePath -> IO [(String, Hyphenation)]
let
parseRecord :: String -> (String, Hyphenation)
parseRecord str =
( stripSpaces $filter (/= '•') str , let (x,_) = runWriter$ readDictFormat str in x
)

raw <- fmap lines $readFile path return$ map parseRecord raw

Parsing the part of speech file is a little trickier; each word can have several parts of speech, and the part of speech codes may be “invalid”. (I found a couple of POS codes that made no sense!)

Part of speech information is stored in the Moby files using two different formats: one in the part of speech file, and another in the pronunciation file. We’ll define two functions for parsing both kinds.

mobyPOSCode :: Char -> Either String PartOfSpeech
mobyPOSCode c = case c of
'N' -> Right NounSingular
'p' -> Right NounPlural
'h' -> Right NounPhrase
'V' -> Right VerbParticiple
't' -> Right VerbTransitive
'i' -> Right VerbIntransitive
'C' -> Right Conjunction
'P' -> Right Preposition
'!' -> Right Interjection
'r' -> Right Pronoun
'D' -> Right ArticleDefinite
'I' -> Right ArticleIndefinite
'o' -> Right Nominative
_   -> Left $"Unknown part of speech: " ++ [c] Parsing the actual file: readMobyPOS :: FilePath -> IO [(String, PartOfSpeech)] readMobyPOS path = do let makeWord :: String -> Char -> Writer [String] [(String, PartOfSpeech)] makeWord !word code = case mobyPOSCode code of Left msg -> do tell [msg] return [] Right pos -> do return [( stripSpaces word, pos )] parseRecord :: String -> Writer [String] [(String, PartOfSpeech)] parseRecord !str = do let (pos,drow) = break (== '\\')$ reverse str

case drow of
'\\':x ->
fmap concat $sequence$ map (makeWord $reverse x) pos _ -> do tell ["Unexpected format: " ++ reverse drow] return [] raw <- fmap lines$ readFile path
let (dict,err) = runWriter $fmap concat$ sequence $map parseRecord raw case err of [] -> return () _ -> do putStrLn$ "parse errors in " ++ path
sequence_ $map putStrLn err exitFailure return dict Parsing pronunciations is also tricky. Some words have more than one pronunciation, depending on their part of speech. First a helper to parse part of speech codes. mobyPronPOSCode :: String -> Either String PartOfSpeech mobyPronPOSCode s = case s of "n" -> Right Noun "v" -> Right Verb "aj" -> Right Adjective "av" -> Right Adverb "inj" -> Right Interjection "prp" -> Right Preposition _ -> Left$ "Unknown part of speech: " ++ s

The pronunciations themselves are not stored using IPA, but rather a kind of ASCII encoding. The following helper function is my best attempt at converting one to the other. :) I had to correct some errors in the data by hand.

mobyToIPA :: String -> Writer [String] String
mobyToIPA str = case str of
[]               -> return []
('(':'/':'@':'/':')':cs) -> fmap ('ə':) $mobyToIPA cs ('e':'/':'T':'/':cs) -> fmap ("ɛð" ++)$ mobyToIPA cs
('/' :cs)        -> mobyToIPA cs
('\'':cs)        -> fmap ('\'':)   $mobyToIPA cs (',' :cs) -> fmap (',':)$ mobyToIPA cs
('_' :cs)        -> fmap ('#':)    $mobyToIPA cs ('.' :cs) -> fmap ('.':)$ mobyToIPA cs
(' ' :cs)        -> fmap ('.':)    $mobyToIPA cs ('[':'@':']':cs) -> fmap ('ɝ':)$ mobyToIPA cs
('(':'@':')':cs) -> fmap ("ɛə" ++) $mobyToIPA cs ('t':'S':cs) -> fmap ("tʃ" ++)$ mobyToIPA cs
('@':'r':cs)     -> fmap ("ɜr" ++) $mobyToIPA cs ('a':'I':cs) -> fmap ("aɪ" ++)$ mobyToIPA cs
('A':'r':cs)     -> fmap ("ɑr" ++) $mobyToIPA cs ('A':'U':cs) -> fmap ("aʊ" ++)$ mobyToIPA cs
('d':'Z':cs)     -> fmap ("dʒ" ++) $mobyToIPA cs ('e':'I':cs) -> fmap ("eɪ" ++)$ mobyToIPA cs
('h':'w':cs)     -> fmap ("hw" ++) $mobyToIPA cs ('O':'i':cs) -> fmap ("ɔɪ" ++)$ mobyToIPA cs
('o':'U':cs)     -> fmap ("oʊ" ++) $mobyToIPA cs ('o':'u':cs) -> fmap ("aʊ" ++)$ mobyToIPA cs
('e':'R':cs)     -> fmap ("ɛʁ" ++) $mobyToIPA cs ('0':'0':cs) -> fmap ("u" ++)$ mobyToIPA cs
('æ':cs)         -> fmap ("æ" ++)  $mobyToIPA cs ('&':cs) -> fmap ("æ" ++)$ mobyToIPA cs
('a':cs)         -> fmap ("æ" ++)  $mobyToIPA cs ('-':cs) -> fmap ("ə" ++)$ mobyToIPA cs
('@':cs)         -> fmap ("ʌ" ++)  $mobyToIPA cs ('A':cs) -> fmap ("ɑː" ++)$ mobyToIPA cs
('b':cs)         -> fmap ("b" ++)  $mobyToIPA cs ('c':cs) -> fmap ("k" ++)$ mobyToIPA cs
('d':cs)         -> fmap ("d" ++)  $mobyToIPA cs ('D':cs) -> fmap ("ð" ++)$ mobyToIPA cs
('E':cs)         -> fmap ("ɛ" ++)  $mobyToIPA cs ('e':cs) -> fmap ("ɛ" ++)$ mobyToIPA cs
('f':cs)         -> fmap ("f" ++)  $mobyToIPA cs ('g':cs) -> fmap ("ɡ" ++)$ mobyToIPA cs
('h':cs)         -> fmap ("h" ++)  $mobyToIPA cs ('i':cs) -> fmap ("iː" ++)$ mobyToIPA cs
('I':cs)         -> fmap ("ɪ" ++)  $mobyToIPA cs ('j':cs) -> fmap ("j" ++)$ mobyToIPA cs
('k':cs)         -> fmap ("k" ++)  $mobyToIPA cs ('l':cs) -> fmap ("l" ++)$ mobyToIPA cs
('m':cs)         -> fmap ("m" ++)  $mobyToIPA cs ('n':cs) -> fmap ("n" ++)$ mobyToIPA cs
('N':cs)         -> fmap ("ŋ" ++)  $mobyToIPA cs ('O':cs) -> fmap ("ɔː" ++)$ mobyToIPA cs
('o':cs)         -> fmap ("ɑ" ++)  $mobyToIPA cs ('p':cs) -> fmap ("p" ++)$ mobyToIPA cs
('r':cs)         -> fmap ("r" ++)  $mobyToIPA cs ('s':cs) -> fmap ("s" ++)$ mobyToIPA cs
('S':cs)         -> fmap ("ʃ" ++)  $mobyToIPA cs ('t':cs) -> fmap ("t" ++)$ mobyToIPA cs
('T':cs)         -> fmap ("θ" ++)  $mobyToIPA cs ('u':cs) -> fmap ("uː" ++)$ mobyToIPA cs
('U':cs)         -> fmap ("ʊ" ++)  $mobyToIPA cs ('v':cs) -> fmap ("v" ++)$ mobyToIPA cs
('V':cs)         -> fmap ("ʋ" ++)  $mobyToIPA cs ('w':cs) -> fmap ("w" ++)$ mobyToIPA cs
('z':cs)         -> fmap ("z" ++)  $mobyToIPA cs ('Z':cs) -> fmap ("ʒ" ++)$ mobyToIPA cs
('R':cs)         -> fmap ("ɹ" ++)  $mobyToIPA cs ('y':cs) -> fmap ("ɜː" ++)$ mobyToIPA cs
('x':cs)         -> fmap ("x" ++)  $mobyToIPA cs ('W':cs) -> fmap ("w" ++)$ mobyToIPA cs
('Y':cs)         -> fmap ("uː" ++) $mobyToIPA cs ('3':cs) -> fmap ("ɝ" ++)$ mobyToIPA cs
_ -> do
tell $["Unidentified phoneme: " ++ str] return str Finally, the function that reads the pronunciation file. readMobyPron :: FilePath -> IO [((String, Maybe PartOfSpeech), Pronunciation)] readMobyPron path = do let fixSpaces :: Char -> Char fixSpaces '_' = ' ' fixSpaces c = c makeWord !word code !pron = if code == "" then return$ return $( ( stripSpaces$ map fixSpaces word, Nothing )
, Pron pron
)
else case mobyPronPOSCode code of
Left err -> do
tell [err]
return []
Right pos -> return $return$
( ( stripSpaces $map fixSpaces word, Just pos ) , Pron pron ) parseRecord str = do let (stem,pron) = break (== ' ') str (word,pos) = break (== '/') stem ipa <- mobyToIPA (tail pron) makeWord word (delete '/' pos) ipa raw <- fmap lines$ readFile path
let (dict,err) = runWriter $fmap concat$ sequence $map parseRecord raw case err of [] -> return () _ -> do putStrLn$ "parse errors in " ++ path
sequence_ $map putStrLn err exitFailure return dict woo # Merging Now that we can parse the Moby data to Haskell types, the big remaining question is how to merge all this data together. After trying a lot of dead ends I’ve settled on having a single, generic, merge function that takes a list of Words and a list of “key-value” pairs, and attempts to update the Words with the given data, or create new Words if a given “key” does not exist, or report an error message if it can’t figure out what to do. This approach has the benefit that it does not depend on the source of our data, so eventually we can use it to merge in information from other sources. A couple of caveats about the following function: we must have that if order x (keyOf y) == EQ, then match x y == True. mergeIO will error out if it ever detects that this constraint does not hold. mergeIO -- Error log :: FilePath -- Extract the key of a word -> (Word -> key) -- Render a key -> (key -> String) -- Compare two keys -> (key -> key -> Ordering) -- Check whether a key matches a Word -> (key -> Word -> Bool) -- Update a Word with given key-value pair -> (Word -> (key,val) -> Writer [String] Word) -- Create a new word from a key-value pair -> (key -> val -> Word) -- Update data -> [(key,val)] -- Dictionary -> [Word] -- Updated dictionary -> IO [Word] mergeIO errF keyOf print order match fiddle create keyval dict = merge (sortBy (\x y -> order (fst x) (fst y) ) keyval) (sortBy (\x y -> order (keyOf x) (keyOf y)) dict) where merge [] bs = return bs merge as [] = return$ map (uncurry create) as
merge us@((k,v):as) (b:bs) =
case span (\(t,_) -> match t b) us of
([],vs) -> case order k (keyOf b) of
LT -> do
let w = create k v
ws <- merge as (b:bs)
return (w:ws)
GT -> do
ws <- merge ((k,v):as) bs
return (b:ws)
EQ -> error $intercalate " " [ "shouldn't happen!" , print k , print (keyOf b) ] (ks,vs) -> do let (xs,ys) = span (\c -> match k c) bs let (zs,errs) = runWriter$ sequence $map (\m -> foldM fiddle m ks) (b:xs) appendFile errF$ unlines errs
ws <- merge vs ys
return (zs ++ ws)

By feeding mergeIO appropriate parameters, we recover specific “merge” functions for different kinds of data. For instance, mergeStem simply adds new words without any extra data.

mergeStem
:: FilePath -> [(String, ())] -> [Word] -> IO [Word]
mergeStem errF = mergeIO errF spelling id compare match fiddle create
where
match :: String -> Word -> Bool
match stem w = stem == spelling w

fiddle w _ = return w

create :: String -> () -> Word
create stem _ = word stem

Next a helper to add a part of speech to a Word:

addPOS :: Word -> (String, PartOfSpeech) -> Writer [String] Word
addPOS w (stem,pos) = if stem /= spelling w
then return w
else
case part_of_speech w of
Nothing ->
return $w { part_of_speech = Just pos } Just q -> do if q == pos then return () else tell$ return $intercalate "\t" ["Part of Speech" , "stem: \"" ++ stem ++ "\"" , "new: \"" ++ writeDictFormat pos ++ "\"" , "old: \"" ++ writeDictFormat q ++ "\"" ] return w And with it, mergePOS can add a list of parts of speech into a dictionary. mergePOS :: FilePath -> [(String, PartOfSpeech)] -> [Word] -> IO [Word] mergePOS errF = mergeIO errF spelling id compare match addPOS create where match :: String -> Word -> Bool match stem w = stem == spelling w create :: String -> PartOfSpeech -> Word create stem pos = Word { spelling = stem , part_of_speech = Just pos , pronunciation = Nothing , hyphenation = Nothing } And a helper to add a hyphenation to a Word: addHyph :: Word -> (String, Hyphenation) -> Writer [String] Word addHyph w (stem,syl) = if stem /= spelling w then return w else case hyphenation w of Nothing -> return$ w { hyphenation = Just syl }
Just q -> do
if q == syl
then return ()
else tell $return$ intercalate "\t"
[ "Hyphenation"
, "stem: \"" ++ stem ++ "\""
, "new: \"" ++ writeDictFormat syl ++ "\""
, "old: \"" ++ writeDictFormat q ++ "\""
]
return w

And then mergeHyph can add a list of hyphenations into a dictionary.

mergeHyph
:: FilePath -> [(String, Hyphenation)] -> [Word] -> IO [Word]
mergeHyph errF = mergeIO errF spelling id compare match addHyph create
where
match :: String -> Word -> Bool
match stem w = stem == spelling w

create :: String -> Hyphenation -> Word
create stem syl = Word
{ spelling       = stem
, part_of_speech = Nothing
, pronunciation  = Nothing
, hyphenation    = Just syl
}

Finally, pronunciations.

mergePron
:: FilePath
-> [((String, Maybe PartOfSpeech), Pronunciation)]
-> [Word]
-> IO [Word]
mergePron errF = mergeIO errF getKey print compare match fiddle create
where
getKey w = (spelling w, part_of_speech w)

print (w, Nothing) = w
print (w, Just q)  = w ++ " (" ++ writeDictFormat q ++ ")"

match (stem, Nothing)  w = stem == spelling w
match (stem, Just pos) w = and
[ stem == spelling w
, or
[ Just pos == part_of_speech w
, Nothing  == part_of_speech w
]
]

fiddle w ((stem,pos),pron) = case pronunciation w of
Nothing -> return $w { pronunciation = Just pron } Just u -> do if u == pron then return w else do tell$ return $intercalate "\t" [ "Pronunciation" , "stem: \"" ++ stem ++ "\"" , "pos: \"" ++ show pos ++ "\"" , "new: \"" ++ writeDictFormat pron ++ "\"" , "old: \"" ++ writeDictFormat u ++ "\"" ] return w create (stem, pos) pron = Word { spelling = stem , part_of_speech = pos , pronunciation = Just pron , hyphenation = Nothing } Putting it together, parseMoby merges together all of the Moby data, recording errors to the file error-moby.txt. parseMoby :: IO [Word] parseMoby = do let errFile = "/home/nathan/code/wordlist/error-moby.txt" writeFile errFile "" putStrLn "parsing database" hyph <- readMobyHyph "/home/nathan/code/wordlist/moby/mhyph.txt" putStrLn "..mhyph" pos <- readMobyPOS "/home/nathan/code/wordlist/moby/mpos.txt" putStrLn "..mpos" pron <- readMobyPron "/home/nathan/code/wordlist/moby/mpron.txt" putStrLn "..mpron" wrds <- readWords "/home/nathan/code/wordlist/moby/mwords.txt" putStrLn "..mwords" putStrLn "merging" return [] >>= mergePOS errFile pos >>= mergeHyph errFile hyph >>= mergePron errFile pron >>= mergeStem errFile wrds To test what we have so far, mainMoby writes the consolidated Moby dictionary to out-moby.txt. mainMoby :: IO () mainMoby = do let outFile = "/home/nathan/code/wordlist/out-moby.txt" dict <- parseMoby writeFile outFile$ unlines $map writeDictFormat dict putStrLn$ "Words: " ++ show (length dict)

exitSuccess

After fixing some small errors in the data by hand (reported by our parsers), running mainMoby produces two files: out-moby.txt, whose lines look like this:

epiphanize	v-t	ɪ'pɪfʌ,naɪz	e•piph•a•nize
epiphanized	v-t		e•piph•a•nized
epiphanizing	v-t		e•piph•a•niz•ing
epiphanous
epiphany	n-s	ɛ'pɪfʌniː	e•piph•a•ny
epipharyngeal			

And error-moby.txt, whose lines look like this:

Hyphenation	stem: "zedoary"	new: "zed•o•a•ry"	old: "zed•o•ar•y"
Hyphenation	stem: "zenith"	new: "ze•nith"	old: "zen•ith"
Hyphenation	stem: "zincate"	new: "zin•cate"	old: "zinc•ate"
Pronunciation	stem: "Elie"	pos: "Nothing"	new: "eɪ'liː"	old: "'ɛliː"
Pronunciation	stem: "ay"	pos: "Just Adverb"	new: "aɪ"	old: "'eɪ"

Woo! out-moby.txt is a tab-delimited list of words with part of speech, pronunciation, and hyphenation points. It has a ton of missing entries, which we can try to fill in later. But at 526462 lines it’s a decent start.

# Wiktionary

Next we’ll use data from Wiktionary to fill as many gaps as we can. Wiktionary is a massive, collaboratively edited database with the goal of collecting every word in every language, and for the popular words in popular languages, it gets pretty close.

I downloaded the June 20, 2017 database dump of the English Wiktionary, which (uncompressed) is a 5.1 GB XML file. But don’t let the file format fool you. :) Although the dump is a big XML file with a node for each word, the data within each entry is not really structured beyond being marked up with wiki tags and mostly standardized section headings.

I made a shell pipeline using mostly sed that attempts to cleanly extract as much as possible. It throws out all definitions, since we don’t care about those, and without definitions things like synonyms and antonyms don’t make sense. We also throw out any information about languages other than English. What’s left is a bunch of words, sometimes with pronunciations in one or more dialects, and sometimes with one or more parts of speech attached. It turns out that a large number of entries either are or can “easily” be put into the form

word (pronunciation)? (noun)? (verb)? (adjective)? (adverb)?

I use sed to gradually get the entries in this form if possible, and awk to redirect such entries to a handful of files. Surprisingly (to me) this recovered over 99% of the entries in Wiktionary. This script is kind of a mess, so I’ll just refer the interested reader to tools/wiki-extract.txt in the repo.

Anyway, running that script results in the following files:

1. words-only.txt: These are entries with a stem only, no pronunciation or part of speech.
2. pos-only.txt: These are entries with at least one part of speech but no pronunciation.
3. pronunciations-plus.txt: These are entries with a pronunciation and zero or more parts of speech.
4. diff-pron.txt: These are entries the script recognizes as having different pronunciations depending on the part of speech. I’ll leave them alone for now.
5. ipa-letters.txt: These are initialisms; I’ll leave them alone for now.
6. wiki-words.txt: These are the entries the script didn’t recognize – the leftovers.

A lot of the machinery for dealing with the Moby data can be reused; we just need to parse the cleaned up wiki data.

wikiPOSCode :: String -> Either String PartOfSpeech
wikiPOSCode c = case c of
"noun"      -> Right Noun
"verb"      -> Right Verb
_ -> Left $"Unknown part of speech: " ++ c Parsing the actual file: readWikiPOS :: FilePath -> IO [(String, PartOfSpeech)] readWikiPOS path = do let makeWord :: String -> String -> Writer [String] [(String, PartOfSpeech)] makeWord !word code = case wikiPOSCode code of Left msg -> do tell [msg] return [] Right pos -> do return [( stripSpaces word, pos )] parseRecord :: String -> Writer [String] [(String, PartOfSpeech)] parseRecord !str = do let (stem,pos) = break (== '\t') str fmap concat$ sequence $map (makeWord$ stripSpaces stem) $words pos raw <- fmap lines$ readFile path
let (dict,err) = runWriter $fmap concat$ sequence $map parseRecord raw case err of [] -> return () _ -> do putStrLn$ "parse errors in " ++ path
sequence_ $map putStrLn err exitFailure return dict Pronunciations: readWikiPron :: FilePath -> IO [((String, Maybe PartOfSpeech), Pronunciation)] readWikiPron path = do let makeWord !word pron = return [( (stripSpaces word, Nothing), Pron pron )] makeWordPOS !word pron code = case wikiPOSCode code of Left msg -> do tell [msg] return [] Right pos -> do return [( (stripSpaces word, Just pos), Pron pron )] parseRecord !str = do let (stem,rest) = break (== '\t') str let (pron,pos) = break (== '\t')$ tail rest
case words (filter (/= '\t') pos) of
[] -> makeWord stem pron
ps -> do
fmap concat $sequence$ map (makeWordPOS (stripSpaces stem) (filter (/= '/') pron))
$ps raw <- fmap lines$ readFile path
let (dict,err) = runWriter $fmap concat$ sequence $map parseRecord raw case err of [] -> return () _ -> do putStrLn$ "parse errors in " ++ path
sequence_ $map putStrLn err exitFailure return dict And now parseWiki attempts to merge the wiki data with a given dictionary. parseWiki :: [Word] -> IO [Word] parseWiki dict = do let errFile = "/home/nathan/code/wordlist/error-wiki.txt" writeFile errFile "" putStrLn "parsing database" wrds <- readWords "/home/nathan/code/wordlist/wiki/words-only.txt" putStrLn "..words-only" pos <- readWikiPOS "/home/nathan/code/wordlist/wiki/pos-only.txt" putStrLn "..pos-only" pron <- readWikiPron "/home/nathan/code/wordlist/wiki/pronunciation-plus.txt" putStrLn "..pronunciation-plus" putStrLn "merging" return dict >>= mergePOS errFile pos >>= mergePron errFile pron >>= mergeStem errFile wrds Putting it all together: main parses and merges the Moby and Wiki data, writing the result to data.txt and any errors to error-moby.txt and error-wiki.txt. main :: IO () main = do let outFile = "/home/nathan/code/wordlist/dict.txt" dict <- parseMoby >>= parseWiki writeFile outFile$ unlines $map writeDictFormat dict putStrLn$ "Words: " ++ show (length dict)

exitSuccess

Running main yields a dictionary of 1052625 words. I think this will do. Stay tuned for a write up of the project this is all prologue for. :)