And that dream was to create the largest, completely useless, corpus of information about the use of words in the English language… EVER!
To put it another way, what I am interested in is creating a link graph, or a web of connections, between words in the English language. But not only what words can be used with other words, but rather what words ARE used with other words.
Now, I can’t answer questions like “dead” cats are more common than dogs, nor why dogs are more commonly considered “faithful”, but I can be struck by some of these fun facts as I try to dig deeper into the mysteries of our language…
At the very least, maybe I can make a desk calendar out of it.
Step 1: Figuring Out The Lingo
For a computer program, figuring out the English language is tricky. This starts with even simple things like the question of “How do I break this paragraph into sentences?” A naive implementation might say that a sentence is anything that ends in a period, an exclamation point, or a question mark . But English isn’t that simple: we use periods for abbreviations, and question marks and exclamation points can be used in some type of technical texts to mean other things.
But that’s only the beginning of the challenge. The real fun comes when you want to figure out what the structure of a sentence is. For example, in the sentence “It was a dark and stormy night.” we know that “it” is a third-person neutral pronoun, “was” is the past-tense form of the verb “to be”, “a” is an article, “dark” and “stormy” are adjectives, “and” is a conjunction, and finally, “night” is a noun. We know this (I hope!) because we can read and parse English without even thinking about it. But for a computer algorithm to do the same is very complex.
This is where NLTK, the Natural Language Toolkit comes in. This is a library (available for Python and a number of other languages) that has many features. Two of the features that I needed for this project were the ability to split texts into sentences, while doing intelligent things with punctuation, and guessing the parts of speech of words in a sentence. I’m saying “guess” because NLTK doesn’t speak English either. It uses complex probability systems and pre-tagged texts to “learn” what a sentence is and what different types of words are in roughly the same way that a spam filter “learns” what email you want and what email you don’t want. And since you know how well your spam filter works, you should be able to guess how well NLTK works. But with these tools, and a little of my own engineering, I was able to build a system that took a novel or other work, divide it into sentences, and then tag those sentences with parts of speech.
From there, I went off on my own and added heuristics to figure out even more information about a sentence. It wasn’t enough to know that “dark” was an adjective and that “night” was a noun, I wanted to know that “dark” referred to “night”. I did this by adding many of my own rules, for example by saying that adjectives and adverbs can be chained (separated by commas or conjunctions), but that while adverbs can be either before or after a verb (“The boy ran quickly.” is roughly the same as “The boy quickly ran.”), adjectives are always before their nouns.
From there, a quick tag job of “Alice in Wonderland” was easy! But a small book doesn’t represent the whole of the English language and in order to come to real conclusions I needed to think bigger.
Step 2: Growing the Data
Once I had a program that could generate a map of relationships between words in “Alice in Wonderland”, I needed to run that on other books. And, to make the data as accurate as possible, I needed to run it on a LOT of other books. And where can you get the most free books online? (All out of copyright, of course!) Project Gutenberg!
Project Gutenberg is an online repository of thousands of out-of-copyright books (and other things) which anyone can access online. Why pay for a copy of Sherlock Holmes when you can just download one? For my purposes, the fact that most of the books are 50+ years old was just fine.
But as I was running through this block of text, I came across a terrible realization: NLTK was slow and I wasn’t smart enough to optimize it. It was taking around 8 minutes for every megabyte of text I could parse. To get even a portion of the way through Project Gutenberg (let alone other types of more recent texts such as wikipedia), it would take more than 40 days. If I could optimize the NLTK tagger, it might go faster, but since I could not what was I left to do?
Go massively parallel.
This problem is what you might call “embarrassingly parallel.” The parsing of one sentence doesn’t really affect any other sentence and you can do them in any order. But what you DO need to do is, at the end, consolidate all of these associations into a weighted graph or tree. Fortunately, that’s mostly an arithmetic problem. This type of problem maps very well into cluster programming: you can split up all of the text processing into one phase, generating a list of words and their contexts, and then have a second phase where all of the usages of every word can be tabulated together. (You also need a zeroth step where you split the data into chunkable sentences.)
Through the use of Hadoop, an open source clustering system, plus StarCluster, an open source cluster management solution, I was able to launch (at one point) 85 servers to simultaneously cycle through this repository of knowledge. By making this process parallel, it could reduce the time it takes to do the work by almost 1/85th. (It scales almost linearly.)
Of course, Amazon resources cost money and this project is still not cost-effective. The normal cost of an 85 nodes of “medium” strength on Amazon is around $30 per hour. Because I was using spot instances, I could get the cost down quite a bit lower — but this type and volume of data just takes a while. Because the time scales linearly, it would be just as expensive to run it on half as many nodes for twice the amount of time, so I felt good with launching this at scale.
But there were challenges…
In the end, I was never able to run the full Project Gutenberg corpus. With more time and more money, I could have done it, but I exceeded what I was willing to spend on the class. What stopped me?
I call them “Sentences of Doom”.
Somewhere in the corpus of Project Gutenberg that I was running – and I had reduced it to around 6% of the text – there was a sentence or sentences that was causing my program to explode in a particularly nasty way without warning and without leaving anything in the log. This was absolutely reproducible, but even 6% of the text would have taken around 3 days on a single machine to track down and I didn’t have the time to do it. I still want to know what that sentence is and defeat the bug, but that will have to wait for another day.
But even without all of the data parsed, I was still able to run this on an exceptionally large data set and make conclusions (and graphs) about the data.
And what I have now…
At present, I have a data file that contains weighted associations for half of the Gutenberg corpus, around 600,000 words. Most of these words are even in English, though some crap data did work its way in places where English texts included non-English phrases or a text was incorrectly identified as English even if it was in some other language. My parser isn’t that smart to figure out that it is being fed words that aren’t English.
Over the next couple of days, I hope to put this up and searchable. The datafile is 185MB. Please email me if you would like access to it (but please respect my copyright).