You might have noticed that we have a number of authors at Relevance. Some of us studied literature and linguistics. We are a wordy bunch. We're also kind of obsessed with time, so it's shouldn't be a surprise that we would be interested in how language evolves over time. Lucky for us, then, that Google has digitized every book they can get their scanners on and made that data available to the world.
English is Interesting
English is a fascinating language. I'm not just saying that because it's my native tongue. (I actually think C is my first language.) It's a mongrel language. Start with a Germanic grammar. Then, mix vocabularies as each conqueror brings their language, then forms a pidgin with the conquered. The ruler changes and is changed, every time. You can read the ebb and flow of empires in every sentence.
We decided to build Word Magic as a fun way to explore the language that a billion of us speak.
For the Casual Reader: Insight From Data
Word Magic shows us several facets of each word or phrase. In this post, I want to talk about word usage over time. We can see words rise and fall like empires, and with them we see what is on the collective mind of the culture. For instance, the word atomic barely registered until the 18th Century, when the first modern chemists theorized about particles that wouldn't be seen for nearly 300 years. But, it doesn't really hit the mainstream until the 1950s when we see atomic everything: atomic bombs, atomic shelters, atomic cocktails, you name it.
Words are interesting, but phrases are really fascinating. For that we need "n-grams".
"N-gram" is a generic term for a string of n words clipped from a sequence. 1-grams are just words. 2-grams are two word snippets. 3-grams are three word snippets, and so on. (Linguistic researchers absolutely love n-grams. They're handy for plenty of algorithmic analyses.)
For example, take Shakespeare's line "think but this, and all is mended." It gives us seven 1-grams:
We can get six 2-grams from it:
- think but
- but this
- this and
- and all
- all is
- is mended
You can see how it's like sliding a two-word window along the length of the text.
We'd get five 3-grams:
- think but this
- but this and
- this and all
- and all is
- all is mended
In Word Magic we can get information about phrases up to five grams long, letting us see how phrases rise and fall.
For example, business value never appears in printed English literature until the 1850's. After that, it potters along until the 1990's when it really takes off. Once Google updates their n-gram dataset for the 2010's, I'm sure we'll see "business value" get completely saturated, then disappear. (In exactly the same way that "scientific management" peaked in the 1910's and has only been used as a either a straw- or boogey-man since then.)
For the Technologist: A Peek Into the Plumbing
The user interface reveals some of the under the hood details, but not everything. Let's take a look at how the app works and how the data gets there.
We built the API with Rails, in a load-balanced, autoscaling farm of application servers. API calls route to a controller that queries HBase for the n-gram in question.
We load an HBase table with every 1-, 2-, 3-, 4-, and 5-gram from Shakespeare's complete works. (Thanks to the Gutenberg Project for the text.) For this job, we've got a small pipeline of programs that construct the n-grams: starting with a map/reduce job built in Ruby. Then we build the HBase table and populate it with another couple of small Ruby scripts.
The whole pipeline is shown below.
Though Shakespeare contributed immeasurably to English, his total lifetime output is only about five megabytes of text. Publishers today put out hundreds of times that every day. They've got him beat in quantity, if not in quality.
The Google n-grams dataset has 1- through 5-grams from every book that Google has scanned. Their collection reaches back to the 1500's. It's a lot of data. Despite the difference in size, we process the n-grams data in a pretty similar way: Map/Reduce (this time in Java) into HBase, then accessed by an API. We'll follow this pipeline from the source data through to the UI.
The Map/Reduce jobs run on a Hadoop cluster in Amazon Web Services EC2. We pull all the files directly from Google since incoming data transfer to AWS is free.
What are we mapping and reducing? Half a trillion words and phrases from Google. That's a lot of data. Ordinary queries against this whole data set would take many minutes to complete. Nobody is going to watch a TV show while they wait for a web page to load! We can't afford extra time while a user is waiting for a page, but we can afford to spend extra time up front, when we load the data. Like a TV cooking show, we've baked the data ahead of time, to match the structure the query needs.
Since the UI wants to display relative usage frequency by decade, we use the Mapper to output the decade and number of occurrences of an n-gram in that decade. The input n-grams will have publication years spread all across the decade, so the Mapper will definitely create duplicates. That's no problem--it's exactly what the Reducer tasks are for.
The Reducer totals up occurrences by n-gram in a decade and writes the result directly into HBase.
We have ideas for more visualizations and more analysis. For example, Shakespeare used contractions to make lines scan better. Good for actors, bad for programs.
We'd also love to hear your ideas too! Send your suggestions to Michael Nygard (firstname.lastname@example.org).
This is an example of what we do, to find out more contact us at email@example.com.