Introducing Word Magic

You might have noticed that we have a number of authors at Relevance. Some of us studied literature and linguistics. We are a wordy bunch. We're also kind of obsessed with time, so it's shouldn't be a surprise that we would be interested in how language evolves over time. Lucky for us, then, that Google has digitized every book they can get their scanners on and made that data available to the world.

English is Interesting

English is a fascinating language. I'm not just saying that because it's my native tongue. (I actually think C is my first language.) It's a mongrel language. Start with a Germanic grammar. Then, mix vocabularies as each conqueror brings their language, then forms a pidgin with the conquered. The ruler changes and is changed, every time. You can read the ebb and flow of empires in every sentence.

We decided to build Word Magic as a fun way to explore the language that a billion of us speak.

Word Magic screenshot

For the Casual Reader: Insight From Data

Word Magic shows us several facets of each word or phrase. In this post, I want to talk about word usage over time. We can see words rise and fall like empires, and with them we see what is on the collective mind of the culture. For instance, the word atomic barely registered until the 18th Century, when the first modern chemists theorized about particles that wouldn't be seen for nearly 300 years. But, it doesn't really hit the mainstream until the 1950s when we see atomic everything: atomic bombs, atomic shelters, atomic cocktails, you name it.

Words are interesting, but phrases are really fascinating. For that we need "n-grams".

"N-gram" is a generic term for a string of n words clipped from a sequence. 1-grams are just words. 2-grams are two word snippets. 3-grams are three word snippets, and so on. (Linguistic researchers absolutely love n-grams. They're handy for plenty of algorithmic analyses.)

For example, take Shakespeare's line "think but this, and all is mended." It gives us seven 1-grams:

  • think
  • but
  • this
  • and
  • all
  • is
  • mended

We can get six 2-grams from it:

  • think but
  • but this
  • this and
  • and all
  • all is
  • is mended

You can see how it's like sliding a two-word window along the length of the text.

We'd get five 3-grams:

  • think but this
  • but this and
  • this and all
  • and all is
  • all is mended

In Word Magic we can get information about phrases up to five grams long, letting us see how phrases rise and fall.

For example, business value never appears in printed English literature until the 1850's. After that, it potters along until the 1990's when it really takes off. Once Google updates their n-gram dataset for the 2010's, I'm sure we'll see "business value" get completely saturated, then disappear. (In exactly the same way that "scientific management" peaked in the 1910's and has only been used as a either a straw- or boogey-man since then.)

For the Technologist: A Peek Into the Plumbing

The user interface reveals some of the under the hood details, but not everything. Let's take a look at how the app works and how the data gets there.

We'll start with the Shakespeare module (which has zero hits for "under the hood" in the Bard's complete works.) When you search for a word or phrase, the results page doesn't contain the actual data about that phrase. Instead, the page uses JavaScript to call an API on the application server. You can hit the API yourself, it just returns a small bit of JSON with the number of occurrences.

When the API call returns, the JavaScript function renders the woodcut image of old Will and some text with results. (This actually comes from a CoffeeScript template that we compile down to JavaScript and render entirely on the client.) We delay this display just in case the API is malfunctioning or unavailable. If the API doesn't respond, we don't show the module. This makes the user interface degrade nicely under partial failure.

We built the API with Rails, in a load-balanced, autoscaling farm of application servers. API calls route to a controller that queries HBase for the n-gram in question.

We load an HBase table with every 1-, 2-, 3-, 4-, and 5-gram from Shakespeare's complete works. (Thanks to the Gutenberg Project for the text.) For this job, we've got a small pipeline of programs that construct the n-grams: starting with a map/reduce job built in Ruby. Then we build the HBase table and populate it with another couple of small Ruby scripts.

The whole pipeline is shown below.

Shakespeare n-gram data flow

Though Shakespeare contributed immeasurably to English, his total lifetime output is only about five megabytes of text. Publishers today put out hundreds of times that every day. They've got him beat in quantity, if not in quality.

The Google n-grams dataset has 1- through 5-grams from every book that Google has scanned. Their collection reaches back to the 1500's. It's a lot of data. Despite the difference in size, we process the n-grams data in a pretty similar way: Map/Reduce (this time in Java) into HBase, then accessed by an API. We'll follow this pipeline from the source data through to the UI.

Google n-grams data flow

The Map/Reduce jobs run on a Hadoop cluster in Amazon Web Services EC2. We pull all the files directly from Google since incoming data transfer to AWS is free.

What are we mapping and reducing? Half a trillion words and phrases from Google. That's a lot of data. Ordinary queries against this whole data set would take many minutes to complete. Nobody is going to watch a TV show while they wait for a web page to load! We can't afford extra time while a user is waiting for a page, but we can afford to spend extra time up front, when we load the data. Like a TV cooking show, we've baked the data ahead of time, to match the structure the query needs.

Since the UI wants to display relative usage frequency by decade, we use the Mapper to output the decade and number of occurrences of an n-gram in that decade. The input n-grams will have publication years spread all across the decade, so the Mapper will definitely create duplicates. That's no problem--it's exactly what the Reducer tasks are for.

The Reducer totals up occurrences by n-gram in a decade and writes the result directly into HBase.

From there, the rest of the story is very similar to Shakespeare. A bit of JavaScript calls an API to compute the n-gram frequencies across decades, then renders it using D3 for visualization. This particular API happens to be written in Clojure. (Gasp! Ruby and Clojure in the same project!)

What's Next?

We have ideas for more visualizations and more analysis. For example, Shakespeare used contractions to make lines scan better. Good for actors, bad for programs.

We'd also love to hear your ideas too! Send your suggestions to Michael Nygard (mtnygard@thinkrelevance.com).

This is an example of what we do, to find out more contact us at sales@thinkrelevance.com.