An n-gram model is a technique of counting sequences of characters or words that allows us to support rich pattern discovery in text. These patterns can be useful in spell-checking, text generation, DNA sequencing; and more generally in natural language processing, statistical machine learning, artificial intelligence, and computational biology.
I'll focus on natural language; specifically just English. A little more formally, a natural language n-gram model is a sequence of characters or words with a mathematical representation that assumes human language is both sequential and distributed. In other words, it tries to capture patterns of sequences (characters or words next to each other) while being sensitive to contextual relations (characters or words near each other).
Another, often unstated, assumption is that human language alphabets are finite and vocabularies are theoretically infinite... even though an upper bound is applied in practice. This assumption is important because it leads to a problem in natural language modeling in general. (Note, in this context assume a "model" is a static set of counts that is built on a development server and then deployed to a production server. Once deployed we can't change the model). If we apply an upper bound to a vocabulary that is just the maximum size of the distinct words/tokens in the data set, then we need a strategy for dealing with future words/tokens we have not seen before. That is, if we lock-down our vocabulary only to words we've seen before we will not have a strategy later for dealing with words we have not seen. I highlight the issue here because it's worth noting, though I won't go into it in any depth in this article.
Moving forward, assume that observed language unfolds as a series of acoustic-phonetic signals over some period of time. For digital text we use this assumption and treat a directional sequence of terminating characters as plotted over some continuous period of time. (The direction most often assumed is [left->right, top-bottom]. Generally, for Arabic languages: [right->left, top-bottom]; Asian languages: [top-bottom, left-right].)
n-gram refers to a number n of sequential items. A word 1-gram (uni-gram) for the sentence “this is true” would be [["this"],["is"],["true"]]; A 2-gram (bi-gram) would be: [["this","is"], ["is","true"]].
Let’s assume for the rest of this article that we are only dealing with sequences of words at the sentence level and limit n to 5 and under. That is, n represents the inclusive upper limit of the sequence[1,2,…,n]and gram refers to the grammar token. We can programmatically collect n-grams with a naive assumption of splitting characters by some delimiter. (Though, better practice is to use a good natural language tokenizer.)
I've not been able to find the original reference of the term gram, but common usage means a valid unit allowed by the grammar. Note that the term grammar used here is from a computational linguistics view: any allowable finite state grammar and/or any unit perceived by native speakers as allowable/understandable/intelligible (e.g., not what is dictated by social construct). The latter points are really only critical in terms of splitting the sequences (natural language tokenization; "is 'New York' one token or two?", "is 'run, ran' one token or two?") and in the data processing phase when it comes time to make decisions about what tokens to keep and how to normalize them. While I skip over these data engineering details, it is worth mentioning that what seem like minor details (how to split text, what counts as a "gram") are often not only dictated by the domain of discourse (are you parsing poetry, source code, proper names, movie reviews, ...) but take up a considerable amount of time and can impact the downstream modeling; these details are far from minor, but out of the scope of this article.
One last assumption that is not critical for this article but worth mentioning: the gram units will be treated as String type. Modern text processing distinguishes between various encodings of the byte values of digital characters. This can lead to non-trivial issues (especially if you have a pipeline that uses more than 1 programming language); I'll ignore these issues and we'll assume a naive use of Julia's String type.
model means we want to be able to represent collected n-grams mathematically in such a way that we can produce and/or use a computable program, formula, or algorithm. For software we can programmatically collect n-grams with a naive assumption of splitting characters by some delimiter. (Better practice is to use a good natural language tokenizer.)
Throughout this article I'll assume that "string" and "word" have been defined and use them intuitively (but it should be noted that these terms in practice are quite difficult to define and depending on how they are defined can have non-trivial impact on downstream modeling and tasks that rely on the model).
Construct two functions: ngram collects gram sequences given n; and ngram_range collects all sub-sequences up to n. These functions assume a pre-split array of (sub)strings into a vector/array of words.
The task of counting all the n-grams in relation to the entire data set still needs to be done. But first, before we count we need to know what it is we are counting and why.
First step to formalizing an n-gram language model
With some basic code to collect n-grams units we now need to count units in a mathematically significant way. We want a formal representation that captures the intuition we started with; namely
"... capture patterns of sequences (characters or words next to each other) while being sensitive to contextual relations (characters or words near each other)"
To begin, say that we want a measure for the likelihood that the first word w1(the sub-script number is the position it occurs in the sequence) is followed by the second word w2, which is followed by the third word w3... up to word wn.
A more formal way of saying this is that we want a "language model": a probability distribution over a bunch of joined sequences. In other words, it is a joint probability distribution. That is what the formula P(w₁,w₂,w₃,…,wₙ) says.
How do we construct such a distribution? If we want a probability for the sequence of words "pluck us from the real"
we can break the problem down into all the individual probabilities for sequence pairs, for example the probability of word-2 following word-1, (w2∣w1) and sum over them (the ∣can be read as "given" or "such that").
Imagine our goal is to predict w5 given all the previous 4 words. All previous words is the history h; e.g., the probability that "real" is next to the sequence "pluck us from the"; which will also imply sensitivity to nearby contexts (e.g., "us from the", "from the", and "the"). We want an answer to the question: what is the probability of w5("real") given the condition of history h("pluck us fromthe"); we want a conditional probability.
To achieve this goal notice that thejoint probability can be equivalently defined as the sum of multiplying all the conditional probabilities.
Breaking down the probabilities of the sequence into individual conditional probabilities is an application of the chain rule of probability. Notice though, what the right hand side of the equation implies is that if we want to knowP(w1,w2,w3,…,wn) we need to know all the other conditional probabilities.
We could also show it this way (if this makes more sense for you):
This gets big quickly. That is, in order to calculate if a word has, for example, a 75% chance of following a sequence of words, we have to calculate all the previous combinations! Maybe it's not so bad if we have a very small data set; certainly not possible or practical with large data sets. But for now let's start counting words with what we have so far.
Counting all the things
# use the DataStructures package to help us count
Pkg.add("DataStructures") #this will take a minute
We'll leverage Julia's first class function handling and create a function that accepts as one of its parameters the ngram() or ngram_range() functions. Briefly, what alm() does is build temp_lm, a DefaultDict that has as its key the word we want to predict and as its value an accumulator that counts the number of times we see a particular history related to our word.
# alm == a language model function to build up a big count of n-grams
# alm("my string data", 3, ngram)
# alm("my string data", 3, ngram_range)
# functions are first class in Julia; we can pass one as a parameter
# I'm declaring types on the arguments for clarity here, but it's not necessary in Julia
Define normalize as a function that creates a distribution of each (word,history) pair. Sum all the counts and then divide each item by the sum. (The [x for x in y] syntax is a list comprehension, similar to an anonymous function, it loops over y returning each item as x. In this case looping over a Dict it returns a (key,value) pair (history,count)).
#create a distribution for the accumulator histories built up in alm()
#sum all counts
#tuple of string with each count divided by sum
[(history,float(sum(count))/s) for (history,count) inaccum]
("Now the leaves", 0.5)
("to the graves", 0.5)
Here is what the stored key-value looks like.
"are"=>[("Now the leaves", 0.5),("to the graves", 0.5)]
The word "are" has two histories of 3-word phrases; each of those histories only happens once (i.e., 1 counted occurrence). The normalize() function built a distribution based on these histories given that each only occurred once: 2 histories each with 1 count == 1/2 == 0.5.
In other words, a word prediction task built on 2 stanzas of Auden's poem and based on a 4-gram model gives: "to the graves" yields a 0.5 probability for the next word being "are"; and "Now the leaves" does the same. Remember probabilities must sum to 1.0 ... so for our entire language model we've exhausted the possibilities for the word "are"; it either has "Now the leaves" or "to the graves" for its history. You can verify this by looking at the 2 stanzas of Auden's poem quoted previously.
If we wanted a richer history for predicting the word "are" we need more input data.
What if we use a bigger model? Staying with the stanzas of Auden's poem we still achieve this by increasing the number of words in our history and using all the n-grams up to this number (e.g., 1,2,3,4,5). We get both a larger (in terms of memory) and more robust (able to better predict) model.
("Nurses to the graves", 0.111111)
("Now the leaves", 0.111111)
("the leaves", 0.111111)
("to the graves", 0.111111)
("the graves", 0.111111)
This has more predictive power; we get a better distribution for predicting "are" in a more varied number of cases now.
And what about the example we used for showing the chain rule of probability? We can apply our little language modeller on the phrase "pluck us from the real".
result3=alm("pluck us from the real", 5, ngram_range)
("from the", 0.2)
("pluck us from the", 0.2)
("us from the", 0.2)
If I'm writing poetry and I want to know what word has a good probability of following the phrase "pluck us from the..." I can look it up in my probability table and see that "real" has a 20% chance of being next.
There are a lot of practical issues with such an approach, but it has worked well for a long time. If you have a small data set with a narrow focus (e.g., a small set of highly specific words) then this simple technique may work.
Having a great variety of measurements can make a difference also. Searching for the word "the" (one of the most common English words) I see lots of choices with low probability.
An interesting result is ("", 0.222222). This shows us that "the" is more likely to occur at the beginning of a phrase or sentence. Still, the wide distribution of "the" means it has a lot of history and this drives down the probability of specific instances of it's occurrence, which makes it harder to predict.
This gives us a clue to the usefulness of n-gram models today: small data sets (to keep performance tolerable) that have a highly specific domain (lots of jargon or specific contexts for certain words). N-grams also work at the character level where one does not have much context for capturing distributional relations of next and near. For example, in classifying emails or short names there is not much context and so breaking up an email, for example "info@coolcompany", into character-level n-grams might prove effective (This article has an example of a character-level n-gram approach to generating Shakespearean text).
N-grams are still a foundational part of many modern data tasks; instead of taking center stage though, they often play a supporting role to more advanced and precise techniques. More advanced techniques often entail more complex code and less transparency or interpretability; if a problem seems like a good fit for an n-gram model (at least for the short term) it may lead to success... with advantages of simplicity and ease of maintainability. There are still, however, many improvements that can be made to what we've done here.
Overall, the driving concept behind the ngram approach is to look at patterns for sequences that occur both next to each other and near each other. This concept of distributed patterns in textual relations underlies even the most sophisticated and state of the art approaches today.
Concluding the first steps
You may have noticed a few things; I'll point them out.
What happens if we search for a word that is not in Auden's poem? Failure.
The Dict used for storage seems like it would take up a lot of memory on large texts; specifically if we have to store the full history for each word.
Points 1 and 2 are out of the scope of this post. Next post will address point 2 and show how to dramatically reduce the history needed to compute these probabilities (hint: it's called the Markov property or also referred to as the Markov assumption).
I started by defining the terms n-gram and model as well as briefly noting some driving assumptions related to n-gram language models. I provided basic examples in Julia of basic n-grams. I next motivated an (hopefully) intuitive first step into how to formalize a language by introducing joint and conditional probabilities and the equivalence between the two via the chain rule of probability.