When working with written natural language data as we do with many natural language processing models, a step we typically carry out while preprocessing the data is tokenization. In a nutshell, tokenization is the conversion of a long string of characters into smaller units that we call tokens.

Word Tokenization

The standard way to tokenize natural language is historically word-level tokenization. This is a conceptually straightforward tokenization: We can, for example, simply use the white space between words to identify where one word ends and the next begins, thereby converting a natural language string like “the cat sat” into three tokens: “the” and “cat” and “sat”.

This word-level tokenization is used by techniques like Word2Vec and GloVe, two popular NLP techniques for quantitatively representing the relative meaning of words. A big drawback with such word-level tokenization is that if a word didn’t show up enough times in our training data, then when the NLP model encounters that word in production, there’s no way to handle it. In situations like this, we consider the new token to be unknown and therefore it is ignored by the model — even though the word might have been important to the production application.

Character Tokenization

To avoid the big unknown-token issue that word-level tokenization has, we can use character-level tokenization instead. With character-level tokenization, a natural language string like “the cat sat” is converted into eleven tokens: “t”, “h”, “e”, “space”, “c”, “a”, “t”, “space”, “s”, “a”, “t”. That way when we encounter a word outside of our model’s vocabulary in production, we don’t need to ignore it — instead the model can leverage its aggregate representation of the characters that make up the new word to represent the new word.

A technique called ELMo — which stands for “Embeddings from Language Model” — is a prominent example of an NLP technique that uses character-level tokenization. Unfortunately, character-level tokenization has its own drawbacks. For one, it requires a large number of tokens to represent a sequence of text. In addition, unlike a word, a character doesn’t on its own convey any meaning, which can result in suboptimal model performance.

Subword Tokenization with Byte-Pair Encoding

So, both word-level and character-level tokenization have critical flaws. Thankfully NLP researchers have devised a solution: subword tokenization. Subwords sit between words and characters: They aren’t as coarse as words, but they aren’t as granular as characters. Subword tokenization blends the computational efficiency of word-level tokens with the capacity for character-level tokenization to handle the out-of-vocabulary words — it’s the best of both worlds!

There are many algorithms out there for tokenizing strings of natural language into subwords, many of which rely upon a concept called byte-pair encoding. The general idea is that we specify how many subwords we’d like to have in our vocabulary and rely on byte-pair encoding to predict what the particular subwords should be given the natural language we provide to it:

First, the algorithm performs word-level tokenization.
Second, it splits each individual word-level token into character-level tokens.
Third, it computes how frequently character-level tokens occur next to each other across all the words in our natural language data.
Finally, it merges the most-frequently-occurring adjacent characters together until the number of subwords you specified to compute is reached.

Once computed, the beauty of subwords is that — unlike characters — subwords do have meaning and so they can be recombined to represent out-of-vocabulary words efficiently. For example, let’s say that after we ran byte-pair encoding over our natural language data it learned to represent the subword tokens “re”, “lat”, and “ed”. These three subwords can be combined to form the word “related”. Now, in a contrived example, let’s say that the word “unrelated” wasn’t in our training data. When our NLP application comes across the word “unrelated” in production, it should nevertheless be able to efficiently represent its meaning because not only did byte-pair encoding tokenize “re”, “lat”, and “ed” but let’s assume that it tokenized the subword “un” as well. The subword “un” and its negation of meaning would allow our NLP application to represent that “unrelated” means the opposite of “related” even though it never encountered the word “unrelated” during training. Very cool, and very powerful!

The upshot is that byte-pair encoding is indeed so powerful that it is a crucial component behind many of the leading NLP models of today such as BERT, GPT-3, and XLNet. So, if you didn’t understand the broad strokes of tokenization, particularly this influential byte-pair encoding approach to tokenization prior to today’s episode, hopefully you do now!

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.