Filtering by Tag: tokenization

Subword Tokenization with Byte-Pair Encoding

Added on November 11, 2022 by Jon Krohn.

When working with written natural language data as we do with many natural language processing models, a step we typically carry out while preprocessing the data is tokenization. In a nutshell, tokenization is the conversion of a long string of characters into smaller units that we call tokens.