Language & LLMs
What Is Subword Tokenization?
Subword tokenization breaks words into smaller pieces, such as roots, prefixes, and suffixes, rather than treating each whole word as a single token. This lets models represent rare or unseen words as combinations of known subword units, reducing vocabulary size and out-of-vocabulary problems. Common methods include byte-pair encoding, WordPiece, and SentencePiece.
Further reading
Read more about subword tokenization — articles and blogs from around the web: