Language & LLMs

What Is Subword Tokenization?

Subword tokenization breaks words into smaller pieces, such as roots, prefixes, and suffixes, rather than treating each whole word as a single token. This lets models represent rare or unseen words as combinations of known subword units, reducing vocabulary size and out-of-vocabulary problems. Common methods include byte-pair encoding, WordPiece, and SentencePiece.

What Is Subword Tokenization?

Related topics

Further reading