Core Concepts

What Is Tokenization?

Tokenization is the process of splitting raw text into tokens — the small pieces a language model actually reads. A tokenizer maps characters and words to numeric IDs, balancing vocabulary size against sequence length. How text is tokenized affects a model’s efficiency and how it handles rare words, spelling, and other languages.

What Is Tokenization?

Related topics

Further reading