Core Concepts
What Is Tokenization?
Tokenization is the process of splitting raw text into tokens — the small pieces a language model actually reads. A tokenizer maps characters and words to numeric IDs, balancing vocabulary size against sequence length. How text is tokenized affects a model’s efficiency and how it handles rare words, spelling, and other languages.
Further reading
Read more about Tokenization — articles and blogs from around the web: