What is a tokenizer?

A tokenizer converts raw text into the sequence of tokens an LLM consumes, and back again. Its vocabulary and splitting rules determine how many tokens a given string costs.

A tokenizer is the component that sits between human text and the model's numeric internals. It maps a string to a list of integer token IDs (encoding) and maps generated IDs back to text (decoding). Most modern LLMs use sub-word tokenizers trained with algorithms like Byte-Pair Encoding (BPE) or its byte-level variants, which learn a fixed vocabulary by repeatedly merging the most frequent character pairs in a training corpus. This strikes a balance: frequent words become single tokens while rare or novel strings decompose into smaller known pieces, so the model can represent anything without an unbounded vocabulary. Each model family ships its own tokenizer and vocabulary, so the same sentence can cost different token counts on different models, which is why you should count tokens with the matching tokenizer rather than estimating. Tokenizer quirks explain real behavior: leading spaces are often part of a token, numbers and code can fragment unpredictably, and some languages tokenize far less efficiently than English, inflating cost and consuming context faster. For builders, the tokenizer is the tool you use to measure prompt size, size RAG chunks to a token budget, and predict how close a request is to the context-window limit before you send it.