What is Tokenization?
Large Language Models cannot read text directly. Tokenization is the preprocessing step that splits sentences into small byte-pair fragments (tokens) and assigns an integer code to each unique fragment.
Tokens vs. Words
A rule of thumb is that 100 English words represent roughly 130 to 140 tokens. For non-English languages, characters often consume more tokens because the default vocabulary lacks dense representation for them.