Tokens:
0
Word emojis will appear here...
Actual tokens will appear here...
âšī¸ Understanding Tokenization Through Analogy
The Emoji Analogy: Just like we can replace words with emojis (strawberry â đ), GPT replaces text chunks with token IDs. The middle box shows words as emojis to help visualize how language models compress text into symbols.
Real Tokenization: GPT doesn't work with whole words though! It breaks text into subword pieces using Byte Pair Encoding (BPE). Notice how the real tokens (box 3) are often smaller than words - this helps the model handle rare words and work across languages.
- 1 token â 4 characters in English
- 1 token â ž of a word
- 100 tokens â 75 words