"LLM Fundamentals 2026 — Tokens, Context Windows, and Hallucination Done Right"
The three things you should know before opening ChatGPT, Claude, or Gemini
ํต์ฌ ์์ฝ
- Audience: General users who've used an AI chatbot but find words like "tokens," "context," and "hallucination" hazy — and developers just starting with LLM APIs.
- What you'll get: 1) What a token actually is and why non-English text costs more, 2) what a context window means and the April 2026 limits across major models, 3) why hallucination won't go away (summarized from OpenAI's official paper), 4) practical mitigations.
- Prerequisite: None. If you've used ChatGPT, Claude, or Gemini even once, you're ready.
1. Tokens — The unit LLMs see instead of letters
A Large Language Model (LLM) doesn't read characters the way you do. It reads tokens — chunks smaller than a word and bigger than a letter. "hello world" is usually 2–3 tokens. The Korean phrase "์๋ ํ์ธ์" (5 syllables) can split into 5–10 tokens depending on the tokenizer.
1.1 Why tokens matter
Three reasons:
| Reason | Impact |
|---|---|
| Pricing | APIs bill per input + output token |
| Length limit | Each model has a max token count (= context window) |
| Latency | More tokens → slower processing and generation |
Even on paid plans for ChatGPT, Claude, or Gemini, there are hidden token ceilings. The "unlimited" feel comes from UIs that silently truncate or compress.
1.2 Tokenizers differ by vendor
| Vendor | Tokenizer | Notes |
|---|---|---|
| OpenAI (GPT) | tiktoken (cl100k_base / o200k_base) | Best for English |
| Anthropic (Claude) | Custom BPE | Strong on English + code |
| Google (Gemini) | SentencePiece | More balanced multilingual |
All three use BPE-family tokenizers, but their training corpora differ, so the same text can yield different token counts depending on which API you hit.
1.3 Why non-English text costs more
English averages ~4 characters per token. Korean, Japanese, and Chinese characters often consume 1–3 tokens each. Net effect: the same idea expressed in Korean costs 2–3× the tokens of an English version (2025 multilingual tokenization study).
Practical takeaway: if you write large volumes via API, English is cheaper. Use your native language when accuracy matters more than cost — but know the multiplier.
2. Context window — The model's working memory
A context window is the maximum amount of text the model can reference for a single response. System prompt, prior conversation, attached files, the current question, and the response being generated all share that budget.
Common confusion: "Context window = how much knowledge the model has." Wrong. Training data (parameters) is the model's long-term memory; the context window is its short-term working memory.
2.1 Major model limits as of April 2026
| Model | Context window | Source |
|---|---|---|
| Claude Opus 4.7 | 1,000,000 tokens | Anthropic docs |
| Claude Sonnet 4.6 | 1,000,000 tokens | Same |
| Claude Sonnet 4.5 | 200,000 tokens | Same |
| Claude Haiku 4.5 | 200,000 tokens | Same |
| GPT-5.5 | 1,000,000 tokens | OpenAI |
| GPT-5.4 / 5.4 Pro | ~1,050,000 tokens | OpenAI |
| Gemini 2.5 Pro | 1,000,000 tokens | Google AI docs |
2.2 What "1M tokens" actually fits
- About 750,000 English words (roughly 7–8 average novels)
- ~700–1,000 pages of standard double-spaced prose
- An entire mid-size codebase
You can pour an entire document or repo into one prompt. But filling the window isn't always smart.
2.3 Context rot — accuracy drops as the window fills
Anthropic's docs spell it out: "as token count grows, accuracy and recall degrade — a phenomenon known as context rot." Models start missing the point, forgetting earlier instructions, or fixating on irrelevant chunks.
The rule isn't "fill the window" — it's "curate what's in context." - Spawn a fresh session for unrelated tasks - Send only the relevant excerpts of long documents - Compress earlier conversation ("summarize what we covered") and start a new session
3. Hallucination — Why LLMs lie with confidence
Hallucination is when an LLM produces false information with full confidence. In September 2025, OpenAI published a formal paper analyzing why this won't disappear (Kalai et al., arXiv:2509.04664).
3.1 Root cause: training rewards guessing
LLMs learn by predicting the next token in massive text. There are no "true/false" labels in pretraining, so models learn patterns of plausible language — not a calibrated sense of "I know" vs. "I don't."
Worse, the evaluation incentives are misaligned. From the OpenAI paper:
Most accuracy-based benchmarks penalize "I don't know" more than they penalize a wrong guess. So models are trained to guess confidently when uncertain.
In plain terms: "I don't know" earns 0 points, but a lucky guess earns 1. Models learn to behave like the student who fills in every multiple-choice blank.
3.2 Patterns where hallucination spikes
Combining the OpenAI paper with Lakera's 2026 analysis:
| Pattern | Examples |
|---|---|
| Low-frequency facts | "What was X's PhD dissertation title?", "Quote page N of book Y" |
| Recent events | Anything after the model's knowledge cutoff |
| Forced format | "Exactly 5 bullets," "100+ words each" — when the data doesn't support it |
| Late in long context | Trust degrades for info far from the prompt |
3.3 Practical mitigations for general users
- Always verify numbers, names, and dates. This is where LLMs miss most often.
- Tell the model to admit uncertainty. Add "If you're unsure, say so" to your prompt. Simple but effective.
- Request verifiable form. "Include source URLs" — but click them, since fake URLs happen.
- Suspect overly polished answers. Real experts hedge ("depends on the case"). Pristine confidence on a niche topic is a yellow flag.
- Cross-check important decisions across two models. When Claude and GPT disagree, treat both as suspect.
4. One-page summary
| Concept | One-line definition | Practical implication |
|---|---|---|
| Token | The sub-word unit LLMs actually process | Non-English costs 2–3× more |
| Context window | Total tokens (input + output) referenceable per response | 1M-token era — but more isn't always better |
| Hallucination | Confident falsehoods baked in by training incentives | Verify numbers, names, and dates |
Developer notes
If you work directly with the APIs:
- Pre-count tokens. Anthropic's token counting API, OpenAI's
tiktoken, and Google'scount_tokens()exist for this. Measure before you send. - New Claude models error on overflow instead of silently truncating (Sonnet 3.7 onward). Pre-counting becomes mandatory.
- Extended thinking tokens are billed once as output but auto-stripped from the next turn's input by the API. You don't need to strip them yourself.
- Compaction. Claude offers server-side compaction (beta), OpenAI has prompt caching, Gemini has context caching. Long-running conversations need one of these.
- Hallucination mitigation patterns: Add "say 'unsure' when uncertain" to the system prompt, enforce JSON Schema responses, and ground with retrieved external data (RAG).
References
- Anthropic — Context windows
- OpenAI — Why Language Models Hallucinate (Kalai et al., 2025-09)
- Google AI — Long context
- OpenAI Tokenizer
- Multilingual tokenization efficiency (PMC, 2025)
This is part 1 of 11 in the AI Basics series. Next: Free vs. Paid — Who should pay for ChatGPT, Claude, or Gemini Plus.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ