"LLM Fundamentals 2026 — Tokens, Context Windows, and Hallucination Done Right"

The three things you should know before opening ChatGPT, Claude, or Gemini


ํ•ต์‹ฌ ์š”์•ฝ

  • Audience: General users who've used an AI chatbot but find words like "tokens," "context," and "hallucination" hazy — and developers just starting with LLM APIs.
  • What you'll get: 1) What a token actually is and why non-English text costs more, 2) what a context window means and the April 2026 limits across major models, 3) why hallucination won't go away (summarized from OpenAI's official paper), 4) practical mitigations.
  • Prerequisite: None. If you've used ChatGPT, Claude, or Gemini even once, you're ready.

1. Tokens — The unit LLMs see instead of letters

A Large Language Model (LLM) doesn't read characters the way you do. It reads tokens — chunks smaller than a word and bigger than a letter. "hello world" is usually 2–3 tokens. The Korean phrase "์•ˆ๋…•ํ•˜์„ธ์š”" (5 syllables) can split into 5–10 tokens depending on the tokenizer.

1.1 Why tokens matter

Three reasons:

Reason Impact
Pricing APIs bill per input + output token
Length limit Each model has a max token count (= context window)
Latency More tokens → slower processing and generation

Even on paid plans for ChatGPT, Claude, or Gemini, there are hidden token ceilings. The "unlimited" feel comes from UIs that silently truncate or compress.

1.2 Tokenizers differ by vendor

Vendor Tokenizer Notes
OpenAI (GPT) tiktoken (cl100k_base / o200k_base) Best for English
Anthropic (Claude) Custom BPE Strong on English + code
Google (Gemini) SentencePiece More balanced multilingual

All three use BPE-family tokenizers, but their training corpora differ, so the same text can yield different token counts depending on which API you hit.

1.3 Why non-English text costs more

English averages ~4 characters per token. Korean, Japanese, and Chinese characters often consume 1–3 tokens each. Net effect: the same idea expressed in Korean costs 2–3× the tokens of an English version (2025 multilingual tokenization study).

Practical takeaway: if you write large volumes via API, English is cheaper. Use your native language when accuracy matters more than cost — but know the multiplier.


2. Context window — The model's working memory

A context window is the maximum amount of text the model can reference for a single response. System prompt, prior conversation, attached files, the current question, and the response being generated all share that budget.

Common confusion: "Context window = how much knowledge the model has." Wrong. Training data (parameters) is the model's long-term memory; the context window is its short-term working memory.

2.1 Major model limits as of April 2026

Model Context window Source
Claude Opus 4.7 1,000,000 tokens Anthropic docs
Claude Sonnet 4.6 1,000,000 tokens Same
Claude Sonnet 4.5 200,000 tokens Same
Claude Haiku 4.5 200,000 tokens Same
GPT-5.5 1,000,000 tokens OpenAI
GPT-5.4 / 5.4 Pro ~1,050,000 tokens OpenAI
Gemini 2.5 Pro 1,000,000 tokens Google AI docs

2.2 What "1M tokens" actually fits

  • About 750,000 English words (roughly 7–8 average novels)
  • ~700–1,000 pages of standard double-spaced prose
  • An entire mid-size codebase

You can pour an entire document or repo into one prompt. But filling the window isn't always smart.

2.3 Context rot — accuracy drops as the window fills

Anthropic's docs spell it out: "as token count grows, accuracy and recall degrade — a phenomenon known as context rot." Models start missing the point, forgetting earlier instructions, or fixating on irrelevant chunks.

The rule isn't "fill the window" — it's "curate what's in context." - Spawn a fresh session for unrelated tasks - Send only the relevant excerpts of long documents - Compress earlier conversation ("summarize what we covered") and start a new session


3. Hallucination — Why LLMs lie with confidence

Hallucination is when an LLM produces false information with full confidence. In September 2025, OpenAI published a formal paper analyzing why this won't disappear (Kalai et al., arXiv:2509.04664).

3.1 Root cause: training rewards guessing

LLMs learn by predicting the next token in massive text. There are no "true/false" labels in pretraining, so models learn patterns of plausible language — not a calibrated sense of "I know" vs. "I don't."

Worse, the evaluation incentives are misaligned. From the OpenAI paper:

Most accuracy-based benchmarks penalize "I don't know" more than they penalize a wrong guess. So models are trained to guess confidently when uncertain.

In plain terms: "I don't know" earns 0 points, but a lucky guess earns 1. Models learn to behave like the student who fills in every multiple-choice blank.

3.2 Patterns where hallucination spikes

Combining the OpenAI paper with Lakera's 2026 analysis:

Pattern Examples
Low-frequency facts "What was X's PhD dissertation title?", "Quote page N of book Y"
Recent events Anything after the model's knowledge cutoff
Forced format "Exactly 5 bullets," "100+ words each" — when the data doesn't support it
Late in long context Trust degrades for info far from the prompt

3.3 Practical mitigations for general users

  1. Always verify numbers, names, and dates. This is where LLMs miss most often.
  2. Tell the model to admit uncertainty. Add "If you're unsure, say so" to your prompt. Simple but effective.
  3. Request verifiable form. "Include source URLs" — but click them, since fake URLs happen.
  4. Suspect overly polished answers. Real experts hedge ("depends on the case"). Pristine confidence on a niche topic is a yellow flag.
  5. Cross-check important decisions across two models. When Claude and GPT disagree, treat both as suspect.

4. One-page summary

Concept One-line definition Practical implication
Token The sub-word unit LLMs actually process Non-English costs 2–3× more
Context window Total tokens (input + output) referenceable per response 1M-token era — but more isn't always better
Hallucination Confident falsehoods baked in by training incentives Verify numbers, names, and dates

Developer notes

If you work directly with the APIs:

  1. Pre-count tokens. Anthropic's token counting API, OpenAI's tiktoken, and Google's count_tokens() exist for this. Measure before you send.
  2. New Claude models error on overflow instead of silently truncating (Sonnet 3.7 onward). Pre-counting becomes mandatory.
  3. Extended thinking tokens are billed once as output but auto-stripped from the next turn's input by the API. You don't need to strip them yourself.
  4. Compaction. Claude offers server-side compaction (beta), OpenAI has prompt caching, Gemini has context caching. Long-running conversations need one of these.
  5. Hallucination mitigation patterns: Add "say 'unsure' when uncertain" to the system prompt, enforce JSON Schema responses, and ground with retrieved external data (RAG).

References


This is part 1 of 11 in the AI Basics series. Next: Free vs. Paid — Who should pay for ChatGPT, Claude, or Gemini Plus.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System