"LLM Fundamentals 2026 — Tokens, Context Windows, and Hallucination Done Right"

4월 25, 2026

The three things you should know before opening ChatGPT, Claude, or Gemini

핵심 요약

Audience: General users who've used an AI chatbot but find words like "tokens," "context," and "hallucination" hazy — and developers just starting with LLM APIs.
What you'll get: 1) What a token actually is and why non-English text costs more, 2) what a context window means and the April 2026 limits across major models, 3) why hallucination won't go away (summarized from OpenAI's official paper), 4) practical mitigations.
Prerequisite: None. If you've used ChatGPT, Claude, or Gemini even once, you're ready.

1. Tokens — The unit LLMs see instead of letters

A Large Language Model (LLM) doesn't read characters the way you do. It reads tokens — chunks smaller than a word and bigger than a letter. "hello world" is usually 2–3 tokens. The Korean phrase "안녕하세요" (5 syllables) can split into 5–10 tokens depending on the tokenizer.

1.1 Why tokens matter

Three reasons:

Reason	Impact
Pricing	APIs bill per input + output token
Length limit	Each model has a max token count (= context window)
Latency	More tokens → slower processing and generation

Even on paid plans for ChatGPT, Claude, or Gemini, there are hidden token ceilings. The "unlimited" feel comes from UIs that silently truncate or compress.

1.2 Tokenizers differ by vendor

Vendor	Tokenizer	Notes
OpenAI (GPT)	tiktoken (cl100k_base / o200k_base)	Best for English
Anthropic (Claude)	Custom BPE	Strong on English + code
Google (Gemini)	SentencePiece	More balanced multilingual

All three use BPE-family tokenizers, but their training corpora differ, so the same text can yield different token counts depending on which API you hit.

1.3 Why non-English text costs more

English averages ~4 characters per token. Korean, Japanese, and Chinese characters often consume 1–3 tokens each. Net effect: the same idea expressed in Korean costs 2–3× the tokens of an English version (2025 multilingual tokenization study).

Practical takeaway: if you write large volumes via API, English is cheaper. Use your native language when accuracy matters more than cost — but know the multiplier.

2. Context window — The model's working memory

A context window is the maximum amount of text the model can reference for a single response. System prompt, prior conversation, attached files, the current question, and the response being generated all share that budget.

Common confusion: "Context window = how much knowledge the model has." Wrong. Training data (parameters) is the model's long-term memory; the context window is its short-term working memory.

2.1 Major model limits as of April 2026

Model	Context window	Source
Claude Opus 4.7	1,000,000 tokens	Anthropic docs
Claude Sonnet 4.6	1,000,000 tokens	Same
Claude Sonnet 4.5	200,000 tokens	Same
Claude Haiku 4.5	200,000 tokens	Same
GPT-5.5	1,000,000 tokens	OpenAI
GPT-5.4 / 5.4 Pro	~1,050,000 tokens	OpenAI
Gemini 2.5 Pro	1,000,000 tokens	Google AI docs

2.2 What "1M tokens" actually fits

About 750,000 English words (roughly 7–8 average novels)
~700–1,000 pages of standard double-spaced prose
An entire mid-size codebase

You can pour an entire document or repo into one prompt. But filling the window isn't always smart.

2.3 Context rot — accuracy drops as the window fills

Anthropic's docs spell it out: "as token count grows, accuracy and recall degrade — a phenomenon known as context rot." Models start missing the point, forgetting earlier instructions, or fixating on irrelevant chunks.

The rule isn't "fill the window" — it's "curate what's in context." - Spawn a fresh session for unrelated tasks - Send only the relevant excerpts of long documents - Compress earlier conversation ("summarize what we covered") and start a new session

3. Hallucination — Why LLMs lie with confidence

Hallucination is when an LLM produces false information with full confidence. In September 2025, OpenAI published a formal paper analyzing why this won't disappear (Kalai et al., arXiv:2509.04664).

3.1 Root cause: training rewards guessing

LLMs learn by predicting the next token in massive text. There are no "true/false" labels in pretraining, so models learn patterns of plausible language — not a calibrated sense of "I know" vs. "I don't."

Worse, the evaluation incentives are misaligned. From the OpenAI paper:

Most accuracy-based benchmarks penalize "I don't know" more than they penalize a wrong guess. So models are trained to guess confidently when uncertain.

In plain terms: "I don't know" earns 0 points, but a lucky guess earns 1. Models learn to behave like the student who fills in every multiple-choice blank.

3.2 Patterns where hallucination spikes

Combining the OpenAI paper with Lakera's 2026 analysis:

Pattern	Examples
Low-frequency facts	"What was X's PhD dissertation title?", "Quote page N of book Y"
Recent events	Anything after the model's knowledge cutoff
Forced format	"Exactly 5 bullets," "100+ words each" — when the data doesn't support it
Late in long context	Trust degrades for info far from the prompt

3.3 Practical mitigations for general users

Always verify numbers, names, and dates. This is where LLMs miss most often.
Tell the model to admit uncertainty. Add "If you're unsure, say so" to your prompt. Simple but effective.
Request verifiable form. "Include source URLs" — but click them, since fake URLs happen.
Suspect overly polished answers. Real experts hedge ("depends on the case"). Pristine confidence on a niche topic is a yellow flag.
Cross-check important decisions across two models. When Claude and GPT disagree, treat both as suspect.

4. One-page summary

Concept	One-line definition	Practical implication
Token	The sub-word unit LLMs actually process	Non-English costs 2–3× more
Context window	Total tokens (input + output) referenceable per response	1M-token era — but more isn't always better
Hallucination	Confident falsehoods baked in by training incentives	Verify numbers, names, and dates

Developer notes

If you work directly with the APIs:

Pre-count tokens. Anthropic's token counting API, OpenAI's tiktoken, and Google's count_tokens() exist for this. Measure before you send.
New Claude models error on overflow instead of silently truncating (Sonnet 3.7 onward). Pre-counting becomes mandatory.
Extended thinking tokens are billed once as output but auto-stripped from the next turn's input by the API. You don't need to strip them yourself.
Compaction. Claude offers server-side compaction (beta), OpenAI has prompt caching, Gemini has context caching. Long-running conversations need one of these.
Hallucination mitigation patterns: Add "say 'unsure' when uncertain" to the system prompt, enforce JSON Schema responses, and ground with retrieved external data (RAG).

References

Anthropic — Context windows
OpenAI — Why Language Models Hallucinate (Kalai et al., 2025-09)
Google AI — Long context
OpenAI Tokenizer
Multilingual tokenization efficiency (PMC, 2025)

This is part 1 of 11 in the AI Basics series. Next: Free vs. Paid — Who should pay for ChatGPT, Claude, or Gemini Plus.

이 블로그 검색

MaJu Tech Notes