Local AI Infrastructure Notes (8/15) — Local LLM Acceleration and Quantization Deep Dive

4월 29, 2026

Same model, same hardware — one setup squeezes 200 tok/s, another tops out at 30. The difference is the algorithm.

In Q1 2026, three techniques redrew the limit of LLM inference speed. DFlash (Z Lab, Feb 2026 — block-diffusion speculative decoding), DDTree (Ringel & Romano, Apr 2026 — tree-attention verification), TurboQuant (Google Research, ICLR 2026 — 3-bit lossless KV cache quantization). They don't interfere with each other — applied together, the speedups compound.

This article walks through all three from first-party sources (arXiv, GitHub, official announcements), starting from prefill/decode/KV cache fundamentals through real-world deployment status. Developer-targeted, but it answers questions like "Why is the M4 Pro nearly 2× faster at inference than the base M4?"

핵심 요약

Prefill = compute-bound, Decode = memory-bandwidth-bound. Different games, different optimizations.
DFlash: block diffusion drafts K tokens at once → 6× speedup, 2.5× over EAGLE-3.
DDTree: builds a draft tree from diffusion distributions → MATH 7.5×, HumanEval 8.2×.
TurboQuant: random rotation + QJL → 3.5-bit lossless KV cache. 8× on H100.
All three are orthogonal. Theoretical compound 8~10×.

1. Fundamentals — Prefill and Decode Are Different Games

LLM inference has two phases. Many articles treat them as one. They aren't.

Prefill — Processing the Input

The full prompt is processed in parallel. K and V values for every input token are computed at once, building the KV cache. Massive matrix multiplications max out GPU tensor cores (or Apple M5 Neural Accelerators) — compute-bound.

H100 with a 10K-token prompt: prefill takes 200~400ms. M5 with a 14B dense model: under 10s. 30B MoE: under 3s.

The metric for this phase is TTFT (Time To First Token).

Decode — Generating the Output

Tokens are produced one at a time, autoregressively. Every step has to read the full KV cache and the full model weights from memory. Computation is small, but memory access becomes the bottleneck — memory-bandwidth-bound.

The metric here is TPOT (Time Per Output Token), or its inverse, tok/s. Typically 30~150 tok/s.

The Role of the KV Cache

Without it, every new token at position t recomputes attention over all previous tokens — O(n²). With the cache, only the new token's K and V are appended — O(n). So the KV cache is mandatory for any serious inference.

The problem is size. A 70B model with 200K context = 40~80 GB KV cache — sometimes larger than the model weights themselves.

Memory Bandwidth Decides — Apple Silicon Comparison

Chip	Bandwidth
M3 base	100 GB/s
M4 base	120 GB/s
M5 base	153 GB/s
M3 Pro	150 GB/s
M4 Pro	273 GB/s (2.3× over M4 base)
M3 Max	400 GB/s
M4 Max	546 GB/s
M5 Max	~600 GB/s

The M4 Pro is nearly 2× faster at inference than the base M4 because its memory bandwidth is 2.3× higher. Decode is memory-bound, so this is the expected outcome.

If you can read the same KV cache faster, you produce more tokens per second. That's the entire game.

2. DFlash — Block-Diffusion Speculative Decoding

Speculative Decoding First

Standard decode produces one token, verifies, produces the next. Speculative decoding: 1. A draft model (smaller) guesses K tokens 2. A target model (larger) verifies all K in one pass 3. Accept the matching prefix; restart from the first mismatch

One forward pass of the big model produces K tokens — theoretical ceiling K×.

EAGLE-3's Bottleneck

EAGLE-3 uses an autoregressive draft model. The draft step itself becomes a bottleneck.

DFlash's Solution

Replace the autoregressive draft with a block diffusion drafter that generates K tokens in one parallel forward pass.

Technical specifics: - Block diffusion drafter: K=4~8 tokens denoised in one pass - Feature extraction: pulls context features from deep layers of the target model - KV injection: target features injected into every layer of the draft (EAGLE-3 only injects into a single layer)

Verified Performance

Qwen3-8B (greedy): 6× lossless speedup
vs EAGLE-3: 2.5× faster
Qwen3.6-27B-DFlash on RTX 3090: 207 tok/s @ 256K context
Consistent across GSM8K, MATH-500, AIME24/25, HumanEval, chat
Speedup holds under temperature sampling and reasoning modes

Availability

Paper: arXiv:2602.06036 (Z Lab — Jian Chen, Yesheng Liang, Zhijian Liu, Feb 2026)
Code: github.com/z-lab/dflash
Models: Hugging Face z-lab/dflash collection (Qwen3-4B/8B, Qwen3-Coder-30B-A3B, LLaMA-3.1-8B variants)
Integrations: SGLang (production), Transformers (experimental), vLLM (in progress), llama.cpp (community discussion)

3. DDTree — Extending Diffusion Drafts Into a Tree

The Information DFlash Discards

DFlash produces a single token sequence quickly. But the draft model knows distributions over multiple possibilities at each position — DFlash throws that away.

DDTree's Idea

Use the draft model's per-position probability distributions to construct a tree, and have the target model verify the whole tree in one forward pass via tree attention.

The 4-Step Process

Draft model generates per-position distributions for the next block
Construct a tree within a fixed node budget (high-probability branches first)
Target model verifies the entire tree via tree attention in a single forward pass
Walk down matching branches; restart from the first mismatch

Verified Performance

Benchmark	Speedup
MATH-500	7.5×
HumanEval (T=0.0)	8.2×
GSM8K	6.6×
AIME'24	7.3×
AIME'25	7.2×
MBPP	6.4×
LiveCode	6.8×
SWE Lite	4.3×
MT	4.2×
Alpaca	3.3×

→ Effects are largest on deterministic tasks like math and code (branches converge fast).

Tradeoff

Larger node budgets increase acceptance length, but verification cost dominates past a point — speedup plateaus. Per-task calibration required.

Availability

Paper: arXiv:2604.12989
Code: github.com/liranringel/ddtree
Authors: Liran Ringel, Yaniv Romano
vLLM integration: vllm-project/vllm#40809

4. TurboQuant — KV Cache Down to 3 Bits

Why KV Cache Quantization Matters

70B model + 200K context = 40~80 GB KV cache. Compressing model weights alone isn't enough — the KV cache is what eats memory bandwidth and capacity. The cache itself has to be compressed.

Limits of Existing Quantization

INT8/FP8: nearly lossless, but capped at 2× compression
INT4/FP4: 4× compression at 1~3% accuracy cost
GPTQ-style vector quantization: requires calibration data

TurboQuant's Approach

Apply a random rotation to input vectors so all coordinates carry similar (beta) distributions, then use a standard scalar quantizer uniformly.

Two-Stage Algorithm

Rotation + scalar quantization: input → random rotation → MSE-optimal scalar quantizer per coordinate
QJL error correction: 1 extra bit applies the Quantized Johnson-Lindenstrauss transform → corrects residual + removes bias

Verified Performance (Paper)

3.5 bits/channel: absolute quality neutrality (no degradation)
2.5 bits/channel: marginal quality drop only
3-bit KV cache: training-free, data-oblivious, lossless
4-bit on H100: 8× over 32-bit unquantized keys
Within 2.7× of the information-theoretic lower bound

TurboQuant+ — Apple Silicon Variant

Base TurboQuant + PolarQuant + Walsh-Hadamard rotation. Optimized for Apple Silicon.

3.8~6.4× compression
M5 Max 128K context: prefill on par with q8_0, decode at 0.9× (effectively equivalent)

No Training Required

This is the real differentiator.

"TurboQuant can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy."

Applied at inference time. No retraining.

Availability

Paper: arXiv:2504.19874 (Zandieh, Daliri, Hadian, Mirrokni — Google Research)
Conference: ICLR 2026
Code: github.com/0xSero/turboquant (Triton kernels + vLLM), github.com/tonbistudio/turboquant-pytorch (PyTorch from-scratch)
Announcement: research.google/blog/turboquant (2026-03-25)

5. The Three Are Orthogonal — Stack Them

They Attack Different Things

Technique	Target	Layer
DFlash / DDTree	Tokens per decode pass (1 → K)	Inference algorithm
TurboQuant	KV cache memory (FP16 → 3-bit)	Memory representation
MLX UMA	CPU↔GPU copy elimination	Hardware acceleration

They don't interfere. They can stack.

Theoretical Compounding (Approximation)

Baseline: Qwen3.6-27B on M4 Max 64GB, FP16, normal decode.

Layer	tok/s (estimate)
Baseline (llama.cpp)	~30
+ MLX	~50 (1.6×)
+ TurboQuant 3-bit	~70 (1.4× more, KV memory pressure relieved)
+ DFlash speculative	~150 (2.1× more)
+ DDTree (code/math only)	~250 (1.7× more)

→ Theoretical compound 8~10×. Real measurements depend on task, model, hardware. The table is theoretical, not measured.

By Use Case

Chat / summarization (short streaming): DFlash alone is enough
Coding / math (branchy outputs): DDTree
Long context (RAG, 100K+): TurboQuant required
Agents (repeated system prompts): oMLX SSD cache + TurboQuant

6. Tool Integration Status (2026-04-29)

Tool	DFlash	DDTree	TurboQuant
SGLang	✅ Production	In progress	Experimental
vLLM	In progress	In progress	✅ (0xSero fork)
Transformers	Experimental	—	—
llama.cpp	Community	—	Partial (PolarQuant)
MLX (Apple)	Not yet	Not yet	TurboQuant+ supported
oMLX	Not yet	Not yet	Integration considered
Ollama 0.19	Not yet	—	NVFP4 (different family)

What to Do Today

NVIDIA GPU users (RTX 3090/4090, H100): Try DFlash + TurboQuant immediately (vLLM, SGLang)
Apple Silicon (M3/M4/M5): Try TurboQuant+ via tonbistudio/turboquant-pytorch. Wait for MLX ports of DFlash.
Mac + simple use: Stay with oMLX defaults. Monitor integration progress.

7. Caveats

"8× Faster" Is an Illusion

The 8~10× compounding above is theoretical. In practice: - DFlash speedup depends on model, hardware, temperature — typical range 1.5~6× - TurboQuant's complete losslessness is at 3.5 bits. 3 bits is near-lossless; some tasks may show minor degradation - Tool integration is incomplete — staged adoption only, today

Apple Silicon Reality Check

For oMLX/MLX users, the only stack-ready acceleration today is TurboQuant+. MLX ports of DFlash and DDTree are in progress with no firm dates.

Domain Sensitivity

3-bit quantization preserves accuracy theoretically, but medical/financial domains warrant additional verification. General chat and coding are essentially unaffected.

Bottom Line

The three techniques in one line each:

Technique	One-line	Speedup
DFlash	Block diffusion drafts K tokens at once	6×
DDTree	Diffusion distribution → tree + tree attention	7.5× (math)
TurboQuant	Random rotation + QJL → 3-bit KV cache	8× (H100)

The single takeaway: "Prefill is compute, decode is memory. The three techniques attack the memory side orthogonally."

What you can do today depends on hardware. NVIDIA GPU users can stack DFlash + TurboQuant via vLLM/SGLang now. Apple Silicon users will get the same once MLX ports land in 6~8 weeks. Until then, run oMLX defaults and test stack on a vLLM/SGLang environment.

First-Party Sources

DFlash paper: arxiv.org/abs/2602.06036, Z Lab
DDTree paper: arxiv.org/abs/2604.12989, GitHub liranringel/ddtree
TurboQuant paper: arxiv.org/abs/2504.19874, Google Research blog
Apple ML Research M5: machinelearning.apple.com/research/exploring-llms-mlx-m5
Prefill/Decode foundations: morphllm.com/llm-inference, morphllm.com/kv-cache-explained

Series overview: Series index