Local AI Infrastructure Notes (8/15) — Local LLM Acceleration and Quantization Deep Dive
Same model, same hardware — one setup squeezes 200 tok/s, another tops out at 30. The difference is the algorithm.
In Q1 2026, three techniques redrew the limit of LLM inference speed. DFlash (Z Lab, Feb 2026 — block-diffusion speculative decoding), DDTree (Ringel & Romano, Apr 2026 — tree-attention verification), TurboQuant (Google Research, ICLR 2026 — 3-bit lossless KV cache quantization). They don't interfere with each other — applied together, the speedups compound.
This article walks through all three from first-party sources (arXiv, GitHub, official announcements), starting from prefill/decode/KV cache fundamentals through real-world deployment status. Developer-targeted, but it answers questions like "Why is the M4 Pro nearly 2× faster at inference than the base M4?"
ํต์ฌ ์์ฝ
- Prefill = compute-bound, Decode = memory-bandwidth-bound. Different games, different optimizations.
- DFlash: block diffusion drafts K tokens at once → 6× speedup, 2.5× over EAGLE-3.
- DDTree: builds a draft tree from diffusion distributions → MATH 7.5×, HumanEval 8.2×.
- TurboQuant: random rotation + QJL → 3.5-bit lossless KV cache. 8× on H100.
- All three are orthogonal. Theoretical compound 8~10×.
1. Fundamentals — Prefill and Decode Are Different Games
LLM inference has two phases. Many articles treat them as one. They aren't.
Prefill — Processing the Input
The full prompt is processed in parallel. K and V values for every input token are computed at once, building the KV cache. Massive matrix multiplications max out GPU tensor cores (or Apple M5 Neural Accelerators) — compute-bound.
H100 with a 10K-token prompt: prefill takes 200~400ms. M5 with a 14B dense model: under 10s. 30B MoE: under 3s.
The metric for this phase is TTFT (Time To First Token).
Decode — Generating the Output
Tokens are produced one at a time, autoregressively. Every step has to read the full KV cache and the full model weights from memory. Computation is small, but memory access becomes the bottleneck — memory-bandwidth-bound.
The metric here is TPOT (Time Per Output Token), or its inverse, tok/s. Typically 30~150 tok/s.
The Role of the KV Cache
Without it, every new token at position t recomputes attention over all previous tokens — O(n²). With the cache, only the new token's K and V are appended — O(n). So the KV cache is mandatory for any serious inference.
The problem is size. A 70B model with 200K context = 40~80 GB KV cache — sometimes larger than the model weights themselves.
Memory Bandwidth Decides — Apple Silicon Comparison
| Chip | Bandwidth |
|---|---|
| M3 base | 100 GB/s |
| M4 base | 120 GB/s |
| M5 base | 153 GB/s |
| M3 Pro | 150 GB/s |
| M4 Pro | 273 GB/s (2.3× over M4 base) |
| M3 Max | 400 GB/s |
| M4 Max | 546 GB/s |
| M5 Max | ~600 GB/s |
The M4 Pro is nearly 2× faster at inference than the base M4 because its memory bandwidth is 2.3× higher. Decode is memory-bound, so this is the expected outcome.
If you can read the same KV cache faster, you produce more tokens per second. That's the entire game.
2. DFlash — Block-Diffusion Speculative Decoding
Speculative Decoding First
Standard decode produces one token, verifies, produces the next. Speculative decoding: 1. A draft model (smaller) guesses K tokens 2. A target model (larger) verifies all K in one pass 3. Accept the matching prefix; restart from the first mismatch
One forward pass of the big model produces K tokens — theoretical ceiling K×.
EAGLE-3's Bottleneck
EAGLE-3 uses an autoregressive draft model. The draft step itself becomes a bottleneck.
DFlash's Solution
Replace the autoregressive draft with a block diffusion drafter that generates K tokens in one parallel forward pass.
Technical specifics: - Block diffusion drafter: K=4~8 tokens denoised in one pass - Feature extraction: pulls context features from deep layers of the target model - KV injection: target features injected into every layer of the draft (EAGLE-3 only injects into a single layer)
Verified Performance
- Qwen3-8B (greedy): 6× lossless speedup
- vs EAGLE-3: 2.5× faster
- Qwen3.6-27B-DFlash on RTX 3090: 207 tok/s @ 256K context
- Consistent across GSM8K, MATH-500, AIME24/25, HumanEval, chat
- Speedup holds under temperature sampling and reasoning modes
Availability
- Paper: arXiv:2602.06036 (Z Lab — Jian Chen, Yesheng Liang, Zhijian Liu, Feb 2026)
- Code: github.com/z-lab/dflash
- Models: Hugging Face
z-lab/dflashcollection (Qwen3-4B/8B, Qwen3-Coder-30B-A3B, LLaMA-3.1-8B variants) - Integrations: SGLang (production), Transformers (experimental), vLLM (in progress), llama.cpp (community discussion)
3. DDTree — Extending Diffusion Drafts Into a Tree
The Information DFlash Discards
DFlash produces a single token sequence quickly. But the draft model knows distributions over multiple possibilities at each position — DFlash throws that away.
DDTree's Idea
Use the draft model's per-position probability distributions to construct a tree, and have the target model verify the whole tree in one forward pass via tree attention.
The 4-Step Process
- Draft model generates per-position distributions for the next block
- Construct a tree within a fixed node budget (high-probability branches first)
- Target model verifies the entire tree via tree attention in a single forward pass
- Walk down matching branches; restart from the first mismatch
Verified Performance
| Benchmark | Speedup |
|---|---|
| MATH-500 | 7.5× |
| HumanEval (T=0.0) | 8.2× |
| GSM8K | 6.6× |
| AIME'24 | 7.3× |
| AIME'25 | 7.2× |
| MBPP | 6.4× |
| LiveCode | 6.8× |
| SWE Lite | 4.3× |
| MT | 4.2× |
| Alpaca | 3.3× |
→ Effects are largest on deterministic tasks like math and code (branches converge fast).
Tradeoff
Larger node budgets increase acceptance length, but verification cost dominates past a point — speedup plateaus. Per-task calibration required.
Availability
- Paper: arXiv:2604.12989
- Code: github.com/liranringel/ddtree
- Authors: Liran Ringel, Yaniv Romano
- vLLM integration: vllm-project/vllm#40809
4. TurboQuant — KV Cache Down to 3 Bits
Why KV Cache Quantization Matters
70B model + 200K context = 40~80 GB KV cache. Compressing model weights alone isn't enough — the KV cache is what eats memory bandwidth and capacity. The cache itself has to be compressed.
Limits of Existing Quantization
- INT8/FP8: nearly lossless, but capped at 2× compression
- INT4/FP4: 4× compression at 1~3% accuracy cost
- GPTQ-style vector quantization: requires calibration data
TurboQuant's Approach
Apply a random rotation to input vectors so all coordinates carry similar (beta) distributions, then use a standard scalar quantizer uniformly.
Two-Stage Algorithm
- Rotation + scalar quantization: input → random rotation → MSE-optimal scalar quantizer per coordinate
- QJL error correction: 1 extra bit applies the Quantized Johnson-Lindenstrauss transform → corrects residual + removes bias
Verified Performance (Paper)
- 3.5 bits/channel: absolute quality neutrality (no degradation)
- 2.5 bits/channel: marginal quality drop only
- 3-bit KV cache: training-free, data-oblivious, lossless
- 4-bit on H100: 8× over 32-bit unquantized keys
- Within 2.7× of the information-theoretic lower bound
TurboQuant+ — Apple Silicon Variant
Base TurboQuant + PolarQuant + Walsh-Hadamard rotation. Optimized for Apple Silicon.
- 3.8~6.4× compression
- M5 Max 128K context: prefill on par with q8_0, decode at 0.9× (effectively equivalent)
No Training Required
This is the real differentiator.
"TurboQuant can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy."
Applied at inference time. No retraining.
Availability
- Paper: arXiv:2504.19874 (Zandieh, Daliri, Hadian, Mirrokni — Google Research)
- Conference: ICLR 2026
- Code: github.com/0xSero/turboquant (Triton kernels + vLLM), github.com/tonbistudio/turboquant-pytorch (PyTorch from-scratch)
- Announcement: research.google/blog/turboquant (2026-03-25)
5. The Three Are Orthogonal — Stack Them
They Attack Different Things
| Technique | Target | Layer |
|---|---|---|
| DFlash / DDTree | Tokens per decode pass (1 → K) | Inference algorithm |
| TurboQuant | KV cache memory (FP16 → 3-bit) | Memory representation |
| MLX UMA | CPU↔GPU copy elimination | Hardware acceleration |
They don't interfere. They can stack.
Theoretical Compounding (Approximation)
Baseline: Qwen3.6-27B on M4 Max 64GB, FP16, normal decode.
| Layer | tok/s (estimate) |
|---|---|
| Baseline (llama.cpp) | ~30 |
| + MLX | ~50 (1.6×) |
| + TurboQuant 3-bit | ~70 (1.4× more, KV memory pressure relieved) |
| + DFlash speculative | ~150 (2.1× more) |
| + DDTree (code/math only) | ~250 (1.7× more) |
→ Theoretical compound 8~10×. Real measurements depend on task, model, hardware. The table is theoretical, not measured.
By Use Case
- Chat / summarization (short streaming): DFlash alone is enough
- Coding / math (branchy outputs): DDTree
- Long context (RAG, 100K+): TurboQuant required
- Agents (repeated system prompts): oMLX SSD cache + TurboQuant
6. Tool Integration Status (2026-04-29)
| Tool | DFlash | DDTree | TurboQuant |
|---|---|---|---|
| SGLang | ✅ Production | In progress | Experimental |
| vLLM | In progress | In progress | ✅ (0xSero fork) |
| Transformers | Experimental | — | — |
| llama.cpp | Community | — | Partial (PolarQuant) |
| MLX (Apple) | Not yet | Not yet | TurboQuant+ supported |
| oMLX | Not yet | Not yet | Integration considered |
| Ollama 0.19 | Not yet | — | NVFP4 (different family) |
What to Do Today
- NVIDIA GPU users (RTX 3090/4090, H100): Try DFlash + TurboQuant immediately (vLLM, SGLang)
- Apple Silicon (M3/M4/M5): Try TurboQuant+ via
tonbistudio/turboquant-pytorch. Wait for MLX ports of DFlash. - Mac + simple use: Stay with oMLX defaults. Monitor integration progress.
7. Caveats
"8× Faster" Is an Illusion
The 8~10× compounding above is theoretical. In practice: - DFlash speedup depends on model, hardware, temperature — typical range 1.5~6× - TurboQuant's complete losslessness is at 3.5 bits. 3 bits is near-lossless; some tasks may show minor degradation - Tool integration is incomplete — staged adoption only, today
Apple Silicon Reality Check
For oMLX/MLX users, the only stack-ready acceleration today is TurboQuant+. MLX ports of DFlash and DDTree are in progress with no firm dates.
Domain Sensitivity
3-bit quantization preserves accuracy theoretically, but medical/financial domains warrant additional verification. General chat and coding are essentially unaffected.
Bottom Line
The three techniques in one line each:
| Technique | One-line | Speedup |
|---|---|---|
| DFlash | Block diffusion drafts K tokens at once | 6× |
| DDTree | Diffusion distribution → tree + tree attention | 7.5× (math) |
| TurboQuant | Random rotation + QJL → 3-bit KV cache | 8× (H100) |
The single takeaway: "Prefill is compute, decode is memory. The three techniques attack the memory side orthogonally."
What you can do today depends on hardware. NVIDIA GPU users can stack DFlash + TurboQuant via vLLM/SGLang now. Apple Silicon users will get the same once MLX ports land in 6~8 weeks. Until then, run oMLX defaults and test stack on a vLLM/SGLang environment.
First-Party Sources
- DFlash paper: arxiv.org/abs/2602.06036, Z Lab
- DDTree paper: arxiv.org/abs/2604.12989, GitHub liranringel/ddtree
- TurboQuant paper: arxiv.org/abs/2504.19874, Google Research blog
- Apple ML Research M5: machinelearning.apple.com/research/exploring-llms-mlx-m5
- Prefill/Decode foundations: morphllm.com/llm-inference, morphllm.com/kv-cache-explained
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ