Local AI Infrastructure Notes (15/15) — Migrating the Blog Image Pipeline to ComfyUI
Why FLUX Local Execution Failed and How Tailscale-Based GPU Sharing Works
ํต์ฌ ์์ฝ
- What this post covers: the three axes to evaluate when moving a blog image generation pipeline from API to a local GPU (memory · disk · quality), and the design pattern for incorporating another machine on the same LAN as "local."
- Measured failure points of running FLUX.1 Schnell directly on Mac (memory, disk, quality, speed) with empirical numbers.
- How to reuse a GPU managed by a separate agent as shared infrastructure via ComfyUI HTTP API + Tailscale virtual IP.
- Trade-offs (availability · queue latency · fallback policy) that come with moving from $0.08/image API cost → $0/image, and Mac memory footprint 25 GB → 0 GB.
What You Can Take From This Post
- Checklist for running local image generation models (FLUX family) on Apple Silicon — memory contention with co-resident models, disk headroom relative to model weight size, quantization quality loss, inference throughput.
- Using ComfyUI as an HTTP backend — REST call endpoints, queue model, explicit parameter control (width/height).
- Pinning a stable address with Tailscale virtual IP — a fixed agent-to-agent call path independent of physical IP or NAT changes.
- Treating fallback as an operational policy rather than a code path — the hidden cost of maintaining two routes simultaneously.
Background: The API Cost Curve
Measured cost for Nano Banana 2 (Gemini 3.1 Flash Image): $0.08/image. At an average of 30 images per blog post, that is ~$2.40 per post; at 20 posts per month, ~$48/month (≈ ₩67,000). The absolute amount is small, but recurring call costs accumulate linearly until a stop decision is made. If equivalent quality can be produced locally, the justification for keeping the API disappears.
The core question is: which machine counts as "local"?
FLUX.1 Schnell on Mac — Decomposing the Failure
Test configuration: Mac mini M4 32 GB, Gemma-4 26B resident (~25 GB occupied), FLUX.1 Schnell loaded via mflux Q4 quantization.
| Item | Measured | Result |
|---|---|---|
| Memory | Gemma 25 GB + FLUX 7 GB → OOM | Crash |
| Disk | FLUX model 33 GB vs external SSD free space 22 GB | Download failure |
| Quality | Q4 quantization output visibly degraded vs Nano Banana 2 | Below threshold |
| Speed | ~130 s/image | Slower than API |
Mechanistic Interpretation
- Memory: FLUX.1 Schnell requires ~7 GB of resident memory even at Q4 quantization. Sharing unified memory with a 26B-class LLM makes OOM a predictable outcome.
- Disk: Downloading and caching the original pre-quantization weights requires 30 GB+. If free space on the target drive is smaller than the model size, the process fails at initialization.
- Quality: Under identical prompts, Q4 lightweight models show measurable gaps in detail and color stability compared to commercial cloud models. Thumbnails depend on first impressions, so the quality floor is relatively high.
- Speed: 130 s/image is in the "technically runs but cannot be inserted into a batch pipeline" range.
One-off Pilot Feasible; Continuous Operation Is Not
By suspending other agents to free memory to ~9.5 GB and redirecting the cache path to an external SSD, a single pilot run completes. However, this path is incompatible with continuous operation — it requires interrupting other work on the main machine each time. Mac standalone execution is therefore excluded from the design.
Alternative Architecture: Localizing Another Machine's GPU
A separate Windows PC with RTX 3080 is already running ComfyUI (SDXL) for the video production pipeline. Reusing that infrastructure for blog image generation requires zero new resource investment. Three factors make the reuse viable:
- Protocol: ComfyUI exposes an HTTP API. Any caller can access it via REST regardless of language or OS.
- Address stability: Any machine running Tailscale gets a
100.x.x.xvirtual IP. The address persists through router reboots, LAN reconfiguration, and NAT traversal. - Existing call pattern:
_try_comfyuiis already implemented in the video pipeline'simage_generator.py. The request payload and response parsing are validated and can be ported directly into the blog publish skill.
No new infrastructure to build — only one additional caller in an existing structure.
Post-Migration Pipeline
Post writing complete
→ SDXL image request to ComfyUI(<TAILSCALE_IP>:8188) with explicit width/height/prompt
→ Insert returned image into post body
→ Publish via Blogger API
Comparison:
| Item | Nano Banana 2 | ComfyUI Remote |
|---|---|---|
| Cost per image | $0.08 | $0 |
| Mac mini memory footprint | 0 GB | 0 GB |
| Average generation speed | ~5 s/image | ~8 s/image (observed) |
| Aspect ratio control | Prompt only | Explicit width/height |
| GPU host offline | Continues | Publish held |
Limitations and Forward Direction
Availability: GPU Host Offline
Design decision: "hold publish." No automatic fallback to the API. Two reasons:
- Maintaining two routes doubles maintenance scope. On failure, branch analysis is required to determine which path is the source.
- A fallback implicitly undermines the "$0 cost" target. If exception calls accumulate, measured cost is no longer zero.
→ Publish delay is acceptable. Dual-path routing is not.
Quality Validation: SDXL vs Nano Banana 2
SDXL produces stable output in both photorealistic and illustrative styles — validated by existing video pipeline production. A direct comparison pilot for blog thumbnail use will be run separately. If output falls below threshold, prompt templates, model selection, or post-processing will be adjusted.
Concurrency: Two Agents Sharing One GPU
ComfyUI is queue-based; concurrent requests are processed sequentially. If the video pipeline holds priority, blog image requests queue in the background. Contention becomes latency, not a collision.
Extracted Principles / Applicable Patterns
- Redefining "local": Local-first design does not mean single-machine execution. A GPU run by another agent within the LAN + Tailscale boundary can be incorporated as "local."
- Fallback as policy, not function: Resolving it as a code branch doubles maintenance cost. An operational rule — "if unavailable, hold" — is more explicit and does not undermine cost targets.
- Linear accumulation of recurring costs: Even at ~$48/month, the amount accumulates linearly for as long as the decision is deferred. The lower the unit cost, the harder the stop decision becomes — this must be factored into design choices.
Open Questions
- Can SDXL-based thumbnails match or exceed Nano Banana 2 in perceived quality in a blog context — which layer (prompt template, model selection, post-processing) drives the gap?
- To what extent should fairness policies (priority, max wait, cap) be applied to a ComfyUI queue serving multiple callers? The current two-caller structure is sufficient, but if the number of callers grows, queue policy becomes the bottleneck.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ