Local AI Infrastructure Notes (15/15) — Migrating the Blog Image Pipeline to ComfyUI

4월 14, 2026

Why FLUX Local Execution Failed and How Tailscale-Based GPU Sharing Works

핵심 요약

What this post covers: the three axes to evaluate when moving a blog image generation pipeline from API to a local GPU (memory · disk · quality), and the design pattern for incorporating another machine on the same LAN as "local."
Measured failure points of running FLUX.1 Schnell directly on Mac (memory, disk, quality, speed) with empirical numbers.
How to reuse a GPU managed by a separate agent as shared infrastructure via ComfyUI HTTP API + Tailscale virtual IP.
Trade-offs (availability · queue latency · fallback policy) that come with moving from $0.08/image API cost → $0/image, and Mac memory footprint 25 GB → 0 GB.

What You Can Take From This Post

Checklist for running local image generation models (FLUX family) on Apple Silicon — memory contention with co-resident models, disk headroom relative to model weight size, quantization quality loss, inference throughput.
Using ComfyUI as an HTTP backend — REST call endpoints, queue model, explicit parameter control (width/height).
Pinning a stable address with Tailscale virtual IP — a fixed agent-to-agent call path independent of physical IP or NAT changes.
Treating fallback as an operational policy rather than a code path — the hidden cost of maintaining two routes simultaneously.

Background: The API Cost Curve

Measured cost for Nano Banana 2 (Gemini 3.1 Flash Image): $0.08/image. At an average of 30 images per blog post, that is ~$2.40 per post; at 20 posts per month, ~$48/month (≈ ₩67,000). The absolute amount is small, but recurring call costs accumulate linearly until a stop decision is made. If equivalent quality can be produced locally, the justification for keeping the API disappears.

The core question is: which machine counts as "local"?

FLUX.1 Schnell on Mac — Decomposing the Failure

Test configuration: Mac mini M4 32 GB, Gemma-4 26B resident (~25 GB occupied), FLUX.1 Schnell loaded via mflux Q4 quantization.

Item	Measured	Result
Memory	Gemma 25 GB + FLUX 7 GB → OOM	Crash
Disk	FLUX model 33 GB vs external SSD free space 22 GB	Download failure
Quality	Q4 quantization output visibly degraded vs Nano Banana 2	Below threshold
Speed	~130 s/image	Slower than API

Mechanistic Interpretation

Memory: FLUX.1 Schnell requires ~7 GB of resident memory even at Q4 quantization. Sharing unified memory with a 26B-class LLM makes OOM a predictable outcome.
Disk: Downloading and caching the original pre-quantization weights requires 30 GB+. If free space on the target drive is smaller than the model size, the process fails at initialization.
Quality: Under identical prompts, Q4 lightweight models show measurable gaps in detail and color stability compared to commercial cloud models. Thumbnails depend on first impressions, so the quality floor is relatively high.
Speed: 130 s/image is in the "technically runs but cannot be inserted into a batch pipeline" range.

One-off Pilot Feasible; Continuous Operation Is Not

By suspending other agents to free memory to ~9.5 GB and redirecting the cache path to an external SSD, a single pilot run completes. However, this path is incompatible with continuous operation — it requires interrupting other work on the main machine each time. Mac standalone execution is therefore excluded from the design.

Alternative Architecture: Localizing Another Machine's GPU

A separate Windows PC with RTX 3080 is already running ComfyUI (SDXL) for the video production pipeline. Reusing that infrastructure for blog image generation requires zero new resource investment. Three factors make the reuse viable:

Protocol: ComfyUI exposes an HTTP API. Any caller can access it via REST regardless of language or OS.
Address stability: Any machine running Tailscale gets a 100.x.x.x virtual IP. The address persists through router reboots, LAN reconfiguration, and NAT traversal.
Existing call pattern: _try_comfyui is already implemented in the video pipeline's image_generator.py. The request payload and response parsing are validated and can be ported directly into the blog publish skill.

No new infrastructure to build — only one additional caller in an existing structure.

Post-Migration Pipeline

Post writing complete
  → SDXL image request to ComfyUI(<TAILSCALE_IP>:8188) with explicit width/height/prompt
  → Insert returned image into post body
  → Publish via Blogger API

Comparison:

Item	Nano Banana 2	ComfyUI Remote
Cost per image	$0.08	$0
Mac mini memory footprint	0 GB	0 GB
Average generation speed	~5 s/image	~8 s/image (observed)
Aspect ratio control	Prompt only	Explicit width/height
GPU host offline	Continues	Publish held

Limitations and Forward Direction

Availability: GPU Host Offline

Design decision: "hold publish." No automatic fallback to the API. Two reasons:

Maintaining two routes doubles maintenance scope. On failure, branch analysis is required to determine which path is the source.
A fallback implicitly undermines the "$0 cost" target. If exception calls accumulate, measured cost is no longer zero.

→ Publish delay is acceptable. Dual-path routing is not.

Quality Validation: SDXL vs Nano Banana 2

SDXL produces stable output in both photorealistic and illustrative styles — validated by existing video pipeline production. A direct comparison pilot for blog thumbnail use will be run separately. If output falls below threshold, prompt templates, model selection, or post-processing will be adjusted.

Concurrency: Two Agents Sharing One GPU

ComfyUI is queue-based; concurrent requests are processed sequentially. If the video pipeline holds priority, blog image requests queue in the background. Contention becomes latency, not a collision.

Extracted Principles / Applicable Patterns

Redefining "local": Local-first design does not mean single-machine execution. A GPU run by another agent within the LAN + Tailscale boundary can be incorporated as "local."
Fallback as policy, not function: Resolving it as a code branch doubles maintenance cost. An operational rule — "if unavailable, hold" — is more explicit and does not undermine cost targets.
Linear accumulation of recurring costs: Even at ~$48/month, the amount accumulates linearly for as long as the decision is deferred. The lower the unit cost, the harder the stop decision becomes — this must be factored into design choices.

Open Questions

Can SDXL-based thumbnails match or exceed Nano Banana 2 in perceived quality in a blog context — which layer (prompt template, model selection, post-processing) drives the gap?
To what extent should fairness policies (priority, max wait, cap) be applied to a ComfyUI queue serving multiple callers? The current two-caller structure is sufficient, but if the number of callers grows, queue policy becomes the bottleneck.

Series overview: Series index

이 블로그 검색

MaJu Tech Notes