Building Mneme · Part 4

Images at Scale - Picaso, Fooocus → ComfyUI, and Surviving the Chaos

~10 min read · by Dave Wheeler (aka Weller Davis)

I still remember the first image generation on my old RTX 2060: I queued it up, grabbed coffee, came back 8 minutes later to... a blurry mess with six fingers and random text saying "SPORTZ." The RTX 4080 upgrade was pure self-indulgence-until it became essential. What took 8 minutes now took 45 seconds. Suddenly, iteration became possible.

But speed alone didn't solve the fundamental problem: I needed control. Fooocus was great for quick exploration, but it kept injecting garbled text into images, had limited LoRA orchestration, and lacked the deterministic workflows I needed for production. ComfyUI solved those problems-at the cost of two days debugging Windows firewall rules, learning graph-based workflows, and wrangling VRAM budgets. This is the story of that migration, the creation of Picaso (the image prompt persona), and how vision LLMs made automated quality validation possible.

TL;DR - Migrated from Fooocus to ComfyUI for production image generation. Built Picaso, a specialized prompt persona using LoRA-trained abstraction + pondering technique for balanced training. Added vision LLM validation (llava:34b) from Comic Creator. Result: clean, reproducible images at scale with automated quality gates. The proof: e-book covers went from garbled text disasters to professional quality.

The Hardware Upgrade: RTX 2060 → 4080

A couple of years ago, I started experimenting with local image generation on an RTX 2060. It was painfully slow-8 minutes per image, and the results were mediocre at best. Generation felt like gambling: queue it up, walk away, come back to see if you won or lost.

When I committed to Mneme as a production platform, I upgraded to an RTX 4080 with 16GB VRAM. The difference was transformative:

That upgrade didn't just make things faster-it made experimentation viable. With the RTX 2060, every generation felt high-stakes. With the 4080, I could try five variations, pick the best, and move on. Speed unlocked creativity.

Fooocus: Fast Starts, Growing Pains

I started with Fooocus, which is underrated for shipping fast. It's a wrapper around Stable Diffusion with sensible defaults, a clean UI, and minimal configuration. For early comics, tutorials, and e-book covers, it worked-mostly.

What Fooocus Did Well

Where Fooocus Failed Me

The breaking point came when I was generating e-book covers. I'd spent 3 hours generating 20 covers for different books, and 15 of them were unusable. The problem? Fooocus insisted on adding text to images-misspelled, garbled, random text that ruined otherwise good compositions.

The moment I knew I needed ComfyUI wasn't technical-it was emotional. I'd carefully crafted prompts, tuned negative prompts to say "no text, no watermarks, no letters," and Fooocus still generated covers with "WSTMACAM CEE" and other nonsense plastered across them. I needed control, not just speed.

The Proof: Before and After

Here's visual evidence of the problem-and the solution. These are actual e-book covers generated by Mneme, showing the dramatic improvement from the Fooocus era to the Picaso + ComfyUI era:

Early Cover (Pre-Picaso, Fooocus Era)
Side Hustles e-book cover with garbled text
Notice the garbled text at top ("WSTMACAM CEE") and bottom (completely unreadable subtitle). This was typical of Fooocus's text injection problem-no matter how hard you fought it in the prompts.
Later Cover (Post-Picaso, ComfyUI)
Leadership Skills Training e-book cover, clean and professional
Clean typography, professional network visualization, zero garbled text. This is what Picaso + ComfyUI enabled: precise control over composition, style, and-crucially-the absence of unwanted text.

That visual difference represents hundreds of hours of work: building Picaso, migrating to ComfyUI, tuning workflows, and implementing validation. But it was worth it-every e-book cover since has been publication-ready on first or second generation.

ComfyUI: Power, Control, and Complexity

ComfyUI is a node-based interface for Stable Diffusion that gives you full control over every stage of generation: model loading, LoRA injection, sampler configuration, upscaling, post-processing. It's powerful-and intimidating.

What ComfyUI Gave Me

The Price: Two Days of Ops Hell

Getting ComfyUI production-ready was not trivial:

Ops Reality Check
• Host: Windows PC, RTX 4080 16GB (repurposed from LLM inference)
• Issues: Firewall rules, local network routing, trust_env configuration
• Tools: curl/PowerShell to probe connectivity, structured retries, WebSocket metrics
• Lesson: Treat the image box like a service, not a laptop

Picaso: The Image Prompt Persona

Once ComfyUI was stable, I needed better prompts. Generic prompts produce generic images. I needed a dedicated persona that understood compositional structure, style controls, negative cues, and how to translate content goals into precise image generation instructions.

Why a Dedicated Persona?

Initially, I used one of the general-purpose personas (the E-book Writer or Tutorial Creator) to generate image prompts. The results were mediocre-vague descriptions, missing style anchors, no attention to composition. I realized: personas tuned to specific tasks perform better.

Generalizing training across multiple activities gave worse results. When I trained a persona on both e-book content and image prompts, it became okay at both but great at neither. Specialization wins-so I created Picaso.

The Pondering Technique: Balanced Training Feedback

One challenge with persona training is overfitting. If you only train on successful outputs, the persona learns to mimic those exact patterns and loses flexibility. If you only train on failures, it learns what not to do but doesn't develop a strong positive signal.

I developed a technique I call pondering: personas do "practice work" on sections of e-book projects and receive feedback before committing to final output. This generates more balanced training data-a mix of "close but needs adjustment" and "nailed it" examples. The feedback is subtle, but it definitely improves results over time.

For Picaso specifically, pondering meant: generate image prompts for hypothetical e-book covers, evaluate them against composition rules (single focal subject, white background, no text artifacts), collect feedback, refine, repeat. Over multiple training cycles, Picaso learned what makes a good image prompt-not just what makes a grammatically correct sentence.

Personas as Abstraction: Specialization + Full Intelligence

Here's the key insight: personas are an abstraction layer. I don't use the LoRA-trained persona model directly for image generation. Instead, I use the LoRA-trained persona to generate tailored prompts for the larger, more capable LLMs (Gemma 27B, or even cloud models when justified).

This gives me the best of both worlds:

Picaso Workflow
User Request: "E-book cover for Leadership Skills Training"
  → Picaso (LoRA-trained): Generate structured image prompt
    → Output: "Professional business portrait, network visualization overlay,
               clean typography, 3:4 aspect, editorial style, no text artifacts"
  → ComfyUI: Execute prompt with selected LoRAs + workflow
  → Vision LLM (llava:34b): Validate output
  → Result: Clean cover image

Before Picaso, I was using other personas and getting vague prompts like "a nice cover about leadership." After Picaso, prompts became precise: compositional anchors, style controls, aspect ratios, negative cues. The difference in output quality was immediate.

Dynamic LoRA Loading

Mneme selects LoRAs at runtime based on the creator module and scene intent (e.g., "clean instructional line-art" vs. "comic panel, chibi style"). LoRAs are treated as first-class parameters with weights and priority ordering.

// Example request payload to ComfyUI
{
  "workflow_id": "ebook_cover_v2",
  "inputs": {
    "prompt": "Professional network visualization, single business figure, modern editorial, 3:4, clean background",
    "negative": "text artifacts, watermark, busy background, photorealism, extra limbs",
    "seed": 12345678,
    "loras": [
      {"path": "lora/editorial_clean.safetensors", "weight": 0.8},
      {"path": "lora/network_viz.safetensors", "weight": 0.5}
    ],
    "cfg": 5.0,
    "steps": 30,
    "hires_upscale": 1.5
  }
}

I keep a LoRA registry in MongoDB with tags, last-updated timestamp, checksum, and known-good base model pairings. This makes rollbacks trivial-if a LoRA update causes regressions, revert to the previous version with one database update.

Quality Gates: Vision LLM Validation

Generation without validation is gambling. Early on, I'd generate 20 images, manually review them, and find that 12 were unusable. That didn't scale. I needed automated quality gates.

Vision LLM: The Comic Creator Breakthrough

I first integrated vision validation when building Comic Creator. Comics need panel-to-panel consistency: same character, correct environment, proper framing. I couldn't manually review hundreds of panels, so I introduced llava:34b, a vision-capable LLM running locally via Ollama.

The vision LLM can:

The Panel Consistency Challenge

One hurdle with Comic Creator was achieving consistent seeding for panel-to-panel transitions. I wanted panels to show the same character in different poses/environments, but maintain visual continuity. The vision LLM was great at identifying when panels didn't match-but it was too strict at first.

If I used identical seeds, every panel was nearly identical-no variety. If I used random seeds, characters changed appearance panel-to-panel. I had to allow some variation while maintaining core characteristics (face structure, clothing, overall style).

The solution: define a consistency rubric with tolerance thresholds. Instead of "panels must be 95% identical," I specified: "character face structure: 80% match, clothing: 70% match, environment: allow full variation." It took time to tune these thresholds, but once calibrated, the vision LLM reliably caught actual problems (character suddenly has different hair color) while allowing natural scene variation.

Three-Attempt Validation Loop

Mneme wraps generation in a retry loop with vision validation:

  1. Generate image (Attempt n)
  2. Send to vision LLM (llava:34b) with validation rubric
  3. LLM returns: pass/fail + reason + suggested adjustments ("crop tighter", "increase contrast", "remove background clutter")
  4. If fail and attempts remain: Adjust parameters (negative prompt, LoRA weights, composition hints) and retry
  5. Choose best pass (or least-bad fail with reason logged for later review)
E-book Cover Validation Rubric (Example)
1. Single focal subject (person or symbolic element)
2. Clean background (no busy patterns)
3. Typography area clear (top/bottom thirds uncluttered)
4. No text artifacts (no embedded letters/words)
5. Aspect ratio = 3:4 (e-book standard)
6. Style = editorial/professional
7. No extra limbs, no anatomical errors

Quality Metrics I Track

Workflow Management and Reproducibility

ComfyUI workflows are versioned assets. I keep JSON graphs under `/workflows//.json` with a manifest file that specifies:

When Mneme queues an image generation, it:

  1. Selects the appropriate workflow version (e.g., `ebook_cover_v2.json`)
  2. Injects dynamic parameters (prompt, seed, LoRAs)
  3. POSTs to ComfyUI's `/prompt` endpoint with the full graph
  4. Listens on WebSocket for node completion events
  5. Stores intermediate artifacts in temp directory
  6. Moves final result to project-scoped path with deterministic fingerprint
images/
  ebooks/<project_id>/
    cover/
      prompt.json         # Full generation params
      attempt_01.png
      attempt_02.png
      selected.png       # Validated winner
      validation_report.json
      seed.txt
      loras.json
      workflow.lock.json  # Exact workflow version used

Performance and VRAM Budgeting

The RTX 4080 has 16GB VRAM-enough for most workflows, but not infinite. Here's what I learned:

Tip - If a workflow barely fits VRAM, split into stages: low-res draft → validate → selective upscale. You'll ship faster with higher pass rates because you're not wasting GPU cycles on images that fail validation.

Results: Higher Quality, Faster Iteration

What I'd Keep / What I'd Change

Lessons: Specialization, Control, and Validation


© Dave Wheeler · wellerdavis.com · Built one error message at a time.