Building Mneme · Part 5

Music Creator - The Sound of Learning

~10 min read · by Dave Wheeler (aka Weller Davis)

I wanted Mneme to generate music with vocals-30-60 second tracks for tutorials and audiobooks. After five weeks wrestling with YuE's CUDA dependencies, checkpoint wrangling, and Windows compatibility battles, I finally produced a 14-second clip. It worked. The architecture proved viable. And then I sat there, listening to muddy vocals and looking at 18GB of checkpoints, and realized: This isn't the answer.

Was Music Creator a failure? No. I got it working. But it was certainly Mneme's least capable module-and paradoxically, one of the most valuable learning experiences. The real win wasn't music at all: it was Sound Creator, the short-form sound effects module that emerged from Music Creator's infrastructure and became an instant family favorite.

TL;DR - Music Creator taught me when engineering success isn't product success. YuE delivered a 14-second proof-of-concept after brutal CUDA pinning and checkpoint validation. Stable Audio Open improved quality and extended to 47 seconds, but still wasn't publishable. Then I pivoted to Sound Creator (1-5 second sound effects), reused 95% of the infrastructure, and shipped in 3 days. My daughters loved it immediately. Sometimes the best outcome is what you discover, not what you planned.

Chapter 1: YuE - The 14-Second Breakthrough

YuE (乐) promised semantic token generation followed by audio synthesis: a two-stage architecture that could turn lyrics into music with vocals on a single RTX 4080. The model was sophisticated, the theory sound, the implementation... complex.

The CUDA Hell

Getting YuE running was five weeks of dependency archaeology:

Known-good configuration (hard-won)
• NVIDIA driver: latest compatible with CUDA
• CUDA toolkit: cu124
• PyTorch/Torchvision/Torchaudio: 2.6.0+cu124 / 0.21.0+cu124 / 2.6.0+cu124
• Attention: sdpa (not FlashAttention2)
• Precision: fp16 across both stages
• Checkpoints: 18GB total, validated by manifest

The Moment It Worked

After pinning the stack, validating all checkpoints, and building a resumable ComfyUI workflow with artifact caching between stages, I clicked "Generate." Twenty minutes later, the progress bar completed. I played the audio.

14 seconds. Vocals. Music. It worked.

For about an hour, I felt like a genius. Then I listened again. The vocals were muddy. The instrumentation was vague. And I knew: this wasn't something I'd publish.

What YuE proved
• End-to-end viability on RTX 4080 (16GB)-but at operational cost
• Two-stage architecture with cached intermediates-resilient but complex
• Reproducibility via seeds and version pinning-when nothing broke
• 14-second maximum duration before VRAM exhaustion-not publishable

Chapter 2: Stable Audio Open - Better, But Not Enough

I couldn't ship muddy 14-second clips. The YuE experiment succeeded in teaching me the ceiling. Now I needed a better solution.

I evaluated alternatives: Suno and Udio had great quality but were API-only with usage costs. MusicGen was simpler than YuE but still multi-stage with quality trade-offs. Then I found Stable Audio Open: single-stage, 47 seconds per clip, designed for ComfyUI, Apache 2.0 license.

Migration in Two Days

The infrastructure I'd built for YuE-prompt generation, project organization, WebSocket progress, artifact storage-worked perfectly for Stable Audio Open. I swapped the ComfyUI workflow, replaced 18GB of YuE checkpoints with a 1.5GB Stable Audio model, and ran a test.

47 seconds. Clear vocals. Actual instrumentation. Three times longer than YuE, and noticeably better quality.

YuE → Stable Audio Improvements
  • Duration: 14s → 47s (3.4× increase)
  • Quality: muddy → clear vocals
  • Checkpoints: 18GB → 1.5GB
  • CUDA pinning: required → standard PyTorch
  • Reliability: two-stage failures → single-stage stability
What Stable Audio Taught Me
  • Operational simplicity matters as much as capability
  • Quality improvements compound user value
  • Infrastructure reuse accelerates iteration
  • But 47 seconds still wasn't publishable for tutorials

Stable Audio Open was better, but it still wasn't the answer. Users wanted 30-60 second tracks that sounded professional. Segment stitching might extend duration, but quality remained the bottleneck. Music Creator worked-but it was still the least capable module in Mneme.

Chapter 3: Sound Creator - The Unexpected Win

While evaluating what to do about music, I noticed something in the Stable Audio documentation: it excelled at short-form audio. Not 30-60 second songs-1-5 second sound effects.

I thought about tutorials, e-books, and audiobooks. They didn't just need background music-they needed sound effects. Notification chimes. Transition swooshes. Section dividers. Ambient textures. The kind of audio that's expensive to license but trivial to describe.

I had 95% of the infrastructure already built from Music Creator. What if I pivoted?

Sound Creator: Built in 3 Days

Instead of five weeks (YuE) or two days (Stable Audio migration), Sound Creator took three days because the hard work was already done:

Sound Creator Capabilities
Input: "Cheerful notification chime, bell-like, 2 seconds"
Output: Clean 2-second audio clip, perfect for UI transitions

Input: "Deep cinematic whoosh, rising tension, 3 seconds"
Output: Professional sound effect for tutorial section dividers

Input: "Gentle rain ambiance, peaceful, 5 seconds"
Output: Background texture for audiobook chapters

The Moment I Knew It Was Different

I showed Sound Creator to my daughters. With Music Creator and Stable Audio, they'd politely listened to 14-47 second clips and said "That's cool, Dad." With Sound Creator, they immediately started generating sounds: spaceship engines, cartoon boings, magical sparkles, explosion effects. They weren't being polite-they were playing.

One of them generated a perfect "level up" chime and asked if she could use it in a video project. That's when I knew: Sound Creator wasn't a consolation prize for Music Creator's limitations. It was better.

The Full Journey: What Worked, What Didn't, What Mattered

YuE (乐)
  • Duration: 14 seconds max
  • Quality: Muddy vocals, vague instrumentation
  • Development: 5 weeks of CUDA hell
  • Checkpoints: 18GB to maintain
  • User reaction: "Technically impressive"
  • Verdict: Proof of concept, not shippable
Stable Audio Open
  • Duration: 47 seconds
  • Quality: Clear vocals, real instrumentation
  • Development: 2 days (infrastructure reuse)
  • Checkpoints: 1.5GB, standard PyTorch
  • User reaction: "Much better"
  • Verdict: Better, but still not publishable
Sound Creator (The Real Win)

Lessons: When Success Looks Different Than Planned

Music Creator taught me lessons I couldn't have learned any other way:

1. Engineering Success ≠ Product Success

YuE worked. I proved the two-stage architecture was viable on consumer hardware. But proving viability isn't the same as delivering value. The 14-second breakthrough was an important milestone-and a clear signal to pivot.

2. Infrastructure Compounds

Five weeks on YuE felt like a loss when I realized it wasn't shippable. But that infrastructure-ComfyUI integration, prompt generation, project organization, artifact storage-enabled the Stable Audio migration in 2 days and Sound Creator in 3 days. The investment wasn't wasted; it was foundational.

3. Users Define "Better"

I chased 30-60 second music clips because that's what I thought tutorials needed. My daughters showed me that 2-second sound effects were more useful, more fun, and more immediately valuable. Product direction isn't always obvious from technical capabilities.

4. Constraints Reveal Opportunities

Music Creator's limitations-duration ceiling, quality issues, operational complexity-forced me to ask: "What could this infrastructure do well?" The answer was sound effects, not songs. The constraint became the insight.

5. Know When to Pivot

I could have spent another five weeks trying to extend YuE to 30 seconds, or optimizing Stable Audio for longer clips. Instead, I asked: "What problem can I solve today with what I've built?" Sound Creator was the answer.

The Real Lesson - Music Creator wasn't a failure. It was a structured way to learn what not to build-and discover what I should build instead. I wouldn't trade that learning for anything.

Technical Notes: What Carried Forward

For teams building similar systems, here's what transferred from YuE → Stable Audio → Sound Creator:

Code Reuse by Numbers

YuE implementation:        5 weeks, ~3,500 lines (including ComfyUI nodes)
Stable Audio migration:    2 days,  ~400 lines changed
Sound Creator:             3 days,  ~600 lines new (UI + batch features)

Infrastructure carried forward: ~95%

What's Next

Sound Creator is shipping. Music Creator remains on hold-not abandoned, but waiting for the right technology to emerge. When a model arrives that can deliver 30-60 seconds of publishable-quality music on consumer hardware with reasonable operational overhead, the infrastructure is ready.

In the meantime, Sound Creator delivers immediate value:

Sometimes the best outcome isn't what you planned-it's what you learned along the way.


Key Takeaways for AI Teams

From the technical journey:

From the product journey:


© Dave Wheeler · wellerdavis.com · Built one error message at a time.