Building Mneme, Part 6: Code Creator - The Quest for Shippable Code

I wanted Mneme to write code I'd trust to run. Not "technically correct but requires 30 minutes of cleanup" code. Not "works in the demo, breaks in production" code. Real, shippable code. That meant rethinking everything about how AI generates software-starting with the uncomfortable truth that most AI code fails validation the moment you try to run it.

The breakthrough came from an unlikely place: watching my daughters play a snake game. Not a tutorial example or proof-of-concept-an actual working game that Code Creator built from scratch, complete with collision detection, score persistence, and progressive speed increases. They didn't care about the architecture. They just wanted to beat each other's high scores.

That's when I knew Code Creator was working.

TL;DR - Code Creator implements the plan → execute → validate → fix loop that works across all of Mneme's creators. The breakthrough: a Compact Context DSL that cut token usage from 50,000 per task to 60% less while improving output quality. Plus discovering that validation needs to check for functional duplication, not just syntax. The result: working apps that pass the ultimate test-people actually want to use them.

The Goal: Shippable, Not Just Syntactic

Early versions of Code Creator could generate code that looked correct. Proper syntax, reasonable structure, even decent naming conventions. But when you actually ran it? Undefined variables. Missing imports. Functions that referenced other functions that didn't exist yet. And my personal favorite: three different implementations of the same logic scattered across the codebase because the model forgot it already wrote that function.

I needed Code Creator to produce code you could run immediately after generation. No manual fixing, no "just change this one thing," no debugging sessions. The bar was simple: If it doesn't work when you click Run, it's not done.

The Pattern: Why Plan → Execute → Validate → Fix?

By the time I started building Code Creator, I'd already implemented the same pattern across e-books, tutorials, images, and music. Every creator followed the same flow:

Universal Creator Pattern
1. Plan: Break the big goal into manageable chunks
2. Execute: Generate one chunk at a time
3. Validate: Check quality before moving on
4. Fix: Auto-correct issues or escalate to human

The trick was breaking it down to manageable chunks while maintaining continuity, consistency, and quality.

For code, this meant:

Plan: Turn "build a snake game" into atomic DevPlan tasks
Execute: Implement one task at a time (skeleton → game logic → UI → polish)
Validate: Syntax check, undefined symbols, import resolution, functional duplication
Fix: Auto-generate targeted fix tasks, bounded retries

The pattern made sense. The challenge was execution.

The Snake Game: A Concrete Example

Let me show you what Code Creator produces by walking through an actual project. I asked it to "build a snake game with score tracking and increasing difficulty."

The DevPlan

Code Creator broke this into atomic tasks:

DevPlan: Snake Game
├─ T001  Create project structure (HTML, CSS, JS files)          [completed]
├─ T002  Implement canvas setup and game state                   [completed]
├─ T003  Add snake movement and collision detection              [completed]
├─ T004  Implement pill generation and score tracking            [completed]
├─ T005  Add speed progression and localStorage high scores      [completed]
└─ T006  Polish UI (game over screen, pause/resume)              [completed]

The Generated Code

Here's a snippet from the actual script.js that Code Creator produced (cleaned for readability):

// Game state
let snake = [];
let pill = {};
let direction = 'right';
let score = 0;
let highScore = localStorage.getItem('snakeHighScore') || 0;
let gameSpeed = INITIAL_SPEED;

// Update game state
function update() {
    direction = nextDirection;
    const head = {x: snake[0].x, y: snake[0].y};

    // Calculate new head position
    switch (direction) {
        case 'up': head.y -= 1; break;
        case 'down': head.y += 1; break;
        case 'left': head.x -= 1; break;
        case 'right': head.x += 1; break;
    }

    // Check collision with walls
    if (head.x < 0 || head.x >= TILE_COUNT ||
        head.y < 0 || head.y >= TILE_COUNT) {
        gameOver();
        return;
    }

    // Check collision with self
    for (let segment of snake) {
        if (segment.x === head.x && segment.y === head.y) {
            gameOver();
            return;
        }
    }

    // Add new head, check for pill
    snake.unshift(head);
    if (head.x === pill.x && head.y === pill.y) {
        score += 10;
        generatePill();
        // Increase speed
        if (gameSpeed > MIN_SPEED) {
            gameSpeed -= 2;
            clearInterval(gameInterval);
            gameInterval = setInterval(gameLoop, gameSpeed);
        }
    } else {
        snake.pop();  // Remove tail
    }
}

This isn't cherry-picked. This is the code Code Creator wrote. Collision detection works. Score tracking persists to localStorage. Speed increases with each pill. The game is actually playable.

The Validation Moment

When I opened the game in a browser and played it, everything worked. Then I showed it to my daughters. They immediately started competing for high scores, laughing when the snake got too fast to control. One of them asked, "Can you make it so we can see each other's scores?"

That question was validation. Not "Does it compile?" but "Can I use this?"

The Compact Context DSL Breakthrough

Early versions of Code Creator had a problem: I was burning 50,000 tokens per task sending entire files to the code model. For a simple function addition, the model would receive hundreds of lines of irrelevant context. The context size was painful. The output quality was worse-models would get distracted by unrelated code and suggest unnecessary refactoring.

Then I realized: the model doesn't need to see every line-just the interfaces, the task context, and the exact section it's editing.

What Changed

Before: Full File Context

Send entire files (500-2000 lines)
50,000 tokens per task
Model gets distracted by irrelevant code
Suggests unnecessary refactoring
Slow, expensive, lower quality

After: Compact Context DSL

Send only: interfaces, task context, edit section
~20,000 tokens per task (60% reduction)
Model focuses on relevant context
Targeted changes only
Faster, cheaper, better quality

Example: Compact Context Format

[TREE]
/src
  game.js       (Main game logic)
  render.js     (Canvas drawing)
  utils.js      (Helper functions)
[/TREE]

[TASK:T003]
Implement snake movement and collision detection
[/TASK]

[IFACE:game.js]
- snake: Array<{x, y}>
- direction: string
- update(): void
- checkCollision(x, y): boolean
[/IFACE]

[EDIT:src/game.js:115-145]
// Only the function being modified, not the entire file
function update() {
    // ... existing implementation
}
[/EDIT]

[INSTRUCTION]
Add collision detection for walls and self-collision.
Return true if collision detected, false otherwise.
[/INSTRUCTION]

This focused context cut token usage by 60% and improved output quality. Less is more, but only if you choose the right "less."

The Functional Duplication Bug

One morning I was reviewing generated code and noticed something odd: three different functions for validating user input, each with slightly different names but identical logic.

function validateInput(data) { /* ... */ }
function checkInputValidity(data) { /* ... */ }
function verifyUserInput(data) { /* ... */ }

All three did the same thing. The model had forgotten it already implemented this functionality and kept generating new versions with different names. My validation pipeline was checking syntax, imports, and undefined symbols-but it wasn't catching functional duplication.

The Fix: Semantic Analysis

I added a new validation step that analyzes function behavior, not just signatures:

Extract function purpose from docstrings and implementation
Compare against existing functions in the codebase
Flag duplicates and suggest refactoring
Auto-generate consolidation tasks when duplication detected

This caught cases where the model would implement the same logic multiple times under different names. The validation failure would trigger a fix task: "Consolidate duplicate validation functions into a single reusable utility."

Lesson learned - Syntax checking isn't enough. You need semantic validation to catch logical duplication, not just syntactic errors.

The Incremental Loop: Plan → Execute → Validate → Fix

Here's how Code Creator actually works, step by step:

1. Plan: DevPlan Generation

Analyze request: "Build a snake game with score tracking"
Survey codebase: Check existing files, interfaces, patterns
Generate atomic tasks: Break into 5-10 focused, testable steps
User approval gate: Show plan before touching any files

2. Execute: Task Implementation

Select next task: Mark as in_progress, one at a time
Build compact context: Only relevant interfaces and edit sections
Generate changes: Structured file operations (write/edit/delete)
Apply atomically: Filesystem changes via MCP tools

3. Validate: Multi-Layer Quality Gates

Syntax: Fast parse per language (JavaScript, Python, etc.)
Undefined symbols: Catch missing variables/functions/classes
Import resolution: Verify all imports exist and are accessible
Functional duplication: Semantic analysis for redundant logic
Optional tests: Run unit/integration tests when enabled

4. Fix: Targeted Repair Tasks

Validation failure: Generate specific fix task from error details
Bounded retries: Max 3 attempts per task to prevent loops
Human escalation: Flag for review if fixes don't resolve issue
Checkpoint: Save state after each successful task

Visual Flow

Request: "Build snake game"
   ↓
DevPlan (6 atomic tasks) → User approves
   ↓
For each task:
   1. Build compact context (interfaces + edit section)
   2. Generate code changes
   3. Apply to filesystem
   4. Validate (syntax, imports, duplication, tests)
   5. If pass: mark completed, move to next task
   6. If fail: generate fix task, retry (max 3 attempts)
   ↓
All tasks completed → Project ready

Two-Tier LLM Design: Fast Analysis, Focused Generation

Code Creator uses two LLMs with distinct roles:

Tier 1: Analysis (local_llm)
Personas: Casey (Coder), Priya (Architect)

Parse user intent
Survey codebase structure
Generate DevPlan (atomic tasks)
Build compact context for each task
Fast, cheap, local inference

Tier 2: Code Generation (code_llm)
Specialized code model

Receive focused context + task
Generate structured file operations
Low temperature for determinism
Return only what changes
Optimized for code quality

This separation keeps context size lower (fast local LLM for planning) and quality high (specialized model for code generation with focused context).

The Honest Truth: I Haven't Used It for Real Work Yet

Here's the part where I'm supposed to tell you about all the production apps I've shipped with Code Creator. But I can't-because I haven't used it for real work yet.

Not because it doesn't work. The snake game proves it works. But because I'm a perfectionist, and Code Creator isn't quite there yet for the kind of complex, production-grade software I build professionally.

What's it good for right now?

Small, self-contained apps: Games, utilities, proof-of-concepts
Prototyping: Get a working version fast, refine manually
Learning tools: My daughters building their first web projects
Boilerplate generation: Project skeletons, CRUD operations, API scaffolding

What's it not quite ready for?

Large refactors across multiple modules
Complex architectural decisions requiring human judgment
Production systems where bugs have real consequences
Code I'd stake my professional reputation on (yet)

But it's getting there. Every week, the validation catches more issues. Every update to the Compact Context DSL improves focus. Every persona training iteration makes the plans smarter. The gap between "works for snake games" and "ships production software" is narrowing.

Real validation - My daughters use Code Creator to build their web projects. They don't care about the architecture. They just want apps that work. That's the bar.

Lessons: What Makes Code Shippable

1. Token Efficiency Improves Quality

Cutting context from 50,000 to 20,000 tokens wasn't just about context size, it made the output better. Focused context means focused changes. The model stops suggesting unnecessary refactoring and just solves the task at hand.

2. Validation Must Be Semantic, Not Just Syntactic

Catching functional duplication required understanding what code does, not just whether it parses. Syntax checking is table stakes. Real quality gates need semantic analysis.

3. Atomic Tasks Compound

Breaking "build a snake game" into 6 focused tasks meant each one could be validated independently. When task 3 failed, tasks 1-2 were still good. No monolithic rewrites-just targeted fixes.

4. User Enjoyment Is the Real Test

My daughters playing the snake game validated Code Creator more than any unit test could. If people want to use what it builds, it's working. If they don't, it's not-regardless of test coverage.

5. Honesty About Limitations Builds Trust

Code Creator works for small projects. It's not ready for production systems. Saying that out loud doesn't diminish what it can do-it clarifies where the value is today and where it's headed tomorrow.

What's Next

Code Creator is evolving. Current priorities:

Test-first workflows: Generate tests before implementation, validate with mutation testing
Multi-file refactoring: Track dependencies across modules, suggest architectural improvements
LoRA persona training: Specialize Casey and Priya on feedback from real projects
Browser E2E agent: Automated UI testing for web apps (navigate, click, validate)
Production readiness checklist: Security scan, performance profiling, deployment readiness gates

The goal remains the same: code you'd trust to run. Not "technically correct," but actually shippable.

Key Takeaways

For AI teams building code generators:

Less context, better output: 50k → 20k tokens improved quality and lower latency
Semantic validation matters: Check for functional duplication, not just syntax
Plan → Execute → Validate → Fix: The pattern works across all content types
Atomic tasks compound: Small, testable steps enable targeted fixes
User enjoyment is the metric: If people want to use it, it works

For teams evaluating local vs. cloud:

Local models have capability ceilings compared to frontier cloud models
But rapid improvements in open-source LLMs are narrowing that gap
The trade-off (local control + economics vs. raw capability) makes sense for privacy-sensitive or cost-sensitive workloads
Code Creator proves local-first code generation is viable for certain use cases today