I wanted Mneme to write code I'd trust to run. Not "technically correct but requires 30 minutes of cleanup" code. Not "works in the demo, breaks in production" code. Real, shippable code. That meant rethinking everything about how AI generates software-starting with the uncomfortable truth that most AI code fails validation the moment you try to run it.
The breakthrough came from an unlikely place: watching my daughters play a snake game. Not a tutorial example or proof-of-concept-an actual working game that Code Creator built from scratch, complete with collision detection, score persistence, and progressive speed increases. They didn't care about the architecture. They just wanted to beat each other's high scores.
That's when I knew Code Creator was working.
The Goal: Shippable, Not Just Syntactic
Early versions of Code Creator could generate code that looked correct. Proper syntax, reasonable structure, even decent naming conventions. But when you actually ran it? Undefined variables. Missing imports. Functions that referenced other functions that didn't exist yet. And my personal favorite: three different implementations of the same logic scattered across the codebase because the model forgot it already wrote that function.
I needed Code Creator to produce code you could run immediately after generation. No manual fixing, no "just change this one thing," no debugging sessions. The bar was simple: If it doesn't work when you click Run, it's not done.
The Pattern: Why Plan → Execute → Validate → Fix?
By the time I started building Code Creator, I'd already implemented the same pattern across e-books, tutorials, images, and music. Every creator followed the same flow:
1. Plan: Break the big goal into manageable chunks
2. Execute: Generate one chunk at a time
3. Validate: Check quality before moving on
4. Fix: Auto-correct issues or escalate to human
The trick was breaking it down to manageable chunks while maintaining continuity, consistency, and quality.
For code, this meant:
- Plan: Turn "build a snake game" into atomic DevPlan tasks
- Execute: Implement one task at a time (skeleton → game logic → UI → polish)
- Validate: Syntax check, undefined symbols, import resolution, functional duplication
- Fix: Auto-generate targeted fix tasks, bounded retries
The pattern made sense. The challenge was execution.
The Snake Game: A Concrete Example
Let me show you what Code Creator produces by walking through an actual project. I asked it to "build a snake game with score tracking and increasing difficulty."
The DevPlan
Code Creator broke this into atomic tasks:
DevPlan: Snake Game ├─ T001 Create project structure (HTML, CSS, JS files) [completed] ├─ T002 Implement canvas setup and game state [completed] ├─ T003 Add snake movement and collision detection [completed] ├─ T004 Implement pill generation and score tracking [completed] ├─ T005 Add speed progression and localStorage high scores [completed] └─ T006 Polish UI (game over screen, pause/resume) [completed]
The Generated Code
Here's a snippet from the actual script.js that Code Creator produced (cleaned for readability):
// Game state
let snake = [];
let pill = {};
let direction = 'right';
let score = 0;
let highScore = localStorage.getItem('snakeHighScore') || 0;
let gameSpeed = INITIAL_SPEED;
// Update game state
function update() {
direction = nextDirection;
const head = {x: snake[0].x, y: snake[0].y};
// Calculate new head position
switch (direction) {
case 'up': head.y -= 1; break;
case 'down': head.y += 1; break;
case 'left': head.x -= 1; break;
case 'right': head.x += 1; break;
}
// Check collision with walls
if (head.x < 0 || head.x >= TILE_COUNT ||
head.y < 0 || head.y >= TILE_COUNT) {
gameOver();
return;
}
// Check collision with self
for (let segment of snake) {
if (segment.x === head.x && segment.y === head.y) {
gameOver();
return;
}
}
// Add new head, check for pill
snake.unshift(head);
if (head.x === pill.x && head.y === pill.y) {
score += 10;
generatePill();
// Increase speed
if (gameSpeed > MIN_SPEED) {
gameSpeed -= 2;
clearInterval(gameInterval);
gameInterval = setInterval(gameLoop, gameSpeed);
}
} else {
snake.pop(); // Remove tail
}
}
This isn't cherry-picked. This is the code Code Creator wrote. Collision detection works. Score tracking persists to localStorage. Speed increases with each pill. The game is actually playable.
The Validation Moment
When I opened the game in a browser and played it, everything worked. Then I showed it to my daughters. They immediately started competing for high scores, laughing when the snake got too fast to control. One of them asked, "Can you make it so we can see each other's scores?"
That question was validation. Not "Does it compile?" but "Can I use this?"
The Compact Context DSL Breakthrough
Early versions of Code Creator had a problem: I was burning 50,000 tokens per task sending entire files to the code model. For a simple function addition, the model would receive hundreds of lines of irrelevant context. The context size was painful. The output quality was worse-models would get distracted by unrelated code and suggest unnecessary refactoring.
Then I realized: the model doesn't need to see every line-just the interfaces, the task context, and the exact section it's editing.
What Changed
- Send entire files (500-2000 lines)
- 50,000 tokens per task
- Model gets distracted by irrelevant code
- Suggests unnecessary refactoring
- Slow, expensive, lower quality
- Send only: interfaces, task context, edit section
- ~20,000 tokens per task (60% reduction)
- Model focuses on relevant context
- Targeted changes only
- Faster, cheaper, better quality
Example: Compact Context Format
[TREE]
/src
game.js (Main game logic)
render.js (Canvas drawing)
utils.js (Helper functions)
[/TREE]
[TASK:T003]
Implement snake movement and collision detection
[/TASK]
[IFACE:game.js]
- snake: Array<{x, y}>
- direction: string
- update(): void
- checkCollision(x, y): boolean
[/IFACE]
[EDIT:src/game.js:115-145]
// Only the function being modified, not the entire file
function update() {
// ... existing implementation
}
[/EDIT]
[INSTRUCTION]
Add collision detection for walls and self-collision.
Return true if collision detected, false otherwise.
[/INSTRUCTION]
This focused context cut token usage by 60% and improved output quality. Less is more, but only if you choose the right "less."
The Functional Duplication Bug
One morning I was reviewing generated code and noticed something odd: three different functions for validating user input, each with slightly different names but identical logic.
function validateInput(data) { /* ... */ }
function checkInputValidity(data) { /* ... */ }
function verifyUserInput(data) { /* ... */ }
All three did the same thing. The model had forgotten it already implemented this functionality and kept generating new versions with different names. My validation pipeline was checking syntax, imports, and undefined symbols-but it wasn't catching functional duplication.
The Fix: Semantic Analysis
I added a new validation step that analyzes function behavior, not just signatures:
- Extract function purpose from docstrings and implementation
- Compare against existing functions in the codebase
- Flag duplicates and suggest refactoring
- Auto-generate consolidation tasks when duplication detected
This caught cases where the model would implement the same logic multiple times under different names. The validation failure would trigger a fix task: "Consolidate duplicate validation functions into a single reusable utility."
The Incremental Loop: Plan → Execute → Validate → Fix
Here's how Code Creator actually works, step by step:
1. Plan: DevPlan Generation
- Analyze request: "Build a snake game with score tracking"
- Survey codebase: Check existing files, interfaces, patterns
- Generate atomic tasks: Break into 5-10 focused, testable steps
- User approval gate: Show plan before touching any files
2. Execute: Task Implementation
- Select next task: Mark as in_progress, one at a time
- Build compact context: Only relevant interfaces and edit sections
- Generate changes: Structured file operations (write/edit/delete)
- Apply atomically: Filesystem changes via MCP tools
3. Validate: Multi-Layer Quality Gates
- Syntax: Fast parse per language (JavaScript, Python, etc.)
- Undefined symbols: Catch missing variables/functions/classes
- Import resolution: Verify all imports exist and are accessible
- Functional duplication: Semantic analysis for redundant logic
- Optional tests: Run unit/integration tests when enabled
4. Fix: Targeted Repair Tasks
- Validation failure: Generate specific fix task from error details
- Bounded retries: Max 3 attempts per task to prevent loops
- Human escalation: Flag for review if fixes don't resolve issue
- Checkpoint: Save state after each successful task
Request: "Build snake game" ↓ DevPlan (6 atomic tasks) → User approves ↓ For each task: 1. Build compact context (interfaces + edit section) 2. Generate code changes 3. Apply to filesystem 4. Validate (syntax, imports, duplication, tests) 5. If pass: mark completed, move to next task 6. If fail: generate fix task, retry (max 3 attempts) ↓ All tasks completed → Project ready
Two-Tier LLM Design: Fast Analysis, Focused Generation
Code Creator uses two LLMs with distinct roles:
Personas: Casey (Coder), Priya (Architect)
- Parse user intent
- Survey codebase structure
- Generate DevPlan (atomic tasks)
- Build compact context for each task
- Fast, cheap, local inference
Specialized code model
- Receive focused context + task
- Generate structured file operations
- Low temperature for determinism
- Return only what changes
- Optimized for code quality
This separation keeps context size lower (fast local LLM for planning) and quality high (specialized model for code generation with focused context).
The Honest Truth: I Haven't Used It for Real Work Yet
Here's the part where I'm supposed to tell you about all the production apps I've shipped with Code Creator. But I can't-because I haven't used it for real work yet.
Not because it doesn't work. The snake game proves it works. But because I'm a perfectionist, and Code Creator isn't quite there yet for the kind of complex, production-grade software I build professionally.
What's it good for right now?
- Small, self-contained apps: Games, utilities, proof-of-concepts
- Prototyping: Get a working version fast, refine manually
- Learning tools: My daughters building their first web projects
- Boilerplate generation: Project skeletons, CRUD operations, API scaffolding
What's it not quite ready for?
- Large refactors across multiple modules
- Complex architectural decisions requiring human judgment
- Production systems where bugs have real consequences
- Code I'd stake my professional reputation on (yet)
But it's getting there. Every week, the validation catches more issues. Every update to the Compact Context DSL improves focus. Every persona training iteration makes the plans smarter. The gap between "works for snake games" and "ships production software" is narrowing.
Lessons: What Makes Code Shippable
1. Token Efficiency Improves Quality
Cutting context from 50,000 to 20,000 tokens wasn't just about context size, it made the output better. Focused context means focused changes. The model stops suggesting unnecessary refactoring and just solves the task at hand.
2. Validation Must Be Semantic, Not Just Syntactic
Catching functional duplication required understanding what code does, not just whether it parses. Syntax checking is table stakes. Real quality gates need semantic analysis.
3. Atomic Tasks Compound
Breaking "build a snake game" into 6 focused tasks meant each one could be validated independently. When task 3 failed, tasks 1-2 were still good. No monolithic rewrites-just targeted fixes.
4. User Enjoyment Is the Real Test
My daughters playing the snake game validated Code Creator more than any unit test could. If people want to use what it builds, it's working. If they don't, it's not-regardless of test coverage.
5. Honesty About Limitations Builds Trust
Code Creator works for small projects. It's not ready for production systems. Saying that out loud doesn't diminish what it can do-it clarifies where the value is today and where it's headed tomorrow.
What's Next
Code Creator is evolving. Current priorities:
- Test-first workflows: Generate tests before implementation, validate with mutation testing
- Multi-file refactoring: Track dependencies across modules, suggest architectural improvements
- LoRA persona training: Specialize Casey and Priya on feedback from real projects
- Browser E2E agent: Automated UI testing for web apps (navigate, click, validate)
- Production readiness checklist: Security scan, performance profiling, deployment readiness gates
The goal remains the same: code you'd trust to run. Not "technically correct," but actually shippable.
Key Takeaways
For AI teams building code generators:
- Less context, better output: 50k → 20k tokens improved quality and lower latency
- Semantic validation matters: Check for functional duplication, not just syntax
- Plan → Execute → Validate → Fix: The pattern works across all content types
- Atomic tasks compound: Small, testable steps enable targeted fixes
- User enjoyment is the metric: If people want to use it, it works
For teams evaluating local vs. cloud:
- Local models have capability ceilings compared to frontier cloud models
- But rapid improvements in open-source LLMs are narrowing that gap
- The trade-off (local control + economics vs. raw capability) makes sense for privacy-sensitive or cost-sensitive workloads
- Code Creator proves local-first code generation is viable for certain use cases today