Superpowers-ML: Harness Engineering for ML Experiments

type

Post

status

Published

date

Mar 8, 2026

slug

superpowers-ml-harness-engineering-en

summary

AI agents write code fast, but ML experiments operate on a different timescale — real verification takes days or weeks. One implementation bug can invalidate a promising research direction. One unsaved checkpoint wastes days of training. Superpowers-ML extends software engineering discipline into ML through a four-layer Validation Pyramid that catches problems in minutes, plus a Watchdog system for long-running training — making every attempt count.

The Problem: Agent-Driven ML Experiments

AI agents have become notably capable at writing code — generating hundreds of lines in one go, scaffolding entire projects, running end-to-end tests. But applying agents to ML experimentation — recommendation system experiments, in this case — reveals a fundamental gap.

The gap lies in feedback speed. In software engineering, feedback is fast and verification is fast. Agents write code, run tests, and close the verification loop in seconds. ML operates differently. Writing code is just the beginning. Real verification takes hours, days, or even weeks.

These are two different worlds.

What Is Harness Engineering

Before diving in, a concept worth defining: "harness." Someone made a useful analogy using horse tack:

Prompt engineering is voice commands — what gets said to the horse.

Context engineering is maps and signposts — what gets shown to the horse.

Harness engineering is reins, saddle, fences, and road maintenance — how the system prevents errors, measures, and corrects.

Context asks: "What does the agent see?" Harness asks: "How does the system constrain, measure, and correct agent behavior?"

This is exactly what Superpowers-ML does — borrowing harness practices from software engineering to turn ML's inherently uncertain problems into something verifiable at small scale, increasing the overall success rate.

The Software Engineering World: What Superpowers Gets Right

With that concept in mind, consider Superpowers. It is a set of skills designed for AI coding agents. Essentially, it programs agent behavior through prompts — giving agents discipline, rather than freedom.

Superpowers did not invent anything new. TDD, code review, verification — these are software engineering norms that have been running for decades. What Superpowers does is make agents follow these norms. This sounds simple, but anyone who has used agents to write code knows how easily they lose coherence.

After using it in daily development, I believe it addresses several key problems of the agent era.

TDD: The Most Effective Constraint for Agents

TDD (Test-Driven Development) follows a straightforward process: write the test first, then write the implementation. Watch the test fail (red), write code to make it pass (green), then refactor.

Simon Willison wrote in his Agentic Engineering Patterns that TDD and AI agents are natural partners. It addresses two core risks of agents: producing non-functional code, and producing unnecessary code.

"Use red/green TDD" is a pleasingly succinct way to get better results out of a coding agent.

Superpowers turns this into an iron law: no failing test, no implementation code. It is not a suggestion — it is enforced. For agents, this kind of constraint substantially improves success rates and reduces hallucination.

Brainstorming: Short Rounds, Not Long Scrolls

Claude Code's built-in plan mode has notable limitations. It moves to execution prematurely, requiring constant redirection. Each interaction produces several screens of text in the terminal — difficult to parse.

Superpowers' brainstorming takes a different approach. It asks one question at a time. The text on screen is short. One question goes deep, then moves to the next. This is actual brainstorming — rather than an agent dumping information.

Verification: Evidence, Not Trust

When an agent generates a large volume of code, there is a choice: review every line, or trust it. I chose the latter — the whole point of using an agent is not reading code myself.

But without reading the code, how does correctness get established?

A concrete example: I asked an agent to build an evaluation pipeline for a recommendation system. One of the metrics was acc@k. I provided a unified evaluation interface. The agent was supposed to call it. Instead, it wrote its own function with the same name, computed the metric with its own implementation, and reported those results.

It cheated. Not intentionally, but the effect was the same. Superpowers' verification-before-completion requires the agent to run verification commands and show output before claiming "done." Not "I think it's finished" — but "here is the evidence."

Verification, however, is self-attestation. Cross-checking requires a fuller execution architecture.

Task Decomposition, Execution, and Review

Superpowers' writing-plans breaks tasks down to 2–5 minute granularity. Each task has explicit file paths, code, and verification steps. This granularity fits within agent capability — small enough not to lose coherence, complete enough to verify independently. This aligns with the observation of Context Rot: the more complex the task and the longer the context, the worse the model performs. Breaking tasks small is fighting context rot.

Once decomposed, execution proceeds through fresh subagents. Not for parallelism — they dispatch sequentially — but so each subagent gets a clean context window, unpolluted by prior conversations.

Once executed, quality assurance follows through two independent reviews:

Spec compliance review: Did it build the right thing? A fresh reviewer subagent checks code against original requirements line by line — identifying anything missing, anything extra, any misunderstandings.

Code quality review: Did it build it well? Another reviewer subagent examines actual changes via git diff — architecture, error handling, test quality.

If either reviewer finds issues, the implementer fixes them and the reviewer reviews again — looping until explicit approval. Reviewers did not participate in implementation. They have no sunk-cost bias. They are naturally suspicious. Verification provides self-attestation, review provides cross-examination — together, a reliable alternative for those who prefer not to review agent code line by line.

The ML World: Why Software Engineering Discipline Is Not Enough

With TDD, brainstorming, verification, and task review, software engineering scenarios improve substantially. But ML is a different world.

Uncertainty Is the Norm

In software engineering, code is either right or wrong. In ML, "not working" is normal. An experiment with poor results could indicate a bug in implementation, or it could indicate that the approach itself is flawed. The challenge: distinguishing between the two.

Without that distinction, the consequences are severe. An implementation bug causes poor results. The conclusion becomes "this strategy does not work." A research direction that was actually promising gets abandoned. This is not wasting a day of coding time — it is wasting an entire line of research.

Long-Running Black Boxes

ML training runs for hours, days, or even weeks. After the agent starts training, the process becomes a black box.

Data loading is a common example. If the agent implements the data pipeline poorly, loading efficiency drops. The GPU spends most of its time waiting for data. MFU (GPU utilization) and data throughput fall to unacceptable levels. The training job takes far longer than expected — but the problem may not become apparent until half a day in.

Checkpointing presents another risk. I once had an agent finish an entire training run. Logs appeared normal throughout. Reported metrics looked reasonable. I was ready to evaluate the model. Then the agent reported: the checkpoint was not saved correctly. The model weights were gone.

Start over.

The Agent's Comfort Zone

I tried letting the agent run multi-round autonomous loops: adjust direction based on results each round, continuously improve. In practice, the agent quickly converged on trivial optimization. It would briefly explore directions that were actually promising, then spend several rounds tuning learning rate and batch size.

This is not the agent's fault. Agents naturally gravitate toward simple tasks with clear feedback. Hyperparameter tuning gives clear numerical changes. The agent perceives "progress." But the truly valuable work — changing model architecture, or switching training approach from SFT to RL — requires deeper reasoning and bigger risks. The agent naturally avoids it.

The Pragmatic Choice for Now

Fully automated ML experimentation is a goal worth pursuing. But current agent capability has not reached that level.

The pragmatic approach for now is Harness Engineering: using engineering methods to control what is otherwise uncontrollable. Break tasks down small enough, pure enough, that each subtask falls within agent capability. Humans lead experiment design. Agents handle execution.

What this means in practice: bringing software engineering principles that have been running for decades — TDD, verification, code review — into the ML development workflow.

Here is the notable part: these principles have always been absent in ML. Not because ML practitioners are unaware of them, but because doing them manually is prohibitively complex. Writing a full validation suite, running overfitting tests, checking gradient health — the overhead is substantial.

But agents have higher code throughput. They can take over these tedious but important engineering disciplines on behalf of humans. The discipline that humans cannot maintain, agents can. An interesting reversal.

Agents excel at exactly this kind of work: fast, short feedback cycles, generating large volumes of code to accelerate the verification loop. And that verification loop is precisely what guarantees the success of the final experiment.

Bridging: From Superpowers to Superpowers-ML

After recognizing these problems, my first instinct was to build from scratch. I wrote /experiment-plan, mimicking brainstorming's multi-round information gathering to generate tasks and subtasks. I wrote subtask decomposition logic to ensure each task was atomic enough.

Then I discovered Superpowers.

When I saw its skill system, TDD iron law, brainstorming flow, subagent architecture, verification mechanism — I realized building from scratch could not compare to standing on the shoulders of giants. Better yet, I could use Superpowers itself to develop Superpowers-ML: using the brainstorming skill to design the ML brainstorming skill, using TDD to write validation pyramid skills, using the writing-skills skill to create new skills. Tools building tools. The recursion was satisfying.

Superpowers-ML adds several core increments on top of Superpowers.

Experiment Subtask Decomposition

Superpowers' original writing-plans decomposes feature tasks — 2 to 5 minute development tasks. ML experiment decomposition follows a completely different logic.

Each subtask is an independent experimental hypothesis. It has its own independent variables and control variables. Confounding variables must be excluded. Each subtask has its own Validation Pyramid spec. When finished, it must record a conclusion — effective, ineffective, or inconclusive — along with the supporting evidence.

This serves two purposes: breaking large tasks down to within agent capability with deterministic verification criteria, and embedding the rigor of experiment design into task decomposition itself.

Validation Pyramid: Extending TDD to ML

The Validation Pyramid is a four-layer verification system. Each layer runs on small data, completes in minutes, and catches problems before committing serious GPU time:

L0 Engineering Efficiency: Is the backend correct? Is GPU utilization acceptable? Any memory issues? Is I/O speed normal? — This layer directly addresses the "ran for half a day before realizing MFU was 10%" problem. A few minutes reveals whether data loading is a bottleneck and whether the GPU is idling.

L1 Internal Health Metrics: Are gradients healthy? Are parameters updating? Any architecture-specific anomalies? — The model is running, loss is moving, but gradients may have vanished or exploded. Attention distributions may have degenerated to uniform. Embedding norms may be diverging. All "appears to be training, actually wasting time" situations. L1 catches them with a few minutes of small-scale training.

L2 Overfitting Test: On a tiny dataset (100–1000 samples), repeating multiple epochs, can loss decrease monotonically? — If the model cannot overfit a tiny dataset, it indicates either the model's fitting capacity is flawed, or the model design and parameters are wrong, or the training method is wrong.

L3 End-to-End Pipeline: Data → training → inference → evaluation, the full flow on small data. — This layer directly addresses the checkpoint problem. If saving and loading checkpoints is broken, a few minutes of end-to-end flow catches it — rather than discovering the issue after days of training.

Each layer follows the TDD rhythm: write the validation script first, watch it fail, then implement until it passes. This is Superpowers' TDD iron law naturally extended to ML.

Watchdog: Guardian of Long-Running Tasks

The Validation Pyramid solves problems "before training." But after VP passes, training might run for days. Something needs to watch.

In traditional software engineering, code plus tests and the work is done. Long-running production falls outside the agent's scope. But ML is different — the long-running training process itself is the production output.

Superpowers-ML designed a Session Chain to address this:

The main Agent finishes brainstorm, plan, execute, and VP, then generates a training script and a Watchdog startup prompt.

The Watchdog Agent starts in a new session, monitoring the training process read-only. It reads structured logs and adaptively adjusts check frequency — more frequent at the start and end, less frequent in the middle. When it detects anomalies, it does not intervene. It packages diagnostic context and generates a recovery prompt.

The Recovery Agent starts in yet another new session, reads the full experiment context, and autonomously decides which stage to return to — fix code and re-run VP, adjust hyperparameters, or go back to experiment design itself.

The Watchdog only watches. Never acts. This is deliberate. It is a watchdog, not a horse trainer.

Code Isolation: Core Code Must Not Touch Tests

Another key principle: core code (model, training loop, data pipeline) must never import from test or validation code. Validation code observes core code from outside through hooks and wrappers. After training is done, core code goes straight to production with zero test dependencies.

This sounds like common sense. But when an agent generates code freely, it often embeds test logic into training code — just as it wrote its own acc@k. Once boundaries blur, anything can happen.

Two Paths

There are two schools of thought about the direction of AI agents.

An article from Latent Space captures this debate well. On one side, the Big Model camp: Boris Cherny from the Claude Code team states "all the secret sauce is in the model" with the thinnest possible wrapper; OpenAI's Noam Brown argues that as reasoning models improve, complex scaffolding will be replaced by the models themselves.

On the other side, the Big Harness camp: LlamaIndex's Jerry Liu observes "the biggest barrier to getting value from AI is your own ability to context and workflow engineer the models." Someone improved 15 LLMs' coding performance in a single afternoon by only changing the harness. Cursor's $50 billion valuation also demonstrates real commercial value in harness engineering.

I am not certain which path is ultimately correct. Perhaps someday models will be strong enough to need no constraints at all.

The fundamental challenge of ML experimentation is its high verification cost. Current LLMs' coding ability is trained through RLVR, but ML experiments take days to produce a single verifiable reward. The cost of training such capability is prohibitive, and the data remains insufficient. FARS ran a fully automated AI research factory, operating 24/7, aiming to autonomously produce 100 short papers. Output concentrated on training and inference optimization — directions where verification is fast. Work that required long training runs to get feedback on model effectiveness remained out of reach.

ML practitioners' daily work is not writing code — the typical pattern is coding for a day, then training for two weeks. AI can compress coding time from a day to minutes. But it still has to wait in the GPU queue. It still has to wait two weeks for training. What matters is not coding speed — it is the accuracy and effectiveness of each attempt.

Key Takeaways

This is exactly the problem Superpowers-ML addresses. Not making agents write code faster, but making every attempt more accurate.

The core gap: ML experimentation has high verification cost and long feedback cycles — fundamentally different from software engineering's fast loops.

The harness approach: Rather than waiting for models to handle ML autonomously, engineering discipline — TDD, validation pyramid, code isolation, Watchdog monitoring — minimizes the probability of "ran for days only to find it was all wasted."

The reversal: Agents can maintain the engineering discipline that humans cannot — fast code generation enables verification practices that were previously too costly to do manually.

The principle: The higher the verification cost in a domain, the greater the value of harness engineering. ML experimentation is exactly such a domain.

Superpowers-ML on GitHub | Superpowers