The Mental Model of Vibe Coding: Managing Agents Like Managing Teams
2026-2-20
| 2026-3-12
字数 4255阅读时长 11 分钟
type
status
date
slug
summary
tags
category
icon
password
priority
This is not a hands-on tutorial about specific tools or techniques. It is about the mental model behind Vibe Coding.
Vibe Coding is fundamentally about coding through Agents. Behind every Agent is an LLM, and LLMs are humanity's "ghosts" — a framing from Karpathy's 2025 year-end reflection: "we're not evolving animals. We're summoning ghosts." Language is a projection of the human world. LLMs are its spectral echo.
Tools and techniques evolve at breakneck speed. This technology is unprecedented — no one has prior experience with it. But human nature is constant. Grasping the "human nature" of Agents and managing them accordingly turns Vibe Coding from bewildering to navigable.

Defining Vibe Coding

On February 2, 2025, Andrej Karpathy posted a tweet and casually coined a term:
There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.
He described building projects with Cursor and SuperWhisper, barely touching the keyboard — just talking to the AI. "Accept All," never reviewing diffs. Errors got pasted back in and usually resolved themselves. The codebase kept growing beyond his own comprehension. Some bugs were unfixable, so he worked around them or fiddled until they vanished. "I'm building a project or web app, but I'm not really coding — I'm just watching, giving verbal instructions, running the program, copy-pasting, and it mostly just works."
That tweet received 4.5 million views. "Vibe coding" became one of the defining buzzwords of 2025. A year later, most people's understanding of it remains imprecise — many conflate it with AI-assisted programming.
AI-assisted programming is broader. It includes using tools like Copilot or Cursor to help write code. AI autocompletes, generates functions, and refactors. But the developer still reviews every diff, still examines every implementation detail, and still has a mental model of whether the code is correct. AI accelerates the writing. The human remains the coder.
Vibe coding is fundamentally different — the developer does not look at the code.
The workflow becomes: type an idea, observe the output, check whether it behaves as expected. If it does, move forward. If not, rephrase and retry. How the code actually implements things is irrelevant — and may be incomprehensible. Product managers, for instance, can do this kind of work effectively.
This creates substantial discomfort for most engineers. The first decade of a software career trains one specific skill — precise control over the process. Every line of code needs a reason. Every design decision needs justification. Abandoning that oversight feels deeply wrong.
The discomfort is real — but it is not new. The last time this feeling appeared was during the transition from IC to team lead.
Vibe coding produces the same psychological shift as team management. Once this analogy takes hold, many things fall into place.

The Role Shift: From Coder to Manager

During a typical vibe coding session, three Agent sessions run simultaneously — one refactoring a module, one writing tests, one adjusting UI. The operator does not need to monitor every step of each session. The requirement is rapid context-switching: check the output, make a judgment, give direction, then move to the next session.
This mirrors the daily workflow of a frontline team leader. Three or four workstreams running in parallel — Employee A drafting a technical proposal, Employee B doing integration testing, Employee C investigating a production issue.
The hardest part of the IC-to-TL transition is not the change in work content — it is the loss of process control. Previously, all code passed through the developer's hands. Now the keyboard belongs to someone else, and they implement features in ways that feel "less than perfect" — but the results work correctly. The instinct to grab back the keyboard must be suppressed.
When a team member's output does not meet expectations, the response is not to take over their implementation. Rather than doing the work directly, the TL communicates and gives feedback to improve the result. A TL might have the ability to get it right in one shot, but that approach does not scale — there may be 5 similar tasks waiting for feedback, and one person cannot write code for all of them.
Multi-session Agent coding follows the same pattern. Precise descriptions help the Agent understand the task. Multiple rounds of feedback sculpt the result — like molding clay into the desired shape.
The operator's value is no longer in execution. It is in clarity of intent and quality of review. The critical skills shift from efficient coding to rapid context-switching, fast feedback, and strong taste in technology and results.
Same bottleneck, same solution: asynchronous collaboration. Align on goals and acceptance criteria, let the executor run, check results at checkpoints, intervene only at blockers. The best vibe coders and the best TLs operate the same way — they never stand behind a desk watching every line of code being written.

Validating Results, Not Code

Traditional programming carries an implicit assumption: code should pass on the first attempt. Think it through beforehand, write carefully, run the tests, submit if they pass. The core of this workflow is confidence in the process — the developer knows whether the code is correct because they wrote it.
Vibe coding abandons this assumption entirely. Without reading the code, process confidence is impossible. Agents hallucinate — and they do so with notable confidence. An Agent may fail to execute properly, then declare the entire approach fundamentally flawed.
Reverting to reading the code means reverting to AI-assisted programming. The object of validation has changed. It is no longer whether the code is correct, but whether the result is correct.
A representative scenario: an Agent declared "this feature cannot be implemented, recommend switching architecture." Rather than debugging its code, a new session was opened with the same requirement described fresh. The new session delivered the feature successfully.
The first session's code is irrelevant. The only question that matters: can the result be produced. If yes, move forward. If not, try a different approach.
This represents a fundamental philosophical shift: the goal is not getting it right the first time, but results that continuously converge toward correctness. Efficiency comes not from single-pass accuracy, but from feedback and iteration speed.

The Bottleneck: Context Switching and Feedback Speed

Frontline team management resembles multi-session single-Agent management. Whether using a multi-session management tool like Vibe Kanban or simply switching between terminal tabs, the approach is the same — relinquish code-level control, focus on task decomposition, expectation management, and multi-round acceptance feedback.
Consider the scene: three sessions running simultaneously. One human. All sessions waiting for feedback.
The largest bottleneck is context-switching speed and parallel processing capacity.
Agents are fast — a session might complete in a few minutes. Serial thinking — finishing session A before looking at session B, then session C — turns the human into the throughput bottleneck. Three Agents running in parallel, serialized by one person.
The same dynamic applies to TLs. Three or four workstreams running concurrently — slow context-switching means everyone waits.
The speed of result evaluation directly determines system throughput. A strong TL reviews a proposal and gives a conclusion in 3 minutes. A weak TL takes a day. In vibe coding, this gap gets amplified because Agent execution speed is so high — the human who cannot keep up becomes the constraint.
After experimenting with various open-source tools, the simplest approach proved most effective: a tmux status bar that provides notifications whenever an Agent session completes.
Input speed is another constraint. Voice input provides a notable improvement. After trying Superwhisper, **Typeless** proved to offer the best balance of user experience, accuracy, and latency. Anything longer than 10 characters is now spoken rather than typed.
Task Description and Decomposition
The prompt given to an Agent is the requirements document given to a direct report. Vague descriptions produce vague output. Precise descriptions produce precise output.
Poor vibe coding results often stem not from Agent limitations, but from inadequate requirement descriptions. This mirrors team dynamics — when a report delivers the wrong thing, it is frequently because the TL did not think clearly about what they wanted or failed to communicate it precisely.
Claude Code's Plan mode is officially recommended, but it does not go far enough — it tends to ask a handful of questions and move on.
Superpower takes the approach further: no code first. Through Brainstorming — requirement clarification and interactive design refinement — then detailed planning, breaking tasks into small, verifiable steps. One planning session took nearly an hour.
A vague goal — "refactor this module" — leads to floundering or incorrect implementations. Concrete subtasks — "extract these three functions into a class, preserve the interface, add unit tests" — enable precise execution.
Task decomposition ability is essentially engineering design ability. The operator no longer executes directly, but needs to understand how the work should be done in order to decompose it correctly.
Many projects do not follow standard software engineering workflows. One effective approach: hand the Superpower repository to Claude and let it learn, self-improve, and adapt to the current project's conventions.

Context Rot

When an Agent session drifts off track, a familiar pattern emerges: every fix introduces new problems, every "almost done" spawns new bugs. The session sinks deeper into a wrong direction.
Continue course-correcting or start over — this is the sunk cost trap. The investment in tokens and wait time creates pressure to keep trying. But in most cases, decisively closing the session and starting fresh with a better description proves faster. The Agent's context has been contaminated with errors. A long context full of mistakes degrades LLM reasoning — effectively wasting time.
The recommended approach: have the Agent document its learnings in a file, then clear or kill the session and start fresh. While Claude Code compresses context automatically, by this point the operator is working with a substantially degraded Agent.
"Context Rot" — a term coined by Chroma Research in their 2025 study. Their finding: even on simple tasks, model performance degrades as input length increases. The degradation is non-uniform and unpredictable.
  • Anthropic's official Best Practices state: "If you've corrected Claude on the same issue more than twice in the same session, the context is now littered with failed attempts." The recommendation is to /clear and start fresh.
  • Sourcegraph engineer Geoffrey Huntley found that despite Claude Sonnet's advertised 200K token limit, context window quality begins degrading around 147K–152K tokens — roughly 75% of the stated capacity.
  • Anthropic acknowledged when releasing Opus 4.6: "context rot" — model performance degrading once a conversation exceeds a certain token count — remains a known limitation.
Practical guidelines from experience: keep context below 75% capacity. Atomize tasks with frequent small commits. Use progress.md and git to save state at high frequency. After clearing a session, restore context from progress.md and git history. Enter the next feature at peak performance with low token cost.
Every session clear dismisses a "contaminated" ghost and summons a fresh one. Karpathy's framing holds — ghosts have no memory accumulation, no growth curve. They only have the state of this particular summoning. Accepting this enables decisive session kills without the feeling of waste.
Team management follows the same principle. When someone has been stuck on a direction too long, the TL needs to make the decision they cannot make for themselves: stop. This is not a rejection of effort — the direction is wrong, and continuing only burns resources. Helping the team member clear their "context" and return at full capacity is often the better approach.
Good managers cut losses decisively. Good vibe coders do the same.

Agentic Engineering

On February 4, 2026 — vibe coding's one-year anniversary — Karpathy posted a retrospective and upgraded the concept. He characterized vibe coding as the early, experimental phase. Professional-grade development had evolved to the next stage, which he named Agentic Engineering.
"'Agentic' because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do."
Orchestrating agents — this was destined to become a pivotal term.
In the single-Agent mode discussed earlier, the core role is feedback-giver — opening sessions, checking results, giving direction. Essentially a frontline TL doing high-frequency reviews. But when task complexity increases — dozens of Agents in parallel, dependencies between tasks, extended execution cycles — feedback alone is insufficient. The operator needs to plan execution order, determine output dependencies, decide when to merge and when to roll back. The role shifts from feedback-giver to orchestrator — precisely what Karpathy meant.
From vibe coding to agentic engineering, the name changes but the essence remains: the operator is no longer the one writing code, but the one who ensures Agents write the correct code.

From Feedback-Giver to Orchestrator

The word "orchestrate" originates from the symphony orchestra conductor — who plays no instrument but coordinates all musicians' tempo, dynamics, and entry order. It was later adopted in software and container service scheduling, and has now been applied to multi-Agent AI systems.
The earlier discussion focused on directly managing a few Agent sessions — the TL personally reviewing each Agent's results and personally providing feedback. The bottleneck in this model is still the human.
When tasks reach sufficient complexity — for instance, recommendation algorithm research (initially expected to be straightforward, but potentially more complex than building a lightweight application due to the number of variables) — this model breaks down. A single training and evaluation cycle takes substantial time. If Agents cannot run autonomously and depend on timely human feedback, the bottleneck remains the human. Even with an extremely complex application, running 20 simultaneous sessions while tracking all their progress is impractical. Context-switching and feedback speed have an upper limit.
At this point, the need shifts from "faster context-switching and feedback" to organizational architecture design.
The analogy is direct: the operator is no longer a frontline TL but a director — managing TLs rather than ICs. Rather than directly managing each Sub-Agent, the operator manages frontline leaders (called "team-leader" in Claude Code's Team mode), who in turn manage individual executors. This creates a two-tier organizational architecture.
In single-Agent mode, the concern is "how to give feedback faster." In multi-Agent orchestration, the concern shifts to whether sub-teams can operate autonomously.
An effective sub-team achieves two things: self-decomposing tasks and self-validating results. It does not require oversight of every step — the operator intervenes only at key checkpoints to review direction, examine conclusions, and make trade-offs. This requires an agent in a "frontline TL" role — one that receives high-level requirements, breaks them into concrete tasks, distributes them to executors, collects results, and determines whether they meet the acceptance bar.

Upgrading the "Organizational Architecture"

Whether with people or Agents, the essence of organizational architecture upgrade is: replacing "human judgment" with "mechanism-based judgment."
Rather than personally validating every Executor's output, the operator builds a Sub-team with two capabilities: autonomous daily planning and decision-making, and built-in error correction.
A single Agent executing complex tasks or running for extended periods gradually accumulates errors — the "context rot" discussed earlier. Beyond that, it develops a myopia problem.
Anthropic ran an experiment having AI agents autonomously attempt complex, long-term tasks (such as building a complex web application similar to claude.ai). The finding: these agents were notably myopic — they tended to complete only the easy parts, prematurely declare task completion, and leave substantial unfinished or undocumented work. This mirrors how humans sometimes cherry-pick easy tasks.
Their solution introduced a multi-agent collaboration structure: a dedicated initializer/planner agent establishes clear project structure at the outset (JSON feature lists, progress logs like claude-progress.txt, initial git commits). Subsequent coding executor agents focus on incrementally completing small portions of functionality, writing tests, making clean commits, and leaving clear state for the next round — enabling long-term coordination across multiple sessions.
In independent experiments, decomposing each sub-task's multi-round iteration into at least 3 roles proved effective: Planner for task planning, Executor for task execution, Reviewer for reviewing process and conclusions.
Planner (sub-Leader) counters the myopia and difficulty-avoidance in Agent behavior — preventing Executors from operating with unclear goals and gradually drifting during execution. Planners need long-term memory: the global view, which paths have been tried, which conclusions have been validated, and current priorities. They provide continuity and judgment.
Executor is spawned fresh each time code needs to be executed, avoiding the degradation and goal-drift that come with long contexts.
Intuitively, Executors should maintain context — having done task 1, task 2 should go more smoothly with that "experience." The opposite is true. What accumulates is not experience — it is bias.
An Executor that hits a snag in task 1 carries that "memory" as contamination into task 2. It begins rationalizing failures — "the problem is not my code, the approach itself is fundamentally flawed." It transforms from an executor into a biased opponent. As context grows longer, reasoning ability itself degrades — context rot compounds the problem.
This pattern is familiar in human teams. When someone has been stuck on a direction too long, they become that direction's most steadfast opponent — not because the approach is unworkable, but because they have psychologically bonded with the failures. A fresh person, given the same task, often delivers quickly.
Reviewer reviews code and conclusions when dealing with complex, multi-variable problems. Without a Reviewer, Executor conclusions propagate up the chain unchecked — the Orchestrator may conclude the goal has been achieved and begin asking whether to terminate.
This architecture addresses two core problems. Autonomous decision-making: Planner decomposes tasks, Reviewer validates results — the daily loop runs without human intervention. Error correction: when an Executor's output has issues, the Reviewer catches them. Executors themselves do not accumulate bias because each one starts fresh.

The Newly Promoted TL's Cardinal Sin

The most common real-world problem with this architecture: the Sub-TL cannot resist jumping in to do the work directly.
In Claude Code's multi-Agent Team mode, a main agent handles orchestration — receiving requirements, decomposing tasks, dispatching to sub-agents, and collecting results. In practice, it frequently jumps in to write code itself.
This mirrors the classic TL anti-pattern — the strongest IC gets promoted, and their first instinct when encountering a problem is "I'll do it myself, it's faster." Because they are too good at writing code.
Once the orchestrator starts executing, it becomes blocked. The terminal sits idle, and the entire interaction rhythm collapses. The channel between the operator and the Sub-Agent team — which should remain clear — is now clogged.
Anyone who has managed a team recognizes this TL type — technically the strongest, promoted from IC. During the two hours they spend writing code, three people wait for their review and two wait for their decision on a proposal. Individual output goes up, but team throughput goes down.
The same dysfunction. A TL's core responsibility is not producing code — it is making the entire team run efficiently. The core responsibility is coordination, planning, and review. The moment it starts "getting its hands dirty," it loses its greatest value.
Even the current best tools (Opus 4.6) have not fully resolved this. Even with clear prompting that the agent serves as an orchestrator, it occasionally writes code directly. The eventual solution: a hook. Whenever the orchestrator attempts to write .py or .sh files, the hook interrupts with a reminder: "You are an Orchestrator. Your core task is not execution — delegate execution to your Team."

Using the Most Expensive Agents — and Placing Them Correctly

Given the same budget for team hiring, the choice between more cheaper hires or fewer expensive ones often favors the latter. A strong hire does not merely "make fewer mistakes" — they proactively identify unconsidered problems and propose novel solutions. The communication and course-correction overhead of several mediocre hires frequently exceeds the salary savings.
Agents follow the same logic. A Claude Code Max subscription runs $200–300/month — substantially more than alternatives. But the math is straightforward: a weak model hallucinates once, costing 30 minutes of investigation, correction, and re-running. The expensive model gets it right the first time. The operator's time is the most expensive resource in the system.
There is also a hidden cost — the cascading effect of wrong conclusions. An inaccurate experimental conclusion that informs an architectural decision, followed by two weeks of team development in the wrong direction — that cost far exceeds the subscription savings.
But expense alone is insufficient. The key is placing the right Agent in the right position. Critical roles always use the strongest model. Planners and Reviewers use the top-tier model. Executors can use lower-cost options (such as Sonnet — though still not cheap). Judgment roles should not be cost-optimized. Execution roles can be.
A post on X demonstrates an even more specialized split — Claude designs the plan, Codex executes relentlessly:
  • Claude: responsible for planning, architecture design, and requirement decomposition. Fast thinking, strong creativity, well-structured proposals, strong contextual understanding — suited for high-level design work.
  • Codex: responsible for steadfast execution, code writing, bug fixing, and implementation details. Reports indicate that given a clear plan, Codex demonstrates remarkable resilience — high first-pass success rates, minimal hallucinations — suited for the heavy execution work.
The logic of talent density and the logic of Agent selection follow the same playbook.

Good People Are Not Enough — The Environment Matters

Selecting the right people is only half the equation.
A common scenario: a widely recognized strong engineer joins a new company and performs below expectations. Their abilities have not degraded — the new company's infrastructure is poor. Compilation and testing takes 40 minutes, the dev environment crashes regularly, documentation is outdated to the point of being counterproductive, and coding standards exist only on paper. Even exceptional people can only deliver roughly 30% of their capability in such an environment.
Agents follow the same pattern. Anthropic published a comprehensive paper on Agent evaluation systems in early 2026, which contained a notable finding: infrastructure configuration's impact on Agent performance can sometimes exceed the impact of switching models — even surpassing the gap between different top-tier models, with experiments showing differences of up to 6 percentage points.
The same Agent, running on a clean environment with a stable toolchain, performs well. In a setup with state leakage, resource contention, and no environment isolation, scores can drop by half. Fixing a few infrastructure bugs — such as inverted scoring logic or missing environment isolation — boosted the same model's scores by 20–30 points. This improvement substantially exceeds what model iterations themselves deliver.
OpenAI's Harness Engineering, published in February 2026, pushed this philosophy to the extreme. A team of three engineers used Codex to build a million-line-of-code product from scratch — with zero handwritten code. The core philosophy: "Humans steer. Agents execute." When an Agent encountered a problem, the fix was almost never "try again." Rather than retrying, the engineers reflected: what capability is missing from the environment, and how can it be made both readable and executable for the Agent? All knowledge was pushed into the codebase itself — not one giant instruction file, but a structured documentation directory. AGENTS.md served as a table of contents pointing to deeper design docs, architecture specs, and execution plans. An architectural decision aligned on Slack but not documented in the repo effectively does not exist for the Agent — similar to a new hire three months in who has no knowledge of that discussion.
In other words, when an Agent appears incapable, the working environment may be the actual constraint.
This mirrors human team dynamics exactly. A TL's responsibility extends beyond selecting the right people and assigning the right tasks — it includes building good infrastructure. Are dev tools effective? Are organizational collaboration workflows smooth? Is there information loss in communication? These "invisible" factors determine whether a person outputs 30% or 80% of their capability.
Applied to vibe coding, the Agent's working environment consists of: CLAUDE.md clarity, architecture documentation quality, tool scaffolding, project structure comprehensibility, task description precision, and toolchain stability. These are not Agent problems — they are operator problems. Poor team infrastructure is not the employees' problem — it is the manager's problem.
A strong TL never concludes "this person isn't good enough" without first asking "have I created the right conditions for them." The same applies to vibe coding — when an Agent performs poorly, examine the environment first, question the model second.

Letting Go of Process Control

This brings us back to the initial discomfort.
Vibe coding requires relinquishing control over the process and instead achieving goals through mechanism design, result review, and rapid trial-and-error. For engineers who spent a decade building precise process control as a professional skill, this feels fundamentally wrong.
A useful reframing: the first programmers wrote assembly. Then assembly was used to create C, and assembly largely disappeared. Natural language is now replacing high-level programming languages. Giving up code-level process control can be understood as switching to a different "programming language."
Giving up process control does not mean giving up quality standards. It means shifting quality assurance from process review to mechanism design. Before execution, lay the tracks — Planner plans, Reviewer reviews, CLAUDE.md serves as the onboarding document for each new Agent. After execution, validate results — not how the code was written, only whether the output meets requirements. In between, embrace rapid trial-and-error — efficiency comes from iteration speed, not single-pass accuracy.
Traditional development resembles surgery — every cut must be precise. Vibe coding resembles sculpture — continuous shaping, correcting, and polishing until it becomes the intended form. The operator does not need to know where every cut goes in advance, but must know the final shape.
The skills that vibe coding truly requires have little to do with writing code:
  • Taste — knowing what a good result looks like, even without knowing how to implement it
  • Judgment — knowing when to trust the output, when to question it, when to kill the session
  • Systems thinking — designing mechanisms that enable Agent teams to self-operate
  • Expressiveness — the clarity of requirement descriptions directly determines output quality
Vibe Coding has nothing to do with "how to write code," but overlaps substantially with "how to manage a team."
And the team being managed happens to be composed of Agents.
  • Agentic Engineering
  • 思考
  • Superpowers-ML: Harness Engineering for ML ExperimentsAI Tech Daily - 2026-03-18
    Loading...