The Context Engineering Ladder: Is plan.md a Bottleneck for Agentic Scaling?

AI agency isn't about smarter models—it's about the dynamism of the context engineering. From static prompts to RAG to agentic orchestration, each rung on the ladder is an analogue to a human cognitive process. And plan.md might be where the climb stalls.

Cover Image for The Context Engineering Ladder: Is plan.md a Bottleneck for Agentic Scaling?

The Agentic Context Ladder: Is plan.md a Bottleneck for Agentic Scaling?

Context engineering is the discipline of structuring the information, tools, and reasoning processes available to an AI model at inference time. The context engineering ladder maps the progression from static prompts (reflex) to RAG (memory) to tool use (procedural knowledge) to chain of thought (working memory) to agentic orchestration (environmental awareness) to heartbeats (interoception) to autonomy progression (metacognition). Each rung replaces a static scaffold with a dynamic process, increasing the agent's ability to navigate novel problems without human intervention.

If you've been building with AI for the last two years, you've lived through a quiet revolution in how we structure the information we give models—and it happened so fast that we barely have language for it yet.

In 2023, the state of the art was a well-crafted system prompt. You'd spend an afternoon wordsmithing instructions, paste in some examples, and hit send. By mid-2024, we were plugging vector databases into the context window so models could reference our own data. A few months later, chains and tool use let models call APIs and execute code. Then AI IDEs like Cursor showed up and gave models access to entire codebases, terminals, and file trees.

Each of these steps felt like a breakthrough. And each one was, in fact, the same breakthrough expressed at a higher level of dynamism: we kept replacing static scaffolds with dynamic context engineering.

Prompts gave way to just-in-time retrieval. Retrieval gave way to step-by-step reasoning and tool use. Fixed workflows gave way to agentic cycles where the model chooses its own next action. At every stage, the thing that changed wasn't the model—it was how much of the world the model was allowed to see, interact with, and reason about in real time.

This progression matters because it tracks directly to agency. My definition of effective agency is the extent to which the system can navigate novel problems without needing to phone home, because its criteria for problem solving are well suited to the problem and to your cognitive style. A static prompt can only be a briefest of snapshots of how you prefer problems be solved, and a model that is only contextualized by that is on a very short leash indeed. A model that can read your codebase, run commands, check its own output, and decide what to do next, has substantially more agency, and most critically, can be trusted to solve more challenging problems autonomously. And the innovations driving this increase in agency DO NOT stem from treating teams of LLMs like humans in an organization—they're about building in processes that are analogues to human cognitive processes: working memory, procedural knowledge, self-monitoring, strategic reasoning.

That's the context engineering ladder--the level of agency an AI system can exhibit is tied to the level of sophistication with which the architect has scaffolded the LLM with dynamic context processes that can replicate the portions of human cognition that LLMs aren't natively good at. And understanding where you are on the agentic context ladder—and where the current ceiling is—matters for anyone building agents that need to do more than follow a checklist.

Let's look at the agentic context ladder step by step. *Pun intended.

The Ladder

Rung 1: Static Prompting

Cognitive analogue: Reflex. Basic stimulus-response.

This is where it started for most of us. You compose a prompt—maybe with a system message, a few examples, and a question—and the model responds based on its training data plus whatever you wrote.

The model has no world state beyond those sentences. No memory of previous conversations (unless you manually paste them in). No awareness of what's happening in your project, your codebase, or your business. If the answer isn't derivable from the prompt text or the pre-training distribution, the model either fails or confabulates confidently.

The agency here is minimal. The model is a function: input in, output out. The human does all the reasoning about what to ask and how to frame it. The context window is a sealed envelope—whatever you put in is all the model gets.

The common failure mode with staticly prompted LLMs is brittle specificity. If you forget to mention a constraint, the model ignores it. If you phrase the question slightly differently, you get a different answer. There's no resilience because there's no dynamism—the model can't go looking for what it needs.

Rung 2: RAG (Retrieval-Augmented Generation)

Cognitive analogue: Semantic memory. Long-term fact retrieval.

RAG was the first major step toward dynamic context. Instead of stuffing everything into the prompt manually, a retrieval system queries a vector database at inference time and injects relevant documents into the context window before the model generates its response.

This solved a real problem: models could now reference your specific data—internal docs, product specs, customer records—without fine-tuning. The context window was no longer limited to what a human remembered to include; it was populated by what a search system determined was relevant.

But the model itself is still passive in this arrangement. It doesn't choose in a goal-oriented sense what to retrieve or when. It doesn't know what it doesn't know. The retrieval system makes relevance decisions based on embedding similarity, which is a blunt instrument—it finds things that sound related, not things that are logically necessary for the current reasoning step. To dwell on that, humans will use backwards chaining to work backwards from a goal to a more proximal step, which might highlight a tool or inference that would help towards the next step of the problem, but with RAG, the model retrieves resources that are already adjacent to the plan the model was in the midst of executing.

The agency in 2024-style RAG type systems is slight. Our LLM gained access to a library, but it's the librarian (the retrieval pipeline) who decides which books to pull off the shelf. If the retrieval misses a critical document, the model has no mechanism to say "I think I'm missing something—let me search again with different terms."

A common failure mode is retrieval mismatch: the system surfaces documents that are topically adjacent but not actually useful for the question at hand, and the model dutifully synthesizes them into a plausible-sounding but wrong answer.

Rung 3: Chains and Tool Use (LangChain, n8n, etc.)

Cognitive analogue: Procedural memory. Knowing how to use a hammer.

This rung introduced the model to the outside world. Frameworks like LangChain, and workflow platforms like n8n, let developers wire models into sequences that include external tool calls: search engines, databases, APIs, code execution, calculators.

"Do A, then take that result and do B, then call this API with the output." The model interacts with real systems and feeds the results back into its own context. It can look up a stock price, run a SQL query, execute a Python function, and weave the results into its response.

This feels like genuine agency. The model is doing things in the world.

But the structure is still deterministic. The human architect pre-defines every step in the chain. The model follows a track laid down at design time. If the API returns an unexpected format, or the database schema changed, or the third step in the chain needs to be skipped—the rigid sequence breaks. The model has tools, but it doesn't get to decide which tool to use or when.

An n8n workflow that calls GPT-4, then a database, then a formatter, then an email sender is powerful for its specific use case. But it's an assembly line, not an agent. The moment the use case shifts—even slightly—the workflow needs a human to rebuild it.

The common failure mode is brittleness at the seams: each connection between steps is a potential point of failure, and the model has no ability to route around a broken link. The chain either completes or it doesn't.

Rung 4: Chain of Thought

Cognitive analogue: Working memory. Internal monologue.

"Think step by step." With this deceptively simple instruction, the context window transforms from a static input into a scratchpad. The model uses its own previous reasoning tokens to influence the next ones. It can hold a logical thread across multiple steps, spot inconsistencies in its own reasoning, and self-correct mid-stream.

Historically, chain-of-thought research predated some of the tooling frameworks, but as a context engineering pattern it represents a higher level of dynamism. Chains and tool use gave models hands; chain of thought gave them the ability to deliberate about what to do with those hands. The model isn't just executing a sequence—it's reasoning through a problem by externalizing its thought process into the context.

The agency increase is meaningful. The model now has a form of working memory—it can hold intermediate conclusions and build on them. It can notice when a line of reasoning leads to a contradiction. It can break complex problems into subproblems and tackle them sequentially. Where chains follow a track laid down by the developer, chain of thought lets the model lay its own track in real time.

The shift: the model's own outputs become inputs to its next step. The context window is no longer just a mailbox; it's a whiteboard. The model can think harder, reconsider, and self-correct—something a rigid chain cannot do.

A common failure mode with CoT-forward AIs though, is reasoning drift over long chains: the model starts strong but gradually loses coherence as the chain gets longer, especially when early assumptions need to be revised based on later reasoning.

Rung 5: Agentic Orchestration and AI IDEs

Cognitive analogue: Task execution. Environmental awareness.

This is Cursor, Windsurf, Claude Code, and the current generation of AI development tools. The model doesn't just suggest code—it reads the file structure, runs terminal commands, checks for lint errors, reviews build output, and iterates. The context window now includes the state of the environment: the codebase, the terminal output, the dependency tree, open files, recent changes.

The agency leap here is significant. The model can choose what to look at. It can decide to read a file, search the codebase, run a test, check an error, and then revise its approach based on what it finds. It's not following a pre-defined chain—it's navigating a problem space.

And this is where the industry invented the plan.md.

The idea makes intuitive sense: give the agent a markdown file with a structured plan—numbered steps, acceptance criteria, a clear sequence—and let it work through the project methodically. It feels like the natural evolution. You're giving the AI a roadmap.

But there's a tension. We arrived at this rung precisely because dynamic context beats static context—and then we handed the agent a static document and told it to follow the steps in order.

For small, well-defined tasks and prototypes, this works fine. For larger projects with interlocking constraints, shifting requirements, and emergent complexity, plan.md can become the ceiling, especially as the project leaves the prototype stage and product managers layer in constraints.

Where plan.md Starts to Struggle

A main reason that we don't have a Cambrian explosion of production applications (yet) is that LLMs shine in underdetermined problem contexts and struggle in overdetermined problem contexts, where the need to trade off constraints against each other doesn't come natively to LLM-based AIs. If you view productionalization as the process of successfully layering in constraints into the system until it can trade off functional and non-functional requirements in the desired fashion, it's not surprising there aren't many successful vibecoded production apps; reprompting the LLM to your plan.md is very likely to generate a problematic game of 'requirements whackamole', as it silently drops the previous requirement to satisfy the one you're now asking for.

A plan written at time T=0 is a snapshot of a reality that begins changing the moment the agent starts working. The codebase evolves. Dependencies update. Assumptions get invalidated by what the agent discovers in step 3. A requirement that seemed clear in the plan turns out to be underspecified when you actually try to implement it.

Three specific patterns emerge:

The sunk-cost trap. Once a 10-step plan is loaded into the context, the model gravitates toward completing the sequence as written. If Step 3 produces an unexpected result that should change the approach for Steps 4-7, the model often plows ahead anyway—the plan is right there in the context, exerting gravitational pull on every subsequent token. This leads to cascading errors where each step faithfully builds on a foundation that was invalidated two steps ago.

Context bloat. As projects grow in complexity, the plan becomes a substantial wall of text competing for attention in the context window. The model is spending tokens on historical intent (what the human planned at T=0) rather than current state (what the codebase actually looks like right now). On the ladder of dynamism, this is a step backward—you're filling the context with a static artifact in a system whose power comes from dynamic context.

Constraint drift. This one is subtle but critical. Native LLMs are not strong at tracking multiple shifting constraints over time—especially when constraints interact or conflict. A plan might say "use the blue button" but the reason was an accessibility compliance requirement. Four steps later, a new constraint emerges that conflicts with the blue button. The model remembers the instruction but has lost the rationale, so it has no basis for making the tradeoff. It either follows the plan rigidly or makes an arbitrary choice.

None of this means plan.md is useless. For well-scoped tasks with low ambiguity, it's a perfectly good scaffold. But as an approach to scaling agent autonomy, it hits a wall—because scaling requires the agent to adapt its strategy based on what it discovers, not just execute a predetermined sequence.

The question is: what comes after the plan?

Rung 6: Heartbeats and Continuous Operation

Cognitive analogue: Interoception. The autonomic nervous system.

Most AI interactions follow a call-and-response pattern. User prompts, AI responds, conversation ends. If the user doesn't initiate, the AI has no presence. It exists only in the moments between your keystrokes.

OpenClaw exploded on to the scene not only to give nerds a reason to by Apple hardware, but because it exhbited more agency than productionalized systems from hyperscalers. It did so primarily by introducing a different model of context engineering: the heartbeat. A heartbeat is a proactive reasoning loop that runs at a regular cadence. Every cycle, the agent wakes up, surveys its environment, and runs through a structured self-assessment:

  1. State check: What has changed since my last pulse? New files? Build errors? Changed dependencies? Incoming messages?
  2. Constraint check: Do my current objectives still make sense given the new state? Have any assumptions been invalidated?
  3. Action check: Is this something I can resolve within my current autonomy level, or do I need to surface it to the human?

This is a fundamentally different operating model from either the call-and-response pattern or the plan.md checklist. The agent isn't waiting for instructions. It's continuously monitoring the delta between its goals and the state of the world.

Consider a concrete scenario: a deployment pipeline starts failing at 2 AM because of a dependency update. A plan-based agent sits idle until the human wakes up, reads the error, and prompts the agent to investigate. A heartbeat-based agent has already noticed the failure, checked the error logs, identified the breaking change, and either fixed it autonomously or prepared a summary with a proposed fix for the human to review.

The shift is from history of chat to real-time telemetry feed. The context window isn't a record of what was said—it's a continuously refreshed picture of what is.

The common failure mode at this rung is over-action: an agent that interprets every environmental change as requiring intervention, creating noise instead of signal. Effective heartbeat design requires calibration—not just what the agent monitors, but what threshold of change warrants action.

Rung 7: Autonomy Progression

Cognitive analogue: Executive function. Metacognition. Strategic reasoning.

This is the top of the ladder as it exists today. Using Levels of Autonomy Progression Prompting, the agent doesn't just execute tasks—it manages objectives. It evaluates its own competence relative to the current situation and adjusts its behavior accordingly.

The key question the agent asks at every decision point: "Is this a tradeoff I can resolve within my current autonomy level, or do I need to escalate?"

This plays out across a spectrum:

At Level 1, the agent is an observer: "Monitor this system and tell me if something breaks." It watches but doesn't act.

At Level 2, it's an analyst: "Summarize what happened and suggest options." It interprets but defers decisions to the human.

At Level 3, it's an executor: "Follow the standard approach, but stop and check in if you hit something unexpected." It acts within known patterns.

At Level 4, it's an orchestrator: "Manage this workstream. Only escalate unresolvable tradeoffs." It makes routine decisions independently and reserves human attention for genuine judgment calls.

At Level 5, it's a strategic partner: "Own this objective. Adapt the strategy as conditions change." It operates with broad latitude and a policy framework rather than a task list.

The crucial distinction from plan.md is that the agent's behavior isn't defined by a fixed sequence of steps—it's defined by a policy that governs how the agent should reason about its own decisions. The prompt stops being a command and starts being a constitution. You're no longer telling the AI what to do. You're defining how it should decide what to do.

The common failure mode is miscalibrated autonomy: an agent operating at Level 4 when the situation calls for Level 2, making consequential decisions that should have been escalated. Getting the autonomy calibration right—and building in the self-awareness for the agent to recognize when it's out of its depth—is the core design challenge at this rung.

The Cognitive Architecture Argument

Here's the point that deserves more attention than it gets: the innovations on this ladder aren't about domain specialization. The difference between a good coding agent and a good research agent isn't that one was trained on code and the other on papers. It's that they use different cognitive architectures—different patterns of context engineering—applied to capable base models.

AI InnovationHuman Cognitive AnalogueFunction
RAGSemantic memoryAccessing facts on demand
Tool UseProcedural memoryKnowing how to use instruments
Chain of ThoughtWorking memoryHolding a logic chain while solving
HeartbeatsInteroceptionMonitoring internal state vs. external reality
Autonomy ProgressionMetacognitionThinking about thinking. Knowing when you know enough.

Each rung is a process, not a dataset. The innovations that move us up the ladder are structural, not informational. They're about giving models the cognitive room to reason—not just more facts to reason about.

This has a practical implication: if you're trying to make an agent better at a complex task, the highest-leverage move probably isn't more training data or a more expensive model. It's upgrading the context architecture. Moving from a static plan to a heartbeat loop. Moving from a fixed workflow to autonomy progression. Giving the model the structural equivalent of working memory and self-monitoring, rather than a longer reading list.

The Constraint Problem (and What Comes Next)

There's a deeper issue lurking under all of this. As you move up the ladder—more tools, more autonomy, more environmental state flowing through the context—the model has to track an increasingly complex web of constraints, tradeoffs, and dependencies.

Native LLMs are not strong at this. They're attention-biased: recent tokens get more weight than distant ones. They lose track of constraints established 4,000 tokens ago. When two constraints conflict, they tend to hallucinate a middle ground that satisfies neither rather than explicitly surfacing the tradeoff.

A linear context window—even one enhanced with CoT and tool use—is fundamentally a one-dimensional pipe. It processes information sequentially. But real-world constraint tracking is multidimensional: a decision in the database layer affects a constraint in the API design, which creates a tradeoff in the frontend, which interacts with a business requirement from the product spec.

At some point, more thinking and more tools within a linear context aren't enough. What's needed is a dynamic context graph: a living, non-linear structure that tracks constraints, state, and tradeoffs as connected nodes rather than sequential text. A structure where the agent can traverse relationships between constraints, not just scroll through a history of what was said.

That's the topic of the next post.

The Bottom Line

The context engineering ladder isn't just a taxonomy—it's a design guide. Each rung represents a real increase in the dynamism of the context, and with it, a real increase in the agency the model can exercise.

Static prompts got us started. RAG gave models access to our data. Tool use gave them hands. Chain of thought gave them working memory. AI IDEs gave them environmental awareness. Heartbeats gave them a pulse. Autonomy progression gave them judgment.

At each stage, the pattern is the same: replace a static scaffold with a dynamic process, and the model's capability jumps. The plan.md served its purpose—it was a natural step in the progression. But as the problems we hand to agents grow in complexity and duration, the static plan starts to show its limits. An n8n workflow or a numbered checklist might be outclassed by an agent with more intricate context flow—one that monitors, adapts, and reasons about its own strategy in real time.

The ladder keeps going. The question is what we build next.


FAQ

Q: What is the context engineering ladder?

The context engineering ladder is a framework for understanding how AI agent capability scales with the dynamism of its context — from static prompts to RAG to tool use to chain of thought to agentic orchestration to heartbeats to autonomy progression. Each rung is analogous to a human cognitive process.

Q: Why is plan.md a bottleneck for AI agents?

Plan.md becomes a bottleneck because it's a static document in a system whose power comes from dynamic context. It creates the sunk-cost trap (agents follow the plan even when invalidated), context bloat (historical intent competes with current state), and constraint drift (the model remembers instructions but loses the rationale).

Q: What are heartbeats in AI agent design?

Heartbeats are proactive reasoning loops that run at a regular cadence — the agent wakes up, surveys its environment, and assesses what has changed, whether objectives still make sense, and whether to act or escalate. This shifts from call-and-response to continuous monitoring.

Q: What is autonomy progression in AI agents?

Autonomy progression is the highest rung, where the agent manages objectives rather than tasks. It operates from observer (Level 1) to strategic partner (Level 5), evaluating its own competence at every decision point. The prompt becomes a constitution rather than a command.


Read more about

·7 min read·📝Posts

SlopBurn reframes agentic software quality as a depth-first roguelike dungeon crawl. Bugs become monsters, tests become weakpoints, and software quality becomes the main loop instead of an afterthought.

·9 min read·📝Posts

We're evolving from a technical product manager to a research company focused on safe vibecoding. Our mission remains the same: help developers build secure, scalable, and reliable software with AI coding agents — from the first line of code.

·9 min read·📝Posts

A new category of freelance work is exploding: fixing apps that AI built and humans shipped. Full disclosure: I'm a former Upwork employee (2022–2024). All observations below are based on publicly available data. Here's what the numbers say about the vibecoding cleanup economy — and why the hardest 20% is where all the money is.

·11 min read·📝Posts

Whether you just shipped an MVP or are still prompting your first feature, your vibecoded app has security gaps. They're not bugs — they're structural omissions baked into how LLMs generate code. Here's how to find them, fix them, and prevent them at every stage of the software engineering lifecycle.

·14 min read·📝Posts

In 2015, Google warned that ML systems were the 'high-interest credit card of technical debt.' A decade later, vibecoding tech debt makes that metaphor quaint. AI-generated code doesn't carry credit card rates — it carries payday lender rates, with terms designed to look cheap until the first payment is due.

·15 min read·📝Posts

Traditional TDD asks developers to write tests before code. Cutline's Red-Green Refactoring mode flips the script — the constraint graph writes the tests for you, turning every feature into a gauntlet of security, performance, and stability checks that the AI must pass.