Red-Green Refactoring for AI Agents: How Cutline Stabilizes Your Vibecoding
Traditional TDD asks developers to write tests before code. Cutline's Red-Green Refactoring mode flips the script — the constraint graph writes the tests for you, turning every feature into a gauntlet of security, performance, and stability checks that the AI must pass.

Red-Green Refactoring for AI Agents: How Cutline Stabilizes Your Vibecoding
Your AI agent just built a feature. It works. But does it rate-limit? Does it encrypt PII? Does it meet your p95 latency target? Traditional TDD would have caught these — if someone had written the tests. Cutline writes them for you.
Red-green refactoring for AI agents is a modified TDD approach where test specifications are auto-generated from a constraint graph rather than written by humans. The constraint graph knows your non-functional requirements — latency targets, security policies, compliance obligations — and generates test cases the AI agent must satisfy before a feature is considered complete. The tests are adversarial by design: they encode requirements the AI model would otherwise miss, because they come from product analysis, not from the model itself.
Test-driven development has a well-known problem: nobody does it.
Not really. Not the way Kent Beck described it. Developers skip the red phase because writing tests for code that doesn't exist yet is tedious, requires disciplined imagination, and feels like it slows you down. So they write code first, then retrofit tests — or more commonly, write no tests at all and rely on manual QA.
AI coding agents have made this worse, not better. When you're generating entire features in minutes, the gap between "code that works" and "code that's tested" widens with every prompt. The AI builds fast. Nobody writes the tests. The codebase grows. The regressions accumulate.
Cutline's Red-Green Refactoring mode solves this by removing the human from the test-writing loop entirely. The constraint graph already knows your non-functional requirements — your latency targets, your security policies, your compliance obligations. It uses those requirements to automatically generate test specifications that your AI agent must satisfy before a feature is considered complete.
No developer writes these tests. The graph writes them. The AI agent is the one who has to pass them.
Why TDD Breaks Down with AI Agents
Traditional TDD follows a strict loop:
- Red: Write a failing test that describes the desired behavior.
- Green: Write the minimum code to make the test pass.
- Refactor: Improve the code without changing behavior.
This works when a human is writing both the tests and the code, because the human holds the full context of what the system should do and what could go wrong. The test encodes the developer's understanding of the problem.
With AI agents, the loop falls apart at step 1. Who writes the failing test?
- The developer? Then you've lost the speed advantage of AI coding. You're back to writing test specifications by hand for every feature, and now you're writing them for an AI that generates code differently than you would.
- The AI agent? Then the same model that generates insecure code is generating the tests for that insecure code. It won't test for vulnerabilities it doesn't know to look for. The tests will pass — and the code will still be broken.
- Nobody? This is what usually happens. Features ship without tests. Regressions compound. The codebase becomes untouchable.
The core issue: the tests need to come from a source of truth that is independent of the code generator. The tests should encode the requirements of the system — not the assumptions of the model.
Constraint-Driven Test Generation
Cutline's constraint graph is that independent source of truth.
When you run a deep dive, Cutline analyzes your product and generates a constraint graph — a typed, connected network of entities (services, APIs, data stores) and constraints (security policies, performance targets, compliance requirements, cost ceilings). Each constraint is a concrete, enforceable requirement with a threshold:
- p95_latency_ms ≤ 500
- auth_required = true for all API endpoints
- pii_encryption = AES-256-GCM for user data stores
- cogs_per_request ≤ $0.03 for AI-backed endpoints
- rate_limit = 10/min for LLM proxy routes
These aren't suggestions. They're typed constraint nodes with operators, values, and units. And they're connected to the specific entities they govern — not floating in a wiki page nobody reads.
When the Red-Green Refactoring mode activates, these constraints become test cases. Automatically.
How the Modified Red-Green Flow Works
Cutline's RGR mode adapts the classic red-green-refactor loop for AI agents working within a constraint graph. Here's the flow:
Phase 0: Complexity Assessment
Before any code is written, Cutline evaluates the constraint landscape for the entity being modified. It counts the total constraints, the number of critical constraints, and the number of distinct categories (security, performance, economics, compliance).
- Low complexity (≤4 constraints, ≤1 critical, ≤2 categories): single-pass execution. The AI gets all constraints at once and implements the feature in one shot.
- High complexity: phased execution. The feature is built incrementally, with each phase adding a layer of non-functional requirements.
This prevents context overload. An AI agent given 30 constraints simultaneously will ignore half of them. An agent given 5 constraints at a time, with tests verifying each batch, will satisfy all of them.
Phase 1: RED — Test Specification
This is where Cutline diverges most sharply from traditional TDD. Instead of asking a human to write failing tests, Cutline's resolveTestSpec function generates them from the constraint graph.
It pulls from two sources:
-
Concrete test cases stored in the graph. These are
GraphTestCaseobjects linked to specific entities — unit tests, integration tests, performance benchmarks — each with explicit assertions derived from constraint thresholds. -
Derived test cases auto-generated from threshold constraints. If a constraint says p95 latency must be under 500ms, Cutline creates a performance test case with that exact threshold, categorized under performance, linked to the constraint ID, tagged as an NFR phase test.
The result is a test specification — a concrete list of tests the AI agent must satisfy — before it writes a single line of feature code.
These tests are red. They fail. That's the point.
Phase 2: GREEN — Functional Implementation
Now the AI agent writes the minimum code to make the tests pass. This is standard green-phase work: implement the feature, handle the happy path, satisfy the functional requirements.
The key difference: the agent isn't just making its own tests pass. It's making the constraint graph's tests pass. Tests it didn't write. Tests that encode requirements the agent wouldn't have considered on its own — because they come from a pre-mortem analysis, not from a prompt.
Phases 3–5: REFACTOR — Security, Performance, Economics
Here's where the constraint graph earns its keep. The refactor phase is split into three targeted passes:
Security refactor. The agent receives only the security and compliance constraints — auth requirements, encryption standards, input validation rules, audit logging obligations. It refactors the functional code to satisfy these constraints. Tests are re-run.
Performance refactor. The agent receives performance and infrastructure constraints — latency targets, throughput requirements, caching policies, database indexing rules. It optimizes accordingly. Tests are re-run.
Economics refactor. The agent receives cost and pricing constraints — COGS ceilings, API call budgets, resource utilization targets. It adjusts for cost efficiency. Tests are re-run.
Each refactor phase narrows the constraint set to a specific domain. The agent focuses on one class of NFR at a time, and the tests verify compliance at each step. Nothing is lost between phases.
What This Means in Practice
Consider a concrete example. You're building a user search endpoint. Without Cutline, the AI generates:
app.get('/api/users/search', async (req, res) => {
const { query } = req.query;
const users = await db.users.find({ name: { $regex: query } });
res.json(users);
});
It works. It's also missing auth, missing input validation, vulnerable to ReDoS via the unescaped regex, has no rate limiting, no pagination, leaks full user objects (including email and hashed passwords), and has no latency guardrails.
With Cutline's RGR mode, the constraint graph generates test specifications before this code is written:
- auth_required: Request without valid token returns 401
- input_validation: Query parameter validated (max length 200, sanitized)
- rate_limit ≤ 60/min: 61st request within window returns 429
- pii_filter: Response excludes passwordHash and email unless requester has admin role
- p95_latency_ms ≤ 200: Search completes within 200ms at 1000-record dataset
- pagination_required: Response includes cursor and limit, max 50 results per page
The AI agent must satisfy all of these. Not because a developer remembered to check. Because the constraint graph requires it.
The Retrofit Trap: Why Bolting Security On Later Breaks Everything
There's a popular alternative to constraint-driven testing: build first, secure later. Ship the feature, then run a security scan, fix what it finds, and move on.
This is how most vibecoded projects handle security. And it's a direct path to whack-a-mole.
Here's what happens. You build 15 features over two weeks of vibecoding. The codebase is working. Users can sign up, search products, manage their account, process payments. Then you run a security vibe check — or a technical co-founder joins, or an investor asks about your security posture — and you discover 40 issues.
You hand the AI agent a list: "Fix all of these." The agent starts refactoring. It adds auth middleware globally — and breaks three webhooks that need to be public. It adds input validation to the user endpoint — and the frontend form stops submitting because the validation schema is stricter than what the UI sends. It adds rate limiting — and the background job that syncs data from a third-party API starts getting throttled by its own rate limiter.
Every fix introduces a new break. You're not improving the codebase. You're destabilizing it. This is the whack-a-mole problem, and it has a structural cause: security was never part of the architecture. It's being forced onto a codebase that wasn't designed for it.
The all-in-one security retrofit fails because:
- It's too much context at once. Forty issues across fifteen features requires understanding every interaction in the system. The AI agent doesn't have that context. It fixes each issue in isolation, unaware that the fix conflicts with another feature.
- It's adversarial to the existing code. The codebase was built without auth boundaries, without input contracts, without rate-limit awareness. Adding them after the fact means changing assumptions that dozens of components rely on.
- It creates merge-conflict hell. Every security fix touches the same middleware, the same route files, the same request pipeline. Fixes conflict with each other. You're resolving merge conflicts in code you don't fully understand.
Cutline's RGR mode avoids this entirely by incorporating NFRs during feature development, not after. There's no retrofit phase. Every feature is born with its security, performance, and stability constraints already satisfied.
Generic Security Tools Don't Know Your Product
Even when teams do run security checks early, they typically reach for generic tools: ESLint security plugins, Snyk, SonarQube, OWASP ZAP. These tools are valuable — they catch known vulnerability patterns across any codebase.
But they share a fundamental limitation: they don't know your product.
A generic scanner can tell you that a wildcard CORS policy is permissive. It can't tell you that your specific product serves an embeddable widget that requires permissive CORS on exactly two endpoints, while every other endpoint should be locked to your domain. It flags both the intentional exception and the actual vulnerability identically.
A generic scanner can tell you that an endpoint lacks rate limiting. It can't tell you that your AI-backed endpoints cost $0.03 per call and need a 10/min limit to keep COGS under control, while your static asset endpoints can safely handle 1,000/min. It doesn't know your unit economics.
A generic scanner can tell you that user data isn't encrypted. It can't tell you that your specific compliance requirements (SOC 2 Type II) require encryption at rest and audit logging and data retention policies and right-to-deletion support — and that all four must be implemented together because they interact.
This is the gap between generic security and product-aware security:
What a generic scanner says: "This endpoint has no auth." What the constraint graph says: "This endpoint is part of the billing service, which is governed by auth-required and PCI compliance constraints."
What a generic scanner says: "This query might be slow." What the constraint graph says: "This query serves the product search page, which has a p95 latency target of 200ms tied to the instant-results product requirement."
What a generic scanner says: "User data is unencrypted." What the constraint graph says: "The users data store contains PII classified as sensitive, requiring AES-256-GCM encryption, 90-day retention, and audit-log-on-access per the SOC 2 constraint chain."
The constraint graph doesn't just flag problems — it provides the specific, threshold-level requirements that apply to this entity in this product. The AI agent doesn't get "add security." It gets "this endpoint requires auth via Firebase ID token verification, rate limiting at 10 requests per minute per authenticated user, input validation rejecting queries over 200 characters, and PII filtering that excludes email and passwordHash from non-admin responses." That's actionable. That's testable. That's what gets implemented correctly on the first pass.
Why Phased NFR Incorporation Works with LLMs
There's a deeper reason the phased approach works, and it has to do with how LLMs process instructions.
Large language models have finite attention. The context window might be 128K tokens, but the model's effective attention — the instructions it actually follows faithfully — is much smaller. Research consistently shows that LLMs handle instructions best when they're focused, specific, and non-contradictory. When you give a model 30 requirements simultaneously, it doesn't weigh them equally. It satisfies the ones that are most similar to its training patterns and quietly drops the rest.
This is why "build this feature and make it secure and fast and cost-efficient" produces worse results than three separate prompts: "build this feature," "now make it secure," "now make it fast."
Cutline's phased RGR exploits this directly. Instead of dumping every constraint into a single prompt, it sequences them:
Phase 1 (functional): The agent focuses entirely on making the feature work. No security distractions. No performance targets. Just the functional spec and its tests. This is what LLMs are already good at — building things that work.
Phase 2 (security): The functional code is complete and tested. Now the agent receives only security constraints. It can focus entirely on auth, validation, encryption, and audit logging. It doesn't need to worry about breaking the feature — the functional tests from Phase 1 are still running. If a security change breaks functionality, the test suite catches it immediately.
Phase 3 (performance): Security is locked in. Now the agent optimizes. It can add caching, indexing, and query optimization without worrying about accidentally removing auth checks or validation — because the security tests from Phase 2 are still running.
Phase 4 (economics): Everything works, it's secure, it's fast. Now the agent right-sizes resources: swap GPT-4 for GPT-3.5 where quality thresholds allow, add response caching to reduce API calls, implement batch processing for bulk operations. The functional, security, and performance tests all keep running.
Each phase has a narrow mandate and a growing safety net. The constraint set is small enough for the LLM to handle faithfully. The test suite is comprehensive enough to catch regressions. And because each phase's tests persist into subsequent phases, nothing is lost.
This is the opposite of the retrofit approach. Instead of one massive "fix everything" pass that destabilizes the codebase, you get four focused passes that each leave the system in a verifiably better state. The LLM never has to hold more than one class of NFR in its attention at a time — and the tests ensure that satisfying the current class doesn't violate the previous ones.
Why This Works Better Than AI-Generated Tests
"Why not just ask the AI to write its own tests?"
Because the AI has the same blind spots in its tests as in its code. If the model doesn't think to add rate limiting, it won't think to test for rate limiting either. AI-generated tests validate what the AI intended, not what the system requires.
Cutline's tests come from the constraint graph — a source of truth derived from pre-mortem analysis, product deep dives, and explicitly defined NFRs. The test specifications are independent of the code generator. They're adversarial by design.
This is the fundamental insight: the value of TDD is not in writing tests. It's in having tests that encode requirements the implementer would otherwise miss. When the implementer is an AI agent, those requirements must come from outside the model.
Every Feature Gets Safer, Faster, Cheaper
The compounding effect is what makes this powerful. Every feature that goes through the RGR pipeline doesn't just get built — it gets hardened against every NFR in the constraint graph.
Add a new endpoint → it inherits the auth, rate-limiting, and input validation constraints of its parent service. Add a new data store → it inherits the encryption and audit-logging constraints of the data classification. Add a new AI-backed feature → it inherits the COGS ceiling and fallback requirements of the product's economics model.
The constraint graph grows as the product grows. The test surface expands automatically. Security, stability, and scalability improve with every feature — not because someone remembered, but because the architecture requires it.
The End of "We'll Add Tests Later"
Every engineering team has a backlog item that says "improve test coverage." It's been there for months. It'll be there for months more.
Cutline's RGR mode eliminates this category of debt. Tests aren't a chore that follows development — they're a structural prerequisite that precedes it. The constraint graph generates them. The AI agent satisfies them. The developer reviews the result.
The human's job shifts from "write tests" to "define requirements." From implementation to governance. That's the right abstraction for AI-assisted development: humans decide what matters, machines ensure it happens.
FAQ
Q: What is red-green refactoring for AI agents?
Red-green refactoring for AI agents is a modified TDD approach where test specifications are auto-generated from a constraint graph rather than written by humans. The constraint graph generates test cases the AI must satisfy — encoding requirements the model would otherwise miss.
Q: Why doesn't traditional TDD work with AI coding agents?
Traditional TDD breaks down because no one writes the failing tests. If the developer writes them, you lose the speed advantage. If the AI writes them, the same model that generates insecure code generates tests that won't catch those insecurities. The tests must come from a source independent of the code generator.
Q: What is constraint-driven testing?
Constraint-driven testing generates test specs automatically from typed constraints in a product's constraint graph. If a constraint says p95 latency must be under 500ms, the system creates a performance test with that threshold — derived from product requirements, not the AI model's assumptions.
Q: Why is phased NFR incorporation better than all-at-once security retrofits?
Phased incorporation works because LLMs handle focused, specific instructions better than 30 requirements simultaneously. The approach sequences functional implementation, then security, then performance, then economics — each phase adding tests that persist into subsequent phases. A single "fix everything" pass destabilizes the codebase.
Cutline's Red-Green Refactoring mode generates test specifications from your constraint graph and enforces them on every AI-generated feature. Security, performance, and cost compliance — verified automatically. Try it free →