Brain index

INSIGHT 22: Feature Work Fails at Planning and Constraints

The most important correction to my earlier "agents need better codebases" framing is that not all coding tasks fail in the same place. Bug-fix tasks often fail at localization and patch correctness. Feature work fails more often at planning, constraint satisfaction, and step fidelity. The agent may write a patch, apply it cleanly, and still miss the actual feature shape.

This note combines four papers because none of them alone is enough:

  • RACE-bench measures feature-addition reasoning at file/task/step granularity.
  • Constraint Decay measures what happens as backend code generation is forced to satisfy production-like structural constraints.
  • CODETASTE measures large real refactorings, comparing detailed instructions against vague "open track" focus areas.
  • FeatureBench measures repository-scale feature restoration with executable environments and tests.

Together they support a precise claim: feature-friendly codebases must expose extension points, constraints, examples, and acceptance tests. Otherwise the agent is forced to invent a plan and then satisfy hidden architecture rules while editing.

Plot-ready data lives in presentations/write-code-ai-agents-love/research/data/feature_constraints_planning.csv.

Source map

RefSourceLocal textWhy it matters
R68RACE-benchpaper-text/race-bench-feature-addition-2603.26337.txtMeasures feature-addition planning quality and reasoning recall.
R71Constraint Decaypaper-text/constraint-decay-2605.06445.txtControlled backend experiment showing structural constraints degrade agent success.
R72CODETASTEpaper-text/codetaste-2603.04177.txtReal large refactoring benchmark with tests plus static rules.
R44FeatureBenchpaper-text/featurebench-2602.10975.txtFeature implementation remains hard even for strong agents.

RACE-bench: patches apply before reasoning is correct

RACE-bench is valuable because it does not stop at "resolved or not." It asks whether the agent identified the correct files, tasks, and implementation steps. That matters because feature work is not just locating a bug line. A feature requires a small plan: add this interface, call that service, update this persistence path, add this test, preserve that old behavior.

The striking pattern is the gap between patch application and resolution. Some systems produce patches that apply at very high rates. That does not mean the feature works. In other words, the agent can manipulate the repository successfully while still misunderstanding the change.

RACE-bench data copied from the paper

AgentPatch applyResolvedGap: apply - resolved
AutoCodeRover96.21%28.79%67.42 pp
TraeAgent78.98%52.65%26.33 pp
mini-SWE-Agent95.83%70.08%25.75 pp
mini-SWE-Agent reasoning levelRecall
Files0.890
Tasks0.751
Steps0.445
Failure comparisonValue
Applied-but-failing patches: lower reasoning recall vs success35.7% lower
Applied-but-failing patches: higher over-prediction vs success94.1% higher

Source trace: R68, paper-text/race-bench-feature-addition-2603.26337.txt.

Chart sketch: RACE-bench apply vs resolved

xychart-beta title "RACE-bench: patch application is not resolution" x-axis ["AutoCodeRover", "TraeAgent", "mini-SWE-Agent"] y-axis "Percent" 0 --> 100 bar "Patch apply" [96.21, 78.98, 95.83] bar "Resolved" [28.79, 52.65, 70.08]

The inference for the talk: "it made a patch" is a dangerously low bar. Codebases should help agents preserve the plan as it gets more specific. That means visible examples, acceptance tests, and structured task specs, not just a natural language feature request.

Constraint Decay: production structure is the hard part

Constraint Decay fixes one API contract based on the RealWorld Conduit API, then varies non-functional constraints: framework, architecture, database backend, and ORM integration. This is exactly the kind of thing real software requires. We rarely want "any working backend." We want the backend to use our framework, architecture, database, data-access rules, auth conventions, and test setup.

The paper's main result is brutal: capable configurations lose about 30 percentage points of assertion pass rate from L0 to L3. That is not a small style preference. It is the measurable cost of making code production-shaped.

Constraint Decay data copied from the paper

MeasurementValueInterpretation
Greenfield generation tasks80Controlled combinations across frameworks/constraints.
Feature implementation tasks20Existing-codebase sanity check.
API operations in contract19Non-trivial CRUD-style backend surface.
Assertions in test suite291Behavioral checks for API behavior.
Capable-config L0 -> L3 A% drop30 ppStructural constraints materially reduce success.
Relative loss from baseline40%Constraint cost is large relative to baseline performance.
Full-set vs subset Pearson correlation0.98Cost-reduced subset tracked full benchmark well.
Full-set vs subset Spearman correlation0.95Rank ordering also tracked well.

Marginal constraint effects copied from the paper

ConstraintAverage A% effect
Clean architecture-9.1 pp
PostgreSQL-19.3 pp
SQLite-14.3 pp
SQLAlchemy-1.5 pp
Sequelize-0.6 pp

The database result matters. It says the agent problem is not only syntax or routing. Data-layer interaction is a core failure surface: query composition, ORM runtime behavior, dialect mismatches, state propagation, and auth state all become places where plausible code fails.

Framework sensitivity copied from the paper

FrameworkAverage assertion pass rate
Express51.4%
Koa50.7%
Flask49.3%
aiohttp38.4%
Fastify31.7%
Django25.4%
FastAPI24.2%
Hono18.5%

Source trace: R71, paper-text/constraint-decay-2605.06445.txt.

Chart sketch: constraints as performance decay

flowchart LR L0["L0: framework + API contract"] --> L1["L1: clean architecture"] L1 --> L2["L2: database backend"] L2 --> L3["L3: ORM integration"] L3 --> D["Average capable-config A% drop: 30 pp"]

The codebase-design inference is subtle. The answer is not "avoid constraints." Constraints are what make software maintainable. The answer is to make constraints explicit and executable. If the architecture rule is hidden in prose, the agent has to infer it. If the rule is encoded as imports, types, generated clients, lints, tests, and examples, the agent has something to repair against.

CODETASTE: agents execute specified refactors better than they discover them

CODETASTE is not a feature benchmark; it is a refactoring benchmark. It is still relevant because large feature work often contains refactoring-like moves: move a boundary, replace an old API, standardize a package, update an integration pattern, or migrate many call sites.

The key distinction is between the Instructed Track and Open Track. In the instructed track, the agent receives a detailed description of the intended refactor. In the open track, it receives a vague focus area and must infer the human architectural choice. The performance gap is the point.

CODETASTE benchmark scale copied from the paper

Benchmark propertyValue
Instances100
Repositories87
Programming languages6
Average files edited by human refactor91.52
Average LOC changed2,605.39
Maximum LOC changed18,821
Maximum files changed290
Average tests per instance1,638.53
Average additive static rules29.66
Average reductive static rules63.41

CODETASTE result data copied from the paper

Model / modePASSAlignment AInstruction-following rate
GPT-5.2 instructed76.0%69.6%89.3%
GPT-5.2 open direct87.0%7.7%about 9-10% components
GPT-5.2 open plan87.0%14.1%higher than direct
GPT-5.2 open multiplan oracle81.0%19.4%highest open-track alignment
GPT-5.1 Codex Mini instructed47.0%34.6%72.2%
Claude Sonnet 4.5 instructed43.0%32.4%69.2%
Qwen3 instructed30.0%11.8%lower than frontier systems

Source trace: R72, paper-text/codetaste-2603.04177.txt.

Chart sketch: specified vs inferred refactoring intent

xychart-beta title "CODETASTE GPT-5.2 alignment" x-axis ["Instructed", "Open direct", "Open plan", "Open multiplan"] y-axis "Alignment %" 0 --> 80 bar "Alignment A" [69.6, 7.7, 14.1, 19.4]

The article should use this as a clean argument for explicit task specs and visible extension points. Agents can execute a detailed structural transformation far better than they can infer a human's intended architectural move from a broad complaint.

FeatureBench: feature restoration is still hard

FeatureBench is useful because it makes "feature work" concrete. It constructs executable environments and validates both fail-to-pass and pass-to-pass behavior. The numbers are low even for strong agents. That prevents the article from sounding like feature success is solved if we add more context.

FeatureBench data copied from the paper

MeasurementValue
Tasks200
Executable environments3,825
Repositories24
Claude Opus 4.5 resolved11.0%
GPT-5.1-Codex resolved12.5%
Approximate FeatureBench task LOC790.2
Approximate SWE-Dev comparison LOC190

Source trace: R44, paper-text/featurebench-2602.10975.txt.

The most important methodological detail from FeatureBench is not only the low resolved rate. It is the validation shape: feature tests must fail before the feature is restored and pass after the patch, while existing behavior must also keep passing. That is the agent-friendly test pattern for feature work: fail-to-pass plus pass-to-pass.

Synthesis: where feature tasks collapse

flowchart TD A[Feature request] --> B[Find extension point] B --> C[Recover constraints] C --> D[Plan files/tasks/steps] D --> E[Implement] E --> F[Run fail-to-pass tests] F --> G[Run pass-to-pass tests] G --> H[Check structural constraints] B -. common failure .-> X1[Wrong surface] C -. common failure .-> X2[Hidden architecture/data rule] D -. common failure .-> X3[Step recall collapse] E -. common failure .-> X4[Plausible but incomplete patch] H -. common failure .-> X5[Passes behavior but violates structure]

This graph is the blogpost argument in one figure. Feature work is not one action. It is a chain of recoveries. A codebase can help at each point.

What code patterns follow from this

Failure modeRepo pattern that helpsWhy
Wrong extension pointCanonical examples and small public interfacesThe agent sees where new behavior belongs.
Hidden architecture ruleCustom lint/static rules and import boundariesThe rule becomes executable feedback.
Missing data-layer conventionTyped repository/service layer and integration testsQuery/ORM mistakes are caught locally.
Step recall collapseTask specs with file/task/checklist structureThe plan survives context and implementation.
Inferred refactor is wrongExplicit migration/refactor spec and static rulesThe desired transformation is named.
Feature breaks old behaviorPass-to-pass testsThe agent sees preservation requirements.
Raw API guessingGenerated SDKs and typed clientsAPI contracts become local symbols.

What I should not claim

I should not claim that "simple architecture" is the only answer. Constraint Decay actually shows that production constraints are hard, but production code needs them. The answer is not less architecture. The answer is more visible, executable architecture.

I should not claim that planning alone solves open-ended feature work. CODETASTE planning nearly doubles GPT-5.2 open-track alignment from 7.7% to 14.1%, but that is still far below instructed alignment. Planning helps only if the plan is grounded in the right structure and objective.

I should not combine RACE-bench, Constraint Decay, CODETASTE, and FeatureBench into one model leaderboard. They measure different task definitions and harnesses. The shared conclusion is about failure shape, not model ranking.

Blog visual candidates

  1. RACE-bench apply-vs-resolved grouped bars.
  2. RACE-bench reasoning waterfall: files -> tasks -> steps.
  3. Constraint Decay marginal effects by constraint.
  4. CODETASTE instructed vs open-track alignment chart.
  5. Feature-work failure chain graph.

References

  • R44: FeatureBench, paper-text/featurebench-2602.10975.txt
  • R68: RACE-bench, paper-text/race-bench-feature-addition-2603.26337.txt
  • R71: Constraint Decay, paper-text/constraint-decay-2605.06445.txt
  • R72: CODETASTE, paper-text/codetaste-2603.04177.txt