Brain index

INSIGHT 24: Context Files Are Config With Debt

AGENTS.md, CLAUDE.md, Copilot instructions, rules files, skills, hooks, and memory files are not ordinary documentation. They are runtime configuration for an autonomous worker. That makes them powerful and dangerous in the same way build config is powerful and dangerous: if the config is short, accurate, scoped, and executable, it improves behavior; if it is stale, duplicated, verbose, or contradictory, it becomes friction.

The evidence is mixed because context files are mixed. Some studies show efficiency gains. Some show cost increases and worse resolution. Empirical studies show widespread adoption but also length, readability issues, and missing non-functional requirements. Instruction-following benchmarks show agents often satisfy individual checks while failing the full conjunction of constraints.

So the conclusion is not "write AGENTS.md" or "avoid AGENTS.md." The conclusion is:

Treat agent context files as maintained operational config with a token budget, owners, scope, tests, and debt.

Plot-ready data lives in presentations/write-code-ai-agents-love/research/data/context_instruction_cost.csv.

Source map

RefSourceLocal text/articleRole in this insight
R17AGENTS.md impactpaper-text/agents-md-impact-2601.20404.txtShows root AGENTS.md can reduce runtime/output tokens.
R18Evaluating AGENTS.mdpaper-text/evaluating-agents-md-2602.11988.txtShows context files can increase cost/steps and sometimes reduce success.
R19Claude Code config-file studypaper-text/claude-code-configs-2511.09268.txtShows what real CLAUDE.md files contain.
R21SWE-Skills-Benchpaper-text/swe-skills-bench-2603.15401.txtShows skills are high-variance and often add tokens without improving pass rate.
R73OctoBenchpaper-text/octobench-2601.10343.txtMeasures scaffold-aware instruction following across heterogeneous sources.
R74Agent READMEspaper-text/agent-readmes-context-files-2025.txtLarge empirical study of 2,303 context files from 1,925 repos.
D34-D53Practitioner/HN sourcesarticles/*Shows the community has moved from prompt tricks to operational context management.

AGENTS.md can reduce motion

The positive case is real. The AGENTS.md impact study ran paired Codex executions with and without root AGENTS.md files. It reports statistically significant reductions in wall-clock time and output tokens. That supports the idea that a good root file can orient the agent and reduce unnecessary exploration.

AGENTS.md impact data copied from the paper

MetricWithout AGENTS.mdWith AGENTS.mdChange
Mean runtime162.94s129.91s-20.27%
Median runtime98.57s70.34s-28.64%
Mean output tokens5,744.814,591.46-20.08%
Median output tokens2,9252,440-16.58%
Mean input/cached tokens353,010.01318,651.51-9.73%
Mean cached input tokens328,877.31296,078.73-9.97%

Source trace: R17, paper-text/agents-md-impact-2601.20404.txt.

Important caveat: this study did not deeply evaluate semantic correctness. It performed a manual sanity check over sampled outputs. I should use it as evidence for efficiency, not as proof that AGENTS.md improves correctness.

R17 methodology (for pairing with R18)

Real OSS only: 10 GitHub repos, 124 small merged PRs replayed at pre-merge commits with paired with/without existing developer root AGENTS.md (≤100 LoC, ≤5 files). Single agent (Codex gpt-5.2-codex). Metrics: wall-clock and tokens (Table 1, arXiv ~p. 4). Not synthetic; not LLM-generated context files. Full reconciliation with R18: see INSIGHT_02 section "Reconciling R17 vs R18".

Context files can also increase cost and reduce success

Evaluating AGENTS.md gives the counterweight. It distinguishes developer-provided context files from LLM-generated ones and measures resolution, steps, cost, and reasoning behavior. The results are exactly what I would expect from configuration: good human config can help; generated or redundant config can make the system noisier.

Evaluating AGENTS.md data copied from the paper

Context conditionSuccess / resolution effectSteps / cost effectInterpretation
LLM-generated context on SWE-bench Lite-0.5 pp resolution+2.45 steps, +20% costGenerated context can add motion without useful signal.
LLM-generated context on AGENT BENCH-2.0 pp resolution+3.92 steps, +23% costSame failure in a newer benchmark.
Developer-provided contextabout +4% performance avg+3.34 steps, cost up to +19%Human context helps more but still has carrying cost.
Docs removed, LLM context+2.7% average performancenot the main pointContext can substitute for missing docs.
uv mentioned1.6 uses per instance<0.01 when not mentionedAgents obey concrete tool guidance.
Repo-specific tools mentioned2.5 uses per instance<0.05 when not mentionedNaming tools explicitly changes behavior.

Source trace: R18, paper-text/evaluating-agents-md-2602.11988.txt, §4.2 Table 2 / Figure 3 (arXiv ~p. 6); docs-stripped ablation Figure 5 (~p. 7).

R18 methodology (real OSS benchmarks, not synthetic)

  • SWE-bench Lite: popular repos, no native dev context file; compare NONE vs LLM-generated file per agent's init recommendation.
  • AGENT BENCH: 138 tasks from 12 niche repos that already have developer context files; compare NONE vs LLM vs HUMAN file; issues + tests built with LLM pipeline (§3, ~pp. 3–5).
  • Success = all instance tests pass; cost = inference USD from step counts (§4.1, ~p. 6).
  • Mechanism: files increase exploration/testing; agents follow instructions (uv, pytest, repo tools spike when mentioned, §4.3, ~pp. 6–7). Extra policy makes tasks harder, not "ignored."

R17 (efficiency on PR replay with human file) and R18 (resolution + cost on issue benchmarks) are not a direct contradiction—see INSIGHT_02 "Reconciling R17 vs R18."

The useful inference is not that context files are bad. The useful inference is that context files change the agent's policy. If the file contains extra requirements, vague overview, stale commands, or old preferences, the agent may spend more steps trying to satisfy them.

Agent READMEs: the ecosystem is already doing this, but unevenly

Agent READMEs is valuable because it is descriptive rather than prescriptive. It looks at 2,303 context files across 1,925 repositories and asks what developers actually write.

Agent READMEs data copied from the paper

File typeCountMedian wordsMedian FRE / readability
CLAUDE.md922485.016.6, very difficult
AGENTS.md694335.539.6, difficult
copilot-instructions.md687535.026.6, very difficult
Total2,303mixedmixed
Repositories1,925n/an/a

Agent README category prevalence copied from the paper

CategoryShare of ACFs
Testing75.0%
Implementation Details69.9%
Architecture67.7%
Development Process63.3%
Build and Run62.3%
System Overview59.0%
Maintenance43.7%
Configuration and Environment38.0%
Documentation26.8%
AI Integration24.4%
Debugging24.4%
DevOps18.1%
Security14.5%
Performance14.5%
UI/UX8.7%
Project Management5.4%

Source trace: R74, paper-text/agent-readmes-context-files-2025.txt.

The content distribution is important. Developers mostly encode what agents need to execute: testing, implementation details, architecture, process, build/run. Security, performance, and UI/UX are much rarer. That suggests current context files are optimized for "get the code changed" more than "preserve the qualities that are easy to regress."

Chart sketch: context files over-index on execution guidance

xychart-beta title "Agent README category prevalence" x-axis ["Testing", "Impl", "Arch", "Build", "Security", "Perf", "UI/UX"] y-axis "Share %" 0 --> 80 bar "ACFs" [75.0, 69.9, 67.7, 62.3, 14.5, 14.5, 8.7]

This is a strong blog point: if agents rely on explicit context and the context under-represents security/performance/UI, then repo owners should not be surprised when agents preserve the happy path but miss non-functional constraints.

Claude config studies: what practitioners put in memory

The Claude Code config-file paper analyzed 328 CLAUDE.md files. It gives a more Claude-specific view of the same ecosystem.

Claude Code config data copied from the paper

CategoryFilesShare
Architecture23872.6%
Development guidelines14744.8%
Project overview12839.0%
Testing guidelines11635.4%
Commands10933.2%
Dependenciesabout 10130.8%
Project guidelinesabout 8425.6%
Integration guidelines5918.0%
Usage guidelines5918.0%
Configuration5717.4%

Source trace: R19, paper-text/claude-code-configs-2511.09268.txt.

The important detail is that code examples are not dominant. Mermaid diagrams are rare. Links are rare. The files are mostly structured markdown instructions. That makes them easy to write but also easy to let rot. A command listed in markdown can become stale the moment package.json, Makefile, CI, or runtime versions change.

Skills: high variance, narrow fit

SWE-Skills-Bench is the best warning against turning every preference into a skill. It evaluates 49 public SWE skills across roughly 565 task instances. Most skills do not improve pass rate. Some help. Some hurt.

SWE-Skills-Bench data copied from the paper

MeasurementValue
Public skills initially observed84,192
Days of public skill creation136
SWE skills evaluated49
Approximate task instances565
Aggregate pass-rate change+1.2 pp
Aggregate token consumption change+10.5%
Skills with zero pass-rate improvement39/49
Best reported skill gain+30.0 pp
Best reported skill token change-34.8%
Degrading skills3
Worst reported degradation-10.0 pp
Token overhead range for perfect-pass skills-77.6% to +450.8%

Source trace: R21, paper-text/swe-skills-bench-2603.15401.txt.

My inference: skills are not "free expertise." They are context injections. A skill should be small, scoped, current, and invoked only when the task matches. Stale concrete examples can anchor an agent on the wrong API version or wrong protocol.

OctoBench: per-check compliance is not full instruction following

OctoBench evaluates scaffold-aware instruction following in repository-grounded coding. It matters because real agents receive instructions from many places: system prompt, user query, repository policy files, skills, memory, tool schemas, and scaffold reminders.

The headline result is the gap between checklist success rate (CSR) and instance success rate (ISR). Models can satisfy many individual checks and still fail to satisfy all constraints for an instance.

OctoBench dataset data copied from the paper

MeasurementValue
Environments34
Tasks217
Scaffold types3
Objective checklist items7,098
System prompt present217 instances / 100.0%
User query present217 instances / 100.0%
System reminder present158 instances / 72.8%
Agents.md/Claude.md present117 instances / 53.9%
Skill.md present48 instances / 22.1%
Memory present32 instances / 14.7%
Tool schema present197 instances / 90.8%

OctoBench result data copied from the paper

ModelAvg CSRAvg ISR
Claude Opus 4.585.64%28.11%
MiniMax-M2.183.86%18.15%
Gemini-3-Pro80.94%14.68%
Claude Sonnet 4.581.10%14.65%
ChatGLM-4.680.38%12.73%
Kimi-K2-thinking80.10%12.95%
Doubao-Seed-1.879.75%9.66%
MiniMax-M280.34%9.81%

Source trace: R73, paper-text/octobench-2601.10343.txt.

Chart sketch: CSR vs ISR gap

xychart-beta title "OctoBench: individual checks vs full instance success" x-axis ["Opus 4.5", "MiniMax M2.1", "Gemini 3", "Sonnet 4.5", "MiniMax M2"] y-axis "Percent" 0 --> 100 bar "CSR" [85.64, 83.86, 80.94, 81.10, 80.34] bar "ISR" [28.11, 18.15, 14.68, 14.65, 9.81]

The blog implication: adding many rules is not the same as achieving rule-governed behavior. Every additional instruction becomes part of a conjunction. Agents may satisfy most individual rules but still miss the one that matters, especially under conflicting or long-lived constraints.

Practitioner signal and how to interpret it

The practitioner material and HN history are useful but not controlled evidence. Anthropic, GitHub, Builder.io, Aider, Sourcegraph, Windsurf, and community posts converge on patterns: short root instructions, path-scoped rules, explicit commands, setup scripts, LSP/search, hooks, skills, and MCP/tool integration.

I should treat these as implementation guidance, not as proof. They help translate the papers into engineering practice:

  • root context should be small;
  • path-specific context should exist only when conventions differ;
  • commands should be copy-pasteable and current;
  • generated/build/vendor folders should be excluded;
  • skills should be on-demand;
  • hooks/lints should enforce what prose cannot;
  • stale context should be detected, not trusted.

Proposed architecture for context files

flowchart TD Root["Root AGENTS.md: invariant commands, constraints, pointers"] --> Area["Path-scoped notes: local conventions"] Root --> Scripts["Executable scripts: setup/test/lint/build"] Root --> Rules["Lint/static rules: architecture constraints"] Root --> Docs["Long docs/Brain insights: source-heavy context"] Area --> Task["Task-specific retrieval"] Scripts --> Feedback["Fast feedback"] Rules --> Feedback Docs --> Task

The important distinction is always-loaded vs on-demand:

Knowledge typeBest homeWhy
Must-follow repo invariantsRoot AGENTS.mdLoaded early; keep short.
Local conventionsDirectory-level instructionsAvoid global context pollution.
Long reasoning / researchBrain insight notesRead only when relevant.
Setup/build/test truthScripts/package configExecutable and testable.
Architecture policyLint/static checksRepairable diagnostics.
Domain workflowsSkillsOn-demand, narrow fit.
External systemsMCP/tools/generated SDKsFresh, structured access.

What I should claim

Context files are agent configuration. Configuration needs:

  • owners;
  • review;
  • small scope;
  • clear precedence;
  • tests or lint where possible;
  • deletion of obsolete rules;
  • links to executable truth;
  • measured effect on time, tokens, and failures.

What I should not claim

I should not claim that AGENTS.md always improves correctness. The positive AGENTS.md paper is mostly efficiency evidence. The task-success evidence is mixed.

I should not claim that generated context files are useless. Evaluating AGENTS.md suggests they can substitute for missing docs. The problem is using generated broad context as always-loaded truth.

I should not claim skills are bad. SWE-Skills-Bench shows high variance: most do nothing, a few help a lot, and a few hurt. The better claim is that skills need strong selection and maintenance.

Blog visual candidates

  1. AGENTS.md with/without runtime and output tokens.
  2. Evaluating AGENTS.md cost/success tradeoff.
  3. Agent README category prevalence.
  4. SWE-Skills-Bench distribution: zero-help skills, helpful skills, harmful skills.
  5. OctoBench CSR vs ISR gap.
  6. Context architecture graph: root index -> scoped docs -> executable checks.

References

  • R17: AGENTS.md impact, paper-text/agents-md-impact-2601.20404.txt
  • R18: Evaluating AGENTS.md, paper-text/evaluating-agents-md-2602.11988.txt
  • R19: Claude Code configs, paper-text/claude-code-configs-2511.09268.txt
  • R21: SWE-Skills-Bench, paper-text/swe-skills-bench-2603.15401.txt
  • R73: OctoBench, paper-text/octobench-2601.10343.txt
  • R74: Agent READMEs, paper-text/agent-readmes-context-files-2025.txt
  • D34-D53: practitioner/HN sources in research/articles/