INSIGHT 01: Agents Need Maps, Not Dumps
AI coding agents perform better when the repository exposes structure as a compact, authoritative map rather than dumping raw files into the context window. Dumping more context increases noise and recall/precision imbalance. Telling the agent how components, tests, and dependencies relate -- through explicit graphs, repo maps, or structured summaries -- yields measurable gains in accuracy, speed, and retrieval quality.
Source map
| Ref | Source | Local text | Role |
|---|---|---|---|
| R10 | ContextBench (2026-02) | paper-text/contextbench-2602.05892.txt | Measures retrieval quality during issue resolution; shows recall-over-precision bias and explored-vs-utilized gap. |
| R03 | RepoBench (2023-06) | paper-text/repobench-2306.03091.txt | Splits repo-level completion into retrieval and generation; makes cross-file retrieval an explicit bottleneck. |
| R12 | RepoGraph (ICLR 2025) | paper-text/repograph-2410.14684.txt | Adds a line-level dependency graph as a plug-in module; boosts SWE-bench by 32.8% relative. |
| R13 | Repository Intelligence Graph (2026-01) | paper-text/repository-intelligence-graph-2601.10112.txt | Deterministic build/test-centered map; +12.2% accuracy, -53.9% time across 3 agents and 8 repos. |
| R14 | AI-assisted Coding with Cody (2024-08) | paper-text/cody-context-retrieval-2408.05344.txt | Practitioner architecture: search + code intelligence > raw long-context for code recommendations. |
| R16 | CodePlan (2023-09) | paper-text/codeplan-2309.12499.txt | Repository-level tasks need planning over dependency structure, not direct generation. |
| D09 | Aider Repo Map docs | articles/aider-repomap.md | Practitioner signal: repo maps help agents work in larger codebases. |
| D10 | Sourcegraph: How Cody understands your codebase | articles/sourcegraph-how-cody-understands-codebase.html | Official-doc evidence: code search + code intelligence, not prompt stuffing. |
ContextBench: agents favor recall over precision
ContextBench (R10) introduces 1,136 issue-resolution tasks from 66 repositories across 8 programming languages. Each task is annotated with human-verified gold contexts (522,115 lines across 4,548 files). The benchmark tracks what code agents inspect during resolution, not just whether they produce a passing patch.
Key finding: sophisticated agent scaffolding yields only marginal gains in context retrieval over a minimal baseline -- echoing "The Bitter Lesson" of AI research. All agents and LLMs consistently favor recall over precision. This means they retrieve broad, noisy context.
ContextBench agent retrieval data (GPT-5 backbone)
| Agent | File Recall | File Precision | File F1 | Block Recall | Block Precision | Block F1 | Pass@1 |
|---|---|---|---|---|---|---|---|
| mini-SWE-Agent | 0.682 | 0.709 | 0.634 | 0.369 | 0.645 | 0.375 | 0.472 |
| Agentless | 0.609 | 0.352 | 0.390 | 0.344 | 0.328 | 0.242 | 0.452 |
| SWE-agent | 0.726 | 0.537 | 0.544 | 0.312 | 0.625 | 0.285 | 0.490 |
| OpenHands | 0.733 | 0.400 | 0.463 | 0.283 | 0.505 | 0.190 | 0.490 |
| Prometheus | 0.717 | 0.336 | 0.403 | 0.258 | 0.646 | 0.285 | 0.512 |
Source trace: R10, paper-text/contextbench-2602.05892.txt, Table 2.
Critical observation: "Significant gaps exist between retrieved and utilized context. Agents often inspect gold-relevant code but fail to retain or use it in final patch generation, highlighting consolidation as a key bottleneck." The problem is not just finding the right context -- it is organizing it so the agent can act on it.
Benchmark scope
| Measurement | Value |
|---|---|
| Total tasks | 1,136 |
| Lite subset | 500 |
| Repositories | 66 |
| Programming languages | 8 |
| Gold context lines | 522,115 |
| Gold context files | 4,548 |
| Gold context blocks (classes/functions) | 23,116 |
RepoBench: cross-file retrieval as explicit bottleneck
RepoBench (R03) decomposes repository-level code completion into three sub-tasks:
- RepoBench-R (Retrieval): Find the most relevant code snippet from other files.
- RepoBench-C (Completion): Predict the next line given pre-selected context.
- RepoBench-P (Pipeline): End-to-end retrieval + completion.
The benchmark uses tree-sitter to parse import statements and identify cross-file dependencies. The "Cross-File-First" (XF-F) setting -- masking the first appearance of a cross-file reference -- is the hardest, because no prior in-file usage exists. This makes it a pure retrieval problem.
The implication for codebase design: when the agent has no in-file usage to learn from, it depends entirely on how discoverable and navigable cross-file dependencies are. Explicit imports, typed interfaces, and package boundaries directly affect retrieval quality.
Source trace: R03, paper-text/repobench-2306.03091.txt.
RepoGraph: structured graphs boost SWE-bench performance
RepoGraph (R12, ICLR 2025) constructs a line-level graph where:
- Nodes = lines of code (definitions or references)
- Edges = dependency relationships (invoke, contain)
Ego-graph retrieval from this structure is integrated into both agent and procedural frameworks as a
search_repograph(param) action.
RepoGraph results on SWE-bench
| Framework integration | Relative improvement |
|---|---|
| Average relative improvement (4 systems, 2 lines of approach) | 32.8% |
The paper demonstrates that RepoGraph helps both agent-based and procedural frameworks. It operates at line, file, and repository level simultaneously. The node filtering step is important: it excludes built-in/stdlib calls and third-party library calls, focusing only on project-internal dependencies.
Source trace: R12, paper-text/repograph-2410.14684.txt.
Repository Intelligence Graph (RIG): deterministic build/test map
RIG (R13) is the strongest direct evidence for the "map not dump" claim. It constructs a deterministic, evidence-backed architectural graph from build and test artifacts.
RIG evaluation results across 3 agents and 8 repositories
| Metric | Value |
|---|---|
| Mean accuracy improvement (with RIG vs without) | +12.2% |
| Mean completion time reduction | -53.9% |
| Mean absolute time reduction per repository | -124.4 seconds |
| Mean efficiency improvement (seconds per score point) | -57.8% |
| Multilingual repo accuracy improvement | +17.7% |
| Multilingual repo efficiency improvement | -69.5% |
| Single-language repo accuracy improvement | +6.6% |
| Single-language repo efficiency improvement | -46.1% |
| Average RIG JSON size | 20,692 bytes (~5,173 tokens) |
| Largest RIG in corpus | 60,076 bytes (~15,000 tokens) |
Source trace: R13, paper-text/repository-intelligence-graph-2601.10112.txt.
The key design choice: RIG is build-and-test-centered, not code-centered. Nodes are components, aggregators, runners, tests, external packages. Edges record dependency, coverage, and orchestration relationships. This answers agent questions like "which components depend on X?" without scanning source code.
RIG benefits are larger on structurally complex repositories and on harder questions. This means the value of a map increases as the codebase becomes more complex -- exactly when dumping raw context becomes most harmful.
CodePlan: planning over dependencies, not direct generation
CodePlan (R16) formalizes repository-level coding as a planning problem. It uses incremental dependency analysis and change-may-impact analysis to propagate edits across file boundaries.
Key result: CodePlan gets 5/6 repositories to pass validity checks (build without errors, correct edits), while baselines without planning (but with the same context) get 0/6.
The implication: when a change escapes a function's signature boundary, it must propagate to callers. Without a dependency map, the agent cannot reason about this propagation. With explicit dependency structure, it can plan a multi-step chain of edits.
Source trace: R16, paper-text/codeplan-2309.12499.txt.
Inference
What the evidence supports:
-
Retrieval is noisy. Agents over-retrieve and under-utilize. More sophisticated scaffolding does not fix this (ContextBench). The codebase must help the agent find and use the right context.
-
Structured graphs outperform flat retrieval. RepoGraph (+32.8%) and RIG (+12.2% accuracy, -53.9% time) both demonstrate that explicit structural representations improve agent performance over baseline approaches that scan files ad hoc.
-
Build/test structure is particularly high-value. RIG shows that build-and-test topology -- not just code-level dependencies -- is what agents most struggle to recover on their own.
-
Cross-file dependencies are the retrieval bottleneck. RepoBench's XF-F setting isolates this; when the agent has never seen a module used in-file, retrieval from structure is all it has.
-
Planning needs structure. CodePlan demonstrates that multi-file changes require dependency analysis; direct generation without structural awareness fails.
Inference (author conclusion, not directly from papers):
- A short, authoritative repo map (modules, boundaries, canonical examples, test coverage topology) should be cheaper and more effective than dumping full files into context.
- The map should be deterministic (generated from build/test artifacts or maintained as config), not generated on-the-fly by the agent.
- Maps should answer: What are the modules? Where do tests live? What builds what? What are the extension points? What is generated vs hand-written?
Non-claims
- The evidence does not prove that any specific map format (JSON, Markdown, YAML) is superior. RIG uses JSON; Aider's repo map uses a tag-based text format. The content matters more than serialization.
- RepoGraph operates at line-level granularity; RIG operates at component/build-target level. These are different scopes and are not directly comparable.
- ContextBench measures retrieval quality but does not directly manipulate map presence. It is observational evidence about what goes wrong, not experimental evidence about specific fixes.
- None of these papers test the specific act of "adding a CLAUDE.md/AGENTS.md architecture summary." That claim must be sourced from the agent instructions research (INSIGHT_02).
Blog/presentation visual candidates
- ContextBench radar chart (from Figure 1 in paper): shows recall >> precision for all agents.
- RIG before/after comparison: time and accuracy with vs without the map.
- RepoGraph integration diagram: how graph retrieval plugs into both agent and procedural workflows.
- CodePlan propagation example: showing how one seed edit propagates to 5+ derived edits across files via dependency analysis.
- "Floor plan vs warehouse" metaphor slide: the talk hook.
References
- R10: ContextBench,
paper-text/contextbench-2602.05892.txt - R03: RepoBench,
paper-text/repobench-2306.03091.txt - R12: RepoGraph,
paper-text/repograph-2410.14684.txt - R13: Repository Intelligence Graph,
paper-text/repository-intelligence-graph-2601.10112.txt - R14: Cody context retrieval,
paper-text/cody-context-retrieval-2408.05344.txt - R16: CodePlan,
paper-text/codeplan-2309.12499.txt - D09: Aider repo map docs,
articles/aider-repomap.md - D10: Sourcegraph Cody,
articles/sourcegraph-how-cody-understands-codebase.html