INSIGHT 09: Long Context Still Needs Structure
Large context windows do not eliminate the need for repository retrieval, maps, and scoped documentation. A million-token window lets you fit more code, but it does not guarantee the model will use that code effectively. The evidence shows: (1) positional effects degrade utilization of middle-context information, (2) long-context coding benchmarks remain hard even at frontier scale, (3) coding agents that use tools and file systems as an explicit long-context interface outperform raw attention baselines, and (4) agents already over-retrieve and under-utilize context at current scales.
The implication for codebase design: progressive disclosure matters more, not less, as context windows grow. A bigger room needs better maps.
Source map
| Ref | Source | Local text | Role in this insight |
|---|---|---|---|
| R32 | Lost in the Middle | paper-text/lost-in-the-middle-2307.03172.txt | Models underuse information depending on where it appears in long context (U-shaped curve). |
| R37 | LongCodeBench | paper-text/longcodebench-2505.07897.txt | Evaluation at million-token scale shows severe performance drops for real coding tasks. |
| R36 | YABLoCo | paper-text/yabloco-2505.04406.txt | Long-context code generation over very large C/C++ repositories (200K-2M LoC). |
| R40 | Coding Agents are Effective Long-Context Processors | paper-text/coding-agents-long-context-processors-2603.20432.txt | Agents using tools/filesystem beat raw long-context baselines by 17.3% on average. |
| R10 | ContextBench | paper-text/contextbench-2602.05892.txt | Agents retrieve too much context and still fail to use relevant material in final patches. |
Lost in the Middle: the positional attention problem
This paper establishes that language models do not uniformly attend to all positions in their context. Performance follows a U-shaped curve: highest when relevant information is at the beginning or end, and significantly degraded when it is in the middle.
Lost in the Middle data
| Measurement | Value | Context |
|---|---|---|
| Performance pattern | U-shaped curve | Best at beginning and end, worst in middle |
| GPT-3.5-Turbo middle-context performance | Lower than closed-book (56.1%) | Model performs worse WITH documents than without |
| Effect of more documents | Marginal improvement | 50 vs. 20 documents: ~1.5% gain (GPT-3.5-Turbo), ~1% (Claude-1.3) |
| Extended-context models vs. standard | Often identical performance | Longer window does not mean better utilization |
| Key-value retrieval | Some models struggle | Even for simple token matching from middle positions |
Source trace: R32, paper-text/lost-in-the-middle-2307.03172.txt.
Key findings relevant to codebase design:
-
Position matters. Information in the middle of context is systematically underused. For a codebase stuffed into a long context window, files in the "middle" of the loaded context may effectively be invisible to the model.
-
More context does not always help. Adding documents beyond saturation provides marginal improvement while increasing the volume the model must reason over. This is the "trade-off" the paper identifies: more information helps retrieval recall but hurts reasoning accuracy.
-
Extended context is not better context. Models with longer windows do not necessarily use those windows more effectively than shorter-context models.
Inference for repositories: if you load an entire codebase into a million-token window, the model's attention to specific files depends heavily on their position relative to the query and the beginning/end of the context. Structured retrieval (loading only relevant files, placing them near the query) will outperform raw dumping regardless of window size.
Note: this paper is from 2023. Newer models may partially mitigate the U-curve. But the fundamental point remains: attention is not uniform over long sequences, and codebase structure that aids selective retrieval will always outperform unstructured context stuffing.
LongCodeBench: million-token coding remains hard
LongCodeBench evaluates coding LLMs on comprehension and repair tasks ranging from tens of thousands to one million tokens, using real GitHub issues.
LongCodeBench data
| Measurement | Value | Context |
|---|---|---|
| Benchmark instances | 1,043 | From 108 real repositories |
| Context range | Tens of thousands to 1M tokens | Stratified by complexity |
| Claude 3.5 Sonnet performance drop | 29% to 3% | As context length increases to maximum |
| Qwen2.5 performance drop | 70.2% to 40% | As context length increases |
| Tasks | Comprehension (LongCodeQA) + Repair (LongSWE-Bench) | Both reading and fixing |
| Maximum context tested | 1M tokens | Full benchmark scale |
Source trace: R37, paper-text/longcodebench-2505.07897.txt.
The performance drops are dramatic. Claude 3.5 Sonnet goes from 29% to 3% as context scales -- a 90% relative degradation. This is not a minor effect. Even with access to the full codebase in context, the model's ability to find and fix bugs degrades catastrophically at scale.
Inference: having a million-token window and filling it with a codebase does not make the agent effective. The agent needs structure to navigate that context -- file boundaries, module maps, dependency indicators -- or its performance collapses.
YABLoCo: very large real repositories
YABLoCo benchmarks code generation in C/C++ repositories ranging from 200K to 2,000K lines of code. It explicitly includes context of dependencies at different levels and call graphs.
YABLoCo data
| Measurement | Value | Context |
|---|---|---|
| Test functions | 215 | Selected from 4 large repositories |
| Repository sizes | 200K to 2,000K | Lines of code |
| Languages | C and C++ | Not covered by prior benchmarks |
| Context types included | Metadata, dependency contexts, docstrings, call graphs | Multiple levels |
| Key challenge | Inter-function dependencies in large repositories | Cross-file context |
Source trace: R36, paper-text/yabloco-2505.04406.txt.
YABLoCo is relevant because it explicitly provides multiple levels of context (metadata, direct dependencies, transitive dependencies, call graphs) and measures whether models can use them. The benchmark design itself embodies the progressive disclosure principle: not all context is equally useful, and the benchmark stratifies by how much dependency information is provided.
The existence of this benchmark confirms that the research community recognizes raw long context is insufficient -- structured context at different granularities is needed for real repository work.
Coding Agents as Long-Context Processors: tools beat attention
This paper makes the strongest argument for structured repository access over raw context windows. It frames long-context processing as a file system navigation problem and shows that coding agents with native tools outperform published state-of-the-art across multiple benchmarks.
Coding Agents Long-Context data
| Benchmark | Corpus size | Improvement over best published | Context |
|---|---|---|---|
| BrowseComp-Plus | 750M tokens | +11% | Relative |
| Oolong-Syn | 536K tokens | +10% | Relative |
| Oolong-Real | 385K tokens | +11% | Relative |
| LongBench | 188K tokens | -1% | Competitive, slight regression |
| Natural Questions | 3T tokens | +56% | Massive corpus |
| Average improvement | - | 17.3% | Across all benchmarks |
Source trace: R40, paper-text/coding-agents-long-context-processors-2603.20432.txt.
The paper attributes effectiveness to two key capabilities:
-
Native tool proficiency: agents leverage executable code and terminal commands (grep, head, Python scripts) rather than passive semantic queries. This gives precise, executable interactions.
-
File system familiarity: agents trained on code repositories transfer hierarchical navigation skills to long-context text tasks. Directory structures provide natural chunking.
A surprising negative result: "equipping coding agents with standard retrieval tools does not consistently improve performance." The agents develop their own task-specific strategies -- iterative query refinement for multi-hop retrieval, programmatic aggregation for analytical tasks.
Inference for codebase design: the repository file system IS the long-context interface. A well- organized directory structure with clear naming, modular files, and predictable locations gives agents a navigable structure that outperforms dumping everything into a flat context window.
ContextBench: over-retrieval and under-utilization
ContextBench shows the problem from the agent's side: even with sophisticated retrieval, agents pull too much context and fail to use what they find.
ContextBench utilization gap
| Finding | Detail |
|---|---|
| Explored vs. utilized gap | Agents inspect gold-relevant code but fail to retain or use it in patches |
| Recall vs. precision | All LLMs favor recall (breadth) over precision (relevance) |
| Cost of aggressive retrieval | Mainly increases token consumption without proportional success gains |
| Balanced retrieval benefit | Higher Pass@1 at lower cost than aggressive retrieval |
Source trace: R10, paper-text/contextbench-2602.05892.txt.
This connects to the long-context problem directly: even without window limits, the agent's challenge is not ACCESSING information but USING it. A larger window means the agent can access more, but if it already fails to utilize what it accesses in smaller windows, scaling the window alone does not help.
Explicit inference
-
Attention is not uniform. Lost in the Middle proves positional effects degrade utilization. This is a fundamental property of transformer attention, not a bug that will be easily fixed. Codebase design should not assume the model reads everything equally.
-
Performance degrades catastrophically at scale. LongCodeBench shows 90% relative degradation for Claude 3.5 Sonnet at 1M tokens. Having a large window is necessary but wildly insufficient for effective repository work.
-
Tools outperform raw context. Coding agents with file system access beat raw long-context baselines by 17.3% on average. The repository's directory structure and file organization are more effective than attention over flat text.
-
The utilization gap is the real bottleneck. ContextBench shows agents over-retrieve and under-utilize. The problem is not "can the agent see the code?" but "does the agent use what it sees?" Structure helps by reducing noise and highlighting relevant material.
-
Progressive disclosure is the design pattern. The converging evidence points to repositories exposing information in layers: a map at the top, module summaries at mid-level, implementation details at the leaf level. This matches how the most effective agents already navigate code.
What this does not prove
-
This does not prove that large context windows are useless. They are clearly better than small windows for many tasks. The claim is that they are insufficient without structure, not that they are unhelpful.
-
Lost in the Middle is from 2023 and uses older models. Newer architectures (particularly those with improved positional encoding) may reduce the U-curve. However, some positional effect likely persists.
-
The coding-agents-as-long-context paper uses off-the-shelf frontier coding agents. The results may not transfer to smaller or less capable models that lack strong tool-use training.
-
LongCodeBench performance drops are model-specific. Newer models (post-2025) likely perform better, but the relative degradation with scale is likely still present.
-
This does not prove that retrieval is always superior to long context. For some tasks (holistic code understanding, architectural analysis), having the full codebase in context may genuinely help. The claim is about the marginal utility of structure, not the absolute superiority of one approach.
Codebase design for progressive disclosure
| Level | Content | Agent use case | Artifact |
|---|---|---|---|
| L0: Root map | Module names, boundaries, entry points | Orient to codebase | CLAUDE.md, generated tree |
| L1: Module summaries | Purpose, public API, dependencies | Decide which module to enter | README in each package, barrel files |
| L2: File documentation | Function signatures, type exports, contracts | Understand specific file | JSDoc/TSDoc, type annotations |
| L3: Implementation | Full function bodies, logic | Make specific changes | Source files |
| L4: History | Commit messages, PR descriptions, changelogs | Understand intent of past changes | Git log, CHANGELOG |
The agent should be able to work at L0-L1 for most decisions and descend to L2-L3 only for the files it needs to modify. If the codebase forces the agent to L3 for every decision (because there are no type annotations, no module summaries, no clear boundaries), it will either overflow its effective attention or waste tokens on irrelevant code.
Blog visual candidates
- LongCodeBench performance decay curve: performance vs. context length for multiple models.
- Lost in the Middle U-curve: accuracy by document position.
- Coding agents vs. raw context: bar chart of improvements across benchmarks.
- Progressive disclosure pyramid: L0 (map) through L4 (history), with token budget at each level.
- Two-panel comparison: flat context (everything loaded, model confused) vs. layered context (agent navigates structure, finds relevant files).
References
- R10: ContextBench,
paper-text/contextbench-2602.05892.txt - R32: Lost in the Middle,
paper-text/lost-in-the-middle-2307.03172.txt - R36: YABLoCo,
paper-text/yabloco-2505.04406.txt - R37: LongCodeBench,
paper-text/longcodebench-2505.07897.txt - R40: Coding Agents are Effective Long-Context Processors,
paper-text/coding-agents-long-context-processors-2603.20432.txt