insight/simplicity-beats-agent-theater · working · tags: ai-agents, research

INSIGHT 07: Simplicity Beats Agent Theater

Complex agent orchestration is not a replacement for clear repository structure and deterministic validation. The evidence consistently shows that simple, well-structured workflows (localize, repair, validate) compete with or outperform multi-agent systems that use tree search, backtracking, and elaborate tool chains. The bottleneck is almost never "the agent needs more sophistication." The bottleneck is "the repository does not give the agent clear signals."

This matters for codebase design because it implies the highest-leverage investment is not in agent tooling but in repository clarity: searchable structure, deterministic tests, precise instructions, and reproducible setup. Only after those are clean does orchestration complexity pay off.

Source map

Ref	Source	Local text	Role in this insight
R23	Agentless	`paper-text/agentless-2407.01489.txt`	Direct evidence: simple localize-repair-validate beats complex agents on SWE-bench Lite.
R10	ContextBench	`paper-text/contextbench-2602.05892.txt`	Shows sophisticated scaffolding yields only marginal gains in context retrieval.
R25	SWE-Search	`paper-text/swe-search-iclr-2025.txt`	MCTS-based search improves performance, but still depends on repository navigation and feedback.
R24	AutoCodeRover	`paper-text/autocoderover-2404.05427.txt`	AST-based code search improves localization, but the improvement comes from structure visibility, not orchestration complexity.
R18	Evaluating AGENTS.md	`paper-text/evaluating-agents-md-2602.11988.txt`	Context files can reduce success rates and increase cost when they add noise -- more context is not always better.

Agentless: the strongest simplicity evidence

Agentless is the cleanest test of the simplicity hypothesis. It uses a three-phase process with no autonomous decision-making, no tool use during execution, and no iterative feedback loops:

Localization: hierarchical narrowing from files to classes/functions to edit locations.
Repair: generate multiple candidate patches in diff format.
Patch validation: use reproduction tests and regression tests to select the final patch.

Agentless data

Measurement	Value	Context
SWE-bench Lite performance	32.00%	96 correct fixes out of 300
Cost per issue	$0.70	Average
Ranking	Highest among all open-source agents	At time of publication
OpenAI adoption	Used as go-to approach	For showcasing GPT-4o and o1 coding performance
Agent turns required	0	No iterative agent loop
Tool complexity	None	No file editing tools, no shell, no search APIs

Source trace: R23, paper-text/agentless-2407.01489.txt.

The paper explicitly identifies three limitations of agent-based approaches that Agentless avoids:

Complex tool usage/design: agents require careful API design and format specification; incorrect tool use wastes queries and reduces performance.
Lack of control in decision planning: agents can take 30-40 turns with large action spaces, making incorrect decisions that compound.
Limited ability to self-reflect: agents struggle to filter incorrect or misleading information from environment feedback.

Key methodological insight: Agentless does not ask "can we make the agent smarter?" It asks "can we make the problem simpler?" The localization phase uses the repository's own structure (files, classes, functions) as the search hierarchy. This only works well when that structure is clear. In a codebase with scattered responsibilities, unclear module boundaries, or tangled dependencies, hierarchical localization would fail -- not because Agentless is too simple, but because the repository is too opaque.

ContextBench: scaffolding does not solve retrieval

ContextBench's key finding for this insight is stated directly in the abstract: "sophisticated agent scaffolding yields only marginal gains in context retrieval."

ContextBench scaffolding comparison

Agent type	Retrieval behavior	Implication
Simple baseline (mini-SWE-agent)	Comparable context retrieval to complex agents	Complex scaffolding does not reliably improve what gets retrieved
Complex agents (OpenHands, Prometheus)	More actions, more tokens, similar recall	Extra orchestration mainly increases cost
All evaluated LLMs	Favor recall over precision	Broad retrieval introduces noise regardless of scaffold
Balanced retrieval agents	Higher Pass@1 at lower cost	Restraint outperforms thoroughness

Source trace: R10, paper-text/contextbench-2602.05892.txt.

The paper's finding #4 is particularly relevant: "Models that balance retrieval frequency and context granularity achieve higher Pass@1 at lower cost, while aggressive retrieval mainly increases token consumption." This directly argues against the intuition that more exploration equals better results.

Inference: if the repository makes the right context easy to find (clear module boundaries, predictable file naming, explicit dependencies), a simple agent with few retrieval steps outperforms a complex agent doing exhaustive search over an opaque codebase.

SWE-Search: orchestration helps, but depends on repository signals

SWE-Search adds MCTS (Monte Carlo Tree Search) to agent exploration, achieving 23% relative improvement over standard open-source agents across five models.

SWE-Search data

Measurement	Value	Context
Relative improvement over standard agents	23%	Across 5 models on SWE-bench Lite
Search mechanism	MCTS with value function	Balances exploration and exploitation
Value estimation	LLM-based, both numerical and qualitative	Self-feedback loops
Final decision	Multi-agent debate (Discriminator Agent)	Collaborative decision-making
Key dependency	Repository navigation and feedback	Agent must observe signals to improve

Source trace: R25, paper-text/swe-search-iclr-2025.txt.

SWE-Search is the strongest counter-evidence to pure simplicity. It shows that search and backtracking do help. However, the improvement depends on the agent receiving meaningful feedback at each state -- it needs to observe whether its actions are moving toward a solution. This feedback comes from the repository: test results, linter output, build errors, and file content.

The design explicitly requires: "a dynamic code environment with a flexible state-space and a git-like commit tree structure" that "facilitates efficient backtracking to previous states." This is a repository-level affordance, not an agent-level one.

Inference: search-based agents amplify the signal quality of the repository. In a repository with fast, deterministic tests and clear feedback, MCTS can exploit that signal. In a repository with slow, flaky tests and ambiguous errors, MCTS explores noise.

AutoCodeRover: structure visibility drives improvement

AutoCodeRover uses code search APIs that operate on the AST (abstract syntax tree) rather than treating the project as a collection of files. It achieves 19% on SWE-bench Lite at $0.43 per issue.

AutoCodeRover data

Measurement	Value	Context
SWE-bench Lite performance	19%	57 correct fixes, pass@1
Average time per issue	~4 minutes	vs. developer average of 2.68 days
Average cost per issue	$0.43	USD
Code search APIs	AST-based	search_method_in_file, search_class, etc.
Fault localization	Spectrum-based (SBFL)	Uses test suite coverage data

Source trace: R24, paper-text/autocoderover-2404.05427.txt.

The paper makes an explicit software-engineering argument: "We work on program representations (abstract syntax tree) as opposed to viewing a software project as a mere collection of files." The improvement comes not from multi-agent debate or tree search, but from giving the agent structural access to the code. The AST-based search APIs are deterministic and require no LLM inference -- they are properties of the codebase, not of the agent.

This supports the simplicity thesis from a different angle: the "complexity" that actually helps is structural visibility of the repository, not behavioral complexity of the agent.

Evaluating AGENTS.md: more context can hurt

This paper evaluates whether repository-level context files (AGENTS.md) actually help agents solve tasks. The counterintuitive finding: context files tend to reduce task success rates while increasing cost.

Evaluating AGENTS.md data

Measurement	Value	Context
Benchmark tasks (AgentBench)	138	From 12 niche repositories with developer-written context files
Effect on task success	Tends to reduce	vs. no context file
Effect on cost	Increases by over 20%	More inference tokens
Behavioral change	Broader exploration (more testing, more file traversal)	Agents respect instructions even when unhelpful
Root cause	Unnecessary requirements make tasks harder	Context files add constraints agents try to satisfy

Source trace: R18, paper-text/evaluating-agents-md-2602.11988.txt.

This is the clearest evidence against "agent theater": adding more instructions, more scaffolding, more context does not inherently help. The paper's conclusion is that "human-written context files should describe only minimal requirements." Brevity and precision outperform thoroughness.

Explicit inference

Simplicity is competitive. Agentless at 32% outperformed all open-source agents at time of publication with zero agent turns and $0.70/issue. This establishes a strong baseline that any complex system must beat.
Scaffolding mainly adds cost, not capability. ContextBench shows marginal retrieval gains from complex scaffolding. Evaluating AGENTS.md shows context files can reduce success. The common pattern: complexity adds tokens and exploration without proportional improvement.
The actual bottleneck is repository clarity. Agentless succeeds because it uses the repo's own structure for localization. AutoCodeRover succeeds because it uses AST-based search. SWE-Search succeeds because it exploits test feedback. All three depend on the repository providing clear signals.
Orchestration has a role, but it is secondary. SWE-Search's 23% improvement is real. But it is an improvement over agents that already have access to good repository signals. In a codebase without fast tests or clear structure, the improvement would be smaller or absent.
Less is often more for instructions. Evaluating AGENTS.md directly shows that adding context can hurt. The lesson is that context should be minimal, precise, and action-oriented -- not exhaustive documentation of everything about the project.

What this does not prove

This does not prove that complex agents are never useful. Multi-agent systems with search and backtracking do improve performance when the repository provides good feedback signals.
This does not prove that Agentless is the best possible approach. At the time of writing, more sophisticated systems have surpassed Agentless on updated benchmarks. But those systems also operate on well-structured repositories with good test suites.
This does not prove that simplicity always wins for all task types. Feature addition, large-scale refactoring, and multi-file changes may genuinely require more exploratory approaches.
ContextBench and Evaluating AGENTS.md measure issue resolution, not open-ended feature work. The simplicity advantage may be smaller for creative or ambiguous tasks.
Agentless was evaluated on Python repositories. The transfer to other ecosystems (especially those with weaker test infrastructure) is plausible but not directly demonstrated.

Codebase design implications

Agent theater pattern	Simple alternative	Why it works better
Multi-agent debate over architecture	Clear module boundaries in code	Agents can localize without debating
Complex retrieval pipelines	Predictable file naming and structure	Simple search finds what it needs
Elaborate context injection	Minimal AGENTS.md with exact commands	Less noise, fewer unnecessary constraints
Test-generation agents	Pre-existing targeted tests	Deterministic feedback without inference cost
Orchestration frameworks	Single deterministic validation command	`pnpm lint && pnpm test` is cheaper than a framework
RAG over documentation	Types and interfaces at boundaries	Types ARE the documentation, no retrieval needed

Blog visual candidates

Agentless vs. agent-based systems: performance vs. cost scatter plot (Agentless achieves highest performance at lowest cost).
ContextBench radar plot: simple vs. complex agents have similar retrieval shape.
Effort allocation diagram: "Fix the repo, not the agent" -- time spent on repository clarity vs. agent orchestration, with diminishing returns on the orchestration axis.
Two-panel comparison: opaque repo with complex agent (many wasted turns) vs. clear repo with simple agent (few precise turns).

References

R10: ContextBench, paper-text/contextbench-2602.05892.txt
R18: Evaluating AGENTS.md, paper-text/evaluating-agents-md-2602.11988.txt
R23: Agentless, paper-text/agentless-2407.01489.txt
R24: AutoCodeRover, paper-text/autocoderover-2404.05427.txt
R25: SWE-Search, paper-text/swe-search-iclr-2025.txt