Brain index

INSIGHT 25: Agentic PRs Have a Different Shape

Agent-written code is no longer a few anecdotes from people trying Claude Code on side projects. AIDev gives the phenomenon ecosystem scale. "How AI Coding Agents Modify Code" compares agentic and human merged PRs. The readability paper mines readability-intended agent commits.

This matters for the article because an agent-friendly codebase is not just a repo that helps an agent write code. It is also a repo that can review, constrain, and absorb agent-shaped changes. Agent patches have measurable patterns. Some are useful. Some are biased. The repo should be designed around those observed biases.

Plot-ready data lives in presentations/write-code-ai-agents-love/research/data/agentic_pr_change_shape.csv.

Source map

RefSourceLocal textRole in this insight
R69AIDevpaper-text/aidev-agentic-prs-2602.09185.txtEstablishes scale: hundreds of thousands of Agentic PRs.
R70How AI Coding Agents Modify Codepaper-text/how-ai-coding-agents-modify-code-2601.17581.txtCompares merged agentic PRs with merged human PRs.
R76Do AI Agents Really Improve Code Readability?paper-text/readability-agents-2603.13723.txtMeasures how readability-intended agent commits affect static metrics.
R09SWE-CIpaper-text/swe-ci-2603.03823.txtAdds maintainability/regression evidence across CI history.
R46Needle in the Repopaper-text/needle-in-the-repo-2603.27745.txtShows functional tests can miss structural maintainability failures.

AIDev: agentic PRs are now observable at scale

AIDev is a dataset paper. Its importance is not a single performance number. Its importance is that agent-authored PRs are now numerous enough to study empirically. That changes the tone of the blogpost. We do not need to argue only from personal experience. We can say: agentic development has a measurable footprint in public repositories.

AIDev data copied from the paper

AIDev tableValue
All Agentic pull requests932,791
All repositories116,211
All users/developers72,189
Curated Agentic PRs33,596
Curated repositories2,807
Curated users1,796
PR comments39,122
PR reviews28,875
PR review comments19,450
PR commits88,576
PR commit details711,923
Related issues4,923
Issues4,614
PR timeline events325,500
PR task-type records33,596

Source trace: R69, paper-text/aidev-agentic-prs-2602.09185.txt.

Graph sketch: AIDev data layers

flowchart TD PRs["932,791 Agentic PRs"] --> Repos["116,211 repositories"] PRs --> Users["72,189 users"] PRs --> Curated["33,596 curated PRs"] Curated --> Commits["88,576 PR commits"] Commits --> Details["711,923 commit details"] Curated --> Reviews["28,875 reviews"] Curated --> Comments["39,122 comments"] Curated --> Issues["4,923 related issues"]

The inference I want to make carefully: AIDev does not prove agentic PRs are good or bad. It proves there is enough agentic PR activity to study patterns. That supports using ecosystem data rather than relying only on "vibes."

How AI Coding Agents Modify Code: patch shape differs

This paper compares merged Agentic PRs and merged Human PRs. It reports statistical and practical differences across commit counts, files touched, additions, deletions, and line changes. The most useful number is Cliff's delta because it describes distributional separation, not just mean difference.

I need to be cautious about direction. The text summary says human PRs modify codebases more broadly and remove more code, while the table reports effect sizes for agentic vs human metrics. Before using a visual in the article, re-open the exact table and label the direction from the paper, not from memory. For this research note, the safe claim is that the distributions differ with small to large practical effects.

Dataset comparison copied from the paper

Dataset sliceAgentic PRsHuman PRs
Merged PRs24,0145,081
Commits440,29523,242

Effect-size data copied from the paper

Change metricCliff's deltaPractical size in paper
Commits0.5429Large
Files touched0.4487Medium
Additions0.2836Small
Deletions0.4462Medium
Line changes0.3158Small

Similarity data copied from the paper

Similarity metricAgentic PRsHuman PRs
TF-IDF0.12450.1007
BM252.84550.1739
CodeBERT0.93560.9285
GraphCodeBERT0.82540.7815

Source trace: R70, paper-text/how-ai-coding-agents-modify-code-2601.17581.txt.

Chart sketch: practical effect sizes

xychart-beta title "Agentic vs human PR change-shape effect sizes" x-axis ["Commits", "Files", "Additions", "Deletions", "Line changes"] y-axis "Cliff's delta" 0 --> 0.6 bar "Delta" [0.5429, 0.4487, 0.2836, 0.4462, 0.3158]

The blog inference: code review systems should not treat an agent PR as "just a normal PR with an AI author." The patch shape, similarity profile, and failure modes may differ. Review gates should watch the patterns that matter:

  • more or fewer commits than expected for task size;
  • broad file touches;
  • low deletion or suspicious compatibility shims;
  • generated files edited by hand;
  • tests changed without implementation;
  • implementation changed without tests;
  • public API changed without generated clients/docs;
  • architecture boundaries crossed.

Readability paper: even "readability" intent can degrade metrics

The readability paper is a useful corrective because it looks at commits that explicitly claim readability/understandability/clarity-like intent. If even those commits often increase size or reduce maintainability metrics, then "the agent cleaned it up" is not enough.

The paper mines AIDev for readability-related commits, classifies what smells they address, and computes static metric deltas before and after.

Readability-paper data copied from the paper

MeasurementValue
Readability keyword commits4,115
Share of all agent commitsabout 0.3%
Python readability-related commits577
Valid pre/post metric-analysis commits403
Manual classification sample231 commits
Label-level agreement94.7%
Micro-averaged Cohen's kappa0.80

What readability commits addressed

Readability smell addressedAgent share
Complex, long, or inadequate logic42.4%
Incomplete or inadequate code documentation24.2%
Logic category from human baseline mentioned in text18.2%

The full table has more categories, but these are the numbers most relevant to the blog point: agents emphasize logic complexity and documentation more than surface-level naming/formatting.

Metric deltas copied from the paper

MetricMean delta after - beforeMedian deltaEffect-size note
Lines of Code+27.61+6.00Large effect
Cyclomatic Complexity+3.130.00Large effect
Multi-line Comments+2.100.00Positive shift
Source Lines of Code+18.47+2.60Large effect
Logical Lines of Code+12.18not copied herePositive shift
Halstead Volume+43.600.00Some commits add cognitive load
Halstead Effort+321.490.00Some commits add cognitive load
Halstead Difficulty+0.310.00Small mean shift
Comment lines+1.67not copied herePositive shift
Single-line Comments+0.73not copied herePositive shift
Maintainability Index-3.25not copied hereMedium negative effect

Directional outcome shares copied from the paper

OutcomeShare
Maintainability Index deteriorated56.1%
Lines of Code increased71.5%
Cyclomatic Complexity improved9.7%
Cyclomatic Complexity increased42.7%

Source trace: R76, paper-text/readability-agents-2603.13723.txt.

Chart sketch: readability-intended does not mean structurally simpler

xychart-beta title "Readability-related agent commits" x-axis ["MI deteriorated", "LOC increased", "CC increased", "CC improved"] y-axis "Share %" 0 --> 80 bar "Commits" [56.1, 71.5, 42.7, 9.7]

This is not a claim that the agent changes were all bad. Static metrics are imperfect. Sometimes a readable change legitimately adds lines or comments. But the result is strong enough to justify review gates: readability claims should be checked against complexity, size, duplication, and actual human clarity, not accepted as self-evident.

Link to maintainability benchmarks

SWE-CI and Needle in the Repo broaden the same concern. SWE-CI finds that many models can look good on style metrics while underperforming on Maintainability Index. Needle in the Repo finds functional test passes can still violate structural oracles.

Supporting maintainability data from earlier insights

SourceData pointWhy it matters here
SWE-CI15/20 models beat human oracle code on PylintStyle-clean is not enough.
SWE-CI20/20 models underperform on Maintainability IndexMaintainability needs deeper checks.
Needle in the Repo64/483 outcomes pass functional tests but fail structural oraclesBehavior tests miss architecture failures.
Needle in the RepoDependency Control 4.3% solve rateSome architectural requirements are especially hard.
Needle in the RepoResponsibility Decomposition 15.2% solve rateAgents struggle with structural design, not only syntax.

Source traces: R09 and R46.

Inference for codebase design

Agent-friendly codebases should assume agent PRs are productive but biased. They should make the desired patch shape visible and enforce the unacceptable patch shapes.

Observed riskCodebase affordanceExample
Broad or odd file touch patternOwnership and package boundariesCODEOWNERS, module tags, dependency rules
Superficial compatibility shimStatic refactoring rules and API migration testsOpenGrep/Semgrep/custom lint for old imports
"Readability" adds volumeComplexity and maintainability gatesMI/CC thresholds, review checklist
Generated files hand-editedGenerated-code policy"Do not edit generated clients; run generator" lint/CI
API call guessingGenerated SDKs and typed clientsOpenAPI -> TypeScript client
Structure bypassed but tests passStructural oraclesimport-boundary lint, architecture tests
No-op should be acceptedReproduction and abstention pathissue-specific verify command

How this changes the article

The article should not sound like "make the repo nice so agents can move faster." It should say:

  1. Agents are a new contributor class.
  2. They have observable patch-shape and quality tendencies.
  3. The repo needs affordances for generation and guardrails for review.
  4. The same artifacts help both: types, tests, generated clients, lint rules, package boundaries, setup commands, and explicit task specs.

This gives the talk a more serious engineering tone. It is not "AI agents love clean code." It is "AI agents need operational constraints because their changes have different statistical failure surfaces."

What I should not claim

I should not claim all agent PRs are worse. AIDev is a dataset; it does not score quality by itself. The agent-vs-human PR paper compares merged PRs, which have already passed some project review process. The readability paper is focused on readability-keyword commits and Python metrics.

I should not overclaim static readability metrics. They are signals, not truth. A larger patch can be more readable. A lower Maintainability Index can sometimes be an artifact. The safe claim is that readability intent does not reliably imply structural quality improvement.

Blog visual candidates

  1. AIDev scale graph: PRs -> repos -> users -> curated subset.
  2. Agentic-vs-human Cliff's delta bar chart.
  3. Readability commit outcomes bar chart.
  4. Review-gate matrix: observed agent PR risk -> codebase guardrail.
  5. Patch-shape dashboard mock: files touched, generated files, tests changed, boundary violations.

References

  • R09: SWE-CI, paper-text/swe-ci-2603.03823.txt
  • R46: Needle in the Repo, paper-text/needle-in-the-repo-2603.27745.txt
  • R69: AIDev, paper-text/aidev-agentic-prs-2602.09185.txt
  • R70: How AI Coding Agents Modify Code, paper-text/how-ai-coding-agents-modify-code-2601.17581.txt
  • R76: Do AI Agents Really Improve Code Readability?, paper-text/readability-agents-2603.13723.txt