Write Code That AI Agents Love

The loop

Before we talk about code AI agents love, we need to talk about the loop they run.

Most coding agents do a messy version of five steps:

The codebase can help at each step of the loop.

Prompt

AGENTS.md

setup commands

repo instructions

Orient

architecture docs

bounded context

repo map

tool / LSP access

Retrieve

examples as specs

subagents

naming

domain vocabulary

Verify

tests

typecheck / build

custom rules / polint

CI feedback

Edit

generated SDKs

types

dependency surface

code quality

side effects

multi-file ripple

Prompt

The user asks for something. The agent also gets whatever repo instructions, tool descriptions, and context files are loaded at the start.

Prompting matters, but this article is not really about prompt tips. A good prompt can start the work. It should not have to explain the whole architecture.

Relevant sections: AGENTS.md/Claude.md, Layered context, Setup commands.

Orient

The agent asks: where am I, what kind of repo is this, what rules matter, and where should I start?

This is the "new maintainer dropped into the codebase" step. If the repo has no map, no clear boundaries, and no obvious product language, the agent starts navigating by vibes.

Relevant sections: RepoMap / Architecture map, Monorepo, Bounded Context / Layout, Domain vocabulary.

Retrieve

The agent pulls in the files and examples it thinks are relevant.

This is where bad naming and hidden dependency edges hurt. More context is not automatically better. The agent needs the right slice.

Relevant sections: Subagents, Naming, Types, Domain vocabulary, Code Quality.

Edit

The agent makes the change.

This is where the codebase either gives it a narrow path or lets it improvise. Types, generated SDKs, examples, and side-effect boundaries decide whether the edit lands inside the system or just looks plausible locally.

Relevant sections: Code Quality, Examples as specs, Types, Generated SDKs, Side effects & dynamic surfaces, Multi-file ripple, The Tools.

Verify

The agent checks whether the change worked, reads the failure, and loops back.

In reality this is not a clean state machine. The agent can jump from verify back to retrieve, from edit back to orient, or from prompt into a clarification. But the loop is still useful because every codebase problem shows up somewhere in it.

The end goal is autonomy: agents research, plan, implement, and review; humans decide what should ship. We do not get there by writing longer prompts. We get there by making the repository easier to orient in, retrieve from, edit, and verify.

Relevant sections: Tests, Examples as specs, Setup commands, The Tools.

The love

Agent-specific stuff

Agents.md/Claude.md

These are the files that your coding agent automatically loads into context when it hits the right filepaths.

I have a love hate relationship to my CLAUDE/AGENTS.md. First off, I hate that Claude Code refuse to officially support AGENTS.md. Secondly it does not follow instructions properly. But this is probably because it is very hard to verbally describe clear instructions to my agent. It works really well for things like:

Run make test-e2e to run all integrations tests, this properly seeds the database.
Read the architecture/good-to-know-patterns.md when reviewing.

etc. Clear prompts of what commands / things to do when. But stuff I like to work better, but frankly does not:

Follow the hexagonal architecture guidelines.
We use DDD in our codebase, make sure to define all business behavior in the domain.

Writing good behavioral prompts, patterns + anti-patterns are often forgotten. Research also shows that the runtime is cut by roughly 30% and output tokens is cut by 16% ish where AGENTS.md are present. This makes sense to me, as these files can remove a bit of the "exploring" of the codebase with some initial guidance. BUT, a counter-study found that can also reduce the success rate by ~20%. But, real code is hard to measure in a controlled study, and both these studies used benchmarks that the models have potentially trained on (OSS) or synthetic codebases. So, my personal tips are:

These files will not fix everything, don't try to.
Code is better docs than these files, write good code to learn from instead.
Keep them clear, short, and focus on commands and bootstrapping exploration
Avoid "general stuff" that infers a lot in it. "write good code".

AGENTS.md should point to commands, files, and rules.

Strong

# agent startup
Start here: thoughts/architecture/README.md
DDD rule: billing owns invoices and credits
Never edit: src/generated/**
API calls: use @acme/sdk
Before done: run the touched package tests

Weak

write clean code
use DDD
follow our architecture
be careful
do not invent billingAuthUserManager2

Broad architecture advice only helps when it points to a concrete rule, command, or file.

Layered context

More context is absolutely not always better. Better context is better. Layered context or "cold loaded" context, such as skills, documentation, CLI stuff, etc., can absolutely be valuable! Especially as a codebase grows it becomes a very bad idea to shove everything into the root AGENTS.md. Letting the model choose when to read what docs, readmes, etc. have improved performance quite a bit for my. The layout I run is in the root monorepo:

For main architectural / cross spanning knowledge.

text

thoughts/architecture/
-- 0-README-INDEX.md   (index to all other architecture files)
-- 1-BOUNDED-CONTEXT.md
-- 2-TESTNG-STRATEGY.md
-- 3-.... etc.

And then for very specific things:

text

backend/features/agents/evals/
-- AGENT.md   (Almost ONLY the index of what to read here)
-- PROMPT-ENGINEERING-IN-INTERNAL-AGENTS.md
-- RUNNING-EVALS.md

Trying to keep things pretty thin here. and IMO for the local things, the code should be so obvious it should not need docs on how to use/edit it. But I added "prompt engineering" as an example here, as my experience is that my friend Claude needs some guidance here. Another trick I do is in my SpecDrivenDevelopment pipeline, in the research phase, I force the models to enumerate the 3-5 most important architecture documents that it must read in plan and implementation to get a better sense of the codebase. This increased the autonomy a bit of the agent in my own experience.

Setup commands

You onboard your agent 100 times a day. Make it VERY easy. My personal favorites are:

Creating a new worktree spins up the dockerized service fully and seeds the database + starts web
make check runs ALL checks, and this makes it easy to just say "run check and iterate until all green"
make setup-mac/linux/server/etc. that just completely install everything to get coding working in your env.
devcontainers have actually been useful here :)

Research states "machine-checkable contracts" agents can run in a fresh environment. I agree.

Subagents

Subagents are agents your main agent can spin up to perform work with an isolated context window. In most coding agents, it is like hiring a team of juniors, telling them to work on the same thing and not communicating with each other. Great! Trying to force a "human way of working" into subagents, like a designer, backend engineer, etc., is wrong in my opinion.

The root context window has more power than the parallelized agents, and your underlying LLM is already an "expert" in these things. Where subagents shine is in "context compression": you need to perform a context-heavy task and compress the results back to the main agent. That means searching the codebase, locating relevant files/patterns, searching the web for things with a high noise-to-signal ratio. Your main context window does not need to be filled with the steps of how it got to the result—it just needs the insight. The main window should still do the code editing IMO. There are other ways to handle retrieval, like a well-structured codebase probably has higher impact.

Regardless, this is roughly how you create a subagent:

You can create subagents like this:

Claude Code: .claude/agents/research-codebase.md (or ~/.claude/agents/ for all projects). Also /agents in the CLI.
Cursor: .cursor/agents/research-codebase.md (or ~/.cursor/agents/ for all projects).
Codex: ask explicitly in the prompt (“spawn an explorer for backend/…, return a short summary”) or define roles in .codex/config.toml under [agents].

Each file is markdown + YAML frontmatter + instructions—for example:

markdown

---
name: research-codebase
description: Read-only exploration of a subtree; use when mapping architecture or finding entry points.
model: inherit
readonly: true
---

Explore only the paths you were given. Return layout, key modules, and 3–5 files to read next. No edits.

Claude delegates from description; Cursor via @research-codebase or natural language; Codex only spawns subagents when you ask (parent keeps architecture, child stays read-heavy).

RepoMap / Architecture map

The research here is pretty clear, if your agents understand the "Graph" of your codebase, the performance improves. Concretely, this is: who calls whom, what depends on what, what builds what, what tests leads to where, does this test branch where I want it to, and so on. Research are not as clear what to build and how to inject it into the agentic loop, and there are a few approaches.

Call graph slices in the prompt. Parse the repo, pull a small neighborhood around the suspect symbol. Dumping a huge subgraph hurts; query it.
Build/test map at session start. What builds what, what tests cover what. Extract from CMake, CTest, package files. Ground truth so the agent stops wandering build scripts.
Skinny symbol map on every edit (Aider style). Important defs and signatures across the repo. Shape before opening every file.
Search wide, rank narrow (Cody style). Keyword, graph, git, docs in; then cut to what fits the context window.

But you are probably not building a code agent harness, you are probably build a good ol' code-thingy. So, to make this actionable: make sure your coding agent has access to the LSP/IDE-plugin. This gives some of this power to the agents, helping it navigate your codebase and sometimes powering retrieval.

BUT, if you are building an agentic harness (who isn't??), the call graph slice seems to have the highest impact. Making it queryable, extracting neighborhoods that typically impact each other. I learned on my last startup that call graphs are hard to build, so I'll (probably) just keep to LSP's for now.

Structure

Monorepo

MONOREPO, you can stop reading now. That's it.

Jokes aside, I loved monorepos before AI took my job. It just makes everything soo much easier to maintain and ship. Sure, if you are Google (which has a monorepo) this is probably hard. But for the average startup, go! And by mono I really mean MONO. Here is a quick-list of what to put in your repo:

Frontend
Backend
Website
Blog
Docs
Infrastructure as code
All services (MICROSERVICE != A LOT OF REPOS).
All research and plan documents that you create in your SpecDD workflow
All agent-driven review documents
Architecture docs
Compliance documentation
Grafana settings
The best place to eat ice cream in town
All .env secrets (encrypted + committed using SOPS)

Sure, MCPs makes stuff searchable and retrievable in other systems. Have fun. Don't use a no-code-tool to build your blog, teach your marketing people how to prompt with claude code instead of ChatJippety.

Additionally, this means that you have a completely co-versioned company. Keeping things in sync becomes much easier when everything is on the same SHA.

Bounded Context / Layout

There is actually not a lot of research that finds that "good architecture = good code generation". And there is also a the debate of "what is even good architecture"... but I do think there are some wins here, and it is not about the agent. As a developer working with AI generated code we need to have a mental model of the work we are doing. This mental model was something we used to build by crying over our keyboards for hours on end. But now it cry in tokens instead of tears, and the mental model of the codebase becomes harder to form. We get cognitive debt.

I think that a good bounded contexts within the codebase reduces the cognitive debt, makes it easier to grasp and understand the code, and therefore makes the developers take better decisions = better code gen in the long run. It may not improve the token shotgun today, but it improves your ability to aim it.

I write a lot of Go this time around, and I really like the "Three Dots Labs" architecture: https://threedots.tech, which is a bit of DDD, event driven, hexagonal-ish, with strong testing guidelines. Here, a bounded context is just a "service", that may be deployed on its own, but can also run in a big monolith along other services, it may not import another bounded context directly, has clear responsibilities, interfaces, APIs, and dependencies. This works well for me and my team, but the goal of this is to keep your cognitive debt low... so you do what's best for you IMO.

A bounded context is the boundary where one domain model, vocabulary, and rules apply.

bounded context

Billing

domain model

vocabulary

business rules

owned data

public surface

commands / events / API

public surface only

another context

Auth

own model

own vocabulary

own rules

The boundary can be a module, package, service, or team ownership line.

The Code

"Code Quality": "how easy is this code to change?"

Code quality has been debated in software systems for quite a while. A lot of it becomes religious. One camp suggests that code quality should magically improve tooling, delivery velocity, and the whole codebase. That can give you a false sense of security. The other camp says code quality barely matters and that you should only focus on moving fast. That makes it very easy to create a big ball of mud at lightning speed: unchangeable, unmaintainable, and hard to understand.

I think it is interesting that we have companies here in Malmo doing good research on this, like CodeScene. Looking at their recent research, they used CodeHealth metrics to refactor Python code and found a correlation between code health and the ability to successfully patch that code. The weaker models had a bigger divergence between healthy and unhealthy code.

The numbers are useful, but only if we keep them narrow. In Code for Machines, Not Just Humans, healthy files gave medium open-weight models an 8-15 percentage point lower break rate. But with stronger systems, the gap nearly disappeared: Sonnet 4.5 was 86.77% on healthy files vs 84.03% on unhealthy files, and Claude in an agentic scaffold was 96.19% vs 94.81%. That is exactly the point. Code health seems to matter most when the model has less spare intelligence or less tooling.

But they also showed that if you max out the model and pick the best model for the task, CodeHealth in that setup barely matters. That is actually my own experience as well. Code health is important, sure. But better models are an effective way to pay your way out of some problems. There are not many times in history where we have been able to do that, but AI-coding is kind of one of them. I can survive on maybe three max subscriptions for different models, and that is relatively cheap if it means I can use the best possible model most of the time. That is the correct type of token maxing, clearly spend more cash where there is a positive ROI.

So I do not think the simple version makes sense, at least not in the codebases I work on. "Good code quality makes agents work" is too broad. Strong models can compensate for a lot. But there are still parts of code quality that matter a lot here. If you are building a product where the code matters (not simple websites and CRUD apps), you need it to be able to maintain it in the long run. This is not only for agents. It is for humans as well.

I think there is a new kind of debt rising here: cognitive debt (popular term on X). You work with the codebase a lot, but you do not actually read it. You do not build the mental model yourself. You let the agent move things around, and suddenly you do not have the capability to understand the codebase anymore. That is where code quality becomes important right now.

Better code is easier to evaluate. It is easier for you to tell if the agent did a good job or a bad job. That is true for quality, security, and the actual product direction. The more important question is not only whether the agent can change the code. It is whether you, as a human, can understand the codebase fast enough to judge the change.

This is where I like Needle in the Repo as a warning. It found 64/483 cases where functional tests passed but the structural or maintainability oracle failed. That is 13.3%. The exact percentage may not transfer to your repo, but the failure mode absolutely does: passing tests can still leave the change in the wrong place.

I talked to my grandmother the other day, and she told me about this "farmer's eye". A farmer can walk into a barn with a thousand cows and see which cow is doing badly. They might not be able to explain exactly why. They might not be able to put it into a guideline. But they can see it.

That is the kind of judgment we still need in codebases. If you cannot look at the change and quickly feel whether it belongs, whether the shape is right, whether it will be hard to maintain later. When you have the "farmer's eye" for your codebase, that's great. Optimize for the code quality that lets you keep that.

That is where I think code quality matters most right now:

Code quality is how easy it is for the next human to understand whether the agent made a good change.

Code Quality = Compounding Velocity

Hard to review

if (status === "paid" && role !== "guest" && invoice.total > 0) {
// apply credit, send email, audit, update DB
 await applyThing(user, invoice, credit, true)
}

The agent can make this pass. The reviewer still has to reverse-engineer the rule.

Easier to change

if (!billingPolicy.canApplyCredit(user, invoice)) {
 return "not_allowed"
}

await billing.applyCredit({
 invoiceId, creditId
})

The domain rule has a name. The side effect has one obvious place to live.

The useful question: can I tell where the change belongs?

Better models flatten the CodeHealth gap.

Model

Refactor tests passed

Gap

40%60%80%100%

Frontier / agentic

Claude Code

v2.0.13

claude-sonnet-4-5-20250929

94.8

96.2

+1.4 pp

Sonnet

claude-sonnet-4-5-20250929

84.0

86.8

+2.7 pp

Medium direct LLMs

Qwen

Qwen3-Coder-30B-A3B-Instruct

72.2

80.7

+8.6 pp

GPT-OSS

gpt-oss-20b

53.0

64.1

+11.2 pp

GLM

GLM-4-32B-0414

50.0

60.1

+10.2 pp

Gemma

gemma-3-27b-it

40.6

55.7

+15.1 pp

Granite

Granite-4.0-H-Small

37.2

46.5

+9.3 pp

unhealthy code

healthy code

Borg et al., 2026 · Python competitive-programming files

Tests

When I talk about tests with agents and agentic development, this is a bit of a divider. Some people say it is too dangerous to let AI agents write the tests and then implement the code, because then you have basically implemented the same flaw twice and locked it in with a good-looking test.

I agree with some of that. A bad AI-generated test can absolutely bless the wrong behavior. If the agent misunderstands the task, it can write a test for the misunderstanding and then make the implementation pass. That is not quality. That is confidence theater.

The research is mixed in the same way. In Rethinking the Value of Agent-Generated Tests, GPT-5.2 wrote new tests in only 0.6% of tasks while resolving 71.8%. Claude Opus 4.5 wrote tests in about 83.0% of tasks and resolved 74.4%. More agent-written tests did not automatically mean dramatically better outcomes.

But I also think the other side is true. Tests are one of the best ways to give the agent a feedback loop. They make the behavior readable for the agent, and for you as the human. A good test lets you read the codebase and see the intent. You can look at the test and understand what the system is supposed to achieve.

The stronger point is about visible executable behavior. In FeatureBench, exposing ground-truth unit tests improved resolved rate by +50.0 percentage points for Gemini-3-Pro-Preview and +43.3 points for GPT-5.1-Codex on the Lite set. I would not frame that as "leak tests." I would frame it as: runnable examples are extremely powerful steering.

The taxonomy I use comes straight from Three Dots Labs. I stole it with pride. The diagrams below walk it from the inside out, one boundary at a time, and each one spells out what that test proves and what it is good for.

I especially like component tests. If you have a service around some kind of context, like user context, scheduled context, tenant context, or whatever it is, you should be able to run that in a test suite without mocking every direct dependency (such as the database related to that service). You test it in a pretty realistic environment, and you test the actual behavior. Then you do not need to go all the way through Playwright for everything. Playwright also tests the browser, fonts, rendering, timing, and a lot of other things that can be brittle. I want a deterministic backend or service-level test where the agent can get a clear failure, fix the thing, and run it again.

Those tests can be written almost like user stories. They describe what the system should achieve, they become easy to inspect, and they communicate back to the agent what it actually changed. You can read them quickly and see whether the test takes the right path through the system.

I care less about test coverage as a percentage and more about branch and behavior coverage. For example, an API endpoint should have tests for the HTTP codes it can return, tenancy tests and access tests. It should prove the most important paths from the outside in, in a way that is easy for a human to inspect. Optimize your test code and infra for prod-like accuracy and readability, to be able to quickly verify intended behavior.

That is the part I care about:

Tests are not there to make the percentage go up. They are there to make behavior visible.

Start by drawing one bounded context: ports, app, domain, adapters, and the data it owns. Everything else stays outside. This is the box every test below is scoped against.

Start with one bounded context.

request

one bounded context

Ports

HTTP / gRPC

App

command / query

Domain

Adapters

owned DB

real

external service

System

shows

One service boundary

inside

Owned model and owned data

outside

External systems stay outside

Unit tests isolate the rule: domain and application logic only, no network and no database. Fast, stable, and the cheapest feedback you can hand an agent for corner cases.

Unit tests isolate the rule.

request

Ports

HTTP / gRPC

App

command / query

Domain

Adapters

owned DB

real

external service

Unit

proves

Domain and application logic

real

No network. No database.

mock

Adapters and external effects

Adapter (integration) tests check one real dependency at a time, against a real database or broker. They prove the SQL, transactions, and queries actually work, not just that they compile.

Adapter tests check the real dependency.

request

Ports

HTTP / gRPC

App

command / query

Domain

Adapters

adapter code

real DB

real

external service

Adapter

proves

One adapter at a time

real

Database / broker

mock

The rest of the service

Component tests run the public API through the whole service, with real owned infrastructure and only external services mocked. This is the agent sweet spot: easy to read, easy to keep stable, and they test user behavior instead of implementation details.

Component tests are the agent sweet spot.

request

Ports

HTTP / gRPC

App

command / query

Domain

Adapters

external mock

owned DB

real

external service

Component

proves

Public API through the whole service

real

Owned infrastructure

mock

External services

End-to-end tests deploy the system together and follow a client path across services. They prove the contract holds, but they are slow and flaky, so keep them for the critical journeys only.

E2E tests prove the system contract.

client

Ports

App

Domain

Adapters

service A

DB A

real

Ports

App

Domain

Adapters

service B

service deps

real

E2E

proves

Client path across services

real

Multiple services together

mock

As little as possible

The boundary decides the test. Component and E2E read like acceptance criteria for the business case; unit and adapter stay diagnostic. Put each test where it actually proves something.

Put each test at the boundary it proves.

Business-case related

Check complete feature

Acceptance Tests

Component tests

E2E tests

Unit tests in Domain

Unit tests in Application

Unit tests in Ports

Adapter tests of database repository

Examples as specs

Looking at examples in your codebase, this is also where code quality comes back. There is a kind of compounding quality with AI agents: you get more of what you already have.

Regardless of whether code quality in the traditional sense directly affects agent performance, you should understand that compounding effect. If your repo has good examples, the agent is more likely to copy good examples. If your repo has messy examples, the agent is more likely to create more mess.

We are not in a place where all the boring blocks have been removed from codebases. The blueprints are not magically there. The scaffolding around the domain logic or the business logic still needs to be written in a lot of codebases. Having strong opinions about how to do that well is good.

This is also where subagents tie in nicely. One of my favorite subagents is a pattern finder. Its job is to do an exhaustive search through the codebase and find the patterns I should mimic for the thing I am building right now.

This has decent research backing if we keep the claim modest. DocPrompting improved CodeT5 pass@1 by 2.85 percentage points on execution-based Python CoNaLa, a 52% relative gain, by retrieving relevant API docs. That is not repo-scale feature work, but it supports the basic idea: nearby, relevant usage information helps models call the right thing.

But it has to be the right example. The nearest random file is not a spec. It might just be old code. The useful example is the canonical one: the one that shows the current way we want this kind of thing to be built.

The pattern I trust is example plus contract plus check:

Context	What it gives the agent
Canonical example	The local shape to copy
Contract or type	Why that shape is valid
Test or validation command	Proof that the shape still works

So examples become specs when they are clear enough to copy:

where the code goes;
which imports are allowed;
what the test should look like;
what naming the domain uses;
which generated files are touched;
which command proves the behavior.

That is the point:

You get more of what you already have, so make the thing worth copying obvious.

Naming

Names are not style polish.

For agents, names are retrieval handles and semantic hints. They connect the request to the codebase. If a function is called reconcileInvoicePayment, the model gets a cheap summary. If it is called run, the model has to spend intelligence recovering meaning the code should have exposed.

The research is unusually clear for something developers usually treat as taste.

CodeT5 is the foundation signal: its identifier-aware pretraining is based on the observation that developer-assigned identifiers preserve code semantics. This does not prove that one naming convention improves a modern agent, but it does show that model builders have treated identifiers as special tokens because they carry meaning.

How Does Naming Affect LLMs on Code Analysis Tasks? is more direct. The authors perturb variable names, method/function definition names, and invocation names. On code search, GraphCodeBERT's Java MRR drops from 70.36% with original names to 17.03% when all names are anonymized/shuffled. Python drops from 68.17% to 23.73%. The paper also finds definition names hurt more when damaged than local variable or invocation names. That makes sense: a function name is often the cheapest summary of the behavior.

When Names Disappear gives the modern caveat. On ClassEval class-level summarization, GPT-4o drops from 87.3 to 58.7 after name obfuscation; DeepSeek V3 drops from 87.7 to 76.7. But competitive-programming code is more robust because algorithmic structure carries more of the meaning. Product code is different. Billing, auth, permissions, attendance, entitlement, and compliance code all carry intent through domain names.

The practical rule:

Prefer boring, truthful names over clever names.
Avoid misleading names more aggressively than vague names.
Name exported functions like behavior summaries: reconcileInvoicePayment, not process.
Avoid semantic sinkholes: utils, helpers, manager, handler, data.
Treat renames of public symbols as retrieval/API changes, not cosmetics.

That gives me a smaller conclusion:

Bad names make the agent spend intelligence recovering meaning the code should have exposed.

Types

Types are compressed context.

A good type tells the agent what values are valid, what fields exist, what methods are callable, what crosses a boundary, and what the compiler can reject before runtime. That matters because agents are very good at inventing plausible APIs: a field that almost exists, a payload that looks right, an enum value that was really just a string in its head.

The strongest evidence is Type-Constrained Code Generation. In its TypeScript experiments, about 94% of compilation errors are type-check failures, not syntax errors. Type-constrained decoding reduces compilation errors by 74.8% on HumanEval synthesis and 56.0% on MBPP synthesis. Functional correctness also moves: average pass@1 relative gain is 3.5% for synthesis, 5.0% for translation, and 37.0% for repair.

Do not overclaim this. It is TypeScript benchmark generation with constrained decoding, not proof that adding types to your app makes agents solve product work. The useful lesson is narrower:

Type information kills a large class of invalid code before tests run.

CatCoder gives the repository-level version. It combines retrieved code with type context extracted from static analyzers for Java and Rust. On Java, it improves over RepoCoder by up to 14.44% compile@k and 17.35% pass@k; removing type context drops Java pass performance by up to 11.57%. The caveat is important: code retrieval still matters. Types do not replace examples. Types show the valid surface; examples show the local pattern.

ToolGen shows the same failure mode through autocomplete/static-analysis tools. It improves dependency coverage by 31.4-39.1% and static validity rate by 44.9-57.7% across three LLMs. The bigger effect is validity, not magic correctness. That is the point.

Types prove shape, not intent. A perfectly typed bug is still a bug. You can model the wrong billing rule beautifully and still charge the customer twice. Tests and product specs still matter.

The type surfaces that matter most are boundaries:

API request/response types;
domain command types;
event payloads;
repository interfaces;
generated SDK clients;
schema-derived models;
result/error types;
discriminated unions for state.

Structure matters more than language choice. Python with good type hints, Pydantic models, dataclasses, and narrow interfaces can expose plenty of structure. TypeScript with any, dynamic indexing, stringly payloads, and broad JSON blobs can be a fog machine.

Practical rule:

Put precise types at boundaries before internals.
Use discriminated unions for states that must not combine.
Keep unknown at trust boundaries and narrow it immediately.
Treat any as a migration scar, not a design choice.
Prefer simple named types over clever generic mazes.
Derive clients and models from schemas where possible.

Types are not documentation you hope the agent reads. They are documentation the compiler enforces.

A good type removes a wrong patch before the agent gets attached to it.

Generated SDKs

When it comes to generating SDKs, this was a good idea before AI started to take over the world. Now I think it is even more obviously a good idea.

Backend-to-frontend API communication, or service-to-service communication, has always been a problem. You can implement the client and the server separately and then try to thread the contract between them in English, but that is hard and error-prone. It is hard to clearly define an API contract in prose.

In code, it is much easier. Define the contract as types. Generate the DTOs and client functions. Use those in the frontend or in the service that communicates over the API.

That makes the feedback loop for agents tremendously fast. If I change something in an API from optional to required, that change should be reflected in the SDK I generate. If I work with TypeScript in the frontend, the compiler can now tell me where the old call sites are wrong.

The research here is adjacent, not direct, but the failure modes line up:

Signal	Why it matters for SDKs
ToolGen improved dependency coverage by 31.4-39.1% and static validity by 44.9-57.7%	Visible symbols reduce invented dependencies
Type-constrained code generation found about 94% of TypeScript compilation errors were type-check failures	Typed API shapes catch wrong payloads and fields
DocPrompting improved CodeT5 pass@1 by 2.85 points, a 52% relative gain	API usage context helps models call unfamiliar surfaces

None of these papers proves generated SDKs by itself. I would state it as an inference: if API docs, valid symbols, and type facts help models avoid wrong calls, then generated clients are the practical way to put those signals directly in the repo.

That is exactly what I want for agents. I want the agent to understand where it is using a DTO incorrectly. I want it to see the type error, understand what changed, and either fix the usage or surface the question to me while it is working on the feature.

This was a good idea before. Now it is just a very, very good idea.

Of course, we have things like tRPC, which I think are great for this. But tRPC also pushes you toward TypeScript on the backend. I prefer my backend in Go-flavored ice cream. So for me, generated SDKs are the clean version of the same idea: keep the backend language I want, but still give the frontend a typed, generated contract.

The tooling will evolve as well, but the principle is already clear:

If the API contract matters, make it code the agent can import.

Generated SDKs make API contracts local.

API contract->

generated client->

call sites->

type feedback

Stringly typed

await fetch("/api/invoices/" + id + "/credits", {
 method: "POST",
 body: JSON.stringify({ creditId }),
})

Contract shaped

await billingClient.applyInvoiceCredit({
 invoiceId,
 creditId,
})

Side effects & dynamic surfaces

Real systems have side effects. They send emails, enqueue jobs, mutate databases, register handlers, read environment variables, and talk to APIs. The problem is invisible side effects. If behavior is wired through import-time registration, string registries, reflection, glob-loaded plugins, monkeypatching, or environment-driven branches, the architecture moves out of the code graph and into runtime folklore.

Humans can sometimes learn that by being around the system long enough. Agents mostly learn through grep, ASTs, typecheckers, language servers, tests, and generated manifests. If those tools cannot see the edge, the agent has to infer it.

So side effects need static handles.

// Hard for agents: importing this file changes the application.
import '@/billing/register-invoice-events';

// Somewhere else:
register('invoice.paid', async (payload) => {
  const handler = handlers[payload.type];
  return handler(payload);
});

The behavior exists, but the surface is poor. The event name is a string. The payload shape is implied. The registry is mutable. The import is only there for its side effect. The agent may edit the handler and miss the registration path entirely.

Prefer an exported, typed manifest:

export const invoicePaidEvent = defineEvent({
  name: 'invoice.paid',
  payload: InvoicePaidPayloadSchema,
});

export const billingEventHandlers = {
  [invoicePaidEvent.name]: handleInvoicePaid,
} satisfies EventHandlerRegistry;

Now the dynamic behavior still exists, but it has a static address. The event can be searched. The payload has a schema. The registry has a type. The handler is imported by name.

The research does not prove that "dynamic imports cause X% worse agent performance." That would be too neat. But the direction is clear. AutoCodeRover uses AST-based search instead of flat files. GraphCodeAgent loses performance when graph traversal is removed: GPT-4o DevEval Pass@1 drops from 58.14 to 51.83. RepoGraph also helps, with the important caveat that more graph is not always better: in one SWE-bench-Lite setup, 1-hop flat context reached 29.67% resolve while larger 2-hop flat context fell to 26.00%.

Make the important edges visible enough that tools can select the right slice. Make it "AST-able".

A side effect the agent cannot see is a dependency it cannot protect.

Multi-file ripple

Most real product work is not one edit. It is a first edit plus everything that edit forces elsewhere.

Change a function signature and the callers move. Add a route and the client, auth rule, tests, telemetry, and generated SDK move. Add a database field and the migration, repository, serializers, fixtures, and UI state move. The local patch can look perfectly reasonable and still be wrong because the hard part is the propagation.

This is the cleanest model:

txt

seed edit:    the first place the change is obviously requested
derived edit: a second edit forced by dependency, contract, behavior, or constraint
oracle:       the check that proves both the new behavior and old behavior survived

The research backs this shape, but not in a magical way. CodePlan treats repository-level coding as a planning problem: start from seed edits, then use dependency and impact analysis to find the edits that follow. In its evaluation, CodePlan got 5/6 repositories to pass validity checks, while baselines without planning got 0/6. Small sample, old tooling, and not a universal proof. Still, the mental model is right: do not make the agent guess the ripple.

RACE-bench shows the same failure mode from the other side. AutoCodeRover applied patches in 96.21% of feature tasks but resolved only 28.79%. mini-SWE-Agent applied 95.83% and resolved 70.08%. A patch can apply cleanly and still miss the actual feature.

Constraint Decay makes it more production-shaped. As framework, architecture, database, and ORM constraints pile up, capable configurations lose about 30 percentage points of assertion pass rate. Constraints are the product. Hidden constraints are the problem.

Single-file changes are the wrong target. Make the ripple legible.

Bad task:

txt

Add invoice credits.

Better task:

txt

Goal:
- Customers can apply invoice credits before payment collection.

Seed surface:
- `billing/credits/createCredit.ts`
- `billing/invoices/applyCredits.ts`

Expected ripple:
- API route accepts `creditId`
- invoice total calculation includes applied credits
- payment collection uses adjusted total
- ledger entry is created for each applied credit
- generated SDK exposes the new request field
- existing invoice-payment behavior still passes without credits

Validation:
- unit tests for credit application
- integration test for invoice payment with and without credits
- typecheck catches generated SDK/request-shape drift
- migration test proves old invoices still load

That is the plan graph in human form. This also builds your mental model of the change that is about to happen, making it easier for you to review it later on.

The skeptical read is obvious: none of this proves your production repo gets better because you wrote a nicer task. CodePlan is small. RACE-bench and FeatureBench are benchmark harnesses. Constraint Decay is partly greenfield backend generation. Stronger models will reduce some misses.

But stronger models do not remove ripple. They only make the local patch better. The underlying job is still to find every dependent obligation and satisfy it without breaking old behavior. If the repo exposes those obligations as concrete artifacts, the agent has evidence. If the repo hides them in convention and memory, the agent has vibes.

But, create an architecture with "reasonable ripple", so you do not get 10x change coupling engineers.

Domain vocabulary

Domain vocabulary matters when the agent has to connect the same product concept across files. Inside a bounded context, one concept should have one searchable name. Names are how the agent joins evidence across the codebase. When the same concept has five names, you have created five retrieval problems.

txt

student in the frontend
pupil in the API
learner in the database
member in tests
participant in docs

A human may know this is historical sediment. The agent does not. It searches for "student" and misses the test named "adds member to course." It edits LearnerProfile and misses StudentEnrollment. The cost is lost locality.

Do not force one global noun across the whole company. That becomes its own stupidity. Account can mean a billing account, a login account, or a customer account. Fine. But inside one bounded context, accidental synonym drift is debt.

The research is narrower than production-agent editing, but it is enough for this claim: identifiers carry meaning. In "How Does Naming Affect LLMs on Code Analysis Tasks?", GraphCodeBERT's code-search MRR drops from 70.36% to 17.03% on Java and from 68.17% to 23.73% on Python when names are perturbed. In "When Names Disappear", GPT-4o drops from 87.3 to 58.7 on ClassEval summarization after obfuscation; DeepSeek V3 drops from 87.7 to 76.7.

Product code leans hard on domain nouns. Billing, permissions, attendance, entitlements, and compliance do not explain themselves through algorithmic structure. The noun is often the map.

So this is agent-hostile:

// api/enrollment.ts
export async function enrollPupil(pupilId: string, courseId: string) {}

// db/schema.ts
export const learnerEnrollments = table('learner_enrollments', {});

// tests/course-membership.test.ts
it('adds a member to a course', async () => {});

// events.ts
export const participantJoinedCourse = 'participant.joined_course';

Prefer this:

// api/enrollment.ts
export async function enrollStudent(studentId: string, courseId: string) {}

// db/schema.ts
export const studentEnrollments = table('student_enrollments', {});

// tests/student-enrollment.test.ts
it('enrolls a student in a course', async () => {});

// events.ts
export const studentEnrolledInCourse = 'student.enrolled_in_course';

Now search, tests, types, and events all point at student. The agent does not have to decide whether participant means student or a different role.

External boundaries are the exception. Stripe can say customer while your product says BillingAccount. That is fine. Put the translation in the adapter. Do not smear both words through the whole codebase and hope the agent guesses which one matters.

A vocabulary note can help, but only if the code agrees with it. The stronger version is executable: schema names, event names, API operation names, generated SDK methods, and tests all use the same noun. A glossary that the code contradicts is stale prose.

Names will not save a tangled system. They are not types, tests, or architecture. But they are semantic handles. If your domain language is inconsistent, you are removing handles and adding false ones.

Agents grep, be grep-friendly!

The Tools

The useful tools are the ones the agent can run after an edit.

In this section, tools means verification: tests, typecheck, lint, build, and repo-specific policy checks. The agent makes a plausible change, runs a check, reads the failure, fixes it, and loops. The better the diagnostic, the less the agent has to guess.

Regular tools catch regular failures. Tests catch behavior. Typecheck catches API misuse. Build catches integration drift. Normal linters catch general code hygiene. But many important repo rules live above that level: which layer may import which layer, where API calls are allowed, which selectors are stable enough for E2E tests, which paths must go through a typed boundary, which patterns should never come back after a refactor.

This is where I think polint is can be useful.

With polint, the useful unit is the failure the agent sees: a scoped repo rule, a file, a line, and a repair message. Instead of asking the agent to "please respect the architecture", give it something closer to:

text

backend/orders/ports/http.go:42:17 local/no-route-db-access
Routes must not import the ORM directly. Move persistence behind the application command.

That is much easier for an agent to fix than a paragraph in AGENTS.md.

The enforcement I would start with:

Enforcement	Why it helps agents
Routes cannot import ORM/database packages	Keeps the edit inside the application boundary instead of patching persistence from the edge.
Feature code cannot call raw HTTP when a generated SDK exists	Prevents hallucinated endpoints and stringly typed service calls.
Generated SDK files cannot be edited by hand	Forces the agent back to the OpenAPI/schema source.
E2E tests cannot use sleep-based waits	Turns flaky verification into event-based verification.
Cross-context imports are denied except through approved boundaries	Stops a local fix from creating a hidden architecture dependency.
Config code cannot silently fall back to defaults in production paths	Makes setup failures loud instead of mysterious.

This is the right level for polint: mechanical, repo-specific checks that normal linters do not know. "Write clean code" is a bad rule. "Do not import database/sql from */ports/http.go" is a good one. The best rules are boring enough that a reviewer should not have to explain them twice.

The research points the same way. CODETASTE uses repository tests plus custom static checks because refactoring correctness is not only "does the test suite pass?" Needle in the Repo found 64 of 483 passing-test cases where the structural or maintainability oracle still failed. I would not overfit that exact number to every repo, but the failure mode is real: tests can be green while the code lands in the wrong shape.

For agents, adding these things to its verification loop have been really powerful in my experience. Every time it fails and I can programmatically "lint" the failure to never happen again I do it. This is part of what compounding quality means to me.

Custom rules turn repo conventions into diagnostics.

polint diagnostic

backend/orders/http.go:42:17 local/no-route-db-access
Routes must not import the ORM directly.
Move persistence behind the application command.

The diagnostic tells the agent what rule it broke and how to repair the shape.

what I would lint

no raw fetch when an SDK exists
no cross-context import from billing to auth internals
no route-to-database access
no generated file edits

polint: repo-owned lint rules for agent workflows

Conclusion

Thanks for getting to the end of this VERY long post :) Hope you learned something! Ping me if you have thoughts or questions, point your agent to this blog and evaluate your codebase, and start with SDK gen + custom lint rules! Best of luck!

Start with high impact and manageable effort.

AGENTS.md

subagents

architecture docs

generated SDKs

tests

custom rules

code quality

bounded context

lower impact/efforthigh efforthigh impact

The loop

The codebase can help at each step of the loop.

Prompt

Orient

Retrieve

Edit

Verify

The love

Agent-specific stuff

Agents.md/Claude.md

AGENTS.md should point to commands, files, and rules.

Layered context

Setup commands

Subagents

RepoMap / Architecture map

Structure

Monorepo

Bounded Context / Layout

A bounded context is the boundary where one domain model, vocabulary, and rules apply.

The Code

"Code Quality": "how easy is this code to change?"

Code Quality = Compounding Velocity

Better models flatten the CodeHealth gap.

Tests

Start with one bounded context.

Unit tests isolate the rule.

Adapter tests check the real dependency.

Component tests are the agent sweet spot.

E2E tests prove the system contract.

Put each test at the boundary it proves.

Examples as specs

Naming

Types

Generated SDKs

Generated SDKs make API contracts local.

Side effects & dynamic surfaces

Multi-file ripple

Domain vocabulary

The Tools

Custom rules turn repo conventions into diagnostics.

Conclusion

Start with high impact and manageable effort.

References

Insights