INSIGHT 14: Types and Static Surfaces Reduce Hallucinated APIs
Agents need to know which members, functions, classes, and external APIs are actually available. Typed and static-analysis-visible surfaces reduce invalid-code errors -- specifically undefined-variable errors, no-member errors, compilation failures, and API hallucination. The evidence converges from four independent approaches: autocompletion integration, type context injection, type-constrained decoding, and library-aware retrieval.
Source map
| Ref | Source | Local text | Role in this insight |
|---|---|---|---|
| R51 | ToolGen | paper-text/toolgen-autocomplete-repo-codegen-2401.06391.txt | Autocompletion-visible identifiers reduce undefined-variable and no-member errors. |
| R63 | CatCoder | paper-text/catcoder-2406.03283.txt | Type context from static analyzers improves Java/Rust repository generation. |
| R43 | Type-Constrained Code Generation | paper-text/type-constrained-codegen-2504.09246.txt | Type system constraints during decoding reduce compilation errors by >50%. |
| R64 | A3-CodGen | paper-text/a3-codgen-2312.05772.txt | Local/global/third-party-library-aware retrieval improves code reuse. |
ToolGen: autocompletion tools eliminate dependency errors
ToolGen (NTU Singapore, 2024) integrates program-analysis-based autocompletion tools into the code generation process. The motivation: "in real-world code repositories, more than 70% of functions are not standalone" and LLMs "cannot be aware of repository-level dependencies during the code generation process."
Source trace: R51, paper-text/toolgen-autocomplete-repo-codegen-2401.06391.txt, lines 20-22.
ToolGen results copied from the paper
| Metric | CodeGPT | CodeT5 | CodeLlama |
|---|---|---|---|
| Dependency Coverage improvement | +31.4% | +39.1% | +36.7% |
| Static Validity Rate improvement | +57.7% | +44.9% | +49.2% |
| Pass@1 on CoderEval | same | +40.0% | +25.0% |
| Latency per function (seconds) | 0.63-2.34 | 0.63-2.34 | 0.63-2.34 |
Source trace: R51, lines 78-88.
How ToolGen works
-
Offline phase: Analyzes source files, builds ASTs, identifies positions where autocompletion tools should be triggered (positions involving repository-level identifiers). Inserts a
<COMP>marker token and fine-tunes the LLM on augmented functions. -
Online phase: During generation, whenever the model predicts
<COMP>, ToolGen invokes the autocompletion tool (Jedi for Python) which provides valid completion suggestions from the current repository context. A constrained greedy search selects the best suggestion.
The illustrative example
When encountering self. in the file layers/layer.py from repo zomux/deepy:
- CodeLlama predicts
_updates-> no-member error - Jedi provides 68 accessible attributes, including the correct
_registered_updatesat position 46
The model hallucinates a plausible-sounding attribute that does not exist. The autocompletion tool knows what actually exists because it analyzes the type's defined members.
New metrics introduced
ToolGen introduces two dependency-focused metrics:
- Dependency Coverage: fraction of repository-level dependencies correctly resolved
- Static Validity Rate: fraction of generated functions that pass static validity checks (no undefined-variable, no-member, import errors)
These metrics directly measure what typed/static surfaces provide: correct symbol resolution.
CatCoder: type context for repository-level generation
CatCoder (Zhejiang University, 2025) enhances repository-level code generation by integrating both retrieved code and type context extracted from static analyzers.
CatCoder results copied from the paper
| Metric | Improvement over RepoCoder (Java) | Improvement over RepoCoder (Rust) |
|---|---|---|
| compile@k | up to 14.44% | up to 14.44% |
| pass@k | up to 17.35% | up to 17.35% |
Source trace: R63, paper-text/catcoder-2406.03283.txt, lines 19-20 and 117-119.
What type context provides
For the motivating example (Apache Commons Math, triu method):
- The LLM knows the algorithm (extract upper triangular part)
- But it does not know that
RealMatrixrequiresgetEntry(row, column)for access - It does not know that
Array2DRowRealMatrixis the concrete constructor - Type context from static analyzer provides: field definitions, method signatures, class
hierarchy for
RealMatrixand its implementations
Source trace: R63, lines 149-159.
CatCoder's three-stage pipeline
- Retrieve relevant code from the repository (similarity-based)
- Extract type context using static analyzer: dependent types, field/method signatures
- Integrate both into a single prompt for the LLM
The key insight: "the types associated with a function can also serve as valuable references for LLMs. Specifically, these related types provide a rich set of fields and methods from which the LLM can select when generating code."
Ablation evidence
The paper notes that removing type context reduces performance. The combination of retrieved code + type context outperforms either alone. This confirms that type information is complementary to code similarity retrieval -- not redundant.
Type-Constrained Code Generation: formal guarantees during decoding
Type-Constrained Code Generation (ETH Zurich + UC Berkeley, 2025) takes the strongest approach: enforcing type system rules during the decoding process itself.
Type-Constrained results copied from the paper
| Effect | Value |
|---|---|
| Compilation error reduction | >50% (more than half) |
| Functional correctness improvement (synthesis/translation) | +3.5% to +5.5% relative |
| Functionally correct repair improvement | +37% relative |
| Model sizes tested | 2B to 34B parameters |
| Language | TypeScript |
Source trace: R43, paper-text/type-constrained-codegen-2504.09246.txt, lines 88-91.
Error distribution finding
The paper reports a critical finding about where compilation errors come from:
| Error type | Share of compilation errors (TypeScript) |
|---|---|
| Type errors (failing type checks) | 94% |
| Syntactic errors | 6% |
Source trace: R43, lines 57-63.
This means: 94% of compilation errors in LLM-generated TypeScript are type errors, not syntax errors. Constraining only syntax (prior work) barely helps. Constraining types eliminates the dominant error class.
How type constraining works
The approach builds a prefix automaton that:
- Maintains a type environment for completed expressions
- Assesses whether partial expressions can be completed to match a required type
- Uses type inhabitation search to determine valid completions
- Rejects tokens that would make well-typed completion impossible
The result: generated code is guaranteed to be well-typed upon termination (a formal soundness guarantee).
A3-CodGen: library awareness prevents compatibility errors
A3-CodGen (Australian National University + Zhejiang University, 2024) provides the library-awareness angle. It identifies three categories of repository information:
Three information types in A3-CodGen
| Type | Definition | Failure mode without it |
|---|---|---|
| Local | Current file: defined functions, variables, FQNs | Logic errors, missed local APIs |
| Global | Other files in same repo: importable functions | Code duplication, missing reuse |
| Third-party | Pre-installed libraries accessible via import | ModuleNotFoundError, compatibility issues |
Source trace: R64, paper-text/a3-codgen-2312.05772.txt, lines 69-85.
Key finding
"According to a previous study, only around 30% of methods operate independently in open-source projects." The remaining 70% depend on other code in the repository.
Source trace: R64, lines 77-78.
A3-CodGen results
The framework generates code with higher reuse rates compared to human developers. Specifically:
- Outperforms GitHub Copilot on global reuse and third-party library reuse
- Performs similarly on local reuse
- Identifies optimal configurations for local and global information ("too much global context can hurt" -- information overload risk)
Inference for codebase design
The evidence supports concrete codebase practices:
| Practice | Mechanism | Evidence |
|---|---|---|
| Explicit exports and imports | Makes available symbols discoverable by static tools | R51 (ToolGen) |
| Typed public APIs | Type context enables CatCoder-style static analysis | R63 (CatCoder) |
| Declared class fields/dependencies | Prevents no-member errors; autocompletion works | R51 (ToolGen) |
| No dynamic mutation of types | Static analyzers cannot track runtime-added members | R43, R63 |
| Reliable typecheck commands | Enables type-constrained generation and validation | R43 |
| Approved library lists in docs | Prevents third-party hallucination | R64 (A3-CodGen) |
| Generated SDK clients from OpenAPI | Provides typed, discoverable API surfaces | Practitioner signal |
| Schema-first design | Types exist before implementation; agents can reference them | R63, inference |
The fundamental mechanism
All four papers converge on the same underlying mechanism:
- LLMs hallucinate API surfaces (members, functions, imports) that do not exist
- The hallucination rate is proportional to the invisibility of valid alternatives
- Making valid alternatives visible (via types, autocompletion, or constrained decoding) dramatically reduces invalid code
- The dominant error class (94% for TypeScript) is type errors, not syntax errors
Therefore: typed, statically analyzable codebases give agents access to the valid symbol space, reducing hallucination of the invalid symbol space.
What I should not claim
-
I should not claim types solve logical correctness. All four papers are about reducing structural/compilation errors, not about generating logically correct implementations. ToolGen's Pass@1 improvements are modest compared to its validity improvements.
-
I should not claim dynamic languages are hopeless. ToolGen works on Python using Jedi (which infers types at runtime-analysis time). A3-CodGen works on Python. But the type-constrained approach (R43) is explicitly limited to statically typed languages.
-
CatCoder's focus on "statically typed programming languages" is deliberate: "dynamically typed languages determine variable types at runtime, making type information less accessible and more ambiguous." For Python codebases, type hints (PEP 484) partially bridge this gap.
-
The 94% figure for type errors is specific to TypeScript. Other languages may have different error distributions. But the directional claim (types dominate over syntax) is likely general.
-
A3-CodGen finds that "too much global context can hurt." Type information is useful in doses. Dumping all types from a large codebase into context would trigger information overload. The agent needs relevant type slices.
Blog visual candidates
- Error distribution pie chart: 94% type errors vs 6% syntax errors in LLM-generated TypeScript. Single most striking number.
- ToolGen before/after:
self._updates(hallucinated) vsself._registered_updates(from autocompletion). Concrete, relatable. - CatCoder three-stage diagram: retrieve code + extract types + generate. Shows how type context complements retrieval.
- Improvement table across all four systems: convergent evidence from independent groups.
- Codebase affordance mapping: typed API -> fewer hallucinated members -> higher compile rate.
References
- R43: Type-Constrained Code Generation,
paper-text/type-constrained-codegen-2504.09246.txt - R51: ToolGen,
paper-text/toolgen-autocomplete-repo-codegen-2401.06391.txt - R63: CatCoder,
paper-text/catcoder-2406.03283.txt - R64: A3-CodGen,
paper-text/a3-codgen-2312.05772.txt