A Systematic Study of LLM-Based Architectures for Automated Patching

TL;DR

This study compares four LLM-based automated patching architectures on the same benchmark of 19 real-world Java vulnerabilities (AIxCC).
The headline result reported: general-purpose code agents (specifically Claude Code) patched 16/19, outperforming more patch-specific workflows in this setup.
The authors argue architecture + iteration depth can matter as much as (or more than) raw model capability.

What this is about

A Systematic Study of LLM-Based Architectures for Automated Patching evaluates different “ways of wiring” LLMs for patch generation: fixed workflows, single-agent approaches, multi-agent approaches, and general-purpose code agents. The goal is to compare architectures apples-to-apples on the same vulnerability set.

Key points

Four architectures: fixed workflow, single-agent, multi-agent, and a general-purpose code agent.
Benchmark: 19 real-world Java vulnerabilities from the AIxCC competition.
Reported outcome: Claude Code patched 16/19; the best patch-specific design in the paper is reported at up to 13/19 (multi-agent on GPT-5) in their experiments.
Design insight: multi-agent overhead doesn’t always pay off compared to a well-tuned single-agent system.

Why it matters

A lot of “autopatch” progress gets attributed to model upgrades, but in practice the orchestration pattern (how you plan, iterate, verify, and stop) can dominate results. If general-purpose code agents outperform bespoke patching workflows on real vulnerabilities, it suggests teams should invest in robust agent tooling and evaluation loops—not just prompt templates.

Practical takeaways

If you’re building automated patching: measure architecture choices explicitly (single vs multi-agent, iteration budgets, verification steps).
Don’t assume multi-agent is better; evaluate whether coordination costs reduce net progress.
Prefer benchmarks with real vulnerabilities and realistic constraints over synthetic-only scoring.

Caveats / what to watch

Results depend on the exact models, prompts, and verification criteria used; reproduce on your own vulnerability set.
“Patched” should be interpreted in the paper’s definition—check whether it means tests pass, exploit is mitigated, etc.

Links

Category: Agents, Claude, LLM, openAI, Programming