RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection

TL;DR

RIVA proposes a two-agent setup for infrastructure verification that stays reliable even when observability tools return wrong or empty outputs.
The key idea is cross-validation: require multiple independent diagnostic paths before concluding “drift” (or “no drift”).
On the AIOpsLab benchmark, RIVA improves accuracy versus a baseline ReAct-style agent, especially under simulated tool failures.

What this is about

Infrastructure-as-code (IaC) makes provisioning repeatable, but real systems drift: manual tweaks, upgrades, or mistakes push reality away from the spec. LLM agents can help analyze telemetry and verify config state—but only if they can handle a common production failure mode: tools that look successful but return misleading results (timeouts, empty strings, stale data).

Key points

Two specialized agents: a Verifier Agent decides what should be checked; a Tool Generation Agent runs diverse tool calls and logs results in a shared history.
Cross-validation threshold: the verifier only concludes after K distinct diagnostic paths for a property (K=2 in evaluation), reducing over-reliance on one flaky tool.
Tool call history matters: contradictions across tool outputs are surfaced instead of silently accepted.
Results: on AIOpsLab, RIVA improves task accuracy compared to a baseline agent, with a larger lift when tools are intentionally made unreliable.

Why it matters

If you deploy agents that call external systems (APIs, CLIs, dashboards), tool unreliability isn’t an edge case—it’s normal. RIVA’s design is a reminder that “agent safety” isn’t only about model behavior; it’s also about system reliability under partial failure.

Practical takeaways

Design agent workflows to verify via multiple independent signals (two tools, two endpoints, logs + metrics, etc.).
Log tool calls and results in a structured way so contradictions are detectable.
Separate “tool execution” from “final judgment” to reduce brittle, single-pass reasoning failures.

Caveats / what to watch

Evaluation is benchmark-based; generalization to other IaC stacks and real org tooling remains to be validated.
Even cross-validation can fail if all tools share the same hidden dependency (e.g., the same source of stale truth).

Links

Category: Agents, LLM, openAI, Programming