LLM - Digest AI

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Posted on March 5, 2026March 5, 2026 by DigestAI

TL;DR RAPO is an Agentic RL training framework that uses retrieval of off-policy step-level traces to expand exploration. The motivation: purely on-policy exploration limits what an agent can ever learn—if it never tries a reasoning path, it can’t improve on it. Reported results show modest but consistent gains across multiple datasets and faster training efficiency….

RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection

Posted on March 5, 2026March 5, 2026 by DigestAI

TL;DR RIVA proposes a two-agent setup for infrastructure verification that stays reliable even when observability tools return wrong or empty outputs. The key idea is cross-validation: require multiple independent diagnostic paths before concluding “drift” (or “no drift”). On the AIOpsLab benchmark, RIVA improves accuracy versus a baseline ReAct-style agent, especially under simulated tool failures. What…

MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN

Posted on March 4, 2026March 4, 2026 by DigestAI

TL;DR MA-CoNav proposes a hierarchical “master + specialized sub-agents” approach for long-horizon vision-language navigation (VLN). The framework splits perception, planning, execution, and memory across agents, with local + global reflection loops to reduce drift. Reported experiments on a real robot indoor dataset show improvements over baseline VLN methods. What this is about Vision-Language Navigation (VLN)…

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Posted on March 4, 2026March 4, 2026 by DigestAI

TL;DR CARE is an agentic framework for multi-modal medical reasoning that emphasizes clinical accountability via explicit, evidence-grounded steps. It decomposes medical VQA into specialist modules (entity proposal → referring segmentation → evidence-grounded VQA) coordinated by an agent-like planner/reviewer. The paper reports sizeable accuracy improvements versus same-size and larger baselines in its evaluated setting. What this…

Claude Sonnet 4.6

Posted on March 4, 2026 by DigestAI

TL;DR Anthropic announced Claude Sonnet 4.6 as the new default on Free/Pro tiers. The release emphasizes better coding behavior (instruction following, fewer hallucinations, less overengineering) and improved “computer use”. It also highlights long-context + tooling improvements aimed at real workflows (large repos, search-augmented work). What this is about This is Anthropic’s release post for Claude…

Cekura – Testing and Monitoring for Voice and Chat AI Agents

Posted on March 4, 2026March 4, 2026 by DigestAI

TL;DR Cekura is a YC-backed tool focused on testing, monitoring, and regression prevention for voice and chat AI agents. The core idea: simulate full conversations (not just single turns) and judge whether the overall interaction succeeds. It targets a real pain point for agent builders: “it worked yesterday” failures after prompt/model/tooling changes. What this is…

A Systematic Study of LLM-Based Architectures for Automated Patching

Posted on March 4, 2026March 4, 2026 by DigestAI

TL;DR This study compares four LLM-based automated patching architectures on the same benchmark of 19 real-world Java vulnerabilities (AIxCC). The headline result reported: general-purpose code agents (specifically Claude Code) patched 16/19, outperforming more patch-specific workflows in this setup. The authors argue architecture + iteration depth can matter as much as (or more than) raw model…

Agentic Code Reasoning

Posted on March 4, 2026March 4, 2026 by DigestAI

TL;DR This paper proposes “semi-formal reasoning”: a structured way for an agent to state premises, trace execution paths, and produce explicit conclusions for code reasoning tasks. On multiple static code-analysis style tasks, the structured format improves accuracy versus more free-form reasoning. The authors report strong results on patch-equivalence checking (including a reported 93% accuracy on…

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Posted on March 2, 2026 by DigestAI

TL;DR DARE-bench is a new benchmark for data-science workflows that tries to measure not just final answers, but whether a model follows a correct modeling and instruction process across realistic tasks. What this is about The paper introduces DARE-bench, a dataset/benchmark derived from Kaggle-style problems intended to evaluate “process fidelity” in applied ML work (e.g.,…

Claws Are Now a New Layer on Top of LLM Agents (Karpathy on OpenClaw)

Posted on March 2, 2026 by DigestAI

TL;DR Andrej Karpathy highlighted “Claws” as a distinct layer above LLM agents—systems that make agents messaging-native and operationally useful. He pointed to OpenClaw as an example of that category. What this is about In a short post, Karpathy frames a “Claw” as an architectural layer that sits on top of an agent runtime and connects…