Draft
TL;DR
RF-Agent treats reward-function design as a search problem: an LLM
proposes reward code, runs RL training to score it, and uses Monte Carlo
Tree Search (MCTS) to decide what to try next. The key idea is to reuse
the entire history of attempts and feedback—rather than
sampling one-off reward candidates.
What this is about
Designing good reward functions for robot control is notoriously
finicky: small shaping mistakes can derail learning, and hand-tuning is
slow. Recent work tries to use LLMs to write reward code automatically,
but many approaches are effectively “generate a batch, keep the best,”
which wastes feedback from failed attempts.
RF-Agent reframes reward design as sequential decision-making. Each
node in a search tree represents a candidate reward function plus its
evaluation feedback. MCTS balances exploring new ideas with exploiting
promising branches, while preserving rich context for the LLM to reason
about what worked and why.
Key points
- Uses MCTS to explore reward-function variants and
reuse feedback across iterations. - Stores per-node context like design rationale, reward code,
scores, and textual feedback, enabling more grounded next
steps. - Introduces multiple “actions” for search: structural/parameter
mutations, crossover from strong candidates, path-based reasoning, and
deliberately divergent exploration. - Reports stronger performance than prior LLM-based reward design
baselines across multiple simulated control tasks.
Why it matters
This is a concrete example of a pattern that shows up across
“agentic” workflows: when feedback is expensive, you want a
structured memory of attempts plus a search strategy—not just
repeated sampling. MCTS is one way to make iteration deliberate and
data-efficient.
Practical takeaways
- If you’re building an agent that iterates with feedback, consider
keeping a tree/graph of attempts (not a flat log) so
the agent can revisit and branch. - Use explicit action types
(mutate/crossover/reason/diverge) to avoid getting stuck in one style of
iteration. - Prefer evaluation signals that are comparable across runs
(normalized scores) so search can make consistent decisions.
Caveats / what to watch
- Reward search can be compute-heavy because each candidate may
require a full training/evaluation run. - Reported gains may depend on simulator/task setup; check
transferability if your domain differs.