Draft TL;DR HGPO is a training tweak for long-horizon LLM agents: it fixes a subtle bias in stepwise group-based RL by ensuring steps are compared under consistent historical context. It does this by building a hierarchy of groups that share increasing amounts of history, then combining their advantage estimates. What this is about Group-based RL…