cs.LG · Reinforcement Learning

When the World Model Lies: Measuring and Characterising Reward Exploitation in DreamerV3 under Sparse Feedback

Independent Researcher, Astana, Kazakhstan

TL;DR DreamerV3 hallucinates rewards 50× higher than reality under sparse feedback. KL divergence collapse predicts this failure 50k steps in advance (r = −0.91). We characterise the mechanism and propose three mitigations.

Abstract

Model-based reinforcement learning agents that plan entirely in imagination can achieve high imagined returns while completely failing the actual task — a failure mode we term the exploitation gap.

We provide the first systematic characterisation of this gap in DreamerV3 on AntMaze, where the world model receives near-zero reward from real experience. Instrumenting the training loop with four new metrics, we show that the imagined-to-real reward ratio reaches approximately 50× at 500k environment steps while evaluation return stays below 0.05.

We establish that KL divergence collapse is a leading indicator of exploitation onset with a ~50k step lag (r = −0.91, p < 0.001). Comparing to the hierarchical baseline THICK, context-kernel gating reduces but does not eliminate the gap. A dense-reward ablation confirms that rich reward signal suppresses exploitation entirely. We propose three KL-aware mitigation strategies and release all experimental infrastructure for reproducibility.

Key Findings

50× Reward Ratio

Imagined-to-real reward reaches 50× at 500k steps while eval return stays below 0.05.

KL Leading Indicator

KL collapse precedes gap onset by ~50k steps (r = −0.91). Enables intervention 25 seconds early.

Hierarchy Amplifies

THICK reduces gap ~3.7× but cannot eliminate it. The mechanism is architecture-independent.

Dense Reward Suppresses

On AntMaze-MediumDense, gap stays below 0.05 — sparse reward is the root cause.

Exploitation Gap Trajectory

AntMaze-Medium-Diverse — mean over 5 seeds (std in parentheses)

Steps	𝔼[r̂] Imagined	𝔼[r] Replay	Gap 𝒢	KL
0	0.000	0.000	0.000	1.00
100k	0.10 (0.02)	0.002 (0.001)	0.098 (0.02)	0.65 (0.08)
200k	0.40 (0.06)	0.008 (0.002)	0.392 (0.06)	0.35 (0.06)
300k	0.80 (0.09)	0.012 (0.003)	0.788 (0.09)	0.22 (0.04)
400k	1.20 (0.11)	0.018 (0.004)	1.182 (0.11)	0.18 (0.03)
500k	1.50 (0.13)	0.030 (0.005)	1.470 (0.12)	0.15 (0.02)

Mitigation Strategies

KL-Scheduled β Annealing

Targets: overconfident model

When KL drops below τ = 0.5 nats, increase β_dyn to re-tighten the posterior-prior constraint. Triggers proactively — 50k steps before exploitation becomes visible.

β_t = β₀ + α · max(0, τ − KL_t)

Ensemble Reward Uncertainty Penalty

Targets: flat reward landscape

Train K=5 independent reward heads. Penalise imagined returns in high-disagreement regions, repelling the policy from OOD latent states. No additional environment steps.

r̃(s) = r(s) − λ · Std_k[r_k(s)]

Gap-Triggered Replay Prioritisation

Targets: corrective gradient scarcity

When 𝒢_t > 0.1, upsample high-return transitions via prioritised experience replay. Amplifies corrective gradient from rare positive events. Reactive complement to S1.

p_i ∝ |r_i − r̂(s_i)| + ε

Combined S1+S2+S3 reduces the exploitation gap by ~14× at 500k steps (model-predicted; empirical validation ongoing).

BibTeX

Citation

@article{khassanov2026worldmodellies,
  title   = {When the World Model Lies: Measuring and Characterising
             Reward Exploitation in DreamerV3 under Sparse Feedback},
  author  = {Khassanov, Arkat},
  year    = {2026},
  note    = {Preprint, Zenodo},
  url     = {https://zenodo.org/records/18879122}
}