Preprint · March 2026
cs.LG · Reinforcement Learning

When the World Model Lies: Measuring and Characterising Reward Exploitation in DreamerV3 under Sparse Feedback

Arkat Khassanov

Independent Researcher, Astana, Kazakhstan

TL;DR DreamerV3 hallucinates rewards 50× higher than reality under sparse feedback. KL divergence collapse predicts this failure 50k steps in advance (r = −0.91). We characterise the mechanism and propose three mitigations.

Abstract

Model-based reinforcement learning agents that plan entirely in imagination can achieve high imagined returns while completely failing the actual task — a failure mode we term the exploitation gap.

We provide the first systematic characterisation of this gap in DreamerV3 on AntMaze, where the world model receives near-zero reward from real experience. Instrumenting the training loop with four new metrics, we show that the imagined-to-real reward ratio reaches approximately 50× at 500k environment steps while evaluation return stays below 0.05.

We establish that KL divergence collapse is a leading indicator of exploitation onset with a ~50k step lag (r = −0.91, p < 0.001). Comparing to the hierarchical baseline THICK, context-kernel gating reduces but does not eliminate the gap. A dense-reward ablation confirms that rich reward signal suppresses exploitation entirely. We propose three KL-aware mitigation strategies and release all experimental infrastructure for reproducibility.

Key Findings

01
50× Reward Ratio
Imagined-to-real reward reaches 50× at 500k steps while eval return stays below 0.05.
02
KL Leading Indicator
KL collapse precedes gap onset by ~50k steps (r = −0.91). Enables intervention 25 seconds early.
03
Hierarchy Amplifies
THICK reduces gap ~3.7× but cannot eliminate it. The mechanism is architecture-independent.
04
Dense Reward Suppresses
On AntMaze-MediumDense, gap stays below 0.05 — sparse reward is the root cause.

Exploitation Gap Trajectory

AntMaze-Medium-Diverse — mean over 5 seeds (std in parentheses)

Steps 𝔼[r̂] Imagined 𝔼[r] Replay Gap 𝒢 KL
00.0000.0000.0001.00
100k0.10 (0.02)0.002 (0.001)0.098 (0.02)0.65 (0.08)
200k0.40 (0.06)0.008 (0.002)0.392 (0.06)0.35 (0.06)
300k0.80 (0.09)0.012 (0.003)0.788 (0.09)0.22 (0.04)
400k1.20 (0.11)0.018 (0.004)1.182 (0.11)0.18 (0.03)
500k1.50 (0.13)0.030 (0.005)1.470 (0.12)0.15 (0.02)

Mitigation Strategies

S1
KL-Scheduled β Annealing
Targets: overconfident model
When KL drops below τ = 0.5 nats, increase β_dyn to re-tighten the posterior-prior constraint. Triggers proactively — 50k steps before exploitation becomes visible.
β_t = β₀ + α · max(0, τ − KL_t)
S2
Ensemble Reward Uncertainty Penalty
Targets: flat reward landscape
Train K=5 independent reward heads. Penalise imagined returns in high-disagreement regions, repelling the policy from OOD latent states. No additional environment steps.
r̃(s) = r(s) − λ · Std_k[r_k(s)]
S3
Gap-Triggered Replay Prioritisation
Targets: corrective gradient scarcity
When 𝒢_t > 0.1, upsample high-return transitions via prioritised experience replay. Amplifies corrective gradient from rare positive events. Reactive complement to S1.
p_i ∝ |r_i − r̂(s_i)| + ε

Combined S1+S2+S3 reduces the exploitation gap by ~14× at 500k steps (model-predicted; empirical validation ongoing).

BibTeX

Citation
@article{khassanov2026worldmodellies,
  title   = {When the World Model Lies: Measuring and Characterising
             Reward Exploitation in DreamerV3 under Sparse Feedback},
  author  = {Khassanov, Arkat},
  year    = {2026},
  note    = {Preprint, Zenodo},
  url     = {https://zenodo.org/records/18879122}
}