Independent Researcher, Astana, Kazakhstan
Model-based reinforcement learning agents that plan entirely in imagination can achieve high imagined returns while completely failing the actual task — a failure mode we term the exploitation gap.
We provide the first systematic characterisation of this gap in DreamerV3 on AntMaze, where the world model receives near-zero reward from real experience. Instrumenting the training loop with four new metrics, we show that the imagined-to-real reward ratio reaches approximately 50× at 500k environment steps while evaluation return stays below 0.05.
We establish that KL divergence collapse is a leading indicator of exploitation onset with a ~50k step lag (r = −0.91, p < 0.001). Comparing to the hierarchical baseline THICK, context-kernel gating reduces but does not eliminate the gap. A dense-reward ablation confirms that rich reward signal suppresses exploitation entirely. We propose three KL-aware mitigation strategies and release all experimental infrastructure for reproducibility.
AntMaze-Medium-Diverse — mean over 5 seeds (std in parentheses)
| Steps | 𝔼[r̂] Imagined | 𝔼[r] Replay | Gap 𝒢 | KL |
|---|---|---|---|---|
| 0 | 0.000 | 0.000 | 0.000 | 1.00 |
| 100k | 0.10 (0.02) | 0.002 (0.001) | 0.098 (0.02) | 0.65 (0.08) |
| 200k | 0.40 (0.06) | 0.008 (0.002) | 0.392 (0.06) | 0.35 (0.06) |
| 300k | 0.80 (0.09) | 0.012 (0.003) | 0.788 (0.09) | 0.22 (0.04) |
| 400k | 1.20 (0.11) | 0.018 (0.004) | 1.182 (0.11) | 0.18 (0.03) |
| 500k | 1.50 (0.13) | 0.030 (0.005) | 1.470 (0.12) | 0.15 (0.02) |
Combined S1+S2+S3 reduces the exploitation gap by ~14× at 500k steps (model-predicted; empirical validation ongoing).
@article{khassanov2026worldmodellies,
title = {When the World Model Lies: Measuring and Characterising
Reward Exploitation in DreamerV3 under Sparse Feedback},
author = {Khassanov, Arkat},
year = {2026},
note = {Preprint, Zenodo},
url = {https://zenodo.org/records/18879122}
}