Model-Based Reinforcement Learning and Exploration Strategies

Model-Based Reinforcement Learning (MBRL) introduces a paradigm in which the agent first learns an internal model of the environment from experience, and then uses this model to derive value functions, optimal policies, or action-value functions through planning. Model learning itself is formulated as a supervised learning problem, and the estimation of model uncertainty provides an additional perspective for reasoning about decision-making. Two sources of approximation error naturally arise in this setting: inaccuracies in the learned transition model $P_\eta$ and in the learned reward model $R_\eta$, together denoted as $M_\eta = \langle P_\eta, R_\eta \rangle$.

Several model representations can be considered, including table lookup, linear expectation models, Gaussian expectation models, Gaussian processes (particularly suitable for representing distributions), and deep belief networks. A simple example is the table lookup model, where the estimated transition probability and reward function are computed from empirical counts: $$ P^a_{ss’} = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(s_t = s, a_t = a, s_{t+1} = s’) $$ $$ R^a_s = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(s_t = s, a_t = a) , r_t $$ Here, $N(s,a)$ is the number of times the state–action pair $(s,a)$ has been observed, and $\mathbf{1}(\cdot)$ is the indicator function. Constructing such a model requires the collection of real experience from the environment, but once learned, it can be used to generate additional simulated experience for planning purposes.

It is important to recognize that the agent’s learned model is always an approximation of the real environment. Inaccurate models can lead to sub-optimal policies. When model uncertainty becomes excessive, one may revert to model-free approaches or explicitly incorporate model uncertainty into the reasoning process. A hybrid or integrating approach can also be adopted, where two sources of experience coexist: (1) real experience from interaction with the environment, used to update the model, and (2) simulated experience from the model, used for planning. In this setting, simulations can be exploited to explore and make inferences, while real experience serves to validate the acquired knowledge and guide action.

Backward Search: The Dyna Family

Backward search methods, such as Dyna-Q, interleave learning from real experience with simulated updates drawn from a learned model. The Dyna-Q algorithm alternates between two phases: learning from the most recent (singleton) real experience and repeating updates from past experiences stored in the model. This mechanism enables automatic recovery from changes in the environment model.

The Dyna-Q+ variant introduces a weighting mechanism to encourage exploration of less-visited state–action pairs. Specifically, the weight for a state $s$ after $\tau$ time steps without being visited is: $$ w_s(\tau) = \mathbb{E}_a [ R^a_s ] + k \sqrt{\tau} $$

**Dyna-Q**

Input: $S$, $A$, $\gamma$
Initialize $Q$ and $M$

While true:

  1. Choose $s \in S$
  2. $a \leftarrow \epsilon$-greedy$(Q(s,\cdot))$
  3. Collect $r$ and $s'$
  4. Model update: $M(s,a) \leftarrow \langle s’, r \rangle$
  5. TD(0) update of $Q(s,a)$
  6. For $n$ times:
        a. Choose $s \in S$ from history $H$
        b. Choose $a \in A$ from $H(s)$
        c. $\langle r, s’ \rangle \leftarrow M(s,a)$
        d. TD(0) update of $Q(s,a)$

The TD(0) update follows: $$ Q(s,a) \leftarrow Q(s,a) + \alpha \big[ r + \gamma \max_{a’} Q(s’,a’) - Q(s,a) \big] $$

Dyna2 extends this approach by maintaining two separate $Q$-functions: a persistent component $Q_P$, updated only with real experience, and a transient component $Q_T$, updated only with simulated experience and reset periodically. Action selection is based on $\epsilon$-greedy applied to $Q_P + Q_T$.

Forward search selects an action by explicitly looking ahead from the current state, constructing a search tree with the current state as the root. This resembles breadth-first search in dynamic programming, but with key differences: the process is sample-based, relies on planning rather than learning, and may operate on imperfect data.

**Naive Monte Carlo Search**
Input: $M$, $S$, $A$, $s$, $\pi$, $k$
For all $a \in A$:
    Simulate $k$ episodes from $s$
    $Q(s,a) = \frac{1}{k} \sum_{t=1}^k G_t$
Return $a = \arg\max_{a \in A} Q(s,a)$

Naive MC search does not improve the policy over time and can be inefficient for long-term use.

Monte Carlo Tree Search (MCTS) improves upon this by structuring each search step into four phases: selection, expansion, playout, and backpropagation. The policy evaluation step estimates:

$$ Q(s,a) = \frac{1}{N(s,a)} \sum_{k=1}^K \sum_{t=1}^T \mathbf{1}(s_t^k = s, a_t^k = a) , G_t^k $$

Policy improvement is performed via $\epsilon$-greedy selection based on $Q$, while the actual action taken is the greedy one. MCTS offers advantages such as best-first search, ease of parallelization, and the ability to focus search where it is most promising. However, exhaustive search may be computationally impractical.

Model-Based Reinforcement Learning and Exploration Strategies 01

Exploration–Exploitation Trade-off

Forward search benefits from effective exploration strategies. In basic settings, exploration is often implemented by injecting noise into a greedy policy. However, two additional paradigms can be employed: optimism in the face of uncertainty, and the use of information states that enrich the state–action representation with historical knowledge. Lookahead methods can leverage this augmented representation both for exploration and for decision-making.

Multi-Armed Bandits and Regret Theory

The multi-armed bandit problem provides a simplified framework for analyzing exploration strategies. In this setting, the agent starts in a deterministic initial state, chooses among multiple actions (arms), and each action leads to a distinct terminal state with a fixed reward distribution. The objective is to learn the expected reward $R_a$ for each action without any state transitions.

A central concept in evaluating exploration quality is regret, defined as the loss in expected return compared to an ideal optimal agent. Lower regret indicates more efficient learning. Asymptotic regret measures the long-run difference and is equivalent to maximizing the expected cumulative reward. However, since the optimal agent is unknown in practice, regret is estimated empirically. It can be shown that purely greedy and $\epsilon$-greedy algorithms incur linear regret over time.

Sublinear regret can be achieved using more sophisticated strategies. Hoeffding’s inequality provides a probabilistic bound on the deviation of the sample mean $\bar{x}_t$ of i.i.d. random variables $X_1, \dots, X_t$ bounded in $[0,1]$ ($\bar{x}t = \frac{1}{t}\sum{i=1}^tx_i$) from its expected value $\mathbb{E}[x]$:

Hoeffding’s inequality
Let $X_1, \dots, X_t$ be i.i.d. in $[0,1]$. Then, for any $u > 0$: $$P(\mathbb{E}[x] - \bar{x}_t \ge u) = P(\mathbb{E}[x] \ge \bar{x}_t + u) \le e^{-2 t u^2}$$

Optimism and the Upper Confidence Bound Algorithm

The optimism-in-the-face-of-uncertainty principle suggests choosing actions that maximize the sum of their estimated expected reward and an exploration bonus reflecting uncertainty. The upper confidence bound (UCB) for an action $a$ is: $$ U_t(a) \propto \frac{1}{N_t(a)} $$ where $N_t(a)$ is the number of times action $a$ has been taken. Less frequently chosen actions have larger uncertainty bonuses. The selected action maximizes: $$ a = \arg\max_a \big[ Q_t(a) + U_t(a) \big] $$ From Hoeffding’s inequality, one can bound the probability that the true action value $Q_t^(a)$ exceeds the estimated value $Q_t(a)$ by more than $U_t(a)$: $$ P\big( Q_t^(a) \ge Q_t(a) + U_t(a) \big) \le e^{-2 N_t(a) U_t(a)^2} $$

If the desired uncertainty level is $p$, this leads to: $$ e^{-2 N_t(a) U_t(a)^2} = p \quad \Rightarrow \quad U_t(a) = \sqrt{\frac{-\ln p}{2 N_t(a)}} $$

To ensure that uncertainty decays over time, $p$ can be set as a function of $t$, for example $p = t^{-4}$, yielding:

$$ U_t(a) = \sqrt{\frac{2 \ln t}{N_t(a)}} $$

The UCB algorithm achieves sublinear regret and guarantees asymptotic convergence to the optimal action.