Proximal Policy Optimization and Advanced Actor–Critic Variants

19 Sep 2025 Master AI DataEngineering, Symbolic and Evolutionary Artificial Intelligence 8 min read

The concept of trust policy regions aims to limit excessive variation in the policy between updates, thus improving stability and avoiding performance collapse. While using the KL divergence as a strict constraint achieves this goal, it can be computationally demanding, with the notable exception of the Natural Policy Gradient method. As an alternative, Proximal Policy Optimization (PPO) proposes simpler proxies for enforcing trust regions, resulting in two main variants: PPO-Penalty and PPO-Clipping. PPO is not just theoretically significant, it’s notably OpenAI’s go-to RL algorithm, what they use to incorporate human feedback into ChatGPT’s training.

PPO-Penalty

The PPO-Penalty variant introduces a KL divergence penalty term directly into the loss function: $$ L(\theta) = \mathbb{E}t \left[ \frac{\pi{\theta_k}(a_t | s_t)}{\pi_{\theta_{k-1}}(a_t | s_t)} A_t ; - ; \beta_n , \mathrm{KL}\left(\pi_{\theta_k}(\cdot|s_t), \pi_{\theta_{k-1}}(\cdot|s_t)\right) \right] $$

Here, $\beta_n$ is a penalty coefficient dynamically adjusted according to the measured KL divergence $d$ against a user-defined threshold: $$ \beta_{n+1} = \begin{cases} \frac{\beta_n}{2}, & \text{if } d < \frac{2}{3} , d_{\text{target}} \ 2 \beta_n, & \text{if } d > \frac{3}{2} , d_{\text{target}} \ \beta_n, & \text{otherwise} \end{cases} $$

This adaptive update seeks to maintain the KL divergence close to a target value.

PPO-Clipping

In PPO-Clipping, the KL divergence is not explicitly computed. Instead, the objective function is clipped to avoid large policy updates. The loss function is: $$ L(\theta) = \mathbb{E}t \left[ \min\left( r_t(\theta) A_t, ; g(\epsilon, A_t) \right) \right] $$ where $$ r_t(\theta) = \frac{\pi{\theta_k}(a_t | s_t)}{\pi_{\theta_{k-1}}(a_t | s_t)} $$ and the clipping function $g(\epsilon, A)$ is defined as: $$ g(\epsilon, A) = \begin{cases} (1 + \epsilon) A, & A > 0 \ (1 - \epsilon) A, & A \le 0 \end{cases} $$ This approach avoids computing KL divergence directly, while still constraining policy updates.

Deep Deterministic Policy Gradient (DDPG)

Up to this point, the discussion on policy gradient methods has assumed stochastic policies, where the policy $\pi_\theta(a|s)$ defines a probability distribution over actions. However, in many control problems — particularly those with continuous action spaces — it is advantageous to adopt a deterministic policy.

A deterministic policy is a function: $$ \pi: S \times \Theta \to A $$

which directly maps a state $s \in S$ to a single action $a \in A$ without sampling from a distribution. The associated action-value function $Q$ is defined as: $$ Q: A \times \Omega \to \mathbb{R} $$ where $\Omega$ denotes the critic’s parameter space.

Deterministic Policy Gradient Theorem

The deterministic setting has its own variant of the policy gradient theorem.

Deterministic Policy Gradient Theorem

Let

$$L(\theta) = \mathbb{E}{\pi\theta} \big[ R(s, \pi_\theta(s)) ,\big|, s \sim d(S) \big]$$ with $\pi, R, P \in C^1$ (i.e., continuously differentiable). Then the gradient of $L$ with respect to the policy parameters is given by: $$\nabla_\theta L ; \sim ; \mathbb{E}{\pi\theta} \left[ \nabla_\theta \pi_\theta(s) , \nabla_a Q(s,a) ,\big|, a = \pi_\theta(s), ; s \sim d(S) \right]$$

In the stochastic case, the policy gradient theorem involves $\nabla_\theta \ln \pi_\theta(a|s)$ multiplied by $Q_\pi(s,a)$. Here, because the policy is deterministic, the derivative passes through $\pi_\theta(s)$ itself. The term $\nabla_\theta \pi_\theta(s)$ tells us how a small change in parameters changes the selected action, and $\nabla_a Q(s,a)$ tells us how sensitive the expected return is to changes in that action. The expectation is taken over the state distribution induced by the current policy, $d(S)$.

Deterministic Compatible Function Approximation

In practice, we do not have access to the true $Q$ function; it must be approximated. In the actor–critic setting, $Q$ is represented by a critic network parameterized by $\omega$.

Theorem (Deterministic Compatible Function Approximation Theorem

Let $Q: A \times \Omega \to \mathbb{R}$ with $Q \in C^1$. The Deterministic Policy Gradient Theorem still holds if:

Compatibility condition:$$

\nabla_a Q_\omega(s,a) ,\big|{a = \pi\theta(s)} = \nabla_\theta \pi_\theta(s)^\top \omega $$This condition ensures that the critic’s gradient with respect to the action matches the actor’s gradient scaled by $\omega$.

Unbiasedness condition:$$\mathbb{E}\pi \left[ \left( \nabla_a Q\omega(s,a) - \nabla_a Q(s,a) ,\big|{a = \pi\theta(s)} \right)^2 \right] \xrightarrow{t \to \infty} 0$$This means that the critic’s gradient must converge to the true gradient of the value function.

From these conditions, one derives the parametric form: $$ Q_\omega(s,a) = a^\top , \nabla_\theta \pi_\theta(s)^\top \omega $$

Update Rule

Given the actor–critic structure, both sets of parameters $\theta$ (actor) and $\omega$ (critic) are updated iteratively.

General case:

Temporal Difference (TD) error:$$ \delta_t = r_t + \gamma Q_{\omega_t}(s_{t+1}, \pi_\theta(s_{t+1})) - Q_{\omega_t}(s_t, a_t) $$This measures the one-step inconsistency between the critic’s prediction and the target.
Actor update:$$ \theta_{t+1} = \theta_t + \lambda_\theta , \nabla_\theta \pi_\theta(s) , \nabla_\theta \pi_\theta(s)^\top \omega_t $$The actor is updated in the direction suggested by the critic’s gradient.
Critic update:$$ \omega_{t+1} = \omega_t + \lambda_\omega , \delta_t , \nabla_\theta \pi_\theta(s) , a $$The critic learns by reducing the TD error, adjusting its weights to better approximate $Q$.

Natural policy modification:

In the natural gradient version, the actor update simplifies to: $$ \theta_{t+1} = \theta_t + \lambda_\theta , \omega_t $$ This directly uses the critic’s parameters as the search direction, under the assumption that the Fisher information matrix is identity or has been absorbed into $\omega_t$.

These theorems form the mathematical backbone of DDPG. The Deterministic Policy Gradient Theorem provides the rule for updating the actor without requiring stochastic exploration in the policy space. The Compatible Function Approximation theorem ensures that the critic’s gradient is consistent with the actor’s parameterization, allowing stable joint learning. Finally, the update rules formalize how the actor and critic interact in practice, with the TD error linking them through experience replay and bootstrapping. This is crucial in continuous control problems, where explicit policy integration over actions is impractical.

Twin Delayed Deep Deterministic Policy Gradient (TD3)

TD3 addresses limitations of DDPG, which is sensitive to hyperparameter tuning and prone to overestimating $Q$. It introduces three key improvements:

Target Policy Smoothing
TD3 adds clipped Gaussian noise to the target policy’s action to prevent exploitation of $Q$-function errors, especially sharp peaks:$$ a = \mathrm{clip} \big( \pi(s) + \mathrm{clip}(\epsilon, -c, c), ; a_\ell, ; a_h \big) $$with $a_\ell < a_h \in \mathbb{R}$, $c \in \mathbb{R}^+$, $\epsilon \sim \mathcal{N}(0,1)$.
Clipped Double-Q Learning
TD3 maintains two critics $Q_{\omega_1}$ and $Q_{\omega_2}$, using the smaller estimate to compute the Bellman target:$$ L(\theta_i) = \mathbb{E} \left[ \big( r + \gamma \min_{j=1,2} Q(s’, a’, \omega^-) - Q(s, a, \omega) \big)^2 \right] $$Policy improvement:$$ \pi^* = \arg\max_\pi \mathbb{E}_\pi \big[ Q(\cdot, \cdot, \omega_1) \big] $$
Delayed Policy Updates
The policy $\pi$ and target network $\omega^-$ are updated less frequently than the critics, typically in a 1:2 ratio, allowing the critics to stabilize before suggesting policy updates.

Soft Actor–Critic (SAC)

The Soft Actor–Critic (SAC) algorithm is a variant of the actor–critic framework designed to balance exploitation (maximizing reward) and exploration (visiting less-known states). It achieves this by introducing entropy regularization into the RL objective.

Entropy regularization encourages the policy to maintain randomness in action selection, preventing premature convergence to suboptimal deterministic policies.

Definition (Entropy)

For a probability density function $P$, the entropy is:

$$H(P) := - \mathbb{E}_{x \sim P} [\ln P(x)]$$ This measures the average uncertainty of the distribution: higher entropy corresponds to more randomness in the policy.

Objective Reformulation

SAC modifies the reinforcement learning objective to include an entropy term:

$$ \pi^* = \arg\max_\pi \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t \left( R(s_t,a_t) + \nu , H(\pi(\cdot | s_t)) \right) \right] $$

Here:

$\nu \in \mathbb{R}^+$ is a temperature parameter controlling the trade-off between maximizing return and maximizing entropy.
A higher $\nu$ increases exploration; a lower $\nu$ focuses more on exploitation.

Entropy-Regularized Bellman Equation

The addition of entropy changes the Bellman equation for the $Q$-function: $$ Q_\pi(s,a) = \mathbb{E}{s’ \sim P, a’ \sim \pi} \left[ R(s,a) + \gamma \left( Q\pi(s’,a’) - \nu \ln \pi(a’|s’) \right) \right] $$

The standard Bellman backup is augmented with the entropy term $-\nu \ln \pi(a’|s’)$, which rewards policies that keep high entropy (i.e., randomness) in their action distribution.

Practical Implementation Details

In practice, SAC adopts several techniques for stability and performance:

Experience Replay – The expectation in the Bellman equation is approximated using a replay buffer, breaking correlations between consecutive samples.
Clipped Double-Q Trick – As in TD3, two $Q$-functions are trained and the smaller value is used to mitigate overestimation bias.
Entropy Coefficient Annealing – The value of $\nu$ can be gradually decreased during training, starting with high exploration and moving towards exploitation.
Critic Learning via MSBE – The critic networks are trained by minimizing the Mean Squared Bellman Error.

Policy Update and the Reparameterization Trick

The policy update in SAC aims to maximize:

$$ \pi^* = \arg\max_\pi \mathbb{E}{a \sim \pi} \left[ \min{j=1,2} Q_{\omega_j}(s,a) - \nu \ln \pi_\theta(a|s) \right] $$

The presence of $a \sim \pi_\theta$ makes the gradient computation challenging because the sampling process depends on $\theta$ itself.

The idea is to rewrite the sampling process as a deterministic transformation of an independent noise variable $\xi$: $$ a_\theta(s, \xi) := \tanh\left( \mu_\theta(s) + \sigma_\theta(s) \odot \xi \right) $$ where:

$\xi \sim \mathcal{N}(0,1)$ is independent noise.
$\pi_\theta \sim \mathcal{N}(\mu_\theta, \sigma_\theta)$ is the Gaussian policy before the $\tanh$ squashing.

This transformation allows gradients to flow through the sampling process, enabling efficient backpropagation.

Using the reparameterization trick, the policy loss becomes: $$ \mathbb{E}{\xi \sim \mathcal{N}} \left[ \min{j=1,2} Q_{\omega_j}(s, a_\theta(s,\xi)) - \nu \ln \pi_\theta(a_\theta(s,\xi) | s) \right] $$

Here:

The first term encourages the actor to choose actions with high estimated value.
The second term encourages maintaining high entropy.

SAC can be seen as an extension of the deterministic and stochastic actor–critic approaches with a principled way to control exploration. By combining entropy regularization with the stability tricks from TD3 (clipped double-Q), it achieves both high sample efficiency and robustness in continuous control tasks. The temperature parameter $\nu$ plays a central role in tuning the exploration–exploitation trade-off dynamically during training.

Tags: ReinforcementLearning