Standard Deep Q-Networks (DQN) and its variants focus on estimating the expected future discounted return, the Q-value Q(s,a)=E[∑t=0∞γtRt+1∣S0=s,A0=a]. This expectation summarizes the potential outcomes of taking action a in state s into a single scalar value. However, this compression loses potentially valuable information about the variability and shape of the return distribution.
Consider an agent choosing between two paths. Path A reliably yields a moderate reward. Path B offers a chance at a very high reward but also carries a significant risk of a large penalty. Both paths might have the same expected return, making standard DQN indifferent between them. Yet, the underlying risk profiles are drastically different. Distributional Reinforcement Learning addresses this by directly modeling the probability distribution of the random return Z(s,a), rather than just its expectation E[Z(s,a)].
The core idea extends the Bellman equation to distributions. Let Z(s,a) be the random variable representing the return obtained by starting in state s, taking action a, and following the current policy thereafter. The standard Bellman optimality equation relates the expected values:
Q∗(s,a)=E[R(s,a)+γa′maxQ∗(s′,a′)]The distributional version relates the distributions themselves:
Z(s,a)=DR(s,a)+γZ(s′,a′∗)Here, =D signifies equality in distribution. The random return Z(s,a) has the same distribution as the sum of the immediate (potentially stochastic) reward R(s,a) and the discounted random return Z(s′,a′∗) associated with taking the optimal action a′∗ in the next state s′. The optimal next action a′∗ is typically chosen by maximizing the expected value of the next state's return distribution: a′∗=argmaxa′E[Z(s′,a′)]. This equation provides a recursive definition for the return distribution, forming the basis for learning algorithms.
Representing and learning a potentially continuous probability distribution is challenging. Practical algorithms use approximations:
Proposed by Bellemare et al. (2017), the C51 algorithm approximates the return distribution Z(s,a) using a discrete distribution supported on a fixed set of N "atoms". These atoms z1,z2,…,zN are typically chosen to be equally spaced points within a plausible range of returns [VMIN,VMAX].
The deep neural network, instead of outputting a single Q-value per action, outputs a probability distribution over these N atoms for each action. For a given state s, the network outputs N×∣A∣ values, usually passed through a softmax function for each action a to produce probabilities pi(s,a):
pi(s,a)≈P(Z(s,a)=zi)such thati=1∑Npi(s,a)=1The expected Q-value can be easily recovered if needed: Q(s,a)=∑i=1Nzipi(s,a).
Learning Update: The learning process involves applying a distributional Bellman update. For a transition (s,a,r,s′), the target distribution is constructed as follows:
Example probability distributions over return atoms for two different actions. Although they might have the same mean (expected Q-value), their shapes reveal different risk characteristics. Action A has higher potential returns but also higher potential losses compared to the more concentrated distribution of Action B.
Proposed by Dabney et al. (2017), QR-DQN takes a different approach by modeling the quantile function (the inverse CDF) of the return distribution. Instead of fixing the return values (atoms) and learning probabilities, QR-DQN fixes cumulative probabilities τi and learns the corresponding return values (quantiles) θi(s,a).
The network outputs N quantile values θ1(s,a),…,θN(s,a) for each action a. These correspond to a fixed set of N target quantiles, often chosen uniformly, e.g., τ^i=Ni−0.5 for i=1,…,N. θi(s,a) represents the predicted return value z such that P(Z(s,a)≤z)≈τ^i.
Learning Update: QR-DQN uses quantile regression loss. The target quantiles for a transition (s,a,r,s′) are r+γθj(s′,a′∗), where θj(s′,a′∗) are the quantile values predicted by the target network for the optimal next action a′∗=argmaxa′N1∑k=1Nθk(s′,a′). The loss function minimizes the discrepancy between the predicted quantiles θi(s,a) and the target quantiles, using a formulation (like the Quantile Huber loss) that correctly handles the asymmetric nature of quantile estimation.
Further advancements like Implicit Quantile Networks (IQN) learn a function that can generate quantile values for any input probability τ∈[0,1], offering a more continuous representation of the distribution.
Learning the full distribution of returns offers several benefits:
Implementing distributional RL requires modifying the network's output head to predict distributional parameters (probabilities for atoms or quantile values) and adapting the loss function (KL divergence or quantile regression loss) and the Bellman update mechanism accordingly. While adding complexity, the empirical gains and the ability to handle risk make it a significant development in deep reinforcement learning.
© 2025 ApX Machine Learning