The core challenge in applying standard GANs to discrete data like text lies in the sampling process. When the generator selects a specific word (a discrete token) from its output distribution, this selection is non-differentiable. Consequently, the gradient signal from the discriminator cannot flow back through the sampling step to update the generator's parameters using standard backpropagation. This breaks the typical GAN training mechanism.Reinforcement Learning (RL) provides a powerful framework to circumvent this issue. Instead of relying on direct gradient flow through the generated output, we can reframe the generator's task as an RL problem:Agent: The generator ($G$) acts as the agent.State: The state ($s_t$) represents the sequence generated so far $(y_1, ..., y_{t-1})$.Action: The action ($a_t$) is the selection of the next token ($y_t$) based on the current state.Policy: The generator's parameterized function $G_\theta(y_t | s_t)$ defines the agent's policy $\pi_\theta(a_t | s_t)$, which is the probability distribution over possible next tokens.Reward: The discriminator ($D$) provides a reward signal, indicating the quality or realism of the sequences produced by the generator's policy.The goal of the generator (agent) is to learn a policy $\pi_\theta$ that maximizes the expected reward obtained from the discriminator. Crucially, RL algorithms like policy gradients allow updating the agent's policy parameters $\theta$ based on received rewards, even when the actions (token sampling) are discrete.Sequence Generative Adversarial Network (SeqGAN)SeqGAN was one of the first successful applications of this RL perspective to GAN-based text generation. It directly employs the policy gradient theorem from RL to train the generator.The discriminator ($D$) is trained conventionally, learning to distinguish between real text sequences from the training data and sequences generated by $G$. Its output, $D(Y)$, represents the probability that a complete sequence $Y = (y_1, ..., y_T)$ is real.For the generator ($G$), this discriminator output $D(Y)$ serves as the reward signal. However, a reward is only available after generating a complete sequence. This poses a problem for training the generator, as it needs feedback during the sequence generation process to decide which intermediate actions (token choices) were good.SeqGAN addresses this by using Monte Carlo (MC) search with rollouts. When the generator has produced a partial sequence $Y_{1:t} = (y_1, ..., y_t)$, it needs to estimate the expected future reward for taking the next action $y_{t+1}$. To do this, the current generator policy is used to "roll out" or complete the sequence multiple times starting from $Y_{1:t+1}$. Let's say $N$ rollouts are performed, resulting in $N$ complete sequences: ${Y^1_{1:T}, ..., Y^N_{1:T}}$. The discriminator evaluates each of these complete sequences. The average discriminator score provides an estimate of the action-value $Q(s_t, a_t = y_{t+1})$:$$Q_{D}^{G_\theta}(s=Y_{1:t}, a=y_{t+1}) \approx \frac{1}{N} \sum_{n=1}^{N} D(Y^n_{1:T})$$Where $Y^n_{1:T}$ is the $n$-th completed sequence starting with $Y_{1:t+1}$, generated using the current policy $G_\theta$. This $Q$-value represents the expected reward if we choose token $y_{t+1}$ at step $t$ and follow the current policy $G_\theta$ thereafter.With this estimated action-value, the generator's parameters $\theta$ can be updated using the policy gradient:$$\nabla_\theta J(\theta) \approx \mathbb{E}{Y{1:t} \sim G_\theta} \left[ \sum_{t=1}^T Q_{D}^{G_\theta}(Y_{1:t-1}, y_t) \nabla_\theta \log G_\theta(y_t | Y_{1:t-1}) \right]$$This update increases the probability of taking actions that lead to higher expected rewards (i.e., sequences the discriminator finds more realistic). The main insight is that the gradient is computed with respect to the log-probability of the action, $\nabla_\theta \log G_\theta(y_t | Y_{1:t-1})$, multiplied by the reward signal $Q$. This bypasses the non-differentiable sampling step.digraph SeqGAN { rankdir=LR; node [shape=box, style=rounded, fontname="Helvetica", fontsize=10]; edge [fontname="Helvetica", fontsize=9]; G [label="Generator (Policy θ)", peripheries=2, color="#4263eb", fontcolor="#4263eb"]; S [label="Generate Partial\nSequence (s_t)", shape=ellipse, color="#1c7ed6"]; Sample [label="Sample Next\nToken (y_t)", shape=diamond, color="#1c7ed6"]; Rollout [label="Monte Carlo Rollout\n(Complete Sequence)", shape=cds, color="#1098ad"]; D [label="Discriminator\n(Evaluator)", peripheries=2, color="#f03e3e", fontcolor="#f03e3e"]; Reward [label="Estimate Action-Value\n(Q-value / Reward)", shape=ellipse, color="#f76707"]; Update [label="Policy Gradient\nUpdate θ", shape=ellipse, color="#37b24d"]; G -> S [label="Current State"]; S -> Sample [label="Policy π_θ(y_t | s_t)"]; Sample -> Rollout [label="Action y_t"]; Rollout -> D [label="Completed Sequences"]; D -> Reward [label="Scores"]; Reward -> Update [label="∇θ log π_θ * Q"]; Update -> G [label="Update Generator"]; Sample -> S [label="Append y_t -> s_{t+1}", style=dashed, constraint=false]; // Loop back for next token }SeqGAN Training Loop: The generator acts as a policy, sampling actions (tokens). Monte Carlo rollouts estimate the future reward (Q-value) for an action from a given state (partial sequence), using the discriminator's evaluation. This estimated reward guides the generator's update via the policy gradient method.While effective, SeqGAN can suffer from high variance in the gradient estimates due to the Monte Carlo rollouts, potentially leading to unstable training. The quality of the learned generator also heavily depends on the quality and stability of the discriminator's reward signal.Rank Generative Adversarial Network (RankGAN)RankGAN offers an alternative perspective, aiming to provide a potentially more stable learning signal by focusing on relative comparisons rather than absolute classification scores. It reframes the adversarial game: instead of a discriminator trying to assign an absolute probability of realness, it employs a ranker or comparator.The core idea is that it might be easier and more informative to determine if one sequence is better (e.g., more realistic, higher quality according to some metric) than another, rather than assigning a precise numerical score to each sequence independently.In RankGAN, the "discriminator" is replaced by a model trained to rank sequences. This ranker could be trained in several ways:Learned Ranker: Train a model (often using architectures like Siamese networks or pairwise comparison networks) on pairs of sequences (e.g., one real, one generated). The objective is for the ranker to assign a higher score to the real sequence than the generated one.Metric-Based Ranker: Use an existing evaluation metric for sequences (like BLEU score in machine translation, though less common for general text generation) as the ranking function. This is less typical in the original RankGAN concept but represents a related idea.The generator ($G$) is then trained adversarially against this ranker. Its objective is to produce sequences that the ranker assigns a high rank to, ideally ranking them higher than (or at least comparable to) real sequences from the dataset. The loss function for the generator typically encourages it to "win" the ranking comparison against real samples.By focusing on relative ordering, RankGAN can potentially avoid some instability issues associated with the absolute reward signal in SeqGAN. The learning signal might be smoother, especially early in training when the generator produces poor samples and the discriminator might saturate or provide noisy gradients.DiscussionBoth SeqGAN and RankGAN represent significant steps in adapting GANs for discrete sequence generation tasks like text. They leverage principles from RL to overcome the non-differentiability problem:SeqGAN uses policy gradients with MC rollouts to estimate action-values based on the discriminator's absolute score.RankGAN uses a relative ranking mechanism, potentially providing a more stable signal by comparing generated samples against real ones.These approaches come with their own set of considerations. The MC rollouts in SeqGAN introduce computational overhead and potential variance. RankGAN's effectiveness depends on designing and training a good ranker. Both methods often require careful tuning and can be more complex to implement than standard GANs for continuous data. Nevertheless, they paved the way for more advanced techniques in adversarial text generation and demonstrated the versatility of combining GANs with RL concepts. Implementing these models typically involves integrating components from both deep learning frameworks (for the generator and discriminator/ranker) and potentially RL libraries or custom policy gradient implementations.