Stablized GAN training

The original GAN was known for unstable during the training:

  • when the fake and real sample distribution do not overlap in the early phase of training(an intrinsic requirement by the Kullback Leibler divergence).
\[KL(P \lVert Q) = \sum_{x=1}^{N}P(x) log{\frac{P(x)}{Q(x)}}\]
  • when one of the adversarial networks overpowers the other, neither one of the two will learn informative features over the training samples;

  • the intrisinc design of the objective function contains unwanted properties that hinders the convergence;

Vanishing gradient

From the previous equation, one could assume generated sample has distribution with various mean(e.g. $\mu$ = 0, $\mu$=5, $\mu=30$), while real sample of a normal distribution($\mathcal{N}(1,0)$). The KL divergence for those generated sample distribution of $\mu=35$ leads to saturating value, which cause the discriminator to rate 0 for $G(z), z\in Q$. Thus the gradient of generator with regards to GAN loss function is:

Mode collapse

During the training, the generator always look for one type of output that is the most plausible for the discriminator, and the discriminator always reject the rest. Thus only monotomous output are generated by the generator even fed various inputs.

\[-\nabla_{\theta_{g}}log(1-D(G(z))) \rightarrow 0\]

1. Wasserstein GAN

The Wasserstein GAN(WGAN) is first known for effectively balance the GAN training. Where weight clipping and wasserstein metrics are used. The later describes the minimum cost of converting one distribution $q$ to another $p$, mathematically defined as the greatest lower bound(infimum) of a transport plan, this could be formulated as: \(W(p,q) = inf_{\gamma \in \Pi(\mathbb{P_{r}, \mathbb{P_{g}}})} \mathbb{E}_{(x,y) \sim \gamma} [\lVert x-y \rVert]\) where $\Pi$ contains all the possible transport plan $\gamma$.

The above formula is intractable but could be further simplified to the follow, using Kantorovich-Rubinstein duality:

\[W(\mathbb{P_{r}}, \mathbb{P_{g}}) = sup_{\lVert f\rVert_{L} \leq1} \; \mathbb{E}_{x\sim \mathbb{P}_{r}}[f(x)] - \mathbb{E}_{x\sim \mathbb{P}_{g}}[f(x)]\]

Where $f(\cdot)$ is a Lipschitz-1 function, meaning: $\lvert f(x_{1}) - f(x_{2})\rvert \leq \lvert x_{1} - x_{2}\rvert $, and such function $f(\cdot)$ is parameterized by the critics(discriminator without the last non-linear layer). Or say, the well-trained critics is the function $f(\cdot)$. As per the Lipischtz-1 constrain, the utilization of weight clipping explicitly clips the maximum magnetude of the critics to a constant range $c$.

Different from the gradient of the original discriminator,

\[\nabla_{\theta_{d}}\frac{1}{N}\sum_{i=1}^{N}[log(D(x^{(i)})) + log(1-D(G(x^{(i)})))]\]

it becomes:

\[\nabla_{\omega}\frac{1}{N}\sum_{i=1}^{N}[f_{\omega}(x^{(i)}) - f_{\omega}(G(z^{(i)}))]\]

Drawbacks

Despite its early success in updating the generator at discriminator’s optimal, and stable training dynamics, the choise of weight clipping hyper-parameter is still tricky. Especially without batch normalization, the gradient could still vanish or explode on decreasing or increasing the magnitude of $c$. Secondly, the explicit constrain on weights also restricts the capacity of the model from learning complicated data or function.

2. WGAN-GP

A differentiable function is Lipschitz-1 continous if and only if its gradient norm is 1 everywhere.

On top of WGAN work, the WGAN-GP instead of explicitly constrain Lipschitz-1 continuity via weight-clipping, it interpolates between real and generated data distribution to ensure gradient norm of 1. As is proven in the appendix A of the paper: \(\mathbb{P}_{(x,y)\sim \pi}[\nabla f^{*}(x_{t}) = \frac{y-x_{t}}{\lVert y-x_{t}\rVert}] = 1\) Where $\pi(x=y)=0$ and $x_{t} = tx_{g} + (1-t)x_{r}\text{, with } 0\leq t\leq 1$. The WGAN loss improves to:

\[W_{GP}(\mathbb{P}_{r}, \mathbb{P}_{g}) = \mathbb{E}_{x\sim \mathbb{P}_{r}}[f(x)] - \mathbb{E}_{x\sim \mathbb{P}_{g}}[f(x)] + \lambda \mathbb{E}_{x_{t}\sim \mathbb{P}_{x_{t}}}[(\lVert \nabla_{x_{t}}(D(x_{t}))\rVert_{2} - 1)^{2}]\]

Differently, WGAN-GP does not use batch normalization to avoid creating correlation among data inside mini-batch.

Pros

The WGAN-GP variant offers better convergence and more constistent performance after convergence, compared to DCGAN. Eventhough it requires more computation complexity.

The overlooked roles of the discriminator

The discriminator is commonly regarded as one player of the competing game, while it actually acts a measurement of the divergence.

Hits