Introduction to GAN
I want to introduce some GAN model I have studied after I started working for the digital signal process. I will skip technical detail of the introduction. My goal is to provide a minimal background information.
Revolution in deep learning
As we have seen at the post of VAE, generative model can be useful in machine learning. Not only one can classify the data but also can generate new data we do not have. For example, we might be able to make a piano music generator of Chopin by learning existing Chopin’s pieces of piano music. How can we train it? The new data should be new but similar to previous data. VAE was limited because its generated data(e.g. digit images) was rather not clear enough.
I recommend readers to understand about generative model before further discussion. The two sections, Independent Component Analysis and Covariant Learning and Variational Autoencoder would be helpful.
How can we improve the quality of generations? Ian Goodfellow’s brilliant paper about the generative adversarial networks (GAN) was suggested in 2014, only 3 years ago.12 Goodfellow compared the GAN to the competition between a fake currency counterfeiter and a police. It is expressed as the following equation.
$$ min_G \sim Max_D $$
$$ \sim V(D, G) = \int dx~ p_{data}(x) \log D(x)+ \int dz~ p_z(z) \log(1-D(G(z))), $$
where x is the data, and z is the latent variable. $D(G(z))$ investigates how well $G(z)$ imitated the real data. $D$ trains to maximize $V$, and $G$ does to minimize it.
Instability and DCGAN
Goodfellow theoretically proved that GAN converges uniquely in his original paper, but technically the saddle point problem is very unstable. If you have tried any kind of GAN program with your customized data, you might have known that it is not so easy to stabilize your network. This, how to make the GAN stable, is still an on-going problem. DCGAN using convolutional neural network (CNN) was very successful to stabilize the GAN.
Why CNN was effective to stabilize GAN? Imagine you want to copy other’s painting. If you tried to start to draw the small detail part, you would ruin the balance at last unless you are very talented at painting. You need to sketch the whole landscape first, and make detail descriptions later. In GAN, $D(G(z))$ is almost 0 at some point, in other words, if the cheat was perfect at some point, it could make some confusion how good the whole generation is. Pooling is similar to draw rough sketch by removing off the details. DCGAN found a good architecture of convolution network( and its activations) for images.

Autoencoder, EBGAN, BEGAN
At the post of Variational method or VAE, we had some lessons that an autoencoder generates approximated data. If we use an autoencoder as a discriminator, we could avoid $D(G(z))$ becomes zero. EBGAN was able to stabilize the network via the auto-encoder discriminator.

The loss from the autoencoder, called reconstruction loss, is not same as the real loss, but the real loss can be understood as Wassertstein distance between the reconstruction losses of real and generated ones. The equation (2) of the BEGAN paper, is from this idea.
$$ \mathcal{L}_D = \mathcal{L}(x ; \theta_D) - \mathcal{L}(G(z_D ; \theta_G); \theta_D) ~~~ for ~ \theta_D $$
This simple form of BEGAN reduces the training time. BEGAN also improved the performance and diversity by considering a balance parameter, $\gamma \in [0,1]$.
It is also notable that Wasserstein distance provides less keen but more stable (to be convergent) loss in GAN. It is defined in very simple form in GAN, and reduce the training time, too.
PGGAN
BEGAN is extraordinary. However, it still generates restrictively. For example, it is hard to generate more radical variations such as rotating faces along various directions. The paper of BEGAN considered a parameter of diversity, but it does not resolve the fundamental restriction of autoencoder. In that sense, PGGAN has a great fidelity and potential as a variational generator. It has more precise than DCGAN, and it converges very well. This GAN is also copying human’s rough sketching not to lose its stability.

By training data with lower resolution, it obtains guide-line, and the frame it never lose. The paper used only famous data set, but the PGGAN works for various different data amazingly well. The PGGAN paper also introduces a fast algorithm using RGB projection. When I see the source at Github, they used the RGB-technique only for mnist data. For celeb data, did not use it. Just upscale the network with the same weight and pixel-normalize it and take convolution and train it.
I will cease this posting with comment on normalization. Normalization is essential for preprocess the data. Without it, the machine gets confused how important each data is. For example, when we say 10. We assume it is decimally 10. Without unifying, it is hard to know if binary 10 is bigger than decimal 1000. DCGAN used a batch normalization. Instead of the batch normalization, PGGAN used a pixel vector normalization, a variation of the local response normalization introduced in 2012.
Mode collapse3
Mode collapse can occur at multimodal data. Let us consider the grayscale-image data of a digit, 0, which is 64x64 pixels. The array shape is, then, (64, 64). If you see the plot of the 32nd row of the array. It would look like two peaks. In other words, the row has two modes. Trouble comes when a generator learns only part of the multi-modes.
A GAN tries to converge the probability density by iterating learning of G and D satisfy the minimax game. In the case of the digit, 0, when a generator learns a peak, a discriminator tends to learn the other peak. Then at the next turn, the generator tries to learn the peak which the discriminator learned, and then it switches the modes, and does not converges.
There are, mainly, three approaches to removing this trouble as I have seen. First, to make the generator experience the multiple updates of the discriminator. Simply from the above 0-example, if the generator knows the discriminator will learn the both peaks turn by turn, generator will try to learn the both peaks simultaneously. It is the main idea of the unrolled GAN4. More technically, they consider the surrogate objective function to see the iterations and keep the process differentiable. Secondly, can use the multiple GAN to learn all the peaks. And the third way is the PGGAN we discussed above. PGGAN can’t learn the partial modes, it learns the modes roughly at the early steps and gradually learn all the modes in detail, so can avoid the mode collapse.
- 
In string theory, there are a couple of big researches in the research, which brought up a magnificent amount of followed researches. String physicists used to call them the 1st, 2nd revolution. I also want to call the GAN research boom after Goodfellow’s publish to be the revolution. ↩︎ 
- 
There are many reviews and video lectures online about GAN. I recommend you to study a bit before reading my post. ↩︎ 
- 
Updated in 2018-02-24 ↩︎ 
- 
The original research paper of the unrolled GAN can be found here and it is implemented in here by the one of the authors. ↩︎ 
