Why we use WGAN?
An case example why we moved from GAN and WGAN
The original version of GAN was released in 2014. It was quite revolutionary and sparked the new line of research that is still very active. What is so revolutionary about GANs? The authors of the paper proposed to train two neural networks (discriminator and generator) stacked against each other in order to overcome the problem of having to create a loss function for the generator.
Now it seems rather obvious, however, at the time it was a wild idea. And even if it looked mathematically feasible, there was no indication that it could work in practise. Nevertheless, Ian Goodfellow and his colleagues demonstrated that such architecture can actually work even with resources at the time. See an examples from the same paper below (Note: images in yellow are the closest exmamples from training dataset)
In nutshell, generative adversarial networks consist of two networks: generator which generates data points and discriminator that evaluates whether input is coming from training dataset or is generated by generator. It is simple enough, however, it is quite important how the task of discriminator is formulated. That will be the main theme of this post.
In the original version of GAN measure of real and generated examples was formulated as binary classification: is it real or is it generated. Let's look at the simple case, where we have true distribution (in other words real examples) that consists of 0's. A perfect discriminator will return 1 when input to the discriminator is 0 and output of the discriminator would be 0 otherwise. Despite the fact that discriminator is doing a great job at discriminating between real and generated, however, the generator does not receive meaningful direction for improving. For example, generator output of 0.1 and 1 will be scored the same, even though 0.1 is much closer to real distribution. Let's see this in action.
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt
np.set_printoptions(suppress=False, precision=5)
def d_loss(d_score_z, d_score_r):
batch_size = d_score_z.shape[0]
d_loss_z =tf.keras.losses.binary_crossentropy(tf.zeros(shape = [batch_size,1], dtype=tf.float32), d_score_z)
d_loss_r =tf.keras.losses.binary_crossentropy(tf.ones(shape = [batch_size,1], dtype=tf.float32), d_score_r)
return tf.reduce_mean(d_loss_z + d_loss_r)
def define_discriminator(act="sigmoid"):
d_optimizer = tf.keras.optimizers.Adam(learning_rate=0.05)
# Discriminator
discriminator = tf.keras.models.Sequential()
discriminator.add(tf.keras.layers.Dense(32, activation="relu"))
discriminator.add(tf.keras.layers.Dense(32, activation="relu"))
discriminator.add(tf.keras.layers.Dense(1, activation=act))
return discriminator, d_optimizer
def train_discriminator(discriminator, d_optimizer, d_loss, batch_size = 64):
real_data = tf.zeros([batch_size, 1])
# Train discriminator till perfection
for i in range(1000):
with tf.GradientTape() as tape:
z = tf.random.normal(shape=[batch_size, 1])
d_score_z, d_score_r = discriminator(z), discriminator(real_data)
total_loss = d_loss(d_score_z, d_score_r)
if i % 100 == 0:
print("Total loss:", total_loss.numpy(), "Scores for real:", d_score_r.numpy().mean(), "Scores for generated:", d_score_z.numpy().mean())
grads = tape.gradient(total_loss, discriminator.trainable_variables)
d_optimizer.apply_gradients(zip(grads, discriminator.trainable_variables))
Disclaimer: we are creating an artificial situation just to demonstrate the underlying issue with using binary classification as a loss for discriminator. In reality, we would not train discriminators in this fashion.
discriminator, d_optimizer = define_discriminator()
train_discriminator(discriminator, d_optimizer, d_loss)
So we have got a nearly perfect discriminator that gives 1 for real distribution and 0 for anything else. Let's train the generator using such discriminator.
Here we will try to train the generator using perfect discriminator. Again, you would not do that in practise.
def g_loss(x, discriminator=discriminator):
return tf.reduce_mean(1.0-discriminator(x))
def train_generator(g_loss, batch_size=64):
g_optimizer = tf.keras.optimizers.Adam(learning_rate=0.05)
generator = tf.keras.layers.Dense(1, activation='tanh')
for i in range(1000):
with tf.GradientTape() as tape:
z = tf.random.normal(shape=[batch_size, 1])
gen = generator(z)
loss = g_loss(gen)
if i % 100 == 0:
print("Loss:", loss.numpy(), "Generator output: ",tf.reduce_mean(tf.abs(gen)).numpy())
grads = tape.gradient(loss, generator.trainable_variables)
g_optimizer.apply_gradients(zip(grads, generator.trainable_variables))
t = plt.scatter(np.zeros(z.shape), z.numpy(), label="True")
g = plt.scatter(gen.numpy(), z.numpy(), alpha=0.5)
plt.xlim(-1, 1)
plt.legend((t,g),("True distribution", "Generated distribution"))
train_generator(g_loss)
It does not converge, the generator does not learn to produce 0's and loss does not decrease. We could find a way to train so that it converges, however, ideally we would want something that would be able to converge in this situation as well.
This issue was described in famous Wasserstein GAN paper. In this paper authors proposed to change binary classification to something that measures distance between generated outputs and real examples.
Let's try it out with our simplified example.
A very naive approach would be to measure Euclidean distance between real and generated distributions. Let’s see if that helps.
def euclidean_distance(x):
return tf.reduce_mean(tf.norm(x, 'euclidean'))
train_generator(euclidean_distance)
It works in this example, we can see that loss is decreasing and generator output approaches 0. However, this works in this specific scenario, we can use euclidean distance when we measure the distance between 1D points, but how about images? Hence, for a general solution we will need something that works universally. And that is what Wasserstein GAN paper is about.
For the sake of this post, we will skip the details of how and why WGAN works (there are a few brilliant posts about that, please checkout out links in "Further reading" section below), but to put it simply, we can use the distance between the outputs of the discriminator as proxy for distances between distributions. (For this to work, we have to enforce a Lipschitz constraint which we will skip in this post, but will talk about how to enforce it in the next one.)
Let's try it out.
We measure the distance between the outputs of discriminator by simply subtracting one from another.
def d_loss_wgan(d_score_z, d_score_r):
return tf.reduce_mean(d_score_z - d_score_r)
discriminator_wgan, d_optimizer = define_discriminator(act=None)
train_discriminator(discriminator_wgan, d_optimizer, d_loss_wgan)
Looks like it is working, loss is decreasing (although it is quite unusual to see negative loss).
So now we have discriminator which gives high scores for real examples. So the objective for generator would be to get as high scores as possible, hence we need just -discriminator(generated)
def g_loss_wgan(x):
return tf.reduce_mean(-discriminator_wgan(x))
train_generator(g_loss_wgan)
Code above actually is in line with WGAN (except, as I mentioned before, we are missing a method to enforce Lipschitz constraint). Three main changes are:
- we removed sigmoid from the last layer of discriminator because we are not doing classification anymore.
- we change the loss of discriminator to be the distance between real and generated outputs of discriminator (subtraction will do just fine for measuring distance between 1D points)
- we change the generator loss to be just opposite to discriminator loss for generated input.
We can see that this algorithm converges and the generated distribution approaches 0. This is all nice and good, however, as you can see, discriminator loss is actually a negative number, hence practitioners more often use some sort of not saturating loss. For example, hinge loss which is defined as follows:
def d_loss_hinge(d_score_z, d_score_r):
d_loss_real = tf.nn.relu(1.0 - d_score_r) # Note: this changed
d_loss_fake = tf.nn.relu(1.0 + d_score_z) # Note: this changed
return tf.reduce_mean(d_loss_real) + tf.reduce_mean(d_loss_fake)
discriminator_hinge, d_optimizer = define_discriminator(act=None)
train_discriminator(discriminator_hinge, d_optimizer, d_loss_hinge)
The same non saturaing principle can be applied to the generator
def g_loss_hinge(x):
return tf.reduce_mean(tf.nn.relu(1.0-discriminator_hinge(x)))
train_generator(g_loss_hinge)
That looks good, we are nearly done, but as I mentioned, we skipped "Lipschitz constraint" piece from Wasserstein GAN paper. We will focus on this in the next post.
I hope now you intuitively understand why we moved from the original GAN to the idea of having the discriminator that measures distance rather than classifying the input. Although the code changes are quite simple, they did improve the stability of the training significantly. And this advancement of GANs played an important role to allow us to generate pictures like that today.