$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\I}{\mathcal{I}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} \newcommand{\inv}1 \newcommand{\map}{\text{MAP}} \newcommand{\L}{\mathcal{L}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

InfoGAN: unsupervised conditional GAN in TensorFlow and Pytorch

Generative Adversarial Networks (GAN) is one of the most exciting generative models in recent years. The idea behind it is to learn generative distribution of data through two-player minimax game, i.e. the objective is to find the Nash Equilibrium. For more about the intuition and implementation of GAN, please see my previous post about GAN and CGAN.

Note, the TensorFlow and Pytorch code could be found here: https://github.com/wiseodd/generative-models.

One natural extension of GAN is to learn a conditional generative distribution. The conditional could be anything, e.g. class label or even another image.

However, we need to provide those conditionals manually, somewhat similar to supervised learning. InfoGAN, therefore, attempted to make this conditional learned automatically, instead of telling GAN what that is.

InfoGAN intuition

Recall, in CGAN, the generator network has an additional parameter: $ c $, i.e. $ G(z, c) $, where $ c $ is a conditional variable. During training, $ G $ will learn the conditional distribution of data $ P(X \vert z, c) $. Although principally what CGAN and InfoGAN learn is the same distribution: $ P(X \vert z, c) $, what different is how they see $ c $.

In CGAN, $ c $ is assumed to be semantically known, e.g. labels, so during training we have to supply it. In InfoGAN we assume $ c $ to be unknown, so what we do instead is to put a prior for $ c $ and infer it based on the data, i.e. we want to find posterior $ P(c \vert X) $.

As $ c $ in InfoGAN is inferred automatically, InfoGAN could assign it to anything related to the distribution of data, depending to the choice of the prior. For example, although we could not specify what $ c $ should encodes, we could hope that InfoGAN captures label information into it by assigning a Categorical prior. Another example, if we assign a Gaussian prior for $ c $, InfoGAN might assign a continuous propery for $ c $, e.g. rotation angle.

So how does InfoGAN do that? This is when information theory takes part.

In information theory, if we want to express the knowledge about something if we know something else, we could use mutual information. So, if we maximize mutual information, we could find something that could contribute to the knowledge of another something the most. In our case, we want to maximize the knowledge about our conditional variable $ c $, if we know $ X $.

The InfoGAN mutual information loss is formulated as follows:

\[I(c, G(z, c)) = \mathbb{E}_{c \sim P(c), x \sim G(z, c)} \left[ \log Q(c \vert X) \right] + H(c)\]

where $ H(c) $ is the entropy of the prior $ P(c) $, $ G(z, c) $ is the generator net, and $ Q(c \vert X) $ is a neural net that takes image input and producing the conditional $ c $. $ Q(c \vert X) $ is a variational distribution to model the posterior $ P(c \vert X) $, which we do not know and as in any Bayesian inference, it is often hard to compute.

This mutual information term fits in the overall GAN loss as a regularization:

\[\min_{G} \max_{D} \, V(D, G) - \lambda I(c, G(z, c))\]

where $ V(D, G) $ is GAN loss.

InfoGAN training

During training, we provide a prior $ P(c) $, which could be any distribution. In fact, we could add as many priors as we want, and InfoGAN might assign different properties to them. The author of InfoGAN called this as “disentangled representations”, as it kind of breaking down the properties of data into several conditional parameters.

The training process is similar for discriminator net $ D(X) $ and generator net $ G(z, c) $ is quite similar to CGAN, which could be read further here. The differences, however, are:

instead of $ D(X, c) $, we use discriminator as in vanilla GAN: $ D(X) $, i.e. unconditional discriminator,
instead of feeding observed data for the $ c $, e.g. labels, into $ G(z, c) $, we sample $ c $ from prior $ P(c) $.

In addition to $ D(X) $ and $ G(z, c) $, we also train $ Q(c \vert X) $ so that we could compute the mutual information. What we do is to sample $ c \sim P(c) $ and use it to sample $ X \sim G(z, c) $ and finally pass it to $ Q(c \vert X) $. The result, along with prior $ P(c) $ are used to compute the mutual information. The mutual information is then backpropagated to both $ G $ and $ Q $ to update both networks so that we could maximize the mutual information.

InfoGAN implementation in TensorFlow

The implementation for vanilla and conditional GAN could be found here: GAN, CGAN. We will focus on the additional implementation for InfoGAN in this section.

We will implement InfoGAN for MNIST data, with $ c $ categorically distributed, i.e. one-hot vector with ten elements.

As seen in the loss function of InfoGAN, we need one additional network, $ Q(c \vert X) $:

Q_W1 = tf.Variable(xavier_init([784, 128]))
Q_b1 = tf.Variable(tf.zeros(shape=[128]))

Q_W2 = tf.Variable(xavier_init([128, 10]))
Q_b2 = tf.Variable(tf.zeros(shape=[10]))

theta_Q = [Q_W1, Q_W2, Q_b1, Q_b2]

def Q(x):
    Q_h1 = tf.nn.relu(tf.matmul(x, Q_W1) + Q_b1)
    Q_prob = tf.nn.softmax(tf.matmul(Q_h1, Q_W2) + Q_b2)
    return Q_prob

that is, we model $ Q(c \vert X) $ as a two-layer net with softmax on top. The choice of softmax is because $ c $ is categorically distributed, and softmax could pose as its parameter. If we choose $ c $ to be Gaussian, then we could design the network so that the outputs are mean and variance.

Next, we specify our prior:

def sample_c(m):
    return np.random.multinomial(1, 10*[0.1], size=m)

which is a categorical distribution, with equal probability for each of the ten elements.

As training $ D $ and $ G $ is not different than vanila GAN and CGAN, we will omit it from this section. To train $ Q $, as seen in the regularization term above, we first sample $ c $ from $ P(c) $, and use it to sample $ X $ from $ Q(X \vert z, c) $:

G_sample = generator(Z, c)
Q_c_given_x = Q(G_sample)

during runtime, we will populate $ c $ with values from sample_c().

Having all ingredients in hands, we could compute the mutual information term, which is the conditional entropy of the prior and our variational distribution, plus the entropy of our prior. Observe however, our prior is a fixed distribution, thus the entropy will be constant and can be left out.

cond_ent = tf.reduce_mean(-tf.reduce_sum(tf.log(Q_c_given_x + 1e-8) * c, 1))
Q_loss = cond_ent

Then, we optimize both $ G $ and $ Q $, based on that:

Q_solver = tf.train.AdamOptimizer().minimize(Q_loss, var_list=theta_G + theta_Q)

We initialized the training as follows:

for it in range(1000000):
    """ Sample X_real, z, and c from priors """
    X_mb, _ = mnist.train.next_batch(mb_size)
    Z_noise = sample_Z(mb_size, Z_dim)
    c_noise = sample_c(mb_size)

    """ Optimize D """
    _, D_loss_curr = sess.run([D_solver, D_loss],
                              feed_dict={X: X_mb, Z: Z_noise, c: c_noise})

    """ Optimize G """
    _, G_loss_curr = sess.run([G_solver, G_loss],
                              feed_dict={Z: Z_noise, c: c_noise})

    """ Optimize Q """
    sess.run([Q_solver], feed_dict={Z: Z_noise, c: c_noise})

After training, we could see what property our prior $ c $ encodes. In this experiment, our $ c $ will encode label property nicely, i.e. if we pass c = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], we might get this:

InfoGAN samples

Note, naturally, there is no guarantee on the ordering of $ c $.

We could try different values for $ c $:

InfoGAN samples

We could see that our implementation of InfoGAN could capture the conditional variable, which in this case is the labels, in unsupervised manner.

Conclusion

In this post we learned the intuition of InfoGAN: a conditional GAN trained in unsupervised manner.

We saw that InfoGAN learns to map the prior $ P(c) $, together with noise prior $ P(z) $ into data distribution $ P(X \vert z, c) $ by adding maximization of the mutual information between $ c $ and $ X $ into GAN training. The rationale is that at the maximum mutual information between those two, they can explain each other well, e.g. $ c $ could explain why $ X \sim P(X \vert z, c=c) $ are a images of the same digit.

We also implemented InfoGAN in TensorFlow, which as we saw, it is a simple modification from the original GAN and CGAN.

The full code, both TensorFlow and Pytorch implementations are available in: https://github.com/wiseodd/generative-models.

References

Chen, Xi, et al. “Infogan: Interpretable representation learning by information maximizing generative adversarial nets.” Advances in Neural Information Processing Systems. 2016.