Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The loss function in WGAN GP seems a little bit confusing #8

Open
Unispac opened this issue Jun 4, 2019 · 12 comments
Open

The loss function in WGAN GP seems a little bit confusing #8

Unispac opened this issue Jun 4, 2019 · 12 comments

Comments

@Unispac
Copy link

Unispac commented Jun 4, 2019

I notice that your loss in WGAN GP is different from that in the original paper. From the theory of GAN, the loss should just use D and don't need a log to wrap it . BUT I found that you have used a log ... I tested the code , your version works very well , and then I replace the loss with the version offered by the paper , but it behaves badly ...
I don't know the reason ,... I think using D WITHOUT LOG is more right theorically ... if we use a log , won’t it just be the same to. GAN ?

@JasonYao81000
Copy link
Owner

Could you provide how you replace tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits()) without logits?
The loss of the original WGAN should not include the sigmoid or softmax functions, it must be linear output, so the loss can be the mean over a batch outputs.
You can try to replace our loss with something like this self.d_loss = -tf.reduce_mean(D_real) + tf.reduce_mean(D_fake) + tf.reduce_mean(D_wrong_img) + tf.reduce_mean(D_wrong_label), and the self.g_loss can be also replaced like this.
There may not be the exact answer to your question, have fun to try your own experiment.

@Unispac
Copy link
Author

Unispac commented Jun 5, 2019

Thank you for your reply !!!

"""
beta = tf.random_uniform(imageRotated.get_shape(),minival=0.,maxval=1.)
differenes = G - imageRotated
interpolates = imageRotated + beta*differenes
D_inter = self.discriminator(interpolates,isTraining=True,reuse=True)
gradients = tf.gradients(D_inter,[interpolates])[0]
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients),reduction_indices=[1])) # gradient penalty

    dLossReal = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=dReal, labels=tf.ones_like(dReal)))
    dLossFake = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=dFake, labels=tf.zeros_like(dFake)))
    self.dLoss = dLossReal + dLossFake + self.theta*GP  #loss of discriminator. (with GP)
	
    self.gLoss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=dFake, labels=tf.ones_like(dFake)))  

"""

Above is the loss function you use in WGAN-GP, right?

In the original GAN, we use a sigmoid to constrain the output of D, and we define the loss with log(D).
So the loss of D may looks like this : -E(log(D_real)) - E(log(1-D_fake))

But in WGAN, there should not be a log to wrap the D, because the D is used to fit a function f which is
used to calculate the Wasserstin distance. So the loss should use D directly instead of log(D).
And we use a clip strategy to constraint the parameter for D, so we can meet the Lipschitz condition.
And the loss of D is like this : -E(D_real) + E(D_fake).

And in WGAN-GP, we don't use the clip strategy, we use a GP to meet the Lipschitz condition.
Then the loss of D is like this : -E(D_real) + E(D_fake) + lamda*GP

So I think the best loss when we implement WGAN-GP should be like this :
"""
dLossReal = -tf.reduce_mean(dReal)
dLossFake = tf.reduce_mean(dFake)
self.dLoss = dLossFake + dLossReal + self.theta*GP #loss of discriminator. (with GP)
self.gLoss = -tf.reduce_mean(dFake) #loss of generator.
"""

But when I replace your loss with the version above,it failed!!!..
I have spent a day in finding the reason. But I just can not make it...

Firstly, I think if we use the loss function offered by you, we are just training an original GAN rather than WGAN-GP..
I don't get the point why you use a log to wrap the output of D... We are training a D to fit the Wasserstin distance.
But when there is a log, will it be the Wasserstin distance? I am a little bit confused..

Also, I found that you are using a lamda = 0.25 .. But usually, people use lambada = 10..
So I think your GP may not actually act as the role to keep the Lipschitz condition ..
It may just be a regularization item..
After realizing this, I modifed the lamba, and make it 10. But also failed.. The result is bad.

Also,I have compared your code with other implementations.. I find that there is some differences when calculating GP..

In your code :
beta = tf.random_uniform(imageRotated.get_shape(),minival=0.,maxval=1.)
the shape of the beta is 64x64x64x3 ..

But in other implementations, they usually use :
beta = tf.random_uniform(shape=[self.batchSize,1,1,1],minival=0.,maxval=1.)
the shape is 64x1x1x1 ...

Also In your code :
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients),reduction_indices=[1])) # gradient penalty

But the most used version is :
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients),reduction_indices=[1,2,3])) # gradient penalty

These points also make me think that your GP don't function much ..
It seems that your code is just a GAN with a little changes..
I am not sure ..
I am still trying to make it to be right ..
But the loss without log failed again ad again ..

@JasonYao81000
Copy link
Owner

JasonYao81000 commented Jun 5, 2019

Here is the WGAN-GP we referenced.
We are the WGAN-GP with condition label, but the original WGAN-GP is pure GAN.
You are right, you can say that we just implemented a condition GAN with gradient penalty.

@Unispac
Copy link
Author

Unispac commented Jun 5, 2019

Thank you! I will go to check it again.

@huangjicun
Copy link
Collaborator

huangjicun commented Jun 5, 2019

dLossReal = -tf.reduce_mean(dReal)
dLossFake = tf.reduce_mean(dFake)
self.dLoss = dLossFake + dLossReal + self.theta*GP #loss of discriminator. (with GP)
self.gLoss = -tf.reduce_mean(dFake) #loss of generator.

Because We use WGAN-GP with condition label, our loss has to contain the part of the condition.
Therefore, our loss is calculated by using tf.nn.sigmoid_cross_entropy_with_logits.

@Unispac
Copy link
Author

Unispac commented Jun 5, 2019

dLossReal = -tf.reduce_mean(dReal)
dLossFake = tf.reduce_mean(dFake)
self.dLoss = dLossFake + dLossReal + self.theta*GP #loss of discriminator. (with GP)
self.gLoss = -tf.reduce_mean(dFake) #loss of generator.

Because We use WGAN-GP with condition label, our loss has to contain the part of the condition.
Therefore, our loss is calculated by using tf.nn.sigmoid_cross_entropy_with_logits.

Do you mean , as you have to contain the condition loss so you have to use a log ?
But maybe this will ruin the principles of WGAN ?
I presume that as the lamba for GP you use is merely 0.25, so It is just like normal GAN, and thus you get a good result.

I am testing the hw3-1, there is no label, and don't need a condition item.
The sigmoid_cross_entropy loss works well, but when they are replaced with the WGAN version , I can't get a good result.. I still don't know why..

@Unispac
Copy link
Author

Unispac commented Jun 5, 2019

And another important question is : Whether the calculation of GP is right ? ... I really think it seems to be wrong...
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients),reduction_indices=[1])) # gradient penalty
It seems that the shape of the gradient (a scalar differentiate a 64x64x3 image) is 64x64x3 .. And as it is a batch , it should be 64x64x64x3 .. And just sumarize the axis 1 seems wrong.
I find that many other implementations use :
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients),reduction_indices=[1,2,3])) # gradient penalty

@Unispac
Copy link
Author

Unispac commented Jun 6, 2019

Update
I have gotten the bug.
As there are norm layers in G, so its optimizer must declare the dependency.
If we want to make the WGAN-GP work, the calculation of GP must be modified as the version I offered above. And the lambda should be 10.

@JasonYao81000
Copy link
Owner

Cool, thanks for your experiments.
Besides the WGAN-GP, you should try spectral normalization for the discriminator, it works like a magic.

@Unispac
Copy link
Author

Unispac commented Jun 6, 2019

Cool, thanks for your experiments.
Besides the WGAN-GP, you should try spectral normalization for the discriminator, it works like a magic.

Really? In discriminators? Maybe you mean generator? Can We add norm layers in discriminator for WGan?

@JasonYao81000
Copy link
Owner

You can take a look at the conclusions of this paper by Google.
They said Our fair and thorough empirical evaluation suggests that when the computational budget is limited one should consider non-saturating GAN loss and spectral normalization as default choices when applying GANs to a new dataset.
And the spectral normalization is also implemented into pytorch now.

@Unispac
Copy link
Author

Unispac commented Jun 6, 2019

You can take a look at the conclusions of this paper by Google.
They said Our fair and thorough empirical evaluation suggests that when the computational budget is limited one should consider non-saturating GAN loss and spectral normalization as default choices when applying GANs to a new dataset.
And the spectral normalization is also implemented into pytorch now.

Oh, I get it. Thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants