Structural Adversarial Objectives For Self-Supervised Representation Learning
Within GAN framework, we propose new training objectives designed to enhance the discriminator's capability for self-supervised feature learning.
摘要
评审与讨论
This paper presents a self-supervised representation learning method for GANs that involves additional structural modeling responsibilities and a smoothness regularizer imposed on the network. The method encourages the discriminator to structure features at two scales by aligning distribution characteristics (mean and variance) and grouping local clusters. The proposed method is free from hand-crafted data augmentation schemes and is shown to produce effective discriminators that compete with networks trained by contrastive learning approaches in terms of representation learning.
优点
- Studying representation learning from a generative perspective is interesting and promising.
- The overall organization and writing of the paper are well, making it easy to understand the work.
- The effectiveness of the method was experimentally verified on small datasets.
缺点
- The motivation behind the proposed method is not sufficiently clear for me. Despite the authors providing an ablation study, the principles behind the different losses are not well explained. I expect the authors to provide a more reasonable motivation to help readers understand the necessity of the proposed method beyond experimental results.
- The paper is the lack of discussion and comparison with the relevant work, ContraD [1], which splits the discriminator into feature learning and real/fake discrimination, similar to the motivation of the work.
- The generation performance of the proposed method is unsatisfactory, according to the FID results in Table 4. While there is an improvement compared to the outdated BigGAN, it is not an appropriate baseline for current comparison. Since the authors have compared their proposed method to StyleGAN2-ADA, to substantiate their claim of improved image generation quality, it would be beneficial for them to compare it to StyleGAN2-ADA on the same architecture.
[1] Jeong, Jongheon, and Jinwoo Shin. "Training gans with stronger augmentations via contrastive discriminator." arXiv preprint arXiv:2103.09742 (2021).
问题
- Why did the authors choose to implement JSD as the loss function? Could a distance metric like Wasserstein-2 distance, commonly used in FID, also based on the assumption of Gaussian distributions, can be used?
- Given that the loss function involves the computation of covariance and Jacobian matrices, which can be computationally expensive, could the authors provide a comparison of training time and overheads with the baselines?
- Can the authors conduct parameter analysis experiments to provide guidance on the selection of hyperparameters?
Thank you for your feedback!
Q1: Motivation of proposed methods.
A1: Our proposed loss functions seek to synchronize the distributions of real and generated images, as in Equation 2. Additionally, we aim to automate learning of semantic representations. This is achieved through the structuring of non-collapsed representations by optimizing their covariance (Eqn.2) and by creating local clusters through feature grouping (Eqn.6). Figure 2 illustrates the effectiveness of our proposed objectives, showing that by optimizing the discriminator with these objectives, the learned embedding aligns semantically similar data and separates dissimilar data.
Q2: Comparison to related work, ContraD
A2: We appreciate the reference to the ContraD study and plan to include it in our related work section. ContraD enhances representation learning in GAN frameworks by incorporating view-consistency contrastive learning methods. Their goal differs remarkably from our approach, which demonstrates that adversarial training objectives can inherently encapsulate feature learning capabilities. Our method presents a concise learning paradigm that does not rely on any form of view-consistency objectives, whether explicit or implicit.
Q3: Generation quality and evaluating our method on StyleGAN2-ADA
A3: In our work, the primary focus is on representation learning rather than image generation. This is why we used a ResNet-18 as the discriminator in all experiments, ensuring consistent comparison with other representation learning frameworks. Although the choice of generator is secondary, it is crucial that it does not become a limiting factor in our pipeline. We opted for BigGAN's generator with increased feature channels for this reason. We did experiment with StyleGAN's generator but found it less effective for our purposes.
Q4: Why did the authors choose to implement JSD as the loss function? Could a distance metric like Wasserstein-2 distance, commonly used in FID, also based on the assumption of Gaussian distributions, can be used?
A4: The JSD was chosen for its properties of symmetry, bounded values and allowing for the simultaneous optimization of mean and variance. Although the Wasserstein-2 distance similarly optimizes both mean and variance, its unbounded nature can lead to greater instability compared to JSD, making JSD a more stable and suitable choice for our objectives.
Q5: Breakdown of the running time.
A5: We have benchmarked the running times of each section in the discriminator's (D's) round, with the times reported as average values over 100 runs. The measurements were taken using a batch size of 256 and a single A40 GPU. The time taken for each section is as follows:
-
Forward pass of the discriminator: 0.02s
-
Forward pass of the generator: 0.07s
-
Computing the model's approximated Jacobian: 0.27s
-
Computing the adversarial loss: 0.007s
Running time for a single forward-backward pass is:
-
A Complete D's round with our objectives: 0.78s
-
A Complete D's round with vanilla GAN objectives: 0.29s
Our proposed objectives require about 2.7x the time of the GAN baseline during training, a reasonable time budget given the significant improvement in feature learning ability. It should be noted that, during inference time, there is no additional running time overhead compared to the GAN baseline.
Q6: Hyperparameter sweep
A6: We run ablation experiments by sweeping through the choices of in CIFAR-10 and show the results as follows:
| Linear-SVM | |||
|---|---|---|---|
| 3 | 5 | 2 | 89.1 |
| 3 | 5 | 4 | 89.8 |
| 2 | 5 | 4 | 88.7 |
| 1 | 5 | 4 | 88.6 |
| 1 | 3 | 4 | 88.7 |
Table C: Ablation experiments on hyper-parameters. Our method is insensitive to the hyperparameter change within a reasonable range.
This paper proposes a self-supervised framework with adversarial objectives and a regularization approach. The proposed framework does not rely on hand-crafted data augmentation schemes, which are prevalent across contrastive learning methods. The proposed method achieved competitive performance with recent contrastive learning methods on CIFAR-10, CIFAR-100 and ImageNet-10.
优点
-
Interesting Topic. Getting rid of hand-crafted data augmentation schemes is undoubtedly beneficial for contrastive representation learning.
-
Nice ablations. The paper includes comprehensive ablations on data augmentation dependence, other generative feature learners and system variants.
缺点
My main concern is about the main experiments on representation learning performance (Table 1).
-
It is not clear why the authors only include toy datasets (CIFAR-10, CIFAR-100 and ImageNet-10) in this table, while they have include experiments on larger datasets(e.g., ImageNet-100) in other tables. Given that the representation learning benchmarks in the baseline methods are all conducted on ImageNet-1k, I don't believe Table 1 is a fair comparison.
-
It is also not clear why the authors use SVM and K-M for evaluating the learned representations in Table 1 and do not include linear probing, which is commonly used in the representation literature.
Others:
-
The reconstruction-based self-supervised methods (e.g., MAE), which have been shown to outperform contrastive learning methods on ImageNet-1k, also do not rely on hand-crafted data augmentations. Hence, to demonstrate the contribution of this work, it is necessary to show that the proposed method can provide performance gain over them on large-scale datasets.
-
I think the authors missed a very relevant related work (not my paper) which should be discussed and compared with: Li et al. MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis. CVPR 2023.
问题
See weakness.
Thank you for your feedback!
Q1: Large-scaled experiment on ImageNet-1K.
A1: Please see the large-scale experiments section in our general reply.
Q2: It is also not clear why the authors use SVM and K-M for evaluating the learned representations in Table 1 and do not include linear probing, which is commonly used in the representation literature.
A2: We utilized Linear-SVM for evaluation, applying the default hyperparameters provided by SKlearn. This approach, compared to linear probing that uses mini-batch gradient descent, minimizes randomness by running over the entire dataset in a single batch. As Linear-SVM is also a linear classifier, we anticipate both classifiers reaching similar end states, a claim supported by the almost identical results in Table 2 for both Linear-SVM and Linear probing. We employed K-Means clustering to evaluate the features under more challenging conditions than linear classifiers, offering a unique perspective on the quality of the learned features.
Q3: The reconstruction-based self-supervised methods (e.g., MAE), which have been shown to outperform contrastive learning methods on ImageNet-1k, also do not rely on hand-crafted data augmentations. Hence, to demonstrate the contribution of this work, it is necessary to show that the proposed method can provide performance gain over them on large-scale datasets.
A3: Masked Autoencoders (MAEs) implictly employ the view-consistency objectives through a process of implicit matching between corrupted and clean images. On the other hand, they are sensitive to hyperparameters like patch size and mask ratio. In linear probing evaluations, MAEs underperform compared to contrastive learning methods. This is evidenced in MAGE [1], where the MAE-VIT-B achieves a 68 probing accuracy, in contrast to the 76.7 achieved by the MoCo v3.
Q4: I think the authors missed a very relevant related work (not my paper) which should be discussed and compared with: Li et al. MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis. CVPR 2023.
A4: We thank reviewer for the suggestion of MAGE and we plan to include it in our related work section. MAGE enhances representation learning through masked autoencoders by adding a contrastive learning method. This differs significantly from our approach, which aims to demonstrate a straightforward learning paradigm free from any form of view-consistency objectives, whether explicit or implicit.
[1] MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Authors propose a approach within the framework of generative adversarial networks (GANs) to enhance self-supervised representation learning. They introduce objectives for the discriminator that include additional structural modeling responsibilities. These objectives guide the discriminator to learn to extract informative representations while still allowing the generator to generate samples effectively. The proposed objectives have two targets: aligning distribution characteristics at coarse scales and grouping features into local clusters at finer scales. Experimental results on datasets demonstrate the effectiveness of the proposed method.
优点
-
This paper successfully combines two objectives into GANs to learn a good represenation.
-
The paper shows good figures which is easy to follow.
-
Authors compare with strong baselines, and support the effectiveness of the proposed method.
缺点
My concerns include the following:
-
The cluster property is well-known in the discriminator. Since DCGAN already show it, so I think it is not new in this paper to present it.
-
The presented method is not two much interesting, even authors give a comprehensive analysis.
-
The used datasets are small.I would like to use big datasets to support the proposed method.
-
Also the frameworks are out of fashion. I think the well-known architecture (e.g., stylegan) is more convincing.
-
There are not much visualization results .
问题
My main question is about the proposed method. The paper is not new, and has less contribution to this community.
Thank you for your feedback!
Q1: The cluster property is well-known in the discriminator. Since DCGAN already show it, so I think it is not new in this paper to present it.
A1: While the discriminator's feature learning capability in traditional GAN frameworks, like DCGAN, is acknowledged, our work focuses on enhancing this aspect significantly. As evidenced in Table 4, conventional GANs exhibit limited feature learning ability. Our innovations in training objectives and regularization techniques have remarkably improved feature learning performance, as shown by the increase in K-Means clustering scores from 29.69 to 80.11, and Linear classifier scores from 69.31 to 89.76.
Q2: The presented method is not two much interesting, even authors give a comprehensive analysis. The paper is not new, and has less contribution to this community.
A2: We respectfully disagree with the assessment of our paper's novelty and contribution. Our approach, diverging from view-consistency objectives, explores a more generalized yet challenging direction in feature learning. Our method demonstrates significant improvements over GAN baselines in feature learning, showcasing a substantial advancement in the field.
Q3: The used datasets are small.I would like to use big datasets to support the proposed method.
A3: Please see the large-scale experiments section in our overall reply.
Q4: Also the frameworks are out of fashion. I think the well-known architecture (e.g., stylegan) is more convincing.
A4: Our primary goal is to advance representation learning, which is why we used a ResNet-18 as the discriminator across all experiments for consistent comparison with other representation learning frameworks. Although the choice of generator is secondary, it is crucial that it does not become a limiting factor in our pipeline. We opted for BigGAN's generator with increased feature channels for this reason. We did experiment with StyleGAN's generator but found it less effective for our purposes.
Q5: There are not much visualization results .
A5: We have included various visualizations in our paper, such as the learned embedding (Fig 1(b)), the dynamic updates of the learned embedding in a synthetic scenario (Fig 2), and generated images in the Appendix.
The paper introduces novel regularization for training GANs to improve the representation learning capability of the discriminator. The representation is competitive with popular contrastive techniques, demonstrated by a variety of experiments.
优点
originality quality clarity significance
- This paper proposes a reasonable extension to GAN training, clustering rather than real/fake prediction, with novelty in the application to GANs.
- The spectral norm of the Jacobian seems novel.
- The paper is generally well-written
- The use of GANs for representation learning is compelling
缺点
The aims of the paper are not constantly clear throughout:
- From the intro: "...also improves the quality of samples produced by the generator."
- From 3.1 "their motivation is to improve image generation quality rather than learn semantic features — entirely different from our aim." Somewhat weakening this contribution of the paper.
The biggest issue is lack of the major comparison dataset in vision representation learning: full ImageNet. I was quite surprised to see this data missing for a few reasons:
- It's commonly used in existing literature.
- BigGAN (of which the proposed works architecture is inspired by) is trained on ImageNet.
- The compared methods are significantly hampered with such small data.
I'd like going to focus on the Masked Auto Encoder (MAE) paper, as I'm quite familiar with that work. The reduced training dataset size of ImageNet10, as well as smaller patch size, is a fairly large deviation. Furthermore, there's no mention of what representation space is used from the MAE: all image patches? the CLS token? While these are fine details, they are crucial for fair comparison. I'm not as familiar with the other compared methods, but given the issues with MAE, I am concerned for the those other methods as well.
It's not clear to me if the proposed method is successful at achieving good representation learning on only small datasets, or broadly. As noted in the StyleGAN2-ada paper, CIFAR-10 is a data limited benchmark.
Minor:
- The use of z,z^g is a little confusing, as z usually refers to the generators input and z^g even moreso.
问题
The Fine-grained clustering is a bit confusing, can you explain how the memory bank works in greater detail? Is z^b the discriminator representation of the real images encoded into the latent space? The nomenclature is not clear. A plain english explanation as to what the loss function is accomplishing would be illuminating as well.
Thank you for your feedback!
Q1: Our description on generation quality and comparison with ICGAN
A1: Our work primarily focuses on representation learning, not on enhancing generation quality. Improved image quality emerges as a secondary benefit from our proposed objectives. These goals are not necessarily in conflict, though their relative importance might vary depending on application. Unlike ICGAN, which uses an external feature learner to improve their generation quality, our framework is designed to learn representations from adversarial training. We refer to ICGAN in our paper to highlight the fundamental differences between their approach and ours.
Q2: Large-scaled experiment on ImageNet-1K.
A2: Please see the large-scale experiments section in our overall reply.
Q3: Implementation details of compared methods.
A3: We provide implementation details for the methods we compared, including MAE, in Appendix A.2. For MAE, we used the average of all patch tokens from the encoder's output as the representation for the linear classifier.
Q4: Clarification on notations and
A4: We use to denote the encoder's final representation, and as the input for the generator. This is in contrast to the traditional GAN framework, which only has a single latent space as the generator's input. Our model includes an additional latent space, which is the encoder's output.
Q5: Explanation of the Memory Bank
A5: The memory bank holds the encoder's representations of real images. It is implemented as a first-in-first-out queue, updated continuously by replacing the oldest features with new ones from the current mini-batch.
Q6: Explanation of the proposed adversarial objectives
A6: Our objectives aim to align the distribution between real and generated images (Eqn.2 for image generation) and to facilitate automatic learning of semantic representation. The latter is achieved by imposing structure on the non-collapsed representation, by optimizing covariance (Eqn.4), and forming local clusters by grouping features (Eqn.6). The effectiveness of our proposed objective is illustrated in Figure 2, where we show that by optimizing the discriminator with these objectives, the learned embedding aligns semantically similar data and separates dissimilar data.
We thank the reviewers for their feedback, and particularly the recognition of our work in representation learning with generative models and emphasis on learning without reliance on augmentation regimes. However, we believe there is a gap between our focus and the reviewers' perceptions. To address this, before answering specific questions in individual responses below, we elucidate our contributions here.
Our goal is to enhance the representation learning ability of GANs, evidenced by improvements from 28.96 to 80.11 in K-Means clustering and from 76.50 to 89.76 in linear classification. This improvement enables us to conduct feature learning solely using adversarial objectives, without relying on view-consistency objectives, as demonstrated in Table 2. We would to further clarify the following points:
- Our primary focus is learning representations without view-consistency objectives, a challenge toward building more general-purpose self-supervised learning systems. This is a significant departure from the prevalent self-supervised approaches that largely depend on such objectives and necessitate extensive tuning of data augmentation for optimal performance. We summarize the comparison in the following table:
| Contrastive Learning | Mask AE | GAN | Ours | |
|---|---|---|---|---|
| View-consistency | Yes (match between | Yes (match corrupted | No | No |
| objectives | augmented images) | to clean images) | ||
| --- | ||||
| K-Means clustering | 75.0 | 37.0 | 28.96 | 80.11 |
| on CIFAR10 | ||||
| --- | ||||
| Linear Classification | 93.1 | 82.3 | 76.50 | 89.76 |
| on CIFAR10 |
Table A: Comparison between our work and related methods on feature learning. Our method demonstrates the ability to learn robust feature representations on selected datasets without relying on any view-consistency objectives.
-
Our proposed new adversarial training objectives integrate feature learning with adversarial training, allowing for learning representation without view-consistency objectives. Simultaneously, we also learn to generate. This suggests that the split in the research community into specialized methods for generation (e.g., GANs) and specialized methods for representation learning (e.g., contrastive self-supervision) may be unnecessary; our work highlights the possibility of unification.
-
We introduce a new regularizer based on the approximated spectral norm of the Jacobian. This approach uniquely balances the discriminator's capacity and smoothness. We summarize the comparison between our proposed approach and other regularizers in the follow table:
| Spectral Norm | Gradient Penalty | Exact Jacobian | Approximated Jacobian(Ours) | |
|---|---|---|---|---|
| Regularization | Layer-wise | Model-wise | Model-wise | Model-wise |
| --- | ||||
| Model outputs | Yes | No | Yes | Yes |
| high-dimensional vectors | ||||
| --- | ||||
| Preserve Model | No | Yes | Yes | Yes |
| capacity | ||||
| --- | ||||
| Efficient Computing | Yes | Yes | No | Yes |
| 128*128 forward pass | 3 forward pass |
Table B: Comparison between our and other regularization techniques. Our proposed regularizer effectively maintains the smoothness of the model without compromising its capacity. It also accommodates the model's final output as a high-dimensional vector.
Large-scale experiments. Our study is focused on providing empirical evidence that learning effective representations can be achieved solely through adversarial objectives, a principle we have successfully demonstrated in smaller-scale datasets. Advancing this approach to surpass current state-of-the-art methods, which have benefited from multiple iterations of refinement, on larger datasets would necessitate considerable extra efforts. That endeavor extends beyond the present scope of our paper.
Due to time and computing resource constraints, we could not fully run tests on ImageNet-1k during the rebuttal period. As a reference, it takes BigGAN 15 days to train on ImageNet-1K with 8xV100 GPUs. However, our results on ImageNet-100, presented in Table 5, demonstrate our method's superiority over the BigBiGAN distillation method, even with a smaller feature extractor. For reference, we report a linear SVM score of 0.75 for SimCLR run on ImageNet-100 at the same 128x128 resolution. We acknowledge a gap between our method and the top contrastive learning methods on both CIFAR-100 and ImageNet-100. However, our approaches showcase feature learning without any view-consistency objectives, which could eliminate need for hyperparameter tuning of the data augmentation. Our approach demonstrates a potential new direction for future work in representation learning.
This paper proposes a variation of the GAN framework where the generative task is used as a pretext task for representation learning. The focus is then on the discriminator rather than in the generator as in a typical GAN framework, and the main motivations is to learn good representations without relying on hand-crafted data augmentation schemes, often required in contrastive representation learning methods.
While reviewers agreed on the relevance of the topic and motivation, they were consistently underwhelmed by the evaluation, both in terms of the data where claims were observed to hold and the choice of model architecture as well. We'd recommend expanding the evaluation prior to publication to better support the claims made in the paper.
为何不给更高分
While the paper is quite interesting, there are issues with the evaluation that should be addressed before it being ready for publication as outlined in the latter portion of the meta-review.
为何不给更低分
N/A
Reject