6.0

/10

Poster4 位审稿人

最低5最高7标准差1.0

3.5

置信度

正确性3.0

贡献度2.8

表达3.0

NeurIPS 2024

Learning from Pattern Completion: Self-supervised Controllable Generation

Zhiqiang Chen,Guofan Fan,Jinying Gao,Lei Ma,Bo Lei,Tiejun Huang,Shan Yu

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

Inspired by the neural mechanisms that may contribute to brain’s associative power, specifically the cortical modularization and hippocampal pattern completion, here we propose a self-supervised controllable generation (SCG) framework.

摘要

关键词

Self-supervised Controllable Generation，Modularization，Pattern Completion，Associative Generation

评审与讨论

审稿意见

评分: 5置信度: 42024-07-10

Conditional generation in the era of diffusion models has been positively impacted by ControlNets, which allows for the fine-tuning of diffusion models using additional image input allowing for fine-grain conditioning. Popular ControlNets rely on pose, edge, or segmentation map conditioning which requires additional, usually supervised, networks to derive the condition of interest. In this paper, the authors propose a fully unsupervised alternative to ControlNets and position their contribution as being proficient by mimicking the brain's modularity. They propose the training of an equivariant auto-encoder (SCG) model which allows for the disentanglement ("modularisation") of the input image information thereby allowing one to naturally access an altered version of an image that can serve as a condition for the generation in a ControlNet framework. Authors experimentally show the performance of the approach by 1) showing images conditioned on various representation modules, 2) comparing it against a Cany edge ControlNet, and highlighting how the approach is more robust to noise and produces more realistic images that resemble the input condition more closely.

优点

SCG is a fully unsupervised method, which in contrast to most ControlNets, does not require relying on supervised tools
SCG allows for the generation of various conditioning inputs by allowing one to select one out of a set of representation modules. This is fundamentally appealing as ideally, it allows the user to select what part of the input information to condition the generation on instead of being tied to a set of (supervised) tools for information extraction in a classic ControlNet setup.

缺点

I believe this paper is not in a publishable state for the following reasons:

Presentation/Clarity/Rigor: the paper should be polished as it contains a lot of typos, confusing sentences/sections (see below), and missing information (see questions). In general, sections 1 and 3 are hard to follow and cause some confusion, and figures - including captions - should be refined.
Impact: While the general goal of this paper (unsupervised + modular image conditioning) is appealing, in practice, SCG becomes a trade-off between supervision and control over the information that will condition the image generation. Standard ControlNets require supervised tools but allow for explicit control over the information that is embedded into the conditioning (e.g., body pose, and structural information through Canny Edges). SCG on the other hand, does not require supervision but results in a set of image conditions that mostly disentangles the input image according to spatial frequencies as pointed out by the authors. As a consequence, most images capture similar semantic information and generated images are very similar to one another. To summarize, I believe the current use cases of SCG are cases where one wants to generate novel images that slightly differ from the input image condition, similar to using a ControlNet on Canny Edges (which is indeed the only ControlNet that authors compare against). I think this should be made more clear in the manuscript from the get-go (only discussed in the limitations) and ideally the paper should address this concern to increase the impact of SCG.
Experimental validation: I believe the experimental validation does not allow the reader to have an in-depth understanding of the performance of SCG. Mainly the following questions are unanswered: how much variability (in quantitative terms) can one expect from conditioning on various modules? how do images generated from SCG compare with images generated by conditioning on the lower principal components (see [1] for a more detailed example)?

[1] Balestriero, Randall, and Yann LeCun. "Learning by reconstruction produces uninformative features for perception." arXiv preprint arXiv:2402.11337 (2024).

Minor:

section 1: overall confusing, I believe the important message here is "brain performs functional modularity, we aim at mimicking this for associative generation purposes; As a starting point, we believe equivariance helps in achieving functional modularity in neural networks". The current state of the introduction is convoluted and it takes some thought to actually extract the general message.
line 33-36: should be rephrased in my opinion, the sentence is confusing/hard to follow.
line 43: are we talking about the brain or neural networks here? this sentence is confusing.
line 45: problem in the sentence formulation
line 53: I believe "devide" should be changed to "divide" throughout the paper
line 109: loose terms $\rightarrow$ "is a kind of change"
line 120: confusing formulation "train the autoencoder in homologous images"
section 3: what is the fundamental difference between z, the latent representation and, f, the feature maps ? We seems to be switching back and forth between both notations.
section 3: why does the latent representation $z$ become a function (i.e., $z(I)$ )?
Eq 6: sum over $\delta$ , which set of values does $\delta$ belong to?
section 3: symmetry loss, the intuition behind this additional loss term is unclear to me.
line 138: typo
line 146: confusing, either ControlNets should be described in more detail in section 2 or the phrasing "based on ControlNets" should be clarified in section 3 to make it evident that the "pattern completion" is entirely mediated by the existing ControlNets framework and is not a contribution.
Eq 9: the expectation symbol should be replaced by \mathbb{E}
Figure 1: This figure is supposed to reflect the approach proposed by the authors, however, it is not straightforward which elements in this figure correspond to the SCG building blocks detailed in section 3. Instead of clarifying SCG, figure 1 brought some confusion in my case.
Figure 3: formulation problem in the caption
Figure 4: typo in the caption and the x-axis label

问题

can authors confirm that the conditioning used is $f$ , the latent representation?
can you clarify what the symmetry loss terms aim to do?
can you explain what translation and rotation augmentations were used for the training of the autoencoder? what is the sensitivity of the representations to these parameters?
can you explain the training procedure of the autoencoder? number of epochs, optimizer, ...
can you confirm that the encoder and decoder "architectures" are single-layer convolutional filters?
how is the subject evaluation performed? how many annotators? through which annotation platform? on how many images? which questions are asked of annotators?

局限性

The authors discuss the fact that the current equivariant autoencoder disentangles mostly based on spatial frequency rather than following a human-interpretable partitioning of the information (high-level/semantic vs. low-level information). Beyond that point which limits the impact of this work, I believe additional limitations such as the lack of control over the redundancy between the original image and the image condition is a limitations that is tied to the use of fully unsupervised despite clear advantages to choosing fully unsupervised methods.

作者回复

2024-08-07

Thank you very much for your thorough review and constructive feedback.

Q1: Presentation/Clarity/Rigor

R1: We sincerely apologize for the confusion caused in the introduction and methods sections. Our aim was to bridge concepts from neuroscience with controllable generation in AI, hoping to stimulate cross-disciplinary inspiration and collaboration. However, in attempting to reach audiences from both fields, we presented the methods in a coarse-grained manner, leading to misunderstandings among readers, including reviewers. To address this, we will: 1）Rephrase: Clarify terms, sentences, equations, and figure captions in sections 1 and 3 to ensure their accuracy and accessibility. 2）Enhance understanding: Provide a more detailed diagram of the proposed modular autoencoder in Section 3 (see it in PDF Fig. R4), including a clearer representation of the prediction matrix M, equivariant loss, and symmetry loss. Refer to GR2 for more.

Q2: Impact

R2:

This proposed method’s most significant contribution lies in introducing a novel self-supervised conditional generation training framework. This framework enables the acquisition of broad capabilities through a self-supervised approach, offering the potential to fully leverage the scaling law and achieve a large-scale foundation model for controllable generation. Furthermore, by fine-tuning the foundation model on specific downstream tasks or utilizing adapters, powerful specialized models can be easily derived.
Since there has been extensive research on conditional generation, we focus on designing a modular autoencoder based on proposed equivariant constraints, which is crucial for subsequent self-supervised controllable generation. And the visualization of the autoencoder component and the demonstration of its conditional generation capability on various tasks with zero-shot generalization prove the effectiveness of our method.
On specific tasks, our method still falls short of the generative capabilities of dedicated supervised methods. A major reason for this is the use of labeled data, such as sketch-image pairs, in supervised methods. If we were to fine-tune our model with labeled data, its performance on specific tasks would significantly improve.

Our primary goal at this stage is to demonstrate that the proposed self-supervised framework can emerge broad conditional generative capabilities. Further supervised fine-tuning will be explored in future work. Refer to GR1 and GR4 for more.

Q3: Variability and Comparison with principal components.

R3:

In PDF Fig. R5, we demonstrate the ability to precisely control the performance of the generated image on specific components by adjusting the intensity of the components (specifically, by multiplying different coefficients). It is evident that our method can easily manipulate saturation and contrast by changing the coefficient of HC0 and HC1.
Regarding the principal components, in PDF Fig. R4, we showcase the principal components obtained through PCA. A fundamental difference between components of PCA and modules of our method is that the different components in PCA are independent. As a result, PCA lacks the ability to decouple highly correlated internal representations into relatively independent and complementary representation modules, which is the core of zero-shot conditional generation capability.

Q4：Minors

R4:

1-2）33-36 line rephrase: We revise it to "The brain’s remarkable ability to generate associations emerges spontaneously and likely stems from two key mechanisms. One is the brain’s modularity in terms of structure and function, enabling it to decouple and separate external stimuli into distinct features at various scales. This modularity is essential for a range of subsequent cognitive functions, including associative generation."

3-7）We will fix them.

8）We will unify z and f to z.

9）We will revise z(I) to z.

10） $\delta$ 's range: $\delta$ contains 3 parameters: two translation parameters $t_x$ , $t_y$ , and one rotation parameter $\theta$ (see PDF Fig. R4). The range of t_x and t_y is [-0.5s, 0.5s), where s is the convolution stride. The range of theta is [0, 2pi). Each parameter or dimension is sampled at n (i.e., n=24) equally spaced values.

11）Refer to GR2.

12-14）We will revise them.

15）Refer to R1 and GR2 .

16,17）we will fix it.

Q5: Questions

R5:

1）Yes, we will revise it to "z".

2）The symmetry loss aims to further enhance the internal correlation (or symmetry) within modules, building upon the equivariant loss. Without symmetry loss, it will learn unrelated features within module as shown in PDF Fig. R3, referring to GR2.

3）We used random translation and rotation augmentation within 0.3 times the image length and 360 degrees. When removing translation augmentation or rotation augmentation (PDF Fig. R2), modular features are still learned. The difference is that without translation augmentation, the learned features lack orientation selectivity and resemble Gaussian difference, while without rotation augmentation, features within each module have the same orientation.

4）We train it for 40000 steps with AdamW optimizer on a cosine anneal strategy with a start learning rate of 0.005.

5）Yes, it’s a single-layer or can be equivalently considered as a single-layer network.

6）We use plattform https://www.wjx.cn/ to collect subjective evaluations, gathering 40 questionnaires. Each questionnaire contained 12 pairs of comparison images and one original image. Participants were asked which image, out of the two, had better fidelity and aesthetics basing on the original image. （see appendix A.4.4 in main text）

Q6: Limitation

R6: Refer to R2.

评论- Answer to rebuttal

2024-08-11

Thank you to the authors for their answers.

My concerns regarding the fact that the method allows for modulating elements like the contrast and brightness of the images but doesn't offer much variability over the semantics ( images are nearly identical to the condition image) therefore limiting the appeal of the approach, remains. Also, the authors do not answer my question regarding the comparison with conditioning the generation on images resulting from Principal Component filtering.

The additional results provided by the authors are informative and show that for some more unusual styles, like graffiti, the conditional offered by the modular autoencoder outperforms supervised methods while it is not the case for more usual styles, where the images conditioned with segmentation mask look as good but with more variability in the image instead of reconstructing the condition image.

Given additional results and after reading the other reviews, I see better the novelty of the proposed method and how it might inspire future work. For this reason, I am increasing my score.

2024-08-12

Thank you very much for your effort and valuable suggestions, and we also appreciate your increasing score. We are also grateful for your insightful summary of important message, from which we learned a lot.

We agree with your concern regarding the limitation of our method in terms of semantic diversity. We will discuss this in more detail in our future work and limitations section.
Regarding your suggestion for a comparison with principal component analysis (PCA), we believe it is an excellent point. We have included the features generated using PCA in PDF Fig. R3, and we can see a clear difference compared to our approach, namely the lack of modular features in PCA. We recognize that the best approach would be training a PCA-based conditional generative model in a similar manner, but due to time constraints, we were unable to train a new model. Theoretically, we analyzed that PCA, due to its lack of specialized modular features, struggles to flexibly manipulate different aspects of features, leading to a generative model that is almost a reconstruction of the original image and and thus lacks the ability to achieve widespread zero-shot generalization.
We admit that in specific task domains, such as conditional generation under semantic segmentation, our self-supervised approach still has a gap compared to supervised dedicate models. We will include a more detailed discussion of this in our future work and limitations section.

Thanks for your efforts on our manuscript and valuable suggestions again.

审稿意见

评分: 5置信度: 32024-07-11

The authors train a modular auto encoder with an auxiliary custom equivariance objective, which enables them to get independent sets of representations of an image. The authors then use some of these representations to condition a ControlNet.

The authors find that the auto encoder’s submodules learn to encode functional specializations such as edge detection and other low-level image features.

优点

This paper contributes an exciting and challenging field: controllable image generation. The work presented in the paper is original.
The authors make their code available with the submission. This makes the paper easier to review, and will greatly improve the paper’s usefulness for other research groups working on controllable generation.

缺点

The paper is titled “Learning from Pattern Completion”. What are we learning from pattern completion?
Section 3 should be written more clearly.
While the paper promises controllable image generation, it appears that the proposed method is more of an image reconstruction method: the experiments from the paper evaluate the proposed method by giving it features generated using the proposed modular autoencoder as input. My sense is that if I wanted to draw “a monkey climbing on the moon”, then I could give ControlNet a hand-drawn sketch of this, while the proposed SCG method would not be able to generate an image based on my sketch. Is this correct?

问题

The authors chose to evaluate ControlNet with the Canny image detector. Would a different representation to condition on have worked better? It seems obvious that the Canny image detector is a lossy representation of the original image.
The authors could use a tilde (~~) before \cite commands to put a small space before the square brackets, which would make the text more visually pleasing. For example~~\cite{schmidhuber_1994}.
I believe that instead of “equivariant constraint”, the correct term is “equivariance constraint”.
It would be great if the README.md of the code could state what kind of GPU and how much RAM is required to run their code.
It would be great if the authors could include more detail in the caption of Figure 2, especially for part C of the figure.
On line 56, it appears that a word is missing between “sparsely” and “in”
On line 66, the authors use the term “closed and complete independent manifold”. The terms “manifold”, “closed” and “complete” have specific meanings in the mathematical area of topology, and if I understand correctly the authors are not using the words in the sense of the definitions from topology. It may be useful to use different terms here to avoid confusion.
On line 93, the authors mention “simple cells”. It would be helpful for some readers to change this to “simple cells in the visual cortex”.
On line 130 it says “where M^(I)(delta) is a learnable prediction matrix determined by delta”. I still do not understand what M^(I)(delta) is. Is it a neural network that takes delta as input? Or is it computed for every delta? More generally: what is the set or distribution of all deltas? Equation (6) involves a sum over deltas, and I don’t know what the set of all deltas (or its distribution) is.
I would be grateful if the authors could describe equation (8) more. Is C able to construct z(I) based on z^(i)(I) and i alone?
On line 172, “We” should be lower-cased.

局限性

Yes

作者回复

2024-08-07

Thank you very much for your efforts and valuable feedback on our paper.

Q1：What are we learning from pattern completion?

R1：Pattern completion focuses on the relationships between different module features at a global scale. By learning these relationships, SCG can utilize information from a subset of modules as clues to complete or generate information in other modules, thereby achieving various conditional generation tasks. Refer to GR3.2 for more.

Q2：Section 3 should be written more clearly.

R2：We sincerely apologize for the lack of clarity in some of our expressions. We will further clarify the captions of Figure 2, the prediction matrix M, the distribution of $\delta$ , equation (6) and equation (8). Since we only presented a very coarse-grained overview of our framework in Figure 1, it may have been difficult to grasp the finer details of the method, especially regarding the modular autoencoder part. To address this, we have added a more detailed structural diagram of the modular autoencoder in PDF Fig. R4. It provides a more intuitive visualization of the modular autoencoder with proposed equivariance constraint, including the prediction matrix M, the equivariant loss, and the symmetry loss. And we will add this figure in section 3. Also can refer to GR2.1.

Q3：It appears that the proposed method is more of an image reconstruction method... The proposed SCG method would not be able to generate an image based on my sketch. Is this correct?

R3：The proposed SCG can generalize to both reconstruction-oriented tasks and generation-oriented tasks, such as sketch (main text Figure 5), LineArt (PDF Fig. R2), and ancient graffiti (main text Figure 7) in a zero-shot manner.

In the PDF Fig. R2, we use SCG to generate an image of “a monkey climbing on the moon” using a sketch as control.
Different modules have different characteristics. For instance, HC0 extracts color information, while HC1 extracts brightness information. This makes them more suitable for reconstruction-oriented tasks like image super-resolution (PDF Fig. R2), dehazing (PDF Fig. R2), and colorization (main text Figure 6 and Appendix Fig. S8).
For HC2 and HC3, the extracted information is more abstract, focusing on edges. Thus, they are more suitable for generation-oriented tasks such as conditional generation under sketch (main text Fig. 5), line art (PDF Fig. R2), and ancient graffiti (main text Fig. 7).

It is undeniable that our fully self-supervised controllable generation approach was not specifically trained for tasks like sketch and line drawing, and therefore, its performance and stability fall short compared to dedicated supervised methods. However, SCG’s performance on specific tasks can be significantly improved by supervised fine-tuning for those tasks, such as adding sketch and images pairs for training. We will incorporate this into the discussion of future work. Refer to GR4.

Q4: Would a different representation to condition on have worked better than Canny?

R4:

We have introduced controllable generation using depth maps, normal direction maps, and semantic segmentation maps as conditions for comparison (see PDF Fig. R1). These conditions are more abstract than Canny edges, offering a larger generation space. However, they require a supervised pre-trained network to extract the conditional information. Due to their abstract nature, the generated images exhibit greater diversity and aesthetic appeal.
However, they are highly sensitive to the distribution of the conditional image. For instance, in tasks involving generating ancient graffiti and ink paintings (PDF Fig. R1), most condition extractors fail to extract the correct conditional information, leading to uncontrollable generation results. Conversely, when accurate conditional information is available, such as high-quality depth information in ink painting, the generation results surpass those of SCG and the Canny operator. For further details, please refer to GR4.1.

Q5: What kind of GPU and how much RAM is required to run their code.

R5: We use A100 GPU and it requires about 11G RAM to run it per image.

Q6: More detail in the caption of Figure 2, especially for part C of the figure.

R6: Revised caption of Figure 2: "Feature Visualization of modular autoencoder. Each panel shows all features learned by an individual model with multiple modules (one module each row). We trained modular autoencoder with a translation-rotation equivariance constraint on a)MNIST and b)ImageNet, respectively. c) On ImageNet, We also train an autoencoder with an additional translation equivariance constraint besides the translation-rotation equivariance constraint. d) We visualize reconstructed images by features of each module in c."

Q7: More details on $M^{(I)}(\delta)$ ,distribution of all $\delta$ , Eq. (6) and Eq. (8).

R7:

$M^{(i)}$ is a codebook of learnable prediction matrices that can be indexed by $\delta$ . $M^{(i)}$ is a 3D codebook, with the three dimensions corresponding to the three parameters of $\delta$ : two translation parameters $t_x$ , $t_y$ , and one rotation parameter $\theta$ (see PDF Fig. R4).
The range of $t_x$ and $t_y$ is [-0.5s, 0.5s), where s is the convolution stride. The range of $\theta$ is [0, 2pi). Each parameter or dimension is sampled at n (i.e., n=24) equally spaced values, which is also the range of sum in Equ (6). The prediction matrix $M^{(i)}(\delta)$ is obtained from the codebook through linear interpolation.
Ideally, C in Equ (8) reconstructs the complete representation from the partial module $z^{(i)}$ and its index i. In practice, the training process of C aims to achieve this goal, for example, using a diffusion model-based conditional generation process. Refer to GR2 for more.

Q8: Some typos and imprecise term.

R8: Thank you for your careful review and pointing out these issues. We will fix them.

评论- Increased score from 4 to 5

2024-08-13

Thank you for these improvement. I increased my score from 4 to 5.

2024-08-14

Thank you very much for your thorough and responsible review, and we also appreciate your increasing score. Your suggestions are very helpful in improving this work.

审稿意见

评分: 7置信度: 42024-07-11

This paper proposes a self-supervised controllable generation (SCG) framework with two training stages. The first stage exploits equivalence invariance to learn a modular autoencoder, aims to extract different visual pattens from input images, each extracted (learned) visual pattens can be regarded as a kind of image condition. In the second stage, perform as the standard ControlNet training with these self-supervised extracted image conditions. After self-supervised training, SCG surprisingly shows some zero-shot capabilities, which can show excellent generalization capabilities for various tasks such as associative generation of paintings, sketches, and ancient graffiti. The authors compare SCG with ControlNet in terms of SSIM/PSNR/CLIP-score and achieved competitive results. However, the authors lack further exploration of the self-supervised trained ControlNet. Overall, I like the idea and method very much, but I still have some doubts about the validity of the experimental results and the fairness of the comparison.

优点

I like the idea that exploring self-supervised training on controllable generation models. Compared with existing methods, the self-supervised training framework proposed in this paper does show better generalization ability to a certain extent.
Using the equivariant constraint to learn a modular autoencoder to extract different visual patterns is a simple, elegant and effective approach.
Although self-supervised training has not seen conditions such as painting, sketches, and ancient graffiti, it still shows good zero-shot generation capabilities, which verifies the rationality of training modular autoencoder with equivariant constraint.
Compared to previous excellent works ControlNet, SCG has a higher or similar winning rate in fidelity and a significantly higher winning rate in aesthetics.

缺点

The MS-COCO dataset is used for both the self-supervised training phase and the evaluation phase, which may lead to a serious unfair comparison because the original ControlNet is not trained on MS-COCO.
Although this paper explores the self-supervised training of ControlNet and its zero-shot generalization ability, no further experiments are conducted, such as whether using it as pre-trained weights for further fine-tuning on labeled tasks such as depth, segmentation, etc. will bring further improvements of controllability? The zero-shot capability alone cannot fully demonstrate the effectiveness and necessity of the proposed self-supervised training.
Based on the above weakness, it is necessary to conduct ablation experiments on the number of modules (from modular vae) and the types of Equivariant Constraints.
In addition to the painting, sketches, and ancient graffiti mentioned in the paper, does SCG also have certain zero-shot capabilities under some other conditions such as Edge/LineArt/Segmentation?

问题

Please refer to the weaknesses.

局限性

作者回复

2024-08-07

We are grateful for your fondness of our idea and are truly encouraged by it.

Q1：Evaluation may lead to a serious unfair comparison because the original ControlNet is not trained on MS-COCO.

R1：The ControlNet we used as baseline is trained from scratch with the same setting as the proposed SCG. We apologize for the inconvenience and confusion caused to readers, including reviewers, by placing the introduction of some of our training settings, including the comparison baseline ControlNet, in Appendix A.4.2 of the main text. We will move this information to the main text’s experimental section in the future.

Q2：Further experiments are conducted, such as whether using it as pre-trained weights for further fine-tuning on labeled tasks such as depth, segmentation, etc. will bring further improvements of controllability? The zero-shot capability alone cannot fully demonstrate the effectiveness and necessity of the proposed self-supervised training.

R2：This is an incredibly insightful suggestion, opening up new avenues for our research. Previously, we focused on fully unsupervised methods for modular, independent, and complementary feature disentanglement, exploring the emergent capabilities. Your suggestion has broadened our perspective. While fully self-supervised training can exhibit impressive emergent capabilities and demonstrate strong generalization across data distributions and task variations, it still falls short compared to supervised methods in specific tasks. Leveraging our self-supervised model as a pre-trained model and then fine-tuning it on labeled data for specific downstream tasks, such as sketch-conditioned generation, super-resolution, or dehazing, can significantly improve performance on those tasks. We believe this approach will enable both competitive capabilities on specific tasks and the ability to generalize to out-of-distribution data and tasks. Moreover, this fully unsupervised pre-training process has the potential to replicate the scaling law observed in large language models and other domains, establishing a foundation model for conditional generation. We will incorporate this into our future work section. Refer to GR1.1 for more.

Q3：Ablation experiments on the number of modules (from modular vae) and the types of Equivariant Constraints.

R3：We apologize for our focus on demonstrating the zero-shot generalization capability of the conditional generation model and neglecting this aspect. We have included an ablation study in PDF Fig. R3.

As visualized in Fig. R3, when the number of modules or the number of convolutional kernels within each module varies, the modular autoencoder network can still reliably produce relatively independent and complementary functional specializations.
When translational motion is removed, modular features are still emerged, but they lose their orientation selectivity, exhibiting a center-surround receptive field similar to a Difference of Gaussians. When rotational motion is removed, features within each module exhibit the same orientation selectivity.
When the symmetry constraint is removed, modular features are still generally produced, but some modules may contain multiple unrelated (or asymmetric) sub-features. As shown in Appendix Fig. S1 a and d of the main text, when the equivariant constraint is removed, the model cannot generate relatively independent and complementary functional modules.

We will add these ablation experiments to the appendix. Refer to GR2.2 for more.

Q4：Does SCG also have certain zero-shot capabilities under some other conditions such as Edge/LineArt/Segmentation?

R4：Yes.

We quickly test SCG in other zero-shot tasks including generation conditioned by LineArt (see PDF Fig. R2), as well as super-resolution (see PDF Fig. R2) and dehazing (see PDF Fig. R2)(Refer to GR4.2 for more).
Furthermore, we also can manipulate saturation and contrast of the generated image by changing the coefficient of HC0 and HC1, see in PDF Fig. R5. (Refer to GR4.3 for more).

This demonstrates the advantage of self-supervision over supervised learning, allowing the emergence of broader capabilities rather than just specific ones. It also indicates the effectiveness of the features learned by our modular autoencoder.

2024-08-11

Thanks for clarifying that the motivation is to propose and validate a novel self-supervised pipeline rather than a model to achieve broad generalizations. The author's rebuttal solves most of my questions, so I tend to maintain the existing score.

2024-08-11

Thank you very much for your time and insightful feedback on our manuscript. We appreciate your thoughtful comments and suggestions, which will help us significantly improve the quality of our work.

审稿意见

评分: 7置信度: 32024-07-12

Presents a self-supervised approach for learning multiple distinct representations from images through a loss leading to specialized modules, each learning different aspects of the data without manual design. Demonstrate that those modules may be useful for the conditional generation of images through a controllable diffusion model.

优点

Originality: the main contribution of the work is in the novel loss function and the insight (perhaps inspired by neuroscience) that modular networks are favourable for learning in self-supervised settings. The idea that a module may develop functional specialization from imposing equivariance to some property of the stimuli is also a nice extension of previous supervised approaches. I think the other half of the work, where the representation of a single module is used for controllable generation, is a natural extension of previous work. Previous works on both manual designing specialized blocks and on Group equivariance are properly quoted.
Quality: the presented experimental results are fine (but as always with image generation, visual judgement is difficult), so the quantitative evaluation (Table 1) and subjective evaluation (Figure 4 and some panels in Figure 7) are appreciated.
Clarity: the submission is clearly written and well organized, using neuroscience inspiration to improve the intuition on how to make progress on self-supervised learning of rich representations. The “pattern completion” jargon, on the other hand, does not contribute to the reader’s understanding; it would have been better to adhere to “conditional, controllable generation,” which actually describes what was done.
Significance: the suggested equivariance and symmetry loss components are easily usable for developing a functional specialization in other projects and may impact seemingly unrelated problems.

缺点

Quality: the evaluation is limited only to ControlNet, which is probably a decent baseline, but there are other approaches for image generation that can be compared with (e.g., it’s fine to show a manual-designed solution outperforming the current self-supervised model or compare with other self-supervised approaches).
Significance: it would be great to see if the suggested loss components can be utilized in a setting other than controllable generation where their contribution can be better quantified, e.g., in lowering MSE in a reconstruction or prediction task.

问题

What is the respective contribution of the symmetry loss vs the equivariance loss? What does performance look like with only one of them? What is the problem solved by introducing each of them?
What are the equivariance properties of HC1 and HC3 (i.e., their functional specialization) and make them useful for controllable generation?

局限性

There is only a limited discussion of the approach's limitations and the limits of functional specialization (summarized as “only low-level feature specialization”); it would have been nice to see a discussion of what useful features are NOT learned by this approach and why.

作者回复

2024-08-07

Thank you very much for your affirmation of our originality!

Suggestion：Better use "controllable generation" rather than “pattern completion” .

Response：Thank you for your suggestion. “Controllable generation” is indeed a more accurate and understandable term in the field of AI, as it clearly reflects the concept of generating content with specific controls. We will clarify this by explaining that “pattern completion” corresponds to “controllable generation” when we introduce the concept in our paper. From that point forward, we will consistently use “controllable generation” throughout the paper.

Q1：Quility: the evaluation is limited.

R1：We added comparisons to other conditions besides Canny, including depth maps, normal directions, and semantic segmentation in PDF Fig. R1, with each condition extracted by a pre-trained model. These condition extraction networks are highly sensitive to data distributions. For instance, all models fail to extract condition information for the ancient graffiti, resulting in low-fidelity generated images. When suitable features can be extracted, more abstract control conditions can achieve better generation results, such as ControlNet based on Canny and depth. Here are two possible reasons for this: 1) Our SCG was only trained on COCO, not a larger dataset; 2) SCG training did not use supervised data, such as sketch and image pairs. Addressing these two issues could significantly improve SCG’s ability on specific tasks. Our goal in this work is more focused on demonstrating that SCG can spontaneously emerge conditional generation capabilities on a wide range of tasks. Refer to GR4.1 for additional information.

Q2：Significance: suggestions to test on reconstruction task other than generation.

R2：Thanks for good suggestions. We quickly conducted simple tests on image super-resolution and dehazing tasks (see PDF Fig. R2) and still see that the proposed SCG has emerged with both super-resolution and dehazing capabilities. As SCG is fully self-supervised, its performance is hard to compare with supervised and dedicated networks. However, it has the potential to be used as a self-supervised pre-trained model and then fine-tuned on specific tasks, thereby enhancing the network’s ability in specific tasks while maintaining high generalization capabilities. Refer to GR4.2 and GR1.1 for additional information.

Q3：More analysis on symmetry loss vs equivariance loss.

R3：The equivariance loss is central to the modular autoencoder architecture, playing a crucial role in promoting independence between modules while simultaneously strengthening feature correlations within each module. The symmetry loss further enhances pairwise correlations (or symmetry) between features within a module, effectively suppressing the emergence of multiple unrelated sub-features within the same module. Removing the equivariance loss prevents the network from learning brain-like Gabor-shaped features and hinders the emergence of complementary functional specialized modules (see Appendix Fig. S1 a and d in the main text). While removing the symmetry loss allows the autoencoder to still generate complementary functional specializations overall, the correlations (or symmetry) within individual modules are reduced. This can lead to multiple unrelated features appearing within some modules (see PDF Fig. R3). Refer to GR2 for more.

Q4：What are the equivariance properties of HC1 and HC3 and make them useful for controllable generation?

R4：

It is noteworthy that all modules within the same network are under the same type of equivariant constraint (i.e., translation and rotation). The distinct functions that emerge from these different modules are entirely self-emergent and not a result of imposing distinct constraints on each module.
Modular autoencoder, under equivariance constraints, disentangles and decomposes into modular and complementary features. For example, HC0 represents color features, HC1 represents brightness features, HC2 and HC3 represent edge features, and HC4 and HC5 represent higher-frequency edge features. When we retain the brightness features (HC1) and drop the other features, the generation model will use the brightness features as clues to complete the other missing information, such as color. Therefore, conditional generation based on HC1 can be used for tasks like colorizing ink paintings or other re-color tasks. Similarly, when we retain the edge features (e.g., HC3) and discard other features, the generation module can use the edge information to generate color, brightness, and other information. Therefore, it often performs better in more abstract conditional generation tasks, such as sketch and ancient graffiti. For HC0, which represents color information, it is insensitive to fog due to its white color. Therefore, we can utilize HC0 for dehazing.
Different conditional generation tasks can be viewed as scenarios where certain aspects of an image are missing or corrupted. The goal is to use the remaining information as clues to complete or generate the missing information. The modularity of the autoencoder, resulting in relatively independent (at a local scale) and complementary features, allows for zero-shot generalization across a wide range of tasks, despite not being specifically trained for any particular task. Refer to GR3 for more.

Q5：Limitations: what useful features are NOT learned by this approach and why.

R5: While more features, such as depth (parallax), motion(e.g., optical flow), and semantic-oriented contours and instance segmentation, remain unlearned, they hold significant potential to further enhance the Controllability. Our current version of the autoencoder is limited to learning static, local features, thus preventing it from disentangling dynamic, volumetric, and semantically related modular features. We will discuss this in more detail in the Limitation section.

评论- Response to rebuttal

2024-08-10

I thank the authors for their clarifications and am satisfied with the improvements. I have adjusted my score accordingly.

2024-08-10

Thank you for the efforts and insightful feedback. We have benefited greatly from your comments, and are truly encouraged for your positive opinion.

作者回复

2024-08-07

We thank the reviewers for their efforts and constructive comments on our manuscript, which is very valuable and enlightening to us. We have revised our manuscript according to reviewers’ concerns. We would like to make an overall response before specific response to reviewers.

GR1：Significance of SCG:

Our original motivation is to propose and validate a novel self-supervised pipeline rather than a model to achieve broad generalizations.

We propose SCG and experimentally demonstrate that it can spontaneously emerge (or 0-shot generalize) various abilities, including super-resolution, dehazing, saturation and contrast manipulation, as well as conditional generation based on diverse styles such as oil paintings, ink paintings, ancient graffiti, sketches, and LineArt. Furthermore, SCG possesses two significant potentials: 1) Leveraging its self-supervision, SCG can further scale up its data and models to benefit from the scaling law, enhancing its basic capabilities; 2) Subsequently, SCG can be fine-tuned for specific tasks, leading to improved performance on particular tasks. These potentials suggest that SCG has the potential to become a foundation model for controllable generation.
This framework comprises two components: a modular autoencoder and a conditional generator. Given the extensive research on conditional generation, we leverage the existing, mature ControlNet for this aspect. Our core contribution lies in designing a modular autoencoder based on proposed equivariance constraint, successfully enabling the network to spontaneously develop relatively independent and highly complementary modular features. These features are crucial for subsequent conditional generation.

GR2：Clarification of Modular Autoencoder:

We agree to all reviewers’ suggestion and will add a more detailed framework diagram of the modular autoencoder in Section 3, including a more intuitive prediction matrix $M^{(i)}(\delta)$ , equivariant loss $L_{equ}$ , and symmetric loss $L_{sym}$ . See PDF Fig. R4 for details. The equivariance loss $L_{equ}$ is the core of the equivariant constraint, primarily serving to increase independence between modules and correlation (or symmetry) within modules. The symmetry loss $L_{sym}$ further enhances the correlation (or symmetry) of features within modules and suppresses the emergence of multiple unrelated features within the same module.
By removing the symmetric loss $L_{sym}$ (see PDF Fig. R3), the autoencoder overall still works, but some modules exhibit unrelated sub-features within them. When we change the number of modules, the autoencoder still reliably forms feature differentiation (see PDF Fig. R3). When removing the translation transform of the data, the learned features lose their direction selectivity. When removing the rotation transform, each module can only learn features with the same orientation (see PDF Fig. R3). We will explain the purpose of each loss more clearly in Section 3 and supplement the above ablation experiments in the appendix.

GR3：Pattern Completion:

Pattern completion is a classic neuroscience concept.The SCG aims to mimic this process by "mask and predict" at component instead of widely used space level. We agree with the reviewers that the introduction of it is not clear enough. We plan to provide more details about this concept and its relationship with the SCG.
The pattern completion process learns the global relationships between different modules, enabling it to complete or generate missing information from clues provided by other modules. For example, HC1 is brightness moduler. Therefore, conditional generation based on HC1 can be used for tasks like colorizing ink paintings or other re-color tasks. Since our modular autoencoder learns a set of complementary and relatively complete modules, most conditional generation or reconstruction tasks can be considered as scenarios where information in one or more modules is missing or damaged. This is also the source of SCG’s zero-shot generalization capabilities.

GR4：More Comparisons and More Tasks:

We added comparisons to other conditions besides Canny, including depth maps, normal directions, and semantic segmentation in PDF Fig. R1. These methods require supervised pre-trained feature extractors, which are sensitive to data distribution. All condition extractors fail to extract reasonable conditional information on ancient graffiti, resulting in uncontrolled generated images, and the performances are all inferior to our SCG. In the case of ink painting conditional generation, the depth feature extractor and Canny operator extracted relatively good features, resulting in better generation results than our SCG. However, other feature extractors failed to extract reasonable features, and the generated training results were not as good as SCG.
We also tested more tasks, as shown in PDF Fig. R2, and found that SCG, in addition to spontaneously emerging conditional generation capabilities for oil paintings, ink paintings, ancient graffiti, sketches, etc., still zero-shot generalized super-resolution, dehazing, and controllable generation capabilities under line art and sketch conditions.
As shown in PDF Fig. R5, we also discovered that by changing the coefficient of HC0, we can easily manipulate the saturation of the generated image, and by changing the coefficient of HC1, we can manipulate the contrast of the generated image.
On specific tasks, SCG’s generative ability still falls short of supervised, specialized models. There might be two reasons: 1) SCG is trained only on COCO, while specialized models are often trained on larger and high-quality datasets. 2) No labeled or supervised data (e.g., scribble and image pairs) has been incorporated for training.

In summary, despite being trained solely on COCO through self-supervised learning, SCG has demonstrated a wide range of capabilities. These extended experiments will be included in the appendix.

最终决定Accept (poster)

2024-09-25

This study proposed a self-supervised controllable generation (SCG) framework inspired by the cortical modularization and hippocampal pattern completion. Importantly, this model exhibits several brain-like properties, and good performance and generalization abilities. It seems that all reviewers provided positive ratings given sufficient interaction with the authors. I am inclined to accept this paper.