6.0

/10

Rejected4 位审稿人

最低3最高8标准差2.1

4.3

置信度

正确性2.5

贡献度2.5

表达3.3

ICLR 2025

Training the Untrainable: Introducing Inductive Bias via Representational Alignment

Vighnesh Subramaniam,David Mayo,Colin Conwell,Tomaso A Poggio,Boris Katz,Brian Cheung,Andrei Barbu

OpenReview PDF

提交: 2024-09-20更新: 2025-02-05

TL;DR

We design a method to make untrainable networks trainable using representational alignment.

摘要

关键词

Representational alignmentneural network optimization

评审与讨论

审稿意见

评分: 8置信度: 42024-10-23

The authors propose an approach for aligning representations across two different deep learning architectures. One architecture serves as the "teacher" (providing guidance) to the "student" (target network). For example, one architecture could a convolution whereas the other could be a transformer.

Specifically the authors propose training the target network with an additional loss term penalizing the dissimilarity between the target and fixed guidance network. The authors measure the layer-wise dissimilarity using the complement of linear CKA similarity.

The authors study four tasks (across vision, language, and arithmetic) to illustrate how their approach can improve performance by guiding an ill-suited architecture for the task with a more suitable architecture for the task using their alignment approach.

优点

The proposed idea is intuitively presented and I believe could serve as a useful tool for studying network initialization schemes as well as scientifically probing the role of architectural inductive biases.
I appreciate the authors investigating how to bridge our understanding of the role of architecture by providing a principled mechanism for transferring (some) of the architectural biases of one network architecture to another.
Empirically the authors selected a diverse set of tasks with reasonable baselines and choice of target versus guide networks. I appreciate how thorough the authors are disclosing training details, hyperparameters, and the intent of the experiments. I also appreciate the error analysis conducted across architectures and the error bounds for experimental results.
Authors draw a distinction between typical distillation approaches and the proposed guidance approach that I find clear and well-motivated.

缺点

The field has mostly converged on the transformer architecture for most tasks. While this doesn't make the idea presented less interesting scientifically, I'd encourage the authors keep this in mind when motivating the approach as today, practitioners are not frequently choosing across many choices of architectures. I find initialization and the scientific value of understanding the role of architectural bias still valuable and would encourage the authors emphasize these angles instead.
The gains for ImageNet are still quite small overall, suggesting we perhaps can't simply transfer over inductive biases or that much much more tuning is needed. Do the authors have any comments on this point? At a minimum, I'd recommend the authors acknowledge and discuss this in the results section.
Inconsistent gains across trainable versus untrainable without clear intuition as to why this might be happening. I appreciate the author's attempt at clarifying this in the paragraph starting on line 373, but I still feel this section is missing crucial intuition to help readers make sense for the empirical findings across the tables. For example, why would an untrained ResNet-18 provide better guidance relative to a trained ResNet-18 to a Deep FCN as shown in Table 2. This is puzzling to me and am curious if the authors have additional experiments or intuition to better explain this in the context of the rest of the results.
The authors make several claims about the revolutionary potential of the method that are not well-supported with evidence. While I'm also excited about the proposed method and potential future work that builds on it, I'd suggest the authors downtone what right now appear as spectulations in this draft. For example, lines 85-86 claim the proposed approach "expands the space of viable networks," a claim I believe is not well-supported by this work. Similar claims are made about potential future promises about guidance on lines 108-112 that are not well-supported by the experiments in the work.
- Section 4 could be improved. You emphasize a distinction between untrained architectures (versus untrained tasks)—defining Untrainable Architectures as target networks difficult to train irrespective of task. In the experiments that follow however, the focus is very much on whether a given target architecture is ill-suited for a specific task.

问题

It's not crucial to the reception of the paper, but I'm curious if the authors explored the evolution of the CKA similarity across layers comparing early versus later layers throughout guidance akin to https://proceedings.neurips.cc/paper_files/paper/2014/file/375c71349b295fbe2dcdca9206f20a06-Paper.pdf
I'm also curious whether the authors explored any other similarity metrics besides linear CKA.
Perhaps I missed it, but where are the experiments with guidance between architectures with a different number of layers described on lines 230-235?
Did the authors consider weighing the two loss terms proposed in equation 1?
Not necessarily a weakness, but I'm curious whether the authors explored transferring equivariant architectures for tasks with known symmetries.

评论- Reviewer seP7 Response (Part 2)

2024-11-22

Q1: but I'm curious if the authors explored the evolution of the CKA similarity across layers comparing early versus later layers throughout guidance akin to

This is a great question and something we have included in an appendix subsection F.1 of the paper. We find that CKA optimizes more quickly in early layers rather than later layers.

Q2: I'm also curious whether the authors explored any other similarity metrics besides linear CKA.

Yes we have! We include RSA [1] and ridge regression results in appendix section L. We find that RSA has similar performance to CKA, when using randomly initialized guide networks but improves dramatically when using trained guide networks. More excitingly, ridge regression leads to further improvements. This finding is intuitive; RSA and ridge regression have more degrees of freedom than CKA, due to fewer invariances. This means that fitting can be improved with less strict similarity functions. See [2] for an overview of other similarity metrics. There are many and we can incorporate any of them as long as they are differentiable!

Metric	ImageNet Validation Accuracy
Trained ResNet-18; CKA	7.50
Trained ResNet-18; RSA	11.02
Trained ResNet-18; Ridge	9.46
Untrained ResNet-18; CKA	13.50
Untrained ResNet-18; RSA	11.74
Untrained ResNet-18; Ridge	15.69

Q3: Perhaps I missed it, but where are the experiments with guidance between architectures with a different number of layers described on lines 230-235?

Most architectures we tested had a different number of layers. For instance, the example in Figure 1 has a different number of layers between the two networks. The only setting that had the same number of layers between the guide network and target network was the Deep ConvNet guided by ResNet-50. The Deep ConvNet was the equivalent of a ResNet-50 with residual connections removed.

Q4: Did the authors consider weighing the two loss terms proposed in equation 1?

We did but eventually came to the conclusion that an equal weighting of both loss terms was sufficient. There were some design choices that could be made. For example, we could have introduced a tunable parameter to the representational dissimilarity term. In practice, we found that some networks reduce the parameter to 0 to eliminate the contribution of the similarity. When setting the parameter manually, we found that a constant factor of 1 worked well for initial experiments with image classification and sequence modeling results. Larger values seem to make the task loss saturate or stall. Smaller values reduced the improvement from providing representational alignment. In practice, we could add this as a hyperparameter that can be tuned.

Q5: Not necessarily a weakness, but I'm curious whether the authors explored transferring equivariant architectures for tasks with known symmetries.

We haven’t but this is one of many architectures that are on our list for future work. It would be amazing to use equivariant architectures to directly check for the inductive biases. Thank you for this suggestion.

[1] Kriegeskorte et. al. Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience (2008)

[2] Klabunde et. al. Similarity of neural network models: A survey of functional and representational measures. arXiv 2023.

评论- Response

2024-11-26

Thank you for the additional comparisons to other similarity metrics and for the clarifying responses. My concerns regarding the ubiquity of transformers today potentially limiting the impact of this investigation and marginal gains on ImageNet remain. While I do agree with reviewer rthw, the work would benefit from more concrete insights based on the proposed approach, I believe overall the approach proposed is of scientific interest for studying architectural inductive biases. Based on this view and taking into the account the revised draft and authors' comment, I maintain my recommendation to accept the paper leaving my score at an 8.

评论- Reviewer seP7 Response (Part 1)

2024-11-22

We thank the reviewer for their constructive and thoughtful review. We are glad the reviewer found the paper intuitively presented, found guidance to be a useful tool for understanding neural network initialization and inductive biases, and appreciated the different baselines and settings. We address specific points and questions here.

W1: field has mostly converged on the transformer architecture for most tasks

Thank you! This is a great point. We agree that the presentation of our work should emphasize the potential for understanding inductive biases further and should highlight this more in the paper. We had some of these points in the conclusion but didn’t emphasize them further in the paper. In general, we would also like to mention that we haven’t given up on making these architectures viable! There are many ways to make improvements to our optimization which we plan to dedicate future work towards. RNNs are a potentially exciting avenue to explore given their avoidance of quadratic complexity. With additional scale and evaluation, we hope to find that these architectures are competitive and think our results point to some potential breakthrough. However, we realize that this is not necessarily in scope for the paper and not entirely justified from the current results. We will modify the conclusion and will do a more thorough edit through the paper to emphasize the utility of our work for understanding inductive biases and finding new initializations.

W2: The gains for ImageNet are still quite small overall

This is a fair point and we realize that the gains we report over FCNs are nowhere close to being useful object detectors. The goal of our paper was to overcome a difficult property of training these architectures. And, as mentioned, this has great implications for understanding inductive biases in neural networks as well as finding new initialization schemes.

There are many potential takeaways our results point to. First, it could be the case that it is not possible to pass along the entire inductive bias or CKA is not the correct function to do this over. Indeed, we have found that a metric with more degrees of freedom like ridge regression is useful for improving the results of guidance. We may find that there are only certain mathematical aspects of our guide network representations that are useful for supervising representations of the target network. This could lead to better designed similarity functions that are simpler and only pass relevant features from the guide network to the target network.

Of course, another conclusion is that guidance will likely need additional tuning. Better representational similarity functions is one thing but we can likely better design our fully connected network to balance between depth and width. It’s likely that architectural design will still matter. Using variational hidden dimension and better optimization techniques like a warmup scheduler in tandem with guidance may lead to better results. In this paper, we intentionally chose networks with extreme failures.

Having overcome problems with training these architectures, we argue that the many of the tricks that made other networks improve can be applied here. We look forward to spending time making these networks useful. In the meantime, we will be sure to rewrite the paper to focus on guidance as a tool to understand inductive bias a bit more.

W3: Inconsistent gains across trainable versus untrainable without clear intuition as to why this might be happening.

We hope to provide some intuitions here. We believe guidance makes a distinction between architectural and knowledge-based priors. Architectural priors could refer to locality or translational equivariance while training priors could refer to internal regularization or sparsity.

The conclusion of guiding a Deep FCN with ResNet-18 seems to be that architectural priors are more useful for overcoming overfitting and obtaining a better image classifier. One immediate explanation is that the randomly initialized guide network has an easier representation space to align with. Training a network may overwhelm the architectural priors or introduce further changes into the representation space that are difficult to replicate, reducing the transfer of architectural information from the ResNet to the Deep FCN. This is emphasized by the use of CKA which is a weak notion of similarity.

This explanation also explains why the trained guide network leads to improvements. The architectural prior is still present. But the space of the trained guide network is noisier and contains richer features than what is just present with the randomly initialized architecture.

This is an exciting finding for making distinctions between trained and untrained networks as well as architectural priors and trained priors.

评论- Followup for Reviewer seP7

2024-11-25

Thank you for taking the time to review our work. Since the discussion period is coming to a close, we just wanted to reach out to make sure our response and additional experiments adequately addressed your concerns. Please let us know if you have any additional concerns, we would be happy to clarify!

Thank you, Authors.

审稿意见

评分: 5置信度: 42024-11-03

This work presents a simple method for training neural networks that are typically considered untrainable. The approach, similar to the teacher-student methodology, involves training a "target" network with the assistance of a "guide" network, which can be either trained or untrained and may have a completely different architecture from the target. The objective is to align the target network's internal representations as closely as possible with the guide ones in terms of CKA similarity.

优点

The paper studies an interesting problem trying to solve the problem of untrainable networks
The authors analyzed multiple scenarios with different tasks and architectures
The authors provide the supplementary materials with demo notebooks

缺点

Writing and Presentation: The paper could benefit from improved clarity and readability. Some suggestions to consider:

The paper includes expressions that may introduce ambiguity or complexity (e.g. lines 30,42) potentially making it harder for readers to grasp the authors' points fully. Rephrasing these expressions with clearer language would better convey the challenges discussed and enhance accessibility.
The tables are somewhat difficult to interpret, and it’s unclear if certain entries serve as baselines. For example, Table 2 could be restructured to distinguish between guide and target, rather than using the "experiment" column, and highlight when training is performed without teacher guidance. Making tables and feature descriptions more self-contained and better explained would enhance clarity.
The Method section could be more clearly organized and written. It is challenging to follow, and restructuring or rephrasing certain parts would likely improve reader comprehension.
In the Related Work section, the "representation similarity" literature is missing (e.g. 2,4-8). Adding this section could support several claims, such as the one on line 145.

Novelty: The paper’s novelty is limited, as it overlooks relevant related work that addresses similar challenges. For example:

A similar work exists, such as [1], where authors leveraged relative representations [2] in a teacher-student framework, training the student network to mirror the teacher network’s latent representations.
The authors’ claim in the conclusion regarding the absence of known methods for improved network initialization is not entirely accurate. For instance, [3] proposed a method for initializing smaller models by selecting subsets of weights from a larger, pretrained model, thereby transferring knowledge from the larger model to smaller architectures.
Additionally, the experiments lack a comparison with the classic teacher-student setting, which would provide a useful benchmark.

Contribution: While the results show some improvement, they also highlight ongoing challenges in making these networks fully trainable. For example:

In Table 2, the results show an accuracy improvement from 7.5 to 13.10. However, no standard deviation is provided, which limits the interpretation of these results. Moreover, an accuracy of 13.10 is not competitive on this dataset for the image classification task.
The choice of the networks raises questions, as they do not represent state-of-the-art (SOTA) architectures. It would be interesting to explore whether using a pre-trained network to guide another competitive network could yield further improvements.

[1] Ramos, Patrick, Raphael Alampay, and Patricia Abu. "Knowledge Distillation with Relative Representations for Image Representation Learning." International Conference on Computer Recognition Systems. Cham: Springer Nature Switzerland, 2023. [2] Moschella, Luca, et al. "Relative representations enable zero-shot latent space communication." ICLR 2022. [3] Xu, Zhiqiu, et al. "Initializing models with larger ones." ICLR 2023. [4] Huh, Minyoung, et al. "The platonic representation hypothesis." ICML 2024. [5] Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020. [6] Shengkun Tang, Yaqing Wang, et al. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. CVPR 2023. [7] Zorah Lahner and Michael Moeller. On the direct alignment of latent spaces. In UniReps 2023. [8] Valentino Maiorca, Luca Moschella, et al. Latent space translation via semantic alignment. NeurIPS 2023.

问题

Are the trained guide networks pretrained models that are then fine-tuned during the guidance process?
Are the untrained guide networks trained in parallel to the target network, or are they frozen during training? If they are not frozen, what is the advantage of training both networks simultaneously rather than focusing solely on training the guide network?
In Figure 1, when mapping different layers of the guide network to the same layer of the target, it may be important to consider the similarity between these different guide layers.
How might the results differ if an alternative metric is used to calculate the similarity?
What impact would guiding only the final layers of the target network have on the results?
In Table 3, the RNN achieves 100% accuracy and is used as a guide network. What does this imply? Given that the work aims to train untrainable networks using well-performing networks, why use a network that appears to be overfitting as a guide for training the target network?

评论- Reviewer Kh2H Response (Part 2)

2024-11-22

Q3: How might the results differ if an alternative metric is used to calculate the similarity?

This is a great question! We include an experiment where we guide with RSA and linear regression. We see larger improvements in performance when optimizing with RSA and linear regression, likely due to added degrees of freedom [3] associated with the similarity metrics. We refer to Appendix Section L.

Metric	ImageNet Validation Accuracy
Trained ResNet-18; CKA	7.50
Trained ResNet-18; RSA	11.02
Trained ResNet-18; Ridge	9.46
Untrained ResNet-18; CKA	13.50
Untrained ResNet-18; RSA	11.74
Untrained ResNet-18; Ridge	15.69

Q4: What impact would guiding only the final layers of the target network have on the results?

We include an experiment where we only provide guidance on the final layer of the target network, along with other ablations in Appendix Section M. In general, the impact depends on the experiment. For example, we found that guiding using only the final layer hurt Deep FCN performance. But, we found that it could be useful for RNN performance on copy-paste. In general, this establishes that guidance has more utility than just fixing the credit assignment problem of gradients in very deep networks i.e. there is more to guidance than making it easier to propagate gradients back in the network.

Q5: In Table 3, the RNN achieves 100% accuracy and is used as a guide network. What does this imply? Given that the work aims to train untrainable networks using well-performing networks, why use a network that appears to be overfitting as a guide for training the target network?

Apologies, we are confused by the comment. The 100% accuracy is evaluated in a held-out test set. The work we cite also achieves 100% accuracy on the same parity task. Could the reviewer clarify how this result indicates overfitting?

[1] Kriegeskorte et. al. Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience (2008)

[2] Hinton et. al. Distilling the Knowledge in a Neural Network. arXiv, 2015.

[3] Klabunde et. al. Similarity of neural network models: A survey of functional and representational measures. arXiv 2023.

评论- Followup for Reviewer Kh2H

2024-11-25

Thank you for taking the time to review our work. Since the discussion period is coming to a close, we just wanted to reach out to make sure our response adequately addressed your concerns. Please let us know if you have any additional concerns, we would be happy to clarify! If the answer sufficiently addresses your concerns, we would appreciate if you could adjust your score.

Thank you, Authors.

2024-11-25

I sincerely appreciate the author’s response and the additional clarifications provided. While I understand that the aim of this work is to address training challenges in networks with known limitations, I still find it challenging to grasp the broader practical implications of the work. Specifically, the overall performance of these networks remains a concern. I value the clarifications shared, and I have adjusted my score to reflect this.

2024-11-27

Thank you for taking the time to read our response! We really appreciate that you’ve engaged with us. To address the reviewer's concerns about performance. First, note that our goal was not performance, it was to escape the bad regime that many architectures are stuck in which makes them untrainable, leading them to be completely uncompetitive with state-of-the-art networks like Transformers. That being said, we do demonstrate cases where our performance is objectively very good.

Our results show that plain vanilla RNNs, with guidance, are competitive with GPT-2 small when they have similar sizes. It is widely believed that the RNN architecture must be changed. GPT-2 was an inflection point in LLMs and Transformers, where many people became convinced that they generalize in a novel way: "Language Models are Unsupervised Multitask Learners", Radford et al. 2019. In many ways this launched the current wave of research and resulted in considerable investment and tens of thousands of citations. Showing that RNNs could always have kept up with this, had we known how to train them demonstrates the real-world performance the reviewer is looking for. In addition, it shows that RNNs can scale, something that the many current ML results say isn't feasible. Our result is not an upper bound, it is a lower bound that demonstrates real-world scaling for RNNs. This seminal paper from 2019 could have been written about RNNs as they existed in the 1960s and 1970s if only we knew how to train them properly! Guidance is one such way, and now that we have initial networks and the desired final network, future work can reverse engineer how to do this even without guidance. That is a significant and unexpected real-world performance result.
An added result is that RNNs can improve Transformers. A near-10% improvement to the performance of a Transformer represents a major improvement given that it doesn't require a change to the architecture, the optimizer, the number of parameters, nothing at all. In addition, to the best of knowledge, no one has observed that an RNN can teach a Transformer anything before. Likely, any task for which an RNN performs well can be used to enhance a Transformer. This again represents a real state-of-the-art performance improvement.

We hope that by showing that RNNs performance is competitive at similar scales to GPT-2 and that RNNs can meaningfully improve Transformers, we have assuaged the reviewer's concerns about performance. While not our goal, our results do achieve significant and improved performance.

评论- Reviewer Kh2H Response (Part 1)

2024-11-22

We thank the reviewer for their constructive review. We are glad the reviewer found that the paper tries to solve an interesting problem and found the use of multiple scenarios to be positive.

W1: A similar work exists, such as [1], where authors leveraged relative representations [2] in a teacher-student framework,

Thank you for bringing this to our attention. We will incorporate this into our related work. However, we want to draw important distinctions between this work and our work. The referenced work uses a distillation process that aligns relative representations between the student and teacher networks.

First, it’s important to note that the setting is different, as is true of any distillation paper. We are applying guidance across much larger, deeper target networks. Moreover, we would like to emphasize our setting uses different architectures, which is not covered in the paper.

However, there are further reasons why guidance may have broader applicability than the cited paper. While the cited paper refers to relative representations, we note that this idea is highly similar to an approach called Representational Similarity Analysis (RSA) [1] in neuroscience. RSA compares sets of representations as follows. Given two sets of representations, $R_1 \in \mathbb{n, d_1}$ and $R_2 \in \mathbb{R}^{n, d_2}$ , we first find the pairwise distance between every input in the set for both pairs of representations i.e. we compute the representational dissimilarity matrix (RDM). This leads us to two matrices, $RDM_1 \in \mathbb{R}^{n, n}$ and $RDM_2 \in \mathbb{R}^{n, n}$ . This is very similar to the idea of relative representations presented in your cited work, where the only difference is that relative representations use anchor points.

First, as requested by other reviewers, we can confirm that guidance improves results with RSA (See Appendix Section L). Given that we apply RSA at several layers, we are providing a stronger signal to the target network. This almost guarantees the fact that guidance will lead to a stronger result. Finally, this demonstrates the power of guidance over the cited work. Guidance can be applied across several similarity metrics as long as they are differentiable. This allows us to answer many questions that weren’t possible before. For example, since we apply guidance over multiple activations, we can ask about similarities across layers. We can also modify the similarity function as we see fit to control what information is sent from the guide to the target. Guidance is more general and more powerful.

W3: However, no standard deviation is provided, which limits the interpretation of these results. Moreover, an accuracy of 13.10 is not competitive on this dataset for the image classification task.

We are confused about this comment because we have provided error bars on all reported numbers in all tables and on all loss curves. We are perhaps missing what the reviewer means in this case.

With respect to the comment on low accuracy scores, the goal of the paper was to overcome training difficulties in networks that have known failures. After overcoming these failures, there are many modifications we can make to achieve competitive performance. For example, for the FCN, if we reduce the depth of the network, add a learning rate scheduler, use a larger guide network, etc. Making the networks competitive and achieving a strong result here is a paper in its own right. This paper introduces a technique, guidance, which has strong implications for alternative architectures and understanding inductive biases in neural networks.

Q1: Fine-tuned during the guidance process? Untrained guide networks trained in parallel to the target network, or are they frozen during training

The guide network is frozen, regardless of whether it is trained or randomly initialized. We do not update the parameters of the guide network.

Q2: In Figure 1, when mapping different layers of the guide network to the same layer of the target, it may be important to consider the similarity between these different guide layers.

Our apologies, we realize that figure 1 may be a bit confusing visually and could be read as mapping several different layers of the guide network to the same layer of the target network. However, in actuality, this network is unrolled (see the x50 and x12 which indicate 50 FCN blocks and 12 ResNet blocks in the image). In this case, the figure is indicating a 1-1 mapping.

审稿意见

评分: 8置信度: 42024-11-04

This paper proposes a novel distillation technique called guiding which distils the inductive priors of the 'guide' network into the target network. The paper shows that by using a known task strong network as the guide, a target network which is known to be weak at the specific task can be significantly improved in terms of performance. The paper then proposes that the guiding network can be disconnected from the process at a very early stage so that it simply acts as a initialisation technique for the network.

优点

Paper is very well written and even enjoyable to read.
Technique is novel and proposes a new direction in which the field of distillation can go.
Technique could revitalise 'dead' neural architectures which failed to take off due to weak performance.
Experimentation on multiple modalities.

缺点

I would have liked to see results on smaller datasets. I feel that this could have resulted in a higher number of ablations to determine different configurations of the network e.g. should we be connecting multiple layers to a single layer in the target? Should guide network have the same number of layers as the target? Do all layers in the target need connections to the guide?
I would have liked to see more in depth discussion of the ViT to CNN/MLP experiments which are hinted at in figure 3.

问题

I am interested in discussion on both of the weaknesses that I have proposed. Overall I am positive on this paper, but I feel that if the above weaknesses were addressed, it would improve the paper.

评论- Reviewer vZFS Response

2024-11-22

We thank the reviewer for their constructive review. We are glad the reviewer found the paper well written and enjoyable to read, the technique novel, the potential to revitalize dead neural architectures, and appreciated the experimentation on multiple modalities. We address the concrete points below.

W1: Smaller datasets. Higher number of ablations

We agree! We include experiments with a smaller dataset, CIFAR-10, using the Deep FCN target network and ResNet-18 guide network. We also include some ablation experiments with the copy-paste task where we guide an RNN with a transformer. Our goal is to quickly test how (1) the number of layers used in guidance affects results and (2) whether guiding one layer with multiple guide layers is generally useful. In the case of an RNN guided by a transformer, the RNN representations at each layer received multiple guide layers as supervision including the layer normalization representations and linear layers. We show results in Appendix Section M.

The answer to (1) for deep networks is that guiding earlier layers leads to better results. With CIFAR-10, we find that guiding the first 5 layers leads to a 20% improvement on accuracy. However, guiding later layers only lead to a decrease in guided network performance. However, we find guiding at least one layer leads to a dramatic improvement in performance. With RNNs, we found that guiding later layers is more important, but overall, guiding all layers was useful

The answer to (2) is more complex. When guiding a Deep FCN with a randomly initialized ResNet-18 guide, the general finding is that our simplest layer mapping is best. In this case, when we apply guidance with multiple guide network layers as supervision to a single target network layer, this hurts performance. In the case of the transformer-guided RNN, we found that all layers of supervision were useful. Removing the multiple layers of supervision hurt copy-paste performance. The largest drop in performance came from removing the layer normalization representations from guiding an RNN layer.

W2: see more in depth discussion of the ViT to CNN/MLP experiments which are hinted at in figure 3.

Our apologies. We should have included more discussion. Similar to guidance with a ResNet-18, we applied guidance with a ViT-B-16. We applied a similar mapping of guide network layers to target network layers. In this case, that included internal layers of the transformer architecture i.e. multi-head attention layers, layer-norm layers, and linear layers. We achieved an accuracy of 11.33% with a randomly initialized ViT-B guide network, which establishes that guide network size doesn’t necessarily correspond with outcome.

In general, we only use ViT to guide the Deep FCN in creating figure 3, with the goal of analyzing error consistency between the guided networks. In error consistency, we measured the overlap in model predictions when guided by different networks. We find that this reconstructed the patterns found

评论- Response

2024-11-25

I appreciate the papers response and have raised my score accordingly.

评论- Followup for Reviewer vZFS

2024-11-25

Thank you, Authors.

审稿意见

评分: 3置信度: 52024-11-06

This paper proposes guidance, a method where a well-performing guide network directs the layer-wise representations of a target network, transferring inductive biases without modifying the architecture. The technique aims to improve training for architectures traditionally prone to issues like overfitting or underfitting, such as fully connected networks and plain CNNs. Initial results suggest that guidance can improve performance in various settings, but further validation across tasks and more rigorous testing would clarify its robustness and broader applicability.

优点

The observations are interesting and to some degree novel.
A wide range of tasks from several domains are tested empirically.
The paper is well written and methods are explained comprehensively.

缺点

The main observation of the paper is that the performance of suboptimal architectures can be improved if their intermediate representations are matched to a more optimal guide models. This is an interesting observation but somewhat a natural outcome of what we already know.

Functionally, representation matching to a guide network allows deeper layers of the target networks to receive useful gradients for learning that they normally do not receive from the task loss. We know that in deep networks w/o skip connections, auxiliary loss functions applied on intermediate layers help with learning [1]. These losses were proposed to mediate the very same issue examined here, that deep CNNs suffer from the vanishing gradient. Relatedly, models with skip connections like ResNets can also be viewed as ensemble of shallow networks as skip connections allow gradient to pass through the deep layers [2]. Suggesting that what skip connections in one way do is to allow many parallel pathways inside the model for letting the gradients flow throughout. I don't see how what is proposed here is fundamentally different from what was show in the prior literature. One way to make this claim more grounded is to show that representation matching outdoes auxiliary losses applied on earlier layers. In general, distillation is a critical missing baseline in all the results.

[1] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9. 2015.

[2] Veit, Andreas, Michael J. Wilber, and Serge Belongie. "Residual networks behave like ensembles of relatively shallow networks." Advances in neural information processing systems 29 (2016).
A second primary observation in the paper is that even when the representation is matched to that of an untrained guide network it leads to substantial improvements in the performance of the target network. While this is an interesting observation, specially in cases where this scenario surpasses the matching of the trained network, the portrayal of the results are inaccurate. The main issue lies in considering the untrained model as incapable of performing the task better than chance. In Table 2, the accuracy of the untrained models are reported as being consistently around chance, which is expected if none of the parameters are trained. However, in the context of this paper which uses network representations as guide, the informative measure would be the accuracy of a linear classifier trained on the penultimate layer features. This value is typically much higher than chance and I expect it to be so here as well which attests to the usefulness of the features at their initial state. Apart from this, there are more things to be done in order to at least attempt to offer a plausible explanation of this phenomenon. For example, what are the geometrical differences as a result of matching to the untrained and trained networks? Presumably, since the trained model is better able to distinguish between imagenet classes, matching its representations, if successful, should have been more useful for the target network. But somehow that's not true.
The observed higher utility of the untrained guide network becomes more interesting specially when considering the fact that the matching loss term is included throughout the training. One possible explanation could be that the untrained network geometry is simpler to replicate for the target network while the fully trained model could have a much more clustered representation that is too difficult for the target to learn from. This could be tested by using various guide networks at different stages of training and tracking performance when matching to each, then examining any relationship between geometrical properties of the latent space and performance of the target network.
Figure 4 results are interesting in showing that even limited training on the matching loss is still helpful. Would the target network its full accuracy in Table 1 if training is done in two phase where the first phase will fully train the network on the untrained guide representations (weight initialization), and phase 2 would do the task training? I think this experiment is important for supporting the claim about finding a good initial weight in these models using the representation matching idea.
At parts, the writing is very verbose. 6.5 pages of the paper is dedicated to the intro, background, and a methods section that contains only modest novelty. Sections 1 and 2 should be written much more concisely to make room for additional experiments. Some of the details of the experiments could also be moved to the appendix.
There are a few papers that are conceptually very similar to the proposed ideas in this paper but were not cited

https://arxiv.org/abs/1412.6550

https://arxiv.org/abs/1808.01405

The last paragraph of introduction talks about the limitations, would fit better at the end of the paper (e.g. in the discussion)
Line 116: "To that end, we also did not optimize networks to convergence," this sounds like a serious limitation that could be avoided. If the networks are not trained optimally some of the conclusions may turn out being incorrect. E.g. the observed differences between matching the trained and untrained models.
Line 257: "We describe the task settings": remove

Overall, this paper showcases very interesting observations but falls short at providing concrete insights. The paper in its current format reads premature. The methods are not novel, the observations are interesting but no real insight is offered. This is study has many potentials but they are not met yet and so I don't think it is ready for publication.

问题

Figure 3 shows the pattern of error consistency, is this any different from what would be obtained from distillation?
Related to fig. 3, are the intermediate representations of these two guide models also similar? A similar plot could be made for the intermediate representational similarities which would be very informative.

评论- Reviewer rthw Response (Part 3)

2024-11-22

W7: To that end, we also did not optimize networks to convergence

This limitation aims to address a concern that guaranteeing convergence in neural networks is generally difficult and that results in this paper are based on very simple optimization strategies that can be improved upon using better optimizers or regularizers. It is also clear that the loss has not saturated in some loss curves we plotted (see the copy-paste result with an untrained transformer guide in Figure 2). We believe this caveat is fair but not concerning for the interpretation of results comparing untrained and trained guide networks.

First, our paper aims to analyze inductive biases. The performance difference between untrained and trained guides is interesting. But, the most interesting and important result is that architectural priors are sufficient at guiding the target network. When the untrained guide network achieves significant improvements, the implication is significant: architectural priors are meaningful. Additionally, the training time we provide is significant and overlaps with other papers. The ResNet paper [6] trains networks for 100 epochs. Other sequence modeling works use 100 epochs as well. These papers make claims from the standard training time. There are certain settings where it is guaranteed for results to hold such as copy-paste with RNNs where the difference in performance is evident. Overall, we believe this caveat is justified but will not change the story and interpretation of the paper.

Q1: error consistency, is this any different from what would be obtained from distillation?

We can rerun this baseline with a basic distillation approach [3] and show results in Appendix Section I.1. We find that error consistency shows similar trends but has a much smaller effect size. Guidance results in error consistency values between guided networks that are comparable with the error consistency between guides. Distillation cuts this value in half, resulting in an error consistency of 0.26 between distilled networks.

Q2: Related to fig. 3, are the intermediate representations of these two guide models also similar?

Great question! We measure the similarity of the internal representations for the two networks using CKA. We make a line plot where the x-axis is the layer index and y-axis is the CKA similarity for a set of 1000 ImageNet images between a layer of ResNet and a layer of ViT-B. We find that the CKA is lower at later layers for both the comparison between ResNet-18 and ViT-B. However, the initial layers are quite similar. We include this in Appendix Section G with a longer discussion.

[1] Li and Papyan. Residual Alignment: Uncovering Mechanisms of Residual Networks. Neurips 2023.

[2] He, Liu and Tao. Why Residuals Work? Residuals Generalize. arXiv, 2019.

[3] Hinton et. al. Distilling the Knowledge in a Neural Network. arXiv, 2015.

[4] Hu et. al. Low Rank Simplicity Bias in Neural Networks. TMLR, 2021.

[5] Fan et. al. Intrinsic dimension estimation of data via principal component analysis. arXiv, 2010.

[6] He et. al. Deep Residual Learning for Image Recognition. CVPR 2016.

评论- Followup for Reviewer rthw

2024-11-25

Thank you, Authors.

评论- Reviewer rthw Response (Part 2)

2024-11-22

W4: For example, what are the geometrical differences as a result of matching to the untrained and trained networks?

We agree that we could have provided more intuitive explanations for the improvements seen with randomly initialized guide networks. The provided explanation is a reasonable interpretation that for certain networks i.e. the representation space of untrained guide could be easier for the target network to match, particularly when the architectures are strikingly different. Another way of presenting this finding is that guidance is finding a distinction between learned priors and architectural priors when preventing undesirable features of training target networks. For example, it seems architectural aspects of CNNs are useful for preventing overfitting in the Deep FCN. Similarly, memory incorporation improves in an RNN when guiding with a randomly initialized transformer as seen with improved copy-paste performance. These architectural priors don’t disappear when the guide network is trained but are more difficult to replicate due to the target network having to replicate both learned and architectural priors, assuming that the learned prior isn’t as useful for overcoming the gap in performance. It’s likely that some aspect of the ResNet is useful for preventing overfitting. This could be sparsity, the distribution of the eigenvalues, etc. But we believe this is an exciting finding that showed whether certain training properties were prevented by inductive biases in the architecture or inductive biases gained through optimization and knowledge! We plan to dedicate further work to understanding these distinctions.

We can indeed verify that replicating a randomly initialized guide network activation space is easier based on Figure 6 and 7 in Appendix Section F for all results where the randomly initialized guide performs better than the trained guide. The CKA dissimilarity optimizes more quickly or starts out smaller. However, note that this still isn’t necessarily the whole story. The Deep ConvNet matches an untrained guide network more quickly but the trained guide network leads to better results.To further verify this, we can consider results with another similarity metric, ridge regression as reported in Figure 16. Ridge regression provides a linear mapping from one representation space to another and we find that ridge regression results with an untrained guide network lead to a much lower dissimilarity loss. Of course, questions remain about a deeper explanation of the geometric differences between networks guided by a trained or untrained guide. We analyze both the effective rank [4] and intrinsic dimensionality (ID) via PCA [5]. We show results in Appendix J. In general, it seems that these geometric differences don’t fully explain guidance. Guidance with an untrained network has similar effective rank and ID to a network trained with no guidance.

W5: Multiple guides through training

We apologize, but the current description is a bit unclear to us on how we would guide with different networks through training. Do you mean applying guidance with trained and randomly initialized guides at different stages?

W6: Would the target network its full accuracy in Table 1 if training is done in two phase where the first phase will fully train the network on the untrained guide representations (weight initialization), and phase 2 would do the task training?

This is a great point! We had this additional baseline after the paper submission. The main issue with our experiment is that network initialization is not dependent on a task nor is dependent on data. To fix this, we modify guidance to better support initialization as follows. We first maximize the CKA between the layers of the target FCN and guide randomly-initialized ResNet-18 for 300 training steps. Critically, this is done with noise so we don’t use real images. The noise consists of samples from an independent Gaussian with mean 0 and standard deviation 1. Afterwards, we take the target FCN and train this on ImageNet without any guidance. We show results in Appendix Section K. We find that overfitting is still prevented. This better supports our point on initialization. We will plan to update Figure 4 to reflect this finding.

评论- Reviewer rthw Response (Part 1)

2024-11-22

We thank the reviewer for their constructive review. We are glad the reviewer found the paper observations in the paper interesting and novel and generally found the paper well-written. We address the key points here by breaking down some of the weaknesses.

W1: interesting observation but somewhat a natural outcome of what we already know… Suggesting that what skip connections in one way do is to allow many parallel pathways inside the model for letting the gradients flow throughout. I don't see how what is proposed here is fundamentally different from what was show in the prior literature.

We agree that some of the findings in our paper align well with findings in prior work. We believe that strengthens our work! We now have additional experimentation to verify prior findings. However, we also would disagree that all our findings are entirely explained by targeting earlier layers in the network or improving gradient flow. This could be a reasonable explanation of the Deep FCN and Deep ConvNet results but we would argue this would not work with the Shallow FCN or with the transformer guided for the parity task, which are not deep enough to have gradient flow issues. RNN results also are not necessarily attributed to gradient flow: copy-paste and language modeling have limited performance on RNNs due to problems with incorporating memory. Fixing gradient flow may prevent vanishing/exploding gradients but won’t make an RNN better at incorporating memory. Guidance is more general than the approaches that the reviewer references.

More importantly, guidance is useful for testing a wide variety of hypotheses about how we overcome barriers for trainability. As we reference in the paper, a unified theory of skip connections is still unclear [1, 2] despite the work you reference. There could be other components of the convolutional architecture that are preventing overfitting in a Deep FCN. And, we find the distinction between the success of a trained guide and randomly initialized guide striking for the Deep ConvNet result.

W2: distillation is a critical missing baseline in all the results.

We include a distillation baseline in this case using the technique in [3]. We include the accuracies here and include a loss curve in Appendix Section I Figure 8. The distillation baseline with a trained network is worse than guidance, getting an accuracy of 3.45%. And, the baseline with a randomly initialized ResNet-18 hurts performance, with an accuracy of 1.41%. Guidance significantly improves over distillation and works with randomly initialized networks as guide networks.

W3: in the context of this paper which uses network representations as guide, the informative measure would be the accuracy of a linear classifier trained on the penultimate layer features.

This is a fair point. It’s very likely that there are features that are linearly decodable in closer to penultimate layers. We train a linear decoder for the randomly initialized ResNets to test their object detection performance. The linear decoder is trained on 4000 ImageNet images and tested on 1000 ImageNet images. In general, this linear decoder has above-chance performance on all layers of the model, although the accuracy is generally below 1%. We show these results in Appendix Section N. In general, a question remains about what features are present in the randomly initialized ResNet that are improving results and whether the features that are useful for linear decodability are accessible by CKA to improve the target network. These are interesting questions because they point to the potential for universal priors that are architecturally agnostic. We hope this highlights the potential for guidance to answer these questions and opens up many lines of future work.

评论- Response (Part 1)

2024-12-01

Thank you for staying engaged as we approach the finish line, and we’re happy that we could address some of your concerns. Hopefully we can do more below!

I genuinely think the findings are potentially interesting but also sincerely believe that the paper needs a major revision and another round of review to be ready for publication, especially considering the volume of content added during the revision.

We want to note that these were experiments requested by the reviewer; experiments which agree with the paper. Would the reviewer be happy if for their own paper, a reviewer requested experiments and then used the fact that new experiments exist as the reason for rejection?

This makes it impossible to accept the paper under any conditions. Either we refuse to run experiments, in which case the reviewer rightly votes to reject because we didn’t answer their questions, or we run experiments, in which case the reviewer votes to reject because we did run the experiments and answer their questions. Particularly in light of the fact that almost all of the experiments are either nulls, for example, distillation does not work, just as we claim, or strengthen the paper such as using new metrics for representational similarity like CKA or ridge regression.

Re W1: I agree with the authors that vanishing gradient alone cannot explain all of the presented results, in particular the shallow FCN experiment. However, I don't think the rebuttal has so far helped convincing me that it is not the case. As I suggested in my initial review, one way to test and rule out this possibility in cases where it could be the cause, is to add auxiliary losses similar to those that were used in the Szegedy et al 2015 paper. The degree to which training models with these additional losses would improve, could be an indicator of the relative importance of vanishing gradient as a cause.

We want to note that vanishing gradients don’t explain many other results in the paper. Such as the transfer between RNNs and Transformers. As well as the early disconnect results for FCNs where only a few updates are needed before the FCN is in a good state, the guide can be disconnected, and the FCN continues to train correctly. We will add distillation as a baseline everywhere — it does not work.

ICLR doesn’t allow reporting additional experiments in this last week. We are happy to add this experiment to the final version of the paper. But note that it cannot substantially change the story.

As mentioned in my original review, I think distillation should be considered as a baseline in all the experiments to help the reader better judge the expected boost between guidance and distillation.

We will do this. But note, distillation does not work for our problems. This is a near-null baseline.

The new figure 11 and table 5 that present the current distillation results are not referred to in the main text and currently all of the distillation experiments are only mentioned within the appendix. To be clear again, distillation should be considered as a baseline in all the experiments and be presented along with the results in the main text/figures/tables.

We will do so.

From the new appendix section I and figure 11, I can't tell if the distillation baseline is trained properly or not. Including more details would be helpful there. For example, the learning rate and the number of epochs for which the model is trained for. The x-axis on figure 11-left shows that the model is trained for 6000 steps but x-axis on the right plot shows up to about 100 epochs. The caption of Figure 11 is not helping in clarifying things either. Why the discrepancy? Is this model trained for 6000 iterations only? If yes, that doesn't sound enough to me. Overall, I'm not confident that the current distillation baseline is done properly.

We followed the same procedure for distillation as we did for training our base networks and guidance experiments. We use 100 epochs of training with a learning rate of 1e-4 and a batch size of 256. For any image classification experiment, this follows from He. et. al. 2016.

The x-axis of the left plot refers to training steps. This refers to the number of optimization steps taken during training over the 100 epochs. We take the average training loss every 80 training steps, which changes the x-axis limits. This average is taken to make the plot cleaner and doesn’t change any result. The x-axis of the right plot refers to the validation loss at each training epoch. So, the total iterations the model is trained for is 6250 * 80 or about 500000 iterations. This matches the same procedure that is used for all other image classification experiments.

We follow the same procedure for tuning the learning rate of the model as we did with our base and guided networks. This entire procedure has been standardized to ensure a fair comparison.

评论- Response (Part 2)

2024-12-01

Re W3: The reported accuracies don't sound quite right to me considering what is already reported elsewhere. For example [1] reported that fitting a linear decoder on randomly initialized RN18 features could reach ~12% accuracy on Imagenet. The issue is potentially in the way the additional experiment is carried out here which only considers 4000 images for doing the fitting. In any case, the point to be maid from these numbers is that the unit responses in the untrained model are far from being useless for classifying objects. A point that I feel is repeatedly implied in the submission and I don't agree with.

From [1]: Section 3.8, page 10, “A linear classifier on the original input images (150K features) achieves a 3.4% test top-1 accuracy”

NB: To achieve the 12% accuracy the reviewer cites, [1] must sample 31,568 random networks, then use their novel method to combine the features of those 31k randomly sampled networks together, and then train the MLP. As [1] reports, performance for 1 randomly initialized network is 3.4%.

If the reviewer does not agree with us, nor with the citation they provide, we urge them to run this experiment. Both the literature, including [1], and our experiments, agree with our point on the submission: performance of randomly initialized models is very low for object recognition.

Re W5: Sorry for having been vague about this. What I meant was to consider target networks in between the untrained and fully trained networks. E.g. 20%, 40%, 60%, 80% trained networks as guides.

Thank you for the clarification and the suggested experiment. We were not surprised by the results and will include it in the final manuscript. Since ICLR doesn’t allow reporting new results in the last week we cannot paste the results here.

additional comment: I noticed that many (all?) of the new appendix results/sections are not referred to in the main text. They should all be referred to in the main text.

Apologies, we will do so for the final version. We wanted to get something with all of the results into the hands of the reviewers as quickly as we could.

We hope this addresses the reviewer’s outstanding concerns. Thank you!

And we hope that addressing the reviewer’s concerns, both previously and now, has an impact on how the reviewer votes with respect to the paper.

2024-12-02

To clarify, my role as a reviewer is to evaluate the submission as is, not to judge its potential or a hypothetical future version. The comments and suggestions I provided were intended to enhance the quality of your work. I’ve provided this feedback in good faith, having spent far beyond what I typically spend on any individual submissions and what I consider as a reasonable ask from any reviewer in terms of time commitment. Despite this, the authors chose to make the matter personal in their last response, painting a false picture that my vote to reject this paper is based on the new experiments added.

Here are the main reasons for maintaining my vote to reject:

some of the experiments carried out during the rebuttal are done in a manner that don’t boost confidence and instead giving me an impression that they were (understandably given the limited time span of the rebuttal) performed in rush and possibly without being completely vetted. E.g. the distillation baseline experiment that was only performed for one case and not others, the plots and captions that were not completely coherent, and were not properly incorporated into the text. Comments such as “We will add distillation as a baseline everywhere — it does not work.” do not boost any confidence in trusting the correctness of the claims.
the changes are not appropriately incorporated into the submission despite having had the possibility to do so. I’ve commented on the specifics in my previous responses.
some of the comments were ignored altogether, e.g. 1) reworking the text to cut back on the verbose introduction, background, and routine methods to make space for experiments and/or more baselines 2) adding a baseline with auxiliary losses.

To be completely clear, I am not comfortable increasing my score to anything above 3. I think this paper should be rejected, so it can be reworked without being rushed, and that the number of missing experiments and details warrants a complete reworking of the paper and resubmission for fresh review.

评论- General Response (Part 1)

2024-11-22

We thank the reviewers for their constructive reviews. We are glad reviewers found the problem and observations in the paper interesting (rthw, vZFS), the potential for guidance to be useful for studying architecture (seP7, vZFS), and appreciated the multiple scenarios and experiments through which guidance was applied (Kh2H, seP7, vZFS). We address some of the common feedback here as well in the individual responses.

Novelty and Positioning (rthw, Kh2H, seP7): We list a number of novel observations that have not been reported in the literature before. We will make these clear in our manuscript as well.

For the first time, we transfer over the architectural prior of one network architecture to another network architecture; not the knowledge of one network to another, independent of any data. An untrained (randomly initialized) CNN that has never seen an image, that never learns because it is never updated, in a setting where no images are shown to either network, only randomly (uniformly) sampled matrices of the correct format, can give its architectural prior to a another network of another architecture. That other architecture, which was otherwise poorly regularized and doomed to overfit, now does not.
For the time first, this shows that fully connected networks can be systematically initialized in a novel way guided by another network architecture to make them avoid overfitting. No changes to the architecture needed. We provide this initialization method, although, we leave explaining it for future work.
This casts doubt on one of the most common ideas of what architecture does in ML. It is not the fact that CNNs or Transformers add an architectural prior by way of convolutions, hierarchy, and/or attention, that regularizes them and allows them to work tasks like for object recognition. This story is pervasive in the current ML literature, appearing in some of the most eminent publications like LeCun, Y., Bengio, Y., & Hinton, G. (2015). "Deep learning." Nature, 521(7553), 436-444. The consequence is that it is assumed that certain architectures are vastly superior to others. For example, Transformers are, with minor exceptions, superior to RNNs. We demonstrate that this relationship isn’t so clear cut: RNNs can help Transformers and vice versa.
We can change how a significant subset of ML is done. Usually we hope to find better initialization methods, regularization methods, or optimizers, without any idea if one exists, what it might look like, what a feasible trajectory looks like, or what the final trained network is. We turn this walk into the darkness into a much more systematic search. We provide networks that have initializations that work which we don’t understand, that have regularizations that work, that have training trajectories or final variants that perform well, but that we do not understand. We turn a stab in the dark into a systematic reverse engineering effort.
Guidance is a new form of probing. We can now for the first time systematically ask what is the relationship between two architectures? Is what relates them the statistics of the eigenvalues of their activations? Sparsity? Receptive fields? Etc. Any question can be trivially turned into a new similarity metric, plugged into guidance, and systematically tested. We can now systematically ask, what is the relationship between the prior a CNN imposes compared to that of a Transformer? For example, we can repeat our Deep FCN results with a Transformer rather than a CNN. Then ask how are the two resulting networks different from one another, both in their activation patterns and their behavior? There are countless publications claiming to reproduce Transformer results with CNNs and vice versa. We can help establish the relationship between the two. Some reviewers asked questions related to 5. For example, is the relationship between FCNs with a trained and untrained guide that one has a different effective rank than another? This is an interesting question, and would provide an easy explanation as well as an easy method to regularize FCNs. In appendix J we show that unfortunately things are not so simple. Effective rank and intrinsic dimensionality does not explain the results and is not effective.

评论- General Response (Part 2)

2024-11-22

Relationship to Distillation (rthw, Kh2H): Note that guidance is not distillation. In distillation you ideally want to completely reproduce the network that is being distilled into another network. That would defeat the experiments we report where an untrained guide which is never updated at any point helps another network learn. What can an untrained network with near-zero performance teach another network? With guidance, this question is not just meaningful, but essential. Distillation ideally produces an exact copy of a network into another, often a smaller one, generally of the same architecture. Guidance ideally copies over a limited aspect of one network to another, aligning either the general statistics of the architectural prior of a network and/or the knowledge a network has. It does not directly ask that the representations should be the same. We can control what the relationship between the networks should be with the similarity metric. To this end, two reviewers asked for distillation comparisons. We have included the results of this experiment in Appendix I. Distillation does not help when using randomly initialized networks and underperforms guidance when using a trained network.
Untrained guide network intuition (rthw, seP7): A few reviewers asked for intuition on why untrained guide networks can possibly improve results. How can a network that knows nothing and never learns anything improve another network? This underscores why guidance is not distillation. By the logic of distillation this is nonsense. From the point of view of guidance, we are transferring over the architectural prior of one network to another. For example, a CNN is organized hierarchically and has receptive fields. An FCN does not. Even an untrained CNN has receptive fields which can be observed in its activations, even in response to randomized inputs. This can be transferred to the FCN in principle. Guidance allows you to pick your similarity metric to probe these intuitions systematically. For example, we could measure receptive fields and how hierarchical the activity of a network is, and provide that as the guidance function. If the CNN still regularizes the FCN, this demonstrates that hierarchy and receptive fields are sufficient to explain why FCNs fail. If it does not, this demonstrates that hierarchy is not the key component that regularizes FCNs, it is some other attribute of CNNs. Our goal here was to show that it is possible to carry out such experiments and introduce the tooling to do so. In future work, we and others can systematically explore these kinds of explanations.
Initialization (rthw, Kh2H): We update the manuscript to improve the initialization results for FCNs. Now, an untrained CNN, which is never updated, guides an FCN on randomized images (uniform noise; we previously used real images). Then, the two are disconnected after 150 update steps. The resulting FCN is now regularized and does not overfit when trained on ImageNet, see Appendix K. Nothing about image statistics, or the knowledge of a CNN, is regularizing the FCN, because neither exists, it is the mere architectural prior of the CNN that regularizes the FCN. Future work should be able to reverse engineer this method.
Other representation metrics (seP7, Kh2H): A few reviewers asked for additional metrics with guidance. We show two additional metrics, Representation Similarity Analysis (RSA) and ridge regression in Appendix Section L. This also speaks to the geometry of the problem, as each makes different assumptions about the types of degrees of freedom when matching representations. Each works, but to differing degrees. Systematically understanding those differences is future work.
Paper Changes: We summarize paper changes here. All changes have been made in blue
- Section 2: We have shortened the related work and added citations as requested by reviewers
- Section 4: We have added some examples for untrainable architectures and untrainable tasks.
- Section 6: we have added some sentences to the conclusion to emphasize the success of guidance to study inductive biases.
- Appendix: We have added Appendix Sections F.1, G.1, I, J, K, L, M, and N to address prior comments.

AC 元评审

2024-12-18

(a) summary

This paper investigates how to use a guide network to train a target one traditionally overfitting(FCN) or underfitting(CNN). It performs representation alignment between the guide and target DNN by introducing additional loss terms. The initial results suggest that guidance can improve performance in various settings.

(b) strengths

The observation that the performance of a student model can be improved if their intermediate representations are matched to a teacher model, is interesting and is novel to some degree.
It investigates the role of network architecture inductive bias for bettering understanding of neural networks.
It conducts extensive experiments over a wide range of tasks and analysis.

The proposed method has limited novelty due to its similarity to teacher-student setting. There are a few papers that are conceptually very similar to the proposed ideas in this paper but were not cited https://arxiv.org/abs/1412.6550 https://arxiv.org/abs/1808.01405
It is not easy to verify the correctness of the claims due to some missing baselines (distillation and auxiliary losses tied to intermediate layers).
It lacks concrete insights into architectural inductive biases based on the proposed approach.
The presentation is not polished.

(d) decision

Although the paper has potential for better understanding inductive bias of DNN, it is not ready for publication in its current form due to its limited novelty, missing experiments, and unpolished presentation. Please keep the reviewers' comments in mind when preparing a future version of the manuscript.

审稿人讨论附加意见

This paper has received diverging reviews ranging from 3 to 8. The reviewers (rthw, vZFS) agree that the problem and observations in the paper are interesting, and the proposed method has the potential for studying DNN architecture inductive bias(seP7, vZFS). Some common concerns (rthw, Kh2H, seP7) on the limited novelty, missing experiments, and the unpolished presentation remain after the rebuttal. Another round of paper revision and review are suggested to make the paper stronger.

最终决定Reject

2025-01-22

Reject