PaperHub
7.0
/10
Poster3 位审稿人
最低7最高7标准差0.0
7
7
7
4.0
置信度
正确性3.3
贡献度3.3
表达3.3
NeurIPS 2024

TrAct: Making First-layer Pre-Activations Trainable

OpenReviewPDF
提交: 2024-04-23更新: 2024-11-06
TL;DR

Making the training dynamics of the first layer of vision models similar to those of the embedding layer of language models.

摘要

关键词
computer visionconvolutionsecond-orderoptimization

评审与讨论

审稿意见
7

This paper presents TrAct, a new and novel training strategy to modify the optimization behavior of the first layer. Basically, it can achieve faster convergence and achieve better classification performance across different models. The effectiveness of TrAct is demonstrated across a range of 50 experimental setups on the various benchmarks, underscoring the method's capability.

优点

  1. The technical elaboration of the proposed method is clear.
  2. I really like the motivation for the proposed method, which is straightforward.
  3. The evaluations conducted on the provided benchmarks provide evidence of the effectiveness of the proposed methods. However, there are some concerns regarding the experimental results, which will be further discussed in the weaknesses section.

缺点

Weaknesses

  • What I am curious about is whether Tract is also generally applicable in other visual tasks, such as detection, segmentation, pose estimation, etc. The author could simply do some experiments on FasterCNN to verify the applicability of TrAct in visual dense prediction. If applicable, this will further enhance the impact of the proposed method.
  • Why is there a significant difference in the performance of TrAct between SGD and Adam? For example, in Figure 2, TrAct seems to converge faster on SGD, even surpassing the performance of the baseline 800 epoch with 100 epochs of training time. However, on Adam, performance improvement seems to be rather minor. Is there any theoretical explanation for this?

问题

Shown as above

局限性

The paper has concluded this in the checklist

作者回复

We would like to thank you for your time reviewing and for helping us to improve our paper.

Strengths:

  1. The technical elaboration of the proposed method is clear.
  2. I really like the motivation for the proposed method, which is straightforward.
  3. The evaluations conducted on the provided benchmarks provide evidence of the effectiveness of the proposed methods. However, there are some concerns regarding the experimental results, which will be further discussed in the weaknesses section.

Thank you for appreciating the clarity of our technical elaboration, our straightforward motivation, and the effectiveness of our method.

We would appreciate if you could let us know whether we have resolved and successfully answered all of your questions and concerns. If new question or concerns come up, please let us know.

Weaknesses:

What I am curious about is whether Tract is also generally applicable in other visual tasks, such as detection, segmentation, pose estimation, etc. The author could simply do some experiments on FasterCNN to verify the applicability of TrAct in visual dense prediction. If applicable, this will further enhance the impact of the proposed method.

Thank you for this concrete suggestion, which really helps extend our paper. Accordingly, we trained FasterRCNN models. However, we need to mention that FasterRCNN uses a pretrained vision encoder where the first 4 layers are frozen. In order to enable a fair comparison, we unfroze these first two layers when training the object detection head. We trained the models on PASCAL VOC2007, and have the following preliminary results with 2 seeds (measured in test mAP):

SeedvanillaTrAct
10.6550.667
20.6640.674

We can observe that both seeds with TrAct performed better than the best seed of the vanilla method, and the average improvement is 1.1%. We offer to extend these studies for the camera-ready.

We would like to point out that, while is TrAct especially designed for speeding up pretraining or training from scratch, i.e., when actually learning the first layer, we were excited to find it also helped in finetuning pretrained models. The limitation would be that TrAct is only applicable when the first layer is actually trained.

Why is there a significant difference in the performance of TrAct between SGD and Adam? For example, in Figure 2, TrAct seems to converge faster on SGD, even surpassing the performance of the baseline 800 epoch with 100 epochs of training time. However, on Adam, performance improvement seems to be rather minor. Is there any theoretical explanation for this?

This is indeed an interesting question, and there is a theoretical explanation for this: The method is motivated for SGD training as the update that TrAct is intended to perform is returned in form of the gradient, and to execute the update the optimizer should (in theory) be SDG. That it still works very well even for Adam is great, and illustrates the strong robustness and versatility of our method. Your comment inspired us to extend Figure 2 by an experiment where we train everything except for the first layer with Adam, and train the first layer with TrAct and the SGD optimizer. Here, we observed small improvements over optimizing the first layer with Adam, which are shown in the Author Rebuttal PDF. Finally, we want to mention that, while using SGD for the first layer in combination with TrAct can improve performance, for easier adoption as it is often simpler to use the same optimizer for everything. Nevertheless, for advanced users, this is a way to squeeze out even a bit more performance. We will add a respective discussion to the camera-ready.

We would appreciate if you could let us know whether we have resolved and successfully answered all of your questions and concerns. If new question or concerns come up, please let us know.

评论

Dear Reviewer yoFY,

We wish to thank you very much for helping us improve the paper. Hopefully, you have had a chance to take a look at our rebuttal. Since the discussion period ends today, we would greatly appreciate it if you could respond to our rebuttal soon. This will give us an opportunity to address any further questions and comments that you may have before the end of the discussion period.

评论

Thanks to the authors for their rebuttal. I appreciate taking time to answer the various questions presented here. Most of my concerns have been addressed. I specifically appreciate the analysis of transfering to object detection tasks.

I will keep my original rating as accept.

审稿意见
7

The paper introduces TrAct, a training strategy that modifies the optimization behaviour of the first layer. The proposed first-layer optimization enables a slightly better results for training the model at lesser number of epochs. TrAct is being demonstrated for a wider setup on image classification and have shown to be consistent.

优点

  1. TrAct enables faster convergence or can achieve slightly better performance for the same number of epochs.

  2. The paper demonstrates applicability of TrAct in various possible scenarios covering a wide range of settings.

缺点

  1. A small overhead (due to additional parameters) in computational time when training with TrAct.

问题

  1. What happens when only the last layer of a pre-trained model is fine-tuned? This is a setup used in evaluating many self-supervised methods. Wouldn't the first layer optimization using TrAct create bias for a new data distrbution?

局限性

Not explicitly stated.

作者回复

We would like to thank you for your time reviewing and for helping us to improve our paper.

Strengths:

  1. TrAct enables faster convergence or can achieve slightly better performance for the same number of epochs.
  2. The paper demonstrates applicability of TrAct in various possible scenarios covering a wide range of settings.

Thank you for appreciating especially the faster convergence of our method, as well as the applicability of our method across a wide range of settings.

Weaknesses:

  1. A small overhead (due to additional parameters) in computational time when training with TrAct.

We would like to clarify that the small overhead is only due to the backward of the first layer being slightly more expensive, whereas the forward is not modified. Indeed, we do not introduce additional parameters.

Questions:

What happens when only the last layer of a pre-trained model is fine-tuned? This is a setup used in evaluating many self-supervised methods. Wouldn't the first layer optimization using TrAct create bias for a new data distrbution?

During additional fine-tuning experiments, not reported in the paper, when comparing a model pre-trained with TrAct vs. pre-trained without TrAct, and finetuning both models without TrAct, we observe the model that had originally been pre-trained with TrAct performed better. If you would like us to perform this experiment also for only fine-tuning the last layer, we offer to include these results in the camera-ready; however, we strongly assume that the model pre-trained with TrAct still performs better in this case because the performance of the entire model is mostly affected by pretraining, and the first layer being optimized more efficiently, leads to a more efficient optimization of the rest of the network.

We would appreciate if you could let us know whether we have resolved and successfully answered all of your questions and concerns. If new question or concerns came up, please let us know.

评论

Thank you for the rebuttal.

If possible, please add the additional experiment in the final version of paper.

评论

Dear Reviewer juWi,

We wish to thank you very much for helping us improve the paper. Hopefully, you have had a chance to take a look at our rebuttal. Since the discussion period ends today, we would greatly appreciate it if you could respond to our rebuttal soon. This will give us an opportunity to address any further questions and comments that you may have before the end of the discussion period.

审稿意见
7

When training vision models, the update of the weights of the first layer is proportional to the input pixel values. This can make the model learn images with high contrast faster and damage learning efficiency. To reduce this dependency, the paper proposed to optimize the first layer embedding (before activation) directly. This perspective leads to an optimization on a new way to update the weights and results in a light weight modification to a variety of vision models. The method was demonstrated on image classification problems.

优点

The authors identified the fundamental problem of training vision models. Compared to modifying the input or the model architecture, targeting the update of the first layer weight is a more direct approach and easy to understand the impact. They formulated an optimization problem and clearly listed out the derivation for the solution to the update of the weights. The proposed method is easy to implement.

缺点

There should be more elaboration on the intuition of the first layer embedding optimization. I can understand the purpose and the conceptual procedure can be viewed as activation applied to the first layer, but how this formulation reduces the dependency on the inputs without hurting the training is not obvious. In the introduction, the authors argued that they bridged the gap between the “embedding” layer in language models and the first layers in vision models. However, it is not clear to me the connection of the proposed work to the “embedding” layer update in the language models.

问题

I believe the proposed method is general enough for vision tasks other than image classification. It would be interesting to see the performance on segmentation, detection, and even image generation. Specifically, I would like to see if the modification on the embedding hurts the performance of the tasks requiring more pixel-by-pixel understanding.

局限性

The authors stated that limitations are discussed as all assumptions are pointed out in the work. However, the assumptions listed are fairly broad and not specific to the proposed work. I’d like to see more discussion on the limitations imposed when modifying the first layer embedding.

作者回复

We would like to thank you for your time reviewing and for helping us to improve our paper.

The authors identified the fundamental problem of training vision models. Compared to modifying the input or the model architecture, targeting the update of the first layer weight is a more direct approach and easy to understand the impact. They formulated an optimization problem and clearly listed out the derivation for the solution to the update of the weights. The proposed method is easy to implement.

Thank you very much for appreciating the clarity of the derivation of our method, and the simplicity to implement. We hope that these properties, combined with the training speedups lead to a wide adoption in the community.

There should be more elaboration on the intuition of the first layer embedding optimization. I can understand the purpose and the conceptual procedure can be viewed as activation applied to the first layer, but how this formulation reduces the dependency on the inputs without hurting the training is not obvious. In the introduction, the authors argued that they bridged the gap between the “embedding” layer in language models and the first layers in vision models. However, it is not clear to me the connection of the proposed work to the “embedding” layer update in the language models.

In language models, the embedding layer is a lookup table, with an embedding / activation for each token. Thus, given an input token (integer), one embedding is selected (a row of the embedding matrix), and fed forward into the next layer. The training dynamics of the embeddings layer corresponds to updating the embeddings directly wrt. the gradient. The update in a language model, for a token identifier ii, is WiWiηzL(z)W_i \leftarrow W_i - \eta \cdot \nabla_{z} \mathcal{L}(z) where z=Wiz = W_i is the activation of the first layer and at the same time the iith row of the embedding (weight) matrix WW. Equivalently, we can write zzηzL(z)z \leftarrow z - \eta \cdot \nabla_{z} \mathcal{L}(z).

In contrast, in vision models, the update is WWηzL(z)xW \leftarrow W - \eta \cdot \nabla_{z} \mathcal{L}(z) \cdot x^\top and our goal is to achieve an update that is close to zzηzL(z)z^* \leftarrow z - \eta \cdot \nabla_{z} \mathcal{L}(z), which we achieve via our closed form solution.

We will include the extended discussion into the camera-ready.

I believe the proposed method is general enough for vision tasks other than image classification. It would be interesting to see the performance on segmentation, detection, and even image generation. Specifically, I would like to see if the modification on the embedding hurts the performance of the tasks requiring more pixel-by-pixel understanding.

Thank you for this suggestion, which we combined with Reviewer yoFY's suggestion of training a FasterRCNN object detection model. Please see our discussion and preliminary results table in our response to Reviewer yoFY.

The authors stated that limitations are discussed as all assumptions are pointed out in the work. However, the assumptions listed are fairly broad and not specific to the proposed work. I’d like to see more discussion on the limitations imposed when modifying the first layer embedding.

There are no assumptions that are specific to this work as our method is quite generally applicable as long as the first layer is a fully-connected or convolutional layer. A potential limitation is that the approach is not applicable in settings where the first layer is frozen, e.g., at a random initialization; however, the technique of the first layer being frozen in some models stems from the first layer's training dynamics problem identified in our work and actually training the first layer with our technique could improve performance.

Please let us know if your had a different specific assumption in mind that you would like us to point out in the paper or discuss.

We would appreciate if you could let us know whether we have resolved and successfully answered all of your questions and concerns. If new question or concerns come up, please let us know.

评论

Dear Reviewer v9Bj,

We wish to thank you very much for helping us improve the paper. Hopefully, you have had a chance to take a look at our rebuttal. Since the discussion period ends today, we would greatly appreciate it if you could respond to our rebuttal soon. This will give us an opportunity to address any further questions and comments that you may have before the end of the discussion period.

评论

I love your explanation about the language models vs the vision models. I believe rephrasing this and adding it to the method section will make the transition to the proposed method -- "To resolve this dependency on the inputs and make training more efficient, we propose to conceptually optimize in the space of first layer embeddings z" easier to understand. Perhaps linking the discussion in the appendix here?

What kind of object detection task was tackled in the additional figure (responded to yoFY)?

What's your thought in the performance on segmentation? Do you think the modification on the embedding can hurt the performance of the tasks requiring more pixel-by-pixel understanding?

评论

Dear Reviewer v9Bj,

Thank you very much for responding to our rebuttal, and for requesting clarifications.

I love your explanation about the language models vs the vision models. I believe rephrasing this and adding it to the method section will make the transition to the proposed method -- "To resolve this dependency on the inputs and make training more efficient, we propose to conceptually optimize in the space of first layer embeddings z" easier to understand. Perhaps linking the discussion in the appendix here?

Thank you so much for appreciating our explanation about language models vs. vision models. As we get an additional page for the camera-ready, we will extend this discussion into the method section and provide an additional linked in-depth discussion in the appendix. Further, if you think it would be helpful, we would like to offer adding additional concrete visual examples to the appendix (we imagine illustrating inputs + weights/embeddings + activations, and how they are updated).

What kind of object detection task was tackled in the additional figure (responded to yoFY)?

We understand that our response to you could have been clearer, and apologize for this.

In our response to Reviewer yoFY, we had actually provided two experiments: (i) object detection with FasterRCNN preliminary results in a table in the rebuttal to Reviewer yoFY at the bottom of the page, and (ii) an experiment further investigating performance differences between training with Adam and SGD for which we provide the additional plot figure in the PDF for a ResNet-18 on CIFAR-10.

The experiment that we referred to in our rebuttal to you is the "(i)". For this, i.e., the object detection with FasterRCNN, we performed the PASCAL VOC2007 object detection task, achieving an average improvement of 1.1% test mAP using TrAct:

SeedvanillaTrAct
10.6550.667
20.6640.674

What's your thought in the performance on segmentation? Do you think the modification on the embedding can hurt the performance of the tasks requiring more pixel-by-pixel understanding?

We expect comparable performance improvements also for segmentation, of course assuming that the entire model is being trained (and thus the first layer is actually also being trained). We don’t think that TrAct can really have any actual negative effect for such settings that require more pixel-by-pixel understanding. The results from FasterRCNN also align with this. If you have a concrete additional experiment for the pixel-by-pixel understanding in mind that you would like us to try, we would be happy to include it in the camera-ready.

作者回复

Additional Figure PDF based on Reviewer yoFY's remark.

最终决定

This paper is reviewed by three reviewers. All of them are positive to this paper, while they also pointed out a couple of points that could be added/addressed/clarified in the final version. The recommendation is acceptance.