PaperHub
5.7
/10
Rejected3 位审稿人
最低5最高6标准差0.5
6
5
6
2.3
置信度
ICLR 2024

Towards Understanding the Effect of Pretraining Label Granularity

OpenReviewPDF
提交: 2023-09-19更新: 2024-02-11
TL;DR

We empirically and theoretically study how pretraining label granularity influences a deep neural network's downstream generalization performance.

摘要

关键词
Learning theorytransfer learninggeneralization

评审与讨论

审稿意见
6

This paper studies the influence of pre-training label granularity on the transfer learning performance on the image classification task. The authors prove that pretraining on leaf/fine-grained labels achieves better transfer results than pre-training on root/coarse labels. The authors provide both theoretical and experimental proof.

优点

  1. The authors have provided both theoretical and experimental proofs, reinforcing the credibility of their arguments.
  2. The drawn conclusion offers guidance for transfer learning, making the paper an engaging read.

缺点

  1. Does the scale of the dataset influence the final performance? As the number of classes increases, the dataset scale typically expands. The authors may consider maintaining a consistent dataset scale—for instance, by having diverse classes with few samples each or limited classes with ample samples—to further substantiate their claims.
  2. In Definition 4.2 regarding 'hard samples', this paper characterizes them based on the introduction of random noise. However, merely adding random noise doesn't necessarily make a sample challenging to classify. Learning with noise is different from learning with hard samples. Prior research typically defines hard samples as those with significant classification loss, e.g., boot-strapping or hard negative mining.

问题

  1. About Figure 4: why does the validation error increase for CLIP clustering when the number of classes increases?
  2. It is suggested to use \citep rather than \cite in the latex
评论

Thank you for your insightful comments and suggestions! Our response to your questions and comments is as follows.

Weaknesses.

W1. "Does the scale of the dataset influence the final performance?", "having diverse classes with few samples each or limited classes with ample samples—to further substantiate their claims"

A: To clarify, when we perform, for instance, the ImageNet21k->ImageNet1k transfer experiments with the various pretraining label hierarchies, only the pretraining label space is changing, the set of input samples in ImageNet21k remains the same. In other words, as the pretraining label granularity increases, the number of samples per class actually decreases. Therefore, we have indeed done as you said: we have experimentally shown that, when pretraining with (reasonably) high label granularity, even though the number of training samples per class is much fewer than the case of pretraining with low granularity, the pretrained model still generalizes better on downstream tasks.

W2. "In Definition 4.2 regarding 'hard samples', this paper characterizes them based on the introduction of random noise"

A: To clarify, our definition of hard samples relies on two factors. Firstly, we assume that common features are not present in these samples. As we know from the theory, common features are very easy for the neural networks to learn, and the neural networks indeed exhibit a tendency towards learning these common features over learning the rarer fine-grained features. Therefore, by removing common features from an input sample, the sample becomes challenging to the neural network. Second, the addition of extra noise, on top of removing the common features, is to make the example even more "confusing" to the neural network, thus making our theoretical result more pronounced. The interpretation of the noisy patches, as detailed in Section 3.2, is to simulate "distracting irrelevant patterns" in real-world inputs.

Questions.

Q1. "About Figure 4: why does the validation error increase for CLIP clustering when the number of classes increases?"

A: Our observations suggest two primary reasons for this phenomenon. Firstly, the labeling policy of CLIP+kMeans does not align with the manual labelling policy of iNaturalist. As the classes become more and more fine-grained, this mis-alignment becomes more pronounced. Therefore, the features learnt by the neural network when trained under the two different labeling policies will also deviate further and further from each other, causing the increasing validation error as the number of classes increases. Secondly, CLIP+kMeans is an inherently noisy labeling policy, and as the number of classes increases, the amount of noise also increases, causing the quality of features learnt by the neural network to deteriorate, leading to worse generalization.

Q2. "It is suggested to use \citep rather than \cite in the latex"

A: Thank you for pointing this out! We have updated \cite to \citep in our paper now.

审稿意见
5

This paper delves into the impact of pre-training label granularity on the generalization capabilities of deep neural networks (DNNs) in image classification tasks. It explores the 'fine-to-coarse' transfer learning scenario, where pre-training labels are more detailed than those of the target task. The study finds that pre-training with the most detailed labels from ImageNet21k leads to improved transfer performance on ImageNet1k, a practice commonly adopted within the community. The paper offers a theoretical perspective, suggesting that fine-grained pre-training enables DNNs to learn not just common features but also those that are rare or specific, thereby enhancing accuracy on more challenging test samples that lack strong common features. Extensive experiments with iNaturalist 2021's label hierarchies indicate that effective transfer requires a meaningful label hierarchy and alignment between pre-training and target label functions.

Post-rebuttal

Based on the current limited empirical evidence and all I suggested experiments are promised to be done in the future work, I'd like to lower my score.

However, I'd like to emphasize to AC, my evaluation is based on the empirical evidence only.

优点

I believe the studied direction is important to understand the transferability of learning representation which corresponding to the goal of ICLR. The methodology employed in the study is theoretically driven, and it indicates a rigorous mathematical approach to understanding the effect of label granularity on DNNs.

The experimental setup is well-detailed, using widely recognized datasets such as ImageNet and iNaturalist. The results section seems to provide theoretical backing with definitions and theorems regarding SGD behavior.

缺点

Clarification: my assessment are mainly focused on the empirical evidence not the theoretical conclusion.

The empirical experimental results are not surprised to me, as much more fine-grained labels help to gain stronger transferable performance. I believe there are two points could be improved:

  • Testing on more datasets. The current results are verified on a single cross-dataset pair which not hold for other dataset pairs. There are some datasets are studied in low-shot learning could be used in this sceneries.

  • Studying how to obtain the hierarchy/fine-grained labels for unlabeled datasets. It's hard, costly, and usually "impossible" to obtain the used hierarchy for large-scale dataset; therefore, it's important to have discussion and analysis here. The paper currently provided a simple study on Section 5.2. However, I expect more analysis such as how should we decide class-level for unlabeled dataset. A probably related paper here is Large-Scale Few-Shot Learning: Knowledge Transfer With Class Hierarchy.

问题

You should use \citep not \cite in most place of citations.

Please address all the mentioned points above.

As I review this paper mainly based on the empirical evidence, I am good to elevate the rating if the concerns around empirical evidence are eased.

评论

Thank you for your valuable time and effort in providing feedback on our work. We hope that our response below will address your concerns. Due to character limits, we will split our response to your questions into two parts.

Weaknesses.

W1a. "Testing on more datasets", "... hard, costly, and usually "impossible" to obtain the used hierarchy for large-scale dataset"

A: As is discussed in the related work section (Section 2.2), there are already a number of empirical works that pretrain large models with highly fine-grained labels (albeit with lower quality than human labels) for improving model generalization. For example, [1,2] use noisy hashtags from Instagram as pretraining labels, [3,4] apply clustering on the data first and then treat the cluster IDs as pretraining labels, and [5] uses the queries from image search results, etc.

Our work, however, deviates from empirically demonstrating this already-known observation again, but rather is focused on mathematically explaining this phenomenon. Our experimental work is comparable to, if not more extensive than, that of the majority of papers in the area of deep learning theory. For reference, please see the following representative papers [6 - 8].

W1b. "There are some datasets are studied in low-shot learning could be used in this sceneries"

A: Thank you for this suggestion. We will take these datasets into consideration in our future work.

W2. "Studying how to obtain the hierarchy/fine-grained labels for unlabeled datasets"

A: Thank you for this suggestion. We are indeed very interested in exploring methods of generating hierarchy/fine-grained labels for datasets that do not have human labels. For example, one approach is to utilize large language models such as [9] to decompose the coarse-grained labels into fine-grained categories. Subsequently, visual question answering (VQA) models such as [10, 11] can be employed to automatically classify the input samples in a fine-grained manner. We are also interested in further understanding the limitations of training DNNs on large-scale fine-grained but noisy data. Lastly, we hope our results may give motivation for others in the community to develop innovative techniques for obtaining fine-grained labels for unlabeled datasets.

"You should use \citep not \cite in most place of citations."

A: Thank you for pointing this out! We have updated \cite to \citep in our paper now.

评论

References

  1. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.

  2. Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollar, and Laurens Van Der Maaten. Revisiting weakly supervised pre-training of visual perception models. In CVPR, 2022.

  3. Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, and Dhruv Mahajan. Clusterfit: Improving generalization of visual representations. In CVPR, 2020.

  4. Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov, and Noam Slonim. Cluster & tune: Boost cold start performance in text classification. arXiv preprint arXiv:2203.10581, 2022.

  5. Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew Tomkins, and Sujith Ravi. Ultra fine-grained image semantic embedding. In WSDM, 2020.

  6. Ruoqi Shen, Sebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipulation. In ICML 2022.

  7. Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. In NeurIPS, 2021.

  8. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Proc. NeurIPS, pages 8571–8580, 2018.

  9. Aram Bahrini, Mohammadsadra Khamoshifar, Hossein Abbasimehr, Robert J. Riggs, Maryam Esmaeili, Rastin Mastali Majdabadkohne, and Morteza Pasehvar. Chatgpt: Applications, opportunities, and threats. In 2023 Systems and Information Engineering Design Symposium (SIEDS), pp. 274–279, 2023. doi: 10.1109/SIEDS58326.2023.10137850.

  10. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.

  11. Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme Ruiz, Andreas Peter Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. PaLI: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=mWVoBz4W0u.

审稿意见
6

The paper considers the setup when models are first pretrained on fine-grained classes and then finetuned (transferred) to a dataset with more coarse-grained labels. They provide both theoretical and experimental contributions.

Theoretically, they prove that 1) coarse-grained pretraining only allows a neural network to learn the “common” or “easy-to-learn” features well, and 2) fine-grained pretraining helps the network learn the “rarer” or “fine-grained” features, thus improving its accuracy on hard downstream test samples.

Empirically, they show that pre-training on ImageNet-21k leaves (and then transferring to ImageNet-1k) is more beneficial than pretraining on other coarser granularity levels. They also experiment with iNaturalist, noting the importance of meaningful label hierarchies and good source-target label alignment.

优点

  • I believe the paper makes valid theoretical contributions which are partially supported experimentally.
  • The paper is easy to read and follow and the main takeaway messages easy to understand.

缺点

I am not totally sure that the idealized setup considered here makes much sense in practice. For example, Jain et al. (2023) claim that fine-grained labels are often hard and expensive to obtain and going in the coarse --> fine-grained direction is equally valuable. Moreover, when pretraining on large-scale datasets, (e.g., Mahajan et al. 2018), I believe it is often not clear what the label hierarchy is (or if it even exists).

The other concern that I have is related to the transition from theoretical contributions to empirical experiments. I am not sure if the experiments on ImageNet and iNaturalist are sufficient to support the presented theory (could you please elaborate a bit on that if that is the case). One suggestion (that should be doable and easy to implement) would be to generate synthetic data and confirm that Theorems 4.1 and 4.2 hold on it, exploring and explaining the impact of the different parameters needed by your theory.

Jain et al., Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification, NeurIPS 2023

问题

Q1: Just to confirm: When fine-tuning, you do fine-tune the whole network, i.e., you do not keep the feature extractor fixed. It is a bit surprising to me that regardless of the pretraining granularity (e.g. on Fig. 1 and Table 1), the fine-tuned model does not catch up with the baseline training, assuming that sufficient time for finetuning is given.

Q2: Is the granularity solely determined by the number of classes or number of classes AND class level in hierarchy? Do we assume that all classes in a given (pre)training dataset are at the same hierarchy level? What if we mix classes from different levels in the class hierarchy during the pretraining?

Q3: Could you please provide some intuition why you need the different patches and how do they relate to real-world image inputs? If I understand correctly, it is the same intuition as in Fig. 2 and the different patches represent different parts of the image (which may contain different common/rare features).

Q4: If I understand correctly, the theorems require that the neural networks are trained only on "easy" samples. Why is that the case? If it is indeed needed, how can you distinguish between easy and hard samples during training?

Q5: In the paper you perform experiments with ViT and ResNets. Based on your theory, in what way is the model important and what is its impact on the training? I.e., what properties are desirable for it?

Q6: On iNaturalist (Fig. 4) why do you only report validation errors but omit the final accuracies? Are the accuracies consistent with the shown figure?

Minor: In your examples (Sec 4) you consider mainly binary task (i.e., 2 coarse level classes). Can the theory and the theorems be extended to the multi-class setup?

评论

Thank you for your valuable time and constructive feedback. We have carefully considered your feedback and thank you for helping us improve our work. Due to character limits, we will split our response to your questions and comments into multiple parts.

Weaknesses.

W1. "not totally sure that the idealized setup considered here makes much sense in practice", "fine-grained labels are often hard and expensive to obtain", "coarse to fine-grained direction is equally valuable"

A: The fine-to-coarse transfer learning setting is already widely used in practice (and getting increasingly popular), especially for training large models. As we discussed in the related work section (Section 2.2), there are more and more works that pretrain large models with highly fine-grained labels and the downstream classification task is coarser in label granularity. For example, [1,2] use noisy hashtags from Instagram as pretraining labels, [3,4] apply clustering on the data first and then treat the cluster IDs as pretraining labels, [5] uses the queries from image search results, etc.

Despite this growing trend, to the best of our knowledge, there is no theoretical work that justifies the benefits of (pre-)training deep neural networks (DNNs) with such fine-grained labels. Our work aims to fill this gap by providing a theoretical understanding of this phenomenon that already exists in practice.

Additionally, we are indeed very interested in studying how to generate fine-grained labels for datasets that do not have human labels. For example, we could first use large language models such as [6] to decompose each coarse-grained label into fine-grained labels, and then use visual question answering (VQA) models such as [7,8] to automatically classify the input samples in a fine-grained manner. We are also interested in further understanding the limitations of training DNNs on large-scale fine-grained but noisy data. Lastly, we hope our results may give motivation for others in the community to develop innovative techniques for obtaining fine-grained labels for unlabeled datasets.

W2a. "transition from theoretical contributions to empirical experiments", "not sure if the experiments on ImageNet and iNaturalist are sufficient to support the presented theory"

A: The purpose of the theoretical study is to explain the empirical observations we made, in particular, the observations we summarized in the introduction of this paper: "Under certain basic conditions on the pretraining and target label functions, DNNs pretrained at reasonably high label granularities tend to generalize better in downstream classification tasks than those pretrained at low label granularities." Therefore, our large-scale experiments are not intended to validate the theory, but it is the other way around. In particular, we chose ImageNet and iNaturalist because they are widely used benchmarks in the literature. Our experiments confirm that the observed phenomenon extends to these popular benchmarks. Our theoretical results provide a mathematically rigorous explanation for the empirical observations.

W2b. "generate synthetic data and confirm that Theorems 4.1 and 4.2 hold on it, exploring and explaining the impact of the different parameters needed by your theory."

A: As stated above, the primary purpose of this work is to use our study on the 2-layer neural network to provide rigorous support for our claim that pretraining at reasonably high label granularity is beneficial for generalization. The theoretical results we presented are asymptotic in nature and they serve to qualitatively confirm the empirical observation we made. It is important to note that our approach aligns with many current deep-learning-theoretic works. For instance, [9-11] referenced below are example theory papers that are similar in structure to ours, but contain significantly fewer experimental investigations.

(continued in Part 2)

评论

Questions (continued).

Q5. "... perform experiments with ViT and ResNets", "in what way is the model important and what is its impact on the training?", "... what properties are desirable?"

A: To address your question, we want to revisit the central focus of this paper. Our primary goal is to analyze the influence of pretraining label granularity on generalization in the "fine-to-coarse" setting. Therefore, as a first step, it is essential to experiment with the SOTA architectures, because this can have a significant impact on practical transfer learning scenarios. However, we acknowledge that exploring the influence of architecture is an interesting direction in our future work.

Q6. "why do you only report validation errors but omit the final accuracies?"

A: Error rate and accuracy are complementary metrics that add up to 100%, so reporting error rate is equivalent to reporting accuracy. When referring to "final accuracies", we presume you mean the accuracies on the test set. We should note that the dataset we worked with does not contain a test set (with ground truth labels). Thus we can only rely on the validation set to assess the model's generalization performance. Also note that this is in fact a common practice in the community. For example, people only report the validation error/accuracy on the popular ImageNet1k dataset to reflect the generalization performance of their models.

"Minor: you consider mainly binary task (i.e., 2 coarse level classes). Can the theory and the theorems be extended to the multi-class setup?"

A: Yes, our theory can be extended to the multi-class setup easily. We chose to present the binary-class case to maintain clarity and avoid further complicating the notations.

References

  1. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.

  2. Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollar, and Laurens Van Der Maaten. Revisiting weakly supervised pre-training of visual perception models. In CVPR, 2022.

  3. Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, and Dhruv Mahajan. Clusterfit: Improving generalization of visual representations. In CVPR, 2020.

  4. Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov, and Noam Slonim. Cluster & tune: Boost cold start performance in text classification. arXiv preprint arXiv:2203.10581, 2022.

  5. Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew Tomkins, and Sujith Ravi. Ultra fine-grained image semantic embedding. In WSDM, 2020.

  6. Aram Bahrini, Mohammadsadra Khamoshifar, Hossein Abbasimehr, Robert J. Riggs, Maryam Esmaeili, Rastin Mastali Majdabadkohne, and Morteza Pasehvar. Chatgpt: Applications, opportunities, and threats. In 2023 Systems and Information Engineering Design Symposium (SIEDS), pp. 274–279, 2023. doi: 10.1109/SIEDS58326.2023.10137850.

  7. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.

  8. Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme Ruiz, Andreas Peter Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. PaLI: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=mWVoBz4W0u.

  9. Ruoqi Shen, Sebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipulation. In ICML 2022.

  10. Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. In NeurIPS, 2021.

  11. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Proc. NeurIPS, pages 8571–8580, 2018.

  12. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

评论

Thank you for answering my questions.
I believe my comments are addressed, thus I am increasing my score.

评论

Questions.

Q1a. "... do not keep the feature extractor fixed?"

A: We do not keep the feature extractor fixed during finetuning.

Q1b."surprising... fine-tuned model does not catch up with the baseline training... sufficient time for finetuning is given".

A: Given that our work represents an initial step towards understanding the effect of pretraining label granularity for DNNs, we want to stick to the common practices as much as possible. Early stopping the pretraining and finetuning phase is a common practice in the field.

More specifically, in the iNaturalist experiments, we allowed the finetuning phase to run for 90 epochs, matching the same length of the pretraining phase (90 epochs is the standard training length for ImageNet-like datasets), therefore the finetuning time is substantial. In the ImageNet21k->ImageNet1k experiments, we adopted the procedure established in the original ViT paper [12].

Furthermore, it is natural for a finetuned model to behave differently from a baseline model. We know that in deep learning, due to the extreme non-convexity of the optimization landscape, initialization of weights in a deep neural network can significantly impact the training and generalization performance of the model. In our situation, pretraining at different label granularities lead to different stochastic gradient descent (SGD) trajectories of the neuron weights, so these different SGD trajectories will very likely head towards different local minima. Finetuning does not necessarily push the neuron weights to the local minima that the baseline models end up in.

Q2a. "solely determined by the number of classes ...?", "... all classes in a given (pre)training dataset are at the same hierarchy level?"

A: Our definition of "label granularity" is the number of classes. In our experiments, all classes in a given (pre)training dataset are indeed at the same hierarchy level.

Q2b."mix classes from different levels in the class hierarchy?"

A: Thank you for the suggestion, we will take this factor into consideration in our future work. For this work, we are taking the first step in understanding the effect of pretraining label granularity, so we chose to stick with the most basic setting of all classes being in the same hierarchy level.

Q3. "why you need the different patches and how do they relate to real-world image inputs?"

A: We will explain this from two perspectives.

Let us consider a concrete example first. Suppose we have a superclass of car, and fine-grained classes of different brands of cars. If we see, for instance, the partial shot of a car with a BMW logo, we know that this is a car belonging to the BMW brand, that is, the BMW logo is a fine-grained feature that distinguishes this particular brand from others. In contrast, common features of cars include, for instance, the wheels which are shared among all car brands. Therefore, the common and fine-grained features occupy different spatial locations within the image, which is why we let the common and fine-grained features dominate different patches in the input vector. Additionally, besides the useful features, real-world images almost always contain irrelevant or distracting patterns (present at spatial locations other than those occupied by the common and fine-grained features) that are irrelevant to classification. These patterns are modeled as noise in our framework.

From the perspective of obtaining analytically tractable results, separating common and fine-grained features into distinct patches makes the mathematical analysis more tractable to some extent, as it weakens the amount of coupling between the learning of different features during training.

Q4. "... trained only on 'easy' samples. Why is that the case?" "If it is indeed needed, how can you distinguish between easy and hard samples during training?"

A: As discussed in our paper, "hard" samples are generally rare in natural image datasets. Our theoretical result is intended to present the "feature-learning bias" of a neural network in an exaggerated fashion. Therefore, examining the case of "no hard training examples at all" serves as a natural starting point for theoretical analysis. In our theory, the main distinction between easy and hard samples is whether or not they contain patches that are dominated by common features: easy samples contain these patches, while hard samples do not. These theoretical considerations do not imply that we advocate for training real DNNs only on easy samples in practice.

(continued in Part 3)

AC 元评审

The authors explore the phenomenon that pretraining large models with highly fine-grained labels can enhance the performance of downstream tasks. In line with their theoretical assumptions and conclusions, they have conducted preliminary experiments and provided corresponding empirical analyses to support their findings.

为何不给更高分

The reviewers expressed major concerns about the empirical analysis in the paper, which is highlighted as one of its two-fold contributions. They noted discrepancies between the authors' empirical analysis and the scenarios in related works. While many studies pre-train large models with fine-grained labels, such as noisy hashtags, and apply them to coarser downstream classification tasks, the authors' analysis involves additional constraints like a meaningful label hierarchy and alignment between pre-training and target label functions. These conditions are not always consistent with the setups in existing works. Therefore, it could be important to extend the empirical analysis across more experiments and datasets to validate the authors' conclusions. While the paper also makes theoretical contributions, this does not diminish the need for some simple toy/synthetic experiment to explicitly validate the theorems presented. However, it was noted that the authors did not directly address these concerns in their response. The empirical analysis, being closely related to the authors' theoretical framework, holds significant weight in establishing the credibility and validity of their theoretical contributions. Therefore, the weaknesses in the empirical analysis could potentially diminish the overall persuasiveness and soundness of the theoretical part as well.

为何不给更低分

N/A

最终决定

Reject