A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints
We propose a new Bayesian model selection criterion, the "downstream free energy", which measures a model's adaptability to new tasks. This criterion does not require downstream data and shows promising results in predicting fine-tuning performance.
摘要
评审与讨论
This paper studies the problem of neural network model selection under the pretrain-then-adapt paradigm. Based on the pretraining data, multiple neural network checkpoints can be obtain roughly corresponding to different local minimums of the network parameters. To select a better choice that adapts well to downstream tasks, pretraining and downstream free energies are introduced as Bayesian model selection criteria. To deal with cases where downstream data are not available during model selection, relations between the two energies are explored so that an approximation of downstream free energy can be designed based on pretraining data, which the authors further refer to as pretraining WBIC. Several numerical experiments are conducted.
给作者的问题
See Claims And Evidence & Relation To Broader Scientific Literature.
论据与证据
-
This paper is not self-contained. As one of the most important equations, only Watanabe's book is referenced for Eq.(4) but no derivation is given. If Eq.(4) is directly taken out of the book, maybe a detailed reference to sections&pages would help. Similarly, the complexity term \lambda^1(\omega^{*1}) also lacks its definition.
-
The intuitive argument "Intuitively, lower downstream free energy indicates a higher concentration of parameters in parameter space for which the model is more adaptable and capable of generalizing well on downstream tasks" along with the definiion of downstream free energy in Eq.(1) using the parameter ball B_\gamma(\omega^*) is not very compelling. It seems that the scale of parameters would strongly affect the energy, which not necessarily affect the model performance. For non-neural network examples, this energy definition may not be ideal for a local-scale mixture model. Similarly, I also wonder how does the use of batch/layer normalization in the neural network affect the effectiveness of this downstream free energy.
方法与评估标准
See Claims And Evidence.
理论论述
See Claims And Evidence.
Proposition 5.1 seems correct.
实验设计与分析
Seems fine.
补充材料
I went over the theoretical components of the supplementary material. In line 235 column 2, it says "under mild assumptions", and in line 56, it says "Let \omega^* and \gamma satisfy assumption ??". The assumption is missing from both the main paper and the supplementary material.
与现有文献的关系
in Section 5.2, the authors estimates the pretraining free energy using Eq.(14), which is called the pretraining WBIC. How does this differ from the commonly used WBIC criteria in Bayesian model selection. Is this free energy idea a new interpretation for WBIC, or is the proposed method actually different from WBIC? If it's the latter, perhaps more discussion of their distinctions could be discussed.
遗漏的重要参考文献
Here are some key relevant references not cited/discussed: WAIC: Watanabe, Sumio. "A widely applicable Bayesian information criterion." The Journal of Machine Learning Research 14.1 (2013): 867-897. WBIC: Watanabe, Sumio. "A widely applicable Bayesian information criterion." The Journal of Machine Learning Research 14.1 (2013): 867-897. More on WAIC: Gelman, Andrew, Jessica Hwang, and Aki Vehtari. "Understanding predictive information criteria for Bayesian models." Statistics and computing 24 (2014): 997-1016. WAIC for latent variable models (potentially related to neural networks): Merkle, Edgar C., Daniel Furr, and Sophia Rabe-Hesketh. "Bayesian comparison of latent variable models: Conditional versus marginal likelihoods." Psychometrika 84.3 (2019): 802-829.
其他优缺点
The ideas is interesting and worth further exploration, but the paper is not well written or self-contained at this point.
其他意见或建议
N/A
We thank the reviewer for taking the time to read our work. We are concerned that the major criticisms do not fully justify a “reject” recommendation. Below, we show that the issues raised can be readily resolved with minor clarifications or references, rather than indicating any fundamental flaw.
Claims and Evidence: This paper is not self-contained..for Eq.(4) no derivation is given...also, the complexity term ) lacks its definition.
As we state in the paper, it is possible to derive Eq (4) using techniques set out in Watanabe’s book, the details of which can be found in [Lau, 2023]. We will add a precise reference to the relevant sections/pages in [Lau, 2023]. Because this expansion is well-established in singular learning theory, many works reference it rather than re-deriving it fully. We hope you will agree that adding explicit sections/pages to our reference of [Lau, 2023] is a minor fix that does not warrant rejection.
In regards to the definition of , we believe here you mean since the former does not appear in the paper. With respect to , recall that this quantity (which represents the complexity measure of ) is defined implicitly as the coefficient in Eq (4). We feel this is sufficient since we never have to reckon with this term on its own, only as it appears in the asymptotic expansion in (4).
Claims and Evidence: The intuitive argument and...the definiion of downstream free energy in Eq.(1) .. is not very compelling. It seems that the scale of parameters would strongly affect the energy...I also wonder how does the use of batch/layer normalization affect the effectiveness of this downstream free energy.
Regarding scale, this is what we think the reviewer is expressing, please correct us if we’ve misinterpreted. The reviewer is hinting that in neural network architectures with some form of scale invariance such as ReLU networks, multiplying the entire parameter set by some constant might not change the function’s outputs—and thus wouldn’t degrade or improve downstream accuracy—while the free energy quantity we define could shift in a non-trival way. We think this is a valid point. However our experiments and our intention revolve around realistic neural networks that are deployed in practice which rarely exhibit strict parameter scaling invariance. We will add a brief discussion in the final version to clarify this point. Thank you for raising this.
Regarding batch/layer norm, we note that for our experiments we trained models with (e.g. Resnet) and without batch norm (e.g. VGG) and see the same effect. We do not expect this factor would affect the effectiveness of our approach.
Supplementary: ...In line 235 column 2, it says "under mild assumptions"...The assumption is missing from both the main paper and the supplementary material.
In the main text at line 235 col 2, we wrote a parenthetical “under mild assumptions, below” to refer to the assumptions in Propn 5.1. We will ensure the final version states these assumptions explicitly, The broken reference in the supplementary will also be corrected.
Relation to Broader...: How does this differ from the commonly used WBIC criteria in Bayesian model selection. Is this free energy idea a new interpretation for WBIC, or is the proposed method actually different from WBIC?
Our localized WBIC can be viewed as the classical WBIC with a Gaussian prior centred on a pretraining checkpoint. While we have taken care to rigorously define our localized WBIC, we welcome the suggestion to make the distinction with classical WBIC more explicit and will add a dedicated paragraph discussing how our “pretraining WBIC” compares with the classical WBIC. We see this as a straightforward clarification that in no way invalidates our approach.
Essential References Not Discussed
We cited Lau et al. (2023) for our local WBIC approach but we are happy to cite the original WBIC paper as you suggest. Thank you. However, we do not consider references on classic WAIC that you mention here to be relevant for our work. Can you please articulate in which sense the WAIC references are essential or give some more detail as to how you see WAIC directly intersecting with our methodology?
Other Strengths and Weaknesses: The idea is interesting ..but the paper is not well written or self-contained.
We hope that our planned clarifications around Equation (4) and any added references will address your concern about self-containment.
In regards to being "not well-written", can you please provide more precise feedback on any sections which remain confusing or unclear. We note that Reviewers ek2K and UWMc explicitly praised our writing, but we will certainly incorporate any further suggestions to improve readability. Could you please indicate which specific aspects of the writing needs improvement so we can address them directly?
This paper introduces a Bayesian model selection criterion called the downstream free energy, which quantifies the adaptability of pretraining checkpoints for downstream tasks. By measuring the concentration of favorable parameters for the task, this criterion helps predict fine-tuning performance without requiring access to downstream data or prior task knowledge. Empirical evidence validates that the criterion reliably correlates with improved fine-tuning performance.
给作者的问题
I have no other questions beyond those already mentioned.
论据与证据
The claims made in the submission are supported by evidence.
方法与评估标准
Yes, the proposed methods make sense for the problem at hand.
理论论述
I have generally checked the proofs, but some details have not been thoroughly verified.
实验设计与分析
I have generally reviewed the experimental design, which seems reasonable.
补充材料
I have roughly checked the proofs in the supplementary material.
与现有文献的关系
The paper contributes to the study of Bayesian model selection.
遗漏的重要参考文献
There are several works [1-3] focusing on assessing the reusability or transferability of pre-trained models. However, this paper does not discuss these works.
[1] Tran et al. Transferability and Hardness of Supervised Classification Tasks. ICCV 2019.
[2] Nguyen et al. LEEP: A New Measure to Evaluate Transferability of Learned Representations. ICML 2020.
[3] You et al. LogME: Practical Assessment of Pre-trained Models for Transfer Learning. ICML 2021.
其他优缺点
-
Selecting pre-trained models for downstream tasks is a field with many existing works [1-5], but this paper does not discuss the differences from these works, nor does it compare them with these methods in the experiments.
-
The experiments only involve two datasets, CIFAR-100 and mini-ImageNet, which are relatively few in number and have a small dataset size for each.
[1] Tran et al. Transferability and Hardness of Supervised Classification Tasks. ICCV 2019.
[2] Nguyen et al. LEEP: A New Measure to Evaluate Transferability of Learned Representations. ICML 2020.
[3] You et al. LogME: Practical Assessment of Pre-trained Models for Transfer Learning. ICML 2021.
[4] Guo et al. Identifying Useful Learnwares for Heterogeneous Label Spaces. ICML 2023.
[5] Zhang et al. Model Spider: Learning to Rank Pre-Trained Models Efficiently. NeurIPS 2023.
其他意见或建议
I have no other suggestions.
Thank you for your time in reviewing our paper. We note that the reviewer raised two concerns—(1) the absence of certain references ([1–5]) and (2) the limited dataset scope (CIFAR-100 and mini-ImageNet)—and offered no further questions or objections. You’ll find below our best efforts to address these concerns. Please consider raising your score if you are satisfied with them.
Essential References Not Discussed: There are several works [1-3] focusing on assessing the reusability or transferability of pre-trained models. However, this paper does not discuss these works.
Indeed there are several studies (including those [1-3] mentioned here) which examine how to quantify transferability of pre-trained models. Below, in “Other Strengths and Weaknesses,” you also reference [4] and [5], but do not categorize them as “essential.” Can you please specify why you feel these particular works [1-3] are essential to the scope of our current paper? In particular, can you please clarify how these works directly inform or extend our results? Provided this clarification, we are happy to include these or any other references we would have accidentally missed.
Other Strengths and Weaknesses: Selecting pre-trained models for downstream tasks is a field with many existing works [1-5], but this paper does not discuss the differences from these works, nor does it compare them with these methods in the experiments.
(Related to above) Our paper includes comparisons with established measures such as geometric complexity and neural collapse. These are equally heuristic or empirical in nature and comparable to [1–5]. Since our focus is on a Bayesian model selection approach, not on exhaustively benchmarking all transferability metrics, we believe our chosen references are sufficient to position this work in the broader literature. Can you please specify how [1–5] directly inform or critique the Bayesian framework we adopt in our paper? If so, we are happy to include these or any other references we would have accidentally missed.
Other Strengths and Weaknesses: The experiments only involve two datasets, CIFAR-100 and mini-ImageNet, which are relatively few in number and have a small dataset size for each.
In regards to our experiments, we used CIFAR-100 and mini-ImageNet because they are well-established benchmarks that allow rapid, reproducible testing of our approach. We view exploring larger datasets as an orthogonal direction that would not alter our main theoretical contributions. We appreciate your feedback and remain open to expanding our experiments to additional datasets in future work.
This paper proposes a new metric, pretraining free energy, which can be used to find a pretraining model checkpoint which is most adaptable for downstream finetuning tasks. The paper is largely theoretical, justifying this metric, although there are two experiments (one in appendix) showing that WBIC, which is used to approximate the pretraining free energy, correlates with downstream finetuning performance.
给作者的问题
- How computationally costly is it to compute WBIC - is this something that drastically increased the overhead of the experiments in the paper?
- Do you think this method is both feasible, and will continue to scale, for much larger models - for example, 70B or 300B parameter LLMs?
- As a follow up ot the above - is this still useful if it doesn't, given this is the arguably the most significant field for pretraining and finetuning.
- How confident are you that this finding would apply in fields beyond image classification, as presented in the CIFAR and ImageNet-mini experiments?
论据与证据
The paper makes a number of claims:
- That downstream free energy is a good proxy for the downstream performance of a model after finetuning. This seems to be generally justified via theory, and seems to hold based on the arguments presented in the paper.
- That pretraining free energy is a more measurable alternative to downstream free energy which upholding similar performance prediction characteristics. Again, I think this hold, although I admit (discussed below), that I found this discussion relatively confusing - possibly a function of my background not aligning to that of the paper.
- The WBIC can be used to approximate pretraining free energy without requiring expensive (and quite possibly intractable) integration calculation. This is demonstrated empirically for two experiments, and seems to hold, though as an empirically minded researcher I would have possibly liked to see this demonstrated in a couple of additional domains to ensure the finding holds generally (possibly in a task that was not image classification). That said, the key contribution of this paper is theoretical and so I do not believe that this limits the correctness or validity of the work.
方法与评估标准
As stated above, this paper is principally theoretical (despite, admittedly, dealing with a very empirical topic). As such, while I believe consideration of additional benchmarks would be good - possibly for larger models, such as LLMs, given this is where a large part of the pretrain-then-finetune regime gains have proven fruitful - I think the paper should be viewed with a more theory-focused lens. In this perspective, I believe the benchmarks used are valid and, while the empirical effectiveness of the work could be boosted, the experiments run in this paper provide enough support for it to stand.
I appreciated the narrative, in which the method was built up - I felt this had a very logical flow, and took practicality into consideration (which can be rare for theoretical papers). As such, the transition from downstream free energy -> pretraining free energy -> WBIC was very logical and I think makes a lot of sense for the problem at hand.
理论论述
I attempted to follow the theoretical claims made throughout the paper, but admit that this work is beyond my background and thus I did not always follow 100%. One thing I was unsure about was why the terms in the pretraining free energy are stochastic, whereas they weren't for downstream free energy, and think a qualitative sentence explaining this would provide some needed clarity.
I would suggest that other reviewers be more emphasised in this regard.
实验设计与分析
The experiments seeem reasonably designed, and there is a description of the pretraining process and fine-tuning details in the appendix (including hyperparameters). I think some more variety would be good, rather than just focusing on image classifciation, to truly verify whether the correlation between WBIC and downstream performance is legitimate or coincidence (though attached to the theory, I think it should hold). I particularly emphasise this as it also sees (by eye) that there is a close link between strong pretraining performance and strong downstream performance; ruling out n empirical link there, as is discussed in Observations 1 and 2, would help the experimental results I think.
补充材料
I spent some time considering the additional imageNet results, and examining the experimental design. I did not consider the proof or examples in too much detail in the supplementary material.
与现有文献的关系
The paper seems to contextualise themselves well against prior literature, although I am not an expert in this field. There are cmparisons against certain metrics which have been proposed in prior literature for similar problems.
遗漏的重要参考文献
N/A
其他优缺点
Overall, I found most of the paper clear as a reader (particularly one who works in a different area to this work). I thought the structure of the narrative was good in building up a more complete picture of the method being presented.
That said, I found the paragraph starting on line 196 (left hand side), about how the checkpoints considered are not actually checkpoints, a bit confusing. I also found proposition 5.1 hard to follow.
Besides that, I felt this was a good paper.
其他意见或建议
On line 267, in the right hand column, there is a missing full stop.
Thank you very much for your careful attention to our paper and thoughtful review. We are glad you think our paper is "clear" and that the "structure of the narrative was good". We will do our best to answer your concerns regarding potential weaknesses below.
Experimental Designs or Analyses: I think some more variety would be good, rather than just focusing on image classification
We fully agree that a more diverse suite of experiments beyond image classification would show that the correlation between WBIC and downstream performance is not coincidental. Thank you for the suggestion, and we look forward to addressing it in follow-up work.
Experimental Designs or Analyses: it also sees (by eye) that there is a close link between strong pretraining performance and strong downstream performance; ruling out n empirical link there, as is discussed in Observations 1 and 2, would help the experimental results I think.
The reviewer suspects a confound: maybe WBIC (and strong downstream performance) both correlate with strong pretraining performance, rather than with each other. We actually have some counterexamples to this, for instance the third row of Figure 2. Note that towards the end of training, all momentum values share the same pretraining loss yet the downstream performance is quite different; the pretraining WBIC can pick this up.
Other Strengths and Weaknesses: Overall, I found most of the paper clear [...] the structure of the narrative was good in building up a more complete picture of the method being presented. That said, I found the paragraph starting on line 196 (left hand side), about how the checkpoints considered are not actually checkpoints, a bit confusing. I also found proposition 5.1 hard to follow.
In regards to the question about "checkpoints considered are not checkpoints": We apologize for the confusion. We will clarify that “pretraining checkpoints” in our theoretical discussion refers to local minima of the test loss, which may differ from the actual checkpoints saved during training. We will further clarify that in order for the theory to match the empirical analysis, we stipulate the actual checkpoints saved during training are local minima of the training loss.
Regarding Prop 5.1 being hard to follow, do you mean that the statement of the proposition itself is hard to follow, the proof, or the discussion of how prop 5.1 is used to justify Eq (10), or something else?
Questions For Authors:
- How computationally costly is it to compute WBIC - is this something that drastically increased the overhead of the experiments in the paper?
- Do you think this method is both feasible, and will continue to scale, for much larger models - for example, 70B or 300B parameter LLMs?
- As a follow up ot the above - is this still useful if it doesn't, given this is the arguably the most significant field for pretraining and finetuning.
- How confident are you that this finding would apply in fields beyond image classification, as presented in the CIFAR and ImageNet-mini experiments?
- The original WBIC is very costly to compute. The local WBIC computed in this paper is much cheaper because the localizing prior forces the exploration to stay close to some parameter . We compute local WBIC through SGLD sampling which is computationally efficient for deep learning models.
- Yes, there is no fundamental obstacle preventing the approach from scaling to much larger architectures—provided sufficient computational resources. In principle, this includes LLMs on the order of tens or hundreds of billions of parameters.
- see above,
- Our current theoretical results rely on conditions (e.g., mild distribution shifts) that are plausibly satisfied in the image classification tasks we considered. We are cautiously optimistic this approach could generalize to other tasks as well—potentially including text domains—but confirming that the same assumptions hold there would likely require additional theoretical and empirical investigation and is the focus of future work.
Thank you again for your insights, which will help make our paper better!
Dear Authors,
Thank you for your rebuttal. I've responded to a couple of points below.
Re: more diverse experiments, I think this might be good to include in the paper as a limitation/proposed future work to be upfront about this restriction of the analysis.
Re: Counterexample, I agree with this point though it is worth noting that a more magnified scale may give a better idea of whether each run has converged to the same loss or whether the scale of loss is just smaller, if that makes sense?
Re: Prop 5.1, I think my finding this difficult is more likely due to the fact that I am an empirical researcher in a different area, and thus this is likely my ignorance showing - reviewer ek2K has verified the correctness of this proposition and I am, therefore, content.
Re: questions, thank you for answering these. I would love to see a sentence (possibly in the future work) suggesting that this should be able to scale to larger models.
While this paper is beyond my area of expertise, and I do not plan on increasing my score since I don't believe this work has the very high impact expected of a paper rated 5, I believe this is a good paper worthy of acceptance to ICML.
We appreciate your valuable feedback -- we’ll incorporate these suggestions into our revision.
This paper introduce a Bayesian model selection criterion, called the downstream free energy, to improve fine-tuning performances. There are both theoretical and empirical results provided.
给作者的问题
See strengths and weaknesses.
论据与证据
Yes. Section 5 is about theoretical results, and empirical results are in Section 6.
方法与评估标准
Yes.
理论论述
Yes. Proposition 5.1 is correct.
实验设计与分析
Yes. Section 6 is about empirical results. And there are also some empirical details in appendix.
补充材料
The appendix is about the proof of proposition 5.1, and some experimental details.
与现有文献的关系
This paper is mainly related to model generalization performance, which mainly focusing on the controlling the upper bound of test error, using the information on training error and function class.
遗漏的重要参考文献
No.
其他优缺点
Strengths
- This paper is well-written, which provide a clear statement of the results.
- There are both theoretical and empirical evidences to support the idea.
Weaknesses
- There is a lack of discussion about the relationship between model generalization performance and energy. Please provide more explanations about this question.
- The model used in this paper is some small models. Are there the same results guaranteed on larger models?
其他意见或建议
See strengths and weaknesses.
Thank you for your time in reviewing our paper. We are glad you think our paper is "well-written" and provides a "clear statement of results" supported by both "theoretical and empirical evidence". Below, we address the potential weaknesses you mentioned, and we hope these clarifications will encourage you to consider raising your score.
Weaknesses
- There is a lack of discussion about the relationship between model generalization performance and energy. Please provide more explanations about this question.
- The model used in this paper is some small models. Are there the same results guaranteed on larger models?
In regards to Weakness 1: Thank you for emphasizing the importance of the relationship between model generalization performance and energy. We agree that explaining this connection is crucial. In fact, Section 5.1 of our paper already provides a detailed discussion of how the free energy (and its complexity term) interacts with test loss to shape model generalization, through a series of Observations tied to our main proposition. If these details were inadvertently overlooked, we kindly invite you to revisit that section and let us know if anything remains unclear or incomplete.
In regards to Weakness 2: The theory we develop here does not make any assumptions on the model size or complexity. So, yes the same results should hold for larger models as well. However, as we state in the Section 7 'Conclusion and Future Work', the bottleneck is computation which can be challenging for very large models. This is an intriguing area of future work and we also suggest some alternative approaches to address this limitation there.
Thanks for the author's reply. It has addressed part of my questions. I will keep the positive score.
This paper proposes a novel Bayesian model selection criterion based on the concept of downstream free energy, which quantifies the adaptability of a model checkpoint to downstream tasks by measuring the concentration of favorable parameters in its vicinity. A key strength of the method is that it can be computed without access to downstream data or prior knowledge about the downstream tasks. The proposed criterion is theoretically motivated, practically useful, and shown to be strongly correlated with downstream performance.
Overall, the reviewers appreciated the novelty and quality of the submission, and I share their positive assessment. The idea of downstream free energy is both original and insightful, with sound theoretical backing and practical implications. While some concerns were raised regarding the omission of related work, I believe these issues are minor and can be addressed in the final version. I urge the authors to ensure a thorough revision addressing all reviewer comments, such as incorporating the missing references.