PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
4
4
2.8
置信度
创新性2.5
质量2.8
清晰度2.5
重要性3.0
NeurIPS 2025

On the Stability and Generalization of Meta-Learning: the Impact of Inner-Levels

OpenReviewPDF
提交: 2025-05-05更新: 2025-10-29

摘要

关键词
Stability and GeneralizationMeta-Learning

评审与讨论

审稿意见
4

This paper provides new algorithmic-stability based generalisation bounds for meta-learning focussing on the impact of inner level optimization. The authors consider two broad classes of inner-optimisation frameworks: gradient descent framework (GDF), which includes classical MAML algorithm, and proximal descent framework (PDF) that includes the iMAML and Meta-Minibatch Pros algorithms. Leveraging on-average algorithmic stability framework, the authors derive new meta-generalisation bounds that reveals the influence of the number QQ of inner-level iterations. For the GDF framework, authors report a trade-off relationship with respect to the choice of QQ. The authors finally also propose a meta-objective with improved generalization performance.

优缺点分析

The paper belongs to a recent series of works that have explored the generalization analysis of meta-learning algorithms. The general classification of the existing frameworks into GDF and PDF is interesting, and enables authors to simultaneously investigate multiple existing algorithms.

However, the paper is not easy to read and lack clarity in many sections. There is an overload of notation especially in Section 3, which makes it difficult to follow. I believe a thorough re-writing is required in Section 3. For instance, in Section 3.2, authors define Fi^(w,Si)\hat{F_i}(w,S_i) and relate it to L^λ(w,Si)\hat{\mathcal{L}}_{\lambda}(w,S_i), which is not used anywhere else . Later on, they introduce K^(w,Si)\hat{\mathcal{K}}(w,S_i), whose relationship with F^i(w,Si)\hat{F}_i(w,S_i) requires clarification. In Algorithm 1, line 7, what is the loss function L\mathcal{L}? Shouldnt this be Li\mathcal{L}_i?

Significance and Originality: The investigation of the impact of inner-level optimization on the generalization of meta-learning algorithms is not new. For instance, reference [1] has already studied this in the GDF framework and has reported a similar trade-off relationship with respect to the number Q of inner level iterations. The authors in this work have gone beyond the GDF and have analysed PDF under both convex and non-convex assumptions on the loss function.

Weaknesses: Please find the weaknesses under the Questions Tab.

Reference: [1] Zhou, P., Zou, Y., Yuan, X.T., Feng, J., Xiong, C. and Hoi, S., 2021, December. Task similarity aware meta learning: Theory-inspired improvement on MAML. In Uncertainty in artificial intelligence (pp. 23-33). PMLR.

问题

I think the following points need clarification:

  1. Authors use Fig.1 to motivate the problem setting. However, their claim regarding MAML -- that generalization error first decreases with Q and then increases -- is not convincing from Fig.1(a). This will require more plots corresponding to increasing values of Q>20.

  2. In Section 4, the authors present Theorem 2 and Theorem 3. The assumptions in both theorems are the same. Its hard to see the difference in the setting (strongly convex/non-convex) considered. Better statement of the theorems are required. Along this line, even the statement of Theorem 1 is incomplete with no reference to ϵ\epsilon.

  3. Following Theorem 2 and Remark 2, it seems like the proposed bound in Theorem 2 yields looser bound than the one in reference [9] in the paper. The authors claim that setting α1/(QL)\alpha \leq 1/(QL) simplifies the first term in the bound of [9] to scale as O(1/n)O(1/n), mitigating the dependence on QQ. However, under the same α\alpha, the first term in Theorem 2 still has a linear dependence on QQ. Note that for sufficiently large QQ, α1/(QL)\alpha \leq 1/(QL) yields negligibly small learning rate, implying that the inner update will be close to the initialisation wTi,0tw^t_{\mathcal{T}_i,0} for each task. Isn't it then reasonable to assume that no dependence on QQ in this regime is better?

  4. Additionally, the term F(w0)minwF(w)F(w^0)-\min_wF(w) accounts for the difference between the meta-objective F()F(\cdot) evaluated at initialisation w0w_0 and that evaluated with respect to the best initialisation w=argminwF(w)w^* =\arg \min_w F(w). It is not clear how increasing QQ can decrease this difference, as this difference depend on the proximity of w0w^0 to ww^*. Of course, this is true if the term was F(wT)minwF(w)F(w^T)-\min_w F(w), the difference between the meta-learned initialisation and the best initialisation possible. I believe reference [9] has the latter term in their bound. The above issue also arises Theorem 3, 4 and 5. This concern could also be attributed to the lack of clear definitions in the paper. In this sense, I am also unsure of the relevance of Section 4.3.

  5. More discussion on how the proposed bounds in Theorems 4 and 5 for PDF framework compare to other existing bounds are required.

局限性

The work is purely theoretical in nature, and does not pose any negative societal impact.

最终评判理由

Following the rebuttal, the authors have clarified my main concerns. The notational discrepancies were corrected, and additional experimental results are included. Given the strong theoretical contributions of the paper, I am happy to increase my score to 4: Borderline acceptance.

格式问题

Already listed

作者回复

We thank for all your valuable comments. According to the weaknesses and questions, we will reply these individually as the following.

Weakness 1: Although we acknowledge certain typographical errors pointed out by the reviewer, we believe we have clarified the meaning of all symbols used in our paper. Regarding the terms F^_i(w,S_i)\hat{F}\_i(w, S\_i) and L^_λ(w,S_i)\hat{L}\_{\lambda}(w, S\_i) and their relationship, we have explained them in lines 144-145. In particular, F^_i(w,S_i)\hat{F}\_i(w, S\_i) represents the empirical loss, i.e., the training loss, which has been explained in lines 132-133. We use L^_λ(w,S_i)\hat{L}\_{\lambda}(w, S\_i) to reflect the effect of λ\lambda. If we did not do so, the definition of FF would be confusing. As for the term K^(w,S_i)\hat{\mathcal{K}}(w, S\_i), we have addressed it in lines 146-149: "For a practical PDF-based algorithm, it's difficult to obtain an exact solution of L^_λ(w,S_i)\widehat{\mathcal{L}}\_{\lambda}(w, S\_i), so we typically resort to an inexact solution, K^(w,S_i)=1n_zS_i(w_T_i,z)+λ2w_T_iw2\widehat{\mathcal{K}}(w, S\_i) = \frac{1}{n} \sum\_{z \in S\_i} \ell(w\_{\mathcal{T}\_i}, z) + \frac{\lambda}{2} \| w\_{\mathcal{T}\_i} - w \|^2."

Weakness 2 and Question 3: We would like to emphasize that our work significantly extends the contributions of [1]. While [1] investigates the trade-off in MAML, their analysis is conducted within a relatively narrow scope and lacks a general or extensible framework. In contrast, we introduce two flexible and broadly applicable frameworks—GDF and PDF—which are designed to accommodate a wide range of meta-learning algorithms. To illustrate the generality and practicality of these frameworks, we derive generalization bounds for six different algorithms. Moreover, our GDF framework provides a more robust and interpretable characterization of the trade-off. Specifically, when comparing our Theorem 2 with the results of [1], let us consider the case where T=mT=m. Under this, our theorem 2 yields a bound O(Qn+F(w0)min_WFm)\mathcal{O}(\frac{Q}{n}+\frac{\sqrt{F(w^0)-\min\_{\mathcal{W}}F}}{m}) whereas the bound in [1] is O(1n+F(wT)min_WF))\mathcal{O}(\frac{1}{n}+F(w^T)-\min\_{\mathcal{W}}F)). It is evident that our second term F(w0)min_WFm)\frac{\sqrt{F(w^0)-\min\_{\mathcal{W}}F}}{m}) is significantly lower than the corresponding F(wT)min_WFF(w^T)-\min\_{\mathcal{W}}F in [1]. Although our first term is looser than theirs 1n\frac{1}{n}, we argue that it is intuitive outcome.. In particular, this statement, "For sufficiently large QQ, α1QL\alpha\leq \frac{1}{QL} yields negligibly small learning rate, implying the inner update will be close to initialisation w_τ_i,0tw\_{\tau\_i,0}^t" reinforces the validity of our bound. Intuitively, if the inner-loop update is too close to the initialization, the model fails to learn task-specific information effectively, leading to larger generalization error—similar to the behavior observed when Q=1Q=1. However, the results in [1] ca not explain the trade-off relationship either is sufficiently small or large QQ when α1QL\alpha\leq \frac{1}{QL}, which means their results is not robust.

Question 1: The core idea of MAML involves applying a few gradient descent steps on a small number of training samples to adapt the meta-model to test tasks. Therefore, We use Fig. 1 to illustrate an interesting phenomenon, i.e., a larger number of QQ is not suitable for MAML. To address the reviewers' concerns, we also provide further experiment for MAML when Q>20Q>20. The experiment is also conducted on the Omniglot dataset. We employ a 3×33\times 3 CNN and use the Cross-Entropy Loss as the loss function. For each training task, the number of support samples and query samples to ntr=1n^{\rm{tr}}=1 and nts=5n^{\rm{ts}}=5, respectively. For each test task, we use ntr=1,nts=15n^{\rm{tr}}=1, n^{\rm{ts}}=15. In addition, each task is formulated as a 55-way classification problem.

Q=2Q=5Q=10Q=20Q=30Q=40Q=50
0.64290.28680.23000.24170.24230.25410.2622

Question 2: In theoretical analysis, two identical assumptions do not necessarily imply that the difficulty and process of derivation under the two settings is the same, especially in stability analysis. In fact, the condition of some parameter is stricter in non-convex. For instance, under convex setting, the step size η1L_Q\eta\leq\frac{1}{L\_Q} with L_Q=αρQ(1+αL)Q1+(1+αL)QLL\_Q = \alpha\rho Q(1+\alpha L)^{Q-1} + (1+\alpha L)^{Q}L. In contrast, under non-convex setting, the step size η_tct\eta\_t \leq \frac{c}{t} and cmin{1LQ,14(2L_Qln(T))2}c\leq\min\{\frac{1}{L_Q},\frac{1}{4(2L\_Q\rm{ln}(T))^2}\} with L_Q=3ρ(1+αL)2(Q1)L+(1+αL)QLL\_Q= \frac{3\rho(1+\alpha L)^{2(Q-1)}}{L} + (1+\alpha L)^{Q}L.

Question 4: Referring to the Eq (1) and Eq(2) in this paper, we can get F(w)=1m_i=1mF_i(w)=1m_i=1mE_D_iE_zP_i[(w_T_iQ(w,D_i),z]F(w)=\frac{1}{m}\sum\_{i=1}^{m} F\_i(w)=\frac{1}{m}\sum\_{i=1}^{m} \mathbb{E}\_{\mathcal{D}\_i} \mathbb{E}\_{z\in\mathcal{P}\_i}[\ell(w\_{\mathcal{T}\_{i}}^Q(w, \mathcal{D}\_i),z]. Then we can find that Fi(w)F_i(w) is benefited by QQ due to w_τ_iQ(w,D_i)=wα_q=0Q1L^_i(w_τ_iq,D_i)w\_{\tau\_i}^Q(w, \mathcal{D}\_i)=w-\alpha\sum\_{q=0}^{Q-1}\nabla \widehat{\mathcal{L}}\_i( w\_{\tau\_i}^q,\mathcal{D}\_i). Therefore, both F(wT)min_WFF(w^T)-\min\_{\mathcal{W}}F and F(w0)min_WFF(w^0)-\min\_{\mathcal{W}}F can be improved by increasing QQ. The main difference is that F(wT)min_WFF(w^T)-\min\_{\mathcal{W}}F can also be improved by the number of outer-loop iterations TT.

Question 5: Since our work primarily focuses on generalization error, while other studies emphasize transfer error, it is difficult to make a fair comparison regarding the tightness of the bounds. Generalization error measures the discrepancy between training tasks and test tasks, making it a key metric for evaluating the practical capability of meta-learning algorithms.

[1] Zhou, P., Zou, Y., Yuan, X.T., Feng, J., Xiong, C. and Hoi, S., 2021, December. Task similarity aware meta learning: Theory-inspired improvement on MAML. In Uncertainty in artificial intelligence (pp. 23-33). PMLR.

评论

I thank the authors for clarifying my concerns and for providing extended experimental results. I would like to increase my score and lean towards borderline acceptance

评论

We sincerely appreciate the reviewers’ recognition of our work. If there are any additional questions you would like to discuss, please feel free to let us know.

审稿意见
5

This paper mainly focuses on exploring meta-learning algorithms. As previous works usually pay more attention to training strategies and data, the impact of inner-loop optimization on generalization is ignored. Thus, in this paper, the authors explore both gradient descent and proximal descent frameworks to obtain insights regarding meta-learning paradigm.

优缺点分析

Pros:

  • Different from previous work, this work dives into the impact of inner-loop optimizations on the generalization in meta-learning algorithms.
  • This work achieves some interesting insights from the perspective of theory.
  • The paper starts from an interesting phenomenon of the generalization error and then develops many valuable insights regarding meta-learning, which is impressive.
  • The results reported in this paper are good.

Cons: From my perspective, I do not think this work has obvious drawbacks. I like this work.

问题

  • Could you provide some explanations regarding Eq. (2)? I do not really understand the formulation.
  • What is the K\mathcal{K} in Algorithm 1? Although it is explained in the later part, it would be better to define it before.

局限性

N/A

最终评判理由

I will maintain my score regarding this paper since this paper is good and my concerns have been well addressed.

格式问题

N/A

作者回复

We thank for all your valuable comments. According to the questions, we will reply these individually as the following.

Question 1: In the inner-level process of GDF, the learner ww needs to find task-specific parameters w_τiw\_{\tau_i} after undergoing QQ times gradient updates. This process can be formulated as w_τ_i=wα_q=0Q1L^_i(w_τ_iq,D_i)w\_{\tau\_i}=w-\alpha\sum\_{q=0}^{Q-1}\nabla \widehat{\mathcal{L}}\_i(w\_{\tau\_i}^q,\mathcal{D}\_i). To emphasize its difference of inner-levels from PDF, we give the optimization objective of inner-process, i.e., Eq(2)=min_w_T_i_q=0Q1L^_i(w_T_iq,D_i),w_T_iw+12αw_T_iw_22\min\_{w\_{\mathcal{T}\_{i}}}\left\langle\sum\nolimits\_{q=0}^{Q-1}\nabla \widehat{\mathcal{L}}\_{i}(w\_{\mathcal{T}\_{i}}^{q},\mathcal{D}\_i),w\_{\mathcal{T}\_{i}}-w\right\rangle+\frac{1}{2\alpha}\|w\_{\mathcal{T}\_{i}}-w\|\_{2}^{2}. By taking the derivative on w_τiw\_{\tau_i} and making the gradient equal to 00, Eq(2) can be written as w_τi=wα_q=0Q1L^_i(w_τiq,D_i)w\_{\tau_i}=w-\alpha\sum\_{q=0}^{Q-1}\nabla \widehat{\mathcal{L}}\_i (w\_{\tau_i}^q,\mathcal{D}\_i)..

Question 2: As we illustrated in section 3.2, PDF aims to obtain an exact solution of its empirical meta-loss L^_λ(w,S_i)=min_w_T_i{1n_zS_i(w_T_i,z)+λ2w_T_iw2}\widehat{L}\_{\lambda}(w, S\_i)=\min\_{w\_{\mathcal{T}\_i}}\{\frac{1}{n}\sum\_{z\in S\_i} \ell(w\_{\mathcal{T}\_i} , z)+\frac\lambda2\|w\_{\mathcal{T}\_i}-w\|^2 \} which is difficult in practical algorithms. Therefore, we often solve an approximation, i.e., K^(w,S_i)=1n_zS_i(w_T_i,z)+λ2w_Tiw2\widehat{\mathcal{K}}(w, S\_i)=\frac{1}{n}\sum\_{z\in S\_i} \ell(w\_{\mathcal{T}\_i}, z)+\frac\lambda2\|w\_{\mathcal{T}_i}-w\|^2 in Algorithm 1.

审稿意见
4

This paper considers two settings - the gradient descent framework (GDF) and proximal descent framework (PDF) - for a number of common meta-learning strategies. It discusses two types of common metrics, the generalisation error and optimisation error, whcih constitute the test error. Given convergence of the optimisation error, the paper produces theory about how the generalisation error changes for the GDF and PDF. It subsequently provides a new meta-objective function, informed by the theory, and a small number of experiments demonstrating that their method has better optimisation characteristics (i.e., better generalisation behaviour than more standard meta-objectives).

优缺点分析

To preface, I am an empirical researcher and so while I did my best to follow the theory I admit that a times certain bounds and literature were unfamiliar to me. For this reason, I submit my review with slightly lower confidence.

Strengths:

  • Throughout the paper, the work is constantly compared and contrasted with preexisting literature. As such, I felt it was very well framed with regards to the surrounding field.
  • For a theoretically-focused paper, I felt their was still a reasonable empirical comparison and demonstration that their theoretical bounds transferred to practical tests.
  • The paper had a natural narrative, following from introducing the problem setting -> introducing types of errors -> discussing how these errors affect generalisation -> providing theoretical bounds -> showing these bounds echo what happens in practice.
  • From what I could follow, the paper is rigorous in its theorems and provides corresponding proofs in the supplementary material. For the sake of final publications, I think it would be good to provide these proofs in the appendix (so that they can all be found in the same PDF).

Weaknesses:

  • There were a number of typos that I spotted (discussed below).
  • I felt like the paper could have introduced a number of concepts in a background section before jumping in. For instance, what is QQ - initially I had assumed this was the number of inner-level gradient updates for e.g., MAML, but the way it is framed in the paper has me guessing if this is the case.
  • I think this may be more common in theoretical papers, but it felt like a lot of important concepts were quoted by reference without actually saying what they were. It would be good for some of these to be directly discussed in the main test.
  • Occasionally, I found it difficult to see where leaps in theory came from. This is partly a result of my naivete of theoretical meta-learning, as well as the fact that the proofs were held in a separate PDF.
  • There is no access to the code, and while in the questionaire it states that significance in experiments was provided, this does not seem to be the case. It is not completely obvious to me that the new proposed objective (which basically just weights larger losses more) actually improves performance, given most lines are practically on top of each other and there are no error bars.
  • I cna't quite follow the logic from theory to the new loss. To me the new loss seems to sort of come out of nowhere, and also doesn't seem to have the framing for GDF vs PDF and why this is useful.

问题

  • This work focuses on a subset of meta-learning. Does this theory generalise beyond these MAML-oriented meta-learning approaches?
  • Can you explain for my sake what QQ is actually referring to?
  • How does the tightness and derivation of your bound compare to the others derived in theory that are referenced here?
  • Where does the intuition of the new meta-loss come from?

Below I provide some examples of typos: Line 17 - well-generalisation is confusing 26 - Q inner-level gradient updating 45 - increases with the number of inner-levels Q grows 111 wells -> well 145 - forlicity 329 - 'We' shouldn't be capitalised. Neither should Objective on 333

局限性

N/A

最终评判理由

I feel the theory is good, and htat the empirical side of the paper leaves some to be desired. I also find that there are a number of clarity issues. However, I think the contribution of the paper outweighs these issues, and thus maintain a score of 4 - borderline acceptance.

格式问题

N/A

作者回复

Thank you for your valuable suggestions, we will reply these as follows.

Weakness 1: We appreciate the reviewer for pointing out the typographical errors, which we will promptly correct.

Weakness 2 and Question 2: To further clarify the meaning of QQ, we can use MAML as an example. MAML utilizes available training data from multiple tasks to derive a meta-initialization that performs well after a few updates during testing, tailored to a new task. Unlike traditional supervised learning, where the goal is to find a model that generalizes well to a new task without any adaptation, MAML focuses on finding an initial model that can efficiently learn a new task with only a small amount of labeled data by performing QQ gradient updates. Therefore, in this paper, our framework requires undergoing QQ gradient updates at the inner level to adapt to a small amount of data. We will add some relevant concepts explanation based on your advice.

Weakness 3: We are sorry to make you confused and will explain them in details. Our work primarily focuses on the generalization error in meta-learning, which quantifies the performance gap between training and testing phases. The upper bound of this generalization error characterizes the worst-case performance at test time, and thus, optimizing this upper bound is critical for ensuring the real-world applicability of meta-learning algorithms. Moreover, stability is a powerful analytical tool in generalization theory. It helps us understand which parameters affect generalization and how they influence it. Crucially, the definition of stability plays a central role in this analysis, as it directly determines the tightness and relevance of the resulting bounds. Therefore, we introduce two novel stability definitions for our frameworks.

Weakness 4: We further elaborate on our results and the proof approach. Our work focuses on the generalization error in meta-learning, and stability proves to be a powerful tool for deriving the generalization error of algorithms. The first step involves introducing two novel stability definitions for our frameworks, namely GDF and PDF. Based on these definitions, we derive Theorem 1, which states: E_A,S[F(w_S)F^(w_S,S_i)]1m_i=1m1nts_j=1ntsE_A,S,S~_i,ktr,z~_i,jts[(w_T_i(w_S,S~_i,ktr),z~_i,jts)(w_T_i(w_S~,S~_i,ktr),z~_i,jts)]ϵ.\mathbb{E}\_{\mathcal{A}, \mathcal{S}}\left[F(w\_{\mathcal{S}})-\widehat{F}(w\_{\mathcal{S}},\mathcal{S}\_{i})\right] \leq \frac{1}{m} \sum\_{i=1}^m \frac1{n^{\rm{ts}}} \sum\_{j=1}^{n^{\rm{ts}}} \mathbb{E}\_{\mathcal{A}, \mathcal{S}, \widetilde{S}\_{i,k}^{\rm{tr}}, \tilde{z}\_{i,j}^{\rm{ts}}}[\ell(w\_{\mathcal{T}\_i}(w\_{\mathcal{S}}, \widetilde{S}\_{i,k}^{\rm{tr}}), \widetilde{z}\_{i,j}^{\rm{ts}})- \ell\left(w\_{\mathcal{T}\_i}(w\_{\widetilde{\mathcal{S}}}, \widetilde{S}\_{i,k}^{\rm{tr}}), \widetilde{z}\_{i,j}^{\rm{ts}}\right)] \leq \epsilon. This provides a bound for the generalization error, ϵ_gen\epsilon\_{\rm{gen}}. It is important to note that w_Sw\_{\mathcal{S}} and w_S~w\_{\widetilde{{\mathcal{S}}}} are two different outputs of a randomized algorithm A\mathcal{A}, which correspond to two neighboring datasets, S\mathcal{S} and S~\widetilde{{\mathcal{S}}}, differing by a single data point. Therefore, in our proof, the key idea is to bound the difference between (w_T_i(w_S,S~_i,ktr),z~_i,jts)(w_T_i(w_S~,S~_i,ktr),z~_i,jts)\ell(w\_{\mathcal{T}\_i}(w\_{\mathcal{S}}, \widetilde{S}\_{i,k}^{\rm{tr}}), \widetilde{z}\_{i,j}^{\rm{ts}})- \ell\left(w\_{\mathcal{T}\_i}(w\_{\widetilde{\mathcal{S}}}, \widetilde{S}\_{i,k}^{\rm{tr}}), \widetilde{z}\_{i,j}^{\rm{ts}}\right). Under our assumption that the loss function is GG-Lipschitz over Rd\mathbb{R}^d, i.e., (w,z)(u,z)Gwu| \ell(w,z) - \ell(u,z)|\leq G|w-u|, we aim to bound the difference between w_T_i(w_S,S~_i,ktr)w\_{\mathcal{T}\_i}(w\_{\mathcal{S}}, \widetilde{S}\_{i,k}^{\rm{tr}}) and w_T_i(w_S~,S~_i,ktr)w\_{\mathcal{T}\_i}(w\_{\widetilde{\mathcal{S}}}, \widetilde{S}\_{i,k}^{\rm{tr}}). This difference is inherently dependent on the updating rule used by different algorithms.

Weakness 5: We have submitted our code in the supplementary materials. You can check them by downloading the supplementary materials.

Weakness 6 and Question 4: Our results indicate that the objective function F(w0)F(w^0) plays a crucial role in reducing the generalization bound. For instance, in the non-convex setting, the generalization bound of GDF is O((1+1cγmΦQ)1cγ(F(w0)T)cγ1+cγ)\mathcal{O}((\frac{1+\frac{1}{c\gamma}}{m}\Phi_Q)^{\frac{1}{c\gamma}}(F(w^0)T)^{\frac{c\gamma}{1+c\gamma}}) and the generalization bound of PDF is O((1+1cγmΦQ)1cγ(F(w0)T)cγ1+cγ)\mathcal{O}((\frac{1+\frac{1}{c\gamma}}{m}\Phi_Q)^{\frac{1}{c\gamma}}(F(w^0) T)^{\frac{c\gamma}{1+c\gamma}}). We can observe that size of F(w0)F(w^0) is important when TT is sufficiently large.

To reduce the term F(w0)F(w^0), we begin with the definition of F(w)=1mi=1mFi(w)F(w)=\frac{1}{m}\sum_{i=1}^m F_i(w). Let us consider a simple relaxation technique: given a constant β[0,1)\beta\in[0,1) and m^m2\hat{m}\leq \frac{m}{2} with the losses sorted by {Fi(w0)}\{F_i(w^0)\} in ascending order, we can obtain that i=1m^Fi(w0)ψ(i)i=m^+1mFi(w0)ψ(i)\sum_{i=1}^{\hat{m}}F_i(w^0)^{\psi{(i)}} \leq \sum_{i=\hat{m}+1}^{m}F_i(w^0)^{\psi{(i)}}, where ψ(i)\psi{(i)} is a mapping function associating the index with the original loss Fi(w0)F_i(w^0). Based on this, we can derive that F(w0)=1mi=1mFi(w0)ψ(i)1+βmi=1m^Fi(w0)ψ(i)+1βmi=m^+1mFi(w0)ψ(i)F(w^0) = \frac{1}{m}\sum_{i=1}^mF_i(w^0)^{\psi{(i)}} \geq \frac{1+\beta}{m}\sum_{i=1}^{\hat{m}}F_i(w^0)^{\psi{(i)}} + \frac{1-\beta}{m} \sum_{i=\hat{m}+1}^{m} F_i(w^0)^{\psi{(i)}}. Motivated by this result, we propose the following optimization objective:

+ \frac{1-\beta}{\tilde{m}}\sum\nolimits_{i=\hat{m}+1}^{m}F_i(w)^{\psi_{(i)}}, $$ where $\tilde{m}=(1+\beta)\hat{m}+(1-\beta)(m-\hat{m})$ is to ensure the lower bound derivation. **Question 1:** Our works mainly focus on the optimization-oriented approaches, which is a main learning field for meta-learning. Therefore, we summarize the common characteristics of most optimization-oriented algorithms and classify them into GDF and PDF according to their different inner training methods. To validate the scalability of our theoretical frameworks, we derived six popular algorithms, five of which are MAML-oriented (except MetaProx). In fact, our frameworks can also be extended to other meta-learning approaches, e.g., MetaOptNet [1] and Hyper-representation[2]. **Question 3:** Since our work primarily focuses on generalization error, while other studies emphasize transfer error, it is difficult to make a fair comparison regarding the tightness of the bounds. Generalization error measures the discrepancy between training tasks and test tasks, making it a key metric for evaluating the practical capability of meta-learning algorithms. Although some previous works [3][4] have analyzed the generalization error of MAML by using stability, their analyses are limited to the strongly convex setting. In contrast, our study extends the analysis to the more challenging and practically relevant convex and non-convex settings. The most close relevant results to our theorem 2 is theorem 1 in their work [5]. Under the convex-setting, if we let $T=m$, then our theorem 2 yields a bound $\mathcal{O}(\frac{Q}{n}+\frac{\sqrt{F(w^0)-\min_{\mathcal{W}}F}}{m})$. In contrast, the bound in [5] is $\mathcal{O}(\frac{1}{n}+F(w^T)-\min_{\mathcal{W}}F))$. We can observe that our second term $\frac{\sqrt{F(w^0)-\min_{\mathcal{W}}F}}{m})$ is significantly lower than the corresponding $F(w^T)-\min_{\mathcal{W}}F$ in [5]. Although our first term $\frac{Q}{n}$ is looser than theirs $\frac{1}{n}$, we remain the trade-off relationship about $Q$. [1] Lee, Kwonjoon, et al. "Meta-learning with differentiable convex optimization." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. [2] Franceschi, Luca, et al. "Bilevel programming for hyperparameter optimization and meta-learning." International conference on machine learning. PMLR, 2018. [3] Fallah, Alireza, Aryan Mokhtari, and Asuman Ozdaglar. "Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks." Advances in Neural Information Processing Systems 34 (2021): 5469-5480. [4] Wang, Lianzhe, et al. "Improving generalization of meta-learning with inverted regularization at inner-level." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. [5] Zhou, Pan, et al. "Task similarity aware meta learning: Theory-inspired improvement on MAML." Uncertainty in artificial intelligence. PMLR, 2021.
评论

Dear authors,

Thank you for your response.

Re: Q\mathcal{Q}, that's exactly what I thought. As a meta-learning researcher, I follow that quite clearly, but think the way the notation is introduced in the paper may cause clarity issues for those not already familiar with the field.

Re: statistical significance, please correct me if I am wrong but the answer in the questionaire is wrong, no? There are no results reported with significance bars.

Your response for Q6/W4 unfortunately hasn't been formatted correctly.

My feeling is to maintain my score, which recommends borderline acceptance, and I think this is broadly in line with other reviewers. My sense is that the theoretical side of the paper is solid, and that the empirical side leaves something to be desired. However, I choose to maintain my low confidence as I do not feel my knowledge of the theory is sufficient to contextualise this work in the surrounding literature.

评论

Thanks for the follow up discussion.

Re: We appreciate and understand the reviewers’ concerns regarding the clarity of QQ, and we will further clarify and improve it in future revisions.

Re: Regarding the error bars, we acknowledge the limitations of our experiments and fully understand the reviewers’ concerns. We apologize for not being able to address this issue within the one-week rebuttal period, as training meta-learning models becomes extremely time-consuming especially when QQ is large. If our paper is accepted, we will make sure to add the error bars in the final version.

Re(Q6/W4): Our results indicate that the objective function F(w0)F(w^0) plays a crucial role in reducing the generalization bound. For instance, in the non-convex setting, the generalization bound of GDF is O(1+1cγm(1+Qntr))1cγ(F(w0)T)cγ1+cγ)\mathcal{O}\big(\frac{1+\frac{1}{c\gamma}}{m}(1+\frac{Q}{n^{\rm{tr}}})\big)^{\frac{1}{c\gamma}}\big(F(w^0)T\big)^{\frac{c\gamma}{1+c\gamma}}\big) and the generalization bound of PDF is O(1+1cγm(1+1CQ))1cγ(F(w0)T)cγ1+cγ)\mathcal{O}\big(\frac{1+\frac{1}{c\gamma}}{m}(1+\frac{1}{C^Q})\big)^{\frac{1}{c\gamma}}\big(F(w^0)T\big)^{\frac{c\gamma}{1+c\gamma}}\big). We can observe that size of F(w0)F(w^0) is important when TT is sufficently large. To reduce the term F(w0)F(w^0), we begin with the definition of F(w)=1mi=1mFi(w)F(w)=\frac{1}{m}\sum_{i=1}^m F_i(w). Let us consider a simple relaxation technique: given a constant β[0,1)\beta\in[0,1) and m^m2\hat{m}\leq \frac{m}{2} with the losses sorted by {Fi(w0)}\{F_i(w^0)\} in {\it ascending} order, we can obtain that i=1m^Fi(w0)ψ(i)i=m^+1mFi(w0)ψ(i)\sum_{i=1}^{\hat{m}}F_i(w^0)^{\psi{(i)}}\leq \sum_{i=\hat{m}+1}^{m}F_i(w^0)^{\psi{(i)}}, where ψ(i)\psi{(i)} is a mapping function associating the index with the original loss Fi(w0)F_i(w^0). Based on this, we can derive F(w0)=1mi=1mFi(w0)ψ(i)1+βmi=1m^Fi(w0)ψ(i)+1βmi=m^+1mFi(w0)ψ(i)F(w^0) = \frac{1}{m}\sum_{i=1}^mF_i(w^0)^{\psi{(i)}}\geq \frac{1+\beta}{m}\sum_{i=1}^{\hat{m}}F_i(w^0)^{\psi{(i)}} + \frac{1-\beta}{m}\sum_{i=\hat{m}+1}^{m}F_i(w^0)^{\psi{(i)}}. Motivated by this result, we propose the following optimization objective:

$

\begin{split} F_{\rm{new}}(w)=\frac{1+\beta}{\tilde{m}}\sum\nolimits_{i=1}^{m}F_i(w)^{\psi{(i)}} + \frac{1-\beta}{\tilde{m}}\sum\nolimits_{i=\hat{m}+1}^{m}F_i(w)^{\psi_{(i)}}, \end{split}

$

where m~=(1+β)m^+(1β)(mm^)\tilde{m}=(1+\beta)\hat{m}+(1-\beta)(m-\hat{m}) is to ensure the lower bound derivation.

评论

Thank you.

Re: error bars, I just want to flag that my concern is two-fold. Obviously, not including any sort of error bars for statistical analysis is not great - but admittedly, this is a pretty classic and lazy reviewer criticism. I do think it’s important here though, especially when this work focuses on ‘stability’ which is near impossible to measure within a single seed when it comes to the complex task of meta-learning.

It’s also that the questionnaire explicitly asks if statistic significance is shown, and your paper says it is (when it’s not).

评论

Thank you again for further explaining your concerns about error bars— we fully understand and sincerely appreciate it. We believe that our theoretical analysis is solid, and we are grateful for your recognition of this aspect. At the same time, we truly value your suggestion regarding the use of error bars, and we will incorporate them in the final version of the paper. Inspired by your feedback, we will also place greater emphasis on improving experimental details in our future work, rather than focusing solely on theoretical rigor.

审稿意见
4

This paper investigates the impact of inner-level updates on the stability and generalization of meta-learning algorithms. The study classifies prominent meta-learning algorithms into two frameworks: the Gradient Descent Framework (GDF) and the Proximal Descent Framework (PDF), based on their inner-processes. It introduces novel definitions of algorithmic stability for both frameworks and derives generalization bounds, revealing a trade-off between inner-levels and generalization in GDF, while PDF exhibits a beneficial relationship. The meta-objective function's critical role in minimizing generalization error is highlighted, leading to the proposal of a simplified meta-objective function to enhance performance. Experiments on datasets like Omniglot validate that GDF's generalization error first decreases then increases with inner-levels, whereas PDF's error consistently decreases, supporting the theoretical findings and the effectiveness of the new objective.

优缺点分析

(+)

  1. This paper considers an important topic in meta-learning, that is, the generalization performances. The authors use algorithmic stability to deal with the problem, which is algorithm-dependent bound.
  2. This paper considers a group of meta-learning methods (gradient descent framework, and proximal descent framework) instead of a single method, which is general. This point is new to me.
  3. The authors provide rigorous theoretical analysis on generalization. While I did not check all the details, I believe that the arguments here are mostly correct.
  4. The authors further provide empirical evidence to validate their theoretical findings.

(-)

  1. The technical novelty seems limited, and the comparison between the upper bounds (instead of the upper and lower bound) stops me from giving a higher score. We know that due to technical limits, we may have to compare between the upper bounds. But this still harms the scope of this paper.
  2. The authors may need to add experiments acorss different datasets to validate the generalizability of findings.

Overall, I still lean to recommand accpetance, given these limitations.

问题

See above.

局限性

See above.

最终评判理由

After reading the authors' responses, I keep my score of 4 unchanged. This score is mostly due to the fact that the authors consider a group of meta-learning methods (instead of a single method), and this is general. Besides, the theoretical part seems correct to me.

I did not give a higher score, since the theoretical contributions still seem limited to me. But this is just a minor problem, and I still vote for an acceptance.

格式问题

No

作者回复

We thank for your valuable comments and will reply them as the following.

Question 1: In generalization analysis, upper bounds are particularly valuable because they characterize the worst-case performance of an algorithm, thereby enhancing its reliability in real-world applications. Especially in stability-based analyses, upper bounds provide deeper insights into the intrinsic properties of learning algorithms. An algorithm A\mathcal{A} is said to be ϵ_gen\epsilon\_{\text{gen}}-stable if the change in its output loss, when applied to two neighboring datasets that differ by a single sample, is bounded by ϵ_gen\epsilon\_{\text{gen}}. By deriving an upper bound on this stability, we are able to systematically trace how differences between neighboring datasets propagate through both the inner and outer learning levels. This enables our bound to precisely reflect the influence of the inner-level optimization. Moreover, our work makes a significant technical contribution: to the best of our knowledge, this is the first study to establish generalization bounds for meta-learning under two scalable algorithmic frameworks. In addition, we introduce two novel definitions of stability, which serve as the theoretical foundation for our analysis. We also recognize that the lower bound is valuable for the next generation of deductions, and will consider exploring this area of work in the future.

Question 2: Thanks for you suggestions and we additionally provide two experiments results in the convex setting and non-convex setting to verify our theoretical results again.

Few-shot regression on the new synthetic dataset: Unlike the experiments conducted in our paper that the tasks are created randomly as random sinusoid function, we recreated tasks randomly as either random sinusoid functions, or random linear functions. The amplitude of the sinusoids varies within [0.1,5.0][0.1,5.0] and the phase within [0,π][0,\pi]. The slope and intercept of the lines vary in [3.0,3.0][-3.0, 3.0]. The inputs are sampled uniformly in [5.0,5.0][-5.0,5.0]. We also use an MLP network with Means square error loss function. The generalization error is assessed as the difference between training and test errors. For each training task, we set the number of support samples and query samples to ntr=5n^{\text{tr}}=5 and nts=5n^{\text{ts}}=5, respectively, while for each test task, we use ntr=5n^{\text{tr}}=5,nts=15n^{\text{ts}}=15. Then we report our results as follows:

Sinusoid & linesMAMLMetaSGDFOMAMLMetaProxiMAMLFOMuML
Q=20.79160.86820.68840.33300.38030.9063
Q=50.63930.67310.31960.15070.35690.6033
Q=100.48630.42960.20340.10400.34220.4674
Q=150.53560.51790.22090.02330.12220.2212
Sinusoid & linesMAML(Ours)MetaSGD(Ours)FOMAML(Ours)MetaProx(Ours)iMAML(Ours)FOMuML(Ours)
Q=20.76240.84840.67200.31900.35620.7381
Q=50.59640.64970.30710.12630.32470.5041
Q=100.46810.41180.17320.05940.25240.3338
Q=150.51440.49210.20180.01620.08640.1583

Few-shot classification on CIFARFS: We conduct the experiment by using the real-world CIFARFS dataset, which is also a few-shot classification taks. It splits the 100 classes from CIFAR-100 into 64, 16 and 20 classes for training, validation, and test respectively. Each class contains 600 images of size 32×32×3. We employ a 3×3 CNN to align support samples and query samples to ntr=1n^{\text{tr}}=1 and nts=5n^{\text{ts}}=5, respectively. For each test task, we use ntr=1n^{\text{tr}}=1 and nts=15n^{\text{ts}}=15. Then we report our results as follows:

CIFARFSMAMLMetaSGDFOMAMLMetaProxiMAMLFOMuML
Q=20.59890.59440.58650.65360.80200.5577
Q=50.40220.44230.42770.46360.71880.3894
Q=100.36770.26730.47090.37210.41050.2588
Q=150.39510.47980.45100.13600.27940.2478
CIFARFSMAML(Ours)MetaSGD(Ours)FOMAML(Ours)MetaProx(Ours)iMAML(Ours)FOMuML(Ours)
Q=20.58190.58550.54340.61960.57960.5430
Q=50.38840.40350.41370.32110.44580.3675
Q=100.34110.25560.43520.27480.36280.2473
Q=150.38610.43400.44930.08340.13420.2310

As shown in the table above, a trade-off relationship exists among the GDF algorithms, i.e., MAML, MetaSGD, and FOMAML, while a beneficial relationship is observed among the PDF algorithms, including MetaProx, iMAML, and FOMuML. Furthermore, we present experiments with our new objective, demonstrating that it improves the generalization error across the six algorithms.

评论

I thank the authors for their efforts in providing the replies. I keep my score unchanged and lean towards an acceptance.

评论

We sincerely appreciate the reviewers’ recognition of our work. If there are any additional questions you would like to discuss, please feel free to let us know.

最终决定

This paper theoretically evaluates how the inner update in meta-learning affects generalization performance. This work centers on two representative approaches to the inner update, the gradient descent framework (GDF) and the proximal descent framework (PDF). From the derived theory, the authors show that the trade-off present in the GDF does not arise in PDF. As a consequence, they propose a new reweighting-style meta-objective and demonstrate its empirical effectiveness.

Multiple reviewers positively noted the theoretical significance of analyzing the impact of inner-loop optimization on generalization within two unified frameworks (GDF and PDF) and of providing stability-based generalization bounds. The analysis is sound, and by focusing on GDF and PDF the theory applies to a broad range of existing meta-learning algorithms, which is valuable. The consistency between the empirical results and the theory is also appreciated. Consequently, the proposed loss function is recognized as a contribution to the community.

Reviewers pointed out that the statistical significance of the experiments was not fully validated; in response, the authors committed to multi-seed runs and adding error bars. Several reviewers also found the logical development not always clear; the authors addressed this reasonably in the rebuttal and they promised the update of the manuscript. Although some questioned the novelty of studying the effect of inner-loop optimization on generalization, the authors argued that their analysis of GDF and PDF, which identifies the trade-offs over a wide range of existing methods, is valuable. A remaining concern is that the theoretical analysis provides only upper bounds, but this is a very common approach in generalization-error analysis and is not considered a critical flaw.

Taken together, the paper offers a novel theoretical analysis and a valuable examination of how inner updates influence generalization from the perspectives of GDF and PDF. It introduces a new loss function grounded in solid theory, and its performance is supported by experiments. Since the authors have responded reasonably to the main concerns, I recommend acceptance.