PaperHub
6.8
/10
Spotlight4 位审稿人
最低3最高5标准差0.8
3
5
4
5
3.8
置信度
创新性3.0
质量2.8
清晰度2.3
重要性3.0
NeurIPS 2025

Which Algorithms Have Tight Generalization Bounds?

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We present conditions that preclude the existence of tight generalization bounds versus a stability condition that guarantees this.

摘要

关键词
learning theory

评审与讨论

审稿意见
3

This paper study the tight generalization bound under the definition 'Estimability'. And give some examples, the conditions lead to estimability and preclude estimability.

优缺点分析

(1): I think the writing of this paper is bad.I\ think\ the\ writing\ of\ this\ paper\ is\ bad. The introduction is nearly 7 pages long, while the main text is hastily concluded in less than 3 pages, to the point that the experiment had to be placed in the appendix. As an article dominated by theory, I think the author has used too much space to informally present the theoretical results. Additionally, please define the symbols before using them, such as U(D)U(D).

(2): I cannot agree with the author on the significance of the theorems presented.I\ cannot\ agree\ with\ the\ author\ on\ the\ significance\ of\ the\ theorems\ presented.

Theorem 3.1 and 3.2Theorem\ 3.1\ and\ 3.2: In the theorem, the reason for such a lower bound is mainly due to the choice of the distribution collection DD, and it has nothing to do with the properties of the algorithm itself. Whether the algorithm is SGD, memorizaiton or something else, this lower bound necessarily exists due to the choice of the distribution collection DD. The title of the article is "Which Algorithms Have Tight Generalization Bounds?", which clearly does not match the meaning expressed by this theorem.

Moreover, the theorem requires the restriction of the number of samples mm, which further limits the significance of the theorem. It is clear that when the number of samples is small, the generalization performance of the algorithm on complex distributions will not be good, making the conclusion of the theorem somewhat obvious. When the number of samples is large, it is evident that this lower bound will not hold. And the author also does not further discuss the potential impact of mm on various algorithms.

In summary, the establishment of these two theorems primarily relies on factors external to the algorithms themselves, which does not align with the previous examples and the main theme of this paper.

Theorem 4.3Theorem\ 4.3: The assumption of this theorem is too strict. The author assume the algorithm to have good stability for any distribution in theorem 4.3, then the theorem 4.3 under this assumption can be understood as: algorithms with good generalization to any distribution have a tight generalization bound, this conclusion obviously has no guiding significance in the real world, this is almost just a simple extension of the classic conclusion. We need to know which algorithms have good generalization rather than simply adding it as the assumptions.

问题

See weakness.

I would like the author to further explain the significance and meaning of these theorems and how these theorems can guide us in determining which algorithms have tight generalization bounds?

局限性

There is no section 'conculsion and limitation' in the paper, but there is no moral issue in this article.

最终评判理由

I couldn't agree with author on the importance of th3.1, I ultimately believe that the theorem 3.1 paid too little attention to the properties of the algorithm itself, leading to a decrease in its importance. So, I think 3 is fairness for author.

However, the theory itself is correct, and different people may see it with different meanings and importance.

So I won't oppose or recommend this paper.

格式问题

It is obvious that the author has used vspace in many places.

作者回复

General Remark

Thank you for your comments and for engaging with our paper!

Estimability can sometimes be confusing, and from your comments it appears that there might have been some misunderstandings about the definitions and setting. Concepts such as estimability, generalization, stability, and learning are all closely related and can be easily confused, but correctly distinguishing between them is crucial for understanding this paper. We'd be grateful if you could carefully read our response, and follow up with questions about anything that may be unclear! Hopefully, this will help us understand each other better.


Weaknesses

Please define the symbols before using them, such as U(D).

The notation U(D)U(D) is defined on l. 278.

Whether the algorithm is SGD, memorization or something else, this lower bound necessarily exists due to the choice of the distribution collection D.

This comment touches a key issue for understanding our contributions. These lower bounds do not hold for all algorithms — the assumption that the algorithm is F\mathcal F-interpolating is crucial. In particular, these lower bounds do not hold for memorization algorithms. As discussed in Example 1.6, a memorization algorithm is estimable for any positive ε\varepsilon, whereas the lower bounds in Thms 3.1 and 3.2 show that other algorithms are not estimable for any ε\varepsilon significantly below 1/41/4.

The title of the article is "Which Algorithms Have Tight Generalization Bounds?", which clearly does not match the meaning expressed by this theorem.

The lower bounds in Thms 3.1 and 3.2 apply to certain algorithms but not to others. They hold for algorithms that are suitably F\mathcal F-interpolating (e.g., algorithms that are unstable) — showing that those algorithms don’t have tight generalization bounds. In contrast, Thm 4.3 shows that sufficiently stable algorithms do have tight generalization bounds. So, the title accurately reflects the central question of the paper.

The theorem requires the restriction of the number of samples mm, which further limits the significance of the theorem. It is clear that when the number of samples is small, the generalization performance of the algorithm on complex distributions will not be good, making the conclusion of the theorem somewhat obvious.

The comment talks about “generalization performance”, but the theorem is about estimability, not performance. There is a crucial distinction (in Section 1.2) between learning (i.e., low population loss / test set loss) vs. estimability, and this comment appears to be conflating the two. When the sample size mm is small, then indeed the performance (test set accuracy) will be poor for most distributions. But that does not imply that the algorithm is not estimable. For instance, if mm is a small constant and the distribution is labeled by functions from a class with VC dimension dd, then as dd \to \infty, the performance of the algorithm becomes very poor (it will have high loss on the test set). However, if the algorithm is stable (e.g., memorization) then it can nonetheless have excellent estimability and tight generalization bounds, even though mm is a small constant and dd \to \infty. In conclusion, the lower bound on estimability does not simply follow from mm being small, and our result is not obvious.

When the number of samples is large, it is evident that this lower bound will not hold.

Indeed, Thm 3.1 holds when the number mm of samples is small compared to the VC dimension of the set FF of possible labeling functions for the population distribution. But this is not a limitation of the theorem — it is simply a basic fact of learning theory that when the number of samples mm is large compared to the VC dimension, then every algorithm has tight generalization bounds (with respect to a fixed and known marginal distribution, as in Thm 3.1).

One might desire a theorem that shows a lower bound also for mm that is large compared to the VC dimension dd, but such a theorem would be false — any true theorem must have mm small compared to dd. In our case we show a result for m=O(d)m = O(\sqrt{d}). It is an interesting open question whether this can be pushed further, e.g., can one show a similar result for m=O(d0.9)m = O(d^{0.9})?

In summary, the establishment of these two theorems primarily relies on factors external to the algorithms themselves, which does not align with the previous examples and the main theme of this paper.

As explained earlier, the lower bounds do not apply to all algorithms. The idea that the bounds depend only on external factors, not on the algorithm, is simply untrue. Together with our upper bound, these results are a clear step towards understanding which algorithms have tight generalization bounds: unstable (e.g., $\mathcal F$-interpolating) algorithms do not, whereas stable ones do (regardless of performance).

The assumption of this theorem is too strict. The author assume the algorithm to have good stability for any distribution in theorem 4.3, then the theorem 4.3 under this assumption can be understood as: algorithms with good generalization to any distribution have a tight generalization bound, this conclusion obviously has no guiding significance [...] We need to know which algorithms have good generalization rather than simply adding it as the assumptions.

It is important to appreciate that stability and generalization are different things. When referring to “algorithms with good generalization”, does this comment mean algorithms where the empirical error is close to the population error? If so please notice that the memorization algorithm with respect to the uniform distribution does not have good generalization in that sense: the empirical error is always 0, while the population error can be arbitrary (e.g., close to 1). However, the memorization algorithm has excellent loss stability with respect to the uniform distribution, and therefore Thm 4.3 implies that it has tight generalization bounds with respect to the uniform distribution. So memorization is an example of an algorithm that is stable but does not have good generalization in that sense.

The distinction goes in the other direction too: consider an algorithm AA that always outputs hypotheses from some set G\mathcal G such that VC(G)\mathsf{VC}(\mathcal{G}) is small compared to the sample size mm (but the distributions D\mathbb{D} can be arbitrary). Then AA is not stable (it can swing wildly between different hypotheses in G\mathcal G), but it does have good generalization, and it does have tight generalization bounds.

If instead “good generalization” in this comment means algorithms with low population loss, then observe that again, that is very different from stability. For instance, Example 1.7 implies that if we fix an algorithm chosen at random from the set of all mappings from training sets to hypotheses, then with high probability that algorithm will not have good performance on the test set (it will have high population loss) but it will have good loss stability and therefore it will have tight generalization bounds.

In conclusion, stability and generalization are distinct notions, and neither of them implies the other. Nonetheless, what Thm 4.3 shows is that stability implies the existence of tight generalization bounds, which we believe is a very worthwhile observation.

(It’s interesting to note that our experiments in Appendix L show that empirically SGD has fairly good hypothesis stability for certain natural distributions. This suggests that it might be possible to prove tight generalization bounds for SGD for certain natural distributions. That would be a meaningful and nontrivial result.)


Questions

I would like the author to further explain the significance and meaning of these theorems and how these theorems can guide us in determining which algorithms have tight generalization bounds?

Yes, thank you for asking this. Understanding how our results can be used to determine which algorithms have tight generalization bounds is a very important part of appreciating the value of this paper. Let's walk through this step by step, discussing both our lower bounds and our upper bound.

Lower bounds. These bounds show that if an algorithm satisfies a condition called F\mathcal F-interpolating for some suitable class F\mathcal F, then it does not have tight generalization bounds. Let's look at Thm 3.2, because it is a bit more concrete. Fix an algorithm AA and a sample size mm, and fix some collection of orthogonal functions F\mathcal F of size roughly 2m2^m (e.g., F\mathcal F can be a subset of all linear functions F2dF2\mathbb{F}_2^d \to \mathbb{F}_2 for d>md > m). Now comes the pivotal step: show that AA is F\mathcal F-interpolating, which means that whenever the training set is consistent with some function from F\mathcal F, then AA outputs a function from F\mathcal F. If you can do that, you are done — Thm 3.2 states that no tight generalization bound exists for AA. Voilà!

Note: as we discuss in Footnote 6, Abbe and Sandon (2020) show that SGD can in particular be interpolating for orthogonal functions, so this recipe can be applied to real-world algorithms too.

Upper bound. For example, consider a variant of kk-nearest neighbors: for input xx, the algorithm AA selects one of the kk nearest neighbors of xx in the training set uniformly at random, and outputs its label. If fewer than k/100k/100 points change in the training set, then the population loss of AA changes by ≤ 0.01 — for any distribution! So AA is (0.01,0,m,k/100)(0.01, 0, m, k/100)-loss stable. By Thm 4.3, AA has a tight generalization bound for all distributions. Ta-da!


Limitations

Thanks for pointing this out. We will use the 10th page in the camera-ready version to add sections on limitations and on directions for future work.

评论

Firstly, I apologize for the unclear weakness of I raised, this makes the author's response not what I wanted, I have reorganized my language and described my problem, and I hope the author can participate in the discussion again.

About the Theorem 4.1. In Theorem 4.1, author say that 'Then there exists a subset FHF\subset H and a collection DΔ()D\subset\Delta(\cdots) of F-realizable distributions such that for any F-interpolating learning rule A and '.

In the prove of this theorem, author select XdX_d as an H-shattered set, then select a FF such that Fϵ,XdF\perp_{\epsilon,X_d}, then define the distribution of DD as U(Df)U(D_f)(define in XdX_d) where fU(F)f\sim U(F)(by Lemma H.1 and line 646).

Besed on this definition, author use Lemma H.1 to prove the Theorem 4.1. By line 712-713(which is also the core) in the Lemma H.1, we know that the 'Preclude Estimability' comes from 'F\perp_{\epsilon,X_d}', so that any 'F-interpolating learning rule A' is not Estimability.

If my understanding of the proof is correct, the Theorem comes from the poor properties between XdX_d and FF, but not itself algorithm(of course, must F-interpolating, ). In other words, when the algorithms is considered in this area (even if they are excellent, such as SGD and memorization) will be limited by the specificity of XdX_d and FF, as well as the number of data mm(d/10\le \sqrt{d}/10), and will not be able to exert their original power, this is not the bad of algorithm but the XdX_d and FF.

So I think using this theorem as the main conclusion is obviously inconsistent with the original intention of the author's article. I want to see, is that: analyze the ability of the algorithm itself on most of common distributions, rather than constructing some distributions with special properties to study.

For Theorem 4.3, I agree that definition 4.1 and 4.2 are stability(as author said before and the paper said), but we also know that the generalization bound can be easily derived from the stability bound, so in my opinion, this assumption is almost the same as the assumption of the generalization bound. Or, does the author believe that definition of 4.1 and 4.2 cannot lead to a generalized bound like previous study?

If author can answer my question from a technical perspective, I will increase my score.

评论

Thank you for your thoughtful follow-up and for taking the time to clarify your concerns. We truly appreciate your willingness to engage in detailed discussion on the technical points. We welcome the opportunity to respond carefully and continue the dialogue.

Questions

Before addressing your comments in full, we want to ensure that we have a common understanding about some basic terms, so that we are all on the same page. Would you kindly be willing to answer the following questions:

Q1. Can you define in your own words what it means for an algorithm to be estimable? How is that different from an algorithm that generalizes well, and from one that learns?

Q2. Can you define in your own words what it means for an algorithm to be F\mathcal F-interpolating? How is that different from an algorithm that has 0 error on the training set?

We hope you don’t mind answering, to establish a common language for a constructive discussion.

Precluding Estimability

If we understand correctly, it appears that you have two main concerns about Thm 3.1.

Concern I

I want to see, is that: analyze the ability of the algorithm itself on most of common distributions, rather than constructing some distributions with special properties to study.

This overlooks a key aspect of impossibility results. Let’s recall that in theoretical computer science, there are two very common types of results — possibility results and impossibility results. Each type corresponds to a specific quantifier — for all vs. exists, as follows:

  1. Possibility Results. Show that a task is possible to do. For example, showing that it is possible to sort a list using O(nlog(n))O(n\log(n)) pairwise comparisons. To prove such a result, one must present an algorithm using O(nlog(n))O(n\log(n)) comparisons such that for all instances, i.e., for every list, the algorithm correctly sorts the list.

  2. Impossibility Results. Show that a task is not possible. For example, showing that it is not possible to sort using asymptotically less than nlog(n)n\log(n) comparisons. To prove such a result, one has to show that for every algorithm using asymptotically less than nlog(n)n\log(n) comparisons, there exists a hard instance, i.e., one specific list, which the algorithm does not sort correctly. Importantly, we don’t need to show that the algorithm fails “on most of common” instances (as in your comment).

In summary, to prove impossibility results, it suffices to construct one hard instance. In Theorem 3.1, we show an impossibility result for generalization bounds. We need to show that for every generalization bound, there exists one hard instance (one distribution) on which the bound is not tight. And we do that. (In fact, we even do more — we show that there exist many distributions on which the bound is not tight, so the algorithm is also not estimable on average.)

If you understand the general paradigm of proving impossibility via hard instances, why not welcome our result as well?

Concern II

the Theorem comes from the poor properties between XdX_d and FF, but not itself algorithm(of course, must F-interpolating, ).

It appears from your comments that you think that being F\mathcal F-interpolating is not a very meaningful property. It’s almost as if you are taking this property for granted.

The set F\mathcal F for which an algorithm is F\mathcal F-interpolating captures an important bias of the algorithm. Every good learning algorithm has biases, and the specific type of biases is one of the most defining features that distinguish one algorithm from another. The proofs of Thms 3.1 and 3.2 strongly use the specific set F\mathcal F representing the bias of the algorithm to show that no tight generalization bound exists for that specific algorithm.

Before discussing this further, we’d like to see your answer to Q2 above.

when the algorithms is considered in this area (even if they are excellent, such as SGD and memorization

It is unfortunate to see you mention memorization in the context of Thms 3.1 and 3.2. As already explained in the rebuttal, these theorems do not apply to memorization. Also, memorization is not an excellent algorithm — it can easily have population loss 1.

Sufficient Condition

we also know that the generalization bound can be easily derived from the stability bound

Is that true? Which prior work are you referring to?

Perhaps you mean the result of Bousquet & Elisseeff (2002)? If so, note that their result only holds for a very strong notion of stability, which implies that the population error is close to the empirical error. In particular, their result doesn’t even imply a tight generalization bound for the memorization algorithm!

In contrast, Thm 4.3 shows a tight generalization bound also for algorithms where the population error is far from the empirical error. This makes our result much more applicable (including also for memorization).

评论
  1. Estimability is given in your paper, that is PDU(D),SDN[E(S)LD(A(S))ϵ]1δP_{D\sim U(D),S\sim D^N}[|E(S)-L_D(A(S))|\le\epsilon]\ge1-\delta.

Then about, let us consider the definition of estimability, in you definition, E(S)E(S) and A(S)A(S) are only depended on SS but not consider the DD, but your require LD(A(S))L_D(A(S)) closed to E(S)E(S) for most of DU(D)D\sim U(D) and SDNS\sim D^N, which implies that DU(D)D\sim U(D) should be similar to each other to make the estimability make sense. If DD is quite different (as that doing in the proof of Th3.1), then whatever algorithm AA will not have power to have a small estimability, this is not bad for algorithm but the bad for the choose of U(D)U(D) and definition of estimability.

For example, I define four distributions Di,jD_{i,j} where i,j1,1i,j\in\\{-1,1\\} as PDi,j((x1,i))=PDi,j((x2,j))=0.5P_{D_{i,j}}((x_1,i))=P_{D_{i,j}}((x_2,j))=0.5, and each distribution in U(D)U(D) have probability 0.25. Then for any algorithm, take S=1|S|=1, there is LDi,j(A(S))i,j\\{L_{D_{i,j}}(A(S))\\}_{i,j} random equal to that in 0.5,1\\{0.5,1\\}(or 0,0.5\\{0,0.5\\}) when ever SS, then any algorithm does not have estimability on such four distributions.

(It is worth mentioning that the core of my example and the one constructed by the author is similar: to reduce the number of training sets, which makes it difficult for the algorithm to determine which distribution the training set comes from, while also ensuring diversity among different distributions.)

But, what can this example tell us? What can we learn from it? The badness of algorithms such as SGD or memorization?

I don't think so. This tells us, such U(D)U(D) is bad and S|S| is small.

Similar to the author's theorem, after reading the proof of the Theorem, I have the same percept. Such FF and XdX_d is bad, but not for the algorithm.

Is the memorization training set bad? Author said yes, on such F and XdX_d, yes. Is SGD bad? May be we can also find some FF and XdX_d for SGD.

But by the Theorem 4.3, if memorization of SGD satisfies the definition 4.1 and 4.2 for some U(D)U(D), then such algorithm is also good, on such U(D)U(D). Even we make a little change to FF or XdX_d in the proof of Th 3.1, memorization may be good at that FF and XdX_d.

So, I think in the Theorem 3.1, the author left the main decision of estimability to the distribution itself, while little to the feature of the algorithm itself (which is what we should focus on). Can this example determine the performance of memorization algorithms on general distributions? Can we provide corresponding no-estimability examples for every given algorithm?

I think a reasonable approach should be to first specify the distribution and then consider the performance of the algorithm. Specifically, for about consider the question: Is memorization good or bad for more commonly used distributions?

The author's previous answer was not what I wanted to see, what I question is the importance of the author's theory rather than mathematics itself (The proof is correct and I agree with proving impossibility via hard instances, but we also have to discuss whether examples really fit the article, otherwise, why don't we use the examples above to argue that all algorithms are bad). In order to facilitate communication, the author may as well answer the follows questions for me(In order not to use more time of your, the last questions.):

Is the estimability more focused on the construction of special distributions or the analysis of the properties of the algorithm itself? For any given algorithm, can we construct a special distribution to negate the estimability of it? If possible, what guiding significance does estimability and Th3.1 for us?

Is the theme of the article try to analysis of the properties of the algorithm itself? Is the theorem and proof more focused on the analysis of the properties of the algorithm itself but not construction of special distributions? How can we use the constructed special distribution really guide the selection of algorithms for daily use?

评论

2, The generalization bound for algorithm AA on distribution DD is PSDN[LS(A(S))LD(A(S))ϵ]1δP_{S\sim D^N}[|L_S(A(S))-L_D(A(S))|\le\epsilon]\ge1-\delta or PSDN[LD(A(S))ϵ]1δP_{S\sim D^N}[L_D(A(S))\le\epsilon]\ge1-\delta, based on the situation.

I think generalization bound can lead to estimability. I just need to take E(S)=LS(A(S))E(S)=L_S(A(S)) or 0(based on the situation). But after I check, definition 4.1 can not lead to a generalization bound, so now my question in only about th3.1.

3, About I talk about SGD and memorization, because I do not what kind of algorithm apply to FinterpolatingF-interpolating, so I gave two algorithms that can fit the training set as closely as possible. These two just use to be examples to show my question.

4, Finally, I would like to point out that writing these comments took me a lot of time, and I respect the author's work, try to understand it and have reviewed their proof. I would like to improve my score but the author must clarify the guiding significance and importance that their approach to studying the estimability can bring us.

评论

Thank you for the significant effort and dedication! We feel that we might be getting a bit closer to clarifying the confusion, especially thanks to the specific example you gave in your response. Below we are respectfully answering the specific questions you asked us, and also addressing that specific example.

Is the estimability more focused on the construction of special distributions or the analysis of the properties of the algorithm itself?

Our framework of Estimability is focused on understanding the behavior of the pairs (A,D)(A, \mathbb{D}) where AA is an algorithm and D\mathbb D is a collection of distributions. This behavior depends crucially on the properties of the algorithm AA itself, and not just on the distributions D\mathbb D.

To see how estimability depends on the algorithm AA, let’s revisit your own example, where you write “For example, I define four distributions Di,jD_{i,j} …”. In that example, the domain is of size 2, x1,x2\\{x_1,x_2\\}. Let’s consider the exact same example, but generalized to a larger domain of size dd, x1,x2,,xd\\{x_1,x_2, …, x_d\\}. The collection of distributions is D=Dy:y±1d\mathbb{D} = \\{D_y: y \in \\{\pm1\\}^d \\}, where each distribution is given by Dy=U((xi,yi):i[d])D_y = U(\\{(x_i,y_i): i \in [d]\\}). Now consider the constant algorithm AA that always outputs the function h(x)0h(x) \equiv 0.

Suppose we take a domain size d=1000000d = 1000000, and we take the sample size mm to be small compared to dd, for instance m=200m = 200. By Hoeffding’s Inequality (Thm K.3), we have that for any distribution DD,

Pr[LS(A)LD(A)0.1]0.9\Pr[|L_S(A) - L_D(A)| \leq 0.1 ] \geq 0.9.

This means that AA has a fairly tight generalization bound.

Consider the following easy yes/no questions:

Q.I. Do you agree that AA has a fairly tight generalization bound for D\mathbb{D} as described above?

Q.II Do you understand that according to Thm 3.1, there exist other learning algorithms that do not have tight generalization bounds for the exact same dd, mm, and D\mathbb{D} as above?

Q.III If you answered yes to the previous questions, do you understand now that estimability depends significantly on the algorithm, and not just on the distributions?

For any given algorithm, can we construct a special distribution to negate the estimability of it?

No, some algorithms are always estimable (as long as mm is not too small, e.g., m200m \geq 200). Examples include the constant algorithm AA discussed above, and the memorization algorithm (we are happy to provide more examples if desired).

what guiding significance does estimability and Th3.1 for us?

The guiding significance comes from analyzing the property of F\mathcal{F}-interpolation for a specific algorithm, which captures the specific bias of the algorithm. We are happy to explain this further, but first we want to make sure that we are on the same page regarding Q.I,Q.II and Q.III.


2, The generalization bound for algorithm AA on distribution DD is PSDN\[LS(A(S))LD(A(S))ϵ]1δP_{S \sim D^N}\[|L_S(A(S))-L_D(A(S))| \leq \epsilon] \geq 1-\delta or PSDN[LD(A(S))ϵ]1δP_{S \sim D^N}[L_D(A(S)) \leq \epsilon] \geq 1-\delta, based on the situation.

Your answer to this question is incorrect, and suggests that perhaps you are still somewhat confused about how generalization bounds are defined. Your equation PSDN[LS(A(S))LD(A(S))ϵ]1δP_{S \sim D^N}[|L_S(A(S))-L_D(A(S))| \leq \epsilon] \geq 1-\delta means that the empirical error is close to the true error. That is sufficient for a generalization bound to be tight, but it is not a necessary condition. As we already noted in our previous answer, tight generalization bounds can also exist if the empirical error is far from the population error (and memorization is a simple example of that). For example, see Equation 3.4 on page 35 of Foundations of Machine Learning (the book can be easily found and viewed online). There are two key differences between that bound and yours:

  1. A generalization bound provides an upper bound on the signed difference
    LD(A(S))LS(A(S))L_D(A(S)) - L_S(A(S)) rather than on the absolute value. Bounding the absolute difference would also yield a lower bound on the worst-case gap.

  2. In your definition, ϵ\epsilon represents a fixed, uniform bound on the deviation between true and empirical error across all samples. However, generalization bounds often involve complexity terms that depend on the sample SS. For instance, in Equation 3.4, the Rademacher complexity term R^S(G)\hat{R}_S(G) is itself a function of SS, meaning the deviation between empirical and true error can vary substantially depending on SS.

Respectfully, please recognize that without fully understanding the definitions in our paper, it will not be possible to understand the paper. This is true for generalization bounds (Eq. 1 in the paper), F\mathcal F-interpolating, and other basic concepts. We greatly appreciate your dedication to understanding our paper, and we are grateful that you are actively engaging in a dialogue with us towards mutual understanding.

评论

I couldn't ultimately agree on the importance of th3.1, but author resolved my questions about th4.3. So I will increase my score to 3. I remain neutral and will not oppose or recommend accepting this paper.

评论

Thank you for your time spent discussing with us.

The following is our final remark regarding your main concern with Theorem 3.1, namely, the suggestion that the result is merely a property of the distribution class:

"If my understanding of the proof is correct, the Theorem comes from the poor properties between X_d and F, but not the algorithm itself (of course, it must be F-interpolating). In other words, when algorithms are considered in this setting (even if they are excellent, such as SGD or memorization)..."

This interpretation is incorrect for a key reason: memorization is not FF-interpolating. In Theorem 3.1, being FF-interpolating is not equivalent to merely interpolating the data (i.e., achieving zero empirical error); rather, it encodes a specific bias of the algorithm.

This distinction is crucial: the memorization algorithm has a tight generalization bound, whereas FF-interpolating algorithms cannot — by construction. Thus, Theorem 3.1 cannot be attributed solely to the choice of distributions; it fundamentally involves the bias encoded by the class FF for which the algorithm is FF-interpolating.

We hope this distinction is now clear :)

审稿意见
5

This work addresses the question of estimability of learning algorithms, i.e. for which learning algorithms there may exist an estimator that is close to the population risk with high probability. It presents conditions that preclude the existence of generalization bounds for certain interpolating algorithms, and sufficient conditions for estimability, in terms of the algorithm's stability.

优缺点分析

** Strengths **

This work addresses an important challenge in understanding generalization and deriving generalization bounds, an area where progress has been slow in recent years – particularly for neural networks. The paper is well-written with the examples in Section 1.4 covering various scenarios, highlighting the nuanced differences between learnability and estimability. The theorems presented are novel, to the best of my knowledge, and have a fairly general scope.

** Weaknesses **

Since the authors repeatedly emphasize the case of deep learning, there should be a discussion dedicated to the hypothesis class parameterized by NNs (eg. their VC dimension) and of gradient descent (and, perhaps, other related optimization algorithms); as for the former, I would like to see Appendix A.3 in the main text. There should also be a discussion on how the theorems apply to deep learning, even if in a weak sense, as near line 517 discussing other works. Related to deep learning, I think the paragraph starting near line 162 means to discuss stability of gradient descent (and variants), since that is the algorithm, while the neural network parameterizes the hypothesis class.

One concern I have is that Theorems 3.1 and 3.2 are not sufficiently general, unlike Theorem 4.3. I'm not even sure what the applications of these theorems are.

It is strange that the authors couldn't think of any limitations of their work; I would urge the authors to add a paragraph on the same. Given the slow progress in this research area in recent areas, I think adding some future directions, which I am sure are plenty, would be helpful for the community.

问题

Can you give some intuition around Remark 1.3? It seems to make a seemingly strong claim without any justification.

Suggestions:

The paper is reasonably self-contained, but the ordering of sections needs to be improved. Particularly, the notation in Section 2 needs to come before Section 1.1, else it becomes very hard to comprehend there onwards. Section 1.5 on related works also needs to be moved up, maybe right after Section 1.1. I am also not sure why the informal and formal versions of the theorems are both included in the main text – with Section 2 moved up, it should be sufficient to just keep the formal forms of the theorems (perhaps with informal descriptions alongside).

The abstract should include what is meant by stability of algorithms, because without that, it is hard to understand what the results are about – something like loss-instability would be good. The abstract also mentions that the paper is concluded with a discussion relating the existence of tight bounds with variance of algorithm's loss. That's not the case for this submission – the relevant discussion is at the end of Section 1.3. Kindly amend the abstract to accurately indicate the location of this discussion.

Fact C.2, as in the main text, is not at all clear, particularly what the variance is with respect to. Reading the supplementary material, my understanding is that the sampling process is DDSDm\mathcal{D}\sim\mathbb{D} \rightarrow S\sim \mathcal{D}^m, and the expectation is w.r.t. marginal over SS and variance w.r.t. DS\mathcal{D}|S; correct me if I'm wrong. Please clarify that.

I think there are some inaccuracies in Example 1.4. Specifically, TU(X)m2T\sim U(\mathcal{X})^{m^2} implies sampling with replacement, as does Claim K.2, which uses a similar notation (also the reference to Birthday Paradox). However, the proof of Claim K.2 seems to assume a set of kk (=m2=m^2 in Example 1.4) distinct points, i.e. sampling without replacement. This can be easily fixed by considering sampling TT without replacement, and fixing the notations; I don't see the example breaking down with this change. In Example 1.4, there is also a mention of LD(A(S))=DX({x1,,xm})L_{\mathcal{D}}(A(S)) = \mathcal{D}_{\mathcal{X}}(\{x_1, \ldots, x_m\}), which is the case under 0/10/1 loss, as is also assumed in some other places, but is not mentioned anywhere before Definition 2.3.

In Example 1.6, I don't think "AA always has 00 empirical loss", because the definition of the outputted hypothesis seems to imply that for inputs with unique labels, we have the hypothesis memorizing the label, but for those with non-unique labels, i.e. (x,1),(x,+1)S(x,-1), (x,+1) \in S, the hypothesis predicts 1-1. For such inputs, we will have non-zero empirical risk. I don't see any distributional assumptions precluding such training sets.

Typos:

In Fact 1.2, my understanding is that if (A,D)(A, \mathbb{D}) is (ϵ,)(\epsilon, \ldots)-estimable, then the generalization bound is 2ϵ2\epsilon-tight, not ϵ\epsilon-tight.

In Example 1.6, first condition can omit "y:\exists y:" part since that is understood.

In Example 1.7, the description of the algorithm is mentioned twice consecutively: 1. "consider a mapping AA chosen uniformly from the set A\mathcal{A}", and 2. "A(S)A(S) is a function that was chosen uniformly from F\mathcal{F}".

局限性

yes

最终评判理由

authors answered all my concerns

格式问题

None

作者回复

General

We are grateful for your positive feedback and the constructive suggestions provided. We have found your comments very valuable in refining our paper.


Weaknesses

Since the authors repeatedly emphasize the case of deep learning, there should be a discussion dedicated to the hypothesis class parameterized by NNs (eg. their VC dimension) and of gradient descent (and, perhaps, other related optimization algorithms);

For lower bounds on the VC dimension of some deep neural network architectures one could for example consult [1]. Such bounds together with our Theorem 3.1 imply the existence of some optimizers for which any generalization bound must be loose across many distributions (see our discussion in lines 117-119).

Similarly, Theorem 3.2 has direct implications for NNs trained with SGD, provided that we know that this algorithm interpolates some class F\mathcal F which is ε\varepsilon-almost orthogonal. As mentioned in our footnote 6, there are settings for which this is provably the case.

As one can see, the applicability to deep learning with SGD hinges on the knowledge about which functions SGD F\mathcal F-interpolates. Understanding the behavior of SGD is an active field of research and is beyond the scope of the current paper.

[...] as for the former, I would like to see Appendix A.3 in the main text.

Thank you for your suggestion. We will try to include more details about these works in the main text of the camera-ready version, if space permits.

There should also be a discussion on how the theorems apply to deep learning, even if in a weak sense, as near line 517 discussing other works.

We refer to this in line 143. If one derives a new generalization bound for neural networks, the very least that needs to be verified for it to be non-vacuous is to check either of the following items.

  1. Exclude in advance all families of distributions with nearly-orthogonal labeling functions, and use this fact in the derivation of the generalization bound.
  2. Mathematically show that the algorithm is not biased towards any set of nearly- orthogonal functions.

Related to deep learning, I think the paragraph starting near line 162 means to discuss stability of gradient descent (and variants) [...]

Thanks for pointing this out. We mean neural networks trained with standard optimizers (e.g. SGD,ADAM,…).

One concern I have is that Theorems 3.1 and 3.2 are not sufficiently general, unlike Theorem 4.3. I'm not even sure what the applications of these theorems are.

Theorems 3.1 and 3.2 can be instantiated so that one can rule out the existence of tight generalization bounds in many scenarios encountered in practical machine learning. As we explain in our paper, on line 143. See our above discussion on application to deep learning.

Limitations

We appreciate this suggestion and will add a discussion of limitations and open problems in the camera-ready version, for which we will have one extra page. Among the questions we plan to include: For VC classes, can the negative results of Theorems 3.1 and 3.2 be extended beyond d0.5d^{0.5}? A characterization of algorithms that admit distribution-free tight generalization bounds (such as the memorization algorithm). What combined properties of an algorithm and a distribution family admit tight generalization bounds?


Questions

Can you give some intuition around Remark 1.3? It seems to make a seemingly strong claim without any justification.

This is a simple consequence of the fact that binary classification is a special case of both regression and multi-class classification. Note that since our Theorems 3.1 and 3.2 are negative results, proving them for the special case (binary classification) in particular implies the same negative results for the more general cases.

Ordering of sections needs to be improved. Particularly, the notation in Section 2 needs to come before Section 1.1, else it becomes very hard to comprehend there onwards. Section 1.5 on related works also needs to be moved up, maybe right after Section 1.1.

Thank you for the suggestion. We understand your concern, but we feel that placing the full technical preliminaries section early on would disrupt the narrative flow. Introducing concepts gradually, alongside motivating examples, before formalizing definitions allows us to better engage the reader and build intuition. We hope this structural choice is not a significant issue, as it does not affect the core content of the paper. That said, we’d be glad to revisit the paper’s organization during the discussion period to find a structure that best serves the reader. We welcome your thoughts on this!

I am also not sure why the informal and formal versions of the theorems are both included in the main text – with Section 2 moved up, it should be sufficient to just keep the formal forms of the theorems (perhaps with informal descriptions alongside).

The abstract should include what is meant by stability of algorithms [...]

We will change in line 7 stable -> loss-stable

The abstract also mentions that the paper is concluded with a discussion relating the existence of tight bounds with variance of algorithm's loss. That's not the case for this submission [...]

Thank you for catching this, will correct this.

Fact C.2, as in the main text, is not at all clear, particularly what the variance is with respect to. [...]

You are correct about the sampling process. We will make this more clear in the revision.

I think there are some inaccuracies in Example 1.4. [...]

Note that X\mathcal X is the interval [0,1][0,1], hence TT contains all distinct points almost surely. You are technically correct with mentioning collisions but we think that it is better to suppress the “almost surely” in this non-technical part of the paper.

In Example 1.4 and assumed 010-1 loss

Thanks for catching this, you we will add the missing assumption of 010-1 loss.

Example 1.6 and unique labels

Thank you for catching this. We had in mind the setting where the labels are deterministic given the input xx, we will add this.

Typos:

In Fact 1.2, my understanding is that if (A,D)(A, \mathbb{D}) is (ϵ,)(\epsilon, \ldots)-estimable, then the generalization bound is 2ϵ2\epsilon-tight, not ϵ\epsilon-tight.

This is incorrect, perhaps you had in mind the case where on the right side of (3) there can also be a +ε+ \varepsilon, which we do not allow (see (1)).


References

[1] Harvey, N., Liaw, C. & Mehrabian, A.. (2017): Nearly-tight VC-dimension bounds for piecewise linear neural networks. Proceedings of the 2017 Conference on Learning Theory, in Proceedings of Machine Learning Research 65:1064-1068

评论

Dear reviewer,

Now is time for the discussion with the authors. Can you please reply to their rebuttal, indicating if your issues were clarified and or answered? If that is not the case, please indicate why the answer was not satisfying.

Many thanks! Your AC

评论

Thank you very much for taking the time to respond!


This is incorrect, perhaps you had in mind the case where on the right side of (3) there can also be a +ϵ+\epsilon, which we do not allow (see (1)).

Correct me if I'm wrong, but if (A,D)(A, \mathbb{D}) is (ϵ,)(\epsilon, \ldots)-estimable, then

$ \&\mathbb{P}\_{\mathcal{D} \sim \text{U}(\mathbb{D}), S \sim \mathcal{D}^m} \left[ \left| \mathfrak{E}(S) - L\_{\mathcal{D}}(A(S)) \right| \leq \epsilon \right] \geq 1 - \delta \\\\ \&\mathbb{P}\_{\mathcal{D} \sim \text{U}(\mathbb{D}), S \sim \mathcal{D}^m} \left[ \mathfrak{E}(S) - \epsilon \leq L\_{\mathcal{D}}(A(S)) \leq \mathfrak{E}(S) + \epsilon \right] \geq 1 - \delta \\\\ \&\mathbb{P}\_{\mathcal{D} \sim \text{U}(\mathbb{D}), S \sim \mathcal{D}^m} \left[ b(S) - 2\epsilon \leq L\_{\mathcal{D}}(A(S)) \leq b(S) \right] \geq 1 - \delta \\\\ $

with b(S)=E(S)+ϵb(S) = \mathfrak{E}(S) + \epsilon.


I don't have any other pending concerns, except about the paper's structure and formatting, which I believe can limit readability. Trusting that the authors will improve on it in the camera-ready version, I recommend this work to be accepted.

I do want to emphasize that I am not confident in assessing the paper's significance for the research community.

评论

[...] I recommend this work to be accepted.

Wonderful, we are thrilled to hear this!

And thank you for your very valuable comments! We are particularly indebted to your careful reading, which identified some details and nuanced points that will help us sharpen our presentation (e.g., ε\varepsilon vs 2ε2\varepsilon tightness, conditions for memorization having 0 empirical loss, the mention of Fact C.2 and stability in the abstract, organizational issues like appearance of informal and formal versions of the theorems, etc.)

Correct me if I'm wrong, but if (A,D)(A,\mathbb{D}) is (ε,)(\varepsilon, …)-estimable, then [...]

Yes, that is exactly correct! And apologies for the confusion. Thank you for catching this typo in Fact 1.2!

审稿意见
4

The paper is concerned with deriving tight generalization bounds for learning algorithms that are non-uniform w.r.t. to distribution — that is, the authors study for which pairs of algorithms and sets of distributions, the algorithm has a tight generalization bound (on the specified distributions). The core studied property to achieve this goal is that of estimability of an algorithm A\mathcal{A}, first defined by (Gastpar et al; 2024), that roughly means that there exists an ε\varepsilon-accurate δ\delta-confident estimator for the population loss achieved by A\mathcal{A}, using mm-samples from the distribution belonging to the specified set of distributions. A simple immediate consequence of this definition is that all estimable algorithms that satisfy some generalization bound will satisfy it in ε\varepsilon-tight manner with δ\delta confidence.

The main contribution that sets this paper apart from existing work is that it estabilshes connections between estimability and standard concepts in statistical learning theory, such as the variance of estimator, loss-stability, VC dimension. They also provide several simple examples that delve into what is estimability and how it is very different from learnability. The paper is a very interesting read and will be of interest to the community. I do have some remarks regarding, what is in my opinion, an unusual organization.

优缺点分析

Strengths: The paper presents very interesting results, which significantly extend the previous results on estimability, the proofs in the appendices appear to be correct from a quick read through.

Weaknesses: In my subjective opinion, the presentation of the paper would benefit from a better organization of the text. Here are some of my observations that I hope might be helpful:

At the moment the majority of the paper is an introduction giving informal results (esp Sec 1.3 and 1.4), which takes 7 pages. On one hand, this makes the results easier to understand, but on the other hand, makes the reader always searching in the appendices and Sec 2 and 3 for concrete details. Most of the actual theory results are presented on 1.5 pages at the end (Sec 2 and 3). The actual Theorem 3.1 and 3.2 are not too difficult to understand in the full statement — It might be worth a though of trying to shorten the informal Sec 1 significantly and instead present more important details in Sec 2 and 3. For example, I find the presentation of “Fact C.2” very brief and I think it would deserve more explanation in the main text.

I would also very strongly suggest to present “Facts” instead as “Theorems”, “Lemmas”, “Propositions”, “Corollaries”, or “Remarks” — I know some of these are simple consequences of definitions but still. Another remark I have is that there is actually quite a bit of important detail in the appendix that is not mentioned in the paper at all, for example the comparison with Gastpar (2024) et al. or the numerical experiments. There is also no concluding discussion, which might be due to lack of space, but would be very useful to have in this kind of paper to have some hints about interesting future directions etc.

Overall, I really like this paper and found it a fascinating read, but the organization feels rushed. I trust the authors to know the best way to improve it (e.g. the decision about the length of informal/formal presentation of the main results)

问题

  • Can you compare your approach to proving non-uniform generalization bounds directly? I assume this would not yield tightness, but possible could lead to new smaller bounds (that might or might not be tight).
  • In Line 85 you state: “And if the number of samples is too small to allow learning, some interesting algorithms might nonetheless perform well on a certain subset of distributions and also be very estimable”. Is this technically correct statement? I thought estimability is equal to having tight bounds, shouldn’t then the sentence state that “\ldots some interesting algorithms might nonetheless perform well on a certain subset of distributions because they are very estimable on that distribution subset.“
  • In Line 124, I would suggest to remove the tilde as a sign of approximation and specify the number 0.160.16.
  • In Line 144 you say “necessarily very weak for many distributions”. Could you please specify in what sense weak? I assume it means that no bound can be better than 1/41/4-tight, correct?
  • In Line 158 you state that: ”Seeing as there are many definitions of stability in the literature, Theorem 4.3 makes a nontrivial conceptual contribution by identifying the “correct” notion of stability for under-standing estimability”. Without pointing to a later argument made in Sec 4 (which indirectly mentions the memorization Example 1.6), this reads as very strong statement. I think it would be good to back this statement here a bit more, or perhaps leave it for later.
  • In Line 216: “Hence, by Markov’s inequality, 99% of learning rules ...“ where is the 99% coming from? Is this in the theorem?
  • l321: “using a similiar technique” — You mean that if you use similiar technique as in Theorem 3.1 you will get 1/6 instead of 0.16? If you want to keep this statement, I would suggest specifying what is the technique based on (or perhaps just apply it and state Theorem 3.2 with the better result).
  • Definition 4.1 -> What is the notation of S1S2S_1\circ S_2? I cannot find it introduced anywhere, is this the union of the sets?

局限性

Yes

最终评判理由

I decided to keep my score. I really like the paper and found it an original and interesting read. I have minor remarks regarding the clarity of organization of the paper and the fact that the topic is non-standard might make the contribution less clear-cut. Overall, I am slightly leaning towards accepting the paper.

格式问题

I don’t like the authors use “Fact” instead of “Theorem”, “Proposition”, or “Lemma”, etc.

作者回复

Thank you for your thoughtful suggestions, which closely engage with important considerations we had while structuring our presentation.


Presentation of the paper

The current version of the paper is informed by our previous experiences presenting this work. Unfortunately, the topic of estimability appears to meet with an unusually large amount of confusion, misunderstandings, and misconceptions. Readers often conflate learnability, generalization, and estimability, and this leads them to variously insist that all our results are obviously true and are already known, or are obviously false and contradict known results, etc. For instance, some readers claim that their favorite generalization bound is universally tight for all algorithms and all distributions (which is provably false), while others claim that classic no-free-lunch theorems have already shown long ago that generalization bounds cannot be tight for all algorithms and distributions (also untrue).

As a consequence, we have chosen to provide a very careful and methodical introduction that emphasizes the basic concepts and questions. We also included Section 1.2 (“learnability and estimability are not the same”), and lots of examples in Section 1.4. We find that this reduces (but does not eliminate) the confusion about the basic questions and value of our work, and increases the number of people who understand that estimability is an interesting and important topic of study that is not subsumed by basic existing results (like the VC theorem, etc.). This is a precondition for readers to be willing to listen to and engage with our specific results.


Renaming “Facts” to “Theorems”, “Lemmas”, “Propositions”, “Corollaries”, or “Remarks”

Thank you for this suggestion, we can rename those to Propositions/Theorems.


Comparison with Gastpar (2024) et al., the numerical experiments, concluding discussion.

We appreciate this suggestion and will add parts of the comparison to Gastpar et al. and a short discussion of limitations and open problems in the camera-ready version, for which we will have one extra page. Among the questions we plan to include:

  • For VC classes, can the negative results of Theorems 3.1 and 3.2 be extended beyond d0.5d^0.5?
  • A characterization of algorithms that admit distribution-free tight generalization bounds (such as the memorization algorithm).
  • What combined properties of an algorithm and a distribution family admit tight generalization bounds?

Questions

Can you compare your approach to proving non-uniform generalization bounds directly?

We don’t quite see a connection between our work and non-uniform generalization bounds. Could you please elaborate more, and then maybe things will click for us?


“I thought estimability is equal to having tight bounds”

This is correct. By the algorithm AA “performing well”, we mean that it learns well (across some subset D\mathbb D), not that some associated generalization bound is tight (across D\mathbb D), which you perhaps had in mind? Could you please clarify this for us?


“\ldots some interesting algorithms might nonetheless perform well on a certain subset of distributions because they are very estimable on that distribution subset.“

This is incorrect. Algorithms can be very estimable but can perform poorly across many or all scenarios. See the memorization example, which is globally estimable yet performs badly on all distributions except ones that are close to the zero function; see also Example 1.7: the algorithms we draw randomly are perfectly estimable yet perform poorly across all scenarios.


In Line 124, I would suggest to remove the tilde as a sign of approximation and specify the number.

We can change this to the more conservative 0.160.16 for the revised version.


In Line 144 you say “necessarily very weak for many distributions”. Could you please specify in what sense weak? I assume it means that no bound can be better than ε\varepsilon-tight, correct?

Yes, this is correct, But moreover, this holds for large ε\varepsilon, namely, at least 1/4o(1)1/4-o(1).


Line 158: “correct” notion of stability for understanding estimability”.

Note that we do point to Sec. 4 in the sentence right after (l. 161). We can change it and point to Sec. 4 already at the end of sentence from l. 158 to make things more clear.


Line 216: “Hence, by Markov’s inequality, 99% of learning rules [...]“

Define IA,f,S=1[L_Df(A(S))12ε]I_{A,f,S} = \mathbf{1}[|L\_{\mathcal{D}_f}(A(S)) - \frac{1}{2}| \geq \varepsilon] and PA=E_f,S[I_A,f,S]P_A = \mathbb{E}\_{f, S}[I\_{A,f,S}] where 1()\mathbf{1}(\cdot) is the indicator function. Then, by assumption E_A[PA]2e2dε2\mathbb{E}\_A[P_A] \leq 2e^{-2d\varepsilon^2} and by Markov's Inequality PA(PA200e2dε2)1100\mathbb{P}_A(P_A \geq 200e^{-2d\varepsilon^2}) \leq \frac{1}{100}.


l321: “using a similiar technique” “You mean that if you use similiar technique as in Theorem 3.1 you will get 1/6 instead of 0.16?”

This is correct. In eq. 12 and below (see the appendix of the full paper), we consider for simplicity the event {Z{2,3}}\{Z \in \{2,3\}\} (where ZZ is the number of consistent hypotheses), already yielding 0.160.16. However, we could have invoked Lemma H.1 with all P({Z=k})\mathbb P(\{Z = k\}) for all k2k \ge 2 which would give the only slightly larger 1/61/6. We will add a note at the end of appendix F for clarification.


Definition 4.1 -> What is the notation of S1S2S_1 \circ S_2? I cannot find it introduced anywhere, is this the union of the sets?

This means concatenation, see definition in l. 337. Note that a set union would not account for duplicates amongst S1,S2S_1, S_2.


评论

I would like to thank the authors for a clarifying reply to my review and I trust that the authors know best how to implement any changes in regards to the clarity of presentation if accepted.

Reading also through the other reviewers comments, some of the points are clearer. I decided to keep my score.

评论

Dear reviewer,

Now is time for the discussion with the authors. Can you please reply to their rebuttal, indicating if your issues were clarified and or answered? If that is not the case, please indicate why the answer was not satisfying.

Many thanks! Your AC

评论

Dear reviewer,

Please respond to the author rebuttal as soon as possible - but at Aug 8, anytime on earth, at the latest. Please do not only submit the "Mandatory Acknowledgement", but also please respond with text to the author rebuttal. For example, explain why the author rebuttal convinced you, or why not.

Many thanks, Your AC.

审稿意见
5

This paper investigates the conditions under which a learning algorithm A admits tight generalization bounds. It introduces the concept of "estimability": a learning algorithm A is estimable if there exists an estimator E such that, with high probability, |E(S) - L_D(A(S))| ≤ epsilon.

The paper provides several negative results showing when tight bounds cannot exist. Theorems 3.1 and 3.2 show that algorithms with "inductive biases" toward subsets of VC classes with the number of samples smaller than sqrt(d)/10 or toward nearly-orthogonal functions do not admit tight generalization guarantees.

On the positive side, the authors show that stable algorithms are estimable (Theorem 4.3), providing a sufficient condition for when tight bounds can exist. They also provide a simple characterization linking estimability directly to the conditional variance of the algorithm's loss, stating that an algorithm is estimable if and only if E[Var(L_D(A(S))|S)] ≤epsilon

优缺点分析

This paper is actually about empirical generalization bounds, i.e., ones that can be evaluated knowing only the training data rather than relying on unknown constants/functionals of the data distribution. This makes the results more practically relevant. As a result, it seems overparameterization will rely on distributional properties. One question is that the celebrated result of Bartlett et al already shows that one needs to have a control on covariance matrix of the data distribution in order to reason about benign overfitting. I think it is indeed connected to this paper. Indeed, the assumption of loss stability is a distributional assumption. One question is how does the sufficient condition compare to other parts of the paper? Is that true that verifying E[Var(L_D(A(S))|S)] ≤ epsilon requires the distributional knowledge?

One section that is missing is open problems. I think posing questions like the required number of samples to estimate the generalization gap, whether the gap between the two negative results can be closed, or whether distribution-free sufficient conditions exist would strengthen the paper.

Another line of related work is testable learning. In this line of work, the goal is to bridge the gap between learning on particular distribution and worst-case perspective in PAC learning. It would be nice to include a discussion on this line of work. Gollakota, Aravind, Adam R. Klivans, and Pravesh K. Kothari. "A moment-matching approach to testable learning and a new characterization of rademacher complexity." Proceedings of the 55th Annual ACM Symposium on Theory of Computing. 2023.

In general, I think this paper is interesting and has some important findings.

问题

see above

局限性

yes

最终评判理由

My view on the paper is still the same. I think they frame an interesting questions and the results are interesting.

格式问题

n/a

作者回复

We appreciate your positive outlook on our paper and your insightful comments!


“This paper is actually about empirical generalization bounds… As a result, it seems overparameterization will rely on distributional properties.”

We agree with this interpretation. Our focus is indeed on empirical generalization bounds, i.e., bounds that can be computed based solely on the training data, which makes the notion of estimability directly applicable in practice. Our results show that tight empirical generalization necessarily requires assumptions about the distribution, just as benign overfitting does, as in your comment below.


“The celebrated result of Bartlett et al. already shows that one needs to control the covariance matrix... I think it is indeed connected to this paper.”

We agree that there is a conceptual connection. The result of Bartlett et al. establishes that benign overfitting in linear regression with Gaussian covariates requires strong control over the data covariance. While their setting is restricted to linear regression over infinite-dimensional Gaussian data, they do speculate on extensions to neural networks in Section 4 of their paper. It remains unclear whether such covariance control is necessary for understanding generalization in deep learning (or other overparametrized models), but their results, like ours, emphasize the importance of distributional assumptions for achieving tight generalization bounds. We will mention this connection in our paper.


“Indeed, the assumption of loss stability is a distributional assumption.”

This is only partially accurate. Loss stability, as used in our paper, is a property of a particular (algorithm, set of distributions) pair. It is not purely a distributional assumption. For example, fixing a restricted class of distributions, some algorithms may exhibit stability while others do not. Notably, our paper includes an example, the memorization algorithm, that is estimable and exhibits stability uniformly across all distributions.


“How does the sufficient condition compare to other parts of the paper? Is it true that verifying \mathbb{E}[\operatorname{Var}(L_D(A(S)) \mid S)] \leq \varepsilon requires distributional knowledge?”

Yes, the variance-based characterization involves an expectation over a prespecified collection of distributions, one that is assumed to reflect prior knowledge about the problem setting (e.g., Gaussianity, bounded support, etc). The goal is not to be distribution-free, but to identify for a given algorithm what distributional conditions admit tight empirical bounds.


“One section that is missing is open problems.”

We appreciate this suggestion and will add a short discussion of open problems. Among the questions we plan to include:

  • For VC classes, can the negative results of Theorems 3.1 and 3.2 be extended beyond d^0.5?
  • A characterization of algorithms that admit distribution-free tight generalization bounds (such as the memorization algorithm).
  • What combined properties of an algorithm and a distribution family admit tight generalization bounds?

“Another line of related work is testable learning... It would be nice to include a discussion.”

The paper you suggested may relate nicely to our work, and maybe some future directions. Our results discuss under what distributional assumptions tight generalization bounds exist, and their work might help test/validate whether such assumptions hold. It is interesting to see if the two perspectives can be combined! We will mention this in our paper.

最终决定

(a) The authors study the question which algorithms admit tight generalization bounds.

(b) Strenghts: the topic is timely, with the open question why deep learning generalizes well.

(c) Weaknesses: the presentation is rather unconvential, but this is chosen in order to introduce the topic well to the reader.

(d) Accept. The paper is interesting to the community.

(e) With one reviewer there has been a substantial debate about the importance and relevance of some of the theorems. Many clarifications have been made by the authors, and in the end, it seems to be that a lot of this hinges on subjectivity.