6.0

/10

Rejected4 位审稿人

最低6最高6标准差0.0

3.8

置信度

正确性3.0

贡献度2.3

表达2.5

ICLR 2025

Class-wise Generalization Error: an Information-Theoretic analysis

Firas Laakom,Yuheng Bu,Moncef Gabbouj,Jürgen Schmidhuber

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

This paper introduces and explores the concept of "class-generalization error" from an information-theoretic perspective.

摘要

关键词

information-theoretic boundsgeneralization errorclass-bias

评审与讨论

审稿意见

评分: 6置信度: 42024-10-30

The paper studies the class-generalization error bound as opposed to the more traditional case of expected generalization over the whole data distribution. The authors provide a new class-specific generalization definition and give multiple information-theoretic bounds based on the KL divergence (more classic), and the super-sample technique (maybe somewhat newer). They analyze the tightness of their bounds in some experiments on CIFAR10 and CIFAR100 showing the failure of the general generalization bounds and the success of their class-specific approach. Finally, the authors provide some examples of the applications in which the provided bounds can be useful and add new insights.

优点

The paper is well-written, and easy to follow. The authors have motivated their work nicely both at the beginning of the paper and the end by providing specific examples of some scenarios in which the class-wise generalization bound can be insightful like the sub-task problem as a specific case of distribution shift. The idea of caring about each class independently is new and in some cases tightens the pre-existing bounds.

缺点

The technical contribution of the paper is limited. While the definition of class-wise generalization bound is new, the results are heavily based on the previous works [Xu and Raginksy 2017, Zhou et al 2022, Harutyunyan 2021, Steinke and Zakynthinou 2020, Clerico et al 2022, Wang and Mao 2023], and follow the same flow of proof, without any particular novelty. It would be useful if the authors mention the technical challenges their definition might propose to the problem and how it differs from the challenges one needs to overcome while taking the general generalization bounds. The insight of some bounds being tighter than others again is well-studied in previous works. However, I agree with the authors that the tightness of class-wise generalization as opposed to general generalization bound in Corollary 1 is new.

问题

In the example of "truck", the authors show that their bound captures the behavior of the noisy case. Is it possible to explain the increase in the error w.r.t samples using the proposed bound?
In Remark 3, the authors mention that their bound is discrepancy-independent. However, the reason for this is not their bound, but the change of the definition from $L_E(W, S)$ to $L_{E_Q}(W, S)$ . I would suggest the authors rephrase this part and perhaps break it down into two factors, one due to the change in the definition (still dependent on the discrepancy), and one the bound (discrepancy-independent) to avoid being misleading.
In [1], the authors provide a general framework for information-theoretic bounds which saves a lot of proof as long as the authors can show some regularity of the distributions. Since the authors' technique of proof follows the same techniques in the general generalization bounds, have authors considered using this framework?

[1] Chained Generalisation Bounds, Clerico et al. COLT 2022

2024-11-21

Weaknesses: Thank you for raising this point. From a technical perspective, while the proof of Theorem 1 can be seen as an extension of prior works by applying the classic Donsker–Varadhan (DV) variational representation over a conditional distribution, however, we note that in the super-sample setting, the existence of the indicator functions, as mentioned in L243-L248 introduces a noticeable challenge.

To address this issue and derive tight bounds, we first extend prior bounding techniques (Wang & Mao, 2023; Harutyunyan et al., 2021) in Lemma 2. The novel proof requires a combination of DV representation along with Hoeffding’s Lemma to obtain tight bounds using the indicator functions. More importantly, we derive a tighter bound in Theorem 4, eliminating the need for a max operation between two indicator functions. We propose to use the CMI between the label-dependent projection of the loss pair $∆_y L$ (which we propose) and the random selection process to upper bound the class-generalization error.

Q the "truck" example: Yes, it is possible. Note that our bounds show two key dependencies: $1/n^Y$ and the CMI term (e.g., $I\_{Z[2n]} (∆_yL_i ; U_i))$ . While increasing the number of samples decreases $1/n^Y$ , it does not necessarily decrease the CMI term and can even increase it.
For the "truck" class, the generalization error increases with the number of samples $n^y$ because the growth rate of the CMI term exceeds the rate at which $\frac{1}{n^y}$ decreases. This suggests that the model increasingly memorizes patterns in the additional samples, leading to higher CMI and, consequently, an increase in both the bound and the generalization error. Our bounds effectively capture and explain this class-specific generalization behavior, demonstrating that the CMI term can increase under certain conditions. Now the question ‘why CMI is increasing’ needs a more refined analysis of the algorithm dynamic and our theory suggests that adding class-specific noise could mitigate this issue.

Q Remark 3: Thanks for pointing this out. To our understanding, the proposed bound is indeed discrepancy-independent, in the sense that it doesn’t depend on an intractable measure. Note that the difference between Eq. 12 and Eq 14 is simply the difference between the empirical losses on E_Q and E_P. Hence, the bound derived in Theorem 5 can be converted to bounds for Eq.12 by simply adding (E_Q-E_P) on both sides. Importantly, (EQ−EP) can be directly computed from the training data, eliminating the need for intractable discrepancy measures between the true target and domain distributions, as required in prior bounds. This highlights the key advantage of our bound compared to prior bounds. To make this clear and avoid confusion, as suggested by the reviewer, we have rephrased Remark 3 in the updated manuscript to highlight these points.

Q [1] Chained Generalisation Bounds: Thank you for raising this interesting point. Indeed, the chaining technique can tighten mutual information bounds by shifting the regularity assumption from the loss function to its gradient. However, bounds derived using standard MI/CMI, as used in this paper, offer greater interpretability compared to their chained counterparts.
Furthermore, in this study, we prioritized the empirical validation of our bounds to understand and capture the generalization behavior of neural networks. Standard MI/CMI settings are well-suited for this purpose as they result in computable bounds that align closely with empirical results.
We consider our work to be a foundational step toward a theoretical understanding of class-generalization. Extending these results using the chaining technique constitutes a promising avenue for future research, particularly in deriving tighter and potentially more nuanced class-generalization bounds. We have now added this discussion in the future work section at the end of the paper

评论- Rebuttal deadline reminder

2024-11-25

We thank the reviewer for their time and efforts. We hope that we have provided clarifications to address your reviews to raise your initial score. As the rebuttal phase is ending and we did not receive any feedback yet, kindly let us know if you have any further concerns.

评论- Reply to official comments by the authors

2024-11-25

I thank the authors for their responses to my questions and appreciate their clarification on Remark 3. I think it is an interesting work to study the class-wise generalization bounds.

审稿意见

评分: 6置信度: 42024-11-01

In this paper, the generalization performance of learning algorithms are studied on a classwise level—i.e., bounds are provided between the population loss for samples on a class and the training loss for samples from that class. These bounds are obtained using information-theoretic tools, i.e. the KL divergence and conditional mutual information. Empirical evaluations are performed to verify that classwise discrepancies exist in real applications, and that the proposed bounds correlate with them. Several extensions and connections to related problems (subtask, sensitive attributes) are presented.

优点

The paper begins by motivating the problem at hand in a very nice way, illustrating that classwise error discrepancies occur and do not behave in ways that are immediately obvious. The presented results are evaluated thoroughly, and interesting extensions are discussed. While the core of the derivations rely on standard techniques, there are some technical intricacies to be dealt with, and the present application is new.

缺点

Some more discussion on the topic of imbalanced data sets and how this affects classwise generalization error would be of interest. For instance, explicitly stating that the classes in CIFAR10 are balanced could be useful for the reader.

The discussion in Appendix C.2 does not seem to be accurate. Unless I’m mistaken, the following appears to be perfectly valid:

$E_{\hat Z_{[2n]}, U, W}[ \frac{1}{n} \sum_i \ell(\hat Z_i^{-U_i}, W) - \ell(\hat Z_i^{U_i}, W)] = \frac{1}{n} \sum_i \big( E_{\hat Z_{i}, U, W}[ \ell(\hat Z_i^{-U_i}, W)] - E_{\hat Z_{i}, U, W}[ \ell(\hat Z_i^{U_i}, W)] \big)$ $= \big( E_{\hat Z_{1}, U, W}[ \ell(\hat Z_1^{-U_1}, W)] - E_{\hat Z_{1}, U, W}[ \ell(\hat Z_1^{U_1}, W)] \big) = \big( E_{\bar Z, \bar W}[ \ell(\bar Z, \bar W)] - E_{Z, W}[ \ell(Z, W)] \big)$

which is the generalization gap. First step was linearity of expectation, second used the symmetry of the algorithm + iid (despite the altered process, the sample pairs are iid), and the last one used that $WW$ and $hat Z_1^{-U_1}$ are independent. The fact that the generalization gap is linear in the test/train losses mean that the $y$ -specificity of the pairs is marginalized out.

Now, if one were to consider bounds where the left-hand-side was given in terms of a function of train/test loss (e.g., the binary KL divergence as in the Maurer-Langford-Seeger bound), this would no longer hold.

Minor:

Should $n_y <n$ on line 136 be $\leq$ ?

“ $n$ super-samples” on line 193: in the original paper from Steinke and Zakynthinou, the term “supersample” was used to describe the full set of $2n$ samples.

Corollary 2 and Theorem 5 (e.g.): Would be nice to adjust the size of brackets.

Line 1070: “requieres”

问题

— Could you address the comment above regarding Appendix C.2? I acknowledge that I may have missed something.

— How does the approach with class-specific gradient noise in Appendix D.7 compare with an approach that adds gradient noise for every class?

— There are some results in the PAC-Bayesian literature on generalization bounds for the confusion matrix (e.g. Morvant et al, “PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification”). How does your work relate to this literature?

— Lemma 2: Does $V$ depend only on $W$ here or on $W$ and $z_{[2n]}$ ?

2024-11-21

Weakness discussion on the topic of imbalanced/balanced datasets: Thank you for your interesting point. In our study, we focused on balanced datasets such as CIFAR-10/CIFAR100, where each class is represented equally, to ensure that the observed class-generalization disparities are not confounded by class imbalance. This allows us to attribute disparities in class-generalization error to intrinsic properties of the learning algorithm and data distribution rather than to skewed class proportions. We have updated the caption of Figure 1 in the revision to emphasize this point.

Weakness discussion in Appendix C.2: In Appendix C.2, we discussed this alternative definition of class-generalization error by tweaking the supersample setting to ensure that each pair of supersample $Z_i^{\pm}$ has the same label y. However, the key issue with such a formulation lies in creating some dependency between the training set $\hat{\mathbf{Z}}_{[n]}^\mathbf{U}$ and the test set $\hat{\mathbf{Z}}\_{[n]}^{-\mathbf{U}}$ Hence, the last equality in the review is not valid in this case, as the test sample $Z_1^{-U1}$ is not independent of W, given that $Z_1^{-U1}$ and the training sample $Z_1^{U1}$ always share the same label. As shown in Eq. 55 in the paper, this setting accounts for class-generalization only in the first term, not the second.

Q $n_y \leq n$ ?: Thank you for pointing this out. $n_y=n$ corresponds to all training/test data coming from one class. So there is no notion of class-generalization disparity in this case, which is the main scope of this work. However, it is true that our theory holds even for $n_y=n$ . Hence, we have changed it to $\leq$ to include the strict equality case.

Q supersample: It is true that in the original paper, they refer to $Z\_{[2n]}$ as a supersample. Here, we adopt a slightly different notation with the supersample being the pair $Z_i^{\pm}$ and hence $Z\_{[2n]}$ is the set of n supersamples. We have clarified this point in the revision.

Q brackets: Thank you for pointing this out. We have revisited the manuscript and fixed the sizes of all brackets to improve readability and make it aesthetically pleasing.

Q Line 1070: Thank you for catching this. We have fixed the typos in the revised manuscript.

Q class-specific gradient noise in Appendix D.7: Thank you for raising this interesting question. Our theory predicts that any approach capable of reducing the MI/CMI between the training samples of class y and the model weights or outputs should improve the class-generalization performance. From this perspective, adding noise specifically to the gradients of samples from class y or introducing noise to all samples (including those from class y) is expected to enhance the generalization for this class.

In this work, we focus on class-specific gradient noise to perturb the gradients for a single class at a time. This allows us to isolate the impact of MI/CMI related to class y, as analyzed in Theorems 2-4, rather than addressing the MI/CMI associated with the standard generalization error, such as in Corollary 2 of Harutyunyan et al. (2021).

Q PAC-Bayesian literature: Thank you for bringing up the PAC-Bayesian literature on generalization bounds for the confusion matrix for our attention. While the work by Morvant et al. focuses on bounding the entries of the confusion matrix to understand multi-class classification with PAC-based bounds, our work takes a different perspective. Specifically, we introduce the concept of class-generalization error, which quantifies the generalization of a specific class. More importantly, our theoretical results are not restricted to the Gibbs algorithm but apply to any learning algorithm. We have added this discussion to the related work section in the revised manuscript.

Lemma 2: Indeed, the random variable V depends only on the random variable $W$ and the constant z_[2n]. To make this clear, we have rephrased the statement of lemma2 in the revision.

2024-11-21

Thank you for your thorough response. I agree that the discussion in Appendix C. 2 is indeed correct, and that I was mistaken in my initial review.

审稿意见

评分: 6置信度: 42024-11-02

In order to address the issue that the existing generalization boundaries cannot capture the varying generalization performance across different classes, the authors introduce the concept of "class-generalization error," and derive bounds based on KL-divergence and super-sample technique, validating their bounds in different networks.

Their theoretical tools extend to: (i) Deriving class-dependent generalization error bounds affecting standard generalization and tightening existing bounds; (ii) Providing tight bounds for subtask problems with test data subsets; (iii) Deriving bounds for learning with sensitive attributes.

优点

Originality: The authors introduce a new concept - "class-generalization error", to address the issue that the existing generalization boundaries cannot capture the varying generalization performance across different classes. Quality: The authors have provided detailed proofs for each conclusion and conducted comprehensive numerical experiments. Significance: The authors not only extend their theoretical findings to derive bounds for diverse scenarios but also shed light on label-dependent generalization bounds.

缺点

The notation in the formulas is not concise enough. For instance, in Eq 23., the parameters of $\ell$ are in the order of w, x, y, whereas in Eq 54, they have become z, w. In addition, some formulas are not aesthetically pleasing, such as [] in Eq11.

Some questions about the proof are discussed in Question part.

问题

1.Motivation: Even though the authors reveal label-dependency of generalization bound, in addition to mentioned in the Sec 4.1, can you discuss how different labels affect the standard generalization error? Or can you explain the scenarios where class generalization must be considered? 2.Essentiality: The derivation of Theorem 1 and Lemma 2 starts from KL divergence (or CMI) and does not exceed this scope. Do these conclusions reveal any more fundamental facts? 3.Proof Details: (i) Regarding Equation 30, is there an inversion of the positive and negative signs? (ii) In response to the issue mentioned in Remark 2, the authors introduce a novel loss random variable. However, the bound presented in Equation 47 is also applicable to the expression preceding Equation 31. Consequently, in Theorem 3, the term $\max\left(\mathbb{1}_{\left\{y_{i}^{-}=y\right\}},\mathbb{1}_{\left\{y_{i}^{+}=y\right\}}\right)$ can be omitted, similar to the approach taken in Theorem 4. Do these facts have any impact on your findings?

2024-11-21

Weakness: Thank you for catching this. In the revised manuscript, we have made the notation consistent. We also fixed the sizes of all brackets to improve readability and make it aesthetically pleasing.

Q1 Motivation: While standard generalization error measures the overall generalization performance of a model, it can hide significant disparities in generalization behavior across different classes, as shown in Figures 1 and 2. The main motivation of our work is to address this phenomenon by using information-theoretic tools to study class-generalization. Beyond providing the first theoretical step toward understanding this puzzling phenomenon, we highlight several key motivations:

Practical Importance: In certain applications, standard generalization may be less relevant when an important class exists (e.g., rare diseases). In such cases, understanding generalization for this specific class is more critical than focusing on overall performance. Our results provide valuable insights into this scenario.
Technical Contribution: Our bounding technique, which incorporates conditioning, produces tighter standard generalization bounds in the MI setting, as demonstrated in Corollary 1.
Broader Applicability: We also show that our results can be easily converted to study: i) fairness generalization ii) subtask generalization iii) generalization in terms of recall or specificity which can be crucial in various applications.

Q2 Essentiality: The core conclusion of our results (Theorem 1-4) demonstrates that models generalize differently across classes due to varying degrees of memorization. Specifically, the CMI between class data and model parameters acts as a proxy for class-generalization error: high CMI indicates memorization and poor generalization, while low CMI suggests better generalization.
This finding provides an explanation for the observed generalization disparity, showing that models can memorize certain classes more than others. Our theory further suggests that CMI and KL divergence terms can serve as proxies for class-specific generalization, aiding in predicting which classes will generalize better. As illustrated in the scatter plots in Figure 2, the CMI terms show a strong correlation with actual class-generalization error.
Furthermore, in general generalization bounds are crucial because they provide a theoretical framework to understand generalization. They offer both theoretical guarantees and practical insights essential for developing new algorithms with stronger performance (e.g., our noise addition approach in Appendix D.7).

Q3 3.Proof Details:

i) Thanks for pointing this out. There is no inversion of the positive and negative signs in Equation 30. This is because we have $\bar{U_i} = - U_i$ at the start of Equation 30, which accounts for the apparent sign inversion.
ii) The main aim in Theorem 3 (and Theorem 2) is to derive bounds with CMI terms depending explicitly on the model’s outputs $f_W(X_i^{\pm})$ (and model weights W), in direct analogy to the bounds in Harutyunyan et al. (2021) (and Steinke & Zakynthinou (2020)). To derive tight bounds with explicit depends on $f_W(X_i^{\pm})$ or ( $W$ ), it is not possible to avoid the max term $\max (\mathbb{1}\_{y_i^-=y}, \mathbb{1}\_{y_i^+=y})$ . In Theorem 5, we wanted to derive a tight bound without this max term and this required introducing a novel random variable $\delta_y L$ which encapsulates $\mathbb{1}\_{y_i^-=y}$ and $\mathbb{1}\_{y_i^+=y}$ .

Applying Equation 47 to Theorem 3 in the expression preceding Equation 31, as suggested by the reviewer, would lead to the following bound:

| \overline{\mathrm{gen}_y}| \leq \mathbb{E}_{\mathbf{Z}_{[2n]}} \Big[ \frac{1}{n^y_{\mathbf{Z}_{[2n]}}} \sum_{i=1}^n \sqrt{2 I_{\mathbf{Z}_{[2n]}} (f_{W}(X^\pm_i); U_i )} \Big]

Note that this bound cannot capture any class-specific dependency and therefore does not provide insights into class-specific generalization behavior. More significantly, the proposed bound is much looser compared to the bound derived in Theorem 3. Our bound, which incorporates the $\max (\mathbb{1}\_{y_i^-=y}, \mathbb{1}\_{y_i^+=y} )$ , is always tighter as $\max (\mathbb{1}\_{y_i^-=y}, \mathbb{1}\_{y_i^+=y} ) \leq 1$ . Therefore, the inclusion of the max ⁡operator in Theorems 2 and 3 is essential for achieving tight, output-dependent (and weights-dependent) bounds that uncover meaningful class-specific dependencies and insights.

评论- Rebuttal deadline reminder

2024-11-25

2024-11-26

Considering the author's comprehensive response to the Questions, I believe that the primary conclusion—that the CMI between class data and model parameters serves as a proxy for class-generalization error—is an interesting, albeit not fundamental, finding. I have decided to raise the score from 5 to 6.

审稿意见

评分: 6置信度: 32024-11-04

This paper proposes characterizing classification generalization errors of each class separately. The motivation is that neural networks do not generalize equally for all classes. The paper demonstrates this phenomenon empirically on CIFAR10 and CIFAR100, and also notes that class-wise generalization depends on factors beyond the class itself. Main contributions are several variants of per-class information-theoretic generalization bounds. The paper notes that the mutual information between the model and the class data can be used to characterize (or upper bound) the class generalization error.

优点

This paper is theoretical in nature. The idea of pursuing generalization errors for each class is claimed to be new. If so (to be verified with other reviewers), then this research direction is original, The proposed information theoretical bounds appear to generalize the bounds studied in Bu et al., 2020, Xu & Raginsky, 2017, Zhou et al., 2020 and other works, which mostly study the standard generalization bounds (i.e., averaged over all classes, and are not class-wise).

Empirically it is verified that the results in Theorem 3 and Theorem 4 provide valid class-wise generalization bounds in both standard CIFAR 10 and a noisy variant of CIFAR 10. This demonstrates that the contributes bounds are widely applicable to several settings.

缺点

(Recommended actions are clearly listed in the Questions section.)

Presentation:

Writing can be much improved. In particular, many theoretical results are not accompanied with a justification, or a practical implication. Writing appears to be tailored to specialists of this subject area. Sections and writing flow in the paper can also be improved. For instance, Theorem 1 is introduced only to be superseded by Sec 2.2 which, at the beginning of the section, states a limitation of Theorem 1. Def 2 (a definition of class-wise generalization errors) is introduced only to be superseded by Def 3 (see reason at L210). It is hard for me to form a coherent story. Same for Theorem 2 (class-CMI) which is introduced only to be commented on before introducing Theorem 3 that the bound in Theorem 2 is hard to evaluate (see L275).

Novelty:

The contributed results, while generalizing bounds in existing results, are also heavily based on them. This raises a concern about the novelty. For instance, Def 1 looks like an adaptation of Eq 3 in Bu et al., 2020.

问题

Major questions/comments:

Definition 1 looks like an adaptation of Eq 3 in Bu et al., 2020. In particular, the difference of two terms given by the population loss and the empirical loss remains. The difference is that Def 1 in the present work considers conditioning on a specific class y. Beyond this generalization to per-class analysis, could the authors please highlight contributions compared to Bu et al., 2020 (and also Xu & Raginsky, 2017, Zhou et al., 2020)? I am asking as a non-specialist reader.
Theorem 1 in the present work is similar to Theorem 1 of Xu & Raginsky, 2017, except that Xu & Raginsky, 2017 considers no conditioning on a specific class y. Could the authors comment on the difference?
The idea of “supersample” used in Def 3 seems to be inspired by Steinke & Zakynthinou 2020 (see Sec 1.2 therein). Could the authors comment on the new parts in the present work wrt to the original supersample idea?
The notation $P(\mathbf{W}|\mathbf{S})$ is used in the paper to denote the distribution of learned weights $\mathbf{W}$ given a training sample $S$ . In Def 1, the defined class-wise generalization error involves $P_{\mathbf{W}, \mathbf{S}_y}$ . This presumably means $P(\mathbf{W} | \mathbf{S}_y) P(\mathbf{S}_y)$ . What does $P(\mathbf{W} | \mathbf{S}_y)$ mean? How can a classifier learn from only data from one class $y$ ?
L210. How does Def 2 depend on $P(y)$ ? Likewise how does Def 3 not depend on $P(y)$ ?
L272: “if model parameters W memorize the random selection U, the CMI and the class-generalization error will be large”. I thought that U is a vector of Rademacher random variables introduced solely for analysis. Do you actually empirically sample U and use it to train W?

Minor:

L1454 in the appendix. Log is missing when expanding the KL divergence.
Minor suggestion. I think CIFAR10 can be considered a solved problem (much like MNIST). I am aware that the paper has experimental results on CIFAR100 in the appendix. Test accuracy on CIFAR 100 of state-of-the-art models can be almost 100% these days. On the empirical side, it would be good to consider one more experimental setting where the model lacks capacity to tackle the problem. CIFAR 10 is too easy for a model class like ResNet 50. For instance, considering a ResNet 8 on ImageNet would be a good setting where the model struggles to learn well due to the lack of capacity. It would be good to check whether the proposed bounds hold in this setting as well. (Note that this is not a request for more experiments for the purposes of evaluating the paper. It is a suggestion to help strengthen this work. The authors need not try to conduct this experiment during the rebuttal. It’s not a major request.)

评论- Official Comment by Authors (1/2)

2024-11-21

Weakness : Thank you for pointing out the challenges in following the logical flow of our paper. We acknowledge that the current structure might appear tailored to specialists. The presentation order reflects the progression of how these techniques were developed. To address this issue, we have created a new figure (Figure 12 in the revised manuscript) that provides a visual summary of the logical progression and relationships between the main definitions and theorems in Section 2, as well as their connections to the corollaries in Section 4.1.

Q1:

Similar to (Bu et al., 2020.), we aim to study a generalization quantity, in particular, class-wise generalization. In most theoretic frameworks, generalization errors are defined as the difference between a ‘population’ risk and an ‘empirical’ risk, explaining the similarities between the two settings. However, unlike Bu et al. (2020), Xu and Raginsky (2017), or Zhou et al. (2020), which focus on standard generalization error, our work introduces a distinct concept: class-generalization error. This concept addresses disparities in generalization performance across different classes, an aspect not explored in prior works.
Furthermore, the main contribution of (Bu et al., 2020) is showing that sample-wise formulations of generalization errors lead to tighter bounds compared to the dataset-based formulation. Building on this insight, we provide a sample-wise Definition in Lemma 1 for class-wise generalization. Note that this change in definitions within the class-wise setting (of "population" and "empirical" risks) introduces a different quantity, necessitating the development of new proof techniques.
From a technical perspective, while the proof of Theorem 1 (the MI setting) can be seen as an extension of prior works (Bu et al., 2020) by applying the Donsker–Varadhan (DV) variational representation over a conditional distribution, however, in the super-sample setting, the existence of the indicator functions introduces a noticeable technical challenge and requires a new bounding technique, as demonstrated in Lemma 2 and Theorem 4.

Q2: Xu & Raginsky (2017) study the standard generalization error, whereas our work focuses on class-wise generalization. While Theorem 1 in our paper may appear to resemble Theorem 1 in Xu & Raginsky (2017), the quantities being bounded are different and not directly comparable. Furthermore, Corollary 1 in our paper directly compares to Theorem 1 of Xu & Raginsky, 2017, as both bound the standard generalization error. Notably, Theorem 1 of Bu et al. (2020) provides a tighter bound with individual sample mutual information compared to Xu and Raginsky's. As stated under Corollary 1, our bound is even tighter than Bu et al. (2020), and thus also tighter than Xu and Raginsky (2017). This highlights a key contribution of our work: class-wise generalization analysis enables tighter MI-based bounds compared to previous methods.

Q3: There seems to be confusion about the contribution of our work here. First, the supersample technique, introduced by Steinke and Zakynthinou (2020), provides a general framework for studying generalization error and has been shown in recent works (Zhou et al., 2022; Wang & Mao, 2023) to yield tighter, practically computable bounds on generalization errors. To derive practical bounds for class-generalization error, in section 2.2, we extend our class-wise analysis to the supersample setting (CMI framework).
To this end, we introduce Definitions 2 and 3 and study them within this framework, which as explained in L222-L242 is different from generalization errors typically used in prior works and hence requires novel bounding techniques: To derive tight bounds, we first extend prior bounding techniques (Wang & Mao, 2023; Harutyunyan et al., 2021) in Lemma 2. The novel proof requires a combination of DV representation along with Hoeffding’s Lemma to obtain tight bounds using the indicator functions. More importantly, we derive a tighter bound in Theorem 4, eliminating the need for a max operation between two indicator functions. We propose to use the CMI between the label-dependent projection of the loss pair $∆y L$ (which we propose) and the random selection process to upper bound the class-generalization error.

评论- Official Comment by Authors (2/2)

2024-11-21

Q4: There seems to be some confusion here. In our framework, the weights W are always learned using the entire training set S, not just the data from a single class y. As mentioned in Definition 1, $P_{W,Sy}$ is induced by the learning algorithm P_{W|S} and data generating distribution of entire dataset P_S by marginalizing $P_{W,S}$ over non $y$ samples. Similarly, P(W|S_y) is obtained in the same way and represents the distribution of the learned weights conditioned on the data from class y, which allows us to measure the class-specific generalization error. In practice, this does not imply that the model is trained solely on class y data, but rather it’s a conditional distribution used to isolate the generalization behavior for that class in the bound.

Q5: Kindly refer to L211-L215 and L222-L232 for detailed explanations of Definitions 2 and 3. In summary, to define a proper sample-based generalization error, Definition 2 normalizes using the expected number of samples n^y=nP (y), making it dependent on P(y), the true distribution of y. In contrast, Definition 3 relies on the empirical count of samples for class y within the supersample $n^y_{z_{[2n]}}$ , which removes the explicit dependence on P(y). This approach avoids the need to estimate P(y), making it more practical for empirical analysis when working with real-world data where the true class distribution is often unknown.

Q6: U is a vector of Rademacher random variables that is not directly used in training. Instead, U determines which samples are used for training W and which for testing. If U can be inferred from the weights W (indicating high CMI), it suggests that the model is memorizing the selection of the training data, resulting in a large class-generalization error. Our experiments follow the same setup as in the CMI setting (Steinke & Zakynthinou 2020), and sample U to select training and testing samples. To clarify any misunderstandings regarding the information-theoretic bounding techniques, we kindly refer the reviewer to Steinke and Zakynthinou (2020) for a detailed explanation of how results are interpreted in the CMI framework.

Minor 1: Thank you for catching that. We have fixed this in the revised manuscript.

Minor 2: Thank you for your suggestion. The primary goal of this paper is to investigate the generalization puzzle related to the noticeable disparity in generalization behavior across different classes, even when the overall standard generalization error is negligible. To highlight this, we focus on "solved" problems like CIFAR-10 and CIFAR-100 with pretrained ResNet-50 models, where the standard generalization error is near zero. However, as demonstrated in Figures 1 and 2, significant class-generalization disparity remains, with some classes showing up to 20% error rates despite the overall low generalization error. This highlights that even in cases where the standard error is minimal, the disparity in generalization behavior across classes can still be substantial. It underscores the importance of addressing this class generalization puzzle in deep learning, providing strong motivation for our work to shed light on this often-overlooked issue in standard generalization studies. We will incorporate the reviewer's suggested experiments in the final version to further strengthen the paper.

评论- Rebuttal deadline reminder

2024-11-25

2024-11-26

I thank the authors for the response. I raised the score from 5 to 6.

AC 元评审

2024-12-19

This submission is a borderline case. It is a pure learning theory paper. It proposed to analyze the class-wise generalization error from an information-theoretic point of view. After the rebuttal, all four reviewers gave a rating of 6 that is "marginally above the acceptance threshold". During the internal discussions, none of four positive reviewers showed up, even after I cued them, suggesting that they were not truly supportive. In particular, I think the review comments and the review ratings do not match, for example,

The technical contribution of the paper is limited ... the results are heavily based on the previous works ... follow the same flow of proof, without any particular novelty ... The insight of some bounds being tighter than others again is well-studied in previous works.

Based on my own experience as an author and AC in these years, this comment looks more like in a review with a rating of 3 rather than 6. Note that two reviewers questioned the novelty and two reviewers questioned the clarity or presentation. I quickly checked it and found it not easy to follow or to appreciate the merits (except the motivation which was quite nice and easy to follow). Given the current version of this work, even if we accept it, it will not make reasonable impact that it should have due to its poor presentation. Therefore, I don't think we should accept it for publication before the authors significantly improve the current version.

审稿人讨论附加意见

During the internal discussions, none of four positive reviewers (whose ratings were all 6) showed up, even after I cued them, suggesting that they were not truly supportive.

最终决定Reject

2025-01-22

Reject