PaperHub
5.5
/10
Poster5 位审稿人
最低2最高4标准差0.6
3
4
3
2
3
ICML 2025

Tight and Fast Bounds for Multi-Label Learning

OpenReviewPDF
提交: 2025-01-24更新: 2025-08-16

摘要

关键词
Multi-Label LearningGeneralization Bound

评审与讨论

审稿意见
3

This paper provide general theoretical guarantees for the generalization of multi-label learning. By developing and leveraging a novel vector-contraction inequalities for smooth base losses, the author induces tight generalization bounds for multi-label learning that have no dependency on the number of labels. Besides, the author develops novel local vector-contraction inequalities for smooth base losses, it can have a bound with faster convergence rate. A tight generalization bounds with no dependency on the number of labels is derived for Macro-Averaged AUC by considering both Lipschitz continuity and smoothness of base loss functions.

给作者的问题

  • Any limitations when the results are applied to the real-world multi-label learning scenarios?

  • Since there are many related papers mentioned and compared with the derived results, a more clear and explicit comparison like a summary table with the results and limitations can make the contribution clearer.

论据与证据

Yes

方法与评估标准

Yes

理论论述

Incorrect statements are not observed in the key claims (e.g., tighter bounds for smooth base losses, faster bounds, tighter bounds for macro-averaged AUC).

实验设计与分析

N/A

补充材料

Roughly went through

与现有文献的关系

This paper introduces vector-contraction inequalities for smooth base loss functions, rather than Lipschitz ones. It shows tight generalization bounds for multi-label learning that have no dependency on the number of labels.

遗漏的重要参考文献

N/A

其他优缺点

Strengths

  • This paper extend the previous work and introduce novel vector-contraction inequalities for smooth base losses.
  • This paper achieves the SoTA generalization bounds and also reveal the relationship between Macro-Averaged AUC and class-imbalance.
  • Compared with prior methods, this apper removes the dependency on label count and achieves faster bounds.

Weaknesses

  • This paper is highly technical and not reader-friendly, it would be great to show some intuitive explanations or some demos to help the reader understand the key ideas.

其他意见或建议

N/A

作者回复

Thank you for your constructive comments and active interest in helping us improve the quality of the paper.

The following are our responses to the Questions:

1. Response to Weakness.

We will add concrete examples of multi-label methods after the definition of the function class to improve readability and practical interpretation. For example, the DNN-based multi-label method named CLIF [1], which proposes to learn label semantics and representations with specific discriminative properties for each class label in a collaborative way, can be expressed in our function class as:

ϕj(x)=σReLUW5[σReLU(W4x)σsig(W3ψ(Y)j)].\phi_j(\boldsymbol{x})= \sigma_{ReLU} \\{ W_5 \cdot \left[ \sigma_{ReLU} (W_4 \boldsymbol{x}) \odot \sigma_{sig} (W_3 \psi(Y)_j) \right] \\} .

The label embeddings ψ(Y)\psi(Y) can be denoted by σReLU(A~σReLU(A~YW1)W2)\sigma_{ReLU}(\tilde{A} \sigma_{ReLU} (\tilde{A} Y W_1) W_2), where A~\tilde{A} denote the normalized adjacency matrix with self-connections, YY is the node feature matrix of the label graph, σReLU\sigma_{ReLU} is the ReLU activation, σsig\sigma_{sig} is the sigmoid activation, \odot is the Hadamard product, WiW_i are the parameter metrices, i[5]i \in [5].

In addition, a class of multi-label methods based on the strategy of label-specific representation, which facilitates the discrimination of each class label by tailoring its own representations, can be formalized in our function class. For example, the wrapped label-specific representation method [2], which presents a kernelized Lasso-based framework with the constraint of pairwise label correlations for each class label, can be expressed in our function class, where fjf_j is the kernelized linear model and the constraint α(w)\alpha(\boldsymbol{w}) is wj1Λ\|\boldsymbol{w}_j\|_1 \leq \Lambda for any j[c]j \in [c], and each label also has the property of sharing which is reflected by the additionally introduced constraint

ic(1sji)wjwiτ\sum_i^c (1- s_{ji}) {\boldsymbol{w}_j}^\top \boldsymbol{w}_i \leq \tau,

where sjis_{ji} is the cosine similarity between labels yjy_j and yiy_i.

Besides, the function class here is applicable to the typical Binary Relevance methods for multi-label learning, where different methods correspond to different nonlinear mappings ϕj\phi_j.

[1] Collaborative Learning of Label Semantics and Deep Label-Specific Features for Multi-Label Classification, TPAMI 2022.

[2] Multi-Label Classification with Label-Specific Feature Generation: A Wrapped Approach, TPAMI 2022.

In addition, we will also provide additional explanations on the key ideas in the theoretical proof to help readers better understand the ideas. Please refer to our Response 3 to Reviewer XSTz.

2. Response to Q1.

Our theoretical results require that the base loss is bounded, and therefore cannot cover the case where the base loss is cross-entropy loss. In the future, we will further study the bounds with a faster convergence rate for unbounded and Lipschitz base losses. In addition, for some methods involving specific label correlations, when using our theoretical results to analyze these specific methods, it is necessary to introduce additional assumptions induced by these specific label correlations in the generalization analysis, so as to better reveal the impact of these setting-related factors on the bound. However, how to explicitly introduce label correlations in generalization analysis is still a crucial open problem. We will further explore related work in the future. We will incorporate these discussions into the paper in an appropriate manner.

3. Response to Q2.

As the reviewer suggested, a better presentation of related work and comparisons between them through a summary table can improve the readability of the paper and make the contribution of our results clearer. We will add a summary table in the revised version from the perspectives of loss function, additional assumption, order of bounds, and reference.

审稿人评论

Thanks for the clarification! Most of my concerns are well addressed.

作者评论

Thank you again for your support.

审稿意见
4

The paper focuses on the theoretical analysis of multi-label learning, particularly in the context of smooth base loss functions. The authors introduce novel vector-contraction inequalities and derive tighter generalization bounds for multi-label learning with smooth base loss functions. These bounds exhibit improved dependency on the number of labels, reducing it to logarithmic terms, and demonstrate faster convergence rates with respect to the number of examples. The paper also discusses the application of these bounds to various multi-label learning methods, highlighting how these results provide general theoretical guarantees for the generalization performance of multi-label learning, especially for methods with smooth base loss functions.

给作者的问题

  1. The universality of generalisation bounds, although theoretically proven, might there be some limitations for some special types of multi-label problems? Such as high-dimensional sparse data or highly unbalanced label scenarios.
  2. The article provides a comparison with existing methods and demonstrates clear improvements.However, inter-label correlation may affect the theory in specific cases. For example, is the traditional Lipschitz approach likely to be more advantageous in the case of strong label correlation? How can more experiments be designed to verify the validity of these theories in practical applications.

论据与证据

Yes. Specifically, the authors derive theoretical support for tight generalisation bounds and more efficient convergence rates for smooth basis loss functions in multi-label learning by means of mathematical proofs and local Rademacher complexity analyses. The authors also give a proof of broad applicability.

方法与评估标准

Yes. The theoretical analysis presented in the article, especially the tight generalisation bounds based on the common smoothed basis loss function, directly addresses the generalisation problem in multi-label learning.In addition, the generalisation bounds proposed in the article not only consider the effect of the number of labels on the model, but also improve the traditional Rademacher complexity analysis by localising the Rademacher complexity, providing faster convergence for large-scale datasets.

理论论述

Yes. 1. A new vector contraction inequality. This proof is based on the existing Rademacher complexity theory and an extension of local Rademacher complexity that rationally simplifies the otherwise complex multi-label learning problem by smoothing the basis loss function. 2. Application of local Rademacher complexity. This proof describes the complexity of samples in multi-label learning more finely by introducing local complexity.

实验设计与分析

No, the paper is purely a theoretical derivation without any numerical experimental analysis.

补充材料

Yes. The main elements of the supplementary material include:

  1. Detailed proofs, including: proofs of vector contraction inequalities, applications of local Rademacher complexity and proofs of macroscopic AUC generalisation bounds.
  2. Related lemmas and auxiliary theorems, e.g., Khintchine-Kahane inequality, Dudley's integral inequality.

与现有文献的关系

Earlier generalisation analyses on multi-label learning, e.g. Wu and Zhu (2020) based on the loss function of Lipschitz continuity can derive generalisation bounds, but the part of the bounds that depends on the number of labels is linear.This thesis successfully reduces the effect of the number of labels on the generalisation bounds, in particular so that the generalisation bounds no longer depend on the number of labels but are only related to the logarithmic term, demonstrating a significant theoretical advance. Bartlett et al. (2005) proposed local Rademacher complexity and showed that it can provide faster convergence than traditional Rademacher complexity.In this thesis, the speed of convergence and generalisation bounds in multi-label learning are improved by introducing the local Rademacher complexity, which provides a faster convergence speed than the traditional method, and the effectiveness of this method is verified theoretically. Wu et al. (2023) proposed a generalisation analysis on macroscopic AUC and explored the effect of label imbalance on the results, without delving into how to reduce the dependence of the number of labels on the generalisation bounds.In this thesis, by combining the Lipschitz and smooth basis loss functions, tight generalisation bounds for macroscopic AUC are given and there is no dependence on the number of labels, which provides a new theoretical guarantee to deal with the label imbalance problem.

遗漏的重要参考文献

No.I keep up with the literature in this area.

其他优缺点

  1. A new vector contraction inequality is introduced in the article to derive tight generalisation bounds for multi-label learning. the proof method is very rigorous, but in practice, if the label distribution of the datasets does not exactly match the assumptions, it may affect the applicability of the method.
  2. Although the introduction of local Rademacher complexity provides an improvement in convergence speed, the estimation of local complexity may face high computational costs, especially when the number of labels and sample size are very large.The correctness and practical feasibility of the proof relies on being able to efficiently estimate the local complexity and this may be challenging on large-scale datasets.

其他意见或建议

  1. The subscript on the left side of the inequality in line 313 is incorrect?
  2. The inequality in line 1026 to line 1030 are the same. Please check again.
作者回复

Thank you for your constructive comments and active interest in helping us improve the quality of the paper.

The following are our responses to the Questions:

1. Response to Weakness 1.

As the reviewer pointed out, when there is some type of label correlation between the labels of the dataset, the label distribution may satisfy some potential constraints, which may correspond to some additional assumptions such as sparsity assumptions or norm regularization constraints. Therefore, when dealing with these specific problems, we need to introduce some additional assumptions to adjust our analysis and explicitly introduce these potential label correlations into the generalization analysis. This is still an open problem and we will further explore related work in the future.

2. Response to Weakness 2.

When dealing with large-scale datasets, in practice, one often consider introducing specific strategies in the label space to deal with extremely large number of labels. In generalization analysis, it is shown that introducing effective and general specific strategies is not only an important open problem in theory, but also extremely challenging in practice. Therefore, it is indeed necessary to further introduce effective assumptions to better explicitly analyze the role of key factors in generalization that can effectively deal with challenging problems with large number of labels and large-scale datasets.

3. Response to Suggestions.

In line 313, the subscript SS in the inequality should be changed to RR.

In line 1026 to 1030, this is a repeated typo and we will remove one of them.

4. Response to Q1.

As we replied above, effective analysis for more specific problems requires further explicit introduction of valid assumptions to reveal the impact of these setting-dependent factors on the bound. For example, for high-dimensional sparse data, one may need to introduce sparsity assumptions into the analysis, thereby inducing bounds that are weakly dependent on the sparsity rate and the number of key labels.

5. Response to Q2.

Different types of label correlations have an important impact on generalization analysis. How to explicitly introduce them in generalization analysis is a crucial open problem. Traditional Lipschitz methods do have more advantages since they are easier to satisfy the conditions. For example, for deep models, the smoothness of base losses also involves the boundedness of the second-order derivative of the function, which is often difficult to guarantee in deep models, while Lipschitz continuity is often established. Experimental verification can be considered from two aspects. On the one hand, verify whether the functions selected by the algorithm have a small error and whether the generalization performance of the small error functions is better. On the other hand, verify whether the smoothness of the model function can be guaranteed by some regularization, and explore which regularization-induced inductive biases are more effective for generalization in practice, and promote further theoretical research.

We will incorporate these discussions into the paper in an appropriate manner.

审稿意见
3

This paper investigates the generalization bound of multi-label loss functions. Specifically, for smooth base loss functions, the authors improve the generalization bounds by removing the dependency on the number of labels cc. By exploiting local Rademacher complexity, the authors further improve the bound from O~(1/n)\tilde{O}(1/\sqrt{n}) to O~(1/n)\tilde{O}(1/n). In addition, they. also derive tight bounds for Macro-Averaged AUC.

给作者的问题

  1. Definition 3.3. The fat-shattering dimension seems to depend on the witnesses s1,,sps_1,\ldots,s_p. What are the meanings of these witnesses?
  2. Theorem 5.5. The population risk R(f)R(f) is bounded by 2R^D(f)2\hat{R}_D(f) instead of R^D(f)\hat{R}_D(f). Does it mean that this bound is somewhat loose? As nn\to\infty, the bound becomes R(f)2R(f)R(f)\leq 2R(f), which is not tight.

论据与证据

The theoretical claims are.supported by proofs and comparisons with previous works.

方法与评估标准

NA.

理论论述

I only read the proof sketches, which seem to make sense.

实验设计与分析

NA.

补充材料

No I did not read the supplementary.

与现有文献的关系

NA.

遗漏的重要参考文献

NA.

其他优缺点

Strengths.

  1. This paper is overall well written, with sufficient discussions on the differences from previous works.
  2. The derived bounds are tighter removing the reliance on the number of classes, while introducing no additional strong assumptions.

Weakness.

  1. The theoretical results do not cover unbounded base loss functions, e.g. cross entropy loss.
  2. Lack of experimental validation of the theoretical results. e.g. does the generalization gap really not depend on cc? (Perhaps it is not possible to calculate the population risk?)

其他意见或建议

NA.

作者回复

Thank you for your constructive comments and active interest in helping us improve the quality of the paper.

The following are our responses to the Questions:

1. Response to Weakness 1.

As the reviewer commented, the theoretical results for unbounded base losses need to be further explored in the future to develop new theoretical techniques to cover this situation. Under the new theoretical analysis method, although the smoothness of the cross-entropy loss may be difficult to hold for models with large capacity, the Lipschitz continuity of the cross-entropy is often established, i.e., tight bounds with no dependency on cc for cross-entropy loss can be obtained. However, improving the convergence rate of its bound with respect to nn is still an urgent open problem to be solved. In the future, we will further study the bounds with a faster convergence rate for unbounded and Lipschitz base losses.

2. Response to Weakness 2.

Since the O~(1n)\widetilde{O}(\frac{1}{\sqrt{n}} ) and O~(1n)\widetilde{O}(\frac{1}{n} ) bounds in our theoretical results is independent of the number of labels, up to logarithmic terms, the dependency of the bounds here on cc is logarithmic. We use O~\widetilde{O} to omit the logarithmic terms since the logarithmic dependency is very weak. However, there is no contradiction between the theoretical results and empirical intuition. The increase in cc will affect the difficulty of learning, but the empirical success of multi-label methods suggests that the increase in cc have limited impact on the difficulty of learning, which means that the ideal bound should not be strongly dependent on cc.

3. Response to Question 1.

If there are real numbers s1,,sps_1, \ldots, s_p such that for each δ1,,δp1,+1\delta_1, \ldots, \delta_p \in \\{-1, +1\\} there exists fFf \in \mathcal{F} with

δi(f(xi)si)ϵ,i=1,,p.\delta_i\left(f(\boldsymbol{x}_{i})-s_i\right) \geq \epsilon, \quad \forall i = 1, \ldots, p.

We say that s1,,sps_1, \ldots, s_p witness the shattering.

4. Response to Question 2.

The multiplicative factor of 2 comes from the use of Lemma A.8, which can also be understood through its proof process. The result can often be shown through some derivation as follows:

R(f)R^D(f)+O~(1/n+R/n),R(f) \leq \widehat{R}_{D}(f) + \widetilde{O}(1/n + \sqrt{R^* / n} ) ,

where R=inffFR(f)R^* = \inf_{f\in \mathcal{F}} R(f). This means that in the separable case (R=0R^* =0), the bound can be improved to a O~(1/n)\widetilde{O}(1/n) rate. The multiplicative factors often appear in bounds based on local Rademacher complexity, as also shown in literature [1-3], etc.

[1] Local Rademacher Complexity-based Learning Guarantees for Multi-Task Learning, JMLR 2018.

[2] Towards Sharper Generalization Bounds for Structured Prediction, NeurIPS 2021.

[3] Generalization Analysis for Ranking Using Integral Operator, AAAI 2017.

审稿人评论

Thanks for the reply. I have no further questions and decide to keep my rating.

作者评论

Thank you again for your support.

审稿意见
2

By incorporating smoothness assumption, author provides generalization guarantee achieving a tighter bound - independent of c, the number of labels up to log factors, a faster bound - 1/n, and a similar tighter bound for Macro-averaged AUC.

给作者的问题

I mentioned by question in "Claims And Evidence" above.

论据与证据

Mostly seems sound, but I have a question. Zhang & Zhang 2024, states in the introduction, remark 3.8, and remark 3.18 that their bound is also independent of c. But in this paper, you only mention sqrt(c) factor bound in the related work section for Zhang & Zhang. Am I missing something? It would be good to state the improvement.

方法与评估标准

No experiments.

理论论述

I did not go over the proofs.

实验设计与分析

No experiments.

补充材料

Only skimmed through the proofs.

与现有文献的关系

I think this is well discussed in the introduction and related works.

遗漏的重要参考文献

I recommend considering discussing the paper, Busa-Fekete, Róbert, et al. "Regret bounds for multilabel classification in sparse label regimes." Advances in Neural Information Processing Systems 35 (2022), since the paper discusses obtaining fast rates (or ultra fast rates) in multi-label setting, which is an important contribution of the paper.

其他优缺点

I find the paper well grounded.

其他意见或建议

I think some part of definition 3.2 is cut-off around line 166.

作者回复

Thank you for your constructive comments and active interest in helping us improve the quality of the paper.

The following are our responses to the Questions:

1. Response to the Question in Claims And Evidence.

Here we mention that the bounds with a square-root dependency on cc in literature [1] mainly refers to the results for 2\ell_2 Lipschitz loss in literature [1], i.e., Lemma 3.6 and Theorem 3.7. Our improvement is mainly relative to Theorem 3.7. In the proof of Lemma 3.6, the nn in equation (9) should be changed to ncnc. This is a typo. In fact, the results in Lemma 3.6 and Theorem 3.7 require the introduction of an additional c\sqrt{c} factor, because they ignore the c\sqrt{c} factor in the radius of the empirical 2\ell_2 cover of P(F)\mathcal{P}(\mathcal{F}). Therefore, a c\sqrt{c} factor is missing in Lemma 3.6 and Theorem 3.7, and literature [1] improved the dependency of the bounds on cc from linear to square-root in the decoupling case for 2\ell_2 norm Lipschitz losses. Although for 2\ell_2 Lipschitz loss, the bounds in literature [1] is only improved by a factor of c\sqrt{c}, they are still the tightest results in multi-label learning with 2\ell_2 Lipschitz loss. In addition, for Hamming loss, its Lipschitz constant can induce the tight bounds with no dependency on cc. We found that the square-root dependency of the bound in [1] on cc is inevitable for 2\ell_2 Lipschitz loss, which essentially comes from the c\sqrt{c} factor in the radius of the empirical 2\ell_2 cover of the projection function class. We also found that the smoothness of the base loss function can eliminate the c\sqrt{c} factor in the radius of the empirical 2\ell_2 cover of the projection function class, so that the tight bounds with no dependency on cc, up to logarithmic terms, can be derived.

[1] Yi-Fan Zhang, Min-Ling Zhang. "Generalization Analysis for Multi-Label Learning", ICML 2024.

2. Response to the Essential Reference.

The literature [2] derived tight bounds with a logarithmic dependency on cc for Hamming loss with KNN under the smoothness assumption of the regression function and multi-label margin and sparsity assumptions and also derived tight bounds with a logarithmic dependency on cc for Precision@κ\kappa under the margin condition and the smoothness assumption. The margin condition ensures that the obtained bounds with a faster convergence rate. In our work, the local loss function space is the key to obtaining bounds with a faster convergence rate. The smoothness condition with respect to the \ell_\infty norm in literature [2] is a variant of Holder-continuity. We also find that the \ell_\infty norm has a positive effect on obtaining tight bound with a weaker dependency on cc, i.e., tight bounds with a logarithmic dependency on cc can be derived for \ell_\infty Lipschitz losses. However, how to improve the convergence rate of the bounds for Lipschitz losses is still an open problem, which we will further explore in future work. We will incorporate these discussions into the paper in an appropriate manner.

[2] Regret Bounds for Multilabel Classification in Sparse Label Regimes. NeurIPS 2022.

3. Response to the Suggestion.

In order to avoid the possibility of truncation in understanding, we will adjust Definition 3.2 to p\ell_p norm covering number instead of listing the 2\ell_2 norm and \ell_\infty norm covering number separately in the definition.

审稿人评论

Thank you for the rebuttal. After reading the rebuttal, I feel that comparison to [1] needs to be discussed in detail and verified carefully to convey the main important point of the paper, which should have been stated in the paper. I lower my score.

作者评论

Thanks for your efforts in making our work clearer for readers. We compare with [1] in more detail and convey our key points as follows, and we hope our further response will address your concerns.

Regarding the reviewer's comments "[1] states in the introduction, remark 3.8, and remark 3.18 that their bound is also independent of cc", Remark 3.8 refers to the bound for 2\ell_2 Lipschitz loss, and Remark 3.18 refers to the bound for \ell_\infty Lipschitz loss. There is no problem that the bound for \ell_\infty Lipschitz loss is independent of cc. In fact, we have also pointed out in related work that "[1] derived a O~(1/n)\widetilde{O}(1/\sqrt{n}) bound for \ell_\infty Lipschitz loss...". When discussing bounds for Macro-AUC, we show in Remark 6.4 that our bound is tighter than bound for Macro-AUC in [1], since bound in [1] uses a looser vector-contraction inequality, while we develop a tight vector-contraction inequality for the case where base loss is smooth, which can improve the bound by a factor of c\sqrt{c}. These discussions are clearly explained in our paper.

Below we explain more clearly the part that may cause confusion, i.e., the relevant results in Remark 3.8 of [1], which is "the bound with a square-root dependency on cc in [1]" described in the related work for 2\ell_2 Lipschitz loss in our paper. Please note that when we mention the bound with a square-root dependency on cc, we emphasize the expression "for 2\ell_2 Lipschitz loss". In Remark 3.8 of [1], it is shown that bound for 2\ell_2 Lipschitz loss is independent of cc. The conflict between these two descriptions is that Step 3 of the proof of Lemma 3.6 in [1] ignores the c\sqrt{c} factor in the radius of empirical 2\ell_2 cover of P(F)\mathcal{P}(\mathcal{F}). Hence, the third inequality below Eqn (10) in Step 3 of [1] should be modified as:

infα>0(4α+48cμc~nc(P(F))log12(nc)T),\inf_{\alpha>0} \left( 4 \alpha+48 \sqrt{c} \mu \sqrt{c} \widetilde{\Re}_{nc}(\mathcal{P}(\mathcal{F})) \log^{\frac{1}{2}} (nc) \cdot T\right), (where T=αMϵ1dϵ)\text{(where $T=\int_{\alpha}^M \epsilon^{-1} d\epsilon$)}

which will introduce an additional c\sqrt{c} factor and cause bounds in Lemma 3.6 and Theorem 3.7 to be square-root dependent on cc. Hence, [1] improved the dependency of bounds on cc from linear to square-root in the decoupling case for 2\ell_2 Lipschitz loss.

In fact, we have previously confirmed and reached agreement with the authors of [1] on this issue. This issue does not affect the conclusion of [1] in general, since for Hamming loss, the inverse of the c\sqrt{c} factor in its Lipschitz constant can induce tight bounds with no dependency on cc.

Our improvement mainly stems from the observation that for 2\ell_2 Lipschitz loss, the square-root dependency of bound in [1] on cc is inevitable, which essentially comes from the c\sqrt{c} factor in the radius of the empirical 2\ell_2 cover of the projection function class P(F)\mathcal{P}(\mathcal{F}), i.e., ϵμc\frac{\epsilon}{\mu \sqrt{c}}. After careful analysis, we found that smoothness of base loss can eliminate this c\sqrt{c} factor, i.e., ϵ12γM\frac{\epsilon}{ \sqrt{12 \gamma M} }. In addition, the method based on Sudakov's minoration used in [1] to upper bound the 2\ell_2 norm covering number of the projection function class is no longer applicable here. In our paper, according to the smoothness of base losses, we first derive the relationship between empirical 2\ell_{2} norm covering number of the loss space and empirical \ell_\infty norm covering number of the projection function class. Then, we show that empirical \ell_\infty norm covering number N(ϵ,P(F),[c]×D)\mathcal{N}_{\infty}(\epsilon, \mathcal{P}(\mathcal{F}), [c] \times D) can be bounded by worst-case Rademacher complexity of the projection function class by using the fat-shattering dimension as a bridge.

The above key points and proof ideas can induce a bound independent of cc. Hence, for the bound with a square-root dependency on cc for 2\ell_2 Lipschitz loss in [1], we consider the smoothness of base loss and improve the bound by a factor of c\sqrt{c}. In addition, the smoothness of base loss combined with the local loss function space allows the development of novel local vector-contraction inequalities, which can induce bounds that not only have a faster convergence rate but also have a weaker dependency on cc.

We will incorporate the above discussion into the paper in a suitable way to better convey the main important point of the paper, and we will not ignore the contribution of [1] to the multi-label community, especially the bounds with no dependency on cc for \ell_\infty Lipschitz loss, which we objectively point out in the paper. We will objectively describe the relevant results without any negative impact.

We hope that our response will help further improve your opinion of our contributions. We are eager to hear back from you if you have any feedback or further questions, and we would love to know your updated reviews.

审稿意见
3

This paper focuses on the problem of multi-label classification, where each instance can be associated with multiple labels simultaneously. The authors derive several generalization bounds for this setting, assuming smooth loss functions. Their analysis relies on standard techniques for characterizing the complexity of function classes, such as local Rademacher complexity.

The core novelty of the paper lies in the development of specific vector contraction inequalities tailored to a particular class of multi-label classifiers. These inequalities are then used to establish generalization bounds. This approach leads to the derivation of slower convergence rates with respect to the sample size. Notably, the authors obtain rates of c\sqrt{c} and c3/2c^{3/2} for several well-known multi-label losses, including Hamming loss, subset loss, and macro-averaged AUC (Area Under the Curve).

给作者的问题

See above.

论据与证据

In summary, while the paper presents valuable generalization bounds for multi-label classification, there are several points that need clarification and improvement. Providing concrete examples of the model class, addressing the applicability to cross-entropy loss, clearly explaining the novelty of the vector contraction inequalities, and correcting the error in Theorem 5.5 would significantly strengthen the paper and enhance its accessibility and impact.

方法与评估标准

No empirical evidence is given.

理论论述

Specific Points and Concerns:

  1. Model Class Examples:
  • It would be beneficial if the authors could provide concrete examples of state-of-the-art (SOTA) multi-label classification methods that fall within the specific class of classifiers considered in this paper. This would help readers understand the scope and applicability of the theoretical results. Clarifying which existing models align with their defined model class is crucial for practical interpretation.
  1. Smoothness and Cross-Entropy Loss:
  • In practice, a common approach for multi-label classification involves using cross-entropy loss for each label independently as a surrogate loss function. It is essential to clarify whether the authors' definition of "smoothness" encompasses this widely used cross-entropy loss. If it does, this should be explicitly stated. If not, the limitations of the analysis concerning this practical loss function should be discussed.
  1. Novelty of Vector Contraction Inequalities:
  • The paper's primary contribution seems to be the new vector contraction inequalities. However, the explanation of their novelty is lacking. It would greatly improve the paper if the authors highlighted the key insights or "core idea" behind their improved inequalities. What specific techniques or arguments allowed them to derive better bounds compared to existing vector contraction inequalities? Clearly articulating this contribution is essential for the paper's impact.
  1. Error in Theorem 5.5:
  • Theorem 5.5 appears to contain an error. The empirical error term seems to have a multiplicative factor of 2, which is likely incorrect. This should be carefully checked and corrected. Such errors can significantly undermine the credibility of the theoretical results.

实验设计与分析

No experiments are provided.

补充材料

Not checked.

与现有文献的关系

Not included a paper which gives actually much better dependence on the number of labels under a more mild assumption on the function class than that of considered in the submission. Please see: Busa-Fekete et al.: Regret Bounds for Multilabel Classification in Sparse Label Regimes. NeurIPS 2022

遗漏的重要参考文献

see comments above

其他优缺点

The notation is quite hard to follow. And there are many inconsistency in the notation. Like b\ell_b has one argument, and some cases two arguments. And R_nc is not defined, or I have not found it.

其他意见或建议

None.

作者回复

Thank you for your constructive comments and active interest in helping us improve the quality of the paper.

1. Response to C1

We add concrete examples of multi-label learning (MLL) methods after the function class definition to improve readability and practical interpretation, please refer to Response 1 to Reviewer Wn57 due to character limitation.

2. Response to C2

For the case where base loss is cross-entropy loss, our theoretical results do not cover cross-entropy, mainly because cross-entropy loss is not bounded. Next we will further discuss the smoothness of cross-entropy. When the model is a linear classifier, smoothness of cross-entropy can be achieved by the boundedness of input, but from the perspective of model capacity, such a result is not general. The function class here involves general functions (i.e., nonlinear mappings ϕj\phi_j), so for nonlinear models, smoothness of cross-entropy involves not only boundedness of the gradient of model function but also boundedness of second-order derivative. For deep networks, changes in parameters may cause drastic changes in second-order derivative, resulting in the norm of second-order derivative being unbounded. Hence, smoothness of cross-entropy is often difficult to hold. However, the Lipschitz continuity of cross-entropy is often established, and the boundedness of gradient of model function is guaranteed by various strategies in practice, e.g., input normalization, weight initialization, gradient clipping, and regularization. This implies the need to develop new theories and analytical methods for unbounded base losses. Under the new theoretical analysis method, tight bounds with no dependency on cc for cross-entropy loss can be obtained using its Lipschitz continuity, but improving the convergence rate of its bound wrt nn is still an urgent open problem to be solved. In the future, we will further study bounds with a faster convergence rate for unbounded Lipschitz base losses.

3. Response to C3

Since the output of MLL is a vector-valued function, we need to convert Rademacher complexity of the vector-valued class into complexity of a tractable scalar-valued class. For 2\ell_2 Lipschitz losses, the analysis of MLL can be traced back to a basic bound with a linear dependency on cc that comes from a typical inequality:

E[supfF1ni=1nj=1cϵijfj(xi)]\mathbb{E}\left[\sup_{\boldsymbol{f} \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^c \epsilon_{ij} f_j\left(\boldsymbol{x}_{i}\right)\right]

cmaxjE[supfj1ni=1nϵijfj(xi)].\leq c \max_j \mathbb{E}\left[\sup_{f_j} \frac{1}{n} \sum_{i=1}^n \epsilon_{ij} f_j\left(\boldsymbol{x}_{i}\right) \right].

The dependency of bounds on cc can be improved to square-root. Such improvements essentially come from preserving the coupling among different components reflected by constraint wΛ\\|\boldsymbol{w}\\| \leq \Lambda.

As a comparison, when wj2Λ\\|\boldsymbol{w}_j\\|_2 \leq \Lambda for any j[c]j \in [c],

if we consider group norm 2,2\\|\cdot \\|_{2, 2},

we have w2,2cΛ\\|\boldsymbol{w}\\|_{2, 2} \leq \sqrt{c}\Lambda, which means that these improved bounds still suffer from a linear dependency on cc. [1] improved the dependency of bounds on cc from linear to square-root in decoupling case for 2\ell_2 Lipschitz losses. We found that square-root dependency of bound in [1] on cc is inevitable for 2\ell_2 Lipschitz losses, which essentially comes from a c\sqrt{c} factor in the radius of empirical 2\ell_2 cover of projection function class. We also found that smoothness of base loss can eliminate the c\sqrt{c} factor, so that tight bound with no dependency on cc, up to logarithmic terms, can be derived. In addition, according to the above core ideas, we combine smoothness of base loss with local loss space to develop novel local vector-contraction inequalities, thereby obtaining sharper bounds with a weak dependency on cc and a faster convergence rate wrt nn. We have also explained proof processes, ideas and specific theoretical techniques in proof sketches, which mainly includes conversions between complexities of different classes and some lemmas required to achieve these conversions.

[1] Generalization Analysis for Multi-Label Learning, ICML 2024.

4. Response to C4

We explain the multiplicative factor of 2, please refer to Response 4 to Reviewer ZadH.

5. Response to Literature

We give a detailed discussion of the paper pointed out, please refer to Response 2 to Reviewer dgGN.

We will incorporate these discussions into the paper in an appropriate manner.

6. Response to Weakness

We will carefully check and revise the notation to ensure consistency, e.g., the definition of base losses b\ell_b. ~nc(P(F))\widetilde{\Re}_{nc}(\mathcal{P}(\mathcal{F})) is worst-case Rademacher complexity of projection function class. We define worst-case Rademacher complexity in Definition 3.1,

~nc\widetilde{\Re}_{nc} is the analog of the definition of worst-case Rademacher complexity for class P(F)\mathcal{P}(\mathcal{F}).

最终决定

This is a theoretical paper focused on multi-label classification. The authors derive generalization bounds for a specific class of classifiers, which depend only logarithmically on the number of labels.

The Reviewers are generally positive about the submission. However, they raised several concerns, including the limited model class and the specific class of loss functions to which the results apply, as well as the lack of references to related work (e.g., Busa-Fekete et al., 2022) and the unclear connection to results presented in (Zhang and Zhang, 2024). If accepted, the paper should be thoroughly revised to address these issues, incorporating the Reviewers' feedback and the clarifications provided by the Authors during the rebuttal phase.