/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Understanding Nonlinear Implicit Bias via Region Counts in Input Space

Jingwei Li,Jing Xu,Zifan Wang,Huishuai Zhang,Jingzhao Zhang

OpenReview PDF

提交: 2025-01-23更新: 2025-08-13

摘要

关键词

implicit biasregion countsnon-linear neural networkgeneralization gap

评审与讨论

审稿意见

评分: 32025-02-27

The paper proposes a new region count metric for characterizing neural networks' implicit biases. This metric counts the number of connected regions in the input space with the same predicted label in a low-dimensional subspace. One of the advantages of this output-based metric over parameter-based metrics, such as sharpness and margins, is parametrization-independence. The authors empirically demonstrate that this metric strongly correlates with the generalization gap over a variety of convolutional architectures, datasets, and hyperparameters, making it suitable for the generalization analysis. Then, the authors experimentally show that region count decreases with an increased learning rate and a smaller batch size. The authors suggest a theoretical explanation for the learning rate behavior based on the edge of stability property of gradient-based algorithms.

update after rebuttal

I will keep my current evaluation. The authors' feedback was useful, but it did not completely clarify my concerns expressed in my Rebuttal Comment.

给作者的问题

Could you clarify the types of hyperparameters for which your metric is well-suited (e.g., learning rate) and ill-suited (e.g., augmentations)?
For which types of analysis or applications is the parametrization-independence of your metric important?
Do you think that your metric is well-suited for both big and small generalization gaps? What are the limitations of your metric?
Can you give a practical application of your metric (e.g., for the design of regularizers)?

论据与证据

While the authors claim that their low-dimensional subspace region count metric is more computationally efficient than the whole space region count metric, I still feel that the region count metric is hard to analyze. For instance, while the sharpness metric directly motivates possible regularization methods, such as sharpness-aware minimization, I do not easily see how to use the region count metric for computationally efficient regularization. Moreover, while the authors manage to analyze the 1D-version region counts theoretically, I do not see how to extend this analysis to multi-dimensional region counts.
Parametrization independence is indeed one of the main properties of the region count metric. However, I think the authors should also discuss situations where parametrization independence is indeed important. As I see it, the main advantage of parametrization-independent metrics is the ability to predict generalization performance conditional only on the final model and training data. At the same time, parametrization-dependent metrics might also potentially predict generalization performance, but they require more information, e.g., the final model, training samples, and optimizer hyper-parameters. In this sense, one could argue that parametrization-independent metrics are better suited for the generalization analysis. However, for me, it is not clear which metric is better suited for the design of regularizers since the effect of regularizers implicitly depends on all training hyper-parameters.
The high correlation with the generalization gap is a strong piece of evidence of the region counts metric usefulness. However, I see two potential problems with this finding.
- First, from what I see, the dependence between the region counts and the generalization gap is non-linear. Specifically, the slope is increasing for the smaller region counts. Thus, the "outlier" models with a big generalization gap could cause a high correlation coefficient. If this is the case, the region counts metric could only distinguish between "very good" and "very bad" models but could not distinguish between "very good" and "good" models. Therefore, it could be interesting to conduct some experiments across "good" models, for example, in a situation where the learning rate and batch size are fixed, and weight decay is varied over a relatively tight range of values.
- Second, the results for augmentations show that the relation between the region counts and the generalization gap significantly depends on the training algorithm. This fact limits the applicability of the region counts metric for the generalization analysis since it suggests that the region count metric is ill-suited for analyzing how the introduction of augmentations affects generalization performance.

方法与评估标准

The proposed evaluation is comprehensive. However, the authors should clarify that they only test convolutional architectures on vision datasets.

理论论述

Theoretical claims seem correct.

实验设计与分析

The experimental design seems correct. I have minor comments. Given that the relation between the region counts metric and the generalization gap could be non-linear, it could be worth also reporting rank-correlation. Additionally, I would be interested in the regression analysis where the generalization gap is regressed on all discussed implicit bias metrics: sharpness, margins, normalized margins, 1D region counts, and 2D region counts. It would be interesting to compare the explanatory power of these metrics in terms of $R^2$ .

补充材料

I have briefly examined the experiment details and ablation studies sections, read how region counts were calculated, and checked the proof of Theorem 6.3.

与现有文献的关系

The paper is directly related to the deep network generalization literature. Specifically, the authors propose a new non-linearity-aware metric that could be useful for analyzing the generalization gap.

遗漏的重要参考文献

I think all essential references are mentioned. However, I would like to see more comparisons with the literature on large learning rates. Specifically, I would like the authors to discuss the differences in mechanisms considered in their paper in the papers by Li et al. (2019) and Lewkowycz et al. (2020).

References

Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in neural information processing systems, 32.

Lewkowycz, A., Bahri, Y., Dyer, E., Sohl-Dickstein, J., & Gur-Ari, G. (2020). The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218.

其他优缺点

I think the paper is well-written, and the proofs are presented clearly. The proposed metric seems original, but it was inspired by the previous studies of activation patterns.

其他意见或建议

I do not have any.

作者回复

2025-03-31

We thank the reviewer for the time spent on reviewing our work and for the very detailed comments. Please find the details below.

Q1: How to use the region count metric for regularization?

A1: Due to the word limit, we respectully refer the reviewer to Reviewer VXuC's response A2.

Q2: How to extend the theorem to multi-dimensional region counts?

A2: The choice of one-dimensional region count is primarily for technical simplicity. The core idea of the proof is that the region count can be upper bounded by the number of activation pattern changes. This holds regardless of the hyperplane’s dimensionality, probably with a different dependency in the exponent. This observation suggests a natural direction for extending the analysis to the multi-dimensional case, which we leave for future work.

Q3: For which types of analysis is the parametrization-independence of your metric important?

A3: Our primary goal is to develop a metric that characterizes the implicit bias of nonlinear neural networks. Since we focus on understanding the solutions the network converges to rather than how it is trained, a parametrization-independent metric is essential for properly capturing this implicit bias.

Q4: Conduct experiments to find whether region count can distinguish between "very good" and "good" models.

A4: We fixed the learning rate at 0.1 and batch size at 256, and varied only the weight decay to compute the correlation.

Parameter	Value
Weight Decay	5e-4,1e-4,5e-5,1e-5,5e-6,1e-6,5e-7,1e-7

The correlation plot is shown in Figure 3 of https://anonymous.4open.science/r/icml-rebuttal-B813/icml%202025%20rebuttal.pdf. The generalization gaps range from 17 to 20, and the region counts range from 2.5 to 3, making them relatively close. We observe a correlation of 0.84, indicating that even under strong generalization settings (small learning rate, large batch size), our region count metric can effectively distinguish between "very good" and "good" models.

Q5: Clarify the hyperparameters for which your metric is well-suited/ill-suited.

A5: As shown in Figures 6 and 7, our metric maintains a strong correlation under mixup augmentation. This aligns with mixup’s implicit effect of reducing region count by enforcing smooth label transitions, which may partly explain its performance benefits. For random crops and flips, high correlation is observed when evaluated separately. However, pre-crop and post-crop data are not directly comparable, as cropping fundamentally changes the data distribution. Such cases are better treated as distinct distributions rather than as hyperparameter variations.

Q6: Reporting rank-correlation.

A6: We add experiments reporting the rank correlation[3]. All networks are trained on CIFAR-10 using the hyperparameters in Table 1. The results are as follows:

Network	Correlation
Resnet18	0.95
Resnet34	0.94
VGG19	0.74
MobileNet	0.98
SENet18	0.96
ShuffleNetV2	0.99
EfficientNetB0	0.97
RegNetX 200MF	0.98
SimpleDLA	0.86

These results show consistently high rank correlation, further validating the effectiveness of our metric.

Q7: Do regression analysis where the generalization gap is regressed on all metrics.

A7: We train ResNet18 on CIFAR-10 and conduct regression analysis using five metrics:sharpness, margin, normalized margin, 1D region count, and 2D region count. We compute the $R^2$ values. The results are as follows:

Measure	$R^2$
All	0.96
Margin	0.41
Normalized Margin	0.03
Sharpness	0.61
1d Region Count	0.94
2d Region Count	0.92

Both our proposed 1D and 2D region count metrics achieve high $R^2$ values, nearly matches the overall $R^2$ , demonstrating strong predictive power for the generalization gap compared to existing measures.

Q8: Discuss the paper [1][2].

A8: [1] attributes the generalization difference to the learning order of examples: large learning rates delay fitting hard-to-fit but generalizable patterns until after annealing, while small learning rates prioritize them early. [2] proposes a phase-based view of training dynamics, where large learning rates in the “catapult” phase reduce curvature and lead to flatter minima. In contrast, our paper offers a new perspective: large learning rates improve generalization by reducing the number of region counts. We will include a discussion of [1][2] in the next version.

We thank the reviewer once again for the valuable suggestions.

References

[1] Li, Yuanzhi, Colin Wei, and Tengyu Ma. "Towards explaining the regularization effect of initial large learning rate in training neural networks." Advances in neural information processing systems 32 (2019).

[2] Lewkowycz, Aitor, et al. "The large learning rate phase of deep learning: the catapult mechanism." arXiv preprint arXiv:2003.02218 (2020).

[3] Kendall, Maurice G. "A new measure of rank correlation." Biometrika 30.1-2 (1938): 81-93.

审稿人评论

2025-04-02

Thanks for the response!

The authors answered almost all of my questions. However, currently, I am inclined to keep my score.

While I understand that the question of the suitability of the metric for different hyperparameters is challenging to answer, I still do not understand the limits of the metric's applicability. Indeed, Figures 6 and 7 demonstrate a high correlation, but the points for different augmentations lie on different lines, which suggests that the metric does not capture some additional generalization mechanisms. The current response suggests that the metric is not applicable for different data distributions; however, all augmentations could be reformulated as changes in data distributions, which suggests that the metric is only directly applicable for the analysis of optimizer hyperparameters.
While I understand the intuitive connection between the mixup and the region counts, the previous point suggests that formally justifying such a connection would be difficult since the mixup inherently changes the distribution.
While the authors claim that the theoretical extension of their results to multi-dimensional region counts is possible, I am still not convinced by their argument. Indeed, one only needs to bound the number of activation changes; however, analyzing activation changes is more challenging in a multi-dimensional environment. Moreover, the current theoretical analysis exploits sharpness to derive the bound, which does not allow the authors to theoretically justify the advantages of region counts over sharpness.
(minor) I think comparing the explanatory power of margin, normalized margin, and sharpness together against 1D region counts would be interesting. If the explanatory power of 1D region counts is higher, it would indicate that the region counts capture a completely new generalization mechanism compared to margins and sharpness.

作者评论

2025-04-03

Thank you for your timely and detailed feedback!

Regarding the first and second point, if we merge the points from the two curves of mixup and calculate the correlation, we also obtain a value of 0.97. Therefore, our method is applicable to mixup. For random crop and random flip, we compute the correlation on all merged data and achieve a value of 0.91. The corresponding correlation plots can be found in https://anonymous.4open.science/r/icml-rebuttal-B813/follow%20up.pdf.

We acknowledge that in some specific cases—e.g., comparing results with and without random crop—region counts may both be around 3.0, while the generalization gap differs significantly (ranging from 13 to 22). However, we would like to emphasize that in the case of random crop and random flip, the change in data distribution is more substantial (unlike mixup, which results in smoother changes). We believe that achieving invariance to different data distributions is a more challenging goal because analyses of the generalization gap are largely dependent on the data distribution, as demonstrated in prior work on uniform convergence [1][2] and benign overfitting [3][4][5].

For the third point, we can not provide a complete proof of the theoretical analysis in higher dimensions at this stage, and we will consider it as part of future directions. Table 2 in our paper demonstrates that region counts in higher dimensions maintain a relatively high correlation, which experimentally supports this idea. We acknowledge that our theoretical analysis rely on sharpness-related assumptions. While sharpness can imply a low region count, the reverse does not necessarily hold. Our empirical results show that region count has a much stronger correlation with the generalization gap than sharpness. This suggests that sharpness may reduce generalization error by inducing a low region count, and that region count may provide a more general and intrinsic explantion for generalization performance.

Regarding the last point, we have added additional experiments that investigate the regression from margin, normalized margin, and sharpness to region count. The results show poor regression performance (with $R^2$ around 0.65), indicating that these metrics cannot fully represent region count. This suggests that region count captures a distinct generalization mechanism compared to margins and sharpness.

References

[1] Bartlett, Peter L., Dylan J. Foster, and Matus J. Telgarsky. "Spectrally-normalized margin bounds for neural networks." Advances in neural information processing systems 30 (2017).

[2] Zhou, Lijia, Danica J. Sutherland, and Nati Srebro. "On uniform convergence and low-norm interpolation learning." Advances in Neural Information Processing Systems 33 (2020): 6867-6877.

[3] Bartlett, Peter L., et al. "Benign overfitting in linear regression." Proceedings of the National Academy of Sciences 117.48 (2020): 30063-30070.

[4] Zou, Difan, et al. "Benign overfitting of constant-stepsize sgd for linear regression." Conference on Learning Theory. PMLR, 2021.

[5] Tsigler, Alexander, and Peter L. Bartlett. "Benign overfitting in ridge regression." Journal of Machine Learning Research 24.123 (2023): 1-76.

审稿意见

评分: 42025-03-03

This paper introduces the notion of connected region count in the input space and shows that it is strongly correlated with the generalization gap. Moreover, it is noted that larger learning rates and smaller batch sizes can lead to smaller region counts. Theoretically, It is proved that for a two-layer ReLU network under the edge-of-stability assumption, the average region count is bounded by O(1/learning rate).

update after rebuttal

Thanks for your response! I will keep my score.

给作者的问题

If we calculate the expected 1D region count between two random points in the convex hull of training data, do we get similar results? In other words, do we need to choose two training examples as the endpoints?

论据与证据

The claims are well-supported in general; see more discussion below on the empirical results.

方法与评估标准

Yes.

理论论述

I didn't check the proofs.

实验设计与分析

The empirical results look convincing overall: CIFAR-10, CIFAR-100 and ImageNet are tried, and multiple network architectures are tested. Data augmentation techniques such as mixup are also analyzed. One concern is that the discussion is focused on vision datasets and convolutional networks; it would be nice if more datasets and architectures can be analyzed.

补充材料

No.

与现有文献的关系

This paper introduces the notion of connected region count and shows that it is strongly correlated with the generalization gap. As far as I know no prior work has shown the relationship between region count and generalization, so I believe this is a nice contribution.

遗漏的重要参考文献

No.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-03-31

We thank the reviewer for the comments and constructive suggestions. In the following, we address the main concern raised. Please find the details below.

Q1: It would be nice if more datasets and architectures can be analyzed.

A1: We thank the reviewer for the suggestion. We add a experiment to explore the applicability of our approach to transformer-based models. We conduct an experiment using Vision Transformers (ViT)[1] on CIFAR-10. Details of the hyperparameters used are summarized below:

Hyperparameters	Value
Learning rate	1e-4,5e-5,1e-5
Batch size	256,512,1024
Weight decay	1e-5,1e-6,1e-7

The correlation figure is shown in Figure 2 in the anonymous github website https://anonymous.4open.science/r/icml-rebuttal-B813/icml%202025%20rebuttal.pdf. We achieve a correlation of 0.84, validating the applicability of our measure in this network architecture. We will include these experimental results in the next version of the paper.

Q2: If we calculate the expected 1D region count between two random points in the convex hull of training data, do we get similar results? In other words, do we need to choose two training examples as the endpoints?

A2: We thank the reviewer for the suggestions. In Appendix C, we have conducted with alternative approaches to generate hyperplanes for computing the region count, selecting a training data point and extending it in a random direction by a fixed length. Even with this method, the correlation remains high.

We further conduct experiments by calculating the expected 1D region count between two random points in the convex hull of the training data, using the hyperparameters from Table 1 to train the neural network and compute the correlation. The result are as follows:

Network	Correlation
Resnet18	0.93
Resnet34	0.95
EfficientNetB0	0.87
SimpleDLA	0.82

The results confirm that the correlation remains strong, demonstrating that our method does not require selecting two training examples as endpoints.

Finally, we thank the reviewer once again for the effort in providing us with valuable suggestions. We will continue to provide clarifications if the reviewer has any further questions.

References

[1] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

审稿意见

评分: 22025-03-14

The authors propose using low-dimension region counts as a proxy for the generalization performance. They empirically test the correlation between region count and generalization performance, and provide a bound on the region count for two layer ReLU neural networks.

给作者的问题

What is the correlation between generalization performance and other potential proxies? For example: norm, margin, and number of GD iterations (this is not an exhaustive list).

论据与证据

The claims are not adequately supported by the evidence. It is unclear whether region count is a good proxy for generalization performance, because it is unclear what a good correlation metric is. What is the correlation metric for other possible proxies, like norm and margin, or even simpler ones like number of iterations of GD? There is no other connection provided between region count and generalization performance other than through this correlation.

The theoretical bound provided gives region count bounds in terms of O(N), which is much larger compared to the empirical region counts in the 2-20 range. It's not clear how meaningful the theoretical bound is, and there is no theoretical generalization bound provided in terms of region count.

方法与评估标准

The evaluation criteria for how good region count would make more sense if it was compared to other possible methods, but as it stands it is not possible to evaluate whether region count is an appropriate proxy.

理论论述

I did a cursory check for correctness and did not see any issues.

实验设计与分析

See previous sections.

补充材料

N/A

与现有文献的关系

This paper builds on a large prior literature investigating generalization.

遗漏的重要参考文献

N/A

其他优缺点

The authors propose an interesting proxy for generalization performance which has potential. However, they do not provide adequate evidence for its effectiveness.

其他意见或建议

N/A

作者回复

2025-03-31

We greatly appreciate the reviewer's comments and valuable suggestions. We added extensive experiments and summarize them below. We will include more detailed results in the revision.

Q1: It is unclear whether region count is a good proxy for generalization performance, because it is unclear what a good correlation metric is. What is the correlation metric for other possible proxies, like norm and margin, or even simpler ones like number of iterations of GD? There is no other connection provided between region count and generalization performance other than through this correlation.

A1: We appreciate the reviewer's feedback on the missing details. In Section 3(Figure 2), we have already shown that the correlations between generalization gap and frobenious norm, margin, or margin/frobenious norm are all weak.

We follow the reviewer's advice and conduct additional experiments on other generalization measures, including spectral norm[2], PB-I and PB-O (which are sharpness metrics from PAC-Bayesian bounds that use the origin and initialization as reference tensors), PB-M-I and PB-M-O[3][4][5] (which are derived from PAC-Bayesian Magnitude-aware Perturbation Bounds). We conduct experiments on CIFAR-10 using ResNet-18 and use the hyperparameters in Table 1. The results are summarized below:

Proxy	Correlation
Frobenious Norm	0.64
Margin	0.02
Margin/Frobenious Norm	-0.18
Spectral Norm	0.77
PB-I	-0.35
PB-O	-0.31
PB-M-I	0.79
PB-M-O	0.78
Region Count	0.98

The results are consistent with the results in [1] and demonstrate that our proposed region count measure exhibits a significantly higher correlation with the generalization gap compared to other measures. We will include these findings in the next version of the paper.

Q2: The theoretical bound provided gives region count bounds in terms of O(N), which is much larger compared to the empirical region counts in the 2-20 range. It's not clear how meaningful the theoretical bound is, and there is no theoretical generalization bound provided in terms of region count.

A2: Thank you for raising this question. The $O(N)$ dependency is actually tight, by considering N points on a line with alternating labels. The upper bound in the theorem is a worst case analysis, and it can possibly be tightened with additional assumptions on the data distribution.

Also, we believe generalization results in terms of region count is possible, since region counts might constrain the size of the function class, which connects to traditional ways of constructing generalization bounds based on function class complexity. We believe the strong empirical connection between region count and generalization shown in this paper is already a novel contribution to the generalization community, and leave a rigorous proof to the future work.

Finally, we thank the reviewer once again for the effort in providing us with valuable and helpful suggestions. We will continue to provide clarifications if the reviewer has any further questions.

Reference

[1] Jiang, Yiding, et al. "Fantastic generalization measures and where to find them." arXiv preprint arXiv:1912.02178 (2019).

[2] Bartlett, Peter L., Dylan J. Foster, and Matus J. Telgarsky. "Spectrally-normalized margin bounds for neural networks." Advances in neural information processing systems 30 (2017).

[3] Keskar, Nitish Shirish, et al. "On large-batch training for deep learning: Generalization gap and sharp minima." arXiv preprint arXiv:1609.04836 (2016).

[4] Neyshabur, Behnam, et al. "Exploring generalization in deep learning." Advances in neural information processing systems 30 (2017).

[5] Bartlett, Peter L., Dylan J. Foster, and Matus J. Telgarsky. "Spectrally-normalized margin bounds for neural networks." Advances in neural information processing systems 30 (2017).

审稿人评论

2025-04-06

Thank you for the response, the inclusion of Figure 2 really helps with exhibiting the high correlation between region count and the generalization gap. I would appreciate some further clarifications on the figure though.

Could you provide some more details on how the experiments in Figure 2 were run, especially compared to those in Figure 4 (specifically the ResNet18 portion of it)? I would assume that you could train the models once (per set of hyperparameters), and analyze several different counts/norms/proxies for a single trained model. This would result in the generalization gaps to be the same across all figures, with the only difference being the x-axis. As a result, I'm a bit confused why the ranges of the generalization gaps differ so much between Figure 2 (14-24) and Figure 4 (15-36)? Also, you mentioned you used the hyperparameters in Table 1, which would result in 27 possible configurations of hyperparameters, but I only see 18 data points in Figure 2. What am I missing here? Thanks!

作者评论

2025-04-06

We sincerely thank the reviewer for the question! We carefully reviewed the code related to Section 3 and confirmed that it was indeed based on some early exploratory work using the hyperparameter from the initial stage of the project. The specific hyperparameters used are as follows:

Hyperparameters	Value
Learning rate	0.1,0.01
Batch size	256,512,1024
Weight decay	1e-5,1e-6,1e-7

To ensure a fair comparison of the correlation, we re-plotted the correlation results for the Resnet18 network in Figure 4 using the 18 hyperparameters settings (instead of the later 27 in Table 1). The updated figure can be found at the anonymous link https://anonymous.4open.science/r/icml-rebuttal-B813/correlation.pdf. The correlation results under these hyperparameters are given as follows:

Proxy	Correlation
Frobenious Norm	0.64
Margin	0.02
Margin/Frobenious Norm	-0.18
Region Count	0.92

As the figures show, the ranges of the generalization gaps are now consistent across the figures(14-24). Under these 18 hyperparameter settings, the correlation with region count remains significantly higher than that of the other metrics.

In addition, we also conduct the experiments for the three measures in Figure 2 using the 27 hyperparameter settings from Table 1. The updated correlation plots can be found at the anonymous link below https://anonymous.4open.science/r/icml-rebuttal-B813/correlation_prime.pdf. The correlation results under these hyperparameters are compared as follows:

Proxy	Correlation
Frobenious Norm	0.74
Margin	0.60
Margin/Frobenious Norm	0.16
Region Count	0.98

The results show that our method also outperforms these metrics under the hyperparameters in Table 1. Besides, we would like to note that the first three rows of the table in the rebuttal are directly taken from our manuscript, so they should be replaced with the rows from the 27 hyperparameter table above. The other rows in the rebuttal table are already correct 27 hyperparameter results.

We apologize for the lack of clarity in the Section 3 of our paper and will revise this part in the next version of our paper.

审稿意见

评分: 32025-03-14

In this work, authors propose region count as a metric to quantify implicit bias / generalizability of neural networks. A region is defined as a set of input points which are classified in the same way by the network; authors show that fewer regions lead to increased generalizability (quantified as gap between test error and train error). They empirically show that the number of regions exhibits a high correlation with the generalizability in a number of convolutional architectures.

Update after rebuttal

After reading the author's comments and other reviewers' comments, I maintain my score.

给作者的问题

Given the link between region count and generalizability, do you think it would be possible to explicitly aim for low region count in the objective function of a neural network training? Perhaps with some regularization term. Do you think this would help in training better models? Could this term be applied from the start of the training, or perhaps when the network has already "stabilized" after some iterations?
I find very interesting the link between region count and data augmentation. I think that data augmentation can help in decreasing region count, thus helping generalization; perhaps this could explain the success of contrastive learning methods? I would like to hear the author's thoughts about this. For this topic, I would suggest this work [1] which reminds of the connectedness property in your work

[1] https://proceedings.mlr.press/v202/dufumier23a/dufumier23a.pdf

论据与证据

The main claim of this work is that the number of regions is correlated with the generalizability of the model
The evidence is convincing, and it indeed seems that better performing models tend to have fewer regions
This work may also provide theoretical justification as to why higher learning rate and smaller batch size can work well in practice

方法与评估标准

I think the setup of the paper is rigorous and the evaluation criteria are satisfactory

理论论述

While I did not carefully check the details of the proofs in the appendix, I am convinced by the correctness of the presented analysis in the main text.

实验设计与分析

I believe the experimental design is convincing. Personally, I would like to see some results for networks which did not converge and exhibit a high test error (the lowest result in Tab. 2 is 0.78 on imagenet). To what extent (going down in generalizability) does the correlation hold? I think this may be interesting to analyze also to derive potential regularization terms aimed at enforcing low region count [see below]

补充材料

I briefly checked the supplementary material.

与现有文献的关系

I believe this work can potentially have a significant impact on the deep learning field in general.

遗漏的重要参考文献

N/A

其他优缺点

The work is clearly presented and understandable even if theoretical
I believe the experimental validation is done correctly, as it is often lacking in more theoretical works
Some empirical results are missing on the lower end of the generalization spectrum. As stated above, I think it would be interesting to see whether this correlation exists also in those cases

其他意见或建议

I think this work is very interesting and can have significant impact, thus may be considered for acceptance. I have some questions about its potential applications [see below]

作者回复

2025-03-31

We thank the reviewer for the detailed comments. We address the main concerns below:

Q1: Personally, I would like to see some results for networks which did not converge and exhibit a high test error (the lowest result in Tab. 2 is 0.78 on imagenet). To what extent (going down in generalizability) does the correlation hold?

A1: We appreciate the reviewer’s insightful question. We observed relatively low training and test accuracy on ImageNet, achieves about 70% accuracy on the training set and 45% on the test set. However, the correlation in Table 2 remains high, though slightly lower than on CIFAR-10/100.

Additionally, we conduct experiments with traditional ML models. When training decision trees and random forests on CIFAR-10 with various hyperparameters, most configurations failed to converge (yielding high test error). The hyperparameters are as follows:

Hyperparameters	Value
Depth	$3,4\cdots, 17$
Criterions	gini, entropy
Splitter	best,random

The correlation figure is shown in Figure 1 in the anonymous github website https://anonymous.4open.science/r/icml-rebuttal-B813/icml%202025%20rebuttal.pdf. We achieve a correlation of 0.96 in decision trees and 0.98 in random forests, demonstrating that our measure remains robust despite low test accuracy.

Q2: Given the link between region count and generalizability, do you think it would be possible to explicitly aim for low region count in the objective function of a neural network training? Do you think this would help in training better models? Could this term be applied from the start of the training, or perhaps when the network has already "stabilized" after some iterations?

A2: We thank the reviewer for this thoughtful suggestion. Since the region count computation is currently non-differentiable, we cannot explicitly incorporate it as a regularization term. However, the data augmentation method mixup implicitly minimizes region count by encouraging smooth transitions between labels rather than predicting unrelated classes. This suggests our method may partially explain mixup’s performance benefits.

While it cannot be directly used as a regularization term, we find it effective for early-stage hyperparameter evaluation. We conduct experiments analyzing the correlation between early-training region counts and final generalization gap (distinct from Table 3’s per-timestep analysis). We train Resnet18 on CIFAR-10 with 200 epochs. The results are as follows:

Epoch	Correlation
10	-0.07
20	0.24
30	0.92
40	0.95
60	0.97
80	0.96
100	0.97
200	0.98

The results show strong correlation between early region counts and final generalization(since epoch 30) and region counts stabilize quickly during training. This enables early stopping for poor hyperparameter configurations by monitoring initial region counts, reducing computational costs. We will add these findings to the paper.

Q3: I find very interesting the link between region count and data augmentation. I think that data augmentation can help in decreasing region count, thus helping generalization; perhaps this could explain the success of contrastive learning methods[1]?

A3: We thank the reviewer for this insightful question. For now, our analysis applies to supervised learning for overparameterized models. For contrastive learning methods, the pretrained representations obtained through this method can serve as feature extractors to improve performance in downstream classification tasks, particularly when labeled data is scarce. If we use kernel-based methods[1] to connect more positive samples when the initial label prediction of the neural network is uncertain, it will lead to simpler learned regions rather than mixed positive and negative samples. This also helps minimize the learned region count, thereby enhancing performance. We believe this is a promising direction for future research.

We thank the reviewer once again for the valuable and helpful suggestions. We would be happy to provide further clarifications if the reviewer has any additional questions.

Reference

[1] Dufumier, Benoit, et al. "Integrating prior knowledge in contrastive learning with kernel." International Conference on Machine Learning. PMLR, 2023.

审稿人评论

2025-04-03

I appreciate the authors' response, and I confirm my partially positive stance towards acceptance.

About A2, would it be possible to perform that experiment on some slightly harder datasets, as CIFAR-10 is too "easy" for ResNet18? Perhaps also CIFAR-100 could be enough.

Also, what I would like to see besides the generalization gap or correlation is the real train/test performance at each timestep. I think it would paint a more complete picture, as the generalization gap may also be low when both training and test errors are high.

作者评论

2025-04-03

Thank you for your timely and detailed feedback. We have added experiments using ResNet-18 on CIFAR-100, training 200 epochs with the hyperparameter in Table 1. The results are as follows:

Epoch	Correlation
10	0.63
20	0.38
30	0.75
40	0.84
60	0.93
80	0.92
100	0.94
200	0.96

The results show that region count can still predict the final generalization gap at early training stages.

Regarding the train and test accuracy for each epoch, for most hyperparameter settings on CIFAR-10 and CIFAR-100, the training accuracy reaches nearly 100%. As an example, for ResNet-18 on CIFAR-100 with 15 different hyperparameter configurations, the error curves for each epoch can be found in this anonymous link https://anonymous.4open.science/r/icml-rebuttal-B813/Training%20and%20test%20accuracy%20over%20epoch.pdf. We will include this part in the appendix of the next version of the paper.

最终决定Accept (poster)

2025-05-01

This paper studies implicit bias by counting decision regions over subspaces in the input space. While a previous work by Somepalli et al. (2022) already counted decision regions over truncated planes (i.e., truncated two-dimensional subspaces), the current submission includes new contributions such as

a theoretical analysis of the decision region count over one-dimensional subspaces;
an empirical analysis of the dependency of decision region count on the learning rate and batch size;
empirical counting of decision regions on subspaces of dimension other than two.

These provide sufficient novelty and significance to this submission.

Nevertheless, the submitted paper misses the important reference to the paper by Somepalli et al. (2022) that was the first to propose the idea of empirically analyzing deep networks based on counting decision regions on subspaces of the input space, and exemplifying its relation to generalization and double descent. In the work by Somepalli et al., the decision regions are counted on truncated planes defined by triplets of training samples. They defined a fragmentation score, which is the average number of decision regions over many planes, and showed it for a varying width of ResNet-18 and with respect to the generalization performance in terms of test error. Moreover, their paper used the decision regions over the planes to show the implicit bias of various deep network architectures.

In an AC-Authors discussion during the rebuttal phase, the authors acknowledged the importance of the missing reference to the paper by Somepalli et al. and agreed to revise their paper accordingly. This issue was considered also in the AC-Reviewers discussion.

Therefore, based on reading the paper, the reviews, the author rebuttal and following discussions, the recommendation is to Accept this paper conditioned on

adding a detailed discussion in the related work section that explains how the current paper extends the previous work by Somepalli et al.;
revising the following places in the text to at least mention and cite the paper by Somepalli et al. and not giving the impression that the current paper is the first to count decision regions over subspaces:
- Line 042, right: "Our research identifies a metric called region count..."
- Line 088, left: "We introduce a novel measure..."
- Lines 157-160, right: "Our motivation can be summarized by a simple idea..."
- Lines 189-192, right: "we propose a computationally efficient surrogate by calculating the region counts on low dimensional subspace spanned by training data points."

Reference:

Somepalli et al., "Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent From the Decision Boundary Perspective", CVPR 2022.