5.7

/10

Rejected3 位审稿人

最低3最高8标准差2.1

3.0

置信度

正确性2.3

贡献度2.0

表达2.7

ICLR 2025

Provable Data-driven Hyperparameter Tuning for Deep Neural Networks

Maria Florina Balcan,Anh Tuan Nguyen,Dravyansh Sharma

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

摘要

关键词

learning theorydata-driven algorithm designhyperparameter tuningneural architecture searchgraph neural networkssample complexity

评审与讨论

审稿意见

评分: 3置信度: 32024-11-03

The first formal study of hyperparameter tuning with discontinuous and oscillating optimization landscape considered, introducing a new technique utilizing techniques from differential geometry and constrained optimization.

优点

Solid paper.

The subject studied is an important one. We have seen major empirical breakthroughs in empirical worlds, including training, hyper paramter tuning, fine-tuning, etc. Theories have been hard to follow up with these progresses. Work like the paper should definitely be encouraged.

The authors apply many advanced mathematics for the topic. Writing on the theory side is clear and pretty. The authors go very deep into analyze the circumstances described by their theoretical assumptions.

缺点

The settings are not well-explained. "We assume that we have a series of deep learning tasks, and we have to tune hyperparameters to do well on average over the distribution of tasks." Why not fine tuning for each different task? Why is it necessary to introduce the difficulty of task distribution for the problem of hyper parameter tuning? These questions naturally arise and there's a lack of explanation.

Some sentences are hard to parse. "A major difficulty is that the loss as a function of the hyperparameter is very volatile and furthermore, it is given implicitly by an optimization problem over the model parameters. This is unlike previous work in data-driven design, where one can typically explicitly model the algorithmic behavior as a function of the hyperparameters." I'm confused by two sentences. A volatile function is still a function of the hyperparameters. It's likely the meaning is not well conveyed.

Maths are not necessarily relevant to the topic. The introduced techniques from differential geometry, algebraic geometry are not necessarily relevant to the real difficulties of the hyper parameter tuning problems, but might just be useful because the assumptions are taken in their favor. This leads to the question, how relevant are the assumptions to the reality? The authors argue that the landscape of hyper parameter tuning is very volatile, which gives me the impression that algebraic geometry and differential geometry are too smooth to be applicable. The authors could have presented a visualization of neural network hyperparameter landscape that confirms their assumptions. This could make the whole paper much more convincing if the actual landscape is shown to obviously satisfy the assumptions in the paper. For example, this blog https://sohl-dickstein.github.io/2024/02/12/fractal.html gives a very good presentation of beautiful fractals made by neural network training. Sometimes, blogs are much better at faithfully presenting information and more polished than papers. It feels like the papers' assumptions on discontinuities are much less discontinuous than reality.

There's a lack of discussion of implications that could be useful for the empirical community. I couldn't find relevant empirical experiments done in this paper. As the nature of the subject is quite empirical, it's essential to have empirical support and evidence of significance for improving existing results.

问题

Included in weaknesses already.

Why the abstract doesn't say "we use both algebraic and differential geometry"?

评论- Response to Reviewer w2Ap [3/3]

2024-11-23

References

[1] Balcan et al., How much data is sufficient to learn high-performing algorithms? Generalization guarantees for data-driven algorithm design, STOC’21

[2] Balcan et al., Learning-Theoretic Foundations of Algorithm Configuration for Combinatorial Partitioning Problems, COLT’17

[3] Balcan et al., Learning to Link, ICLR’20

[4] Bartlett et al., Generalization Bounds for Data-Driven Numerical Linear Algebra, COLT’22

[5] Balcan et al., Structural Analysis of Branch-and-Cut and the Learnability of Gomory Mixed Integer Cuts. NeurIPS’22

[6] Balcan et al., Sample Complexity of Tree Search Configuration: Cutting Planes and Beyond. NeurIPS’21

[7] Balcan et al., Dispersion for Data-Driven Algorithm Design, Online Learning, and Private Optimization. FOCS’18

[8] Cheng and Basu, Learning Cut Generating Functions for Integer Programming NeurIPS’24

[9] Cheng et al., Sample Complexity of Algorithm Selection Using Neural Networks and Its Applications to Branch-and-Cut NeurIPS’24

[10] Amos et al. Meta Optimal Transport, ICML’23

[13] Balcan and Sharma Provably Tuning Elastic Across Instance, NeurIPS’22

[15] Balcan et al. New bounds for hyperparameter tuning of regression problems across instances, NeurIPS'24

评论- Response to Reviewer w2Ap [2/3]

2024-11-23

"Differential and algebraic might only be useful because the assumptions are taken in their favor. What about the blog https://sohl-dickstein.github.io/2024/02/12/fractal.html? It looks like the discontinuities are much less discontinuous in reality."

We thank the reviewer for referring to the interesting blog. However, we want to clarify the following points:
1. Our focus is on tuning model hyperparameters in data-driven settings, where the dual utility function $f_{x}(\alpha, w)$ admits piecewise polynomial structures. Of course, one might think of hyperparameter tuning in DNNs in a different way, like tuning the optimization algorithm hyperparameter instead, and we agree that our analysis is not applicable in this case. But as the reviewer stated ". . . Theories have been hard to follow up with these progresses. Work like the paper should definitely be encouraged. ", the theoretical analysis for hyperparameter tuning is challenging and it is good to study a specific setting, under some (reasonable) assumption first to have some initial result and understanding of the problem from a learning-theoretic lens.
  
  To clarify this point, we changed our title to "Sample complexity of data-driven tuning model hyperparameters in neural networks with piecewise polynomial dual functions" as mentioned above, to emphasize that we are focusing on studying the sample complexity only, focusing on a specific case of hyperparameter only. We also incorporated multiple changes in our revised draft to clarify this point ( l.95-101, l.147-157, and a detailed discussion in Appendix B).
2. "Differential and algebraic might only be useful because the assumptions are taken in their favor":
  1. Actually, the idea of using differential and algebraic geometry is inspired by the observation that the hyperparameter and parameter loss landscape $f_x(\alpha, w)$ often admits a piecewise polynomial structure.
  2. For the simpler problem of parameter tuning it is known that the piecewise polynomial structure holds. We show that as we vary both parameter & hyperparameter the same structure holds (see section 6 on applications), but even if that is the case it is not obvious it implies generalization guarantees for hyperparameter tuning. We show that this is the main technical challenge (Sections 4 and 5). When this particular piecewise structure holds true, Assumption 1 holds almost everywhere (see l.321-323, due to a fundamental result in differential geometry (Sard’s theorem, E.10)).
3. "What about the blog https://sohl-dickstein.github.io/2024/02/12/fractal.html?": though this blog provides a nice visualization of the hyperparameter loss landscape, we emphasize that it is of optimization algorithm hyperparameter and is not applicable in our case. Moreover, it only provides visualization in a few example instances without any theoretical evidence, while the piecewise structure in our application is proven to hold true for any network and input instance.
Lack of experiments.: We thank the reviewer for the comment. However, the main purpose of our work is theoretical. We note that Learning Theory is an explicit area of interest in the Call of Papers, and this paper is also listed as having Learning Theory as its Primary Area.

Additional questions

"Why not saying that "we use both algebraic and differential geometry"?: Thank you for pointing it out, we have just added it in the abstract.

Summary

Again, we thank the reviewer for constructive feedback. We made several modifications (in the title and main body) to address the reviewer’s concern and emphasize the scope and setting of our study. We are happy to answer further questions raised by the reviewer. We respectfully request that the reviewer reevaluate our paper in light of our rebuttal.

评论- Response to Reviewer w2Ap [1/3]

2024-11-23

We thank the reviewer for spending time reviewing our paper and raising some good points. We appreciate that the reviewer finds our paper solid, and important, and should be encouraged due to a lack of theoretical understanding of the topic. We are glad that the reviewer found the theory side of our paper clear and pretty. We address the concern of the reviewer as follows

Questions/clarity of settings/assumptions/experiments.

Extra elaboration on the setting. On introducing the task distribution for the problem of hyperparameter tuning?:
1. As stated by the reviewer and mentioned in the title, we focus on the data-driven setting, which assumes that there is an application-specific problem (task) distribution $\mathcal{D}$ from which the problem instance (task) $x$ comes. In this setting, we tune the hyperparameter for the problem distribution $\mathcal{D}$ , not for a single problem instance $x$ . We note that this is not an uncommon setting in machine learning, and has been investigated in both theoretical and empirical sides (see [1,2,3,4,5,6,7,8,9,10] for a non-exhaustive list). This setting naturally captures cross-validation, but is more general and also applies to multitask hyperparameter tuning [13].
2. On introducing the task distribution: By assuming a distribution $\mathcal{D}$ over tasks and the availability of task samples $x1, . . . , x_N$ from $\mathcal{D}$ , we can provide a generalization guarantee for the hyperparameter tuned using those available tasks. Note that this setting makes sense if we have to solve multiple related tasks repeatedly [9, 10], but also captures cross-validation as a special case (where random folds of validation sets correspond to different samples drawn from a fixed training set).
3. However, we agree with the reviewer that this point should be clarified more carefully. Hence, we made the following changes to emphasize our setting and contribution:
  1. Title change: We are changing the title of our paper to "Sample complexity of data-driven tuning model hyperparameters in neural networks with piecewise polynomial dual functions". It emphasizes that:
    1. We focus on analyzing the sample complexity when tuning hyperparameter specifically in data-driven setting,
    2. We specifically focus on tuning model hyperparameters (not optimization hyperparameters for example), and
    3. We focus on the case where the dual utility function $f_{x}(\alpha, w)$ admits polynomial piecewise structure. However, we note that this case is not uncommon when tuning model hyperparameter in data-driven setting, as shown in many prior works [1,2,3,4,5,6,7,8,9].
  Please let us know what you think about this title.
  1. Main body changes: We made extra clarification in the main body (l.95-101, l.147-157) and a detailed discussion in Appendix B to justify the positioning of our paper.
  2. Our problem is challenging and requires novel techniques: We note that our setting requires technical novelty compared to prior work in statistical data-driven algorithm hyperparameter tuning [1, 2, 3, 4, 13, 15]. As far as we are concerned, in most prior work [1,2,3,4], the hyperparameter tuning process does not involve the parameter $w$ , meaning that given any fixed hyperparameter $\alpha$ , the behavior of the algorithm is determined. In some other cases that involve parameter $w$ , we can have a precise analytical characterization of how the optimal parameter behaves for any fixed hyperparameter [13] or at least a uniform approximate characterization [15]. However, our setting does not belong to those cases and requires a novel proof approach to handle the challenging case of tuning hyperparameter tuning of neural networks (see Appendix B in our revised draft for a detailed discussion).
Clarifying "A major difficulty is that the loss as a function of the hyperparameter is very volatile and it is given implicitly by an optimization problem over the model parameters. This is unlike previous work in data-driven design, where one can typically explicitly model the algorithmic behavior as a function of the hyperparameters.":
1. The second sentence means that in prior work [1,2,3,4,5,6,7,8,9], the structure of the function $u^*_{x} (\alpha)$ is simple, which is a closed-form piecewise polynomial/rational/. . . function of the hyperparameter $\alpha$ and the main challenge is establishing this structure.
2. In contrast, there are many cases where $u^*_{x}(\alpha)$ cannot be written as a function of $\alpha$ explicitly, but is implicitly defined as in our case. A natural question now would be: can we still perform data-driven hyperparameter tuning by establishing learning-theoretic guarantees in this case? That is the meaning of the first sentence and the motivation of our main results (Theorem 5.1, Lemma 4.2).

We incorporated this discussion into a revised draft (Appendix B).

评论- Experiments are still needed.

2024-11-25

Thank the authors for their well written responses. Many confusion have been cleared.

Still, I believe experiments are necessary. The type of topic in this paper needs experiments. This comes from my own experience of doing machine learning theory. The gap of learning theory and practice is so large that it's our responsibilities to show that the theories are relevant to the real world. And rarely in learning theory one can develop deep and profound theoretical results that have broad implications across different disciplines, which would be enough excuse for a lack of experiments. This paper doesn't qualify as such.

Furthermore, it doesn't take much time to do these experiments. The absence of them might suggest the assumptions are not well suited to describe real world circumstances.

审稿意见

评分: 6置信度: 22024-11-04

The paper proposes a learning theoretic approaches to hyperparameter optimization. They show that using results from learning theory, it is possible to estimate the error of hyperparameter optimization in the case where the function $f(x, \alpha, \omega)$ , representing the neural network, is piecewise constant or piecewise polynomial. The authors prove that this assumption is true for two instances of hyperparameter optimization for deep neural networks, providing corresponding learning guarantees.

优点

Applying a learning theoretic approach to hyperparameter optimization is original, challenging and novel.
Hyperparameter optimization is a very heuristic research domain and it is refreshing to see some efforts towards more principled understanding and characterisation of the problem.
The paper presents an impressive piece of theoretical work.
The paper is well structured

缺点

I would like to state that I was not able to check the proofs completely since I am not an expert in learning theory. I think that this paper would at least require an examination of these proofs by an expert in the field before acceptance.

Major

My main problem is about the positioning of the paper. The paper claims to provide guarantees for hyperparameter optimization for deep neural networks, but in reality, it provides guarantees:
- In a setting that is unusual for hyperparameter optimization, i.e. the setting where $\alpha_{ERM}$ is obtained after a sampling of several whole datasets from $\mathcal{D}$ (see l.163), whereas usually hyperparameter optimization is performed for one single dataset. So, this setting does not correspond to reality, diminishing the impact of the work. In addition, the paper focuses on the optimization of one single hyperparameter, which misses the stakes and challenges of hyperparameter optimization that are more about the large number of hyperparameters and their interactions.
- It is not applicable to "hyperparameter optimization of deep neural nets" but rather to (i) the interpolation parameter of activation function in a (debatable, see questions) application of one hyperparameter opt algorithm called DARTS (2) kernel parameter of graph neural nets. They are two very specific and not-so-common instances of hyperparameter optimization, and each of them required significant theoretical work to prove that they match the piecewise constant / polynomial assumptions.

l.177 $u_{\alpha}$ is computed by solving an optimization problem. This problem is stochastic by nature, whereas $u_{\alpha}$ is presented as a function (see questions)

Some problems in the proofs:
- (potentially serious) I did not find any proof of Lemma 3.3 in the Appendix, whereas it seems central, and the authors clearly state that Lemma 3.1 is not applicable in the case of Lemma 3.3 (so it would need a proof even more).
- (presentation) Warren's theorem used in the proof of Th. 6.1 but no references, we don't know what it is. It is not a standard Theorem.
- (presentation) l.926, 1424, 1476 "standard learning theory results gives us..." which standard learning theory result ??

Minor

The presentation could be improved:
- l.56: random search methods [...] only work for a discrete and finite grid" this is not true.
- l.93-95 sentence not correct
- l.249 state that the results holds thanks to Th. 2.1
- l. 258 piece function not defined, $c_i$ is introduced and no no longer used (why not defining as in 5?)
- l.282 Theorem 4.1 you meant Lemma 4.1 right ?
- l. 308 Notation $R_{x,t}$ different as before
- l.309 "behaves regularly" not defined
- l. 310 preimage introduced but not defined (defined below but you should define it prior to using it)
- Assumption 1 is really difficult to grasp. The authors say that it is "relatively mild" but it is not at all conveyed by the presentation.

问题

What is the link between assumtion 1 and deep neural networks?
Can you calrify the link with DARTS? To my understanding, what they do is not what is stated in the paper. They use weights to encode the probability of using $o_1$ and $o_2$ , to make the NAS differentiable. It is not an interpolation.
l.177 $u_{\alpha}$ is computed by solving an optimization problem. This problem is stochastic by nature, whereas $u_{\alpha}$ is presented as a function. How do you cope with this stochasticity in your analysis ?

评论- Reply to Reviewer 4928 [1/3]

2024-11-23

We thank the reviewer for constructive feedback. We appreciate that the reviewer finds our paper novel, well-structured, and impressive piece of theoretical work. Some main concerns of the reviewer are about the position of the paper, and the clarification of the proofs, which we address below.

On major clarification/paper positioning

A. Paper positioning:

The author raises some good points about how we should position our work. Though to some degree we agree with the reviewer, we also want to clarify the following points:

On our hyperparameter tuning setting:
1. It is true that we are focusing on the data-driven hyperparameter tuning settings, as stated in the title. In this setting, one can think of tuning $\alpha_{ERM}$ using multiple problem instances $x_1, \dots, x_N$ drawn from an application-specific problem distribution $\mathcal{D}$ . The problem instance $x_i$ could be a dataset as a reviewer stated, but it could also be something more simple, such as random validation folds from a fixed training set (usual cross-validation).
  
  This setting is not uncommon in machine learning in machine learning (see [1,2,3,4,5,6,7,8,9,12,17] for a non-exhaustive list of examples, including clustering/semi-supervised learning/decision trees ...). This setting naturally captures cross-validation, but is more general and also applies to multitask hyperparameter tuning [12].
2. To position the paper better as the reviewer suggested, we made the following changes (marked in red in our revised draft):
  1. Title change: We are changing the title of our paper to "Sample complexity of data-driven tuning model hyperparameters in neural networks with piecewise polynomial dual functions". It emphasizes that:
    1. We focus on analyzing the sample complexity when tuning hyperparameter specifically in data-driven setting,
    2. We specifically focus on tuning model hyperparameters (not optimization hyperparameters for example), and
    3. We focus on the case where the dual utility function $f_{x}(\alpha, w)$ admits polynomial piecewise structure. However, we note that this case is not uncommon when tuning model hyperparameter in data-driven setting, as shown in many prior works [1,2,3,4,5,6,7,8,9].
  2. Main body changes: We made extra clarification in the main body (l.95-101, l.147-157) and a detailed discussion in Appendix B to justify the positioning of our paper.
3. Our problem is challenging and requires novel techniques: We note that our setting requires technical novelty compared to prior work in statistical data-driven algorithm hyperparameter tuning [1, 2, 3, 4, 13, 15]. As far as we are concerned, in most prior work [1,2,3,4], the hyperparameter tuning process does not involve the parameter $w$ , meaning that given any fixed hyperparameter $\alpha$ , the behavior of the algorithm is determined. In some other cases that involve parameter $w$ , we can have a precise analytical characterization of how the optimal parameter behaves for any fixed hyperparameter [13], or at least a uniform approximate characterization [15]. However, our setting does not belong to those cases and requires a novel proof approach to handle the challenging case of tuning hyperparameter tuning of neural networks (see Appendix B in our revised draft for a detailed discussion).
"Is the result only applicable to (1) interpolation hyperparameter of activation function, and (2) kernel parameter of graph neural networks?"
1. We note that our main results (Theorem 5.1, Theorem 4.2) are applicable for model hyperparameter tuning problems for which the dual function $f_x(\alpha, w)$ admits a piecewise constant/polynomial structure. We have worked out implications in two interesting cases, but we expect it will apply in other settings as well.
2. Prior work [11, 14] has shown that the network function $f_x(\alpha)$ as a function of just the parameter on a fixed instance $x$ has a piecewise polynomial structure. So we expect our techniques to be useful for any hyperparameter for which dual function $f_x(\alpha, w)$ also possesses a piecewise polynomial structure.
3. However, we agree with the reviewer that our result cannot capture all the scenarios of hyperparameter tuning in DNNs, but rather focus on model hyperparameters in the data-driven setting. As suggested by the reviewer, we made changes in the title and main body to clarify this point (see above for details).
4. As the reviewer stated, ". . . Hyperparameter optimization is a very heuristic research domain and it is refreshing to see some efforts towards more principled understanding and characterization of the problem . . . ". Though our work focuses on specific scenarios, we believe that it still benefits future research on theoretical understanding of hyperparameter tuning.

评论- Reply to Reviewer 4928 [3/3]

2024-11-23

Summary

Overall, we thank the reviewer for constructive feedback and for raising good points on how we should clarify/position our contribution. As the reviewer suggested, we made changes in both the title and the main body of the paper to clarify our contribution and scope, as well as fix the typos pointed out by the reviewer. We hope that our answers and changes address the reviewer's concerns, and we kindly request that the reviewer reevaluate our paper in light of our rebuttal. After all, as the reviewer stated, there is a lack of theoretical understanding in this hyperparameter tuning direction, and we believe that our work would serve as a good starting point and would benefit future research for theoretical understanding of hyperparameter tuning.

References

[1] Balcan et al., How much data is sufficient to learn high-performing algorithms? Generalization guarantees for data-driven algorithm design, STOC’21

[2] Balcan et al., Learning-Theoretic Foundations of Algorithm Configuration for Combinatorial Partitioning Problems, COLT’17

[3] Balcan et al., Learning to Link, ICLR’20

[4] Bartlett et al., Generalization Bounds for Data-Driven Numerical Linear Algebra, COLT’22

[5] Balcan et al., Structural Analysis of Branch-and-Cut and the Learnability of Gomory Mixed Integer Cuts. NeurIPS’22

[6] Balcan et al., Sample Complexity of Tree Search Configuration: Cutting Planes and Beyond. NeurIPS’21

[7] Balcan et al., Dispersion for Data-Driven Algorithm Design, Online Learning, and Private Optimization. FOCS’18

[8] Cheng and Basu, Learning Cut Generating Functions for Integer Programming NeurIPS’24

[9] Cheng et al., Sample Complexity of Algorithm Selection Using Neural Networks and Its Applications to Branch-and-Cut NeurIPS’24

[11] Anthony and Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University Press

[12] Liu et al., Darts: Differentiable architecture search, ICLR’19

[13] Balcan et al. Provably Tuning Elastic Across Instance, NeurIPS’22

[14] Bartlett, P., Maiorov, V., Meir, R. (1998). Almost linear VC dimension bounds for piecewise polynomial networks. NeurIPS'11

[15] Balcan et al. New bounds for hyperparameter tuning of regression problems across instances, Neurips’23

[16] Balcan et al. Learning to branch, ICML’18

[17] Balcan and Sharma, Data-driven Semi-supervised Learning, NeurIPS’22

[18] Suggala, Netrapalli, Online non-convex learning: Following the perturbed leader is optimal, ALT’20

评论- Reply to Reviewer 4928 [2/3]

2024-11-23

B. " $u_{\alpha}(x)$ is computed by solving an optimization problem, but it is also presented as a function, how could it be?"

We do not quite understand this question, but will answer it with any potential meaning we can think of:

If it is about the stochasticity of problem instance $x$ : it is true that the problem instance $x \sim D$ is drawn from the problem distribution $\mathcal{D}$ over the set of problem instance $\mathcal{X}$ . However, given any realization of problem instance $x$ , the function $u_x(\alpha)$ is defined deterministically as $u^*_x(\alpha)$ = \min_{w \in \mathcal{W}} f_{x}(\alpha, w). In other words, given a fixed problem instance $x$ , there would be no randomness in the definition of $u^*_x(\alpha)$ .
If it is about the stochasticity of optimization algorithm for solving $u^*_x(\alpha)$ :
1. By defining $u_x(\alpha) = \min_{w \in \mathcal{W}}f_x(\alpha, w)$ , we assume that we are using an ERM oracle here. We will make sure to emphasize this point again in the revised draft. Besides, we note that it is quite common in machine learning theory (see [18] for example).
2. Besides, as the reviewer stated, a theoretical understanding of hyperparameter tuning is challenging, and applying a learning theoretic approach is even more original, challenging, and novel. We believe the reviewer will agree that taking initial steps towards a challenging direction requires some original foundation/assumption.
3. We added the clarification about the ERM oracle and its necessity in the revised draft (l.948-l.951).

We hope that the above is what the reviewer is talking about, and we are happy to be clarified by the reviewer if that is not the case.

C. Other concerns of the proofs:

"Missing proof of Lemma 3.3." Sorry for confusing the reviewer. The proof of Lemma 3.3 is straightforward and can be directly derived from the oscillation definition (Definition 1), which is why we did not incorporate it in the main paper. As the reviewer requested, we reincorporated it into the revised version (See Appendix C).
"What is Warren’s theorem in the proof of Theorem 6.1.". We were referring to the Lemma E.8. (Warren). To make it more clear, we added an explicit reference to that lemma in the proof.
"l.926, 1424, 1476, which standard learning theory results?." Sorry for confusing the reviewer, we are talking about the results summarized in Appendix C. Additional background on learning theory. We added the references to that appendix section as requested.

On other minor comments on the presentation

"l93-95 typos": thank you for pointing it out. We were missing the part "... admits a specific piecewise structure." We have added that part in the revised version.
"l.282 Theorem 4.1 you meant Lemma 4.1 right?": that is correct, sorry for the typo. We fixed it in the revised draft.
"Why Assumption 1 is mild? What is the intuition behind it?":
1. Intuition: Assumption 1 is about the regularity of the boundary functions, which are frequently mentioned as "general positions" in algebraic geometry literature. Roughly speaking, it says that the intersections of boundary functions behave regularly, for example, the intersection of two hyperplanes in 3-dimension space should be a line, or the intersection of two lines in 2-dimension should be a point, etc.
2. Why it is mild?: as mentioned in l.322-349, due to Sard’s Theorem (Theorem E.10), the set of non-regular values (basically determining where the non-regularity of boundary functions occur) has Lebesgue measure zero. It generally means that Assumption 1 almost always holds.
Other clarification (e.g. preimages, . . . ): We made changes in the main draft to clarify the points raised by the reviewer.

评论- Follow-up on Reviewer 4928

2024-11-28

We are reaching out to follow up on our response and to check if you have any further questions. In our rebuttal, we addressed your concerns regarding the data-driven settings, its benefits, as well as clarifying the scope of our work. We also discussed/fixed your comments on the proofs, and presentation, as well as other minor issues. We hope that this resolves the weaknesses outlined in your review and would appreciate a prompt response for confirmation, or any additional questions, clarifications. Thank you again for your thoughtful feedback.

2024-11-29

Dear authors, I appreciate that you took into account my concerns about the presentation and positioning. Even if the setting seems a bit far from practical hyperparameter optimization, I think such theoretical works should be emphasized. Especially nowadays, when most of the focus is captured by intensive empirical works.

Hence, I increase my rating to 6. I can not do more because I am not proficient enough in learning theory and was not able to check the proofs appropriately, though I am reassured by the review of the reviewer GxVs.

评论- Response to Reviewer 4928

2024-11-29

We thank the reviewer for understanding and positively revaluing our work. We are glad our response resolved the reviewer's concerns. Happy Thanksgiving!

审稿意见

评分: 8置信度: 42024-11-04

This paper proves statistical learning generalization bounds for the task of optimizing architectural hyperparameters (ie those static during training) when allowed multiple problem instances. In particular, using machinery derived from piecewise decompositions of the associated utility function, the authors are able to (under some assumptions on geometric regularity of the decomposition) derive PAC-style sample complexity results for two architectural optimization applications of interest: (1) learning an optimal interpolation between piecewise-polynomial activations and (2) optimizing a certain parameterized polynomial adjacency kernel in graph convolutional networks. The proofs rely on recent statistical learning results for problems with piecewise-structured utilities, and adapt/apply these techniques to more specific and involved settings.

优点

I think the overall approach to getting generalization bounds for this type of architecture search (which is fairly understudied) is solid. Finding piecewise/local structure (w.r.t. $\alpha$ ) in the costs of optimized networks, deriving learning complexity results for these piecewise structures, and bootstrapping up to a generalization bound for the overall problem feels like a very natural high-level process. Since the main statistical learning theory workhorse (the [Balcan '21] paper) expresses complexity in terms of oscillation, the authors proceed to use certain problem structures (in particular, piecewise constant, piecewise poly) to control these oscillations in a reasonable way -- the latter setting requires some clever geometric reasoning to pull off. To me, the main strength of this paper is the overall perspective taken on architecture search and the method of translating geometric structure of the meta-loss landscape to learning complexity results. Furthermore, the approximation method to relax Assumption 2 to Assumption 1 is very smooth.

缺点

I will list several considerations that are mainly in terms of presentation/contextualization of the results. I am sorry in advance for how long this section is, but I wanted to be helpful and thorough :)

This paper sets out to prove generalization bounds for the empirical risk minimizer $\hat{\alpha}$ , and it does so. This is a worthwhile theoretical goal on its own (and I really like how you did it!), but I think you should be careful to separate this goal from the results that would be applicable and impactful directly to practitioners. For one, (as is often the case in statistical learning), the analysis is not structured to yield an efficient, implementable algorithm for your hyperparameter tuning nor any clues for how to design one (i.e. unless we can figure out how to find the piecewise structure dynamically and make use of it, one is still stuck doing grid search, hypergradient descent, or bandit methods). In fact, I would argue that the way you've set up your analysis is counterproductive in terms of designing such an algorithm (again, I recognize that algorithm design is not the goal of your paper, but I think the presentation would benefit from being more clear about this). For one concrete example of this point, note that your initial definition of the utility function $u_\alpha$ on p. 3 implicitly assumes finding a global optimizer of the loss w.r.t. the network weights, which will be wildly non convex and NP-hard for DL applications. Because of this, it cannot be claimed that $u_{\alpha_1} > u_{\alpha_2}$ implies $\alpha_1$ is a better architectural choice than $\alpha_2$ since there is no telling that an efficient optimization method (like GD) will find local minima that prefer $\alpha_1$ to $\alpha_2$ . In a sense, this subtlety is the whole point of papers like [Li '21 (Geometry-Aware...)] -- a useful architecture optimization algorithm needs to improve the architecture in a way that GD can find. To reiterate, I understand that your paper does not set out to design such an algorithm, but I think to call the architecture optimization problem "learnable" at all requires some sort of recognition of this subtlety in a way that is clear to the reader. My philosophy is that the theoretical setting should capture the interesting phenomena you wish to explain and ignore those you don't: if you want to work in a setting that is farther from practical considerations, you should make that clear and perhaps reconsider the phrase "for deep neural networks" in the title?
I think this paper would benefit from being more careful about presentation. To start, the phrase "hyperparameter tuning" is often (and perhaps more ubiquitously) used to refer to tuning hyperparameters of an optimization algorithm (such as learning rate), whereas your setting is focused on a static choice of $\alpha$ during the NN training and the applications are therefore architectural. This is a bit of my own bias as an optimization theorist, but I think that provable hyperparameter tuning in this dynamic sense is quite different (and, perhaps wrongly, more studied, see "meta-optimization", "learning to learn", and many adaptive LR papers) than provable NAS, but the abstract and much of your exposition is written in a way that is vague about this difference -- it confused me in the beginning and might confuse others similarly. Perhaps more importantly, I think the claim that you provide the "first precise sample complexity bounds for applications" and "first analysis for the learnability of parameterized algorithms involving both parameters and hyperparameters" should be treated with more care. For one example, I would call Prop 4 of https://jmlr.org/papers/volume18/16-558/16-558.pdf a quantitative sample complexity bound and a (constructive) analysis of learnability -- it's not clear at first glance if your and their results are comparable or one stronger than the other, but at the very least this problem has been looked at before. I am fairly sure that you are the first to present a sample complexity bound for your two applications (and this is nontrivial through your analysis, since it requires unveiling particular piecewise structure of the applications), but I see no reason why one couldn't get some result in these applications as a corollary of Theorem 1 of the Hyperband paper. Again, we would have to see if your geometric analysis gives any advantage over such a bound -- I am not saying that your bound is worse or equivalent, I am saying that it's a slightly cavalier thing to say that your bound is the "first" in such a setting where prior results seem directly applicable.
In terms of the proof methodology, I would like to know more about how to control the # of piecewise components in the piecewise decompositions you use. Your Lemma D.1/E.8 are in a sense the exponential worst-case bounds, but perhaps there is more advantage to be had from a closer look into when the number of pieces is smaller? As an example, here is a cool result https://arxiv.org/pdf/1901.09021 showing that, for piecewise-linear activation functions, while the worst-case number of linear regions is exponential in the # of neurons, the average case is actually linear (proven on initialization, experimentally verified during training). Such investigations shed significant light on the relationship between complexity and expressivity as measured by # of pieces, and would really help contextualize the strength/pessimism of your bounds (and may even directly strengthen them in certain settings!). I am not sure if there are more useful results in the literature, but I think it's worth revisiting.

问题

To recap the suggestions that I kind of roped into the "Weaknesses" section, I would advise that you:

be more specific about the differences between your theoretical setup and the practical setting (i.e. local optima w.r.t. NN weights instead of global) and maybe highlight which parts of your analysis help toward designing practical hyperparameter-tuning algorithms
be more thorough and precise about the particular type of hyperparameter-tuning you study (i.e. exemplify differences with the dynamic perspective on hyperparameter-tuning) and the relationships with prior results (such as Hyperband paper and others you cite)
look a bit deeper into how '# of pieces' has been studied by the deep learning theorists as a measure of complexity in order to gain insight, contextualize your results better, or perhaps even strengthen them

To finish, I want to ask a question that was lingering in my head while thinking about your paper. I feel that the [Balcan '21] paper is the main statistical learning workhorse of these results, and it all boils down to the use of 'oscillation of the utility function' as the measure of learning complexity. My geometric intuition tells me that the suitable condition is specified in terms of curvature of the meta-loss landscape, perhaps through smoothness assumptions such as Lipschitz gradients (ie bounded Hessians) -- at a high level, bounded second derivatives implies reduced oscillation. One could appeal to something like the differential part of Danskin's theorem (which is similar to your Lemma E.2) to convert smoothness of the loss to smoothness of the utility and proceed that way... do you think that a plan of attack along these lines could be more direct and result in proofs that are sharper, more transparent, or more general? I feel that the polynomial formulation may be an indirect way of doing a morally similar thing -- while the polynomial structure allows you to use Bezout and co., perhaps it puts too much emphasis on things like degree and zero sets and obscures the more fundamental object of oscillation. Do you think there are sufficient analytic/geometric tools to carry out such an approach? And do you think there is something particularly special about the (piecewise-)polynomial structure w.r.t. generalization bounds, or do you view it more as a mechanical tool that we know how to prove things about?

Thank you for reading this review! Have a lovely day.

伦理问题详情

N/A

评论- Discussion with Reviewer GxVs [6/6]