PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高8标准差1.3
5
5
8
5
3.3
置信度
正确性2.5
贡献度2.3
表达2.8
ICLR 2025

Revisiting Convergence: A Study on Shuffling-Type Gradient Methods

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
shuffling-type gradient methodsconvergence analysisrelaxed smoothness assumptions

评审与讨论

审稿意见
5

This paper examines the convergence rate of shuffling-type gradient methods without assuming Lipschitz smoothness, achieving results that match the current best-known convergence rates. The theoretical analysis covers non-convex, strongly convex, and non-strongly convex cases under both random reshuffling and arbitrary shuffling schemes.

优点

  1. The theoretical analysis is rigorous and considers multiple cases for different function properties.
  2. The authors discuss the limitations of their work and suggest directions for future research.

缺点

My main concern is the experimental section, which feels too limited and simple to fully support the theoretical findings.

  1. The data used in the experiments is relatively simple. It would be valuable to see if this method remains effective on more complex datasets, such as image datasets or applications with large language models.
  2. Additionally, since the theoretical analysis includes both random reshuffling and arbitrary shuffling, it would strengthen the paper to show results for both methods compared to the baseline SGD.
  3. Similarly, since the analysis considers three different cases (non-convex, strongly convex, and non-strongly convex), conducting experiments separately under each case would add depth to the findings.

问题

See the Weaknesses above.

评论

We thank the reviewer for their valuable comments.

We have updated the experimental section to include both strongly convex and non-strongly convex cases, as well as results for all three shuffling schemes. These updates can be viewed in the revised PDF. Additionally, we are currently working on experiments with image datasets and will include them in future updates.

评论

We thank the reviewer for their constructive feedback. In response, we have completed experiments on the CIFAR-10 image dataset; covered all three shuffling schemes; as well as strongly convex, non-strongly convex, and nonconvex cases. Please refer to our revised manuscript for detailed results and analysis.

评论

Thank you for your valuable feedback on our paper. We have carefully addressed all the points raised and incorporated them into the revised manuscript. To ensure we can further improve the submission based on your insights, we kindly ask if you could provide any additional feedback at your earliest convenience.

评论

Thank you for your responses. I sincerely appreciate the effort. After consideration, I have decided to maintain my score.

Best.

评论

Thanks for the response. While we appreciate the thoroughness of the review process, we are disappointed by the conclusion that the score should be maintained. We believe our experiments have addressed all the concerns you mentioned. If they resolve your concerns, we kindly ask for a reconsideration of the rating. If any additional issues remain, please let us know and we would be happy to provide further clarifications.

审稿意见
5

The paper considers shuffling-type gradient descent under more general smoothness assumption -- L0,L1L_0, L_1-smoothness.

优点

Results match the standard Lipschitz smoothness rates

缺点

Variance assumptions, which are stronger than the most standard bounded variance assumption

问题

  • Definition 2.2 seems missing subscript for xx in the right hand side. It is unclear what authors mean. It could be either symmetric (x is any of x1x_1 or x2x_2) either non-symmetric L0,L1L_0, L_1-smoothness (maximization over x1x_1 x2x_2 interval), see [1]. Or something different? Symmetric case is much easier to analyze. I tried to get it through the proofs, and it was very strange to see, that everywhere (e.g. Lemma A.2, Lemma A.4 ) authors used F(ω)G\|\|\nabla F(\omega)\|\| \le G, and then said (line 672) that "From definition 2.2, we can use Lipschitz smoothness". The standard smoothness AFAIU. So the question: Can the GG be huge? Is it just hidden to the \cO\cO? and is the analisys mainly like for the standard smoothness? It seems to me that it actually is, because in lines 615, 704, 890, 1000, etc authors use standard smoothness inequalities. I simply can bound f(x)=f(x)f(x) (L0+L1f(x))R=RL0|\|\nabla f(x) \|\| = |\|\nabla f(x) - \nabla f(x^*) \|\| \ \le (L_0 + L_1|\|\nabla f(x^*) \|\|)R = RL_0 which could be a G. and then my effective smoothness is L=L0+RL0L1L= L_0 + RL_0L_1. In the non-symmetric we have extra exponents as a multipliers, according to [1]. Is this what authors effectively did? Of course the one can recover the same rate as for standard smoothness. The problem is that the constant will be huge.

  • what is r:=G/Lr := G/L in Theorem 4.5, Theorem 4.6, Lemma A.1, Lemma A.2. I couldn't find where authors referr to rr. What is it for? The results of the mentioned theorems and lemmas do not depend on rr. Then rr is appearing in Lemma A.4 and in bounds two subsequent trajectory points difference norm. Which mainly coincides with my above bound on the gradient (if we plug in the step).

I briefly check the proofs and it seems they are to adapting my above bound on the gradient norm to the stochastic case (which is where the weaker variance assumption is used -- no expectation, and the difference between the full gradient and its stochastic counterpart is estimated).

However, the correct approach is to allow the step size to increase as the gradient norm approaches zero. E.g [1] suggests clipping with gradient norm in the denominator -- when the norm is large - the stepsize is small and vice-versa.

If I was mistaken, I would be glad to investigate the proofs more carefully if authors argued that I was wrong. My understating is that gradient norm is just trivially bounded and the contribution is poor.

References: [1] Z. Chen et al. 2023 Generalized-Smooth Nonconvex Optimization is As Efficient As Smooth Nonconvex Optimization

评论

We thank the reviewer for the comments. Below we will try to address the concerns and questions of the reviewer.

For the Weakness:

We have updated the conclusions to rely on weaker assumptions. Please refer to the updated version of the paper.

For the Questions:

  • Definition 2.2 Missing Subscript for xx:
    We apologize for the lack of clarity. What we meant is that for any xx in the domain and any x1,x2x_1, x_2 close enough to xx, the property should hold. Specifically, we should have included "xdom(f)\forall x \in \text{dom}(f)" in the definition. However, the xx on the right-hand side does not need a subscript. Definition 2.2 is therefore distinct from both symmetric and asymmetric generalized smoothness definitions in [1]. We have updated the definition accordingly.

  • Use of F(w)G\|\nabla F(w)\| \leq G in Lemmas A.2 and A.4:
    In Definition 2.2, if f(x)\|\nabla f(x)\| is bounded above, we recover standard Lipschitz smoothness for any pair of points close enough to xx. The statement on line 672 is justified by the following:

    1. From the induction hypothesis, wk1(t)w^{(t)}_{k-1} is close enough to w0(t)w^{(t)}_0.
    2. The term F(w0(t))\|\nabla F(w^{(t)}_0)\| is bounded due to the definition of τ\tau and Lemma A.3.
  • Can GG Be Large, and Is It Hidden in OO?
    According to Theorem 4.4, G=O((4Δ1δ)12p)G = O\left(\left(\frac{4\Delta_1}{\delta}\right)^{\frac{1}{2-p}}\right), where 0p<20 \leq p < 2 is the degree of the \ell function. It is possible for GG to be quite large, which is reasonable given the relatively loose assumption on the gradient. In some cases, the gradient can indeed be very large, but it is polynomial in Δ1\Delta_1 and 1/δ1/\delta. The term GG is hidden in the OO notation because it is independent of nn and ϵ\epsilon.

  • Bounding f(x)\|\nabla f(x)\|:
    We cannot directly bound f(x)=f(x)f(x)(L0+L1f(x))R=RL0\|\nabla f(x)\| = \|\nabla f(x) - \nabla f(x^*)\| \leq (L_0 + L_1 \|\nabla f(x^*)\|)R = RL_0 because, in Definition 2.2, the points x1,x2x_1, x_2 need to be close enough to xx. There is no guarantee that xx is close to xx^* here. Additionally, we do not make an assumption such as xxR\|x - x^*\| \leq R.

  • Similarity to Standard Smoothness Analysis:
    Our analysis is not equivalent to standard smoothness analysis. The key differences are as follows:

    1. We do not always assume standard smoothness. Since we do not have a bound for f(w;i)\|\nabla f(w;i)\|, it is not possible to give an upper bound for F(w)\|\nabla F(w)\| along the trajectory. Instead, we use GG that controls the probability of the gradient norm exceeding GG.
    2. To apply standard smoothness properties, we condition all results on the event t<τt < \tau, where τ\tau is the time when the gradient norm exceeds GG. Under this conditioning, many results from standard smoothness cannot be directly applied.
  • Redundant Notation rr:
    The variable rr was used to simplify notation but is now redundant. We have removed rr from the draft. Thank you for pointing this out.

评论

Since the reviewer referred to [1] multiple times, we would like to provide a comparison between [1] and our work.

Our definition of \ell-smoothness is derived from Definition 2 in [2], and it is equivalent to the condition 2f(x)(f(x))\|\nabla^2 f(x)\| \leq \ell(\|\nabla f(x)\|) almost everywhere, where \ell is a sub-quadratic, non-decreasing function. The generalized smoothness in [1] can be viewed as a special case of \ell-smoothness with (x)=L0+L1xα\ell(x) = L_0 + L_1 x^\alpha for α[0,1]\alpha \in [0, 1].

In terms of proof techniques, [1] uses normalization to handle potentially large gradients, whereas our approach allows for large gradients but ensures that they occur only with a small probability δ/2\delta/2. We further demonstrate that the gradient norm bound GG is not excessively large, as it is independent of ϵ\epsilon or nn in the nonconvex case. Consequently, the challenges addressed in our work are significantly different from those in [1].

References:

[1] Chen, Z., Zhou, Y., Liang, Y., and Lu, Z. (2023). Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. In International Conference on Machine Learning (pp. 5396-5427).

[2] H. Li, J. Qian, Y. Tian, A. Rakhlin, and A. Jadbabaie. Convex and non-convex optimization under generalized smoothness. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.

评论

Thank for providing a comparison between [1] and your work. I indeed was very biased, because it seemed to me that the setup you considered led to pessimistic constant. Now I see I was right, but you considered a different assumption.

The reason I was biased is that the result of D Vankov, A Rodomanov, A Nedich, L Sankar, SU Stich, Optimizing (L0,L1) - Smooth Functions by Gradient Methods, https://arxiv.org/abs/2410.10800 states that even in the case leading to the pessimistic constant in your approach it is possible to almost recover the standard rates.

Yes, the paper was published right after the submission deadline. I just want to say now, that the analysis might be significantly improved.

But, that it is clearly a contribution to consider a stochastic case, but it was done in

H. Li, J. Qian, Y. Tian, A. Rakhlin, and A. Jadbabaie. Convex and non-convex optimization under generalized smoothness. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.

generalizing the result of (Li et al) to shuffling type methods seems to be an incremental contribution to me.

Am I right saying that you basically generalized the work of (Li et all) to shuffling type methods? Having this is true, I am adjusting my scores.

评论

We sincerely thank the reviewer for their thoughtful comments and for raising the score. Below, we provide detailed responses to the questions raised:

  • Our \ell-smoothness assumption, which can be expressed as 2f(x)(f(x))\|\nabla^2 f(x)\| \leq \ell(\|\nabla f(x)\|) almost everywhere for a sub-quadratic \ell, generalizes the (L0,L1)(L_0, L_1)-smoothness condition. Specifically, when \ell is a linear function, the \ell-smoothness assumption reduces to (L0,L1)(L_0, L_1)-smoothness as a special case.

  • Since our \ell-smoothness assumption is more general than (L0,L1)(L_0, L_1)-smoothness, the techniques used in [1] cannot be directly extended to improve our results. Even if we generalize (L0,L1)(L_0, L_1)-smoothness to (x)=L0+L1xp\ell(x) = L_0 + L_1 x^p for 0p<20 \leq p < 2 (a specific case of our generalized smoothness), the proof techniques in [1] fail to hold. This is because their analysis relies heavily on the linearity of the smoothness assumption. For instance, in the proof of Lemma 2.2 in [1], the term ν(t)\nu(t) in the second inequality would become νp(t)\nu^p(t) for p1p \neq 1, invalidating the third inequality.

  • Analyzing shuffling-type algorithms is fundamentally different from, and often more challenging than, analyzing SGD. Unlike SGD, where each sampling step is independent, the sampling steps in a shuffling method are dependent within one epoch. This dependency introduces significant analytical complexity, making it non-trivial to generalize existing techniques to the shuffling setting. For example, [2] demonstrates the substantial effort required to adapt momentum methods to the shuffling framework, and [3] extends this effort to Nesterov acceleration. In our work, we tackle the unique challenge of analyzing shuffling gradient methods without assuming Lipschitz smoothness or independence between steps—issues that are rarely addressed in prior literature. As such, we believe our contribution goes beyond being incremental.

References:
[1] Vankov, Daniil, et al. "Optimizing (L0,L1)(L_0, L_1)-Smooth Functions by Gradient Methods." arXiv preprint arXiv:2410.10800 (2024).
[2] Tran, Trang H., Lam M. Nguyen, and Quoc Tran-Dinh. "SMG: A shuffling gradient-based method with momentum." International Conference on Machine Learning. PMLR, 2021.
[3] Tran, Trang H., Katya Scheinberg, and Lam M. Nguyen. "Nesterov accelerated shuffling gradient method for convex optimization." International Conference on Machine Learning. PMLR, 2022.

评论

We greatly appreciate the time and effort you put into providing constructive feedback. Regarding your questions, we have made attempt to answer them. We kindly request that you review the revised submission and our rebuttal and consider providing additional feedback. Thank you once again for your valuable input and for contributing to the quality of our work.

评论

Thank you for your valuable feedback on our paper. We have carefully addressed all the points raised and incorporated them into the revised manuscript. To ensure we can further improve the submission based on your insights, we kindly ask if you could provide any additional feedback at your earliest convenience.

评论

I decided to increase my score. Strengths overweight weaknesses, and generally the paper make a contribution.

评论

Dear Reviewer h6tH:

Thank you very much for reevaluating our work and raising your rating. Since you said “Strengths overweight weaknesses”, we are not sure if you mean to accept this work? If yes, you could notice that rating 5 given by you means marginal rejection, while 6 means marginal acceptance.

Thanks, Authors

审稿意见
8

This paper revisits the case of random shuffling-type stochastic gradient methods for finite sum minimization problems. More precisely, the paper considers objectives without the traditional structural assumption of Lipschitz smoothness. In doing so, the authors focus on non-convex, strongly convex and convex objectives which satisfy as a smoothness "surrogate" the notion of l\mathcal{l}- smoothness. To that end, the authors provide respective convergence rates for each respective case.

优点

The paper is extremely well written and enjoyable to read. Moreover, as far I checked the math seems correct and sound. Without being an expert myself on the respective literature, I find the respective very interesting and challenging from a theoretical standpoint.

缺点

My main concerns consider two main factors:

First, the notion of l\mathcal{l}- smoothness introduces additional parameters to be tuned as it becomes apparent from the definitions of the step-sizes in the main theorems. It would be could to include some discussion of how difficult are these to be evaluated both in real life scenarios and in theory. More precisely, do the authors believe that the respective toolbox from the adaptive algorithm like in [1] can be incorporated?

Secondly, the proposed step-size policies seem to rely on a prior knowledge on the iteration horizon TT. Do the authors believe that an any time convergence rate guarantee can be achieved?

[1] Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum Minimization, Neurips 2022.

问题

See weaknesses

评论

We thank the reviewer for the comments. Below we will try to address the concerns and questions of the reviewer.

  • Additional Parameters Introduced by \ell-Smoothness:
    Since the \ell function and the value of σ\sigma are difficult to obtain in practice, the results in our paper may not provide optimal guidance for choosing hyperparameters. However, the primary contribution of this work lies in providing convergence guarantees under weaker smoothness assumptions.

  • Incorporating Adaptive Algorithms:
    Regarding the possibility of incorporating the respective toolbox from adaptive algorithms, there has been work on shuffling algorithms with variance reduction techniques, such as in [1]. Moreover, since variance reduction has been successfully combined with generalized smoothness in [2], we believe it is indeed feasible to integrate such techniques into shuffling algorithms. This represents a promising direction for future research.

  • Step-Size Policies and Any-Time Convergence Guarantees:
    The proposed step-size policies rely on prior knowledge of the iteration horizon TT. However, the step-size and TT can be determined simultaneously at the start of the algorithm, as demonstrated in our parameter choices under Theorem 4.4. Achieving an any-time bound is also possible but may not be very useful. By replacing the iteration horizon TT with the current iteration index tt, the 1δ1 - \delta probability and the bound for the average gradient norm (previously ϵ2\epsilon^2) would adjust accordingly.

References:

[1] Malinovsky, Grigory, Alibek Sailanbayev, and Peter Richtárik. "Random reshuffling with variance reduction: New analysis and better rates." Uncertainty in Artificial Intelligence. PMLR, 2023.

[2] Li, Haochuan, Alexander Rakhlin, and Ali Jadbabaie. "Convergence of adam under relaxed assumptions." Advances in Neural Information Processing Systems 36 (2024).

审稿意见
5

This paper studies the shuffling method under the generalized smooth assumption, which was proposed recently to fit many modern machine learning tasks. The authors proved that, under properly picked parameters, the shuffling method provably converges under the weak smoothness condition for both nonconvex/strongly convex/convex objectives. Numerical experiments are also conducted to support the theory.

优点

I appreciate the paper tries to tackle more realistic problems (i.e., functions with non-uniform smoothness) and studies the shuffling algorithm, an arguably more common scheme in practice.

缺点

Major points.

  1. All convergence results are proved in a high-probability manner. However, the dependence on the margin δ\delta is in the order of poly(1/δ)\mathrm{poly}(1/\delta), which makes the results weak. I also suggest the authors explicitly give the dependence on δ\delta in the final sample complexity.

  2. Some descriptions of the existing works contain wrong facts. (addressed)

    • Lines 90-91, the bounded variance assumption is not required to improve the rate O(1/T2)\mathcal{O}(1/T^2) to O(1/(nT2))\mathcal{O}(1/(nT^2)) in Nguyen et al. (2021). Instead, Nguyen et al. (2021) can handle unbounded variance.

    • Lines 92-93, results in both Mishchenko et al. (2020) and Nguyen et al. (2021) hold under unbounded variance condition. The current description is not correct.

  3. The conditions on noises, i.e., Assumptions 4.3, 4.4, and 4.7, are strong compared with the existing literature, which significantly reduces the impact of the work. I will elaborate more on this point in the following.

  4. Nonconvex part. (addressed)

    • Random reshuffling scheme.

      • In this case, previous works only consider Assumption 4.7, or even a weaker version, i.e., the non-uniformly bounded variance, to obtain the O(n/ϵ3)\mathcal{O}(\sqrt{n}/\epsilon^{3}) sample complexity.

      • However, to recover the same rate, this work requires much stronger Assumptions 4.3 and 4.4 in Theorems 4.5 and 4.6, respectively. Hence, the results in this paper are not directly comparable to prior literature.

      • When only Assumption 4.7 holds, Corollary 4.8 has extra dependence on nn as indicated by the authors.

      • In addition, I am not sure why Corollary 4.8 is a corollary and cannot find its proof. Did I miss anything?

    • Arbitrary shuffling scheme.

      • Again, the authors require stronger Assumption 4.3 to make their sample complexity as good as the previous results. However, the latter can hold under non-uniformly smoothness, e.g., see Nguyen et al. (2021). As such, the claim in Lines 63-64 is misleading.
    • Moreover, imposing three assumptions on noises may confuse the reader. Especially, the proofs under Assumptions 4.3 and 4.4 are similar as claimed in Lines 860-863. I didn't see a necessary reason for the authors to do so.

  5. Strongly convex and convex parts.

    • Assumption 4.3 is strong and not assumed in previous best-known results, making the contributions weak. (addressed)

    • As far as I know, the previous best results don't need any assumption on the noises for convex problems (Assumption 4.14). Hence, whichever condition among Assumptions 4.3, 4.4, and 4.7 is used, the result is not an improvement in my opinion.

  6. The writing can be improved. Some theorems give a threshold on the stepsize (e.g., Theorem 4.5) but others give an exact choice (e.g., Theorem 4.12). Can the author present a unified statement? (addressed)

Minor points. (addressed)

  1. Line 290, the second T=O(1ϵ3)T=\mathcal{O}(\frac{1}{\epsilon^3}) should be O(nϵ3)\mathcal{O}(\frac{n}{\epsilon^3}).

  2. Line 310, Delta1Delta_1 should be Δ1\Delta_1.

问题

See Weaknesses.

评论

We thank the reviewer for the comments. Below we will try to address the concerns and questions of the reviewer.

  1. Noise Assumption:
    We have updated the noise assumption in the new draft to match the assumption in [1]. In the strongly convex case, the results now align with [1] under the same assumption, as outlined in the second paragraph of Theorem 1 in [1].

  2. Dependence on δ\delta:
    We have explicitly added the dependence on δ\delta in the remarks below Theorems 4.4, 4.8, and 4.11. However, we humbly disagree with the assertion that the polynomial dependence on δ\delta is weak. Under the \ell-smoothness assumption, the dependence of TT on δ\delta is commonly polynomial. For instance:

    • Theorem 6.2 in [2],
    • Theorem 5.3 in [3],
    • Theorem 4.13 in [4].

    This is because, under \ell-smoothness, δ\delta accounts for the probability that Lipschitz smoothness does not hold—a consideration absent in standard Lipschitz smoothness settings. Therefore, it is not appropriate to directly compare the dependence on δ\delta here with that in Lipschitz smoothness.

    Furthermore, there are results under \ell-smoothness with a log(1/δ)\log(1/\delta) dependency (e.g., Theorems 4.1 and 4.2 in [2]). However, in those cases, it is proved that the gradient norm is always bounded before TT in the probability space, which does not hold in our case. Improving the dependency to log(1/δ)\log(1/\delta) here would require a significant advancement in techniques under \ell-smoothness and is beyond the scope of this paper.

  3. Descriptions of Existing Works:
    We have corrected inaccuracies in the descriptions of some existing works. Thank you for pointing these out.

  4. Threshold vs. Exact Stepsize:
    Regarding the comment about stepsize thresholds (e.g., Theorem 4.5) versus exact choices (e.g., Theorem 4.12), we now provide explicit stepsize choices in the paragraphs following the relevant theorems. However, we have retained the threshold form in some cases as it provides a more accurate description of our results.

References:

[1] Nguyen, Lam M., et al. "A unified convergence analysis for shuffling-type gradient methods." Journal of Machine Learning Research, 22.207 (2021): 1-44.

[2] Haochuan Li, Alexander Rakhlin, Ali Jadbabaie. "Convergence of Adam Under Relaxed Assumptions." NeurIPS 2023.

[3] Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, Ali Jadbabaie. "Convex and Non-convex Optimization Under Generalized Smoothness." NeurIPS 2023.

[4] Wenhan Xian, Ziyi Chen, Heng Huang. "Delving into the Convergence of Generalized Smooth Minimax Optimization." ICML 2024.

评论

I thank the authors' feedback. Given the new results, I have increased my score. I cannot give a higher score because the assumption is still strong in the general convex case as explained in the original review.

评论

We thank the reviewer for their constructive feedback. In response, we have updated the draft to further generalize Theorems 4.9 and 4.12 to cases without any variance assumptions. This update aligns our results with those in [1]. We invite you to review the revised manuscript for details.

[1] Nguyen, Lam M., et al. "A unified convergence analysis for shuffling-type gradient methods." Journal of Machine Learning Research, 22.207 (2021): 1-44.

评论

Thank you for your valuable feedback on our paper. We have carefully addressed all the points raised and incorporated them into the revised manuscript. To ensure we can further improve the submission based on your insights, we kindly ask if you could provide any additional feedback at your earliest convenience.

评论

I thank the authors for the further update. My comments are as follows:

  1. Theorem 4.11 (i.e., the random reshuffling scheme) still needs a variance assumption. In contrast, existing works like [1] do not have any variance assumption for the random reshuffling scheme. Hence, Theorem 4.11 is still weaker in my view.

  2. I understand the authors removed Assumption 4.3 in Theorem 4.12 (i.e., any shuffling scheme). But this result cannot reflect the better sample complexity when the random reshuffling scheme is employed, i.e., a n\sqrt{n} improvement as indicated by Lines 347 (taking p=0p=0 for simplicity) and 361.

As such, the result for the general convex case is still not satisfied.

  1. Another minor point is that Assumption 4.1 in every theorem seems to link to the wrong place. The authors may need to check this issue.

References:

[1] Mishchenko, Konstantin, Ahmed Khaled, and Peter Richtárik. "Random reshuffling: Simple analysis with vast improvements." NeurIPS 2020.

评论

Thank you for your continuous feedback and constructive suggestions. We have addressed your concerns as follows:

  1. Theorem 4.11 and the Variance Assumption:
    At present, we are unable to prove Theorem 4.11 without the variance assumption. Previous analyses of convex optimization under \ell-smoothness (e.g., Theorem 4.2 in [1]) have only been conducted for gradient descent, where the gradient norm decreases in every iteration. In contrast, our analysis for arbitrary schemes requires a certain bound on the trajectory, which does not hold in random reshuffling. As a result, prior methods are invalid in our case. While we are actively exploring this case, we currently do not have a definitive conclusion.

  2. Strongly Convex Case (Theorem 4.8):
    Our proof for Theorem 4.9 can be directly generalized to Theorem 4.8, yielding results consistent with [2]. On the other hand, the current version of Theorem 4.8 aligns with Theorem 1 in [2].

  3. Contributions Despite Challenges in the Convex Case:
    While we acknowledge that resolving the convex case without the variance assumption is challenging (and potentially very difficult), we would like to highlight the significant contributions made in our work. These include advancements in settings of shuffling algorithms, including nonconvex cases, strongly convex cases, and non-strongly convex cases with arbitrary shuffling schemes. Additionally, we have employed variance assumptions that are weaker than any previously used under \ell-smoothness assumptions and conducted extensive numerical experiments to validate our findings. We kindly request the reviewer to reconsider the score to reflect the contributions we have made.

  4. Assumption 4.1 Link Update:
    We have updated the link for Assumption 4.1, and this change will be reflected in the next version.

Thank you for your time and thoughtful consideration!

[1] H. Li, J. Qian, Y. Tian, A. Rakhlin, and A. Jadbabaie. Convex and non-convex optimization under generalized smoothness. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. [2] Nguyen, Lam M., et al. "A unified convergence analysis for shuffling-type gradient methods." Journal of Machine Learning Research 22.207 (2021): 1-44.

评论

We have updated our paper in response to the reviewers' valuable suggestions. Below is a summary of the key changes we have made:

  • We are now using a weaker noise assumption (Assumption 4.3), aligning with [1]. All the theorems and proofs have been modified accordingly.
  • The flow of the proofs in the appendix has been reorganized to improve readability.
  • Typos have been corrected.

We also plan to work on numerical experiments soon, as suggested by reviewer RYK7.

[1] Nguyen, Lam M., et al. "A unified convergence analysis for shuffling-type gradient methods." Journal of Machine Learning Research, 22.207 (2021): 1-44.

评论

We sincerely thank the reviewers for their constructive feedback, which has greatly contributed to improving our manuscript. As the revision has reached a temporary conclusion, we would like to summarize the changes we have made:

  1. Weaker assumptions: We now use a weaker assumption of general bounded variance, aligning it with the assumption made under Lipschitz smoothness.
  2. Expanded analysis: We have extended our analysis to cover both strongly convex and non-strongly convex cases without any variance assumptions.
  3. Additional experiments: We conducted more numerical experiments, including:
    • All three shuffling schemes,
    • Both strongly convex and non-strongly convex cases,
    • Image datasets.

We believe that currently we have addressed all the concerns raised by the reviewers. We look forward to further responses and suggestions to continue refining and improving the manuscript. Thank you again to all the reviewers for their invaluable feedback.

AC 元评审

This paper studies the convergence guarantees of shuffling based gradient methods in non-Lipshitz settings. Authors show that, this algorithm with a careful schedule converges under weak assumptions. Authors consider various settings, e.g. nonconvex, convex, assuming bounded variance. They further provide empirical studies for demonstration.

This paper was reviewed by four reviewers and received the following Scores/Confidence: 5/4, 5/4, 5/3, 8/2. I think the paper is studying an interesting topic but authors are not able to convince the majority of the reviewers sufficiently well about the importance of their contributions. The following concerns were brought up by the reviewers:

  • There were many concerns raised by the reviewers and I appreciate that the authors provided a major revision of their work. However, given the number of points raised, addressing all reviewer concerns would require significant revision, which then requires another set of reviews.

  • Novelty of technical analysis.

  • Experimental results are limited and do not sufficiently support main claims.

  • Convergence guarantees have a weak dependence on the margin.

Three out of four reviewers rejects the paper. As such, based on the reviewers' suggestion, as well as my own assessment of the paper, I recommend not including this paper to the ICLR 2025 program.

审稿人讨论附加意见

There were many concerns raised by the reviewers and I appreciate that the authors provided a major revision of their work. However, addressing all reviewer concerns would require significant revision, which then requires another set of reviews.

最终决定

Reject