6.0

/10

Rejected4 位审稿人

最低5最高8标准差1.2

4.8

置信度

正确性3.3

贡献度3.0

表达3.3

ICLR 2025

Efficient optimization with orthogonality constraint: a randomized Riemannian submanifold method

Andi Han,Pierre-Louis Poirion,Akiko Takeda

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

摘要

关键词

OprimizationOrthogonality constraintRiemannian optimizationStiefel manifold

评审与讨论

审稿意见

评分: 8置信度: 52024-10-25

The paper deals with large-scale optimization problems with orthogonality constraints. To overcome the computational bottleneck of the retraction operator, the authors propose a novel "subspace" technique that updates variables on random submanifolds, reducing per-iteration complexity. They introduce two strategies for selecting these submanifolds and analyze the convergence of their methods for both general nonconvex functions and those with local Riemannian PL conditions. Besides, the approach is also applicable to certain quotient manifolds. Experimental results demonstrate the method's effectiveness across various problems.

优点

This paper proposes a novel random submanifold descent method (with two sample strategies), which enables low-cost retractions and low per-iteration complexity.
The authors provide a thorough convergence analysis under the non-convex setting, the local Riemannian PL setting, and the stochastic setting.
The proposed algorithm outperforms other existing methods in various problems.

缺点

In line 87, the authors mentioned the poor parallelization of existing methods. However, they did not say too much about why RSDM can be implemented in parallel. In line 93, the authors say “allowing parallel computation”. Adding more discussion and exploration in this aspect can greatly strengthen this paper. For example, elaborating more on the parallel implementation of RSDM or even showing the speedup compared to the sequential implementation in the experiment part would be helpful.
We know from Definition 1 that the Riemannian PL holds locally. In Line 304 of Theorem 1, it says “Suppose k0 large enough such that $X_{k0} \in U$ ”. I understand that this assumption is because the iterates may not always stay in the local region due to algorithmic randomness. Is it possible to get rid of such an assumption? This seems difficult to me. Similar things happen in other theorems.

问题

In Line 69 and Line 92, the authors say that the complexity of non-standard linear algebra reduces to O(r^3) from O(nr^2). I understand O(r^3) comes from the cost of retraction operation. However, the per-iteration cost is still O(npr) or O(nr^2) (see Section 4.1 and Section 4.2). It would be better to clarify these in the statement.
In line 142 and line 146, it is not suitable to use “denote … as …”. Besides, in line 155, what is the meaning of $X^*$ ?
It can be better to explicitly write down the expression grad F_k(I_n) arising in (4) using the Euclidean gradient. Although it is very simple, it can help readers understand lines 212-213.
In Line 222, why does the gradient computation only require O(nr^2)?
Could the authors add more discussions on the parallel implementation?
Could the authors get rid of the assumptions in Line 304?
In line 484 of Section 7.3, it should be Figure 4 instead of Figure 2. In addition, if possible, could the authors also compare their RSDM with RSSM in (Cheung-Wang-Yue-So ‘24) in Line 577?
In Line 467-470, Can the phenomenon of switching from sublinear rate to linear rate be explained by the error bound property (or the Riemannian PL property)? See Theorem 1 and Theorem 2 in “Liu H, So A M C, Wu W. Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods[J]. Mathematical Programming, 2019, 178(1): 215-262.” The authors can add some discussion if this is related.
The update of RSDM is similar to the algorithm in "A Block Coordinate Descent Method for Nonsmooth Composite Optimization under Orthogonality Constraints" (https://arxiv.org/pdf/2304.03641). For example, the formula in Line 215 is similar to (12) in the reference. Could the authors compare RSDM with the algorithm in the reference and highlight the differences more explicitly?

伦理问题详情

评论- Responses to Reviewer Cgj8 (Part 1)

2024-11-20

We sincerely appreciate the reviewer for acknowledging the strengths of our work. We also thank the reviewer for all the thoughtful comments and constructive feedback, which we address in details below. All revisions are highlighted in red in the revised manuscript.

1. (W1 + Q5) Parallelization of RSDM compared to existing methods.

Thank you for raising this point. We believe the use of term "parallel" may have caused some confusion. What we really mean is the following: Although the total complexity is the same as existing coordinate-based methods [3,4], RSDM converges in fewer iterations despite its higher per-iteration complexity. This characteristic makes RSDM more suited to modern hardware, such as GPUs, which are optimized for parallel processing. To this end, we have modified the wording in Section 1.1 by removing the "parallel" discussion.

To be more precise, let $n=p$ and coordinate based methods [3,4] requires $O(n^2\epsilon^{-2})$ iterations for convergence to $\epsilon$ -stationary points and each iteration costs $O(n)$ , which gives total complexity of $O(n^3\epsilon^{-2})$ . In contrast, proposed RSDM (with permutation sampling) requires $O(n^2r^{-2} \epsilon^{-2})$ iterations and each iteration costs $O(nr^2)$ , yielding the same total complexity. Therefore, existing methods require more iterations (sequentially), which limits their efficiency on modern hardware designed for parallel processing.

This claim can be validated from Figure 2(a) and Figure 6(a), where the runtime complexity of RCD [3,4] is significantly higher despite a lower per-iteration cost. On the contrary, the proposed RSDM is more favorable for large-scale settings when implemented on hardware such as GPUs.

2. (W2 + Q6) On the assumption of iterates staying in the local region.

Thank you for the question. We can get rid of such an assumption when $X^*$ is an isolated local minima in the neighborhood $U$ . This is because RSDM converges to a stationary point almost surely (shown in Theorem 2). Therefore, once $X_{k_0} \in U$ . Then we can show $X_k \in U$ for all $k \geq k_0$ by a quadratic growth argument as shown in [5]. We have modified Theorem 1 and 2 accordingly, along with the proof.

Nevertheless, in the stochastic setting where we fix the stepsize, the almost sure convergence to $X^*$ is lost. For such cases, we rewrite Theorem 3 by showing linear convergence only for points $X_{k_0}, ..., X_{k_1} \in U$ . It is worth noticing that many prior works on stochastic Riemannian optimization directly assume all iterates are contained in $U$ , e.g., Theorem 3 in [6], Theorem 4.6 in [7].

3. (Q1) Clarify the statements on per-iteration cost and non-standard linear algebra.

Thank you for the comment. Indeed, the main aim of the paper is to reduce the complexity of non-standard operations associated with retraction, which is the main bottleneck for optimization with orthogonality constraints (see e.g., [8]).

Apart from the retraction, the per-iteration cost also includes gradient and iterate computation via matrix multiplication. Although these costs require higher worst-case complexities, they are less of concern on modern hardware such as GPUs, where they can be easily accelerated.

4. (Q2) Not suitable to use "denote...as...". The meaning of $X^\star$ in line 155.

Thank you for the suggestion. We have now corrected the wording involving "denote...as...". We have also clarified the meaning of $X^*$ as any local minimizer.

5. (Q3) Write grad F_k(I_n) with Euclidean gradient.

Thank you for the suggestion. We have now added the expression explicitly in terms of Euclidean gradient.

6. (Q4) Why gradient computation in line 222 require $O(nr^2)$ ?

This is because the cost of truncated permutation, required for forming $A = P_k(r) \nabla F(X_k)$ and $B = P_k(r) X_k$ is smaller than the matrix multiplication of $AB^T$ , which costs $O(nr^2)$ .

7. (Q7) Line 484 should be Figure 4 instead of Figure 2.

Thank you for noticing the typo. We have now fixed it.

8. (Q7) Comparison to RSSM [9].

Although we had briefly compared proposed RSDM to RSSM [9] in Section 1.1, we have now included a dedicated section in Appendix E of the revised manuscript to provide a detailed and comprehensive comparison. We also include a numerical comparison despite that RSSM [9] does not report experimental results or provide the code.

Specifically, we show that RSDM requires a lower retraction cost of $O(r^3)$ while RSSM [9] requires $O(nr^2)$ . In addition, RSDM allows easy extension to quotient manifolds, such as Grassmann manifold, while RSSM [9] appears challenging to adapt to such manifolds.

Our experimental results clearly demonstrate that the proposed RSDM converges significantly faster than RSSM in terms of runtime, further validating the advantages of RSDM over RSSM.

评论- Responses to Reviewer Cgj8 (Part 2)

2024-11-20

9. (Q8) Discussion on sublinear to linear rate in relation to [1]

Thank you for the suggestion of the reference and indeed it is relevant. We agree that the transition from sublinear to linear convergence may be due to that random submanifold descent allows a quick search for the regions where local PL or local error bound condition holds. We have now incorporated this discussion and cited the reference in the revised manuscript.

10. (Q9) Comparison of RSDM to [2].

Thank you for the comment. We agree that the formulation of RSDM is similar to eq. (12) of [2]. We have now added Appendix G in the revised version for properly comparing our development to [2] in more details.

We first highlight that [2] is primarily designed for nonsmooth problems and thus the settings, optimality conditions and convergence analysis are largely different to ours. Nevertheless, we still adapt its algorithm for the smooth case.

Below we summarize the main differences:

One of the critical differences is the update of [2] follows a projected Euclidean gradient descent, while RSDM follows a Riemannian gradient descent, which leads to entirely different analysis techniques.
They show convergence to block- $k$ stationary points, which is weaker than our established convergence to stationary points (as shown in Theorem 5.5 of [2]).
Their convergence rate in Theorem 6.3 depends on potentially large binomial coefficient $C_n^r$ while our convergence only has a coefficient $n^2 r^{-2}$ .
They only consider $U_B$ to be truncated permutation while we consider both truncated permutation and general orthogonal matrix.
Unlike [2], we have shown convergence in the stochastic settings, and show how our developments extend to other quotient manifolds.
Unlike [2] that only shows convergence in expectation, we also show convergence in high probability and almost surely.

To further verify the empirical advantages of RSDM over [2], we conducted a numerical experiment on the PCA problem. The results clearly demonstrate that RSDM achieves superior convergence compared to [2]. This suggests that the difference in the update direction plays a critical role in convergence, with Riemannian gradient descent often being more effective than projected Euclidean gradient descent.

References

[1] Liu H, So A M C, Wu W. Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods. 2019.

[2] Yuan, G. A block coordinate descent method for nonsmooth composite optimization under orthogonality constraints. Preprint. 2023.

[3] Han, A., Jawanpuria, P., & Mishra, B. Riemannian coordinate descent algorithms on matrix manifolds. In ICML 2024.

[4] Gutman, D. H., & Ho-Nguyen, N. Coordinate descent without coordinates: Tangent subspace descent on Riemannian manifolds. Mathematics of Operations Research. 2023.

[5] Rebjock, Q., & Boumal, N. Fast convergence to non-isolated minima: four equivalent conditions for C2 functions. Mathematical Programming. 2024.

[6] Zhang, H., J Reddi, S., & Sra, S. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In NeurIPS 2016.

[7] Kasai, H., Sato, H., & Mishra, B. (2018). Riemannian stochastic recursive gradient algorithm. In ICML 2018.

[8] Ablin, P., Vary, S., Gao, B., & Absil, P. A. Infeasible deterministic, stochastic, and variance-reduction algorithms for optimization under orthogonality constraints. Preprint. 2023.

[9] Cheung, A. Y. M., Wang, J., Yue, M. C., & So, A. M. C. Randomized Submanifold Subgradient Method for Optimization over Stiefel Manifolds. Preprint 2024.

评论- Raising the score

2024-11-23

Thanks a lot for the detailed reply. I also read other reviewer's comments. The authors really did a great job and now the draft looks perfect to me (in writing, theory, and experiments). I decide to increase the score.

Just one quick question: In Appendix G (the comparison between OBCD and RSDM), I guess the core difference is the OBCD consider the Euclidean gradient direction of F in line 1579 while the RSDM considers the Riemannian gradient direction of F (that is why RSDM has the skew-symmetric operation). Is this exactly the case? Or can we directly derive the RSDM starting from line 1579 by performing a Riemannian gradient step?

评论- Thank you for your positive feedback

2024-11-24

We are deeply motivated by your positive feedback on our work and are sincerely grateful that you have increased the score. Your constructive reviews have significantly strengthened the paper.

Regarding your question on RSSM, the use of Euclidean versus Riemannian gradient direction is the key difference between OBCD and our RSDM. We believe the Riemannian gradient update cannot be derived starting from line 1579, as the Euclidean gradient appears in the global solution to the subproblem (19). This is mainly because the subproblem is defined extrinsically in the ambient space, rather than intrinsically on the manifold.

We hope this clarifies your question. If further explanation is needed, we are more than happy to elaborate.

审稿意见

评分: 5置信度: 52024-11-02

This paper proposes a randomized Riemannian submanifold descent method (RSDM) to address large-scale optimization with orthogonality constraints. Specifically, it mitigates the computational costs associated with the retraction in each iteration by updating on a low-dimensional random submanifold rather than the full Stiefel manifold. Two sampling strategies, orthogonal and permutation sampling, are provided. Theoretical analysis guarantees convergence, and extensive experiments demonstrate the effectiveness and efficiency of RSDM.

优点

The proposed RSDM is novel, which updates on a low-dimensional random submanifold instead of the full Stiefel manifold. In this way, it reduces the computational costs in each iteration, contributing to large-scale optimization.
The paper provides a comprehensive theoretical analysis, including general nonconvex optimization problems, nonconvex functions satisfying the PL condition, and the results in stochastic settings. The discussion about the trade-off between efficiency and convergence is interesting.
The effectiveness and efficiency of the proposed algorithm are validated on an array of numerical experiments, and the source code is available.

缺点

The contribution seems limited in the scenario when $p=\Omega(n)$ . However, numerous applications typically involve the setting $p\ll n$ . Specifically, as discussed in Remark 3, the total complexity of the standard Riemannian gradient descent is $O(np^2\epsilon^{-2})$ , while the proposed method costs $O(n^3\epsilon^{-2})$ , which appears as a theoretical gap.

问题

As cited in line 88 of the manuscript, the literature [1] proposed a randomized method, RSSM, which also updates on a low-dimensional random submanifold. Could you compare it with the method in this work in terms of computational cost per iteration? Additionally, is there any relationship between RSSM and RSDM?
In line 382 of the manuscript, it states that analysis for Theorem 3 can be easily extended to mini-batch gradient descent. Can the authors explain this point briefly?
In the experiments, the parameter $p$ is consistently close to $n$ , for which the proposed RSDM outperforms other methods. It would be beneficial to demonstrate at what percentage of $n$ the value of $p$ needs to reach before RSDM begins to show its advantage.
In Section 7.4, the numerical results of the deep learning task seem confusing, where the accuracy hovers around $40\\%$ to classify CIFAR10. Additionally, only the tendency of accuracy is presented; including the training/test loss would be more illustrative.

I would like to increase my grade if the concerns are well addressed.

[1] Randomized submanifold subgradient method for optimization over Stiefel manifolds. 2024

评论- Responses to Reviewer xyJ5 (Part 1)

2024-11-20

We are immensely grateful for the reviewer’s positive feedback on our contributions. We also deeply appreciate the insightful comments and questions, which we address in detail below. All revisions are highlighted in red in the revised manuscript.

1. (W1 + Q3) Contribution limited to $p = \Omega(n)$ while many applications involve $p \ll n$ . Experiments across various $p$ .

Thank you for the comments and questions on $p$ . First, the $O(n^3 \epsilon^{-2})$ complexity of the proposed RSDM represents the worst-case complexity, which may not reflect its empirical runtime performance. While the per-iteration cost of RSDM is $O(nr^2)$ , it is dominated by matrix multiplication (as shown in Section 4), which can benefit significantly from modern hardware (such as GPUs). In contrast, the per-iteration complexity of RGD includes $O(np^2)$ for retraction, which involves non-standard linear algebra operations that cannot be easily accelerated. As shown in Section 7, RSDM achieves faster convergence in runtime than RGD across various settings, despite the complexity analysis. This suggests the theoretical complexity could be tightened with a refined analysis, which we leave as future work.

Additionally, we have conducted further experiments for PCA problem under various settings of $p = 100, 300, 500, 800, 1000, 1200, 1500, 1800$ , with $n = 2000$ fixed. The results are given in Appendix D of the revised manuscript. We observe that RSDM shows superior convergence than RGD across all settings except for $p = 100$ . This empirically validates the advantages of RSDM over RGD over a range of $p$ (even when $p$ is small relative to $n$ ). However, when $p$ is significantly smaller, the cost of retraction becomes negligible, which can lead to RGD outperforming RSDM in such cases.

In our future work, we plan to enhance RSDM for cases where $p$ is significantly small, e.g., by developing an adaptive submanifold selection scheme that allows RSDM to behave similarly to RGD when $p$ approaches zero, while maintaining its superiority for larger $p$ .

2. (Q1) Comparison to RSSM [1].

Thank you for the question. Although we have briefly compared RSDM to RSSM in Lines 89-92, we have now included a dedicated section (Appendix E of the revised manuscript) where we detail the updates for RSSM and compare it to RSDM. For your convenience, we create the following table summarizing the per-iteration complexity of RSDM and RSSM [1].

	Gradient + others	Retraction	Total
RSSM [1]	$O(npr)$	$O(nr^2)$	$O(npr + nr^2)$
RSDM-P	$O(nr^2)$	$O(r^3)$	$O(nr^2 + r^3)$

From the table, it is clear that RSDM (with permutation sampling) requires significantly lower per-iteration complexity than RSSM, especially in terms of the cost of retraction, which requires non-standard linear algebra operations and cannot be easily accelerated compared to matrix multiplication.

Although both RSDM and RSSM [1] share the same goal of reducing the retraction cost, they employ largely different strategies. One of the major differences is that RSDM operates on the rows while RSSM [1] operates on the columns. This distinction provides RSDM with additional advantages, including easy generalization to quotient manifolds, such as the Grassmann manifold (as discussed in Section 6). In contrast, this generalization appears challenging for RSSM.

Additionally, while RSSM [1] did not include any numerical experiment nor provide the code, we have implemented and compared RSDM to RSSM numerically in Appendix E of the revised manuscript. The results show that RSDM achieves significantly faster convergence compared to RSSM [1], thus further validating the benefits of proposed RSDM.

3. (Q2) On the extension of Theorem 3 to mini-batch gradient descent.

Thank you for the comment. We have now extended Theorem 3 to the mini-batch gradient descent setting, in Appendix F of the revised manuscript.

In summary, we consider $F(X) = \frac{1}{N} \sum_{i=1}^N f_i(X)$ where each component function $f_i$ has bounded variance $\sigma^2$ and the mini-batch of size $B$ is sampled uniformly with replacement. We prove that using a mini-batch reduces the gradient variance from $\sigma^2$ (as in the stochastic case) to $\sigma^2/B$ . Consequently, the convergence rate remains the same as in the stochastic case, with the variance term $\sigma^2$ replaced by $\sigma^2/B$ for mini-batch setting.

评论- Responses to Reviewer xyJ5 (Part 2)

2024-11-20

4. (Q4) On the experiments of CIFAR10 hovering around 40%. Only accuracy is presented, including training/test loss would be more illustrative.

Thank you for the comment. The absolute accuracy of the experiments is not particularly meaningful, as we consider the network to be MLP with orthogonal weights (unlike the standard settings). We have now included the training loss in the Appendix A of the revised manuscript. We observe that despite some variability, RSDM still observes slight improvement in early-stages in terms of training loss.

评论- Looking forward to your reply

2024-11-25

Dear Reviewer xyJ5

We sincerely appreciate your great efforts in reviewing our work and your valuable comments. Your insights have been incredibly helpful in improving our submission.

In the rebuttal, we have tried our best to address your concerns by providing detailed responses and significantly revising the submission based on your valuable suggestions. Would you mind checking our responses and confirming that if there are any remaining questions so that we could clarify?

Best,

Authors

2024-11-25

The reviewer thanks the authors for the detailed reply. Endeavoring a better (or optimal) theoretical complexity would be interesting. Moreover, since most applications often benefit from formulating fewer columns ( $p$ ), it is still worth finding a way for the proposed submanifold method to be in favor of the other solvers. In view of these limitations, the reviewer would maintain the score.

评论- Further responses to Reviewer xyJ5

2024-11-25

We thank the reviewer for the further comments. However, we respectively disagree with the comment that "most applications often benefit from formulating fewer columns ( $p$ )". Claiming that most applications are formulating with smaller $p$ is a relatively subjective argument without sufficient justification.

In fact, many optimization problems only work with $n = p$ , e.g., quadratic assignment problem [1], orthogonal recurrent neural network [2,3], token orthogonalization [4] orthogonal fine tuning for large langugage models [5,6]. This has motivated a line of works on efficient optimization designed only for orthogonal manifold (i.e., with $n=p$ ) [3,7,8]. The fact that our method operates efficiently in the regime $p = \Omega(n)$ represents already a significant improvement over these existing works.

Additionally, our method is already complexity-optimal for the setting $p = \Omega(n)$ , which encompasses a wide range of important applications. We further emphasize that the aim of this work is to address scenarios where the cost of retraction is a critical bottleneck. In contrast, when $p$ is significantly smaller, the retraction cost is relatively insignificant compared to other matrix operations, making retraction efficiency less critical in such cases.

In summary, our work introduces a principled framework for improving the efficiency of retraction, particularly in the challenging regime of $p = \Omega(n)$ . We believe these developments are significant both theoretically and empirically, which pave the way for future research towards scalable Riemannian optimization.

[1] Wen & Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming. 2013.

[2] Helfrich et al. Orthogonal recurrent neural networks with scaled Cayley transform. ICML 2018.

[3] Massart & Abrol. Coordinate descent on the orthogonal group for recurrent neural network training. AAAI 2022.

[4] Huang et al. Orthogonal transformer: An efficient vision transformer backbone with token orthogonalization. NeurIPS 2022.

[5] Liu et al. Parameter-efficient orthogonal finetuning via butterfly factorization. ICLR 2024.

[6] Ma et al. Parameter efficient quasi-orthogonal fine-tuning via givens rotation. ICML 2024.

[7] Shalit & Chechik. Coordinate-descent for learning orthogonal matrices through Givens rotations. ICML 2014.

[8] Ablin & Peyré. Fast and accurate optimization on the orthogonal manifold without retraction. AISTATS 2022.

审稿意见

评分: 5置信度: 42024-11-03

The paper proposes a randomized submanifold algorithm for solving optimization problems with orthogonality constraints. The authors provide convergence guarantees for the proposed algorithm and conduct numerical experiments. Theoretical and numerical comparisons with existing literature are also presented.

优点

The proposed randomized submanifold method includes several coordinate descent methods as specific examples and allows flexibility in setting the size of the submanifold.
Convergence is established in both expectation and high-probability settings, with clear derivations provided.

缺点

The algorithm does not achieve exact convergence, potentially due to the stochasticity in subspace selection. Is there a way to ensure exact convergence, for example, by adding restrictions on the sampling process?
It appears that the total computational complexity shows no improvement compared to Riemannian gradient descent and may even be higher when using orthogonal sampling. Additionally, the algorithm only converges to a neighborhood rather than a first-order stationary point.
The numerical comparisons are insufficient. It would be helpful to see the effects of different submanifold sizes $r$ and small $p$ values.

问题

Theorem 1. How to ensure $X_k$ remains inside $U$ under $X_{k_0} \in U$ ? The stochasticity induced by sampling seems problematic. Purely assuming all $X_k \in U$ may be too strong.
Remark 3. The retraction is at cost of $\mathcal{O}(np^2)$ , while calculating the Euclidean gradient is easy to be $\mathcal{O}(n^2p)$ flops, as seen in problems like PCA. Why does the Riemannian gradient descent become impractical?
Theorem 2. Is it possible to design a verison of submanifold method that has exact convergence rather than high probability.
Figure 2. Many Stiefel maniofold applications are with small $p$ . Could you also add some numerical experiments on small $p$ to see the comparsions?
Figure 5. The improvement on the accuracy appears marginal. The early-stage superior performance might also be achieved by using a larger step size.

评论- Responses to Reviewer hfHE (Part 1)

2024-11-20

We sincerely thank the reviewer for acknowledging the strengths of our work and for providing constructive comments and questions. Below, we provide detailed responses to each of the points raised. All revisions are highlighted in red in the revised manuscript.

1. (W1 + Q3) On the possibility of exact convergence.

Yes. We have now (in Appendix C of the revised version) derived an exact convergence of our method by adding a condition on the sampling. In particular, we design a two-loop procedure for RSDM where we construct a set of $P_k^s(r), s=0,...,S-1$ , for each outer iteration $k$ . Then as long as the sampling of $P_k^s(r)$ satisfy some non-degeneracy condition regarding the gradient norm (eq. (14)). Then we can show the exact convergence rate of $O(1/k)$ .

We also give one such example of permutation sampling that satisfies this condition, i.e., by ensuring a non-overlapping over the set of (unique) truncated permutation matrix (details in Appendix C).

These results confirm the possibility of exact convergence of our proposed RSDM.

2. (W2) Total complexity shows no improvement compared to Riemannian gradient descent.

We highlight that all subspace based methods (including coordinate descent, subspace gradient) cannot achieve improved total complexity over gradient descent for general nonconvex functions, either in the Euclidean space [1] or on Riemannian manifolds [3,4].

However, this does not de-merit subspace based methods because when $n,p$ becomes extremely large, our proposed RSDM (with permutation sampling) can make progress once $O(nr^2)$ operations, whereas Riemannian gradient descent (RGD) can only make progress after $O(np^2)$ operations.

It is also important to note that the complexities presented are worst-case complexities and may not reflect the actual runtime behavior. In fact, the per-iteration cost of RSDM, i.e., $O(nr^2)$ is dominated by matrix multiplication whose cost could be much lower when exploiting modern software, such as GPU. However, the cost of RGD, i.e., $O(np^2)$ is dominated by non-standard linear algebra operations (by retraction), which is not as easily accelerated. This suggests that in practice, the total complexity of RSDM could be much lower than RGD.

In our experiments, we have indeed demonstrated our proposed RSDM consistently requires lower runtime compared to Riemannian gradient descent on various problem instances.

3. (W2) "The algorithm only converges to a neighbourhood rather than a first-order stationary point"

This is in fact incorrect. As we have shown in Theorem 2 that RSDM under both permutation and orthogonal sampling converges to a first-order stationary point almost surely.

4. (W3 + Q4) Experiments on different $r$ and small $p$ .

Regarding the choice of submanifold sizes $r$ , we have already shown in Figure 3(a) on the performance of proposed RSDM with different $r$ . We observe there exists a range of $r$ values where RSDM performs better than Riemannian gradient descent (RGD) method.

For settings with small $p$ , we have now conducted additional experiments on PCA problem under a range of $p$ values: $p =100, 300, 500, 800, 1000, 1200, 1500, 1800$ with $n = 2000$ fixed. In addition, we have evaluated the performance of RSDM with different $r$ values for each $p$ . The experiment results are presented in Appendix D of the revised manuscript.

The results demonstrate that indeed the performance gap of proposed RSDM over RGD decreases when $p$ becomes smaller, as predicted by our theoretical analysis. However, RSDM is still able to show faster convergence than RGD across all settings, except for the case when $p = 100$ . This verifies the benefits of RSDM over RGD across a range of $p$ . Nevertheless, when $p$ is significantly smaller, the cost of retraction becomes negligible, which can lead to RGD outperforming RSDM in such cases.

In our future work, we believe RSDM can be enhanced for scenarios where $p$ is significantly small, e.g., by designing an adaptive submanifold selection scheme such that RSDM behaves similarly to RGD when $p$ is small while retaining its advantages for larger $p$ .

5. (Q1) How to ensure $X_k$ remains inside $U$ under $X_{k_0} \in U$ ?

Thank you for your thoughtful question. In Theorem 1, we can ensure $X_k$ remains in $U$ almost surely when $X^*$ is an isolated local minima in $U$ . This is because as we have shown in Theorem 2, RSDM converges almost surely in gradient norm. Thus, we can show $F(X_k)$ converges to $F(X^*)$ (once $X_k \in U$ ) and by quadratic growth condition (equivalent to PL condition for $C^2$ functions) in [2], we have $||X_k - X^* || \rightarrow 0$ almost surely. As a result, we can guarantee all $X_k \in U$ almost surely for $k \geq k_0$ .

We have now modified the Theorem 1 and 2 in this regard. We have also added the proof accordingly in the Appendix.

评论- Responses to Reviewer hfHE (Part 2)

2024-11-20

6. (Q2) The retraction is at cost of $O(np^2)$ , while Euclidean gradient is easy to be $O(n^2 p)$ as in PCA.

Thank you for the question. We would like to highlight again that the retraction requires non-standard linear algebra operations beyond matrix multiplication. Therefore, the retraction computation cannot be easily accelerated on modern hardware, such as GPUs. As a result, retraction has been identified as the main bottleneck for Riemannian optimization with orthogonality constraints (see [5,6]).

In contrast, the Euclidean gradient often relies on matrix multiplication that can be parallelized and significantly accelerated. Additionally, automatic differentiation can also largely improve the computation efficiency of gradient computation in practice. Therefore, compared to retraction, complexity of gradient computation is generally less of a concern.

7. (Q5) The improvement on the early-stage accuracy can be achieved using larger step size.

Thank you for the comment. While using a larger step size can improve early-stage accuracy, it also introduces significant fluctuations during the later stages of training. For this experiment, we carefully tuned the step size to ensure the best overall performance of both RSDM and RGD across all iterations.

Reference

[1] Kozak, D., Becker, S., Doostan, A., & Tenorio, L. A stochastic subspace approach to gradient-free optimization in high dimensions. Computational Optimization and Applications. 2021.

[2] Rebjock, Q., & Boumal, N. Fast convergence to non-isolated minima: four equivalent conditions for $\mathrm {C}^ 2$ functions. Mathematical Programming. 2024.

[3] Han, A., Jawanpuria, P., & Mishra, B. Riemannian coordinate descent algorithms on matrix manifolds. In ICML 2024.

[4] Gutman, D. H., & Ho-Nguyen, N. Coordinate descent without coordinates: Tangent subspace descent on Riemannian manifolds. Mathematics of Operations Research. 2023.

[5] Ablin, P., & Peyré, G. Fast and accurate optimization on the orthogonal manifold without retraction. In AISTATS 2022.

[6] Ablin, P., Vary, S., Gao, B., & Absil, P. A. Infeasible deterministic, stochastic, and variance-reduction algorithms for optimization under orthogonality constraints. Preprint. 2023.

评论- Looking forward to your reply

2024-11-25

Dear Reviewer hfHE

We sincerely appreciate your great efforts in reviewing our work and your valuable comments. Your insights have been incredibly helpful in improving our submission.

Best,

Authors

审稿意见

评分: 6置信度: 52024-11-04

The paper proposes a randomized Riemannian submanifold method for solving optimization problems constrained to the set of $n\times p$ orthogonal frames, also called the Stiefel manifold. The main innovation of the algorithm is to randomly select an $r$ -dimensional subspace, in which the current iterate is rotated in a way that locally decreases the objective. In this way, the method can be seen as a Riemmanian analogue to the random block coordinate descent for Euclidean optimization.

There are two sampling distributions considered for the random selection of the $r$ -dimensional subspace: Haar uniform sampling on the orthogonal group and random selection of indices (referred to in the paper as a random permutation matrix). Both of these have the complexity of $nr^2$ although the latter can be computed in a more efficient way due to not needing to perform QR decomposition.

The paper provides global convergence analysis to critical points and local to a local minima (under PL-ineq.) in expectation and also in high-probability. The method can be also extended/analyzed to the stochastic variant when the objective gradients are random with bounded variance.

优点

The idea of updating the iterate only in a subspace by rotations is not yet well explored and practically useful generalization of the random Riemannian coordinate descent. The paper provides extensive convergence analysis for the methods (although I haven't checked the proofs in the appendices in great detail). The paper is also well written and clearly explains the main concepts. There are multiple numerical experiments in the paper.

缺点

Writing comments

Although the paper is well written overall, I have some remarks on formulation of certain ideas:

Line 161: the authors state that the random submanifold is defined by the random orthogonal matrix $P_k$ . This is not true strictly speaking, in fact, I think it is only the first $r$ -columns of the matrix $P_k$ that define the subspace for the rotation.
Adding on the remark above: Isn't it then enough to generate only a random orthogonal frame from $St(n,r)$ that defines the subspace?
Lines 172 - 178: This part describes one of the the main technical observations: that it is possible to do a local approximation of the $\tilde F(Y)$ by computing retraction around an identity matrix. Can this be theoretically shown that this is a local approximation? It is not immediately clear to me why this is the case?
Eq. 4: there is a mistake/typo after the first equality
Line 198: this is very minor remark, but does the order of the first $r$ indices in the sampled permutation matrix matter? In this sense, wouldn't it be sufficient to sample $r$ indices without replacement instead of the permutation? Practically this might be the same so it might not matter.

References on sketching

This method seems to be related to random subspace sketching. For example a very recent work of (Shustin & Avron, 2024; https://www.jmlr.org/papers/volume25/21-1022/21-1022.pdf) considers random sketching algorithm for fast optimization over orthogonal constraints. This would be good to discuss in the paper (if indeed related), and possibly also to compare against in the numerical experiments.

Numerical experiments

The paper provides numerous numerical experiments but I would like to see the study of dependency on the choice of the random subspace dimension $r$ . In the first experiment, the orthogonal procrustes, I am missing the information of the choice of $r$ altogether. I would also like to see a discussion on the computational cost of the random Haar uniform sampling of O(n) as I imagine this will be computationally complex operation.

Overall, I like the main paper idea, but I am missing more direct comparison with sketching methods in Riemnannian setting as well as the explanation why we can expect (4) to hold as a good first order approximation to minimizing $\tilde F$ . I would also like to see the numerical section improved with more extensive discussion on the choice of $r$ and how it relates to the per-iteration cost.

问题

Could you use fast random subsampling methods, such as random Fast JL transform?

评论- Responses to Reviewer 8cig

2024-11-20

We thank the reviewer for the thoughtful feedback and general appreciation of our work. Here are our detailed responses to the comments. All revisions are highlighted in red in the revised manuscript.

1. "Only the first $r$ -columns of the matrix $P_k$ that define the subspace for the rotation."

We appreciate the reviewer for pointing this out. Indeed the random submanifold is only determined by the first $r$ rows of $P_k$ . To address this, we have explicitly clarified this by specifying "the first $r$ rows of $P_k$ " in line 173 of the revised manuscript.

2. "Enough to generate only a random orthogonal frame from St(n,r) that defines the subspace?"

This is correct. We have already addressed this in lines 209–211 for the uniform orthogonal sampling. Additionally, we have now clarified this in lines 218–219 for the permutation sampling, where only the first $r$ indices are relevant for defining the subspace.

3. On the local approximation of $\tilde F(Y)$ .

As discussed in Section 3, we wish to minimize $\tilde F(Y) = F(U(Y) X_k)$ locally around $Y = I$ such that the updated $X_{k+1} = U(Y) X_k$ is not far from $X_k = U(I) X_k$ . Due to the proximity of $Y$ to $I$ , minimizing the local approximation of $\tilde F(Y)$ is natural, which gives rise to the Riemannian gradient update of $Y$ .

Based on Riemannian gradient update, we have rigorously derived convergence in expectation (Theorem 1), in high probability and almost surely (Theorem 2). These results justify that minimizing the local approximation is both reasonable and effective.

4. Mistake/typo after the first equality in Eq. 4

After carefully reviewing the equation, we have not identified any mistakes or typos to the best of our knowledge. We would be greatly appreciated if the reviewer could be more specific on the potential mistakes/typos so that we can address and provide clarification.

5. On the order of the first $r$ indices of permutation.

The order of the indices does not matter and one can sample $r$ indices without replacement. This is exactly what we have implemented in the experiments. We have now added this remark in line 220 of the revised version.

6. Comparison to random subspace sketching [1].

Thank you for pointing out the connection to random subspace sketching method. However, the reference [1] is not relevant to our setting to the best of our knowledge.

In their paper (Page 3, second paragraph), the setting is optimization over generalized Stiefel manifold with constraint $X^T B X = I$ , where $B = Z^T Z + \lambda I$ with $X \in R^{d \times p}$ , $Z \in R^{n \times d}$ with $n >> p$ . The aim of [1] is to develop a sketching method to reduce complexity dependency on $n$ and improve the conditioning of the optimization problem. However, we focus on when $B = I$ and to reduce the retraction complexity of $O(dp^2)$ . Hence, their method is not directly applicable to our setting.

We have now included a discussion to this work in the Relate work section (Section 1.1).

7. Experiments on the choice of $r$ . Missing information of $r$ for procrustes problem.

We have already provided an experiment comparing the different choices of $r$ for the PCA problem. See Figure 3(a), where we compare various choices of $r$ for both orthogonal and permutation sampling. We have noticed that there exists a range of $r$ that can lead to improvement over the Riemannian gradient descent method. In addition, we have performed additional experiments on varying both $r$ and $p$ for the PCA problem. The results are included in Appendix D of the revised version.

Regarding the choice of $r$ for Procrustes problem, we set $r = 150$ for the small-size problem and $r = 900$ for the large-size problem. We have now added this information in the revised version.

8. Discussion on the computational cost on Haar uniform sampling on $O(n)$ .

As we have discussed in Section 4, it is not necessary to sample the entire orthogonal matrix from $\mathcal{O}(n)$ . We only need to sample the first $r$ rows of the orthogonal matrix, which can be efficiently achieved using QR decomposition (via the Gram-Schmidt process) on a randomly generated Gaussian matrix. The computational cost of this sampling process is $O(nr^2)$ .

In our experiments, we observe that despite the higher computational cost of orthogonal sampling compared to permutation sampling, RSDM with orthogonal sampling can still achieve better performance than Riemannian gradient descent.

9. "Use fast random subsampling methods, such as random fast JL transform?"

Thank you for the comment. We believe random fast JL transform is not directly applicable in our setting due to the orthogonality constraint we address in this paper. Nevertheless, we believe this is an interesting direction to explore in the future.

[1] Shustin, B., & Avron, H. (2024). Faster Randomized Methods for Orthogonality Constrained Problems. JMLR, 25(257), 1-59.

评论- Brief reply

2024-11-20

Thank you for the reply to my comments. Few quick comments:

I apologise for my mistake: there is not typo in eq (4). Initially I thought that $\tilde{F}_f(I_r)$ is of size $n\times p$ which confused me.
Thank you for relating your work to the work of (Shustin & Avron, 2024). I see they are not immediately comparable, although, if in their problem $B=Z^T Z$ is chosen to subsample random indices of $I_r$ , it might translate to your setting. Since the works are not directly comparable, I leave it up to the authors opinion whether mentioning of (Shustin & Avron, 2024) is needed in their paper.
Re: cost of computation:
- In the complexity estimate in the introduction you do not mention the cost of computing the gradient which is $O(npr)$ for Haar and $O(nr^2)$ for permutation as these are ``fast'' matrix multiplications.
- This is reasonable, but it might be a good idea to mention it in the list above Sect 1.1, where you write: ``This reduces the complexity of non-standard linear algebra operations from $O(np^2)$ to $O(r^3)$ , where $r$ is the dimension of the submanifold selected.''. Could you consider specifying here something about the cost of sampling, maybe along the lines: This reduces the complexity of non-standard linear algebra operations from $O(np^2)$ to $O(r^3)$ in the case of the uniform random subsampling of the canonical basis at the added cost of only fast matrix multiplication on the order of $O(npr)$ ?

I will address the other changes in the paper in more detail at a later point.

评论- Thank you for the prompt reply

2024-11-20

On the additional complexity apart from retraction.

We thank the reviewer for the prompt reply. We would like to point out that the gradient complexity is also reduced from $O(np^2)$ (of Riemannian gradient descent) to $O(nr^2)$ or $O(npr)$ under permutation and orthogonal sampling strategies respectively. However, orthogonal sampling incurs an additional cost of $O(nr^2)$ by QR decomposition during the sampling process. Meanwhile, the complexity of sampling a permutation matrix is negligible.

We have now added the remark in the introduction section above Section 1.1.

评论- Looking forward to your reply

2024-11-25

Dear Reviewer 8cig

We sincerely appreciate your great efforts in reviewing our work and your valuable comments. Your insights have been incredibly helpful in improving our submission.

We would like to kindly follow up as we have not heard back from you since our last response. We are keen to know whether our responses have sufficiently addressed your concerns. If there are any remaining questions, please let us know, and we would be happy to provide further clarification.

Best,

Authors

2024-11-25

Thank you for clarifying this. I really like this paper's idea and also authors' response to reviewers' comments. I decided the increase the score of my review.

I also fully agree with the authors reply to the reviewer xyJ5 that the case of $n=p$ is an important one and that the argument that the retractions are fast enough when $p\ll n$ is subjective since ``fewer columns'' is a subjective statement. One could take this argument to an extreme by considering $p=1$ , when the retraction corresponds to just a norm normalisation of a vector, but, even though it is very fast to compute, this is not where the useful structure of the orthogonality is employed.

评论- Thank you for your positive feedback

2024-11-26

We are greatly encouraged by your positive feedback on our work. Your insightful comments have been highly constructive in improving our submission. We also sincerely appreciate you sharing your perspectives on small $p$ , which further reinforces the significance and necessity of our contributions.

Once again, thank you for your invaluable time and thoughtful feedback.

评论- Global response and summary of revisions

2024-12-01

Dear AC and Reviewers,

We thank all the reviewers for the thoughtful reviews. We have individually responded to each reviewer and updated a revision. For your convenience, here we provide a summary of our major revisions to the manuscript.

We have modified Theorem 1 & 2 by removing the assumption that iterates stay in the local neighborhood where PL condition holds. Instead, we show the iterates are bounded in the neighborhood almost surely when $X^*$ is an isolated local minima.
In Appendix C, we prove that proposed RSDM achieves exact convergence by requiring a non-vanishing condition on the sampling process. Further, we show an example of such sampling process that satisfies the condition.
In Appendix D, we have conducted additional experiments on various settings of $p$ for the PCA problem and observed that RSDM shows superior convergence across a range of $p$ values.
In Appendix E, we have compared our method to a related work RSSM (Cheung et al., 2024) in details. We demonstrate our method not only achieves lower per-iteration complexity than RSSM, but also allows easy generalization to other quotient manifolds. We have also conducted numerical comparisons with RSSM. The experiments suggest our method indeed attains faster convergence than RSSM.
In Appendix F, we have generalized the analysis of stochastic settings to mini-batch settings, and have derived the convergence guarantees in expectation and with high probability.
In Appendix G, we have compared our method to OBCD (Yuan, 2023) in details. We revealed a critical difference that OBCD employs (projected) Euclidean gradient descent update for the subproblem while our method employs Riemannian gradient update. This leads to significant differences in the analysis techniques. Numerically, we conducted additional experiments to verify that our method converges significantly faster than OBCD due to our use of more intrinsic updates.

We sincerely appreciate the reviewers’ valuable feedback and constructive suggestions, which have helped improve the quality and presentation of our work.

We thank the reviewers who have responded to our rebuttal. Given the deadline of the discussion period is approaching, we would like to kindly inquire if there are any remaining questions, particularly from reviewers who have not yet responded. We are more than happy to address any additional comments to further clarify and enhance the understanding of our contributions.

Best regards,

Authors

AC 元评审

2024-12-27

The paper develops and analyzes a Riemannian optimization method for orthogonal constrained optimization problems which restricts to a random submanifold at each iteration. This can be considered a generalization of random subspace / coordinate descent to Riemannian manifolds. The main benefit of this approach is to reduce the per-iteration complexity of Riemannian optimization, since projection onto (or more generally, retraction to) a lower dimensional submanifold can be less computationally costly. The paper studies two methods for introducing random submanifolds: random orthogonal transformation and random permutation. Random permutation is particularly effective in reducing the per-iteration cost from O(np^2) to O(nr^2) where r is the dimension of the random submanifold.

The main strength of the paper is that it produces a method with cheap iterates - especially when one uses permutation sampling. This method is especially applicable to learning high dimensional orthogonal matrices (i.e., p = n). Experiments show improvements in function value versus wall clock time compared to methods that use full gradients as well as a competing random submanifold method [RSSM]. As discussed below, the main reviewer concerns related to the overall complexity analysis (which currently shows an improvement when p = \Omega(n)) and the theory, which (at least in the analysis of fast convergence under PL) requires iterates to stay in a neighborhood U. After discussion and considering author responses, reviewers retained a mixed evaluation of the paper, placing it below the bar for acceptance.

审稿人讨论附加意见

Reviewers produced a mixed evaluation of the paper, praising its clarity, and noting that it proves convergence both in expectation and with high probability. Several issues were raised on review, including

The overall complexity of the method ( O(n^3/\epsilon^2) to achieve \epsilon accuracy, compared to gradient methods which require O(np^2/ \epsilon^2) ) [xyJ5, 8cig]
Details of the theoretical analysis (iterates remain in neighborhood) [hfHE, Cgj8]
Experimental evaluation of varying rank, and comparison to previous work on random submanifolds
Clarity issues [8cig]

The discussion and revision addressed concerns around the clarity of the paper, and clarified the time complexity accounting in the main document. In particular, reviewer [8cig] found that the revision improves clarity. The discussion also points out that (a) cheap iterates can be helpful even if the overall worst case complexity is not improved, and (b) in some circumstances (such as orthogonal procrustes) p is comparable to n.

The theoretical results on linear convergence under PL also evoked some discussion; the initial submission assumed that this was the case. The revision added argumentation which aimed to show, using almost sure convergence in the gradient norm, that the iterates actually converge to X^, the unique minimizer in the neighborhood U. At least one reviewer retained concerns about these revised arguments, since almost sure convergence of the gradient norm may not directly imply subsequence convergence to the point X^.

Reviewers had a mixed reaction to these arguments, with several [xyJ5, hfHE] maintaining their overall evaluation, while [cgi8, Cgj8] raised their scores.

最终决定Reject

2025-01-22

Reject