PaperHub
5.3
/10
Rejected4 位审稿人
最低4最高6标准差0.8
6
4
6
5
3.3
置信度
正确性2.5
贡献度2.5
表达2.3
NeurIPS 2024

Block Coordinate Descent Methods for Optimization under J-Orthogonality Constraints with Applications

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

关键词
Orthogonality ConstraintsNonconvex OptimizationNonsmooth Composite OptimizationBlock Coordinate DescentConvergence Analysis

评审与讨论

审稿意见
6

The paper introduces the JOBCD (J-Orthogonal Block Coordinate Descent) algorithm, a novel method designed to tackle optimization problems under J-orthogonality constraints. JOBCD includes two variants: GS-JOBCD (Gauss-Seidel strategy) and VR-J-JOBCD (Jacobi strategy with variance reduction). Theoretical analyses establish the algorithms' complexity and convergence, while extensive experiments show JOBCD's superior performance compared to state-of-the-art methods in various applications.

优点

The strengths of this paper are listed as follows:

Originality: This paper introduces JOBCD as a novel approach to handling J-orthogonality constraints. It offers GS-JOBCD and VR-J-JOBCD, showcasing flexibility and innovation in optimization strategies.

Quality: This paper provides comprehensive complexity and convergence analyses. Extensive experiments demonstrate superior performance on real-world and synthetic data.

Clarity: The structure is logical.

Significance: This work is relevant to various statistical learning and data science fields.

缺点

Some proofs for Section 4 are hard to follow.

问题

  1. Do the optimization problems in (S3) of Algorithm 1 and Algorithm 2 require an especially exact solution? What happens if they are solved approximately?
  2. Line 34: The notation Q(X+;X)\mathcal{Q}(X^+; X) is undefined.
  3. Line 109: Should the notation B\mathrm{B} be replaced by Bt\mathrm{B^t}?
  4. LIne 124, formula (4): Should I2I_2 be replaced by I4I_4?

局限性

NA

作者回复

Dear Reviewer hZf8, we appreciate your dedication to reviewing our manuscript. In the following, we will respond to your concerns point by point.

Question 1. Some proofs for Section 4 are hard to follow.

Response: We now provide the proof sketch for the global convergence of the JOBCD algorithm (including both GS-JOBCD and VR-J-JOBCD).

  1. We use the Lipschitz continuity condition of the gradient and the properties of the optimal point to establish a sufficient decrease condition.

  2. We recursively apply the sufficient decrease condition to derive an inequality relationship between the ϵ\epsilon-BS-point and [f(X0)f(Xˉ)][f(\mathbf{X}_0) - f(\bar{ \mathbf{X}})].

  3. Based on the settings of the variance reduction strategy, we determine the number of arithmetic operations per iteration, and then derive the complexity of the arithmetic operations.

We now provide the proof sketch for strong convergence of the JOBCD algorithm under the KL Assumption:

The paper demonstrates the finite length property of the algorithm iterations to describe the algorithm's performance within a finite number of steps, such as convergence rate and error bounds. We use i=1tXt+1XtF\sum_{i=1}^t \|\mathbf{X}^{t+1} - \mathbf{X}^t\|_F to describe the finite length property.

  1. Since we can easily obtain an inequality relationship between Vt\mathbf{V}^t and Xt\mathbf{X}^t, we first consider proving that the sufficient decrease condition in the global convergence proof shows that f(Xt)f(Xt+1)f(\mathbf{X}^t) - f(\mathbf{X}^{t+1}) is related to Vt\mathbf{V}^t.

  2. We use the KL inequality property, establish the relationship between f(Xt)f(Xt+1)f(\mathbf{X}^t) - f(\mathbf{X}^{t+1}) and the Frobenius norm of the Riemannian gradient of f(Xt)f(\mathbf{X}^t).

  3. We prove the relationship between the Frobenius norm of the Riemannian gradient of f(Xt)f(\mathbf{X}^t) and the Frobenius norm of the Riemannian gradient of the majorization function used in the algorithm.

  4. We prove the relationship between the Frobenius norm of the Riemannian gradient of the majorization function used in the algorithm and Vt\mathbf{V}^t.

  5. We summarize all the steps from (2.1) to (2.4), we can obtain an inequality relationship involving only the variable Vt\mathbf{V}^t, with all other terms being constants. At this point, based on basic mathematical knowledge such as the triangle inequality, we construct a recursive formula to complete the proof.

Question 2. Do the optimization problems in (S3) of Algorithm 1 and Algorithm 2 require an especially exact solution? What happens if they are solved approximately?

Response: We require an exact solution; otherwise, our algorithm will lose the strong theoretical optimality guarantee of the BS point.

Moreover, solving these problems approximately makes it difficult to measure the degree of approximation to the non-convex optimization problem, except by using the definition of critical points.

We solve the subproblems in Algorithm 1 and Algorithm 2 exactly to obtain stronger BS points. However, if we use gradient descent to solve these subproblems, we will only achieve weak optimality at a critical stationary point.

Question 3. Line 34: The notation Q(X+;X) is undefined.

Response: Q(X+;X)Q(X+;X) should be the right-hand side of Inequality (2). Thank you for pointing this out.

Question 4. Line 109: Should the notation B be replaced by Bt?

Response: It should be BtB^t. Thank you for pointing this out.

Question 5. Line 124, formula (4): Should I2 be replaced by I4?

Response: It should be I4I_4. Thank you for pointing this out.

评论

Thank you to the authors for their responses. I will maintain my score.

审稿意见
4

This paper proposes two Block Coordinate gradient descent methods(BCD) for solving J-orthogonal constrained problem. One is Gauss-Seidel type, the other one is Jocobi type as well as addressing finite sum problem using variance reduction strategies. Convergence guarantees are proved with KL conditions. Numerical experiments show the advantages of the proposed algorithm.

优点

The proposed algorithm decomposes the matrix variable into row block structure, yielding a block coordinate descent algorithm with a small size subproblem. The numerical performance is very impressive.

缺点

  1. This paper is based on the paper " [51] Ganzhao Yuan. A block coordinate descent method for nonsmooth composite optimization under orthogonality constraints. ArXiv, abs/2304.03641, 2023." The main difference is the constraint in this paper becomes J-orthogonality constraint. However, the framework follows almost the same as [51]. The authors should highlight the novelty of the algorithm or difficulty in the extension.

  2. The authors of the reference [31] may be wrong. Besides, the UMCM algorithm in [31] solves orthogonal constrained problem. Is there any difference in implmenting in solving J-orthogonality problem? The objective value of UMCM is far from the JOBCD method. I'm curious about the reasons.

  3. In numerical experiment, how do you select subset from the dataset, see line 309.

问题

  1. Why cannot we design variance reduced algorithm for Gauss-Seidel type BCD?
  2. line 323: it claims that the lower objective achieved by GS-JOBCD can be explained by the stronger stationary point result. However, it is based on KL condition. I think it is not the main reason?
  3. line 324: infinite sum?

局限性

N/A

作者回复

Dear Reviewer swMz, thank you for your efforts in evaluating our manuscript. In the following, we will respond to your concerns point by point.

Question 1. This paper is based on the paper " [51] Ganzhao Yuan. A block coordinate descent method for nonsmooth composite optimization under orthogonality constraints. ArXiv, abs/2304.03641, 2023." The main difference is the constraint in this paper becomes J-orthogonality constraint. However, the framework follows almost the same as [51]. The authors should highlight the novelty of the algorithm or difficulty in the extension.

Response: This paper differs from the work of [51] in several key aspects:

  1. The nature of the problems differs: solutions to optimization problems with orthogonality constraints are compact, whereas solutions with JJ-orthogonality constraints can be unbounded.

  2. We extend the OBCD method [51] to handle optimization problems with J-orthogonality constraints, addressing a broader class of optimization problems.

  3. We introduce a parallel Jacobi strategy, marking the first application of modern parallelization techniques to BCD methods.

  4. We incorporate variance reduction strategies into the JOBCD framework, transitioning from a deterministic to a stochastic algorithmic framework.

  5. JOBCD presents, for the first time, the first-order optimality condition for optimization problems with J-orthogonality constraints, the tangent space of the optimization manifold, the optimality condition, and the convergence properties.

  6. The comprehensive consideration of all these aspects leads to a significantly higher level of difficulty in proving the properties of JOBCD compared to OBCD.

Question 2. The authors of the reference [31] may be wrong.

Response: Thank you for pointing this out. We will change it to: "[31] Nachuan Xiao, Xin Liu, and Ya-xiang Yuan. A class of smooth exact penalty function methods for optimization problems with orthogonality constraints. Optimization Methods and Software, 37(4):1205–1241, 2022."

Question 3. Besides, the UMCM algorithm in [31] solves orthogonal constrained problem. Is there any difference in implmenting in solving J-orthogonality problem? The objective value of UMCM is far from the JOBCD method. I'm curious about the reasons.

Response: We use Lemma 3.1 to design the UMCM algorithm. The derivation process is detailed in the comment: "Implementation of UMCM algorithm for Optimization Problem under J-orthogonality Constraints".

For the HEVP problem, we obtain the following equivalent unconstrained (quartic) optimization problem:

minXRn×ntr(XCX)0.5JXf(X),XJXJ\operatorname{min}_{\mathbf{X} \in \mathbb{R}^{n \times n}} \operatorname{tr} (\mathbf{X}^{\top} \mathbf{C} \mathbf{X}) - 0.5 \langle \mathbf{J}\mathbf{X}^{\top}\nabla f(\mathbf{X}), \mathbf{X}^{\top}\mathbf{J}\mathbf{X} - \mathbf{J} \rangle,

where C=DD\mathbf{C} = -\mathbf{D}^{\top}\mathbf{D} and f(X)=CX\nabla f(\mathbf{X}) =\mathbf{C} \mathbf{X}.

In the experiments, we used the built-in Adagrad optimizer in PyTorch to perform gradient descent. However, the results showed that the solution obtained by solving the above unconstrained optimization problem did not always strictly satisfy the J-orthogonality constraint. Therefore, we dynamically adjusted the step size so that the UMCM solution was obtained within a certain J-orthogonal constraint, even though this might limit the objective value to some extent.

Question 4. In numerical experiment, how do you select subset from the dataset, see line 309.

Response: We performed uniform random sampling on the dataset. The Python sampling code is as follows:

data=scio.loadmat('./sector_train.mat')

cr = 500;cc = 1000

selected_indices = torch.randint(0, data['x'].nnz, (cr*cc,1)).squeeze()

tensor_var = torch.tensor(data['x'].data[selected_indices]).view(cr, cc)

Question 5. Why cannot we design variance reduced algorithm for Gauss-Seidel type BCD?

Response:

  • We can design a variance reduction algorithm for GS-JOBCD. GS-JOBCD represents a partial-gradient method, while J-JOBCD represents a full-gradient method. Neither J-JOBCD nor GS-JOBCD dominates the other, similar to how the classic GS and Jacobi methods do not have one being better than the other.

  • J-JOBCD updates using the full gradient, making it simpler to utilize variance reduction strategies, and experiments have shown it to be faster. In contrast, the GS method only uses partial gradients, meaning each iteration of GS-JOBCD has a smaller computational cost but requires more iterations.

  • Although incorporating variance reduction strategies into GS-JOBCD is feasible, the implementation process would be more complex."

Question 6. line 323: it claims that the lower objective achieved by GS-JOBCD can be explained by the stronger stationary point result. However, it is based on KL condition. I think it is not the main reason?

Response:

  • The reason JOBCD achieves a stronger stationary point is due to the breakpoint search strategy used in the objective search, not the KL inequality.

  • The optimality analysis in Section 3 does not rely on the KL condition.

  • We only used the KL condition to prove the strong convergence in Section 4.2.

Question 7. line 324: infinite sum?

Response: It should be finite-sum. Thank you for pointing this out.

评论

We consider minimizing the following differentiable function under J-orthogonality constraints:

minXRn×nf(X), s.t. XJX=J\min \limits_{\mathbf{X} \in \mathbb{R}^{n \times n}} f\mathbf{(X)} , \text { s.t. } \mathbf{X}^{\top} \mathbf{J} \mathbf{X}=\mathbf{J}.

We derive the Lagrangian function of the above problem with ΛRn×n\Lambda \in \mathbb{R}^{n \times n}:

L(X,Λ)=f(X)12Λ,XJXJ.\mathcal{L}(\mathbf{X}, \Lambda) = f(\mathbf{X})-\tfrac{1}{2} \langle \Lambda, \mathbf{X}^\top\mathbf{J}\mathbf{X} - \mathbf{J} \rangle.

Setting the gradient of L(X,Λ)\mathcal{L}(\mathbf{X}, \Lambda) w.r.t. X\mathbf{X} to zero yields:

f(X)JXΛ=0.\nabla f(\mathbf{X}) - \mathbf{J} \mathbf{X} \Lambda = 0.

Multiplying both sides by X\mathbf{X}^\top and using the fact that XJX=J\mathbf{X}^\top \mathbf{J}\mathbf{X}=\mathbf{J}, we have JΛ=Xf(X)\mathbf{J} \Lambda = \mathbf{X}^\top \nabla f(\mathbf{X}). Multiplying both sides by J\mathbf{J}^\top and using JJ=I\mathbf{J}^\top \mathbf{J}=\mathbf{I}, we have Λ=JXf(X)\Lambda=\mathbf{J}\mathbf{X}^\top \nabla f(\mathbf{X}). Thus, we obtain the following equivalent unconstrained optimization problem:

minXRn×nf(X)12JXf(X),XJXJ.\min \limits_{\mathbf{X} \in \mathbb{R}^{n \times n}} f\mathbf{(X)}-\tfrac{1}{2} \langle \mathbf{J}\mathbf{X}^\top \nabla f(\mathbf{X}), \mathbf{X}^\top\mathbf{J}\mathbf{X} - \mathbf{J} \rangle.

Finally, we solve it using a gradient-based approach, specifically employing the Adagrad optimizer built into PyTorch in our paper.

评论

Thank you for the clarification. However, I noticed that in [31], the penalty function used is the augmented Lagrangian, whereas in this response, the Lagrangian is used. Additionally, you mentioned that the main difference between J-orthogonality and orthogonality lies in boundedness. Could you please indicate where the boundedness makes your analysis unique compared to [51]?

评论

Question 9. The main difference between J-orthogonality and orthogonality lies in boundedness. Could you please indicate where the boundedness makes your analysis unique compared to [51]?

Response: To handle the J-orthogonal constraint, we introduce additional novel strategies in the proof of convergence. We illustrate these with examples:

  1. Lemma E.5: Based on the KL assumption, the key proof step involves deriving the inequality relationship between dist(0,f(X))\operatorname{dist}(0, \nabla f^\circ(\mathbf{X})) and dist(0,Jf(X))\operatorname{dist}(0, \nabla_{\mathcal{J} }f(\mathbf{X})) by solving the optimal solution in Lemma E.4 regarding the tangent space projection:Yˉ=argminYTXJh(Y)=12YGF2 \bar{\mathbf{Y}} = \underset{\mathbf{Y} \in T_{\mathbf{X}} \mathcal{J}}{\operatorname{argmin}} \, h(\mathbf{Y}) = \frac{1}{2} \|\|\mathbf{Y} - \mathbf{G}\|\|_F^2. Since the orthogonal constraint is compact, we can directly obtain the closed-form solution for the orthogonal tangent space. However, for the J-orthogonal matrix, we cannot obtain the optimal solution. Therefore, in our proof, we establish an inequality based on the necessary (not sufficient) condition for the optimal solution: h(Yˉ)h(GJXGXJ)h(\bar{\mathbf{Y}}) \leq h(\mathbf{G} - \mathbf{J} \mathbf{X} \mathbf{G}^\top \mathbf{X} \mathbf{J}).

  2. Lemma E.2: In proving the "Riemannian Gradient Lower Bound for the Iterates Gap", we need to rely on the first-order optimality condition of the optimization problem under J-orthogonality constraints, constructed for the first time in Lemma 3.1: 0=Jf(Xˉ)=f(Xˉ)JXˉ[f(Xˉ)]XˉJ,0 = \nabla_{\mathcal{J}} f(\bar{\mathbf{X}}) = \nabla f(\bar{\mathbf{X}}) - \mathbf{J} \bar{\mathbf{X}} [\nabla f(\bar{\mathbf{X}})]^\top \bar{\mathbf{X}} \mathbf{J}, to complete the proof. This results in a completely different key coefficient, ϕ\phi, for the lemma.

  3. Similarly, in proving Lemma E.3 regarding dist(0,Jf(X))\operatorname{dist}(0, \nabla_{\mathcal{J}} f(\mathbf{X})) and JT(I2;Xt,B)\nabla_{\mathcal{J}} \mathcal{T}(\mathbf{I}_2;\mathbf{X}^t, B), there are also many novel aspects, with significant differences in the proof process and the key coefficients in the results.

Moreover, the unbounded nature of optimization problem under J-orthogonality constraints adds significant complexity to the proofs related to variance reduction stochastic strategies and parallel strategies.

评论

"Since the orthogonal constraint is compact, we can directly obtain the closed-form solution for the orthogonal tangent space." I don't think so. The closed-form solution is not due to the compactness of the orthogonal constraint. The tangent space has a nice decomposition formula; see [1]. Therefore, we can get the projection solution exactly. Indeed, as the tangent space is a linear subspace, as stated in your Lemma A.3, one can solve the projection problem using the standard projection technique onto a linear constraint.

By the way, which Riemannian metric did you use to define the Riemannian gradient? If the metric is induced from the Euclidean one, then it is the projection of the Euclidean gradient onto the tangent space, for which you have the expression of the Riemannian gradient in Lemma E.3.

评论

Question 8. In [31], the penalty function used is the augmented Lagrangian, whereas in this response, the Lagrangian is used.

Response:

  • The unconstrained objective function obtained using the multiplier correction strategy is equivalent and well-defined whether or not it includes quadratic terms. We have chosen the simpler version.

  • Since we have already compared with the ADMM method based on the augmented Lagrangian function, using this more straightforward approach also helps to distinguish it from ADMM.

  • Now, we illustrate that the equivalent minimization function (without the quadratic term) is also well-defined for the HEVP problem. We have:

minXRn×nf(X)12JXf(X),XJXJ,\min \limits_{\mathbf{X} \in \mathbb{R}^{n \times n}} f(\mathbf{X}) - \tfrac{1}{2} \langle \mathbf{J}\mathbf{X}^\top \nabla f(\mathbf{X}), \mathbf{X}^\top\mathbf{J}\mathbf{X} - \mathbf{J} \rangle,

where f(X)=tr(XCX)f(\mathbf{X}) = \text{tr}(\mathbf{X}^\top\mathbf{C}\mathbf{X}) and C=DD\mathbf{C} = -\mathbf{D}^\top\mathbf{D}.

In the above expression, the function f(X)f(\mathbf{X}) is quadratic, and the Lagrangian term contains a quartic function JXCX,XJX-\langle \mathbf{J}\mathbf{X}^\top\mathbf{C}\mathbf{X}, \mathbf{X}^\top\mathbf{J}\mathbf{X} \rangle. Now, we show that this expression is always non-negative:

JXCX,XJX=JXCX,J=XCX,I=tr(XCX).-\langle \mathbf{J}\mathbf{X}^\top\mathbf{C}\mathbf{X}, \mathbf{X}^\top\mathbf{J}\mathbf{X} \rangle = -\langle \mathbf{J}\mathbf{X}^\top\mathbf{C}\mathbf{X}, \mathbf{J} \rangle = -\langle \mathbf{X}^\top\mathbf{C}\mathbf{X}, \mathbf{I} \rangle = -\text{tr}(\mathbf{X}^\top\mathbf{C}\mathbf{X}).

Since C=DD\mathbf{C} = -\mathbf{D}^\top\mathbf{D}, the term tr(XCX)-\text{tr}(\mathbf{X}^\top\mathbf{C}\mathbf{X}) is always non-negative. Therefore, we conclude that this quartic term is always non-negative and dominates the other lower-order terms, ensuring that the objective function is well-defined.

  • Upon request, we also report the results for the UMCM algorithm including the quadratic term (i.e., using the augmented Lagrangian function). Notably, we consider the following minimization problem:

minXRn×nf(X)12JXf(X),XJXJ+β2XJXJF2.\min \limits_{\mathbf{X} \in \mathbb{R}^{n \times n}} f(\mathbf{X}) - \tfrac{1}{2} \langle \mathbf{J}\mathbf{X}^\top \nabla f(\mathbf{X}), \mathbf{X}^\top\mathbf{J}\mathbf{X} - \mathbf{J} \rangle + \tfrac{\beta}{2} \|\| \mathbf{X}^\top\mathbf{J}\mathbf{X} - \mathbf{J} \|\|^2_\mathsf{F}.

Please refer to the table below. These results will be included in the revised manuscript.

HEVP-datasetnamecifarCnnCaltechgisettemnistrandn10
(m-n-p)(1000-100-50)(2000-1000-500)(3000-1000-500)(1000-780-390)(10-10-5)
Lagrangian-4.81e+02(1.0e-06)-8.61e+01(8.6e-07)-1.70e+06(8.1e-07)-2.79e+04(1.0e-06)-3.38e-01(1.0e-06)
Augmented Lagrangian-4.86e+02(1.0e-06)-7.33e+01(3.4e-07)-1.33e+06(5.0e-07)-2.80e+04(1.0e-06)-1.41e+00(1.0e-06)
HEVP-datasetnamerandn100randn1000sectorTDT2w1a
(m-n-p)(100-100-50)(1000-1000-500)(500-1000-500)(1000-1000-500)(2470-290-145)
Lagrangian-1.13e+02(1.0e-06)-1.28e+04(6.8e-08)-1.39e+03(6.8e-07)-1.73e+06(6.8e-07)-2.89e+02(1.0e-06)
Augmented Lagrangian-1.42e+02(1.0e-06)-1.11e+04(4.9e-08)-1.11e+03(4.2e-07)-1.38e+06(4.3e-07)-3.08e+02(1.0e-06)

Supplementary experiments for HEVP (limited to 30s). The data in the cell stand for f(Xt)f(X0)f(\mathbf{X}^t) - f(\mathbf{X}^0) of UMCM with Lagrangian and Augmented Lagrangian (β=10\beta = 10) methods. The value in parentheses represents 1n2ijnXJXJij\tfrac{1}{n^2} \sum_{ij}^n |\mathbf{X}^{\top}\mathbf{J}\mathbf{X}-\mathbf{J}|_{ij}.

评论

Regarding the UMCM method, the original algorithm should be implemented exactly, including the quadratic term, regardless of whether the subproblem is well defined or not. Deviating from this makes it a different algorithm altogether. I checked the code and noticed that the step size decreases by half every two iterations, which would degrade its performance. Since UMCM is not a stochastic algorithm, a diminishing step size is not necessary.

评论

Question 11. Since the orthogonality constraint is compact, we can directly obtain the closed-form solution for the orthogonal tangent space." I don't think so. The closed-form solution is not due to the compactness of the orthogonal constraint.

Response: It should be "For the orthogonality constraint which is compact, we can directly obtain the closed-form solution for the orthogonal tangent space".

Question 12. By the way, which Riemannian metric did you use to define the Riemannian gradient? If the metric is induced from the Euclidean one, then it is the projection of the Euclidean gradient onto the tangent space, for which you have the expression of the Riemannian gradient in Lemma E.3.

Response:

  • Projecting the Euclidean gradient onto the tangent space requires solving the following optimization problem: minYYGF2,s.t.XJY+YJX=0\min_Y \|\|\mathbf{Y} - \mathbf{G}\|\|_F^2, \operatorname{s.t.} \mathbf{X}^\top \mathbf{J}\mathbf{Y} + \mathbf{Y}^\top \mathbf{J}\mathbf{X} = 0. Although this problem involves linear constraints, obtaining a closed-form solution is challenging, so we chose to avoid solving it. In other words, we did not use the projection metric.

  • Following the strategy outlined by [Wen2013], we derived the first-order optimality condition J(X)=0\mathcal{J}(\mathbf{X}) = 0 for any feasible solution X\mathbf{X} to the optimization problem, using the first-order information of the objective function and the symmetric property of the multiplier. We refer to J(X)\mathcal{J}(\mathbf{X}) as the Riemannian gradient.

  • It should be noted that our JOBCD algorithm does not use the Riemannian gradient J(X)\mathcal{J}(\mathbf{X}), as we rely solely on the Euclidean gradient. The Riemannian gradient is employed only for theoretical analysis.

[Wen2013] Wen, Z., & Yin, W. (2013). A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1), 397-434.

评论

Question 10. Regarding the UMCM method, the original algorithm should be implemented exactly, including the quadratic term, regardless of whether the subproblem is well defined or not. Deviating from this makes it a different algorithm altogether. I checked the code and noticed that the step size decreases by half every two iterations, which would degrade its performance. Since UMCM is not a stochastic algorithm, a diminishing step size is not necessary.

Response:

  • In the supplementary experiments, we essentially used a fixed step size of 10510^{-5} for both methods, with and without the quadratic term, which is a relatively small step size.

  • During the experiment design, we tested various step sizes, including {103,105,107}\lbrace 10^{-3}, 10^{-5}, 10^{-7} \rbrace, and found that 10510^{-5} generally yielded better results.

  • One possible reason for requiring such a small step size is that the penalty term in the objective function is a quartic function, which is highly sensitive to the step size scale, making a very small step size necessary.

  • In the actual experiments, we adopted a dynamic step size, gradually decreasing to 10510^{-5} to achieve better outcomes.

审稿意见
6

The paper proposes two block coordinate descent methods for minimization of a finite-sum subject to the J-Orthogonality constraints — one based on Gauss-Seidel strategy, the other based on variance reduction and Jacobi strategy. The convergence is proved, with a global convergence rate of O(N/\epsilon) and O(\sqrt{N}/\epsilon) respectively, and a local convergence rate that depends on the desingularization in the KL-condition assumption.

优点

The algorithms proposed are novel and might be useful in practice: the update rules involve solving a small size problem thereby is very simple, and the convergence is proved theoretically under reasonable assumptions.

缺点

The paper is relatively dense, and I find it a bit hard to keep track of all the terms introduced. For instance, the parameter theta is used in the algorithms but I’m not sure where it is introduced; in Assumption 4.8 KL function is mentioned but it’s not defined…

问题

The blocks in the proposed JOBCD algorithms consist of just 2 indices. I’m wondering if there is benefit in updating more than 2 indices in each iteration?

I might miss something while reading the paper… what is the parameter theta in the algorithm 1?

局限性

Yes, the paper discusses the assumptions of the theorems.

作者回复

Dear Reviewer JyL5, we appreciate your dedication to reviewing our manuscript. In the following, we will respond to your concerns point by point.

Question 1. For instance, the parameter theta is used in the algorithms but I’m not sure where it is introduced

Response: θ\theta is first defined in Inequation (3) as a user-defined, strictly positive parameter, with a default value of 1e-6. We introduce it to ensure convergence guarantees for the algorithm. It plays a significant role in our convergence theorems, such as Theorem 4.6 and Theorem 4.7.

Question 2. in Assumption 4.8 KL function is mentioned but it’s not defined…

Response: The KL function refers to a function that satisfies the inequality dist(0,f(X))ϕ(f(X)f(X))1\text{dist}(0, \nabla f(X^{\prime})) \phi^{\prime}(f(X^{\prime}) - f(X)) \geq 1 . KL functions are a broad class of functions, such as semi-algebraic functions, which makes Assumption 4.8 a relatively mild assumption. Semi-algebraic functions are a class of functions that satisfy the KL property. These functions are widely used in applications and include real polynomial functions, finite sums and products of semi-algebraic functions, and indicator functions of semi-algebraic sets.

Question 3. The blocks in the proposed JOBCD algorithms consist of just 2 indices. I’m wondering if there is benefit in updating more than 2 indices in each iteration?

Response: There is no exact solution or theoretical guarantee when k>2. Initially, we considered the situation for k>2k>2, but this introduces many challenging problems. We will use k=3k=3 as an example to illustrate. According to Proposition 2.2, when k=3k=3, pp can take values {0,1,2,3}\lbrace 0,1,2,3 \rbrace:

  1. When p=0p=0 or p=3p=3, the problem degenerates into an optimization problem under orthogonality constraints, for which exact solutions can be obtained based on existing research.

  2. When p=2p=2, U1U_1 and V1V_1 are different 2x2 orthogonal matrices, requiring two sets of sin(x)\sin(x) and cos(x)\cos(x) functions cos(x1),sin(x1),cos(x2),sin(x2)\cos(x_1), \sin(x_1), \cos(x_2), \sin(x_2) for modeling, while cc and ss need to be modeled using a set of cosh(x)\cosh(x) and sinh(x)\sinh(x) functions cosh(x3),sinh(x3)\cosh(x_3), \sinh(x_3). Although we can simplify similarly to what is mentioned in our paper, using tan(x)=sin(x)/cos(x)\tan(x)=\sin(x)/\cos(x) and tanh(x)=sinh(x)/cosh(x)\tanh(x)=\sinh(x)/\cosh(x), we ultimately still need to solve a 3-variable optimization problem where tan(x1),tan(x2),tanh(x3)\tan(x_1), \tan(x_2), \tanh(x_3) are coupled together (e.g., tan(x1)×tan(x2)×tanh(x3)\tan(x_1)\times\tan(x_2)\times\tanh(x_3), tan(x1)×tanh(x3)2\tan(x_1)\times\tanh(x_3)^2), and there is currently no exact method to solve such complex nonlinear relationships.

  3. When p=1p=1, the problem to be solved is structurally similar to that in p=2p=2.

The above is a simplified analysis for the case of k=3k=3. Due to the inability to obtain exact solutions, it leads to a lack of theoretical guarantees, and therefore it is not presented in this paper. However, in the future, your suggestion could be a very interesting research topic.

评论

Thank the authors for the clarifications! In the revised version, please be sure to properly introduce parameters/variables/terms before using them (such as theta and KL function). I will keep my score.

审稿意见
5

This paper proposes a block coordinate descent method for solving optimization problems with J-orthogonality constraints. Several variants of the method are introduced within this framework, and convergence results are established. Extensive numerical results are also presented to demonstrate the efficiency of the proposed methods. However, I have some concerns regarding the novelty of this paper as well as the numerical results.

优点

It appears that optimization with J-orthogonality constraints has not been thoroughly studied in the literature. This paper proposes an efficient method for addressing this problem.

缺点

  1. My major concern is that the novelty of this paper might be insufficient since the row-based approach is very similar to that in [51], even though the two papers tackle different problems.

  2. For the numerical results shown in Table 1, the proposed method fails to return a feasible solution for some instances, such as randn(10-10-5) and w1a (2470-290-145), as well as some other instances in the appendix. This is strange since the paper describes a BCD-type method, which should always return a feasible solution.

  3. The information of the reference [31] might be incorrect.

  4. For the GS-JOBCD method, there are two options for choosing QQ, whereas J-JOBCD only has one option. The authors should provide an explanation for this difference.

  5. The presentation could be further improved. Here are a few examples: the formulation of PiP_i after equation (12) could be simplified by removing the notation mat\mathrm{mat}; it is unclear if the requirement on Q\underline{Q} in equation (4) is sufficient to guarantee convergence (probably not, since Q=0\underline{Q} = 0 also satisfies this condition).

问题

  1. How to perform and choose the parameters in ADMM for solving this problem? It seems that the corresponding method lacks a convergence guarantee.

  2. The authors should provide more details on the implementation of J-JOBCD and GS-JOBCD. Did you use parallel techniques in J-JOBCD? What would happen if these techniques were not used? Should you compare the iteration numbers of the methods? How do you choose the parameter in equation (12)? Moreover, if J-JOBCD consistently performs much better than GS-JOBCD, it seems unnecessary to introduce GS-JOBCD.

局限性

At the beginning of the paper, the authors claim that equation (2) can imply fi(X)if(X+)LfXX+\\|\nabla f_i(X) - \nabla_i f(X^+)\\| \leq L_f \\|X - X^+\\|, which is incorrect. Note that the converse is correct. The other assumptions in Assumptions 4.1 and 4.2 essentially assume the compactness of the iterates.

评论

Question 8. Did you use parallel techniques in J-JOBCD?

Response: Yes, we utilized parallel techniques in J-JOBCD by expanding the matrix dimensions and defining the variables as tensors in PyTorch to facilitate parallel computations.

Question 9. What would happen if these techniques were not used?

Response:

  1. Without using parallel techniques, the method would degrade into a sequential BCD method. The parallelization strategy in this paper is an effort to accelerate the BCD framework by fully utilizing modern parallel architectures.

  2. We could consider using other parallelization techniques for experiments, such as more efficient multi-core parallel computing or efficient GPU-based parallel computing. It is worth noting that using PyTorch to define variables as tensors for parallel computations is very efficient for small-dimensional computations, as it avoids the communication costs between different machines.

Question 10. Should you compare the iteration numbers of the methods?

Response: Thank you for your suggestion. We have conducted relevant experiments, as shown in Table 3 and Figure 2 of the attached PDF file. The JOBCD algorithm consistently achieves better optimization results in fewer iterations.

Question 11. How do you choose the parameter in equation (12)?

Response: In the experiments of the paper, we set θ=106\theta = 10^{-6} and let ζ\zeta be the Lipschitz constant of the blocks of different functions. Specifically, in the HEVP experiments, we used the maximum absolute value of the chosen blocks in the Hessian matrix CC as the block Lipschitz constant. In the HSPP experiments, we heuristically specified the Lipschitz constants empirically for different datasets and selected the most effective constant. We chose within the range of {100,500,1000}\lbrace 100, 500, 1000 \rbrace, and generally, 500 yielded the best results.

Question 12. Moreover, if J-JOBCD consistently performs much better than GS-JOBCD, it seems unnecessary to introduce GS-JOBCD.

Response: We could avoid GS-JOBCD and directly present the J-JOBCD algorithm. However, doing so would not illustrate the practical impact of parallelization and variance reduction techniques on the algorithm's convergence.

To better demonstrate the influence of these modern techniques, we described GS-JOBCD and J-JOBCD separately in the writing.

Additionally, GS-JOBCD can serve as a basic deterministic algorithmic framework for further designs.

Question 12. At the beginning of the paper, the authors claim that equation (2) can imply fi(X)fi(X+)FLfXX+F|| \nabla f_i(X) -\nabla f_i(X^+) ||_F\leq L_f ||X-X^+||_F, which is incorrect. Note that the converse is correct.

Response: We will explicitly assume fi(X)fi(X+)FLfXX+F|| \nabla f_i(X) -\nabla f_i(X^+) ||_F\leq L_f ||X-X^+||_F for some constant LfL_f. Thank you for pointing it out.

作者回复

Dear Reviewer TAHW, thank you for your efforts in evaluating our manuscript. In the following, we will respond to your concerns point by point.

Question 1. My major concern is that the novelty of this paper might be insufficient since the row-based approach is very similar to that in [51], even though the two papers tackle different problems.

Response: This paper differs from the work of [51] in several key aspects:

  1. The nature of the problems differs: solutions to optimization problems with orthogonality constraints are compact, whereas solutions with JJ-orthogonality constraints can be unbounded.

  2. We extend the OBCD method [51] to handle optimization problems with J-orthogonality constraints, addressing a broader class of optimization problems.

  3. We introduce a parallel Jacobi strategy, marking the first application of modern parallelization techniques to BCD methods.

  4. We incorporate variance reduction strategies into the JOBCD framework, transitioning from a deterministic to a stochastic algorithmic framework.

  5. JOBCD presents, for the first time, the first-order optimality condition for optimization problems with J-orthogonality constraints, the tangent space of the optimization manifold, the optimality condition, and the convergence properties.

  6. The comprehensive consideration of all these aspects leads to a significantly higher level of difficulty in proving the properties of JOBCD compared to OBCD.

Question 2. For the numerical results shown in Table 1, the proposed method fails to return a feasible solution for some instances, such as randn(10-10-5) and w1a (2470-290-145), as well as some other instances in the appendix. This is strange since the paper describes a BCD-type method, which should always return a feasible solution.

Response: This is because, in the paper, we use the CS decomposition method to generate the initial random J-orthogonal matrix. This method relies on the functions cosh(x)=ex+ex2\cosh(x) = \frac{e^x + e^{-x}}{2} and sinh(x)=exex2\sinh(x) = \frac{e^x - e^{-x}}{2}, which may lead to data overflow issues. To address this problem, we use the identity matrix as the initial value for JOBCD. The output results are shown in Table 1 of the attached PDF file. Additionally, the last column of each result table includes experiments where the JOBCD algorithm is continued based on other algorithms. This verifies that JOBCD maintains the J-orthogonal constraint throughout the iterative process.

Question 3. The information of the reference [31] might be incorrect.

Response: Thank you for pointing this out. It should be:

"[31] Nachuan Xiao, Xin Liu, and Ya-xiang Yuan. A class of smooth exact penalty function methods for optimization problems with orthogonality constraints. Optimization Methods and Software, 37(4):1205–1241, 2022."

Question 4. For the GS-JOBCD method, there are two options for choosing Q, whereas J-JOBCD only has one option. The authors should provide an explanation for this difference.

Response: As mentioned in Lemma 2.4, only when Q=ξI4\mathbf{Q} = \xi \mathbf{I}_4 can we ensure that the sub-optimization problems between any two blocks BiB_i and BjB_j are independent.

When Q\mathbf{Q} has an inseparable structure, the sub-optimization problems between any two blocks become interdependent, making parallel problem-solving impossible.

Question 5. The presentation could be further improved. Here are a few examples: the formulation of Pi after equation (12) could be simplified by removing the notation mat

Response: This description was intended to align with algorithm 1, highlighting the similarities and differences between the two algorithms. We will simplify it as suggested to make the algorithm more concise.

Question 6. it is unclear if the requirement on Q\underline{Q} in equation (4) is sufficient to guarantee convergence (probably not, since Q=0\underline{Q}=0 also satisfies this condition).

Response: Q\underline{Q} is an inherent parameter of problem, related to the maximum eigenvalue of the submatrix of the Hessian matrix (which also reflects the curvature information of the objective function). It is generally non-zero and is not manually specified. The specific expression is given in Lemma 2.1, Part (c). ζ\zeta is a parameter of the algorithm and needs to be greater than or equal to the inherent parameter Q\underline{Q}.

Question 7. How to perform and choose the parameters in ADMM for solving this problem? It seems that the corresponding method lacks a convergence guarantee.

Response:

  1. We have provided a detailed derivation process of the ADMM algorithm in the comment "Implementation of ADMM Algorithm for Optimizaton Problem under J-Orthogonality Constraints".

  2. In the ADMM algorithm, λ\lambda is an important parameter that balances the constraint adherence and the optimization objective. We offer two methods for choosing λ\lambda: one is a fixed λ\lambda that remains constant throughout the entire ADMM iteration process, and the other is a dynamic λ\lambda that increases every specified number of iterations until it reaches an upper limit.

  3. Table 2 and Figure 1 in the attached PDF file show the objective function values and constraint violation for solving the HEVP problem with different λ\lambda values when ADMM converges. Generally, the dynamic λ\lambda setting performs better than the fixed λ\lambda setting. However, in the dynamic λ\lambda initial setting, a smaller λ\lambda can lead to larger constraint violation.

  4. To effectively compare with feasible methods such as CSDM and JOBCD, we chose the ADMM algorithm with a dynamic λ\lambda setting, initializing λ\lambda at 1e4.

评论

Thank you to the authors for the clarification. Some of my concerns have been well addressed. Therefore, I would like to increase my score to 5.

评论

We consider minimizing the following differentiable function subject to J-orthogonality constraints:

minXRn×nf(X), s.t.XJX=J\min \limits_{ X \in \mathbb{R}^{n \times n}} f(X), \text { s.t.} X^{\top} J X= J .

Defining Y=JXRn×n Y = J X \in \mathbb{R}^{n \times n}, we have: minX,YRn×nf(X), s.t. XY=J,Y=JX\min \limits_{ X , Y \in \mathbb{R}^{n \times n}} f(X), \text { s.t. } X ^{\top} Y = J , Y = J X .

Introducing Lagrangian multipliers ZRn×n Z \in \mathbb{R}^{n \times n} and WRn×n W \in \mathbb{R}^{n \times n}, and the penality variable βR\beta \in \mathbb{R}, we have the following argumented Lagrangian function:

L(X,Y;Z,W)=f(X)+XYJ,Z+JXY,W+β2XYJF2+β2JXYF2\mathcal{L}( X , Y ; Z , W ) = f( X )+\langle X ^\top Y - J , Z \rangle+\langle J X - Y , W \rangle+\frac{\beta}{2}\|\| X ^\top Y - J \|\|_F^2+\frac{\beta}{2}\|\| J X - Y \|\|_F^2

Suppose f(X)f(X) is ll-Lipschitz gradient continuous: f(X)f(Xt)+XXt,f(Xt)+l2XXtF2f(X) \leqslant f( X ^t)+ \langle X - X ^t, \nabla f ( X ^t ) \rangle+\frac{l}{2}\| \| X - X ^t \|\|_{F}^2.

We consider minimizing L(X,Y;Z,W)\mathcal{L}(X,Y ;Z,W ) with respect to XX. We derive the following majorization function Qt(X)Q^t(X) of L(X,Yt;Zt,Wt)\mathcal{L}(X,Y^t;Z^t,W^t) at XtX^t:

L(X,Yt;Zt,Wt)Qt(X)f(Xt)+XXt,f(Xt)+l2XXtF2+XYtJ,Zt+JXYt,Wt+β2XYtJF2+β2JXYtF2\mathcal{L}(X, Y^t ; Z^t, W^t) \leq Q^t(X) \triangleq f (\mathbf{X}^t)+ \langle X - X ^t, \nabla f (X^t ) \rangle+\frac{l}{2} \|\| X - X ^t \|\|_{F}^2+ \langle X^\top Y^t - J , Z^t \rangle+\langle J X - Y^t, W^t \rangle+\frac{\beta}{2}\|\| X^\top Y^t - J \|\|_F^2+\frac{\beta}{2}\| \|J X - Y^t \|\|_F^2

We solve the following subproblem to update Xt+1 X ^{t+1} and Yt+1 Y ^{t+1} alternately:

Xt+1=argminXQt(X) X ^{t+1} = \operatorname{arg} \operatorname{min}_ X Q^t(X)

Yt+1=argminYPt(Y)L(Xt,Y;Zt,Wt)Y ^{t+1} = \operatorname{arg} \operatorname{min}_ Y P^t(Y) \triangleq \mathcal{L}( X^t , Y ; Z^t , W^t )

Zt+1=Zt+β(Xt+1Yt+1J)Z ^{t+1} = Z ^t +\beta\cdot({ X ^{t+1}}^\top Y ^{t+1}- J )

Wt+1=Wt+β(JXt+1Yt+1)W ^{t+1} = W ^t +\beta\cdot( J { X ^{t+1}}- Y ^{t+1})

Using the first-order optimality conditions of minXQt(X)\operatorname{min}_ X Q^t(X) and minYPt(Y)\operatorname{min}_ Y P^t(Y), we get the updated formula for Xt+1 X ^{t+1} and Yt+1 Y ^{t+1}:

Xt+1=(lI+β(YtYt+I))1(f(Xt)lXt+YtXt+JWtβYtJβJYt)X ^{t+1} = -(l\mathbf{I}+\beta( Y ^t{ Y ^t}^\top+\mathbf{I}))^{-1}(\nabla f({ X ^t}^\top)-l X ^t+ Y ^t{ X ^t}^\top+ J W ^t-\beta Y ^t J -\beta J Y ^t)

Yt+1=(β(Xt+1Xt+1+I))1(Xt+1ZtJWtβXt+1JβJXt+1)Y ^{t+1} = -(\beta( X ^{t+1}{ X ^{t+1}}^\top+\mathbf{I}))^{-1}( X ^{t+1} Z ^t- J W ^t-\beta X ^{t+1} J -\beta J X ^{t+1})

In the experiment, we used the following strategy to adjust λ\lambda: If tt%2==02 ==0 and λ1e9\lambda \leq 1e9, then λ\lambda is doubled.

作者回复

Dear reviewers, thank you for taking the time to review our paper.

Your valuable feedback and constructive comments are greatly appreciated. Please, find the answers to your questions below.

Please note that we have added tables and figures in the attached pdf to support our responses to the reviewer TAHW.

最终决定

This paper had borderline scores, slightly positive (avg 5.25) but below the standard acceptance threshold (~5.7). That means the Area Chair took a very close look at the paper and comments.

The reviewers are mildly positive and the biggest strength seems to be that this is a pretty reasonable method to solve optimization under J-Orthogonality constraints (for reference, this means XX must satisfy XTJXX^T J X where JJ is block diagonal with one identity block and one negative identity block, hence "hyperbolic" style). There are not a lot of methods to solve this, and of applicable methods, the numerical results show that the proposed approach does very well.

That said, the reviewers raise two main issues:

  1. the key algorithmic innovation seems to be mild, since much of this is similar to the work in [51]. The authors argue against this in their rebuttal, mentioning the addition of several things including an optimality condition they derived and variance reduction. They also claim that J-orthogonality constraints, not representing a compact set, are fundamentally different. Reviewer swMz thought that this compactness argument didn't have a direct role. Ultimately, the rebuttal slightly increased one or two reviewers' scores but didn't make any reviewer extremely positive.
  2. there were some issues about the implementations of the methods compared against, raised by a few reviewers. The authors have altered other methods a bit (which seems to have been in good faith, in order to make those methods perform well) but some reviewers were worried this made those methods potentially worse. As this was raised by more than one reviewer, I think it is significant. I'd suggest that the authors compare to two versions of existing methods: one implemented exactly as in the existing papers, and one tweaked.

A final point raised later in discussion by a reviewer is that there are techniques to convert many constrained problems into unconstrained problems, and that this should be discussed (in particular, cf. Xiao, Nachuan, Xin Liu, and Kim-Chuan Toh. "Dissolving constraints for Riemannian optimization." Mathematics of Operations Research 49.1 (2024): 366-397. )

Adding my own thoughts, I think that these hyperbolic style constraints are interesting, and I appreciate the applications listed at the top of page 2 (e.g., to the hyperbolic eigenvalue problem, ultrahyperbolic knowledge graph embedding, ...), but ultimately this is a rather specialized topic. That is fine, but I think with such a specialized topic, the bar for publication should be just a little bit higher.

Thus in the end, I am unfortunately not recommending this paper for NeurIPS this year, given that the topic is so specialized and the borderline ratings from the reviewers.