Linear programming using diagonal linear networks
摘要
评审与讨论
This paper shows that gradient descent on a diagonal linear network (with a quadratic parametrization) under specific initialization leads to an approximate solution to the entropy-regularized solution to a linear programming problem. The convergence results are presented for both gradient flow and gradient descent algorithms.
优点
- This paper presents a new idea that solves LPs via training a diagonal linear network, which exploits the implicit bias of diagonal networks.
- linear convergence results are shown for training diagonal linear networks with GD, which is different from what has been established, for example, the results in Vaškevičius et. al. 2019.
缺点
The main weaknesses are the presentation and the significance of the results, mainly for Theorem 3.4 and 3.5.
- There needs more discussion on the upper bound on the step size, and the linear rate in Theorem 3.4, specifically their dependence on 1) the underlying LP problem , and its scale (# of decision variables, #of constraints, etc.); 2) the initialization . For example, if either is ill-conditioned, or the initialization is close to the origin, then I believe should be close to one. Merely showing that GD converges linearly does not make a significant contribution if what authors propose is to implement this GD algorithm for solving real LP problems.
- Theorem 3.5 only shows that the GD converges to some that is close to the desired solution to the LP, but the result is weak in the sense that it doesn't suggest an upper bound on the # of GD iterations for achieving certain accuracy. Specifically, the convergence result one expects is that given some , the GD with some step size takes iterations to achieve either 1) , where is the true optimal solution; 2) or the optimality gap is less than .
- Another concern I have is that I don't find, from the discussions and experiments in this paper, any evidence that the proposed algorithm has advantages in solving certain LPs, compared to existing methods.
问题
See "weaknesses"
The upper bound on was explicitly written down in equation (6.1) in the appendix where and the value of can be large in the worst case (leading to a conservative constant for the convergence rate), it can be much better in cases where the matrix satisfies some regularity properties. For example, if all the entries of are i.i.d. Gaussian, then one can show that with high probability, has a polynomial dependence on the problem sizes.
The dependence of on , and the initial point can be derived from the proof of Theorem 3.4. This dependence is hard to be written in a clean way, as we have used few constants that can be shown to exist but cannot be written in a closed form (see Appendix H.1). Furthermore, since our problem is non-convex in nature, it makes it lot more harder to delineate exact expression of the linear rate. However, we agree that if is ill-conditioned or the initial point is close to , the value of is close to .
Finally, we note that conservative constants (e.g. Hoffman constant) have been commonly used in deriving the linear convergence of first order methods for solving LP problems. Although such constants may not reflect the actual performance of the algorithm, it provides some hint on the convergence properties of the algorithm. Some of the impactful papers from the past with similar problem includes [1] and [2].
This is an excellent question. We definitely like to include more details in the Appendix. To emphasize how the stepsize depends on the gap, let us define
where is the solution of the LP problem and is the same constant as in the response of the previous comment. Now a simple but tedious computation would show that if the regularization and stepsize is chosen in way so that , then the optimality gap after large enough time will be less than .
The a number of iterations to reach an -optimality gap is more subtle. Since our analysis here is significantly different than usual first order methods, it is hard to write a closed-form expression of . This is partly because our results shows global convergence rate of GD for a non-convex problem. Unlike the analysis of a convex problem, the typical per-iteration analysis is not working here. We have to define some global constants that can be shown to exist but cannot be written in a closed form (see Appendix H.1). The dependence on is therefore encoded in a sophisticated way.
We have run a new experiment comparing our method against the commercial LP solver Gurobi and another first-order method: an ADMM-based solver SCS. In particular, we simulate the data with the same data-generating process as in Section 4 of the paper, but with and . Here is a brief comparison of the methods in this example: Gurobi, SCS and our algorithm achieve respectively 0, -0.971,0.005 primal gaps and 0, 2.21e-10, 4.17e-11 feasibility gaps in 16.77, 40.49 and 9.26 seconds.
Here the primal gap is defined as , with be the solution by an algorithm and is the optimal solution (by Gurobi). The feasibility gap is defined as . First, it can be seen that our method is much faster than SCS, where our method computes a solution with a smaller primal gap in a much shorter time. Our method is also faster than Gurobi if a low-accuracy solution with a primal gap is satisfactory for the underlying application. Moreover, we note that Gurobi takes seconds for preprocessing, which is already similar to the runtime of our method.
Admittedly, our method is not able to obtain high-accuracy solutions as fast as Gurobi, but it is useful for quickly obtaining an approximate solution. Moreover, since Algorithm 1 only involves matrix-vector multiplications, it is easily implemented in a GPU-acceleration setting, which can potentially outperform Gurobi for larger problems.
[1] Hong, M., Luo, Z.-Q. On the, linear convergence of the alternating direction method of multipliers. (2019) Mathematical Programming.
[2] Luo, Z.-Q., Tseng, P. On the linear convergence of descent methods for convex essentially smooth minimization (1992) SIAM Journal on Control and Optimization
Dear Reviewer 9DAk,
As the deadline for the discussion phase is fast approaching, we were curious if you had any feedback for our response. Please also let us know if you have any further questions. Thank you and looking forward to hearing from you.
Sincerely,
Authors.
The discussion on is helpful but not fully satisfactory. My main concern is that the paper should ideally show the computation complexity for getting an -optimal solution to their regularized LP problem, but neither Theorem 3.4 nor Theorem 3.5 does it. I understand that complexity analysis for this algorithm may be non-trivial, but I believe it should be an essential part of the theoretical analysis. Therefore, I keep my score.
Minor: for comparison, the proposed algorithm would be of much interest if it can scale well. So I suggest having more experiments with various problem sizes when comparing with Gurobi.
Dear Reviewer 9DAk,
Thank you for your prompt response; your comments are highly valued and a source of inspiration for us.
Upon a thorough reevaluation of our computations, we have successfully quantified an expression for . Following a meticulous and detailed analysis, we have determined that for
M=M(A,b,R):= \sup_{u\in \bar B_R(0)} \| A^\top (A(u\circ u) - b) \|_2, \quad K= K(A,b,R) := \Big\lceil 8\max(1,M)\Big\rceil. $$ the running time $T(\epsilon)$ could be upper bounded by $$(\eta(\epsilon)^{-1}+1)(\lambda(\epsilon)^{-1}+1)K^{\gamma}$$ where $\gamma= C\|A(u_0\circ u_0)-b\|^2+1$ for some constant $C>0$. Here $\delta$ and $C$ are model dependent constant which does not necessarily grow with the size of the problem. In our response on the dependence of $\eta(\epsilon)$ and $\lambda(\epsilon)$, we have indicated that $\eta(\epsilon), \lambda(\epsilon) = O(\epsilon)$. In other word, $T(\epsilon) = O(n^{\gamma}\epsilon^{-2})$ where $n$ is the model size and $\gamma$ is constant which depends on $A,b$ and initialization $u_0$. Recall that $u_0= exp(-c/2\lambda(\epsilon))$. Therefore, for $\epsilon\leq 1/(\max_i c_i\log n)$ and $\|A\|,\|b\|\leq 1$, $\gamma$ could be upper bounded as $$\gamma \leq C(\|A\|^2\|u_0\circ u_0\|^2+\|b\|^2)+1\leq C(1\cdot 1+1)+1 = 2C+1.$$ We would be very happy to add this in our main results of the revised manuscript. Furthermore, in the process of deriving $T(\epsilon)$, we got also bound on the linear rate $\rho$ which is another of your query. We are pleased that your inquiries have sparked a new set of developments on our end. $**Response on the minor comments about more simulation:**$ We completely agree that more simulations with different problem size would make our claim more stronger and we are preparing to add more simulation in the Appendix like as you have indicated. We would greatly appreciate it if you could consider increasing your score should our response meet the criteria of your inquiry. Thank You, Authors.I am glad to see the complexity result and appreciate the authors' effort they put in the rebuttal phase. I have raised my score.
This manuscript studies the implicit bias of re-parametrized gradient descent in which the macroscopic learning rates are used. By leveraging the characterization of the implicit bias and the convex geometry of a linear program, they prove the linear convergence of GD on the linear regression problem under the quadratic parametrization. Importantly, they make minimal assumptions to prove their results.
优点
The extension of the previous results [1,2] on reparametrized gradient descent/flow under minimal assumptions on data and macroscopic choice of learning rate is a nontrivial result. The analysis under this generality introduces additional complexity, but this work successfully establishes (linear) convergence for this setting.
[1] Woodworth, B.E., Gunasekar, S., Lee, J., Moroshko, E., Savarese, P.H., Golan, I., Soudry, D., & Srebro, N. (2019). Kernel and Rich Regimes in Overparametrized Models. ArXiv, abs/2002.09277.
[2] Even, M., Pesme, S., Gunasekar, S., & Flammarion, N. (2023). (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability. ArXiv, abs/2302.08982.
缺点
The presentation can be further improved in several ways to enhance clarity and readability:
- Firstly, I could not follow how the similarity between Algorithm 1 and the Sinkhorn algorithm is used in the paper.
- Additionally, to the best of my knowledge, the result in [1] proves a similar result in the manuscript in a fairly general setting too. I think it would be helpful for the readers if the authors could elaborate on these points more in their revised manuscript.
Another aspect that requires attention is the lack of characterization of the effect of using large step sizes in the paper. In particular, [1] studies the effect of large step size in the same setting on the generalization; however, this study does not provide insight about the consequence of using large step size.
[1] Even, M., Pesme, S., Gunasekar, S., & Flammarion, N. (2023). (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability. ArXiv, abs/2302.08982.
问题
In the experiments, the authors show that for the large step size cases, re-parametrized GD converges faster than the exponentiated gradient algorithm (or mirror descent with the entropy potential). Can they comment on why this is the case and if their results imply anything in that direction?
Certainly, this is an intriguing question. We acknowledge that the elucidation of the comparison between our algorithm and the Sinkhorn algorithm might not be entirely clear. Our intention was to delineate the similarities and distinctions between these two algorithms by scrutinizing the update rules in each iteration. For instance, we demonstrated in Section 2.3 that our algorithm behaves as row and column rescaling algorithms just like as the Sinkhorn algorithm.
Similar to the Sinkhorn, our neural reparametrized gradient descent initialize the coupling matrix of OT at where . When the stepsize is small, like as in Sinkhorn, the gradient descent rescales the rows and columns of the coupling matrix. In other words, as the step size gets smaller, gradient descent iterates look like where and are respectively the vectors of residuals of the row sums and column sums of the coupling matrix . Hence the updates above can also be viewed as row and column rescalings, although with different rescaling rules than the Sinkhorn. Given the parallels between the Sinkhorn algorithm and the updates in our gradient descent, it's reasonable to assert that our approach introduces a Sinkhorn-like algorithm for solving linear programming problems within the context of diagonal linear networks. As far as our knowledge extends, this novel application has not been documented in prior works.
Although our work bears certain similarities to the references [1] and [2] cited by the reviewer, the primary distinction lies in the diverse approaches to reparametrization. In [1] and [2], the authors have used the reparametrization in the basis pursuit problem while we use a slightly different reparametrization which leads to different final representations of the limit of the gradient descent in our case. However, in both cases, the limit of the gradient descent solve regularized -norm minimization problem of . But the nature of regularization is different in our work compared to [1] and [2]. Moreover, at the time of submitting our work, the rate of convergence result for gradient descent, as presented in [1] or [2], was not available. To the best of our knowledge, this result was initially introduced in our work. In the following, we discuss the effect of stepsize in our gradient descent in the light of [2] and [3].
As the reviewer rightly mentioned, the stepsize play a crucial role in shaping the quality of the algorithm. Theoretical thresholds for the step size necessary for global linear convergence are detailed in Theorem 3.5. However, recent investigations, exemplified in the reference [2] pointed out by the reviewer and the recently published paper in arxiv [3], scrutinize the impact of an increasing step size. These studies, especially the latter one, reveals that, as the step size amplifies, gradient descent in quadratic regression traverses distinct phases. Beyond the theoretical bounds, the loss function oscillates around a diminishing trend, signifying the emergence of the 'catapult' phase. Notably, during the catapult phase, the application of specific ergodic averaging methods has been demonstrated to enhance generalization error. Further escalation of the step size leads to the onset of the periodic phase, marked by periodic oscillations in the loss function without a discernible downward trend. Subsequent to the periodic phase, the system transitions into the chaotic phase, characterized by a lack of convergence signals, ultimately culminating in divergence. Given that our gradient descent iterates mimic those of the least square quadratic regression, we anticipate that the insights from [3] will be applicable to our framework. We would be pleased to incorporate a comprehensive discussion of this in the revised version.
[3] Chen, X., Balasubramanian, K., Ghosal, P., Agrawalla, B. (2023). From Stability to Chaos: Analyzing Gradient Descent Dynamics in Quadratic Regression. ArXiV: 2310.01687
That is an excellent question. We posit that our algorithm outpaces mirror descent due to the tendency of mirror descent updates to overshoot frequently, resulting in unnecessary excursions near the limit for extended periods. Nevertheless, it remains plausible that a judicious selection of the step size in mirror descent, akin to our algorithm, could potentially enhance its speed.
Dear Reviewer p2u6,
As the deadline for the discussion phase is fast approaching, we were curious if you had any feedback for our response. Please also let us know if you have any further questions. Thank you and looking forward to hearing from you.
Sincerely,
Authors.
Comparison with the Sinkhorn algorithm: As far as I can follow, to establish the similarity with the Sinkhorn algorithm, the authors use a well-known property the exponentiated gradient descent algorithm, i.e., for small , see [Chapter 4.4, 1] and [Chapter 2.7, 2]. Therefore, I still cannot follow how the similarity pointed out in the manuscript provides us with new insight.
For the other points, I thank the authors for their explanations. Overall, I still think that the manuscript requires more work to improve the presentation. Therefore, I keep my score the same.
[1]. Kivinen, J., & Warmuth, M.K. (1997). Exponentiated Gradient Versus Gradient Descent for Linear Predictors. Inf. Comput., 132, 1-63.
[2] Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games.
The authors prove linear convergence rates for the discrete and continuous versions of gradient on diagonal linear networks. In addition, they show that the continuous and discrete versions converge toward the solution of an entropically regularized linear problem.
优点
The theoretical results are new and elegant, I am not aware of previous results connecting linear programming and diagonal linear networks.
缺点
-
Section 2.2 and 2.3 are a little bit scientifically "loose", it draws some connections with other methods, but I am not sure there are proper theoretical results that can be extracted from these parts
-
Experiments. I understand this is a theoretical paper, but I think the paper would have more impact with more extensive experiments. The current experimental section illustrates the linear convergence of the algorithm and has one comparison to the mirror descent. Maybe the authors could provide an experiment with optimal transport and compare the proposed method to the standard Sinkhorn algorithm
问题
- In the experiment section, the authors mention that the theoretical step size found is too conservative. This conservative step size problem can usually be overcome with coordinate descent-like methods, that use a larger coordinate-specific step size. In addition, the authors mention some previous results on coordinate descent for diagonal linear networks. I was wondering if it is possible to extend the theoretical results of the authors to coordinate descent.
伦理问题详情
NA
We are very grateful to you evaluation. This is indeed a great question. Certainly, the selection of the step-size is a subject that merits deeper exploration, and we intend to delve into this aspect, drawing insights from recent studies elucidating the impact of step size on gradient descent. Exploring unique step-sizes for coordinate-wise descent presents a fascinating avenue for investigation. We are optimistic about the potential acceleration of our algorithm through coordinate descent; nevertheless, the analysis of its convergence and rate appears to pose distinctive challenges compared to our current approach.
Thank you for the question! Actually we compared our method with standard Sinkhorn, but unfortunately, we found that Sinkhorn is better for OT -- it more directly uses the structure of the problem. Nonetheless, we found that our algorithm can be useful for general LPs where Sinkhorn method is not applicable. For instance, we have performed a new experiment, comparing our method with Gurobi and another ADMM-based method. In particular, we simulate the data with the same data-generating process as in Section 4 of the paper, but with with and . Here is a brief comparison of the methods in this example: Gurobi, SCS and our algorithm achieve respectively 0, -0.971,0.005 primal gaps and 0, 2.21e-10, 4.17e-11 feasibility gaps in 16.77, 40.49 and 9.26 seconds.
Here the primal gap is defined as , with be the solution by an algorithm and is the optimal solution (by Gurobi). The feasibility gap is defined as . First, it can be seen that our method is much faster than SCS, where our method computes a solution with a smaller primal gap in a much shorter time. Our method is also faster than Gurobi if a low-accuracy solution with a primal gap is satisfactory for the underlying application. Moreover, we note that Gurobi takes seconds for preprocessing, which is already similar to the runtime of our method.
Admittedly, our method is not able to obtain high-accuracy solutions as fast as Gurobi, but it is useful for quickly obtaining an approximate solution. Moreover, since Algorithm 1 only involves matrix-vector multiplications, it is easily implemented in a GPU-acceleration setting, which can potentially outperform Gurobi for larger problems.
Dear Reviewer xDbj,
As the deadline for the discussion phase is fast approaching, we were curious if you had any feedback for our response. Please also let us know if you have any further questions. Thank you and looking forward to hearing from you.
Sincerely,
Authors.
This work looks into the dynamics of gradient descent (or GD) optimization, specifically focusing on the reparameterized GD for linear programmings. By reparameterizing the problem, the author(s) claim to provide a clearer understanding of the implicit bias of GD, particularly how it induces sparsity in solutions. The paper's theoretical contributions demonstrate that this reparameterized GD converges to zero-loss solutions and biases the flow toward solutions with better sparsity properties than conventional vanilla GD. The author support their claims with mathematical proofs and experiments that compare the performance of reparameterized GD with traditional GD and mirror-descent methods, suggesting advantages in terms of sparsity and convergence.
优点
This study's approach, reparameterizing GD for linear programs, offering a novel perspective on the implicit bias of GD. The quality is demonstrated through rigorous mathematical analyses, which include bounding the iterates of the algorithm and characterizing the limit points of the convergence.
Clarity is another strength, with the paper presenting its methodology and findings in a structured and understandable manner.
缺点
-
a notable weakness is the limited scope of the experimental setup; their simulation relies on isotropic Gaussian features, which may not be representative of real datasets that often contain features with varying scales and correlations. Moreover, the paper does not discuss the impact of non-Gaussian noise or different initialization schemes, which could potentially affect the generalization of the results
- can the author provide some results on real-world benchmarks? If conditions permit, I also suggest that the author compare it with sota linear programming algorithms (like some commercial solvers) and plot learning curves of the objective func value decreasing over time t
-
the paper's analysis assumes a batch size of , which may not scale well or apply directly to the common practice of using mini-batches. Moreover, the discussion on the impact of step size and batch size on the effective initialization scale is not sufficiently detailed, potentially limiting the applicability of their findings
-
while the paper offers a comparison with mirror descent, it may benefit from a broader comparison with other optimization algorithms in the community (empirically, or theoretically) to establish a more comprehensive understanding of its advantages
问题
-
In practical applications, mini-batch GD is common; could the authors speculate on how their results might change with the introduction of mini-batches? Could the authors discuss the limitations of their assumptions regarding initialization and step sizes in more technical depth, possibly suggesting how these might be relaxed or generalized?
-
Are there any theoretical insights from the paper that could suggest practical guidelines for tuning hyperparameters (like the step-size) in GD to leverage the sparsity-inducing properties observed in the reparameterized model?
-
The theoretical framework is focused on diagonal linear networks; could the authors discuss the potential challenges and modifications required to extend their framework to deep / non-linear networks?
We appreciate your inquiries. The substitution of minibatch gradient descent for gradient descent is not only feasible but can potentially offer advantages. Upon closer examination of our proofs and drawing on similarities between GD and one pass SGD on DLN, it becomes evident to us that the theoretical findings presented in our paper remain robust for minibatch GD with an appropriate choice of the batch size.
Exploring the impact of step size and initialization is indeed a crucial avenue of investigation. The initialization of our gradient descent proves to be a pivotal factor in guiding the algorithm toward the desired solution. As demonstrated in Theorem ~2.2 and 3.5, we established that initializing the gradient descent with leads to convergence toward the solution of Indeed, selecting a small positive value for results in a very small initialization. This, in turn, enhances the quality of the solution, given that the regularization term diminishes as approaches .
The impact of the step size is pivotal in achieving a superior generalization error. The theoretical bounds on the step size required for global linear convergence have been established in Theorem. However, recent works, such as in [1], delve into the effect of an increasing step size. In [1], it is demonstrated that, as the step size increases, gradient descent in quadratic regression undergoes various phases. Beyond theoretical bounds, the loss function oscillates around a diminishing trend, indicating the onset of the 'catapult' phase. Notably, in the catapult phase, employing certain ergodic averaging methods has been shown to lead to improved generalization error. Further increasing the step size leads to entry into the periodic phase, characterized by periodic oscillations in the loss function with no discernible downward trend. Subsequently, the chaotic phase ensues, devoid of any signs of convergence, ultimately culminating in divergence. Given that our gradient descent iterates mirror those of least square quadratic regression, we anticipate that the observations made in [1] will hold for our framework. We would be delighted to incorporate a detailed discussion of this in the revised version.
[1] Chen, X., Balasubramanian, K., Ghosal, P., Agrawalla, B. (2023). From Stability to Chaos: Analyzing Gradient Descent Dynamics in Quadratic Regression. ArXiV: 2310.01687
We have run a new experiment comparing our method against the commercial LP solver Gurobi and another first-order method: an ADMM-based solver SCS. In particular, we simulate the data with the same data-generating process as in Section 4 of the paper, but with and . Here is a brief comparison of the methods in this example: Gurobi, SCS and our algorithm achieve respectively 0, -0.971,0.005 primal gaps and 0, 2.21e-10, 4.17e-11 feasibility gaps in 16.77, 40.49 and 9.26 seconds.
Here the primal gap is defined as , with be the solution by an algorithm and is the optimal solution (by Gurobi). The feasibility gap is defined as . First, it can be seen that our method is much faster than SCS, where our method computes a solution with a smaller primal gap in a much shorter time. Our method is also faster than Gurobi if a low-accuracy solution with a primal gap is satisfactory for the underlying application. Moreover, we note that Gurobi takes seconds for preprocessing, which is already similar to the runtime of our method.
Admittedly, our method is not able to obtain high-accuracy solutions as fast as Gurobi, but it is useful for quickly obtaining an approximate solution. Moreover, since Algorithm 1 only involves matrix-vector multiplications, it is easily implemented in a GPU-acceleration setting, which can potentially outperform Gurobi for larger problems.
Instead of a diagonal linear network, one might consider employing an -layer deep diagonal linear network. It amounts to parametrizing the gradient descent as over layers for . The gradient descent initialized at then converges to the solution of . Extending this result to intricate architectures of deep non-linear networks is open. However, the optimal choice of initialization in such scenarios remains an area of active research.
Dear Reviewer FLD4,
As the deadline for the discussion phase is fast approaching, we were curious if you had any feedback for our response. Please also let us know if you have any further questions. Thank you and looking forward to hearing from you.
Sincerely,
Authors.
I appreciate the authors' explanation, and will retain my positive score.
The authors propose to solve the problem of determining whether a linear program is feasible using a reparameterized gradient descent approach, and analyse a continuous time version of the approach showing that it relates to an entropy regularized solution to the linear program under technical assumptions. However, the limitations imposed by those assumptions is unclear, and further, the gap between the continuous time and practical discrete time formulations is unclear. Experimentally, the authors simply compare their approach against mirror descent, and ignore the vast literature on entropy regularized optimal transport algorithms. Given the limitations of the theoretical results and experiments, I recommend rejection.
为何不给更高分
Neither theoretical nor empirical results justify the value of the approach proposed by the authors.
为何不给更低分
N/A
Reject