Streaming Algorithms For $\ell_p$ Flows and $\ell_p$ Regression
We give new streaming algorithms for underconstrained regression when we see the columns one at a time, and obtain related results for flows.
摘要
评审与讨论
The regression is a fundamental problem in machine learning, data science, and numerical linear algebra. When , it is the classical linear regression problem and has a closed-form solution . The general regression has been well-studied, in particular, when is the vertex-edge incidence matrix, it is the -norm flow problem, which corresponds to transshipment, Laplacian solver, and maximum flow if , and respectively. This paper considers the scenario that the matrix has , which leads to the linear system being underdetermined. In this case, it is impractical to store all the columns of , which motivates the usage of the column-arrival streaming model. In addition, there are two kinds of solutions to regression: (1) that approximates the optimal norm; (2) that approximates the optimal solution . For the first kind, this paper achieves space complexity , which is when , and also gives lower bounds for . For the second kind, this paper also provides lower bounds and upper bounds for different values of .
优点
(1) This paper researches the general regression problem under the column-arrival streaming model, while the existing work only considers the special case, like or is the vertex-edge incidence matrix.
(2) For general , this paper establishes both lower and upper bounds on space complexity for the two kinds of solutions: cost estimation and vector-valued problems. These new results strengthen the paper's contributions.
(3) This paper has an extensive set of references and a comprehensive summary of the related work.
缺点
The -norm flow problem defined in (1) (Line 61) is not exact. The correct one should be . To be specific, when and is the cost vector, then it is a transshipment problem; when and , where is the resistance vector, then it is reduced to a Laplacian solver; when and , where is the capacity vector, then it is the maximum flow problem.
By this formulation, it is the same as formulation (2). As a special case, when is a vertex-edge incidence matrix in the graph streaming problem, there is some existing work (Line 91-97) on transshipment, electrical flow, maximum flow, etc. Could you compare this work with their results?
问题
(1) In Theorem 1, Line 153, should " rows" be " columns"? At Line 164, should the second be ? Please verify that.
(2) In Theorem 1, when , like , would be a large constant. In this case, could you discuss the behavior of the proposed algorithm? If it is indeed a non-trivial issue, could you give a brief analysis or comment in this paper on ?
Weaknesses: Agreed, equation (2) is meant to capture the general setting when the Lp norms are weighted and capture the cases which you mention. Surprisingly, there are in fact not many algorithmic results known for streaming flow problems. The works mentioned in 91-97 are lower bounds against the directed version of the problem, which are hard for streaming, whereas we present new algorithms for the undirected version. Known results are presented in Table 1.
(1) Thank you for catching the typo for "rows" vs "columns", this has been fixed. In the graph setting, the columns are sparse so the number of bits is asymptotically the same as the number of columns. (2) Indeed, if p goes to 1, there are inherent difficulties, as we discuss in 187 and complement with a lower bound in Theorem 3.
Thank the authors for your feedback! I will retain my score.
The paper initiates the study of -norm regression in the one-pass streaming setting. They consider the underdetermined setting where has size , , receives column updates. They consider . When we consider instances graphs, these updates correspond to edge insertions.
The first result is an algorithm for “p-norm flow sparsifier” in space which is a general version of graph sparsifiers which minimize the -norm. This algorithm follows from using the online lewis weight sampling algorithm of Woodruff and Yasuda’22 on the dual problem. The other algorithmic result deals with giving a tradeoff between the size of the sparsifier and objective error when the goal is to maintain the entire solution vector x. The first algorithm only maintains the objective value.
The paper also presents several information theoretic lower bounds which match these upper bounds. They also give some extra lower bounds for the case of .
优点
The paper studies an important problem and makes a significant amount of progress in the new direction of streaming algorithms for regression. They also leave some good open questions in the paper. The techniques are also quite simple and use previous works quite well. The main technical core is the lower bound part, which is of independent interest.
缺点
It seems like there are too many ideas and previous results used in the paper. Would be useful to have some more preliminaries in the appendix or somewhere.
问题
- Can the authors add a comparison with what is known in online algorithms in the corresponding settings?
- The tables in the paper, that give the lower and upper bounds for different settings, can these be split in a different way so that its easier to contrast the lower and upper bounds for all settings? This would also make clear what settings are still unknown.
It seems like there are too many ideas and previous results used in the paper. Would be useful to have some more preliminaries in the appendix or somewhere.
We have added a preliminaries section in the appendix with the most important definitions and if the reviewer has any further suggestions as to what would be helpful, we would be happy to add them.
- We have added a discussion on the work of https://arxiv.org/abs/1604.05448, which is the only other work we are aware of on online graph sparsification.
- The tables in the current draft are organized as Table 1 for algorithms and lower bounds for estimating the cost, and Table 2 for algorithms and lower bounds for outputting the solution. We are happy to take suggestions on what other organization would be more clear to readers.
The paper gives upper bounds and lower bounds for streaming underdetermined linear regression in the column arrival model. The results apply both to the value of the best fit, and to the actual solution. There are various results, depending on the value of and the approximation guarantees. The arguments use duality and sparsification.
优点
The paper explores the full range of values of .
缺点
The methods rely substantially on past work.
问题
None
This paper provides streaming algorithms for solving linear regression problems with an constraint. In the streaming model data arrives one coordinate at a time. The reviewers deem the results to be good, and this is definitely publishable.
In my own reading, this seems to be an algorithms focused paper, and may not have the proper/big audience in ICLR. However, this is no fault of the authors. The reviews have also been short, but positive.
Overall I recommend acceptance.
审稿人讨论附加意见
The reviews have been positive and uniform; little issues raised, and the authors provided precise response.
Accept (Spotlight)