Measurement Manipulation of the Matrix Sensing Problem to Improve Optimization Landscape
摘要
评审与讨论
The authors are concerned with the so-called RIP constant of linear operators. The authors motivate their study with results from the matrix sensing literature. There exist a plethora of randomly constructed linear operators that operate at an optimal sample complexity (i.e., number of measurements is modulo polylog-terms in the ambient dimension equal to the degrees of freedom of the problem) and have the RIP with high probability. Almost all of these rest upon the assumption of near isometry, i.e. that . The authors argue that this assumption in many applications is not given, and propose to extend the results in two directions. First, they prove a pertubation result: In essence, if an operator is close to an operator with the RIP, it also has the RIP with a slightly higher RIP constant. Secondly, they propose a preconditioning scheme, based on replacing the sensing operator with its right singular vectors (or, equivalently, preconditioning it with . They prove a bound on the RIP constants of the preconditioned matrix, and showcase the efficiency of their approach with a numerical experiment.
优点
The authors study a relevant problem and connect it nicely to the matrix sensing literature. The article is written in a way which is easy to follow. As far as the scientific content goes, their result on the RIP of perturbed Gaussian matrices can be highlighted. The appeal of the result is that no assumption what soever is put on the pertubation - it may in particular be deterministic. Since the RIP constant is continously dependent on the matrix operator, it is no surprise that such a result can be proven, but I am unaware of a concrete bound like this in the literature.
缺点
The second part of the paper, related to the improvement of the RIP constants through conditioning, unfortunately suffers from weaknesses.
Questionable novelty
The idea to replace a matrix with its left singular vectors is a very natural one, and it has indeed been proposed before in the literature -- in the manuscript of Chen and Lin (2021) that the authors cite. In there, it is motivated via and formulated in the context of sparse recovery, but the mathematics are more or less equivalent. This severely impacts the value of the empirical success of their method.
I am not convinced by Lemma 1 -- the right-singular vector matrix will only be Haar distributed if the distribution of the operator fulfills some orthogonal invariance, which is not the case for many of the distributions they use in their numerical experiments.
Unclear interpretation of Theorems 4 and 5
From my point of view, Theorems 4 and 5 do little to explain the empirical success of the preconditioning strategy. As the authors point out, Theorem 4 only provides a better bound on the RIP constant if . Since , this can essentially only be the case if the singular values of are biased downward. In particular, in the Gaussian setting that they consider in Theorem 5, it is not. Indeed, for all values of -- i.e., the 'direct bound' that is given by previous results is better. In their proof, they seem to argue that since has the -RIP, it also has the -RIP, and then compare their bound to instead of . I have a hard time understanding the latter reasoning.
I also think that the presence of the term in their bound is very limiting. Does this not necessitate the number of measurement to be proportional to , i.e. grow as the ambient dimension? Much of the compressed sensing literature is about avoiding this situation.
Impact of noise
When performing the preconditioning in practice, it will not only transform the measurement operator, but also noise in measurements . Since the success of their method relies on the singular values of being small (see Theorem and Remark 4), this means that the noise necessarily is amplified. A discussion on this, and possible mitigations, such as in (Chen, Lin; 2021) would make the work more complete.
Small inaccuricies
While the paper generally is well written, it would benefit from another round of proofreading. There are some inconsistencies in the notation: The norm is never defined, on page 9, there is a term that should be a , and so forth.
问题
The points that I think are most crucial for the authors to clarify are my above concerns about Theorems 4 and 5 above, i.e.
- The necessity of being in the order of .
- The fact that the new lower bound the theorems provide never seem to be smaller than the original .
iii) I also think that the presence of the term in their bound is very limiting. Does this not necessitate the number of measurement to be proportional to , i.e. grow as the ambient dimension? Much of the compressed sensing literature is about avoiding this situation.
Thank you for bringing up an important issue. We could directly get the following result from the proof of Theorem 4:
Since rescaling will not affect the landscape of the matrix sensing problem, we multiply all of the above inequalities by and the tightest RIP bound we could have is , which is
If we plug in the bound in Assumption 1 that
\operatorname{Pr}\left\\{\sqrt{n^2/ m}-1-t \leq \sigma\_i(\mathbf{A}) \leq 1+\sqrt{n^2 / m}+t, i \in[m]\right\\} \geq 1-2 \exp \left(-mt^2 / 2\right), \quad \forall \epsilon>0.For , can go to infinity while , we could have
As , in the order of , we could have , and the right-hand side upper bound goes to . Thanks a lot for mentioning this concern, we will update the new bound in our modified paper. There is no need for to be on the order of and it should be .
- Small inaccuricies
While the paper generally is well written, it would benefit from another round of proofreading. There are some inconsistencies in the notation: The norm is never defined, on page 9, there is a term A(M) that should be a A(X), and so forth.
Thanks a lot for pointing that out. We will fix the typos and proofread the paper one more time.
- Unclear interpretation of Theorems 4 and 5
i) From my point of view, Theorems 4 and 5 do little to explain the empirical success of the preconditioning strategy. As the authors point out, Theorem 4 only provides a better bound on the RIP constant if . Since , this can essentially only be the case if the singular values of are biased downward.
For the simulation experiments, we could see that, for uniform, correlated normal and poisson distribution, the original sensing operator has a RIP constant close to 1, where the Gaussian assumption violated. These are the main situations we care about, and we could see that the pre-conditioning has a great improvement on those distributions. The method will have improvement on those worst cases, and for the violation of , the original is not too large to substantially harm the landscape.
ii) In particular, in the Gaussian setting that they consider in Theorem 5 , it is not. Indeed, for all values of -- i.e., the 'direct bound' that is given by previous results is better. In their proof, they seem to argue that since has the -RIP, it also has the -RIP, and then compare their bound to instead of . I have a hard time understanding the latter reasoning.
The main theorem explaining the success of pre-conditioning is Theorem 3, which states that the RIP would be zero if its domain is restricted to . Theorem 4 and 5 aim to show that pre-conditioning does not deteriorate the situation for operators whose RIP was already good. So, Theorem 3 shows why pre-conditioning is expected to work and Theorem 4 and 5 show that there is no harm in using pre-conditioning. Since we never know in advance if a problem is already easy or not, we should always apply pre-conditioning and it would be detrimental if pre-conditioning makes an easy problem a hard one. We showed that this would not happen.
More precisely, we cannot claim that pre-conditioning improves the the RIP constant for nearly isometrically distributed operators. We could only claim a high probability result for the two events to happen at the same time, which are and . And as long as you can bound , you can also bound . Rather than saying that Gaussian (which is already an easy problem) can benefit from pre-conditioning, it is on the other side claiming that pre-conditioning will not harm even if originally the sensing operator has the largest sample efficiency.
If we look at the simulation result, we could see that for large n,m values, the empirical RIP results for original and pre-conditioned are almost the same for normal distribution and therefore there is no improvement (and no performance deterioration due to pre-conditioning). If the original sensing operator is good enough like i.i.d Gaussian, the pre-conditioning will not help much.
Dear Reviewer oF7T,
We greatly appreciate the time and effort the reviewer has put to read our work and provide constructive comments.
- Questionable novelty
i) The idea to replace a matrix with its left singular vectors is a very natural one, and it has indeed been proposed before in the literature -- in the manuscript of Chen and Lin (2021) that the authors cite. In there, it is motivated via and formulated in the context of sparse recovery, but the mathematics are more or less equivalent. This severely impacts the value of the empirical success of their method.
The idea of preconditioning is truly not new. The main difficulty is how to transfer the low sparsity assumption from compressed sensing to the low-rank assumption in matrix sensing. In Theorems 3 and 4, we theoretically proved the RIP bound for low-rank case. By our experimental results, we show that the the preconditioning can also work in Matrix Sensing area. We also show that even for matrices with special structures like low-rank or sparse structure, pre-conditioning also works. While there are many papers on RIP, none of them has applied the pre-conditinoning idea and we believe the complexity is the analysis of low-rank matrices rather than sparse vectors. To understand the importance of this result, let us first summarize our results:
-
The existing literature states that if RIP is less than 0.5, the learning problem is easy.
-
Another result says that RIP can decrease below 0.5 and approach zero if the samples are from i.i.d. Gaussian and the number of samples increases.
-
RIP may never become less than 1/2 for non-Gaussian even if there are infinitely many samples.
-
Calculating RIP is NP-hard and it may not change continuously or smoothly with respect to changes in the probability distribution of the measurement models. Now, Section 2 asks whether RIP changes smoothly if there is a deviation from the assumptions made in the literature (i.e. from i.i.d. Gaussian).
-
We showed that if the deviation from Gaussian is not too much, then increasing the number of samples would lower the RIP to the level of Gaussian and then it can become less than 0.5. To do so, we showed that there is an upper bound on RIP and it can be reduced by increasing the number of samples. This is a very strong result because for distributions far away from Gaussian it is known that even with infinitely many samples RIP can still be very large, but we showed that this is not the case if the deviation from Gaussian is modest.
-
In Section 3, We consider a distribution that is far away from Gaussian.
-
It is known that for such distributions, the RIP is very high and may not decrease to below 0.5 even in presence of many samples.
-
We proposed a pre-conditioning technique that transforms the distribution to another one that makes the new distribution closer to Gaussian. In other words, after pre-conditioning, the individual entries of the new sensing matrices are approximately Gaussian and the preconditioned operators are likely to act as i.i.d. Gaussian with small perturbation.
-
Then, the results of Section 2 would be applied to the pre-conditioned operator.
Our results allow a matrix sensting problem with many spurious solutions to be transformed to another one without any spurious solutions. This can by no means be inferred from Chen and Lin (2021) and we believe the results are powerful.
ii) I am not convinced by Lemma 1 -- the right-singular vector matrix will only be Haar distributed if the distribution of the operator A fulfills some orthogonal invariance, which is not the case for many of the distributions they use in their numerical experiments.
In "Elizabeth S. Meckes, The Random Matrix Theory of the Classical Compact Groups, Cambridge University Press, 2019" we found the following results, which support our claim that the individual entries of a random orthogonal matrix are approximately Gaussian for large matrices:
Corollary 2.6: For each , let be a random orthogonal matrix. Then the sequence converges weakly to the standard Gaussian distribution, as .
Theorem 2.9: Let \left\\{U\_n\right\\} be a sequence of random orthogonal matrices with for each , and suppose that , with . Let denote the top-left block of , and let denote a random matrix of i.i.d. standard normal random variables. Then
Our intuition is that random orthogonal matrices have asymptotic behavior similar to Gaussian for large n values. We agree that a deeper analysis would be useful. However, all of the bounds and theoretical results we currently have in the paper are correct, and are supported by our simulations.
- Impact of noise
- When performing the preconditioning in practice, it will not only transform the measurement operator, but also noise in measurements . Since the success of their method relies on the singular values of A being small (see Theorem and Remark 4), this means that the noise necessarily is amplified. A discussion on this, and possible mitigations, such as in (Chen, Lin; 2021) would make the work more complete.
Thank you for bringing up an important issue. We respectfully disagree with your conclusion that our method would amplify the noise. To understand the issue, consider the noisy case , where denotes noise. The optimization problems
$
\min_X \|\mathcal A(X)-y\|\quad \text{s.t.}\quad \text{rank}(X)=r
$
and
$
\min_X \|2\mathcal A(X)-2y\|\quad \text{s.t.}\quad \text{rank}(X)=r
$
have the same solutions. This means that if we multiply all measurements by a factor of 2, the optimization solution will not change even if the noise value is increased by a factor of 2. In fact, it can be shown that the RIP value also does not change with a scalar multiplication. In fact, what matters is the relative magnitude of the elements of compared to , and multiplying both of them with the same constant will not affect their relative values. We should mention that the general definition of RIP is
$
(1-\delta)\|M\|_F^2\leq \text{constant} \times \|\mathcal A(M)\|^2\leq (1+\delta)\|M\|^2
$
where "constant" can be any arbitrary number. Many papers normalize and consider the constant to be 1. That is why in our bounds we have , and otherwise we can re-scale it as re-scaling will not affect the optimization solution. We will add a discussion to the paper to address your comment and emphasize that this method does not directly make the noise effect severe.
I have read the reviewers detailed response, and have carefully thought about it. Let me first say that the response 4 has helped my understanding. I was wrong to say that small singular values in of themselves are problems -- it is rather the fraction between the largest and smallest singular value (i.e. the condition number of A) that is relevant.
The rest of the responses have however not convinced me. In short, it seems to me that the authors in their response present a very different interpretation of their results in Section 3 than in the paper itself. I find this concerning for the soundness of the paper, and I will therefore not raise my score.
In their response, the authors write that
we cannot claim that pre-conditioning improves the the RIP constant for nearly isometrically distributed operators.
I.e., that the RIP-constant of the preconditioned matrix is not better than the non-preconditioned, but not significantly worse, and that that is the intention of Section 3. However, in said section ( in Remark 5), they write
Hence, the preconditioning algorithm can improve the RIP constant for nearly isometrically distributed random matrices with high probability.
These two statements diametrically contradict each other. Their updated result in particular also showcases this change of interpretation: A claim of the method improving the RIP constant to a claim that for large values of m and n, the RIP constant does not change (It should be acknowledged that their updated result does not need to grow as , which is good).
This begs the question: Which interpretation is correct? If it is the one given in the response, I think it is fair to say that the theory does little to show that the preconditioning improves the RIP, and therefore does not explain the success of the method.
Also, although Theorem 3 gives a nice intuitive picture, it ultimately only proves that the RIP of the matrix is small if all low-rank matrices lie close to - whether this is true for moderately sized is left entirely open. If one relies on to contain the entire set of rank--matrices one will need to be the entire space (since ), and hence that .
Let me also clarify some things related to the responses 1.
(i) I understand that the authors formulate their results in a context of matrix sensing problems. This interpretation is nice. However, I think that as far as the theorems they prove are concerned, proving RIP for sparse signals is mathematically more or less equivalent to proving RIP for low-rank matrices (or any other set of structured signals, for that matter).
(ii) I agree that the entries of Haar matrices are approximately Gaussian distributed, I do not question the validity of this theorem. What I'm rather not convinced about is how relevant the theorem is to say something about the RIP-property of the preconditioned matrix: Unless is rotationally symmetrically distributed, is not Haar-distributed. SInce the authors say that they want to treat distributions that are far away from Gaussian measurements (which is the example of rotationally symmetrically distributed matrices they are appeal to) in this section, it is to me not clear why Lemma 1 should apply to it.
Thank you so much for reading our rebuttal and providing further comment. We are happy to hear that your concern about singular values has been addressed. About your other concern, we would like to emphasize that the results in the paper are correct but probably the words we have used in our response to interpret them have been a bit confusing. Let us first summarize what we have done and then respond to your concerns:
-
If the probability distribution is far away from Gaussian, then pre-conditioning works. We proved it theoretically and also showed that the RIP should be reduced from numbers close to 1 to numbers less than 1/2.
-
If the probability distribution is close to Gaussian, the problem looks simpler in general. However, there are two scenarios:
-
2a) the number of measurements is not a lot. In that case, RIP would be improved. The improvement may not be significant since our pre-conditioning aims to makes probabilities close to Gaussian and when it is almost Gaussian, the benefit may not be as high as case 1. However, if RIP is close to 1, pre-conditioning could bring it to numbers less than 1/2. This is what Corollary 2 and Remark 5 say that you have referred to.
-
2b) the number of measurements is a lot. In that case, RIP is less than 1/2 and the problem is easy, and pre-conditioning does not deteriorate the RIP.
Thank you very much for sharing your thoughts on Theorem 3. Finding the correct value of RIP is NP-hard and so providing a theorem that precisely characterizes the effect of pre-conditioning on RIP would be unlikely. What we did indeed can be interpreted as follows:
RIP amounts to solving an optimization over a non-convex low-rank set (let’s call the feasible set R). Let’s focus on a subset of R, named S, which is obtained by using a sparse combination. We said the objective function of the RIP optimization over the subset R reduces after pre-conditioning. We can argue that due to continuity this may imply that pre-conditioning may also lower the objective value of the RIP optimization outside of R. But our main observation was that the gap between R and S reduces as the number of measurements increases. We agree this theorem may not be as concrete as what you had hoped for, but calculating RIP is NP-hard and finding the exact relationship between RIP and the pre-conditioning matrix may be unlikely. We have done extensive simulations to empirically support the usefulness of our results.
Haar matrices are used to provide some rough intuition and we did not use them in our proofs. We would be happy to make the language of this part milder.
We greatly appreciate the time and effort you have put to provide thoughtful comments. We will certainly improve the presentation of our paper and add more expositions based on these comments.
Thank you for responding again. First, let me state that I think that I understand your setting, and I in particular agree that improving the RIP through pre-conditioning per se is sound and potentially of practical. The measurement matrix is not always something that one can choose, as the authors have correctly motivated with their example.
The reason why I am still confused (and not convinced) is the interpretation of Corollary 2 and Remark 5. Let be very concrete her: Denote . Surely, . Cor. 2 and Rem. 5 states that the RIP of the perturbed matrix is bounded by . The authors now say that it is interesting that this new bound can be made smaller than 1/2 even when is close to . However, in order for , we need that , i.e. , i.e. , and is smaller than , since . Hence, in order for the new bound to be smaller than , the old RIP-constant of needs to be smaller than . (Note that is the (upper bound) for the RIP of -- this the definition of the event in the proof). I therefore do not understand how we can use said corollaries to argue that the RIP constant goes from something above 1/2 to something below 1/2. If the authors can explain my error in reasoning here, and convince me that the theory shows that the method can improve a RIP constant above 1/2 to one under 1/2, my view of the paper would change significantly.
The paper discusses the robustness of the restricted isometry property constant of matrices to perturbations, and a pre-processing algorithm to improve the RIP constant of a large class of matrices, with the matrix sensing problem in mind. The main application discussed is the matrix sensing problem.
优点
The RIP contant is a fundamental property with a great deal of practical importance in compressed sensing, matrix sensing, and other regression tasks. The paper is mostly self-contained, the proofs are quite simple and easy to follow, and convincing numerics are also provided to showcase the proposed algorithm for various types of matrix ensembles. The algorithm seems very robust. The overall redactional quality is high.
缺点
The main two weaknesses from my viewpoint are: firstly, it is not clear that ICLR is the right avenue for this paper that is very strongly signal processing oriented. Secondly, the motivation of the processing algorithm is not clear: if one can design the sensing matrices at their will, what is then the advantage of starting from given ones and try to improves them? Why not taking directly Gaussian matrices with good RIP constant? This needs to be discussed at it appears nowhere.
问题
-
I would ask the authors to authors to discuss the above point: if one can design the sensing matrices at their will, what is then the advantage of starting from given ones and try to improves them? Why not taking directly Gaussian matrices with good RIP constant? Provide an example where the re-design is natural
-
first line below def 2 is not clear. What A are we talking about that is nearly isometric? Please rephrase
-
what is a variance « proxy » in corollary 1 or theorem 2?
-
rmk 2: where does it come from that A has RIP O(1/sqrt m)? Is this automatic from nearly isometrically distributed?
-
the text below the proof to Theorem 3 is important but a bit too condensed to be well understood. Please expand it
Dear Reviewer 27fu,
We greatly appreciate the time and effort the reviewer has put to read our work and provide constructive comments.
- the motivation of the processing algorithm is not clear: if one can design the sensing matrices at their will, what is then the advantage of starting from given ones and try to improves them? Why not taking directly Gaussian matrices with good RIP constant? This needs to be discussed at it appears nowhere.
We believe that there has been some major confusion about the main idea of our work. Before responding to your comment, we would like to provide a real-world example to better explain how is designed from . The low-rank matrix sensing problem studied in this paper naturally appears in power systems, where the problem is called state estimation and is solved every 5 minutes by power system operators. A power system is a graph with nodes and a set of edges . Each node of the system has a voltage parameter to be learned. Each measurement of the network is in the form of
$
b_j=\sum_{i: (j,i)\in\mathcal E} \frac{x_j(x_j-x_i)}{z_{ji}}
$
where is a known line parameter and is the power flown over line . The right-hand side of the measurement can be written as
$
b_j=\langle A_j,xx^T\rangle
$
for some matrix that depends on the parameters and the topology of the graph (note: is the vector of all nodal voltages). We cannot change any measurement model directly. Changing an entry of means removing/adding lines to a physical power grid or changing the reactances of the transmission lines on the streets, which is impossible (the goal is to learn the voltages from the data given by the sensors rather than changing the infrastructure). The existing methods requiring to be Gaussian are not applicable at all since is heavily structured for power systems. We propose the following idea:
- Recall that sensors return the measurements .
- We design some coefficient , and create a mixed measurement . We replace measurement 1 with this new combined measurement. Then, the new measurement can be written as , where is equal to .
- Note that the mixing idea cannot generate arbitrary values for . For example, if there is no physical line between nodes 2 and 3, then the (2,3) entry of all matrices are zero and so the (2,3) entry of is also zero no matter what coefficients 's we select.
- We then proceed and replace measurement 2 with a new mixed measurement . We proceed with the replacement of all measurements.
- Using this idea, we exploit the existing measurements/sensors, and do not require new measurements that are not physically infeasible. The question is: how can 's, 's, etc. be designed so that the process of learning becomes simpler?
To explain the above idea in a general context, the pre-conditioning trick mentioned in this paper allowed us to make linear combinations of the original sensing matrices, and we cannot change to arbitrary like Gaussian i.i.d. sensing operator. We use mixed measurements to obtain
The new sensing operator can only be in the linear span space of the original sensing matrices. And in Algorithm 1, we have
This is equivalent to calculating a weight matrix and performing the left-side multiplication . If we were to choose another matrix, say , and obtain the new operator via , since has a rectangular shape, the set of possibilities for is limited and none of the choices may be close to Gaussian. In the context of power systems, we explained that no matter how we perform the mixing, the (2,3) entry of the new operator is zero if there is no physical line (2,3) in the system.
We will revise the paper to clarify the setting of the paper in response to your constructive comment. We will include the above example to illustrate the main idea of the paper.
- first line below def 2 is not clear. What A are we talking about that is nearly isometric? Please rephrase
Thanks. We will improve the writing. Line 118 is saying that for any A satisfying Definition 2 (Nearly isometrically distributed), the RIP constant for A will be upper bounded by with high probability if .
- what is a variance "proxy" in corollary 1 or theorem 2?
Variance proxy is used in the definition of Sub-Gaussian distribution. For a sub-Gaussian random variable, there exists a parameter that bounds the tails of its distribution in a way similar to how the variance bounds the tails of a Gaussian (normal) distribution. To be more exact, if there exists some such that for all , then is called a "variance proxy".
- rmk 2: where does it come from that A has RIP O(1/sqrt m)? Is this automatic from nearly isometrically distributed?
Yes, if is nearly isometrically distributed, then there exist positive constants and such that, with probability at least , we have
- the text below the proof to Theorem 3 is important but a bit too condensed to be well understood. Please expand it
Thanks. We will more exposition to the paper. For , we could have , which means that the RIP would be zero if we allow its domain to be restricted to . As increases, the span space finally consists of all the orthonormal bases in , and hence will cover the low-rank space which is . This means that the true RIP would get close to zero as we add more orthonormal sensing matrices.
The paper considers the matrix sensing problem where a low-rank matrix is sensed via inner products with a set of sensing matrices, resulting in a set of scalar measurements. The results consider the restricted isometry property (RIP) constants for perturbed matrix constructions, where the baseline are matrices with i.i.d. Gaussian entries. The paper also shows that by orthogonalizing the rows of a matrix the RIP constant improves, and by mixing multiple sensing matrices the RIP constants improve as well.
优点
The numerical results verify the benefit of pre-conditioning the sensing matrices to be orthogonal to one another.
Perturbing sensing matrices to improve their performance is an interesting idea.
缺点
The analytical results show that the RIP constants increase with perturbation; it is not clear then what is the benefit of perturbation if it requires more measurements (e.g., Remark 2). The matrix perturbation proposed is not tested numerically.
The broadest results I am aware of for RIP of random matrices are based on subgaussianity of the underlying distribution of its entries. Thus focusing on Gaussian matrices in a comparison may be too narrow; and I note that some of the improvements claimed here involve subgaussian distributions already. Furthermore, sums of Gaussian and subgaussian matrices are subgaussian.
It is well known in the CS literature that orthogonalizing the rows of the sensing matrix improves its performance (these are so-called orthogonal projectors). See for example https://doi.org/10.1109/TIT.2005.862083 and https://mdav.ece.gatech.edu/publications/dwb-tr-2006.pdf - this may also relate to the Haar distributed random matrices used in Section 3.1, although they have not been defined in this paper. The preconditioning algorithm is therefore commonly applied in the CS literature, and that proposed here is a straightforward extension from measurement vectors to sensing matrices.
It is also straightforward to see that a weighted mixture of matrices will have distribution closer to Gaussian (akin to the law of large numbers). In addition, sum of subgaussian random variables are subgaussian as well (akin to Corollary 1).
It is not clear why Theorem 1 defines its constants in terms of when a bound of itself is already included in the assumptions; the best constant can be obtained by having meet its bound.
问题
In line 86, the RIP constant is defined as the smallest number such that the inequality holds; so it should be unique if it exists.
In line 119, a statement is made about nearly isometry, but a distribution for is not established. Is it the one in the following sentence? (i.i.d. Gaussian)
In line 163, it appears that the mat(.) operator must know the size of the target matrix (as an additional input?)
In Theorem 3, is it implicitly assumed that ? Otherwise you would not be able to have all orthogonal.
Typo: Line 52 board -> broad
For the question part:
- In line 86, the RIP constant is defined as the smallest number such that the inequality holds; so it should be unique if it exists.
Yes, thanks for pointing out. We will change our words to "If satisfies the RIP inequality (3), every number greater than will satisfy the RIP inequality (3)".
- In line 119, a statement is made about nearly isometry, but a distribution for A is not established. Is it the one in the following sentence? (i.i.d. Gaussian)
Not exactly; there is a wide range of distributions included in the definition of nearly isometry (examples are provided in the response to Comment 2). The statement holds for every distribution satisfying Definition 2.
- In line 163, it appears that the mat(.) operator must know the size of the target matrix (as an additional input?)
Yes, you are correct. We need to know the size of the target matrix. Since we know the size of our sensing matrix , we can do this operation. We will add this description to our updated paper.
- In Theorem 3, is it implicitly assumed that . Otherwise you would not be able to have all orthogonal.
No, we do not need this assumption. In fact, the problem is trivial if . The practical regime is (and in fact the hope is that is close to rather than since the whole point of matrix sensing is to learn a matrix from a limited number of measurements). We could have
and at the same time
They are orthogonal to each other, while .
- It is also straightforward to see that a weighted mixture of matrices will have distribution closer to Gaussian (akin to the law of large numbers). In addition, sum of subgaussian random variables are subgaussian as well (akin to Corollary 1).
There has been a major confusion here. We would like to provide a real-world example to explain our answer. The low-rank matrix sensing problem studied in this paper naturally appears in power systems, where the problem is called state estimation and is solved every 5 minutes by power system operators. A power system is a graph with nodes and a set of edges . Each node of the system has a voltage parameter to be learned. Each measurement of the network is in the form of
$
b_j=\sum_{i: (j,i)\in\mathcal E} \frac{x_j(x_j-x_i)}{z_{ji}}
$
where is a known line parameter and is the power flown over line . The right-hand side of the measurement can be written as
$
b_j=\langle A_j,xx^T\rangle
$
for some matrix that depends on the parameters and the topology of the graph (note: is the vector of all nodal voltages). We cannot change any measurement model directly. Changing an entry of means removing/adding lines to a physical power grid or changing the reactances of the transmission lines on the streets, which is impossible (the goal is to learn the voltages from the data given by the sensors rather than changing the infrastructure). The existing methods requiring to be Gaussian are not applicable at all since is heavily structured for power systems. We propose the following idea:
- Recall that sensors return the measurements .
- We design some coefficient , and create a mixed measurement . We replace measurement 1 with this new combined measurement. Then, the new measurement can be written as , where is equal to .
- Note that the mixing idea cannot generate arbitrary values for . For example, if there is no physical line between nodes 2 and 3, then the (2,3) entry of all matrices are zero and so the (2,3) entry of is also zero no matter what coefficients 's we select.
- We then proceed and replace measurement 2 with a new mixed measurement . We proceed with the replacement of all measurements. \item Using this idea, we exploit the existing measurements/sensors, and do not require new measurements that are not physically infeasible. The question is: how can 's, 's, etc. be designed so that the process of learning becomes simpler?
As explained above, no matter how we perform the mixing, the (2,3) entry of the new operator is zero if there is no physical line (2,3) in the system. This contradicts your statement: "It is also straightforward to see that a weighted mixture of matrices will have distribution closer to Gaussian." The mixing technique does not change the number of measurements and also in the context of power systems, it cannot change the values of the zero entries corresponding to non-existent power lines, and therefore, the law of large numbers mentioned in your comment does not hold.
The underlying idea behind this confusion is similar to the issue discussed for Comment 2. being sub-Gaussian bounded does not imply i.i.d. subgaussianity of . We only assume that the largest magnitude of perturbation is bounded with high probability. We do not assume any distribution on perturbation, and it can be asymmetric, not centered around 0. We will add some clarification to the paper about the important issue you have raised to avoid any confusion.
- It is not clear why Theorem 1 defines its constants in terms of when a bound of itself is already included in the assumptions; the best constant can be obtained by having meet its bound.
Thanks a lot for your advice. We will apply the upper bound of to our result to avoid confusion. Right now the best upper bound we could have due to first-order approximation is
Hence, the influence of perturbation on the RIP constant is upper bounded by a controllable constant term (We will update our paper accordingly).
- The broadest results I am aware of for RIP of random matrices are based on subgaussianity of the underlying distribution of its entries. Thus focusing on Gaussian matrices in a comparison may be too narrow; and I note that some of the improvements claimed here involve subgaussian distributions already. Furthermore, sums of Gaussian and subgaussian matrices are subgaussian.
Thank you for your comment. The most general result of our paper is Theorem 1, which states that for arbitrary and any bounded perturbation operator , the RIP constant will increase by at most . There is no assumption on subgaussianity, and this is a deterministic result.
After Theorem 1, its following Corollary and Theorem are high-probability results, based on the assumption that is nearly isometrically distributed, which includes a large number of distributions. For example, it includes the i.i.d. symmetric Bernoulli distribution
or the following where two-thirds of the entries are zero:
Regarding the deviation/perturbation, we assume is sub-Gaussian bounded. Here, we need to clarify that we do not assume i.i.d subgaussianity for all entries of and only require the largest magnitude of perturbation to be bounded with high probability, which is
For example, can be [0,1] Uniform distributed with close to 1 correlation while at the same time, is upper bound by 1, hence is subgaussian. We do not assume any distribution on perturbation, and it can be asymmetric, not centered around 0, highli-correlated. Since there is no i.i.d sub-Gaussian assumption on the perturbation entries, we could not apply directly the subgaussianity result of RIP.
- It is well known in the CS literature that orthogonalizing the rows of the sensing matrix improves its performance (these are so-called orthogonal projectors). See for example https://doi.org/10.1109/TIT.2005.862083 and https://mdav.ece.gatech.edu/publications/dwb-tr-2006.pdf - this may also relate to the Haar distributed random matrices used in Section 3.1, although they have not been defined in this paper. The preconditioning algorithm is therefore commonly applied in the CS literature, and that proposed here is a straightforward extension from measurement vectors to sensing matrices.
The idea of preconditioning is truly not new. The main difficulty is how to transfer the low sparsity assumption from compressed sensing to the low-rank assumption in matrix sensing. In Theorems 3 and 4, we theoretically proved the RIP bound for low-rank case. The existing results on orthogonalization are in a different context and cannot lead to these conclusions. We would like to emphasize that although the proposed technique may seem natural, it indeed leads to a powerful result: the RIP is often above 0.5 for which there are many spurious solution but pre-conditioning reshapes the landscape and makes the spurious solutions disappear. Many papers have been published on RIP in NeurIPS, ICML, ICLR, etc. in the past 10 years and while several of them tried to improve the RIP, none of them was able to make the observation of our paper. We believe that this is because the idea of preconditioning for low-rank matrices is not as trivial as it is for sparse vectors previously done in CS.
Dear Reviewer iS9D,
We greatly appreciate the time and effort the reviewer has put to read our work and provide constructive comments.
- The analytical results show that the RIP constants increase with perturbation; it is not clear then what is the benefit of perturbation if it requires more measurements (e.g., Remark 2). The matrix perturbation proposed is not tested numerically.
Before responding to your comment, we would like to first summarize the goals of Section 2:
-
The existing literature states that if RIP is less than 0.5, the learning problem is easy.
-
Another result says that RIP can decrease below 0.5 and approach zero if the samples are from i.i.d. Gaussian and the number of samples increases.
-
RIP may never become less than 1/2 for non-Gaussian even if there are infinitely many samples.
-
Calculating RIP is NP-hard and it may not change continuously or smoothly with respect to changes in the probability distribution of the measurement models. Now, Section 2 asks whether RIP changes smoothly if there is a deviation from the assumptions made in the literature (i.e. from i.i.d. Gaussian).
-
We showed that if the deviation from Gaussian is not too much, then increasing the number of samples would lower the RIP to the level of Gaussian and then it can become less than 0.5. To do so, we showed that there is an upper bound on RIP and it can be reduced by increasing the number of samples. This is a very strong result because for distributions far away from Gaussian it is known that even with infinitely many samples RIP can still be very large, but we showed that this is not the case if the deviation from Gaussian is modest.
Then, moving on to Section 3, we consider a distribution that is far away from Gaussian. We proposed a pre-conditioning technique that transforms the distribution to another one that makes the new distribution closer to Gaussian. Then, the results of Section 2 would be applied to the pre-conditioned operator.
In summary, we do not perturb a given operator and what we mean by perturbation is that when an original operator or a preconditioned operator is deviated from Gaussian (in which case we call it a perturbed Gaussian), we would like to understand how much the RIP would increase due to that deviation. We will improve the text of the paper and clarify that by perturbation we mean deviation.
In Section 2, we derive the upper bound for RIP under deviation from (perturbation to) Gaussian. In Theorem 2, based on our assumption, the perturbation is mean 0 and variance proxy (to match with the magnitude "Gaussian i.i.d. sensing operators satisfy RIP"), and that for
The RIP upper bound we derive is , and hence the influence of perturbation on RIP is a controllable constant level. This also means that we can compensate for the RIP increase due to deviation from Gaussian by slightly increasing the number of measurements . And that is the intuition behind the pre-conditioning algorithm. After pre-conditioning, the individual entries of the new sensing matrix are approximately Gaussian, and hence these preconditioned operators are likely to act as i.i.d. Gaussian with small perturbation. As long as the new upper bound after perturbation is smaller than 0.5, we can obtain good properties like global optimality for Matrix Sensing problems. We will improve the presentation of our results based on your constructive comment.
For the numerical result of matrix perturbation, we could see that there is a small influence on the empirical RIP value as long as the assumption for perturbation holds: .

Thanks for the response. My interpretation of N as a perturbation of A was based on the context given by the paper title (“manipulation”) and the perturbation itself (which is given first as an addition of a deterministic bounded matrix, instead of described in terms of the distribution of A). I can conceive of a practical setting for this setup, e.g., when the matrix is generated according to a Gaussian distribution and is then quantized. The second setting is more difficult to assess in terms of practical impact, as both A and N are drawn from specific distributions and, as other reviewers have pointed out, it’s not clear why it is better to perturb and add measurements than to keep the original matrix distribution at the larger size, which should also improve the RIP constant.
Also, my last question was wrong and the matrices can only be orthogonal if m < n^2, so we are in agreement.
This paper considers the matrix sensing problem through Restricted Isometry Property (RIP). Since a larger RIP constant lead to performance degradation & a substantial number of practical sensing operators belong to such case, they propose to reduce the RIP constant through the pre-conditioning trick.
优点
- This paper is quite easy to follow.
- The analysis seems to be sound. I randomly sampled some proofs and can verify their correctness.
缺点
- My major concern comes from the setting itself. The authors basically change the sensing operator, e.g., from to in Algorithm 1. If this operation is allowed, I wonder why the authors bother with the pre-conditioning trick, they can simply ignore the original operator and use a Gaussian i.i.d. sensing operator. The RIP constant can be guaranteed.
- The justification of the pre-conditioning is not convincing enough. In Section 2, they prove the RIP constant will increase in the presence of perturbation. There are several issues. First, their analysis is on the upper-bound on RIP. It's unclear how the RIP constant change. The RIP constant can remain constant while the upper bound of RIP constant can change from to . Second, the upper-bound is a little confusing for me. It seems that there a optimal sensing number . Intuitively, the performance should keep improving with the increasing sampling number, as the deviation becomes smaller.
- The techniques in the analysis is quite standard.
- A minor regarding to Figure 2 is to plot the error bar of the simulated RIP constant. Unless you exhaustively check all possible rank- matrices, there is no possibility of getting the exact RIP constant. Based on the content, it seems that you use Monte-Carlo simulation to find the RIP constant, which is fine. Still, it looks better to include the variance.
问题
See above
伦理问题详情
NA

- The justification of the pre-conditioning is not convincing enough. In Section 2, they prove the RIP constant will increase in the presence of perturbation. There are several issues. First, their analysis is on the upper-bound on RIP. It's unclear how the RIP constant change. The RIP constant can remain constant while the upper bound of RIP constant can change from c to 2c. Second, the upper-bound is a little confusing for me. It seems that there a optimal sensing number m. Intuitively, the performance should keep improving with the increasing sampling number, as the deviation becomes smaller.
Section 2 is regarding perturbation analysis, while Section 3 is about pre-conditioning. Let us first summarize the goals of Section 2:
-
The existing literature states that if RIP is less than 0.5, the learning problem is easy.
-
Another result says that RIP can decrease below 0.5 and approach zero if the samples are from i.i.d. Gaussian and the number of samples increases.
-
RIP may never become less than 1/2 for non-Gaussian even if there are infinitely many samples. So, your statement that the perform should improve (or RIP should decrease) by having more samples is unfortunately proven not to hold in general.
-
Calculating RIP is NP-hard and it may not change continuously or smoothly with respect to changes in the probability distribution of the measurement models. Now, Section 2 asks whether RIP changes smoothly if there is a deviation from the assumptions made in the literature (i.e. from i.i.d. Gaussian).
-
We showed that if the deviation from Gaussian is not too much, then increasing the number of samples would lower the RIP to the level of Gaussian and then it can become less than 0.5. To do so, we showed that there is an upper bound on RIP and it can be reduced by increasing the number of samples. This is a very strong result because for distributions far away from Gaussian it is known that even with infinitely many samples RIP can still be very large, but we showed that this is not the case if the deviation from Gaussian is modest.
Now, let us summarize the goals of Section 3:
- We consider a distribution that is far away from Gaussian.
- It is known that for such distributions, the RIP is very high and may not decrease to below 0.5 even in presence of many samples.
- We proposed a pre-conditioning technique that transforms the distribution to another one that makes the new distribution closer to Gaussian. In other words, after pre-conditioning, the individual entries of the new sensing matrices are approximately Gaussian and the preconditioned operators are likely to act as i.i.d. Gaussian with small perturbation.
- Then, the results of Section 2 would be applied to the pre-conditioned operator.
Regarding your second concern, that is due to technical details of the proof. We also assume that the perturbation is mean 0 and variance proxy , so as to match with the magnitude "Gaussian i.i.d. sensing operators satisfy RIP". Our RIP bound in Theorem 2 can be decomposed into two parts: the first is the RIP bound for to be nearly isometrically distributed, and the second part is influence by noise. As m increases, the first part decreases, and the second part seemingly increases on the order of . However, we could see that there is an assumption for :
By taking the magnitude of into consideration, your intuition is absolutely correct. The performance keeps improving with increasing the number of samples, as the deviation becomes smaller.
- The techniques in the analysis is quite standard.
The problem defined in this paper is new to Matrix Sensing, which is how to improve the landscape when it is not possible to change the sensors that collect data (as discussed through the power systems example above). The problem has previously been studied in signal processing for sparse vectors. However, in our paper, the analysis is for low-rank matrices. The framework is similar but details differ. Although the technique may seem standard, the observation that a simple mixing technique makes the RIP become less than 0.5 is powerful and very useful for practical purposes.
- A minor regarding to Figure 2 is to plot the error bar of the simulated RIP constant. Unless you exhaustively check all possible rank-s matrices, there is no possibility of getting the exact RIP constant. Based on the content, it seems that you use Monte-Carlo simulation to find the RIP constant, which is fine. Still, it looks better to include the variance.
Thanks a lot for your advise. We have re-run the experiments and add variance to the plot.
Dear Reviewer aJfp,
We greatly appreciate the time and effort the reviewer has put to read our work and provide constructive comments.
- My major concern comes from the setting itself. The authors basically change the sensing operator, e.g., from to in Algorithm 1. If this operation is allowed, I wonder why the authors bother with the pre-conditioning trick, they can simply ignore the original operator A and use a Gaussian i.i.d. sensing operator. The RIP constant can be guaranteed.
We believe that there has been some major confusion about the main idea of our work. Before responding to your comment, we would like to provide a real-world example to better explain how is designed from . The low-rank matrix sensing problem studied in this paper naturally appears in power systems, where the problem is called state estimation and is solved every 5 minutes by power system operators. A power system is a graph with nodes and a set of edges . Each node of the system has a voltage parameter to be learned. Each measurement of the network is in the form of
$
b_j=\sum_{i: (j,i)\in\mathcal E} \frac{x_j(x_j-x_i)}{z_{ji}}
$
where is a known line parameter and is the power flown over line . The right-hand side of the measurement can be written as
$
b_j=\langle A_j,xx^T\rangle
$
for some matrix that depends on the parameters and the topology of the graph (note: is the vector of all nodal voltages). We cannot change any measurement model directly. Changing an entry of means removing/adding lines to a physical power grid or changing the reactances of the transmission lines on the streets, which is impossible (the goal is to learn the voltages from the data given by the sensors rather than changing the infrastructure). The existing methods requiring to be Gaussian are not applicable at all since is heavily structured for power systems. We propose the following idea:
- Recall that sensors return the measurements .
- We design some coefficient , and create a mixed measurement . We replace measurement 1 with this new combined measurement. Then, the new measurement can be written as , where is equal to .
- Note that the mixing idea cannot generate arbitrary values for . For example, if there is no physical line between nodes 2 and 3, then the (2,3) entry of all matrices are zero and so the (2,3) entry of is also zero no matter what coefficients 's we select.
- We then proceed and replace measurement 2 with a new mixed measurement . We proceed with the replacement of all measurements.
- Using this idea, we exploit the existing measurements/sensors, and do not require new measurements that are not physically infeasible. The question is: how can 's, 's, etc. be designed so that the process of learning becomes simpler?
To explain the above idea in a general context, the pre-conditioning trick mentioned in this paper allowed us to make linear combinations of the original sensing matrices, and we cannot change to arbitrary like Gaussian i.i.d. sensing operator. We use mixed measurements to obtain
The new sensing operator can only be in the linear span space of the original sensing matrices. And in Algorithm 1, we have
This is equivalent to calculating a weight matrix and performing the left-side multiplication . If we were to choose another matrix, say , and obtain the new operator via , since has a rectangular shape, the set of possibilities for is limited and none of the choices may be close to Gaussian. In the context of power systems, we explained that no matter how we perform the mixing, the (2,3) entry of the new operator is zero if there is no physical line (2,3) in the system.
We will revise the paper to clarify the setting of the paper in response to your constructive comment. We will include the above example to illustrate the main idea of the paper.
Dear reviewers:
Since the discussion period would end in a week, could you please read our rebuttal and let us know if you have any further concerns or comments? We provided extensive answers to your previous comments and hope to discuss them with you during this discussion period. We appreciate your time and effort.
This paper focuses on the matrix sensing problem, introducing an approach to improve the Restricted Isometry Property (RIP) constants of a measurement matrix by employing a preconditioning trick, which involves replacing sensing matrices with their weighted sum. Using this approach, the authors claim that the RIP constants can be decreased. However, the authors have not clearly conveyed the main idea of the paper, and the reviewers raised multiple concerns regarding various aspects of the proposed method, including technical details in the proofs and the interpretation of the results. Despite the authors’ efforts during the rebuttal phase, I believe that the current version of the paper needs substantial revisions, both in motivating the core idea and in clarifying the presentation and interpretation of the main findings. Therefore, I recommend rejecting this paper in its current form.
审稿人讨论附加意见
The authors provided responses to the reviewers' comments during the rebuttal phase. However, these responses did not seem to satisfy the reviewers, who maintained their initial scores. While reviewers generally did not engage extensively during the rebuttal phase, Reviewer oF7T had a discussion with the authors regarding certain technical details and the interpretation of the theoretical results. The explanations provided by the authors were not convincing enough.
Reject