Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements
摘要
评审与讨论
The paper considers a new problem setting of recovering sparse binary vector from generalized linear measurements. The authors propose a simple algorithm based on Plan et al. (2017) and prove its performance guarantee, complemented by a nearly tight lower bound. This then implies tight resolution to noisy 1-bit compressed sensing and logistic regression. Interestingly, for SparseLinearReg, when , keeping the sign information only suffices. Also, for SparseLinearReg, the authors prove tighter upper and lower bounds based on MLE.
给作者的问题
- The proof of Theorem 2.1 doesn't seem to utilize the fact that the entry is Gaussian explicitly; rather, the only facts used were 1. and 2. it satisfies the power constraint (Eqn. (2)). Then are there examples of subGaussian distribution (not Gaussian) that satisfy the power constraint? Then maybe Theorem 2.1 can be written in a more general sense. One immediate example may be binary sensing matrix, with for some ?
- In Theorem 2.1, what is the meaning of ? This reminds me of the Polyak-Łojasiewicz condition, with a statistical twist(?). What do the authors think? Also, are there (subGaussian) GLMs that do not satisfy this condition? Lastly, has this condition been considered before in the GLM literature?
- The statements of Corollary 2.2~2.4 should be either "... if , or "there exists a such that when , Algorithm 1 recovers the unknown signal".
- The authors mention that the lower bound applies to the average error prob, while the upper bound is for the max prob. Then is there a chance that there may be a better algorithm (or better analysis) for deriving an upper bound for the average error prob, somewhat closing the gap between upper and lower bound ( vs. )? Or is the conjecture of Gamarnik & Zadik (2017a) basically saying that the gap (for now) is inevitable regardless of whether one looks at max error prob or average error prob?
- By the way, in the abstract, the authors mention that there is no statistical-computational gap for 1bCSbinary and logistic regression. Why is this the case? To my understanding, the upper bounds hold for and the lower bound requires ? ( for logistic regression).
- In the proof of Theorem 2.9 (Appendix A.3), can the authors elaborate on the last equality (equivalent formulation of log likelihood testing)?
- Is there a combinatorial (or something similar) intuition behind ?
- It seems to me that the MLE-based approach is valid for GLM observations beyond SparseLinearReg, as we have a parametric assumption on the distribution, which makes well-defined...? Then, am I correct in saying that the improved upper and lower bound arguments (Theorems 2.9, 2.10) cannot be extended trivially to GLM observations?
论据与证据
See Theoretical Claims.
方法与评估标准
Not applicable.
理论论述
I checked all proofs and concur that they are generally correct. There are some minor issues and typos regarding absolute constants and tightness of the results, but the orderwise guarantees remain intact. Typos are deferred to the end. Here are some minor issues that I saw:
- In line 338 (second column), it should be , not .
- In the proof of Theorem 2.5, when Fano's inequality is applied, why is there a "+1" in the ? To my understanding, the total number of possible k-sparse binary vectors is precisely ? Then one could bound as , yielding a slightly tighter inequality.
- In the last part of the proof of Theorem 2.5, when the authors invoke the result of Topsøe (2001), there is a missing. Also, the range of should be , not .
- In many parts of the other proofs (e.g., Proof of Theorem 2.10), the authors use the inequality . It may be minor, but still I would prefer if the authors write the proofs and statements via .
实验设计与分析
Not applicable.
补充材料
Yes. I have reviewed the entire supplementary material. I have skimmed through the detailed calculations, such as integrals and algebraic manipulations.
与现有文献的关系
- To the best of my knowledge, tackles a new problem setting
- Nearly tight sample complexity bounds, which cannot be obtained by prior algorithms nor analyses
- Interesting proof techniques that may be of interest for statistics and information theory community
遗漏的重要参考文献
None to the best of my knowledge.
其他优缺点
Strengths:
- Clearly and well written
- Simple yet effective resolution to the well-studied statistics problem with tight theoretical guarantees extending to GLM observations.
- Although the proof flow itself is standard in information theory and statistics, there are several novel details that I appreciated: correlation vs. uncorrelation based on the support of and appropriate use of independency arguments (virtually all theorems, to my understanding), coupling(?)-based tighter lower bound for SparseLinearReg (Theorem 2.10).
Weakness:
- Given the algorithm's simplicity, it would have been better to include some toy experiments that showcase the tightness of the theoretical results presented.
- Writing can be overall improved; see below.
其他意见或建议
Typos:
- The notation is used interchangeably between column vector and row vector throughout the paper. For instance, in Eqn. (2), is the row-vector, but then shouldn't it be , not ? But then in Section 2.1, the same notation is overloaded with column vector. Also, in some parts, this is denoted as (e.g., line 360 left column).
- Line 52 (first column): is not defined
- Line 91 (second column): Guassian => Gaussian
- Line 128 (first column): Theorem 2.8 => Corollary 2.8
- Line 639: what is the definition of ?
- Line 754: what is the definition of ? Actually, it seems that this is defined in line 978 as the differential entropy.... It should be defined sooner
- Line 801:
- Line 1085: Jenson => Jensen
- In Theorem 2.10,
Suggestions:
- I would suggest putting in a table comparing prior results and the results shown in this paper. Right now, all the comparisons to prior works are crammed into Section 1.1, which was a bit hard for me to parse at first, and I had to go back and forth after going through the theorems. Also, the table would help the readers organize the theorem results more clearly, as they are subtly different depending on the setting: specifically, whether or not.
- Some notations are left undefined, e.g., . The definitions should be included, at least in a footnote, for completeness.
- Line 913: maybe put in a reference for the inequality (Theorem 11.1.3 of Cover & Thomas (2006))
- Throughout the paper, I see things like [Example 2.5.8](Vershynin, 2018). If the authors meant to do \citet[Example 2.5.8]{vershynin2018highdim}, which should yield Vershynin (2018, Example 2.5.8), the authors should fix the bibtex errors accordingly.
- Appendix B is too much not self-contained... The notations such as , , and aren't defined, as it seems that the reader must explicitly look them up in Plan et al. (2017)
Thank you very much for a careful review and all your interesting insights and questions. We completely agree with the suggestions, minor issues and typos mentioned. We will incorporate them in the next revision of the paper. In particular, we will fix all the minor issues and typos, incorporate the suggestions including adding a table of results and defining a separate notation for a row vector and column vector of .
Below we give responses to the questions asked.
(Question 1) Your observation is correct. As you mentioned, there are other subgaussian distributions, for example, each entry of is chosen iid uniform on the set {0,1} or {-1,1} that can be used as measurement matrices. The general proof technique will work for these distributions too. However, the theorem statement in its current form is true only for a Gaussian matrix where each entry is chosen iid . The proof uses the fact that each entry of the measurement matrix has zero mean and uses the distribution of the entries while applying Stein’s lemma between (11) and (12). As you noticed, the proof outline can easily be used for any other distribution on the measurement matrix. We will mention it in the revised version of the paper.
(Question 2) This is an interesting observation. While there might be some relation to an optimization paradigm, instead of this being a condition on the model, it can be thought of as a definition of which determines the rate of convergence. Note that we can also write or can omit it altogether and write instead of in (7) (Theorem 2.1). Since could be vanishing with as is the case in this paper, as defined it will always exist. The numerator in can also be written as (see the series of transformations between (11) and (12) on page 7). This quantity has previously been used in Plan, Vershynin, Yudovina, 2017. See the definition of in proposition 1.1. This comes naturally while analyzing the convergence of linear estimator, since the estimator measures the contribution of each coordinate to the measurement outcomes.
(Question 3) You are right. Thank you for noticing this.
(Question 4) It turns out that for iid Gaussian design the avg error and max error criterions are equivalent. Given any decoder with good average error performance, we can design with good max error performance as follows:
On input , sample a uniform dimensional permutation matrix ; compute , and output .
Then, for any x, where . is iid Gaussian design because if we take a uniform permutation of columns of a gaussian matrix, it is still iid Gaussian. Since is i.i.d. Gaussian and is uniform -sparse vector, this implies the max error of is same as the average error of .
Thus, the conjectured hardness by Gamarnik and Zadik is true for both maximum probability of error and average prob of error.
We will mention it in the revised version.
(Question 5) In the prior literature, and are not distinguished for the purpose of information-computation gap, which makes sense if for any . When for some constant , we can use an identity matrix to recover in measurements. Note that in this case. Therefore altogether and may not be differentiated for the purpose of this problem.
(Question 6) We believe that you are referring to the last equality on page 15. We should have added another step here. This comes by substituting the value of probability densities as we show below. For any and set , the density . Thus, The last quantity here can be manipulated to get the last equality on page 15.
(Question 7) You are right. In principle, the bounds of Theorem 2.9 and 2.10 can be extended to GLM observations. Though, it will be more difficult to analyse them as the probabilities will be more complex. We did not try to extend it for one bit compressed sensing and logistic regression because the given upper and lower bound already match up to constants. Recall that we can use an identity sensing matrix when is linear in .
I thank the authors for addressing all of my concerns and questions! I intend to keep my score and champion for its acceptance.
Out of curiosity, could the authors respond to my Question #7 regarding the intuition behind ? The authors' response to Question #7 seems to be for my Question #8. Thanks!
After the authors' second rebuttal comment:
I'm satisfied with the authors' responses. Thank you, and congratulations on this nice work! I hope that the authors will incorporate all the comments and suggestions from me and other reviewers into the future revision, which will further strengthen the paper! I keep my score.
We apologise for missing the answer to Question 7. Please find it below.
This value of maximizes . To see this, note that can be thought of as condition entropy where and are both Bernoulli random variables, with and and . The marginal on is given by . Thus, . This upper bound is achieved by substituting in . Note that approximately gives the number of -sparse vectors at Hamming distance from a given vector. Therefore, this value of , approximately, gives the value of Hamming distance which maximizes the number of vectors at a particular hamming distance.
We again thank you for your careful and positive review of the paper.
The paper studies the problem of recovering sparse binary vectors from noisy generalized linear measurements. For simplicity I am stating the special case problems SparseLinearReg and 1bCSbinary here: {} is the unknown -sparse vector that needs to be estimated. is a sensing matrix (possibly random) with the power constraints . The linear observations using that matrix are the entries of the vector for SparseLinearReg or for 1bCSbinary. The goal is to use an appropriate and algorithm that takes and produces which is correct whp. The more general version of the problem uses generalized linear measurements using a link function in the standard way, which captures the above two problems as well as Logistic Regression as special cases. The paper uses having standard Gaussian entries and a very simple algorithm from Plan et al., 2017 that works by outputing the heaviest elements of the vector (which intuitively will correspond to the support of because is correlated with iff ). That way, the paper gives an upper bound for the sample complexity (achieved by the aforementioned computationally efficient algorithm) of the general problem, which yields corollaries for each of the three problems, SparseLinearReg, 1bCSbinary and LogisticRegression. Notably, if only the sign information in is enough in linear regression (i.e., 1bCSbinary is not harder than linear regression in that regime). The upper bounds also contradict the conjecture from Gamarnik and Zadik 2017 that no computationally efficient algorithm can exist for SparseLinearReg with the provided sample complexity. The paper provides an information theoretic lower bound based on Fano's inequality for the general problem which gives corollaries for each of the 3 problems. This gives tight characterization of the sample complexity for 1bCSbinary LogisticRegression. Finally, for exact recovery in SparseLinearReg, the authors complement the lower bound with an analysis of the MLE that gives almost a matching upper bound.
给作者的问题
Related to the conjecture in Gamarnik and Zadik, 2017, do the authors think that a statistical-computational gap still exists but for some weaker threshold for the number of samples?
论据与证据
The main body and of the paper contains proofs for the claimed results.
方法与评估标准
Not applicable.
理论论述
I went through the main body of the paper and the claims seemed reasonable. That being said, I have not done a thorough check of all the proofs.
实验设计与分析
Not applicable.
补充材料
Not applicable.
与现有文献的关系
The paper discusses related work adequately.
遗漏的重要参考文献
I do not have additional suggestions.
其他优缺点
Strenghts
- Unified approach: the main result holds for the general problem that uses generalized linear measurements, and gives as corollaries the results for the three other problems.
- The algorithm is very simple to state and is computationally efficient
- It is interesting (and perhaps a bit surprising) that in the regime only the sign information suffices for sparse linear regression.
- The paper shows tight results for exact linear regression.
Weaknesses
- The gap between upper and lower bounds is hard to quantify, as the bound is expressed as some optimization problem.
- It remains open problem to find an efficient algorithm with better sample complexity than for linear regression.
Overall, I feel that the paper makes important contributions and would like to recommend acceptance.
其他意见或建议
.
Thank you very much for your careful and positive review of the paper. Regarding your question, for the 1bCSbinary problem, there is no computational-statistical gap, since the information theoretic lower bound matches with the sample complexity of the linear estimator (up to constants). However the constant factors can be different, and also for other GLMs it may be the case that there is a computational statistical gap depending on the quantities (in Thm 2.1) and the mutual information in Thm 2.5. Hope this answers your question.
This paper addresses the problem of recovering a -sparse binary vector from generalized linear measurements. Given observations , which are related to a sparse vector through an inverse link function such that:
the goal is to accurately reconstruct . Since is binary, the problem reduces to support recovery.
It presents a simple “linear estimation followed by top‑k selection” algorithm – essentially a one‐shot version of an iterative greedy scheme – and provides tight sample complexity guarantees for several settings, including noisy one‑bit compressed sensing (1bCSbinary), sparse linear regression (SparseLinearReg), and logistic regression. In addition to the algorithmic upper bounds, the paper offers nearly matching information theoretic lower bounds. Notably, logistic regression and 1-bit compressed sensing emerge as special cases of this framework.
The authors build upon the simple linear estimation algorithm from Plan et al. (2017), focusing on its application to binary vectors. They establish both upper and lower bounds on the sample complexity required for successful recovery:
Upper Bound: Sample Complexity of Algorithm 1 (Theorem 2.1)
If the generalized linear model (GLM) ensures that each is a subgaussian random variable with norm , and for some ,
then Algorithm 1 successfully recovers with high probability, provided the number of measurements satisfies:
where is a constant.
Lower Bound for GLMs (Theorem 2.5)
For any sensing matrix , if is a uniformly chosen -sparse vector, then any algorithm that attempts to recover satisfies:
only if the number of measurements meets the condition:
for some mutual information term satisfying for all . Specifically, when takes binary values (), the expected squared inverse link function satisfies:
where the expectation is taken over the randomness in and .
Update After Rebuttal:
I think the authors addressed my review sufficiently, and vote to accept the paper.
给作者的问题
N/A
论据与证据
The claims made in the submission are supported by clear and convincing proofs.
方法与评估标准
This is a theoretical paper, and this does not apply.
理论论述
I checked the first 9 pages and they seem fine to me.
实验设计与分析
This paper does not have experiments.
补充材料
I did not review the supplementary material.
与现有文献的关系
The work builds on the foundational ideas from compressed sensing and sparse linear regression as developed by Candès et al. (2006), Donoho (2006), and Tibshirani (1996). The literature on one-bit compressed sensing—initiated by Boufounos and Baraniuk (2008) and later refined by Jacques et al. (2013)—has dealt with the challenges posed by extreme quantization (only sign information is available). This paper extends that line of work by incorporating Gaussian noise before quantization (the 1bCSbinary model), and it rigorously shows that the sample complexity remains optimal () even when only sign information is used.
遗漏的重要参考文献
I wonder if it might make sense to also cite work on lower bounds for sparse recovery (such as "Lower Bounds on Sparse Recovery" (Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff)), since they also have these communication complexity style proofs.
其他优缺点
Strengths:
- The paper has a unified treatment of multiple regression models.
- The proposed algorithm -- linear estimator followed by top-k, is simple.
- The insight into not having a stat-comp gap here is interesting.
Weaknesses:
- Many of the results make strong assumptions on the measurement matrices.
- I feel the paper doesn't communicate the main novelty in analysis or algorithm clearly.
其他意见或建议
None
We thank you for your constructive and positive review.
Thanks also for suggesting the additional reference. We will cite this paper as a relevant lower bound for sparse recovery. Since the input signal is binary in our case, the lower bounding techniques are somewhat different, as we can use information theoretic inequalities directly.
Regarding the strong assumptions on the measurement matrices, we would like to mention that our lower bounds (sec 2.2 and Thm 2.5) work for any family of measurement matrices. Regarding the upper bound: we can make Thm 2.1 work for a more general class of matrices, at the expense of some clarity. Indeed, the only property of the Gaussian measurements that we use, conditioned on other assumptions of the theorem, is centered-ness, and Stein’s lemma. So the theorem holds for other matrices too, albeit will result in slightly complicated expressions (see some more details in the response to Reviewer mHu2).
This paper studies the problem of exact recovery of sparse binary vectors from generalized linear measurements and presents a simple linear estimation algorithm with sample complexity bounds; there were also nearly matching info-theoretic lower bounds.
During the discussion, the authors authors addressed all comments raised to the reviewers' satisfaction. While no experiments were included, reviewers agreed this did not detract from the paper’s strong theoretical contributions. The strengths of the paper include its simplicity, clarity, and unification of several sparse recovery settings; this includes a nice insight that no stat-compute gap exists for 1bcs and logistic regression.
I encourage the authors to incorporate the reviewers’ suggestions to improve clarity; especially by fixing notation, defining constants carefully, and including a summary table of results. In all, this is a solid paper.