Near-optimal Active Regression of Single-Index Models
We provide a near-optimal query complexity for the active regression of single-index models.
摘要
评审与讨论
This paper studies the problem of learning single-index models in the active learning setting, where given a dataset and an activation function , the goal is to output some by querying the entries of such that is close to , the error of the best . Previous works that study this problem give several upper bounds for the query complexity for different , but can only approximate within a constant factor of . In this work, the authors develop active learning algorithms for this problem with different . For , the query complexity of the algorithms is , which matches the lower bound provided by this paper. For , the query complexity is . In this regime, the paper also gives a lower bound of .
优点
Learning single-index models and active learning are both central problems in learning theory. Compared to active learning for classification problems, active learning for regression problems is much less understood. This paper gives a nearly optimal query complexity for the corresponding regression problem for a wide range of . The algorithm builds on a similar sample and learn frameworks used by previous works but uses several solid and new ideas to analyze and improve the error guarantee and the query complexity. The results and new techniques would be of interest to ML and theory communities.
缺点
I think one main weakness of the paper is its presentation. The paper itself is very technical and involves deep mathematical tools such as Lewis weights that are frequently used for other problems such as subspace embedding. As a person who does not work on these problems and is not familiar with these tools, I found sometimes it hard to go over the paper as the related background is not clearly provided in the main body. Furthermore, I feel that to fully understand the techniques, the paper requires the reader to be familiar with the line of the previous work. For example, in line 219-line 226
we obtain the following concentration bound .... which has been the central tool in the work on subspace embeddings... In short, when the number of queries is m...
These statements are confusing for a reader who is not familiar with the technique and related literature. It is hard to get an intuition of how to relate the query complexity with the previous error analysis.
- Another weakness I found is the computational complexity of the problem. I think to implement the algorithm in the paper, one has to solve a sub-optimization problem, which I believe has a super-polynomial time complexity with respect to the error parameter . In fact, there have been works that show that agnostic learning single index model is usually computationally hard. (see my questions below) Given these hardness results, I am not sure whether it is still motivated to pursue -approximation, such a strong guarantee. I think adding some numerical simulations for the algorithm would be helpful to show the algorithm is useful and it is still reasonable goal to achieve -approximation for some reasonable choice of .
问题
-
I would like to suggest the authors improve the presentation and move some related background from the appendix to the main body of the paper.
-
There have been papers such as [DKR23] that show that even if the data is sampled from a Gaussian and the activation function is a ReLU, it is computationally hard to achieve . Can you help explain the relation between these hardness results and the results in this paper? Given the hardness result, why it is still motivated to design algorithms that learns a -approximate solution?
Diakonikolas, Ilias, Daniel Kane, and Lisheng Ren. "Near-optimal cryptographic hardness of agnostically learning halfspaces and relu regression under gaussian marginals." International Conference on Machine Learning. PMLR, 2023.
We thank the reviewer for the comments. Below we address the reviewer's questions.
-
Presentation: We agree with the reviewer that the paper is very technical and providing background on the mathematical tools such as Lewis weights helps readers unfamiliar with these concepts better understand our result. We will move this background from the appendix to the main text with additional explanations to improve the presentation.
In Line 219 - Line 226, the techniques used to upper bound have been employed in several earlier works related to the active regression problem (such as (Yasuda, 2024)) or subspace embeddings. Therefore, we intentionally phrased them as black boxes as they are not our main contribution. We will make this clearer in the revised version. The most novel contribution lies in the section on ``further improvement'', which enables us to achieve a near-optimal dependence on in query complexity.
-
Motivation for -approximation: We would like to clarify that our paper does not consider the agnostic case, that is, the function is given. The statement of Theorem 1 says ``given ... '' and Algorithm 2 lists as part of the input. Therefore, the hardness result of the agnostic case does not apply here. Even in the agnostic case, where one can only have a PTAS for -approximation, a smaller query complexity implies solving a smaller-scale regression problem (dimension versus ), which is advantageous, particularly when solving such problems is computationally hard (runtime versus ).
I am a little bit confused about the term "agnostic" in this rebuttal. Agnostic learning usually means the label does not equal even for the optimal . This has no relation to whether is provided or not. In the hardness paper that I mentioned, is also given, which is just a ReLU function. So I believe the agnostic ReLU learning problem studied in [DKR23] is a special case of the setting studied in this paper. So, the rebuttal does not fully address my concern.
We thank the reviewer for the explanation and acknowledge our inaccuracy in the earlier response. Admittedly we do not have a good answer to the reviewer's question on computational hardness. In our formulation, each row of would be a data point and thus our regression formulation corresponds to a uniform distribution over a finite point set for in the context of half-space learning. The hardness result may not hold for finite support, if the support size is not large enough. For a continuous marginal distribution of , one can approximate this distribution using a finite support with sufficiently large size and then apply our framework. However, approximating the distribution may require a support size of , which does not contradict existing hardness results for time complexity.
This paper presents and algorithm to provide an epsilon approximate solution to the active regression problem. They show that their number of query points is minimax optimal.
优点
S1 - The paper is generally well written.
S2 - The technical claims in the paper are generally well explained.
S3 - The idea is interesting.
缺点
W1 - Literature review is a bit lacking.
W2 - No numerical experiments.
W3 - The presentation can be improved by first introducing the algorithm. In the current version, the algorithm is somewhat built up during the upper bound proofs.
问题
Q1 - Following the impossibility result of Gajjar et al., how does your result compare?
Q2 - Why are the probabilities different in Theorem 1,2,3.
Q3 - In addition to the constant probability ones, do you have a high probability results, where the number of queries is a function of that probability?
We thank the reviewer for the comments. Below we address the reviewer's questions.
-
Q1: As explained in the introduction section, the impossibility result of Gajjar et al. (2024) states that one cannot hope for
with an algorithm using a polynomial number of queries. Therefore, their error guarantee contains an additive term as
Note that this is for with a constant-factor approximation. For us, this hardness result remains and our result has a similar error guarantee to that of Gajjar et al. (2024). We obtain a -approximation for general with an analogous error (see Theorem 1)
By rescaling by at most a constant factor, we can rewrite the error guarantee above as
-
Q2: Theorem 1 is an upper bound while Theorems 2 and 3 are lower bounds. The success probabilities in these theorems can be made an arbitrary number greater than by adjusting the constants (the hidden constants in big- and big- notations usually depend on the failure probability). We chose these constants specifically to simplify the analyses and will make appropriate adjustments to improve the readability.
-
Q3: At present, we do not have a high-probability result, as the primary focus is on achieving near-optimal query complexity. For failure probability , the current argument introduces a factor in the query complexity, owing to the application of Markov's inequality for the initial value of . It may be possible to improve the dependence on using the techniques similar to those in (Musco et al., 2022), and we leave this as a direction for future work.
Regarding the weaknesses,
-
W1: There are relatively few papers with provable query complexities for this problem, and, to the best of our knowledge, we have cited all known theoretical works in this line of research. On the other hand, we believe that there are other papers related to our problem and including them may help readers better understand our problem and active regression in general. We will provide a more comprehensive literature review of the relevant areas in the revised version.
-
W2: This submission focuses on the theoretical aspects of query complexity, with a particular emphasis on the -dependence of query complexity. While fewer queries may suffice for certain real-world datasets, we consider this a potential direction for future work.
-
W3: The algorithm is presented in Section 3. We agree with the reviewer that moving it earlier in the paper may enhance readability.
Does the main difference to (Gajjar et al., 2024) reside in the generation of the sampling matrix?
The main difference from (Gajjar et al. 2024) is that we achieve a -approximation, which requires a significantly harder analysis, in particular, the development of new techniques (described in the section titled 'Further Improvement') to optimize the -dependence. The sampling matrix is different though similar, (Gajjar et al. 2024) samples the rows of and using leverage scores as they consider only , while we use Lewis weights for general .
In summary, this paper studies the single-index (label efficient) active regression problem of , where is sampled as efficiently as possible. It contributes a -approximation, replacing the constant-factor approximations in the literature, with near optimal query complexities.
优点
Originality: It presents novel findings from theoretical and algorithmic perspectives.
Quality: The approach is well motivated, including being well-placed in the literature.
Clarity: The general approach is observable.
Significance: Regarding potential value and impact, the paper has the goal of better addressing a known application/problem, namely single-index active regression, with new theoretical findings of improved query complexities.
缺点
The submission is not sufficiently clear, with many occasions of ambiguous wording. It seems technically correct but hard to gauge at various times. The experimental analysis is missing, hence not rigorous or reproducible.
The paper does not sufficiently support the claims. A key issue is that the analysis accompanying the theoretical results of query complexities includes many approximations, the significance of which are mostly omitted without necessary explanations, resulting in a somewhat lacking scientific rigor.
The paper could bring new knowledge of sufficient value to the community. However, it needs to be revised to better present its soundness.
问题
Questions:
- Line 148: why is this approximate?
- Line 163: what is the relation between and ?
- Line 172: what is this approximate inequality defined as?
- Line 197: is this inequality exact?
- Line 215: again, what is this inexact inequality?
- Line 220: how do you conclude ?
- Line 299: how do you apply (13) repeatedly, AM-GM inequality? Don't the regularization terms also accumulate?
- Line 397: is the second line an inequality?
- Line 423: how is the last inequality obtained?
Major Suggestions:
- There are too many approximative parts in the analysis, which hardens the verifiability. The writing should be improved to ease digestion.
- Line 68: improvement from [2023b] to [2024] is unclearly worded.
- Line 191: error terms need to be expanded and explained more clearly.
- Line 278: need more intuition into (10) and (13).
- Line 461: this multi-dimentional extension could be better explained, feels a bit rushed.
Minor Suggestions:
- Line 16: V1, V2? better write these (and similar ones) as initially to ease reading.
The following are our responses to the reviewer's questions.
-
Line 148: Here we are looking at the norm and it can be proved that exact equality (i.e. isometric embedding) is generally impossible. Normally the approximation in subspace embeddings is multiplicative, i.e., one has a distortion parameter and wants .
-
Line 163: In the definition of in Equation (5), we use , which is the optimal solution to the regression problem. Here, we are saying that this can be actually replaced with an arbitrary point in the definition of and the argument for upper bounding still works. Indeed, in our detailed analysis, we will pick to be different points including and hence we would like to highlight this flexibility.
-
Line 172: The symbol ``'' indicates an inequality up to a constant factor, similar to the big- notation. Specifically, we write if there exists a constant such that . This notation is explained in the Preliminaries section (Section A) of the appendix and we will move it to the main text to improve the presentation.
-
Line 197: This inequality is exact.
-
Line 215: This inequality uses ``'', indicating that a constant factor is lost.
-
Line 220: Equation (9) shows that for . This fits the definition of (with appropriate constants hidden in ) and so .
-
Line 299: To repeat the process, observe that Line 296 gives a better bound on and so we can define a new, smaller to be . Using this new and repeat the argument, we can then obtain a better bound and so on. The regularization term does accumulate, but the accumulated error is still bounded by factors and will be absorbed in the factor in the final bound. We will make this clearer in the revised version.
-
Line 397: We thank the reviewer for pointing out this confusion. This line should read "" by Line 377. We can rigorously express this approximation as an inequality by applying the Chernoff bound, as detailed in the appendix. We use here to substitute in the expectation, providing intuition of our idea.
-
Line 423: There is a typo. The "'' should be ``''. Recall that . Thus, we have .
We thank the reviewer for the detailed comments. We would like to clarify that the main body of the paper presents only an outline of our approach, focusing on the core ideas, while the full rigorous proof is provided in the appendix. Given that rigorous proofs have been included, we do not think the paper is ''not rigorous or reproducible'' or lacks ''scientific rigor''.
We would also like to clarify that this submission focuses on the theoretical aspects of query complexity, with a particular emphasis on the -dependence of query complexity. While fewer queries may suffice for certain real-world datasets, we consider this a potential direction for future work.
Below, we address the reviewer's major suggestions:
-
Line 68: There is a typo. Gajjar et al. (2023b) achieved queries, which was later improved to by Gajjar et al. (2024).
-
Line 191: The main purpose of this part (Lines 187--200) is to explain why we set the regularization parameter to be . Indeed, in this equation, we would like to explain should be smaller than . As explained in Line 193, the final guarantee has the term and hence we should set smaller than regardless of the error term. Hence, the error term plays a minor role in this argument and we intentionally did not fully expand it. The use of regularized regression was previously employed in the work of Gajjar et al. (2024). Our submission uses the same regularized regression problem. While we could have omitted the explanation and used the regularized version as our starting point, we chose to include it for completeness.
-
Equations (10) and (13): These error bounds are similar to previous work such as that of Yasuda (2024). The most novel part of this submission is the section on ``further improvement'', which enables us to obtain a near-optimal dependence on in query complexity. Therefore, we intentionally phrased them as black boxes as they are not our main contribution. We will make this clearer in the revised version.
-
Extension to is not as difficult as . Once the argument for works out, it is more or less a standard technique to extend it to . Hence, we presented it in a way that focuses on our main contribution, which is the case of . The full proof for can be found in the appendix. We will polish the paper to improve the presentation.
Can we say that you achieve -approximation with the complexity of Gajjar et al. (2024)?
Additionally, regarding (13) and "Further Improvement", is your main contribution showing how the iterative refinement of (13) results in tighter guarantees?
Your work is not experimentally rigorous, or reproducible in general, since no experiments are provided.
Similarly, it seems many approximations are given without sufficient justification, e.g., how some terms can easily be omitted without issues in the subsequent calculations. This makes the analysis (at least in the main text) lacking in rigor. You could have potentially frame your work in a more digestible manner.
Moreover, the appendix feels convoluted. There are many assumptions, for which the corresponding justifications seem to be missing. Some examples:
- Line 978: How did the error term in Corollary 16 turn into this, where is the -term on the left side? Consequently, it is not clear how (27) is reached.
- Line 1058: Where is this assumption coming from?
- Line 1080: Following here, it is not clear to me how you compute or how such a given exists.
Regarding Line 299, you need to carefully explain how the omitted additive and multiplicative terms accumulate with successive iterations. It is hard to see and judge in the current approximative form.
We thank the reviewer for further questions and comments, which we address below.
-
The reviewer is correct that we achieve a -approximation with the same query complexity (up to log factors) as in (Gajjar et al., 2024). We would also like to note that (Gajjar et al., 2024) considers only while we handle the full range . Our analysis for controlling is also simpler and more generalizable than theirs, though we acknowledge that similar arguments have appeared in prior work and do not claim this to be our contribution. We also prove the near-optimality of our query complexity for and the near-optimality of the -dependence for .
-
The section in ``Further Improvement'' improves the -dependence in query complexity from a more straightforward to the near-optimal .
-
Given the length of the full analysis, only the key ideas are presented in the main body. To better emphasize proof ideas, it is common practice to omit minor terms or use approximations, avoiding a complex presentation. Readers with doubts or questions may refer to the detailed proofs for clarification.
-
The algorithm is explicitly provided and is straightforward to implement; we do not see issues in its implementation. The challenge lies in the analysis rather than the algorithm itself. Since (1) the query complexity is rigorously established through proof and shown to be nearly optimal, and (2) the claims on query complexity are derived from theoretical analysis rather than empirical experiments, we do not see concerns regarding reproducibility or the necessity of numerical experiments to validate our claims.
-
We do not introduce additional assumptions in the full proof. It is common for lemmas or intermediate results to be stated with specific assumptions, which are then verified when these lemmas are invoked in proofs elsewhere.
-
Line 978: We shorthand as here, since one can assume that (mentioned in Line 958), otherwise one can just sample all entries of for such a small . Under this assumption, we have .
-
Line 1058: The assumption comes from Line 865, which is part of the assumptions stated in the statement of Lemma 13.
-
Line 1080: There is a typo. It should read "". By the definition that (Line 1071) and , we have . We also have as mentioned above, so and one can choose as claimed.
-
Line 299: We agree with the reviewer that the current presentation may cause a misunderstanding about the constant hidden in potentially becoming large (although the detailed proof shows this is not the case). In the revised version, we will include a better description of the proof idea.
Thank you for the reply. Although, I am not fully convinced by the state of the paper, I am raising my score after considering your rebuttals.
The paper considers the problem of active regression where the objective is to solve the problems of the where is an unknown Lipschitz function, is a known matrix. The learner aims to solve the minimization problem by querying only a fraction of elements of . The problem has been well studied for . For the case of general , existing results only obtain a constant factor approximation. In this work, the authors obtain a approximate solution with general with a new algorithm and analysis.
优点
I think this is an interesting paper overall. While the problem appears to be simple, there is quite of bit of machinery involved in establishing the results and designing the algorithm. I believe there are several novel aspects in the algorithm design and analysis that might be useful in general for future studies in this direction.
缺点
I don't see any glaring weakness in the paper.
问题
Can the authors elaborate why the techniques in Yasuda (2024) do not carry forward to the case of . As far as I understand, the hard instance needs to be updated where is replaced by and the analysis should work out. I am trying to understand what does not work out with this choice.
We thank the reviewer for the comments. Below we explain the difficulty in extending the lower bound from to when :
We first briefly summarize the constructions in (Yasuda, 2024) and our paper for . In (Yasuda, 2024), the first hard instance gives and while the second hard instance gives and . In our construction, the first hard instance gives and while the second hard instance gives and . The key difference is that Yasuda's construction is asymmetric (with one of the values being ) whereas our construction is symmetric.
When extending the result to the case of , Yasuda (2024) constructs a set of unit vectors of size , where any two vectors in the set are almost orthogonal. The existence of such a set is shown in previous work. The asymmetry can then be exploited as follows: duplicate the first hard instance in each of these directions, then choose one of these instances and flip it to the second hard instance with probability . If the chosen copy is not flipped, we have and . If the chosen copy is flipped, we have and . This small error is negligible because the other directions are almost orthogonal to the chosen direction and the value of the first hard instance is .
Unfortunately, our construction is symmetric and neither value is . To extend the lower bound to using the set defined above, we cannot follow the same approach because of the absence of a "base" instance like the first hard instance in Yasuda's construction. This prevents us from setting all copies to be the "base" instance with a random flip on one of them, as the value is not , which makes the small error mentioned above is no longer negligible.
Regarding the possibility of an asymmetric construction in our case, we found that the symmetry is crucial for improving the lower bound to when . Therefore, we encounter this dilemma when extending the lower bound to the case of .
I see, the issue is bit more subtle than I previously thought. You might have thought about this but does taking a scaled version of the problem help here? In particular, after flipping you set the hard instance to be scaled version of the current one. The optimal value will change without affecting .
Thanks for the question.
If we scale the flipped copy only, it seems that we only need queries to identify the chosen copy where is the number of copies. It is because, in our construction, we plant the large value in an all-one vector and hence scaling the entire copy makes it immediately distinguishable without considering the planted entry. In this case, it may not be very helpful.
To complement our response, we would like to further point out that the reason why the small error mentioned previously is negligible also comes from the fact that in the first instance. It is because if we pick to be one of the vectors in mentioned before then part of the error also comes from the inner product of and other vectors in . By the almost orthogonality property, this contribution to the error is small. Hence, in Taisuke's construction, both and are crucial properties in the argument.
The paper considers the following active learning problem, find an (approximate) solution to minimise for by making queries to the entries of . The paper obtains an approximation for this problem by making a certain number of queries, for which some lower bounds are also provided. The approximation is somewhat unusual in that the additive term is not just , but is , where is the min-norm optimizer. The inverse link function is assumed to be Lipschitz. The problem is interesting, though the results for single-index models are not as clean as in the case of linear regression. The techniques appear to be relatively standard.
审稿人讨论附加意见
The two most negative reviewers didn't provide very useful comments as part of their reviews or in subsequent discussion. They also declared low confidence in their own reviews. As a result, I've discounted them substantially, pushing this paper into accept range. However, I wouldn't object if it were rejected.
Accept (Poster)