Tolerant Algorithms for Learning with Arbitrary Covariate Shift
摘要
评审与讨论
This paper studies the problem of PAC learning with covariate shift. In particular, it examines two specific learning frameworks: one is PQ learning, where a learner is allowed to "absent" from some testing samples but is required to have good accuracy for the retained samples; the other is TDS learning, where one is allowed to completely absent if the testing distribution is detected to be far from the training distribution. From a technical point of view, the paper is restricted to cases when the covariate distribution is Gaussian or uniform and when the concept class is nicely approximated by polynomials of low degree (i.e., the so-called "sandwiching" functions). The main contribution appears to extend prior results to handle arbitrary fractions of outliers and provide tighter bounds.
优点
While I'm not an expert in PAC learning using polynomial regression, this paper appears to provide some interesting technical results. The outlier removal algorithm introduced in Theorem 3.1 seems to be an interesting technical primitive for handling outliers in low-degree polynomial regressions. Although the settings considered in this paper are fairly restrictive, the results seem to provide a valuable step toward understanding learning with abstention in covariate shift.
缺点
The main weakness of the paper, in my opinion, is the presentation. The paper is very hard to read, especially for those who are not immediately familiar with the relevant literature.
I outline the following specific comments:
-
The main text as written does not provide much technical information. For example, Theorem 3.1 is supposed to be the main technical ingredient of the paper, but the proof overview provides nearly no information. Due to the limited time period of the NeurIPS review, I don't have time to delve into the technical details provided in the appendix. I suggest the authors provide a much more detailed overview in the main text. It does not necessarily need to include all the technical details, but it should provide enough information on the logical flow. Perhaps the "Comparison with [DKS18]" section can be omitted to save space?
-
The theorem statements are sometimes framed very sloppily. For example, the in Theorem 4.1 was never defined (though it appears in the proof); in Lemma B.2., the sentence "we have f(x) = 0 for all x such that..." does not make sense to me, as the subsequent quantifier has nothing to do with x, referring to f(x)=0 always. Since there are too many sloppy statements like this, I lack the energy to check the details carefully.
-
The authors claim to provide "efficient learning algorithms," but all the computational complexities scale exponentially with respect to the error parameters . Did I miss something? Can this parameter be selected as a constant and be boosted in some way?
-
Is there a specific reason why Theorem 3.1 is stated for the second moment? Is it due to Lemma B.1?
Given these reasons, although I believe this paper has some interesting technical contributions, it would require substantial revision to be suitable for publication at NeurIPS.
问题
See above.
局限性
N/A
We thank the reviewer for their time and effort.
Question 1. We note that, while we provide most of the technical details of our proofs in the appendix, we do provide a high-level explanation of our results in the main part, including Theorem 3.1 (see lines 189–223). In fact, the comparison with [DKS18] partly explains the technical differences between Theorem 3.1 and results on outlier removal from prior work. Given the reviewer’s feedback, however, we will add some further technical details in the main paper.
Question 2. We respectfully disagree with the reviewer’s comment that our statements lack precision. Note that the parameter is defined in the preliminaries (lines 146–148). Moreover, in Lemma B.2, the premise regarding is that for any that satisfies some property (the one described in line 693, i.e., that is more than for some polynomial with low second moment) and not all . We will clarify these points in the main paper and we are happy to address any further specific concerns that the reviewer may have.
Question 3. The algorithms we propose are efficient in the dimension of the input, which is a standard goal in learning theory (see, for example, the classical work of [KKMS08]). In the presence of label noise, the exponential dependence on the parameter cannot be removed, even if one assumes that there is no distribution shift (see [DKPZ21]). Additionally, when there is no label noise, the algorithm of Theorem 4.3 for PQ learning of halfspaces is quasi-polynomial in all parameters and is optimal in the SQ framework (see [KSV24a]).
Question 4. Theorem 3.1 is stated for the second moment for two reasons. First, using the second moment enables the use of spectral methods for the outlier removal procedure and is a common approach for prior results on outlier removal. Second, it is sufficient for our purposes, since the difference of the upper and lower sandwiching approximators is itself a polynomial with low second moment under the target distribution. Subsequently, we can use the outlier removal procedure to find a subset of the input set over which the bound on the squared difference of the sandwiching approximators is preserved.
[KKMS08] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
[DKPZ21] Diakonikolas, Ilias, Daniel M. Kane, Thanasis Pittas, and Nikos Zarifis. "The optimality of polynomial regression for agnostic learning under gaussian marginals in the SQ model." In Conference on Learning Theory, pp. 1552-1584. PMLR, 2021.
[KSV24a] Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Learning intersections of halfspaces with distribution shift: Improved algorithms and sq lower bounds. 37th Annual Conference on Learning Theory, COLT, 2024.
I thank the authors for the clarifications. I raise my rating to 6. Since I'm not an expert in this particular field, I will have no objection if the other reviewers lean towards acceptance.
The authors consider the problem of efficiently learning a concept class under distribution shift, i.e. in the setting where the training data and testing data are drawn from different distributions. They study two frameworks: PQ learning and TDS learning. In the former, the learner is allowed to abstain from classifying a part of the test samples. In the latter, the learner is allowed to abstain entirely if the test distribution is 'visibly' different from the training distribution.
The paper has two main contributions. First, in the PQ setting, the authors obtain the first dimensionally-efficient learning algorithms, under the assumption that the training data is nicely distributed (Gaussian or uniform on the cube) and the concept class is simple (intersection of halfspaces or low-complexity formulas). Second, in the TDS setting, under the same assumptions, they provide the first efficient algorithms that tolerate small distribution shift in TV-distance. This generalizes earlier results which only tolerate shift 'in the moments' of the distribution (which is a weaker notion).
The proof of these results consists roughly of two steps. First, the authors adapt/improve an existing spectral technique for outlier-removal to show that the test data can be pruned so that it satisfies a 'low-degree spectral boundedness property', without removing too many samples. Second, they show that this spectral boundedness property suffices to apply low-degree polynomial regression to the PQ/TDS-learning problems (in certain settings). In order to do so, they rely on the notion of 'L2-sandwiching polynomials' of [KSV24b]. The important distinction w.r.t. [KSV24b] is that, there, the authors rely on a 'moment-matching property' (which is stronger than spectral boundedness, and in particular, not always satisfied even if the training and test data are close in TV-distance).
优点
-
The paper achieves strong and novel results in an area which I believe to be of interest to the Neurips community. It falls in a line of research focussing on relaxing (or testing) distributional assumptions in algorithmic learning theory, which has recently received a lot of interest. In particular, I like that the paper goes beyond the 'moment-matching' approach of earlier works.
-
I find the combination of ideas from [KSV24b] and outlier removal to obtain the main results very insightful.
-
The paper is very well written: the key technical ideas are exposed clearly, and it is easy even for the non-expert to graps the main concepts. Furthermore, the results and some of their consequences are positioned clearly in the existing literature. Lastly, the technical background (on learning models, previous algorithms etc.) is well presented.
-
The work leaves open some directions for future research, and I think it is likely that its ideas will be applied to obtain future results in this area.
缺点
-
Arguably, much of the technical 'heavy-lifting' to obtain the results of this paper is done by the 'L2-sandwiching' of [KSV24b], and to a lesser extend the outlier removal technique of [DKS18] (which the authors do significantly improve on).
-
The results apply only to very simple concept classes. It would have been nice to have guarantees in more complicated settings as well, e.g. for (low-degree) polynomial threshold functions. Similarly, the distributional assumptions on the training data are quite restrictive (although it should be noted that this is not uncommon in the literature).
问题
-
l62-63: In testable learning, one does not test whether the underlying distribution is actually Gaussian (which is not possible), but rather whether it shares some important characterics with a Gaussian distributions (e.g. its low-degree moments match). I guess the word 'indirectly' is meant to allude to this distinction, but I think this phrasing could be confusing.
-
l111-117: It is not clear to me from this paragraph if any work on distribution-specific PQ-learning was done before this paper. In particular, it think it would be good to clarify if 1) this work gives the first efficient distribution-specific PQ algorithm (but inefficient distribution-specific algorithms had been considered, or 2) this work is the first to consider distribution-specific PQ-learning at all.
局限性
Adequately addressed
We thank the reviewer for their time and appreciation of our work.
-
Lines 62-63: We will clarify this point, thank you for pointing out.
-
Lines 111-117: Our work is indeed the first to consider distribution-specific PQ-learning at all. Prior work involved reductions to other learning primitives like agnostic or reliable agnostic learning, but the reductions did not preserve the marginal distributions and, hence, the resulting algorithms were inherently distribution-free.
This work proposes methods for learning under covariance shifts. In particular, it studies PQ learning and TDS learning models. It provides an algorithm based on filtering technique. Given sample access to two distributions, the algorithm produces a filter that rejects points from the second distribution which attains large values on some degree-k polynomial, and does not reject many values of the first distribution.
This algorithm is then applied to learning some function classes (halfspaces and classes with low sandwiching degree) in PQ learning model and to tolerant TDS learning model. In the former model authors obtain a computationally-efficient algorithm, and in the latter an algorithm which does not reject a distribution even if there exists a small but non-negligible shift.
优点
- Paper provides novel results in the PQ learning and TDS learning. Solving these learning models is important, since in practice algorithms need to perform well under covariate shifts.
- Core algorithm of the paper is versatile, as the authors show how to apply it to multiple other learning settings (e.g. learning with nasty noise).
- For TDS learning, prior works rejected distribution with a very small distributional shift. This work provides an algorithm which gives non-trivial results even if the distance between distributions is larger.
- This work suggests a novel technique to bound number of iterations of their algorithm through the use of potential function.
- The main text of the paper is well-written.
缺点
I find the main weakness of the paper is the quality of the proofs in the appendix. They are often inaccurately written and require a lot of time to understand the arguments because of the typos. But it seems that there are some inaccuracies in the proofs of the core results beyond typos, which I highlight in the 'Questions' section.
问题
My main question is about Appendix B, which I believe is crucial for the proof of all technical theorems. I find this section not carefully written with inconsistent notation and typos, which makes it hard to follow the arguments.
Next, I focus on the parts which were not merely typos:
- I do not follow equation (B.2). First of all, it should be square of the Frobenius norm. But still second and third step is unclear. Could the authors explain these steps in more detail?
- If I understand correctly, equation (B.3) requires -tameness, instead of -tameness. This is because are of degree .
- Equation (B.4): I do not follow first inequality, where does first come from? Since , we obtain that , and there is no additive ?
- Line 922: Can authors clarify why there exists such ? it seems that if I increase , then the value on the left decreases to 0, while the value on the right stays positive?
I will consider changing my score when authors address these questions.
局限性
There are no ethical limitations in this work.
We thank the reviewer for their constructive feedback. We will carefully go through our proofs in the appendix to fix all typos for future revisions. Regarding the specific questions of the reviewer:
Question 1. You are correct that Equation (B.2) should have the square of Frobenius norm on the left side. Thank you for pointing out that typo. This will in turn slightly change the bound of the equation between lines 677 and 678, where we will instead use the fact that . The statement of line 678 is still true for some sufficiently large constant .
We now give the omitted details for Equation (B.2). We start with the second equality. We have , which is zero if and 1 otherwise (by the equation starting directly after line 669). We also have and and therefore which equals to .
Now we explain the third equality, after adding (the expectation over the random selection of the elements of ) in the beginning of the second line, which is missing due to a typo. Since the set is composed from i.i.d. samples from , the quantity equals to the variance of the random variable (the randomness is over the choice of ). Since the choice of elements of is i.i.d. from , we can use the fact that the variance of a sum of i.i.d. variables equals to the sum of their variances. Overall, we have that the variance of equals to . Finally, we substitute .
Question 2. Even though the polynomial is of degree , it has the extra property of being a product of two degree-k polynomials. If it has to either be the case that or . The probability of either of those events can be bounded by referring to the definition of -tameness.
Question 3. We confirm that the additive term is indeed not necessary in the second line of Equation (B.4). We will correct this typo in the revision. We however note that the third line of Equation (B.4) should still have an additive term of (replacing the current value with the value entails changing to in the line after line 701 and does not change the conclusion on line 702). The third line of Equation (B.4) follows from the second line of Equation (B.4) (with removed) as follows. By triangle inequality, the second line of Equation (B.4) can be upper-bounded by a sum of terms if we let denote and denote , then by the triangle inequality we see that the second line of Equation (B.4) is at most . We also observe that , and and which bounds the whole expression by
Question 4. Thank you for pointing this out, we added a short claim that proves that there indeed exists a value of satisfying the condition, assuming certain high-probability events take place. The proof is based on a simple averaging argument. In particular, is defined as the threshold such that the current set contains unreasonably many points such that . If such a threshold did not exist, then would not satisfy line 919 (which is a contradiction).
More formally, for the sake of contradiction, suppose that for every it is the case that
Since every element of satisfies we have
Combining Equations (1) and (2) gives
which in turn is bounded by . Additionally, since is assumed to satisfy property (3) in Claim 1, and is assumed to satisfy Equation (E.1) (see line 929), we have . Overall, we have bounded by , which contradicts the premise that , finishing the proof.
I thank the authors for the detailed answers to my and other reviewers questions, and I will increase my score.
This work considers learning under distribution shift. Here, a learner receives i.i.d labeled data from , along with i.i.d unlabled data from . It then builds a classifier with the goal of achieving high accuracy over . Under arbitrary conditions, this task is impossible, so in the PQ-learning framework the classifier is allowed to abstain from prediction for points with a frequency close to the total-variation distance between and . This work also considers the TDS setting in which the learner is allowed to completely abstain from prediction if it believes its inputs are drawn from a different data distribution.
The main technical idea of this work is an outlier detection scheme, which (efficiently) computes an outlier detection function meant to distinguish the points in the training sample from that are "far" from the support of . Their outlier procedure works by iteratively utilizing a quadratic program to find a polynomial of degree k that strongly distinguishes an outlier set. They then remove points until the outlier set only consists of points where the polynomial solution evaluates to very large amounts. The overall idea here is that similar distributions (i.e. if ) will lead to similar polynomial evaluations over them, and consequently the existence of a "separating" polynomial serves as a means to detecting where the distributions differ.
This idea is reflected in their first result, which shows that the output of this scheme results in a funtion such that upon filtering with , polynomial evaluations over the test and training distribution must be bounded within a factor of each other (based on tolerance parameters). They additionally bound the runtime of their algorithm. The only requirement for this theorem to hold is that the training distribution must satisfy regularity conditions based on the degree of polynomials being used, and they subsequently show that these conditions are met by isotropic log-concave distributions.
Utilizing this technical idea, this work proceeds by giving PQ-learning algorithm for half-spaces over Gaussian training distributions. Despite the limited scope of this case, this work neverthless provides the first dimension-efficient solution to this problem. Then, this work continues with a more general result for PQ-learning, which, under a condition of "reasonableness" for a pair , gives an efficient learning algorithm. Under the same condition, this work concludes by giving results for success in the related TDS setting.
优点
This paper offers a solution to well-known theoretical problem that achieves the best known bounds. I particularly like that their algorithms are computationally efficient (in addition to enjoying the typical performance guarantees). Their technical idea for outlier detection (which forms the core of the paper) is relatively well explained and seems to me an innovative way to approach outlier detection.
缺点
The latter half of the paper is fairly burdened with algebra and consequently a bit difficult to follow. I would have appreciated a greater focus on intuition with more of the technical details being left to the appendix.
问题
-
Could you give some intuition for why you began your PQ-results for Gaussian distributions? The outlier-detection procedure works for a more general set of distributions, so it would be interesting if you could spell out the additional properties of Gaussians that make your first result possible.
-
Could you similarly give more intuition about the -reasonableness condition? The definition feels rather extensive (given 3 conditions in it) and i consequently feel it could merit a longer discussion. As it is currently written, it feels a bit convoluted to me.
局限性
Yes.
We wish to thank the anonymous reviewer for their constructive feedback and suggestions.
Gaussian assumption: The reviewer is correct that the outlier removal procedure works under weaker assumptions than Gaussianity. In particular, it will work for any tame distribution (see lines 155–158 for a definition). However, in order to obtain our PQ learning results, we also make use of the existence of low degree -sandwiching approximators. The existence of such approximators has been proven in prior work for the Gaussian distribution as well as for the uniform distribution over the hypercube. Nonetheless, we believe that simple classes like halfspaces and halfspace intersections admit low-degree sandwiching approximators with respect to other distributions as well and establishing appropriate bounds is an interesting direction for future work. The important relevant properties are, in general, both concentration and the anti-concentration of the Gaussian distribution, which are, for example, also satisfied by strongly log-concave distributions.
Reasonable pairs: Definition 4.5 indeed consists of 3 properties. The first one is the existence of sandwiching approximators, which is known to be important for learning in the presence of distribution shift by prior work on TDS learning [KSV24b]. The second one is the tameness condition, which is important for the outlier removal procedure and was introduced in the work of [DKS18] for similar purposes. The third one ensures generalization for polynomial regression. We will add an appropriate discussion in future revisions.
[DKS18] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1061–1073, 2018.
[KSV24b] Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. 37th Annual Conference on Learning Theory, 2024.
Thank you for your response. Overall, I'll maintain my current score but increase my confidence.
This paper studies two fundamental learning setups, namely PQ learning and TDS learning (Testable Learning with distribution shift), both motivated by covariate shift.
In PQ learning, the algorithm is given labeled samples from over with marginal distribution , and unlabeled samples from some test distribution (also over ). Assume that there is some common hypothesis that achieves a small error under both and , the goal of the learning algorithm is to output a selector and a classifier such that (i) achieves a small error with respect to restricted to the acceptance region of , and (ii) most samples from will not be rejected by . The main contribution is an algorithm that works when (i) the marginal of satisfies some polynomial concentration properties (including isotropic log-concave) and (ii) the hypothesis class admits sandwiching polynomial under . In particular, it achieves a rejection rate of and an accuracy guarantee of . The bounds are optimal when one sets to balance the rejection rate and accuracy.
In tolerant TDS learning, the algorithm similarly has labeled sample access to and unlabeled sample access to . In addition, the algorithm is allowed to reject when it finds that the marginals of and are at least -far from each other in TV distance. The name ``tolerance'' follows exactly from the constraint that the algorithm is not allowed to reject when and are close but not identical to each other. Their algorithm works in a setup similar to the PQ learning case and achieves a final error of , where is the optimal error of some common hypothesis under and (the sum of the two errors), is the tolerance in TV distance, and is the error parameter.
The core technique is a spectral based outlier removal procedure that resembles that from [DKS18] that performs filtering on the dataset until the norm of any polynomial under the empirical distribution becomes at most a constant factor of its norm under the reference distribution (the marginal of ). However, the analysis in the current work is tighter, and in particular exhibits non-trivial guarantees even if is very far from in total variation distance, which is crucial in the technique's application in PQ learning. After ensuring that the spectral of the dataset is appropriately bounded, the algorithm runs polynomial regression, and its guarantees mainly follow from the existence of sandwiching polynomials.
优点
The application of spectral outlier removal in the context of distribution shift is natural and turns out to b quite powerful. The most surprising part is that the procedure works even when the TV distance between the test and the training distributions is more than . A naive application of spectral outlier removal will not work in this case as the filter may end up removing all points from the dataset (since the standard guarantee only says that the filter remove more "bad" points than "good" points). The analysis of this work circumvents the issue, and characterizes a tradeoff between the rejection rate and the final bound on the norm of polynomials. This makes PQ learning possible even when the TV distance between and approaches (albeit with a cost of having larger learning error in the end).
缺点
A recent work [CKKSV24] shows that non-tolerant TDS learning can also be accomplished when the hypothesis classes has sandwiching polynomials. This makes TDS learning of some important hypothesis classes such as circuits and quadratic threshold functions possible. However, the technique of spectral outlier removal in this work seems only applicable when one has sandwiching polynomial as otherwise one cannot easily replace moment-matching with spectral boundedness.
问题
In Definition 4.5, should it be "from some distribution " instead of "from some distribution "?
局限性
Yes, they have.
Thank you for your time and for appreciating our work. In Definition 4.5, the distribution represents some distribution over the features, which is unlabeled. We instead typically use to denote the labeled training distribution (see lines 136–137 and also Definition 4.1).
This is a theory submission about the important problem of learning under covariate shift. The authors consider the two recent frameworks of testable learning [KSV24b] and PQ learning [GKKM20], where both frameworks allow abstentions (but in different ways). The goal is to have a computationally efficient learner than does not make a lot of unnecessary abstentions. The authors achieve this for some classes of hypotheses (e.g., intersection of half spaces) and some distributions (e.g., Gaussians) through the use of a spectral outlier removal. The authors' significant contributions were appreciated by the reviewers. There is room for improvement in terms of the clarity of the proofs, especially in the appendices.