PaperHub
6.1
/10
Poster4 位审稿人
最低1最高5标准差1.5
3
5
4
1
ICML 2025

PAC Learning with Improvements

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We show that the learnability of agents that can improve is not characterized by the VC dimension, analyze the sample complexity of learning several fundamental concept classes in this setting and provide an empirical study on real datasets.

摘要

One of the most basic lower bounds in machine learning is that in nearly any nontrivial setting, it takes at least $1/\epsilon$ samples to learn to error $\epsilon$ (and more, if the classifier being learned is complex). However, suppose that data points are agents who have the ability to improve by a small amount if doing so will allow them to receive a (desired) positive classification. In that case, we may actually be able to achieve zero error by just being "close enough". For example, imagine a hiring test used to measure an agent's skill at some job such that for some threshold $\theta$, agents who score above $\theta$ will be successful and those who score below $\theta$ will not (i.e., learning a threshold on the line). Suppose also that by putting in effort, agents can improve their skill level by some small amount $r$. In that case, if we learn an approximation $\hat{\theta}$ of $\theta$ such that $\theta \leq \hat{\theta} \leq \theta + r$ and use it for hiring, we can actually achieve error zero, in the sense that (a) any agent classified as positive is truly qualified, and (b) any agent who truly is qualified can be classified as positive by putting in effort. Thus, the ability for agents to improve has the potential to allow for a goal one could not hope to achieve in standard models, namely zero error.\ In this paper, we explore this phenomenon more broadly, giving general results and examining under what conditions the ability of agents to improve can allow for a reduction in the sample complexity of learning, or alternatively, can make learning harder. We also examine both theoretically and empirically what kinds of improvement-aware algorithms can take into account agents who have the ability to improve to a limited extent when it is in their interest to do so.
关键词
agents improvementPAC learningsample complexitylearning from graphs

评审与讨论

审稿意见
3

The paper studies a scenario in which "agents" adapt in response to the classifier deployed by the learner. Within the PAC framework, the authors derive an upper bound for certain standard hypothesis classes.

给作者的问题

I am unsure whether the loss function accurately reflects the scenario the authors aim to capture. Given a deployed hypothesis hh, the loss function is defined as

(x,h,f)=maxzΔh(x)I[h(z)f(z)].\ell(x, h, f^{\star}) =\max_{z \in \Delta_h(x)} \mathbb{I}[h(z) \neq f^{\star}(z)].

Consider a case where h(x)=0=f(x)h(x) = 0 = f^{\star}(x), and the improvement set is Δh(x)={z1,z2}\Delta_h(x) = \{z_1, z_2\}, with h(z1)=h(z2)=1h(z_1) = h(z_2) = 1. Suppose f(z1)=1f^{\star}(z_1) = 1 but f(z2)=0f^{\star}(z_2) = 0.

Here, the hypothesis hh correctly classifies the base point xx, and there exists at least one point in the improvement set, z1z_1, where the agent can move and still be classified correctly. However, the loss is

\ell(x, h, f^{\star}) = \max $\mathbb{I}[h(z_1) \neq f^{\star}(z_1)], \mathbb{I}[h(z_2) \neq f^{\star}(z_2)] $ = \max $ 0, 1$= 1.

So, my question is: why should the hypothesis hh be penalized at point xx despite being correct there and offering at least one valid improvement option? This appears to be a significant weakness in the formalization of the problem.

That said, I may be misunderstanding the intended reasoning, and I am open to engaging with the authors during the rebuttal process to clarify this issue. If a justification is provided, I am willing to reconsider my assessment. However, in its current form, I cannot recommend acceptance.

论据与证据

Yes, all the theoretical claims are justified by either proof or a sketch of it.

方法与评估标准

Experiments were conducted on three real-world datasets: Adult UCI, OULAD, and Law School. While these datasets do not precisely reflect a scenario in which agents actively improve based on the deployed model, they broadly capture the underlying concept, serving as a proof of concept.

理论论述

I verified the details of the examples and proofs included in the paper's main text.

实验设计与分析

The experimental setup looks good.

补充材料

I did not check the supplementary material in detail.

与现有文献的关系

This problem is closely related to research in fairness and performative prediction. Additionally, the loss function formulation suggests a connection to the literature on adversarial robustness.

遗漏的重要参考文献

The authors should provide a more thorough comparison with the adversarial robustness literature. In adversarial robustness, the robustness set is predefined and independent of the deployed hypothesis, whereas in this setting, it depends on the hypothesis itself. While these two problems are not always equivalent, their equivalence depends on how the robustness set is defined. A well-known result by [MHS2019] shows that the VC dimension is generally insufficient for proper learnability but sufficient for improper learnability. Given the similar findings in this paper, there may be a deeper connection worth exploring.

To that note, is the closure algorithm always a valid PAC learner, although it becomes improper when the hypothesis class is not intersection-closed? If so, this would lead to a conclusion similar to [MHS2019] in the context of this paper: PAC learning with improvements is possible for all VC classes, but only in an improper manner.

  1. Montasser, Omar, Steve Hanneke, and Nathan Srebro. "Vc classes are adversarially robustly learnable, but only improperly." Conference on Learning Theory. PMLR, 2019.

其他优缺点

The problem is interesting and well-motivated in the context of modern AI development.

However, it is unclear whether the loss function accurately represents the intended setting. See the questions below for further clarification.

其他意见或建议

N/A

作者回复

“why should the hypothesis be penalized at point despite being correct there and offering at least 1 valid improvement option?”

Thank you for your question.
Let's consider a motivating scenario: suppose that in order to do well in your research group, an applicant needs to (a) know basic concepts from probability and (b) be able to apply them. A test hh that only tested knowledge of definitions would not be a good test because an applicant xx might then only study the definitions and move to point z2z_2, rather than studying the definitions and practicing solving problems with them, moving to point z1z_1. In general, our definition assumes that if agents are able to move to a point zz where h(z)=1h(z)=1 but f(z)=0f^*(z)=0, then they will do so, under the theory that if you're not sure, it's better to assume the worst. Note that this makes our positive results stronger than assuming any specific tie-breaking behavior.

Another perspective on this modeling is as follows. Imagine that we want to grant a loan to an agent. If h decides that the agent can improve to qualify for the loan, then from the agent's perspective, they will indeed receive the loan (as they are only affected by hh's decision). However, it is possible that the policymaker (h) makes a mistake—this agent may not actually be able to improve with respect to ff^∗. Our loss function is the most restrictive for the policymaker, as it encourages h to make conservative decisions, but from the agent’s perspective, “what they see is what they get.”

Furthermore, we would like to point out that prior work on strategic classification (where agents move to deceive the classifier instead of genuinely improving) also makes the same assumption that an agent moves to the positive point that maximizes the learner’s loss whenever it can do so (cf. Sundaram et al., 2023: “When there are multiple best responses, we assume the data point may choose any best response and thus will adopt the standard worst-case analysis.”). We agree that a specific tie-breaking behavior could capture different modeling aspects of the problem.

“.., is the closure algorithm always a valid PAC learner, although it becomes improper when the hypothesis class is not intersection-closed?”

This is a great question. The answer is no. This is implied by Example 2 (the hypothesis class is indeed not intersection-closed here), where the VC dimension is finite, but our argument rules out the existence of any PAC learner (proper or improper, including the closure algorithm) for the problem with access to finitely many samples.

“A more thorough comparison with the adversarial robustness literature.”

Thank you for the helpful suggestion. We will make sure to include the distinction between our setting and the adversarial robustness literature and the following points of comparison:

  • Finite VC dimension is not sufficient for PAC learnability with improvements (Example 2) as opposed to adversarial robustness, where it is sufficient for (improper) learning. This is a major difference between the models. On the other hand, infinite VC dimension does not rule out learnability in both models (Example 1 and MHS2019).
  • In adversarial robustness, the learner's error can only increase compared to the case where there is no adversary. In PAC learning with improvements, this is not necessarily the case, and we present several examples where the error is smaller than in standard PAC learning.
  • Modeling differences: While in adversarial robustness the role of the labels 0/1 is similar (the adversary simply tries to force a misclassification), in our setting, there is an asymmetry between the labels: the agent won’t even react if it is classified as 1. Another point of distinction is that, as you mentioned, the robustness set is predefined and independent of the deployed hypothesis, whereas in our setting, the reaction set is a function of the hypothesis itself.

In general, we are very familiar with the techniques in the adversarial robustness literature and have even tried to employ some of them, but it seems that the two models behave quite differently due to the above differences. For example, our risk-averse algorithm is very different from the robust-ERM approaches in the adversarial literature [Cullina et al., 2018] and sample compression scheme algorithms [Montasser et al. 2019, Attias et al. 2023].
Our results indicate that the properties of H that govern learnability (or the characterizing dimension) should be different for the two models. However, due to analogies between the frameworks, it is meaningful to ask future research questions along the lines of the directions pursued in the adversarial robustness literature e.g. extensions to unknown or uncertain improvement sets (cf. [Montasser et al. 2021, Lechner et al. 2023]). Any suggestions are welcome.

We will add the above discussion in our revision. We are happy to answer and clarify any further questions.

审稿人评论

I thank the authors for providing thoughtful examples that illustrate cases where the loss function aligns with the intended setting. However, I still feel that these examples seem somewhat retrofitted to the specific loss studied in the paper. If the goal was truly to capture the scenarios outlined in the response, then wouldn’t that suggest the abstract and introduction are somewhat misleading? As written, they give the impression that the paper tackles a setting strictly easier than standard PAC learning due to the possibility of post-deployment improvements. That said, this discrepancy between how the paper is pitched in the introduction and how it is formalized is just my personal opinion, and I don't believe it should influence the overall judgment of the work. Since I did not find any technical errors, I am happy to recommend a weak acceptance.

Thank you also for the clarification regarding the closure algorithm. I appreciate the comparison to the adversarial robustness literature as well. Given the structural similarity of the loss function and the apparent intent to build conservativeness into decision-making, I believe a brief discussion of adversarial robustness formulations would be helpful to a reader.

作者评论

We thank the reviewer for their careful consideration of our rebuttal and their re-evaluation of our work. We also appreciate the thoughtful follow-up comment and would like to offer the following point of clarification.

We note that it is not the goal of the paper to tackle a setting that is strictly easier than standard PAC. Rather, we highlight our ability to achieve zero error for a variety of standard concept classes as a remarkable property of learnability that emerges when learning with agents that can improve. Zero error is not possible for non-trivial concept classes in any previously studied models we are aware of, including the standard PAC, strategic PAC, and adversarial PAC frameworks. Importantly, in our setting, we count an error whenever the learned hypothesis hh allows for the possibility that a point could improve incorrectly (i.e., h(x)=1h(x') = 1 while f(x)=0f^*(x') = 0). From this perspective, our achievement of zero error is even more striking, as it holds under a stricter notion of error than in the alternative “optimistic” model proposed by the reviewer, in which an error corresponds only to the case where all possible improvements are invalid.

Finally, we note the phrase “..or alternatively, can make learning harder..” in the abstract, which implies that learning in our model can sometimes be more difficult than in the standard PAC setting. See Example 2, where the goal is to learn a union of two intervals. Although the VC dimension is 44, we show that there exists an improvement function Δ\Delta for which any learner (proper or improper) suffers a constant excess risk. We did not intend to mislead and will make this point clearer in the revision by including the above discussion in the final version. We are open to any suggestions for improving the presentation.

审稿意见
5

This paper continues the line of work on strategic classification and related settings. The authors consider a binary PAC learning setting where negatively predicted points xx can change their prediction slightly (defined by some set Δ(x)\Delta(x)); the expected loss is then evaluated on this potentially altered predictions. Various results are proven:

  1. Learnability is here not captured by the VC dimension.
  2. state a distinction for the case when the target fHf\notin\mathcal{H}
  3. learning halfspaces (with the natural definition of Δ\Delta)
  4. learning intersection-closed spaces H\mathcal{H} with arbitrary Δ\Delta relying on the Closure algorithm
  5. learning with finite domains where Δ\Delta is given by a graph. Here they achieve near-tight sample complexity bounds to get 00 loss, which is not possible in standard PAC.

Finally some experiments on smaller datasets are provided (UCI etc.)

update after rebuttal

Changed score to 5 to emphasize the overall quality of this work and the fact that the reviewers clearly addressed mine and others' concerns.

给作者的问题

  1. The threshold/halfspace examples have vague similarities with learning with margin (or perhaps a one-sided version of it). Is there any relationship of the setting here and learning margin halfspaces or partial concept classes in general?
  2. Is there some natural (sufficient) requirement on Δ\Delta (and/or HH and the used learning algorithm as in Thm 4.7) such that finite VC is sufficient for learning here? Perhaps some sort of monotonicity or smoothness?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

I checked the proofs, they are correct.

实验设计与分析

Solid experimental evaluation with simple UCI datasets and related.

补充材料

Checked the proofs, all correct.

与现有文献的关系

Solid discussion of related work. Perhaps mention that the graph-case you consider is very related to the manipulation graph in previous papers (Cohen et al., Lechner&Urner, ...).

遗漏的重要参考文献

None.

其他优缺点

Interesting results continuing an important line of work. Clearly fits ICML and is very well written. I vote for acceptance.

One small weakness: The different considered problems (halfspaces, intersection-closed, graph-case) are somewhat scattered and disconnected; there seems to be no clear overarching general picture or theory (which, however, makes sense as for general Δ\Delta the situation is rather complex).

其他意见或建议

Not important comment: Note that what you call "non-realizable" (the fact that fHf^*\notin \mathcal{H}) but still assume there is an hh that has 0 true loss, is typically still referred to as "realizable" in the literature (see e.g., Def 2.1 in the "Understanding Machine Learning" book (2014)).

作者回复

We thank the reviewer for their constructive comments and very interesting questions which inspire directions for promising future research.

“The threshold/half space examples have vague similarities with learning with margin (or perhaps a one-sided version of it). Is there any relationship between the setting here and learning margin halfspaces or partial concept classes in general?”

Thank you for the question. Our results indicate that risk-averse learners work well across different concept classes. Specifically, learners that output the positive agreement region f+f_+ of the labeled sample, i.e. the function that classifies the intersection of the positive region of classifiers consistent with the data, achieves low error in our setting. For learning halfspaces, this function may not be a halfspace and if we are restricted to proper learning, the learner may output a halfspace h+h_+ with sufficient margin to positively classify only a subset of the positive agreement region. This would involve sacrificing some accuracy, unless the data is separable by a large-margin classifier (margin large enough relative to the agent's ability to move) such that there is no probability mass between f+f_+ and h+h_+. In other words, when learning on separable data with a sufficiently large margin, our setting would coincide with the standard setting, but not in general (there may be some loss in accuracy).

“Is there some natural (sufficient) requirement on Δ\Delta (and/or HH and the used learning algorithm as in Theorem 4.7) such that finite VC is sufficient for learning here? Perhaps some sort of monotonicity or smoothness?”

The intersection-closed property is an example of a sufficient requirement on HH, such that for an arbitrary Δ\Delta finite VC dimension is sufficient for learning. There might be weaker properties that are sufficient for some restricted classes of Δ\Delta such that the above property holds. As an extreme case on the other hand, if we make no other assumption on HH, then Δ(x)={x}\Delta(x)=\{x\} is an example where a finite VC dimension is sufficient, as this reduces to the standard PAC learning setting. Examining intermediate cases, involving some combination of sufficient natural assumptions on Δ\Delta and HH, is an interesting and natural question for further research.

Not important comment: Note that what you call "non-realizable"...

We agree that the terminology in this context might be confusing and that it could be improved. Thanks for pointing it out. We changed this paragraph to the following: “As another example of the distinction between our model and the standard PAC setting, consider the following example. We construct a distribution D that is realizable by the class H, yet the target function ff^* does not belong to H. That is, there exists h in H with zero error on D, but h differs from the ground truth ff^*. However, in our PAC learning model with improvement, the best function h in H must incur an error of 1/21/2​ for the same ground truth function ff^*.” We will also change the terminology in the proof (Example 3 in the appendix).”

On the different problems considered and the overarching picture

As the reviewer notes, the overall picture is quite complex, and our problems capture different aspects of it. For proper learnability (when the learner must output a hypothesis from a fixed set H), the intersection-closed property is sufficient and essentially necessary. Our halfspaces example shows that if the intersection-closed property is not satisfied, there is still hope to learn using improper learners. The graph model captures that if we can reasonably express (say by some discretization) the instance space X and the improvement function Δ\Delta using a graph, then for any H we can achieve learning with zero-error given enough samples (that depend on the graph and target concept ff^*).

We will incorporate the above discussions and clarifications in our revised version. Additionally, we will also add a remark that our graph model is related to manipulation graphs previously studied under strategic classification; the difference being here the edges correspond to potential true improvement instead of “admissible feature manipulations”. We would be happy to answer and clarify any further questions.

审稿意见
4

This paper introduces a novel framework called PAC learning with improvements. The manuscript outlines the setting as one in which data points correspond to individuals that can improvement themselves. A key feature of this framework is that the improvement is real and as such, a classifier can be correct on a point that would have been misclassified before the improvement is made. For each data point there exists a set of improvement points each individual can transition to after observing the classifier h. In this setting, the paper provides the following results: A separation from classical PAC learning, proof of learnability of classes that are closed under intersection and finally learnability under a graph-model. Finally, the paper provides an empirical study of the ideas developed in the manuscript.

给作者的问题

Q1. In the experiments, what is the relationship between false positive weight and imbalance in the corresponding datasets?

Q2. Can you provide an intuition for behavior when wFPw_{FP} is increased? Why is the error consistently larger across all datasets?

Q3: What does a small improvement budget in section 6 correspond to? What does r=2 mean? At which point are all x able to obtain a positive label?

Q4 We know that neural networks learn low-dimensional representations, often explainable in 2 dimensions in which data points become separable. Are you aware of any work relating the intersection-closed property to encoders/embeddings of trained neural networks?

论据与证据

The first claimed contribution of the paper is a novel framework for PAC learning under improvement which is given in section 2. The second claim is that there is a separation between PAC learning and PAC learning with improvement which is provided in Theorem 3.1. The third claim is that intersection-closed classes are learnable. Evidence is provided in Theorem 4.7. The learnability of the graph-based model is proven in Theorem 5.2. Finally, in the experimental section it is claimed that among risk-averse strategies, loss-based risk aversion proves most useful in the learning with improvement setting. Furthermore, it is shown that only small budgets of improvement are sufficient to increase performance and that the dataset characteristics influence the required level of risk aversion. Support for these claims is provided in Figure 2.

All Theorems are appropriate for the respective claims and the experimental section’s conclusions are accurate but broadly stated. I think it would be good to explicitly tie the language to the selected datasets since generalizing claims about parameters are difficult. The sentence in line 436L makes it sound like the findings are general but in line 410R the text specifically talks about diminishing gains under a specific r. The r under which benefits diminish is highly data dependent as we can see in the Appendix. One could point out that in all cases, after r reaches a certain threshold, performance gains diminish.

方法与评估标准

The experimental section 6 uses three tabular datasets as well as one synthetic 8-dimensional dataset. A true labeling function is obtained via a zero-error model trained on each dataset. The datasets all represent social situations in which humans are the data points and in theory these points could improve. This seems like a sensible choice for an evaluation. Given that the claims are about 0 model-error, evaluating model error is also the correct metric. Plotting error against budget is a sensible choice. Overall, the experiments provide some nice intuition about the risk-averseness of a model and learning with improvement.

理论论述

I was able to follow all proofs in the paper and did not find anything immediately wrong with them. I did not carefully check additional proofs in the appendix. However, the results in section 4 seem quite intuitive.

实验设计与分析

I did verify the experimental design. One thing that I am missing in the text is a direct relation between the theoretical results that were demonstrated and the empirical design. This is mostly an issue of exposition I believe. But I do think that several interesting question arise from this connection, see, e.g. Q4.

Another thing that is missing is a measure of variance. It is not immediately clear if the results are statistically significant. However, given that the results are consistent across multiple datasets, it is likely that the interpretation of these results is valid.

The experimental design is interesting as it outlines a general approach to improvement but the improvement function is difficult to interpret. I think adding more intuition as to how the improvement functions impact these results would have been nice. One thing that is now clear is at which point classification becomes trivial, i.e. when does an all 1 classifier get 0 error. I think it would be good to address this in the main text. See Q3 below.

补充材料

I took a closer look at section F to understand the experimental setup better.

与现有文献的关系

Overall, I think this discussion is decent. However, especially in cases where frameworks are very similar to each other, I believe it is useful to have a clear formal description of how the proposed framework differs. In this case, it might be useful to relate this more clearly to existing work such as strategic classification.

遗漏的重要参考文献

I’m not sufficiently familiar with the literature to make a statement here.

其他优缺点

The paper is very well written and has a clear structure. The manuscript does a good job of providing intuition with examples before stating full proofs which makes it easy to read and makes it easy to extract the key properties of the proposed framework.

其他意见或建议

It may be worthwhile to consider making the title a little more explicit. When I first saw the title, I thought the paper was going to provide improved PAC bounds, not that there would be a new PAC framework.

作者回复

We thank the reviewer for appreciating our work, and for the insightful questions and suggestions.

Relationship between FP weight and imbalance

At this moment, we cannot confirm a specific relationship. However, the choice of false positive and false negative weights required to train a positive reliable classifier is affected more by class separability than by class imbalance. Specifically, when a dataset exhibits strong class separability, we may not need to impose an aggressive false positive-to-false negative weight ratio to achieve a positively reliable (risk-averse) classifier, as discussed in our paper.

Intuition on behavior when wFPw_{FP} is increased and why the error consistently larger

Increasing the ratio wFP/wFNw_{FP}/w_{FN}​ (e.g., 4.0/0.0014.0/0.001) causes the model to penalize false positive mistakes more heavily, prioritizing their reduction at the cost of increased false negatives compared to when wFP/wFN=1w_{FP}/w_{FN} = 1 (i.e., a BCE-trained model). Consequently, across all datasets, the initial error at r=0r = 0 (before any agent improves) is higher than that of the BCE model. However, as agents improve, the false negatives decrease. We will also include more evaluation results to provide more intuition on how agents' movement impacts the resultant TPs and FPs.

``What does a small improvement budget in section 6 correspond to? What does r=2r=2 mean? At which point are all xx able to obtain a positive label?''

Agents improve within an \ell_{\infty} ball of radius rr (lines 417–435). A smaller improvement budget restricts agent movement to a smaller radius. The point at which all xx agents can receive a positive label depends on dataset separability, the risk aversion (e.g., weighted BCE loss-trained model), and the radius of the improvement ball.

Connection of the experiments to the theoretical results.

Thank you for pointing this out. We will include the following paragraph at the beginning of the evaluation section to better connect its motivation to the theoretical discussion: "This section empirically examines various improvement-aware algorithms, specifically practical risk-aversion strategies (loss-based and threshold-based) since risk-averse classifiers perform best according to our theory, which consider agents improving within a limited improvement budget rr. We also explore whether and under what conditions the model error can be reduced to zero when negatively classified agents can improve within an \ell_{\infty} ball of radius rr. Recall that zero-error is a remarkable property achievable in the learning with improvements setting, even for fairly general concepts (cf. Section 5 where the instance space is discrete but the concepts can be arbitrary functions).”

.., it might be useful to relate this more clearly to existing work such as strategic classification.

Thank you for the helpful suggestion. While we already include a discussion of how our work compares with existing work on strategic classification, we agree that it is very useful to establish a clearer connection. Note that in strategic classification, the agents move in order to deceive the learner, while in our setting they try to truly improve. This subtle difference in the definition results in very different characteristics of learnability under the two settings. In particular, there are instances where learning with improvements is more challenging than strategic classification and vice versa.

Concretely, prior work by Sundaram et al. (2023) shows that a bounded “strategic VC” dimension is sufficient for learnability in strategic classification. This is however not true for our setting. It is possible to construct a concrete instance along the lines of Example 2, where the strategic VC dimension is constant but no learner (proper or improper) can achieve learnability with improvements. On the other hand, learnability with improvements can in some cases be easier than strategic classification. For example, consider the setting where the agents can move anywhere and the concept class consists of classifiers that label exactly one point in X as positive, plus the classifier that labels every point negative. Suppose the target ff^* and data distribution DD are such that PrD[f(x)=0]=PrD[f(x)=1]=12Pr_D[f^*(x)=0]=Pr_D[f^*(x)=1]=\frac{1}{2}. Given log1/δ\log 1/\delta samples, the learner sees at least one positive point with high probability. Now the learner can output the hypothesis that only labels this positive point in the sample as positive and achieve zero error in learning with improvements (all agents will move to this point). But in strategic classification any classifier suffers error at least 12\frac{1}{2} (if any point is classified positive, all negative points will successfully deceive; else all positive points are classified incorrectly).

We will add the above examples that establish a separation between our model and strategic classification to clarify the comparison with the strategic classification model.

审稿意见
1

The paper proposes a variant of PAC learning where the data points are conceived as agents that can move around by a small amount and thus if they are close to the decision boundary, they can be classified as desired (typically positive; e.g., for receiving a loan after some small "effort" in order to move their precise coordinates). The authors formulate this approach and consider sample complexity results; e.g., situations where learning is easier or situations where learning is harder. While the authors provide several theoretical results, they also provide experimental evidence in the end in order to show the effectiveness of the proposed method in real-world datasets.

给作者的问题

Q1. If you think your understanding of the hypothesis space and concept class are appropriate as you have defined things in the paper, please argue a bit more here and we will find the "truth" together.

论据与证据

For the largest part the paper has clear and convincing evidence.

However, the authors have some issues with traditional terminology. In particular the authors mix the notions of "concept class" and "hypothesis class/space" (e.g., line 78 right column; they treat them as if they are the same, when they are not). Based on what I read in the paper, I believe that the authors fail to realize that the term "hypothesis" corresponds to the function that is being learnt from data and is the output of a learning algorithm. The name is fitting because it is really a "guess" for the ground truth function, which we typically call "concept". The reason that we call a function a concept probably has its roots in the early days of learning theory where functions were (and still are) seen as languages over an instance space that describe a particular concept of interest (that we want to learn).

In other words, the ground truth is a concept (function) from the concept class and a hypothesis is a function that we learn and it is one of the permissible functions from the hypothesis space (or, hypothesis class). Due to this misunderstanding, the authors even propose the existence of "hypotheses" outside of the "hypothesis class" (page 2, footnote 1), which is completely preposterous. Along these lines the authors mix things again (line 120, right column) where they want to make a connection between learnability and the VC-dimension. While such a connection exists, the connection really refers to the VC-dimension of the hypothesis space; the space of functions that the learner considers and may produce in the output; not on the space of functions from where the ground truth is potentially coming from and is the claim of the authors.

In line 190 we read about a "non-realizable target concept f^*" that is not part of the hypothesis space. While in general this would not be a problematic statement, if we think a little bit about the definitions that the authors have provided in the paper, they allow on one hand hypotheses to be outside of the hypothesis class and apparently now they also allow target concept functions (ground truth functions) to be outside of the hypothesis class as well!! What is the hypothesis class then? Again, in order to help the authors: the hypothesis class is the set of functions among which the learner has to choose one and output as a result of the learning process given a particular dataset.

In general, I think the authors have a good story to tell and some very interesting results. However, some of the claims are outright confusing because of all the above. For this reason, I believe a revised version of the paper will be much more interesting for everyone and this is why my current recommendation will not be for accepting this paper.

方法与评估标准

Yes, they are ok.

理论论述

Yes, I checked several proofs. However, there are several leaps of faith due to the non-standard and inconsistent use of terminology for machine learning theory.

实验设计与分析

The experiments look ok.

补充材料

I skimmed through the supplementary material and it looks ok to me.

与现有文献的关系

Again, I need to stress the fact that the authors use traditional terminology in a very non-standard way. While I really like the ideas of the paper and the results that the authors have, nevertheless, these need to be stated in the appropriate/standard context for machine learning theory. Unfortunately, this is not where the paper is standing right now.

遗漏的重要参考文献

I think everything is fine w.r.t. the references.

其他优缺点

I think this paper has very interesting ideas. However, these need to be presented in the appropriate context. And this is especially true because ultimately the authors want to provide results that separate their model from traditional model of learning. But if their description for traditional model of learning has many holes, then, unfortunately, the paper is not ready yet. Again, I like the paper a lot, but a revision is needed before it gets published. This way all the results will make more sense and can be seen under the appropriate context.

其他意见或建议

There were a couple of typos here and there, but they were really minor.

作者回复

We thank the reviewer for taking the time to review our work. Below are responses to the questions they raised.

Terminology - “hypothesis class”[H] vs. “concept class” [C]

We indeed make the common assumption of implicitly taking H=C. While earlier papers in the field considered the case where a hypothesis class H is used to learn a concept class C (aiming to compete with the optimal hypothesis in H), we can always take H=C, which is particularly natural in the realizable setting—our main focus in the paper. That's why we use these terms interchangeably. We will clarify this terminology in our revised version but emphasize that all statements are correct as they are, without any changes. See, for example, Def 3.1 of PAC Learning in “Understanding Machine Learning: From Theory to Algorithms” [1]. A recent example is [2] (see e.g. Pg 2 or Pg 20) or see [3] for an example from earlier literature (first sentence says most work assumes H=C).

[1] Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[2] Alon, Noga, Steve Hanneke, Ron Holzman, and Shay Moran. "A theory of PAC learnability of partial concept classes." In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 658-671. IEEE, 2022.

[3] Kobayashi, Satoshi, and Takashi Yokomori. "On approximately identifying concept classes in the limit." In International Workshop on Algorithmic Learning Theory, pp. 298-312. Berlin, Heidelberg: Springer Berlin Heidelberg, 1995.

To be clear, when we say that a distribution D is realizable by a hypothesis/concept class H, we mean that there exists h in H with zero error on the distribution D (see, for example, Def 2.1 in “Understanding Machine Learning: From Theory to Algorithms”). When we say an “improper learning algorithm,” we mean that the algorithm can output any function while competing with the best hypothesis in H.

“hypotheses" outside of the "hypothesis class" (page 2, footnote 1)”

Actually, we wrote, “An improper learner can output any function h : X → {0, 1} that may not be in H.” We will update the variable name h:X0,1h: X \rightarrow \\{0,1\\} to f:X0,1f: X \rightarrow \\{0,1\\} for clarity. This is consistent with the definition of improper learning, see, for example, Remark 3.2 in “Understanding Machine Learning: From Theory to Algorithms”.

“they allow on one hand hypotheses to be outside of the hypothesis class and apparently now they also allow target concept functions (ground truth functions) to be outside of the hypothesis class as well!!”

We agree that the terminology in this context might be confusing and that it could be improved. Thanks for pointing it out. We changed this paragraph to the following: “As another example of the distinction between our model and the standard PAC setting, consider the following example. We construct a distribution D that is realizable by the class H, yet the target function f^* does not belong to H. That is, there exists h in H with zero error on D (in the standard PAC model), but h differs from the ground truth f^. However, in our PAC learning model with improvement, the best function h in H must incur an error of 1/2​ for the same ground truth function f^.” We will also change the terminology in the proof (Example 3 in the appendix). We believe this terminology is easier to understand, but we emphasize that the statement is already correct as it is, without any further changes.

We would be happy to answer and clarify any further questions.

审稿人评论

Dear Authors,

Thank you for the comment and clarifications. My apologies I could not comment earlier.

For our discussion, below when we refer to a book, let's refer to the book that you brought up: "Understanding Machine Learning: From Theory to Algorithms" [1]

Based on your response I will try to go directly to the main points of my concerns.

From what I understand you consider the case H=C\mathcal{H} = \mathcal{C} and you allow the learner to output functions from some set F\mathcal{F} which can be a proper superset of H\mathcal{H}. Indeed, in this case you would work in the realizable case. However, in order for someone to apply the ERM principle and ultimately do PAC learning, in the general case, one needs to argue about the combinatorial parameters of F\mathcal{F}, not of H\mathcal{H} as you do in the paper. (Though, sometimes, I have the feeling that you also discuss the case F=H\mathcal{F} = \mathcal{H} in the paper too, but that is a separate issue; let's ignore it for now.)

To illustrate my point, consider the following two cases.

Case 1. Let H\mathcal{H} be a set of functions, conveniently defined in some way, such that H\mathcal{H} has VC-dimension dd. If we consider the set of functions in H\mathcal{H} together with some additional functions (e.g., partial functions of the functions in H\mathcal{H}) that do not give rise to additional functions that can shatter a larger set of points, then it makes sense for someone to argue about the statistical complexity of F\mathcal{F} and talk about H\mathcal{H} (as you do in the paper).

However, when one does representation-independent learning and the realizability assumption holds, this is usually the case that I describe below.

Case 2. Now consider the case where H\mathcal{H} is extended in such a way that the VC dimension increases, or if you want to consider finite sets of functions, the cardinality of the set simply increases. Now, in Corollary 3.2 or in Theorem 6.8 from [1], the statistical complexity will depend on F\mathcal{F} (which you never care to define in the paper), not on H\mathcal{H}.

That is, sometimes we do representation-independent learning, and we help ourselves for performing empirical risk minimization, even if this comes with some toll in the sample complexity that we need for solving the learning problem. Along these lines, there is the paper "Computational limitations on learning from examples" by Pitt and Valiant [2], as well as the paper by one of the two authors of the book [1], titled "Using More Data to Speed-up Training Time" [3]. The point here is that the sample complexity increases according to some combinatorial parameter that governs F\mathcal{F} -- not H\mathcal{H}. It just happens to be the case that ERM is now easy in the bigger set of functions even if the sample complexity has increased as it no longer depends on H\mathcal{H}.


Coming back to your paper, I think it is this last bit that needs to be clarified when your paper is written. Because it appears that you want to do representation-independent learning in the sense of Case 2 above (well, mostly, since in some cases I think you discuss F=H=C\mathcal{F} = \mathcal{H} = \mathcal{C}), but somehow you only mention H\mathcal{H} and you never even bother to define F\mathcal{F}.

To me, the above is very confusing and I would like to see the paper written in a way that these notions are clear and everybody can understand what you are talking about. Again, I want to emphasize that I like your paper and that I genuinely believe that you have a very nice story to tell. But I do not believe that you are there yet for the reasons explained above.

Closing, now that I think we have identified these three sets F,H\mathcal{F}, \mathcal{H}, and C\mathcal{C}, can you please explain if indeed F\mathcal{F} is a strict superset of H=C\mathcal{H} = \mathcal{C} in your paper? Is this always the case in your paper? Is this true in some cases in your paper? Why do your statistical bounds always bring up H\mathcal{H} for comparison, when the learner may return a function from the difference between F\mathcal{F} and H\mathcal{H} as well (and therefore a different sample complexity may be used as well as a different approach in order to apply the ERM principle)?

Thank you for your time.

[1] Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[2] Pitt and Valiant. Computational limitations on learning from examples. URL: https://dl.acm.org/doi/10.1145/48014.63140 For example, the relevant part is that one can use F\mathcal{F} = k-CNF formulae and H=C\mathcal{H} = \mathcal{C} = k-term DNF, where FH\mathcal{F} \supsetneq \mathcal{H}.

[3] Shai Shalev-Shwartz, Ohad Shamir, Eran Tromer. Using More Data to Speed-up Training Time. URL: https://proceedings.mlr.press/v22/shalev-shwartz12.html

作者评论

Thank you for clarifying your question. We think the following should alleviate your concerns:

First, speaking generally, let's consider Case 2 in your comment (representation-independent learning of H). For representation-independent learning, to claim that an algorithm is PAC-learning H, one always requires sample complexity to be polynomial in the VC-dimension of H, not the VC-dimension of the broader class the learner's function is coming from (and the learning algorithm is not restricted to return a function from H). For example, in the Pitt-Valiant case of learning k-term DNF formulas using hypotheses that are k-CNF formulas, it is crucial that k is constant, so that the O(nk)O(n^k) sample size required is only polynomially larger than the VC-dimension of the class of k-term DNF formulas. Their algorithm would not be viewed as a PAC-learning algorithm for the case k=log(n)k=\log(n), for instance, because the time and sample size would no longer be polynomial in the VC-dimension of H. To put it another way, one is not allowed to "cheat" and claim a PAC-learning algorithm for H by using a hypothesis that is just a truth-table (F = all Boolean functions) which will learn with time and samples proportional to the VC-dimension of F (which is 2n2^n for Boolean functions over 0,1n\\{0,1\\}^n).

Next, speaking in terms of our paper, we think you will find that all our results are in terms of the correct quantities. For example,

  • Theorem 3.1: - Example 1: Here we show that if Delta=X, then any class H is learnable, independent of the VC-dimension of H. - Example 2: Here we show that for H = unions of two intervals (VC-dimension(H)=4), there exists Delta such that no algorithm can learn H from a finite sample size, even if the algorithm is allowed arbitrary functions, and is not restricted to unions of two intervals.
  • Theorems 4.1, 4.6, 4.7, 4.8: these all involve proper learning.
  • Theorem 4.9: this is an algorithm for learning halfspaces that uses a classifier that is not a halfspace (it is a convex set). The sample size bound is in terms of the VC-dimension of halfspaces, which is the appropriate quantity for representation-independent learning here.

To be explicit, in PAC learning there are several general paradigms that researchers study:

  1. (proper, realizable learning of H): it is assumed the target function belongs to H and the learner's output must belong to H. E.g. Sections 4.1, 4.2 and 5.1 in our work.
  2. (improper, realizable learning of H): it is assumed the target function belongs to H but the learner's output may be arbitrary (e.g., boosting). E.g. Section 4.3. in our work.
  3. (proper, agnostic learning of H): the target function might not belong to H but the learner's output must belong to H. Performance is compared to that of the minimum error achievable using h in H.
  4. (improper, agnostic learning of H): the natural combination of 2 and 3.

In summary, even when the learner is allowed to use “F”, a strict superset of the class H to which the concepts belong, the correct quantity for sample complexity is the VC-dimension of H and not F. We note that all but one of our sample complexity theorems in Sections 4 and 5 are about proper learning, where the learner must output a function from H. The only exception is Theorem 4.9 which involves improper learning. As stated clearly in the paper (cf. Footnote 1, Pg 2), an improper learner is allowed to output any function X0,1X \rightarrow \\{0,1\\}, i.e. “F” is the class 0,1X\\{0,1\\}^X of all functions that label the instance space X, matching the standard case 2 above.

We hope this answers any questions the reviewer may have, and are happy to provide any further clarification.

最终决定

The paper deals with a learning-theoretic scenario where points (e.g., agents) can "improve" their features. While a similar setting was studied extensively in the "strategic classification" literature, this paper poses a novel challenge: Under such improvements, can we learn what is unlearnable in the standard PAC framework?

The reviewing committee had several disagreements, and all discussions (rebuttal, author-reviewer, reviewer-reviewer, and AC-reviewer) were extensive and engaging. All reviewers appreciate the results of this paper and believe it is both novel and contains non-trivial contributions that deserve to be published in a top ML conference.

The main concern raised by the reviewers was the use of imprecise or misleading terminology in several parts of the paper, primarily around the distinctions between proper vs. improper learning and agnostic vs. realizable settings. This lack of clarity makes certain arguments and claims in the paper difficult to interpret. The authors acknowledged this issue in their rebuttal and committed to addressing it upon acceptance. In addition, they provided, as part of their rebuttal, a detailed breakdown of the key theoretical statements and clarified how each aligns with the appropriate definitions.

After all discussions, during which one reviewer increased their score and suggested championing the paper, another reviewer suggested that the required modifications may be non-trivial and warrant a new revision of the paper.

To make my final decision, I have read the paper and all the associated reviews and discussions carefully. I recognize both the merits of the paper and the critical perspective. Ultimately, I see this as a question of risk assessment. In my view, the necessary revisions are minor and well within the scope of a camera-ready version, and I trust the authors' clear commitment to address them. I acknowledge that this judgment could prove optimistic; for example, some statements may turn out to be ill-defined and require further clarification. However, I believe that delaying publication on these grounds would be unfair to the authors, most of the reviewers, and the community. I therefore vote for accepting this paper.