Optimal Mistake Bounds for Transductive Online Learning
摘要
评审与讨论
The authors consider the setting of transductive online learning for a concept class with a little stone dimension d. The setting differs from the standard online learning in the sense that the sequence of samples is chosen before the learning stage and is revealed to the learner. The object of interest is how many mistakes do the learner has make given the concept class has little stone dimension d. The problem has been studied since the 90s with the best known lower and upper bound being log d and 2d/3 respectively (notably with the upper bound from 1997). The authors give the tight upper and lower bounds showing that the correct answer is \sqrt{d}.
For the lower bound, given that the sequence of samples is chosen properly s.t. satisfies the property in lines 143-146. The author shows that the strategy for the adversary is to use a hard threshold on the ratio of the version space shrinking. And only force a mistake when the version space does not shrink faster than epsilon fraction. While it is easy to verify that if both properties are satisfied, then such a strategy does force an overall sqrt{d} mistakes for the learner. The author didn't provide further intuition on how to select the sequence to satisfy the two needed properties in the main body.
For the upper bound, the authors need to construct a particular function class with little stone dimension d such that the learner can learn with at most \sqrt{d} mistakes. The main intuition here is to maximize the information obtained for each mistake the learner makes. (e.g., if 0 and 1 are half/half for the version space of a sample, then a mistake here would only reveal 1 bit to the learner) So the high-level idea is to make each sample have value 1 for a very small subset of the version space. And the learner would always guess 0, so that if there is an error, this immediately shrinks the version space by a lot. While this idea seems plausible at a high level, the authors show that there are significant techniques required to implement it. In particular, there are two types of samples, on-path samples and off-path samples. They cannot implement this strategy on on-path samples, as they need on-path samples to maintain the required little stone dimension. So they can only implement it on off-path samples. Then, from my understanding, the difficulties are the following:
- The adversary might be able to pile up on-path samples in the front before the learner can use off-path samples to learn the sparse encoding
- The learner does not know whether a sample is on-path or off-path, which depends on the unknown target function.
- When to transition from limiting on path-samples to halving the version space. The author resolves the problems by using a multiplicative algorithm on the splitting experts.
优缺点分析
The authors established the tight mistake bound in the setting of transductive online learning (for learning a concept class with little stone dimension d). This solves a long-standing open problem. The paper is also well-presented with enough intuition to explain the high-level construction of both the upper and lower-bound. Regarding weaknesses, I do feel like a more concrete (formal) definition of on-path/off-path labels would better help the readers understand the obstacles described in 223-247 if possible.
问题
While the sparse encoding strategy for the upper bound makes sense, it does require the domain of the function class to be sufficiently large (from my understanding). As someone who is not familiar with online learning, would it make sense to consider this setting with a restriction on the domain size? And if so, would that change the sample complexity?
局限性
No, although I don't see any particular limitation that needs to be addressed here. The authors have established the tight lower bound for the problem. Although it would be better if they could discuss open problems and potential future directions.
格式问题
N/A
We thank the reviewer for their generous score and for raising thoughtful and technically insightful questions. We especially appreciate the technical depth reflected in their questions, which is all the more notable given the self-reported low confidence. We intend to consider and include these open questions in a future revision, with proper acknowledgment of the anonymous reviewer.
Weaknesses
Regarding weaknesses, I do feel like a more concrete (formal) definition of on-path/off-path labels would better help the readers understand the obstacles described in 223-247 if possible.
Yes, this makes sense of course. Conceptually, given a Littlestone tree, each function in the class corresponds to a root-to-leaf path in the tree. The on-path nodes for a function are simply the nodes on that root-to-leaf path. This is mentioned in line 116 and in footnote 9, and is illustrated in Figure 1c. The formal definition is stated in Definition A.5 (lines 463-470). We will explicitly clarify in Definition A.5 that “on-path” refers to nodes that are on the path, as it is defined in that location.
Questions:
While the sparse encoding strategy for the upper bound makes sense, it does require the domain of the function class to be sufficiently large (from my understanding). As someone who is not familiar with online learning, would it make sense to consider this setting with a restriction on the domain size? And if so, would that change the sample complexity?
Thank you for noticing this — yes, and this is a very insightful point!
At present, we show an upper bound of mistakes using a class that has functions on a domain of size roughly .
Clearly, it is not possible to obtain such an upper bound with a domain of size , because a set of functions on a domain of size is simply the set of all functions on the domain — and so the adversary can force a mistake on each point in the domain (in other words, the VC dimension will be as well).
But what about a domain of size, say, , , , or ? Is there some nice tradeoff between domain size and the best upper bound? These are very good questions, and we don’t know the answers. We will include them in the Open Questions section in the camera-ready version.
Thanks again for contributing this great technical question — it shows significant insight into our work.
Limitations
It would be better if they could discuss open problems and potential future directions.
Definitely! The final version of the paper is allowed to include an additional page, and in that page we will have a Future Directions section. It will include:
- Open questions (contributed by Reviewer zo5C): Is it possible to achieve the upper bound with an efficient learning algorithm? (See further discussion in our answer to Reviewer zo5C)
- Open question (contributed by you, Reviewer bNtP): for a class of Littlestone dimension , what is the tradeoff between the domain size, and the best upper bound? (As discussed above.)
- Open question: Can we get finer asymptotic bounds? I.e., what are the “correct” (tightest) constants hidden by the and notation in the lower and upper bounds? For example, is there some constant such that the answer is .
Thank you so much for your review!
The Author(s)
This paper solves a long-standing open problem concerning the power of unlabeled data in online learning. The authors prove rigorously that, in the transductive online setting, the optimal mistake bound is , where represents the Littlestone dimension of the hypothesis space . Such optimal bound constitutes an exponential improvement over previous lower bound of , and also improves upon the best known upper bound of . Overall, this work provides milestone theoretical results to demonstrate the power of unlabeled data in transductive online learning.
优缺点分析
Strengths
(1) The proofs are novel and truly technical. I first need to admit that I am not an expert in the field of mistake bounds for transductive online learning, and I am sorry to say I am not familiar with the combinatorial technique to deal with Littlestone dimension. Therefore, it is hard for me to check the proof in very detail, and I could only read the proof as possible as I can. However, according to my experience, the proofs for the optimal mistake bound are elegant and look correct.
(2) The exponential improvement over previous lower mistake bound of is a great breakthrough. I read the technical ideas in Section 2.2 and am convinced that it is true.
(3) The improvement from the best known upper bound of to is also very technical and novel.
Weaknesses
(1) I am convinced that this is a very solid work. However, for reader’s convenience (especially for readers not in this field), it is suggested to give more explanations for which part of proof is new and which idea is borrowed from the existing work of Hanneke, Moran, and Shafer (2023).
(2) For the exhibition of the proof idea for the upper bound, it is truly hard for me to read the so length words in Sections 2.3.1-2.3.5. I would suggest defer the length proof technique description to the appendix, and give more concise lemmas/claims/mathematic notations in the main paper, to show mathematically how to derive improved mistake upper bound. In the current version, it is still truly hard for me to catch the essence of main proof strategy.
(3) The related works section does not introduce the optimal mistake bound for standard online learning setting. It will be easier to see the benefits of access to unlabeled data in online learning when compared with the optimal mistake bound in the standard online setting.
(4) This paper actually exceeds the page limit.
问题
(1) Could you please give more explanations for which proof idea is essential to derive improved mistake bounds when compared with the latest work Hanneke, Moran, and Shafer (2023) ?
(2) Could you please introduce more related works about the optimal mistake bound for standard online learning?
局限性
(1) This paper exceeds the page limit (nine pages).
最终评判理由
Thanks authors for their detailed responses. It is true that exhibiting Technical Overview instead of mathematical notation the in the main paper has its merit, especially for a truly technical paper. Overall, I believe this is a very nice paper according to its very strong optimal mistake bound and the technical proof technique. I will raise my score and congratulation to the authors.
格式问题
(1) This paper exceeds the page limit (nine pages).
Thank you very much for your thoughtful and constructive review. Your comments — including your suggestions to clarify what is new in our work, to improve the presentation of the upper bound, and to make the standard online learning setting more accessible — will help us improve the clarity and presentation of the paper.
Weaknesses
(1) I am convinced that this is a very solid work. However, for reader’s convenience (especially for readers not in this field), it is suggested to give more explanations for which part of proof is new and which idea is borrowed from the existing work of Hanneke, Moran, and Shafer (2023).
Absolutely! The upper bound is new, and does not have a direct analog in [1]. For the lower bound, there are a number of important differences between our proof and theirs:
- They use an adversary that presents all the nodes in the Littlestone tree, in breadth-first order. However, one can show an upper bound of mistakes for any adversary that presents all the nodes in the Littlestone tree in breadth-first order, so in that sense their technique cannot be improved. To overcome this limitation, we devise an adversary that presents only a subset of the nodes in the tree (see Algorithm 2, “ConstructSequence”). The intuition here is that presenting all the nodes in the tree can be bad for the adversary because some nodes are easy to guess, but still leak a small amount of information about the correct labeling function. Therefore, to obtain a stronger lower bound, one can try to present only a subset of the (“harder to guess”) nodes in the tree.
- The proof of the lower bound in [1] actually contains a somewhat complex adversary that maintains a parameter that is an inverse of a double exponential in the number of mistakes that have been forced so far (see Algorithm 1 in [1]). Because we present a shorter sequence with labels that are “harder to guess”, we are actually able to simplify this somewhat cumbersome aspect of their proof, and use a single value that remains constant (see Algorithm 1 in our paper).
(2) For the exhibition of the proof idea for the upper bound, it is truly hard for me to read the so length words in Sections 2.3.1-2.3.5. I would suggest defer the length proof technique description to the appendix, and give more concise lemmas/claims/mathematic notations in the main paper, to show mathematically how to derive improved mistake upper bound. In the current version, it is still truly hard for me to catch the essence of main proof strategy.
We can understand a preference for mathematical notation over textual description. However, at present, the appendices contain 24 pages of nontrivial mathematical proofs, including definitions, pseudocode, equations, etc. We don’t feel that the proofs could really be made to fit in the 9 pages of the main text, even in a condensed and abbreviated form. If we tried to do so, the result might be hard to understand without the full definitions and pseudocode, etc., and it would likely be confusing and inaccurate.
Therefore, the choice we made for presenting our work was to include a clear (but condensed) statement of our results in Theorem 1.1, and then provide a Technical Overview that explains the main ideas, ingredients, and steps of our proofs at an intuitive level, while referring the reader to specific formal definitions, lemmas, and algorithms in the appendix. Our hope is that after reading the Technical Overview, readers will have a general idea of how the proofs work, and they will be able to read the details of the formal proofs more easily.
We feel that this method of presentation makes sense, but we are of course open to hearing suggestions. Are there any specific changes to the Technical Overview that you feel would make it clearer?
Per your feedback, using the space from the additional 10th page in the camera-ready version of the paper, we will try to add to the main paper a back-of-the-envelope calculation that at least gives a feeling for how the quantity arises in the bounds.
(3) The related works section does not introduce the optimal mistake bound for standard online learning setting. It will be easier to see the benefits of access to unlabeled data in online learning when compared with the optimal mistake bound in the standard online setting.
Thank you for pointing this out! Presently, the classic characterization of the mistake bound for the standard setting using the Littlestone dimension is mentioned already in the abstract, and it is mentioned again in the Related Work sections — but you are correct that the formal statement of this result appears only later on, in the appendix (specifically, in Theorem A.7). To make the paper more readily accessible to a broad audience, we will make the following additions:
- On page 2, right after defining the number of mistakes in the standard setting, we will note that this quantity is characterized by the Littlestone dimension, and refer the reader to Theorem A.7 for the full details.
- In the Related Work section, when mentioning this result, we will again refer the reader to Theorem A.7 for full details.
(4) This paper actually exceeds the page limit.
Respectfully, all content in our main submission PDF ends at the bottom of page 9 — and subsequent pages contain only references and the NeurIPS checklist. This adheres to the page limit in the NeurIPS Call for Papers.
In the supplementary material, for which no page limit exists, the main text minimally exceeds 9 pages due to merging documents and slight edits.
Questions:
(1) Could you please give more explanations for which proof idea is essential to derive improved mistake bounds when compared with the latest work Hanneke, Moran, and Shafer (2023)?
[Repeating our response to Weakness (1):]
Absolutely! The upper bound is new, and does not have a direct equivalent in [1]. For the lower bound, there are a number of important differences between our proof and theirs:
- They use an adversary that presents all the nodes in the Littlestone tree, in breadth-first order. However, one can show an upper bound of mistakes for any adversary that presents all the nodes in the Littlestone tree in breadth-first order, so in that sense their technique cannot be improved. To overcome this limitation, we devise an adversary that presents only a subset of the nodes in the tree (see Algorithm 2, “ConstructSequence”). The intuition here is that presenting all the nodes in the tree can be bad for the adversary because some nodes are easy to guess, but still leak a small amount of information about the correct labeling function. Therefore, to obtain a stronger lower bound, one can try to present only a subset of the (“harder to guess”) nodes in the tree.
- The proof of the lower bound in [1] actually contains a somewhat complex adversary that maintains a parameter that is an inverse of a double exponential in the number of mistakes that have been forced so far (see Algorithm 1 in [1]). Because we present a shorter sequence with labels that are “harder to guess”, we are actually able to simplify this somewhat cumbersome aspect of their proof, and use a single value that is constant (see Algorithm 1 in our paper).
(2) Could you please introduce more related works about the optimal mistake bound for standard online learning?
Definitely! The paper that established this result is Littlestone [2]. This paper is already cited a number of times in the submission, but as we discuss in our response to Weakness (3) above, we will add direct references to the formal statement of this result (Theorem A.7) from page 2 and from the Related Works. Additionally, we will add references to expositions that are more accessible, including Chapter 21 in Shalev-Shwartz & Ben-David [3], Chapter 8 in Mohri, Rostamizadeh & Talwalkar [4], and the relevant sections in Cesa-Bianchi & Lugosi [5].
Limitations
(1) This paper exceeds the page limit (nine pages).
[Repeating our response to Weakness (4):]
Respectfully, all content in our main submission PDF ends at the bottom of page 9 — and subsequent pages contain only references and the NeurIPS checklist. This adheres to the page limit in the NeurIPS Call for Papers.
In the supplementary material, for which no page limit exists, the main text minimally exceeds 9 pages due to merging documents and slight edits.
[1] Hanneke, S., Moran, S., & Shafer, J. (2023). A trichotomy for transductive online learning. Advances in Neural Information Processing Systems, 36, 19502-19519.
[2] Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4), 285-318.
[3] Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
[4] Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning. MIT press.
[5] Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
Dear Reviewer EVTg,
We happily thank you once more for your careful reading of our paper and for your thoughtful suggestions. We hope that our rebuttal has addressed all your comments to your full satisfaction, and of course — we are very willing to engage in further dialogue as needed, and we look forward to any additional remarks you might like to share.
We were totally thrilled to receive some very positive comments from you, highlighting the paper’s results, novelty, elegance, and technical depth. We would be very grateful if you would be willing to consider increasing your score, to more fully reflect the merits of the work.
Best, The author(s)
This paper provides improved lower and upper bounds on mistakes for transductive online learning, providing tight guarantees and showing a gap between transductive and standard online learning. The proof technique is intuitive to follow and relies on multiple techniques from the field including the halving algorithm and multiplicative weights algorithm.
优缺点分析
The main strength of the paper is the novelty and clarity. It solves a 30-year-old open problem by providing matching upper and lower bounds for mistakes in transductive online learning. The proof is also exposed intuitively in the main paper, which makes it easier to understand. I do think the proof is correct based on the exposition in the main paper, but I have not verified the calculations in the appendix. I do not see any major weaknesses in the work.
问题
- What is the computational efficiency of the algorithm for proving the upper bound? It seems that in the worst case, there could be exponential experts (or probably even super-exponential). Is that correct? Do the authors think the computational efficiency can be improved?
局限性
I think this paper does not have any major limitations, but a discussion of the computational efficiency of the algorithm could also add some value.
格式问题
None
Thank you for your supportive review!
What is the computational efficiency of the algorithm for proving the upper bound? It seems that in the worst case, there could be exponential experts (or probably even super-exponential). Is that correct? Do the authors think the computational efficiency can be improved?
Thanks for raising the question of computational efficiency, which is a very interesting topic! We will include this in the Open Questions section in the final version. More broadly, we will also add a discussion on this topic along the following lines.
First, one needs to define what computational efficiency means in this context. Kakade & Kalai [1] consider a transductive learner to be efficient if it runs in time , where is the length of the sequence . In their setting they consider the VC dimension to be constant, ignoring the runtime dependence on the VC dimension or the size of the class, etc.
In our upper bound, we construct a class of Littlestone dimension d with a domain of size roughly , and a learning algorithm that makes at most mistakes. The algorithm runs in time polynomial in and the number of experts. Each expert is associated with a subset of the hypothesis class, and at each time step these subsets are disjoint, so the number of experts is at most the number of hypotheses in the class. Therefore the algorithm runs in time which is polynomial in but is not efficient. (However, the runtime is not super-exponential.)
Note, however, that in our case the class is constructed randomly. Therefore it would be reasonable to think that the learning algorithm receives a description of the hypothesis class as one of its inputs. Such a description would essentially be a binary matrix. So under this view, our runtime of is polynomial (and potentially even sublinear) in the input length.
It is an interesting open question whether there exists a learning algorithm and a sequence of classes , such that for every integer :
- has Littlestone dimension , and
- Given as input the index and a sequence , runs in time , and makes at most mistakes assuming that the labels are realizable by .
A natural variation is to ask for an oracle efficient algorithm: an algorithm that runs in time using access to an ERM oracle for the hypothesis class (or some other similar oracle).
[1] Kakade, S., & Kalai, A. T. (2005). From batch to transductive online learning. Advances in Neural Information Processing Systems, 18.
In the standard online learning model, the minimum number of mistakes made by the learner is exactly characterized by the Littlestone dimension of the hypothesis class. This paper exactly characterizes the minimum number of mistakes in the transductive version (where the learner knows the test data in advance) by showing that there are classes of Littlestone dimension on which no learner can do better than mistakes, and conversely that if the LD is then the learner can always learn with mistakes
优缺点分析
- Exceptionally well-written paper. Well written proofs, good mix of intuition and formalisms, detailed pseudocode for everything. I enjoyed reading it.
- Resolves a ~30-year-old open question with a tight bound. ? I haven't read the previous bounds on the problem, so I don't know how much the paper innovated in terms of proof ideas.
- I wish I wrote it.
问题
This is sort of a history question: What do you think was the big insight or connection that let you solve the problem? Or from another perspective, what was blocking the previous papers from getting this bound? All the steps seemed almost obvious when reading your paper (which to be clear I don't think is a bad thing).
局限性
Yes
最终评判理由
No change
格式问题
none noticed
We’re very grateful for the reviewer’s heartwarming comments and score. This kind of encouraging feedback is both humbling and motivating. Thank you for engaging with our work so thoughtfully.
This is sort of a history question: What do you think was the big insight or connection that let you solve the problem? Or from another perspective, what was blocking the previous papers from getting this bound? All the steps seemed almost obvious when reading your paper (which to be clear I don't think is a bad thing).
Thank you for this question! Looking back at our work, we actually don’t believe that there is a single big insight that leads to this result. Rather, it followed from a process (typical in mathematics) of tinkering with a number of specific examples, carefully considering some proofs and seeing whether they can be pushed further, or whether one can prove lower bounds that would preclude improvements, etc.
As a first point, one needs to understand what is the correct statement to be proved, namely, bounds of . But it is hard to say where comes from. Consider:
- In the PAC setting, unlabeled data doesn’t help. So one could conjecture that there is a lower bound of .
- On the other hand, the previous lower bound of is actually tight (cannot be improved) if the adversary presents a sequence consisting of the entire Littlestone tree in breadth-first order (as was the case in [1]). So one could conjecture that this is true for all sequences, and therefore that there is an upper bound of .
- Even now, it is not clear that there is a simple and concise intuition for where the comes from.
Next, consider the upper bound. It involves constructing a hypothesis class at random, and then utilizing properties of that class in order to learn with few mistakes. While the probabilistic method is of course very common in discrete math, this is arguably not a very typical use case — randomized constructions are more commonly used to obtain “hard instances”: random functions are hard to learn, adding random noise is conjectured to makes it hard to learn linear functions (LPN), refuting random 3-SAT instances is conjectured to be hard [2], random strings are provably hard to compress (Kolmogorov), etc. In contrast, in our proof, we construct a randomized class so that it will be easy to learn.
Additionally, the proofs of the upper bound and lower bounds are very different from each other. Often there is some sort of “correspondence” between elements in the proofs of matching lower and upper bounds, but this does not appear to be the case here.
For the lower bound, the paper [2] offered a characterization of the transductive mistake bound, but that characterization somewhat obscures or deemphasizes the tree structure of the hypothesis class, which turns out to play an important role in our proof and in [1].
That being said, one important insight that allowed us to obtain our improved results, was that in the proof of the previous lower bound in [1], the adversary presents all the nodes in the Littlestone tree, and this can be bad for the adversary because some nodes are easy to guess, but still leak a small amount of information about the correct labeling function. Therefore, to obtain a stronger lower bound, one can try to present only a subset of the (“harder to guess”) nodes in the tree.
[1] Hanneke, S., Moran, S., & Shafer, J. (2023). A trichotomy for transductive online learning. Advances in Neural Information Processing Systems, 36, 19502-19519.
[2] Ben-David, S., Kushilevitz, E., & Mansour, Y. (1997). Online learning versus offline learning. Machine Learning, 29(1), 45-63.
[3] Feige, U. (2002). Relations between average case complexity and approximation complexity. In Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing (pp. 534-543).
Thank you so much for your review!
The Author(s)
This paper solves a long-standing open problem concerning the power of unlabeled data in online learning. It provides an improved lower and upper bounds on mistakes for transductive online learning, providing tight guarantees and showing a gap between transductive and standard online learning.
Reviewers are all very positive about the contributions as well as the quality of presentation.