PaperHub
6.3
/10
Poster3 位审稿人
最低4最高8标准差1.7
4
7
8
2.7
置信度
正确性3.3
贡献度3.0
表达2.3
NeurIPS 2024

A Theory of Optimistically Universal Online Learnability for General Concept Classes

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-15
TL;DR

We provide a full characterization of the concept classes that are optimistically universally online learnable with {0, 1} labels.

摘要

关键词
Online LearningStatistical LearningConsistency

评审与讨论

审稿意见
4

This work considers the theory of universal consistency and optimistically universal learning rules in the binary classification case where, instead of considering all measurable target functions, the analysis is parameterized by a specific class H\mathcal{H} of binary classifiers. Two types of classes H\mathcal{H} are considered:

Strongly universal: for every H\mathcal{H}-realizable process X\mathbb{X} there is a strongly universally consistent online learning algorithm for H\mathcal{H} and X\mathbb{X}.

Optimistically universal: there is an online learning algorithm that is strongly universally consistent for H\mathcal{H} and X\mathbb{X} for every H\mathcal{H}-realizable process X\mathbb{X}.

The main results of this work (as far as I understand) are the following:

H\mathcal{H} is strongly universal if and only if it has no infinite VCL tree (Theorem 9).

H\mathcal{H} is optimistically universal if and only if it has no infinite Littlestone tree (Theorem 10).

Recall that H\mathcal{H} has no infinite Littlestone trees implies H\mathcal{H} has no infinite VCL tree.

Further results extends this characterization to the agnostic case.

优点

The work considers a variant of universal consistency for online learning in which there is a reference class H\mathcal{H} of binary classifiers. A full characterization on universal online learnability is provided.

缺点

The clarity of the presentation is below the NeurIPS standards.

问题

Please elaborate on the connections between this work and Section 3 of arxiv.org/pdf/2011.04483.

Please clarify the apparent inconsistency in the Theorems 9, 10, and 12.

[The above questions have been addressed in the rebuttal]

局限性

There is no section on limitations, but the work is purely theoretical with no foreseeable societal impact.

作者回复

We appreciate your helpful comments and we address your questions below.

First, let us re-iterate the definitions of the paper. For a given concept class H\mathcal{H}, a process X=(X1,X2,)\mathbb{X} = (X_1,X_2,\ldots) admits a universally consistent online learner if there exists a learning algorithm guaranteeing that, for any realizable classification of the sequence, the learner's number of mistakes grows only sublinearly. So the property of admitting universal online learning is a property of a concept class H\mathcal{H} and a stochastic process X\mathbb{X}. In contrast, the property of admitting an optimistically universal online learner is a property of a concept class H\mathcal{H}: namely, H\mathcal{H} admits an optimistically universal online learner if there exists an algorithm which is universally consistent (i.e., makes sublinearly many mistakes for realizable classifications) for every process X\mathbb{X} which admits a universally consistent online learner. See the paper (e.g., lines 50-52), and references therein, for further explanation about this terminology and the history of this subject (in the context of prior work, our innovation is incorporating the concept class H\mathcal{H} into the theory).

There is no inconsistency in the results. As we prove (Theorem 9), any class with an infinite VCL tree has the property that the set of processes X\mathbb{X} admitting a universally consistent online learner is not the set of all processes: that is, only some processes admit a universally consistent online learner. In contrast, Theorem 9 also establishes that any class that has no infinite VCL tree has the opposite property: that is, every process admits a universally consistent online learner. Note that this is different from the property of admitting an optimistically universal online learner; the latter would require also that there exists a single online learning algorithm which is universally consistent for every such process (in the case of classes having no infinite VCL tree, since Theorem 9 implies every process admits a universally consistent online learner, an optimistically universal learner would need to be universally consistent for every process; our Theorem 10 establishes that, among classes with no infinite VCL tree, such an optimistically universal learner exists if and only if H\mathcal{H} also has no infinite Littlestone tree). For classes which have an infinite VCL tree, the property of being an optimistically universal learner only requires the learner to be universally consistent under all processes X\mathbb{X} which admit a universally consistent online learner, hence they do not need to be consistent for all processes (this is impossible, by our Theorem 9); for such classes, our condition A describes precisely which processes X\mathbb{X} admit universally consistent learners (Theorem 11), and our condition B describes which concept classes H\mathcal{H} with infinite VCL trees admit optimistically universal learners (Theorem 12).

We indeed discuss (and use, with citation) the results of the Bousquet et al paper referenced in the review. The sufficiency half of our results for classes having no infinite Littlestone tree (Theorem 10) are directly based on that work. We note that their work considers whether there exists an online learner successful under arbitrary realizable data sequences. In contrast, we are interested in understanding which pairs (H\mathcal{H},X\mathbb{X}) admit the existence of successful online learners, and the question of whether there can exist a learner which achieves this adaptively (i.e., without dependence on the distribution of the sequence): the former is the question of which processes admit universally consistent learners, and the latter is the question of which concept classes admit optimistically universal learners. For instance, consider the class H\mathcal{H} of thresholds on the real line; this class admits an infinite Littlestone tree (hence Theorem 10 (which is similar to, but not quite the same as, Bousquet et al.'s Theorem 3.1) shows there is not an algorithm universally consistent under all realizable data sequences), but since there is no infinite VCL tree our Theorem 9 reveals that every process X\mathbb{X} admits a (process-specific) universally consistent online learner. Hopefully this clarifies: the latter is about whether each process admits a (distribution-dependent) universally consistent learner, whereas the former is about whether there is an algorithm (defined independent of the process) which is universally consistent for every process X\mathbb{X}.

评论

Thank you for the clarification. I now better understand that: For classes that have an infinite VCL tree, the property of being an optimistically universal learner only requires the learner to be universally consistent under all processes which admit a universally consistent online learner. This resolves the apparent contradiction I had mentioned in my original review. As a consequence, I have increased my score. However, my reservation about the quality of the presentation remains.

审稿意见
7

This paper considers characterizations for optimistically universal online learnability, a notion that refers to the existence of learning rules that are simultaneously optimal for any data-generating process that satisfies some minimal assumptions, i.e. admitting an optimal process-dependent learning rule. Previous works have studied this notion of learnability when the function being learned can be any measurable function, a very broad class of functions. In this paper, the authors consider characterizations of learnability when the labels of the samples are realizable some known binary hypothesis class H, and also consider the agnostic task of achieving low regret with respect to the best hypothesis in H. They give characterizations of when H will be optimistically learnable based on VCL and Littlestone trees as well as some additional assumptions in certain cases. They provide clear examples to illustrate these characterizations.

优点

I think the question of characterizing optimistic learnability for a given concept class is very interesting, and the paper is well-written in that it clearly explains the intuition and interpretations of the results, and connects to prior work.

缺点

I found some of the notation, especially around VCL trees and games, extremely difficult to parse. For instance, it was confusing that a large X may refer to either a tuple of X-values, or just a single X value. It would be great if this could be made more readable.

问题

  • Do these results/techniques have any implication for optimistic online learning beyond the binary-label setting?

局限性

The authors have adequately address the limitations and assumptions made in their results.

作者回复

We appreciate your helpful comments.

We will work to make the notation more readable, such as changing the capital X in the VCL game to make it less confusing.

As for the optimistically universally online learnability problem beyond binary-labeled cases, this is a very interesting question. We believe the technique used here extends easily to finite-label multiclass learning. Beyond that, we have found that further research is needed to understand the problem for infinite-label multiclass learning, regression, and learning with general losses, and we leave this for future work.

评论

Thanks for your reply and answers to my questions! I will keep my score as-is.

审稿意见
8

This work continues the line of work by Hanneke and Blanchard et al universal online learning under general stochastic processes (instead of the more common iid or adversarial settings) with binary labels. While previous papers focused on the case of all measurable functions (as an implicit hypothesis spaces), this current work gives characterization for arbitrary hypothesis spaces.

The two main characterizations are as follows. Fix hypothesis space HH.

  1. For any data generating process XX there exists some algorithm that "learns" (i.e., average regret goes to 00) on XX iff HH has no infinite VCL tree (a combinatorial dimension like VC used in universal learning).
  2. For any data generating process XX there exists some algorithm that learns on XX and there exists a single algorithm that learns under any data process "under which learning is possible" iff it has no infinite Littlestone tree (a standard combinatorial dimension, characterizing online learning).

By the first chracterization, the existence of an infinite VCL tree just admits, thus, some data processes where learning fails, but it's unclear if learning still is possible on different specific processes. To tackle this, the authors offered an additional more fine-grained condition for each fixed (unlabeled) stochastic process XX (for the case the HH has infinite VCL trees):

  1. XX is learnable iff a countable set of experts exist (which can depend on XX) that satisfy a certain no-regret condition.
  2. there is a single algorithm that learns "whenever learning is possible" (i.e., for any process that admits learning) iff there exists a countable set of experts (with no dependents on XX) that satisfy the no regret condition for all processes XX that admit learning.

Both of these conditions collapse to a previously studied condition if HH is the set of all measurable functions.

优点

Strong progress for an important and recent learning problem.

This work continues the characterization of distribution-dependent (universal) learning under stochastic processes, which fits into a broad range of recent results in universal learning and combinatorial characterizations.

The experts advice algorithm Squint is used as a nice black-box reduction from agnostic to realizable, similar to the weighted majority based reductions in adversarial online classification.

缺点

The paper lacks a bit of exposition to guide the reader and a more thorough discussion of related work.

问题

The paper only seems to discuss qualitative statements (learning possible iff no infinite tree,...). As far as I can see no quantitative rates are provided (like in uniform/universal learning). Are such rates, even very loose ones, possible (perhaps as future work)?

局限性

See questions.

作者回复

We appreciate your helpful comments. We will provide more discussion on related works in the final version and help the reader understand our results more easily. In our work, we are focus on the consistency problem. The quantitative rates problem is an interesting future direction, though will certainly require significant work even to formulate the question appropriately (for instance, it could be a question about, for a given rate function R(n), which function classes and processes admit learning with mistakes growing at rate R(n)? It is definitely an interesting direction, though perhaps challenging, and worthy of a separate future work)

评论

This is a strong and important paper. I think it should be accepted and hence I increase my score.

作者回复

We want to thank all reviewers for their helpful suggestions and comments. We will polish our paper to make it easier to read and follow.

We here clarify our results again:

First, we would like to clarify the motivation of our work. This line of work on optimistically universal learning is trying to capture the minimal assumption needed for learnability. We investigate the universal consistency problem beyond i.i.d. data-generating processes and for general concept classes.

To emphasize our novelty and contribution: We provide a full characterization of what data processes admit universal online learning under concept class H and what concept classes H have an optimistically universal learning rule. The formal statements are in Theorems 9, 10, 11, 12, and for agnostic cases: Theorems 24, 25. Those results are all new results, except the sufficiency of Theorem 10, which can be derived from Bousquet et al. (2021). The sufficiency of Theorem 9 is totally novel, though we use some techniques from Bousquet et al. (2021) and Alon et al. (2021). The necessity of Theorem 10 is also (subtly) different from the necessity part of Theorem 3.1 in Bousquet et al (2021), since our data process must be fixed in advance (rather than adaptively constructed by an adversary) and we need to show not only that the number of mistakes is infinite but that it does not grow sublinearly.

最终决定

This work studies online learning under general stochastic processes (instead of the more common iid or adversarial settings) with binary labels continuing the line of work by Hanneke [2021]. While previous papers focused on the case of all measurable functions, this current work gives characterization for arbitrary hypothesis spaces. At a technical level the problem is a natural generalization of well-studied settings and the results are a significant progress. At the same time it is rather theoretical even by the standards of ML theory. In addition, the presentation is terse and hard to read even for experts on online learning. Still I think that the strengths of the results compensate for the limitations and recommend acceptance