Stabilizing Linear Passive-Aggressive Online Learning with Weighted Reservoir Sampling
We combine weighted reservoir sampling with passive-aggressive online algorithms to dramatically mitigate test accuracy fluctuations.
摘要
评审与讨论
The authors are interested in sequential learning algorithms, within an IID data setup. That is, the ultimate objective is expected loss for a single algorithm output, not the regret incurred by a sequence of candidates. More specifically, they are interested in linear binary classification, and their overarching goal is to design an algorithm which is less sensitive to idiosynchratic data points, especially those near the end of the learning process.
Broadly speaking, their proposal is a procedure to construct a weighted sum of "good candidates" for the final algorithm output, ideally being far less sensitive to outliers that can take traditional online learning algorithms astray. Their procedure "wraps" around a base update (here, PAC or FSOL), and selects candidates for inclusion in a "reservoir" in a stochastic fashion, and the core mechanism underlying their weighting is to use the number of "passive steps" as an indicator of quality. Passive steps are those steps, given a certain candidate, where the 0-1 loss is zero (i.e., correct classification) and no updates are made.
The underlying idea itself is heuristic, but the authors provide theoretical analysis in the form of risk bounds for candidates selected based on the number of cumulative passive steps. Essentially, when the minimum achievable risk is small, the passive step-based choice is not likely to be much worse than the best choice among the stocked candidates. They also conduct a series of empirical tests, which suggest that their proposed approach can be applied in a straightforward way to achieve better accuracy and stability when compared with the base procedure their algorithm wraps around, on a variety of datasets.
优点
The authors present a procedure which is intuitive, can be given some theoretical grounding, and appears to work well in practice. The paper is well-structured, and the writing (aside from some technical points to be mentioned later) is very clear. While the basic idea of taking a weighted sum of good candidates to "smooth out" the performance of sequential learning algorithms is a well-studied tactic, the authors consider a rather unique approach applying weighted reservoir sampling to passive-aggressive base algorithms.
缺点
While I highlighted writing clarity as one of the paper's strengths, I found the methodology presented here rather difficult to understand. The WRS approach is positioned as a central part of the authors' strategy, but what part of Algorithm 1 is specific to WRS? If we remove the uniformly sampled random variable and just determine inclusion in the -sized set of candidates by the number of passive steps, would things really change all that much for the worse?
In addition, the use of formal notation in the paper is in my opinion quite sloppy in parts. The technical material feels rushed, and makes it difficult for readers/reviewers to effectively parse the procedure being proposed. I will provide some concrete examples in the next field.
Finally, considering the fact that there is a massive literature on taking averages of candidates from iterative learning procedures, I think many readers will find it troubling that the empirical investigations are completely inward-facing, i.e., they simply evaluate the base algorithms and the proposed wrapper under different settings, and do not treat other alternative averaging strategies which are far more generally applicable (e.g., averaging most recent candidates, downweighting over time, using so-called "anytime" sub-routines, and so forth).
问题
I do not have any critical questions for the authors. For the most part, I get their idea and I understand the investigation that they have carried out. Below is a handful of small points I tripped up on while reading the paper.
-
Abstract, "Our reservoir thus contains previous intermediate...": here really has no meaning at all. Why not just say "a subset of previous intermediate weight vectors..." or something?
-
First sentence of 2.1: If my reading is correct, the first time "linear binary classification" is mentioned is here at the start of 2.1. I think most readers, having read up to this point (including the abstract!), will be shocked that the scenario considered by the authors is so narrow. This should be established earlier, and more clearly.
-
The symbol is used for iteration numbers as well as for transposing the weight vector in the main equation presented in 2.1 (for ); this should be avoided if possible.
-
Reading through 2.3, I found the WRS exposition to be quite mysterious. What in particular about the problem being studied here makes WRS a natural fit? What parts of Algorithm 1 use the original WRS as-is, and which parts are original? It's not clear to me.
-
3.2, "A given online learning algorithm outputs model after seeing , with loss ": I found this sentence clunky. Shouldn't the output be , considering we have seen the data and loss for step already?
-
3.2, third paragraph: thus far the "model" has been characterized by or simply , but suddenly here appears. Plus, the here (not bold) is inconsistent with previous data notation.
-
3.2, fourth paragraph: how are the "updated models" defined? What is ? This is all totally unclear.
-
Algorithm 1, line 15: this critical definition of includes , an undefined quantity. The only place I can find it is earlier in 2.3, regarding "weights" used in WRS. Are these supposed to be pre-fixed?
-
Algorithm 1, line 16: here is used as a "threshold" of sorts, but notation clashes with used for the number of iteration in the for-loop.
-
Algorithm 1, line 19: to update "accordingly" means what? Are the elements supposed to be paired up with elements of , so that when an element is removed from , the corresponding elements in and are removed as well? This is unclear. I can infer it from line 29 with some confidence, but readers shouldn't have to infer such things.
-
Algorithm 1, lines 27 and 29: the use of is bad notation; withing Algorithm 1, is said to be a set of promising solutions, but here it magically transforms into a set of the indices of promising solutions. This is even more problematic because with no index subscript is all we have within Algorithm 1, so is effectively undefined in this context. Note it also completely clashes with used in line 32.
-
Algorithm 1: when WS is "Standard", then is updated, but under AS as "Simple Average," plays no subsequent role; is this correct? If so, considering the fact that "Simple Average" performs very similarly to "Weighted Average," one wonders if the passive-step count based approach is actually meaningful at all. Note also that the critical check in Algorithm 1 is unclear due to the undefined .
局限性
Limitations have been discussed.
We thank the reviewer for encouraging us to elaborate more on the differences between WRS-Augmented Training (WAT) vs. simply taking the candidate solutions with largest numbers of passive steps (denoted as ``top-"). The primary operational difference between WAT and top- lies in the probabilistic sampling steps in Lines 10-15 of Algorithm 1. We hypothesize that WAT's probabilistic approach imbues an element of exploration that may yield dividends in some cases where the number of passive steps might not be a perfect indicator of performance, compared to deterministic top-.
Emphasizing our discussion in Section 4.4 and Tables 2-4, holding reservoir/set size at , we find that PAC-WRS (simple average + standard weights) successfully stabilized test accuracy performance on 13 out of 16 datasets compared to PAC + top-'s (simple average) 10 out of 16 datasets, as measured via Relative Oracle Performance and averaged across 5 trials. We focus on simple averaging of reservoir solutions and standard weights (i.e., number of passive steps as opposed to the exponentiated number of passive steps) because, overall, these settings appear to be the most optimal and robust. We also found that FSOL-WRS successfully stabilized test accuracy performance on all 16 of 16 datasets, compared to 15 datasets for FSOL + top-.
Figure 3 in our 1-page rebuttal PDF shows the test accuracy curves over time for WAT and top- (both with ) on three datasets --- Avazu (App), Avazu (Site), and Criteo, with PAC as a base model. The shaded regions represent the minimum and maximum test accuracy values at each timestep, taken over each algorithm's 5x trials. The thin lines inside the shaded regions represent the 5x individual trials of each algorithm. The solid lines represent the means taken across the 5x trials. From this Figure 3, we see that the performances of WAT and top- are not only statistically different, but also oftentimes do not even overlap across the extremes of multiple trials, with WAT being higher performing
Furthermore, synthesizing Table 4 in Section 4.4 and Tables 5-6 in Appendix C, using Wilcoxon Signed-Rank Tests on ROP measured with respect to base PAC/FSOL, we find that top- cannot yield nearly as statistically-significant increases in test accuracy stability as WAT, for both PAC and FSOL. At significance level , while both WAT and top- can produce statistically-significant increases in final test accuracy compared to base FSOL, top-'s -values are more than an order of magnitude larger than their WAT counterparts. Finally, both WAT and top- cannot produce statistically-significant increases in final test accuracy on PAC.
To further probe the differences between WAT and top-, for this rebuttal, we also computed Wilcoxon Signed-Rank Tests on ROP, treating top- as the control method and viewing WRS as a probabilistic treatment/modification of the deterministic top- control method (using ). Under this framework, we found that PAC-WRS outperformed PAC + top- on 12 out of 16 datasets, averaged across 5x trials, with a statistically-significant at significance level . In other words, using PAC as base model, injecting this probabilistic treatment to form WAT was statistically-significant at improving test accuracy stability, compared to deterministic top-. On FSOL, under this second hypothesis testing framework, FSOL-WRS and FSOL + top- achieved a statistical tie (). Note (per one-page PDF), WAT always improved stability where other methods did not.
Nonetheless, aggregating all of the aforementioned results and analyses, it is clear that the probabilistic WRS-augmented training method is superior to the deterministic top- method, with minimal additional computational cost compared to the naive base algorithm.
I thank the authors for their detailed response. I think with proper revisions, the essential elements and insights of the paper will be more readily accessible, and I am okay with raising my score.
We greatly appreciate the raised score and all of your detailed feedback towards making our paper clearer, more accessible, and better at highlighting our key contributions and novelty.
In particular, your recommendation to compare WAT against other traditional, well-established averaging mechanisms like the moving average and exponential average was especially pivotal. It helped us clearly demonstrate WAT's decisive contributions in both computational efficiency and performance/stability.
Please let us know if any other questions and/or feedback arise. Thank you so much for your time, and we hope you have an amazing weekend!
In addition, the use of formal notation in the paper is in my opinion quite sloppy in parts. The technical material feels rushed, and makes it difficult for readers/reviewers to effectively parse the procedure being proposed. I will provide some concrete examples in the next field.
Thank you for taking your time to provide us with so many helpful revisions. We will address each of these one-by-one below with proposed line edits:
- Abstract: We will replace K with "a subset of intermediate weight vectors" as recommended.
- We will update the last sentence of our abstract to "We demonstrate our WRS approach for binary linear classification with the Passive-Aggressive..."
- We will use the correct symbol '\top' for transpose.
- We will motivate better that we aim to take an ensemble of online models into a single final model and explain how sampling from models with high survival times is a good proxy metric for this. We will then discuss that WRS is a linear-time, low-memory method to efficiently accomplish this.
- Agreed, we will fix to
- Thank you for catching our oversight, we will change to and bold x
- was the reservoir in the theory, but to be consistent with our Algo block, we will change it and clarify that is the reservoir, and we will "collect a model only when it is first updated, into ".
- Algo 1, line 15: Thank you for catching our typo. The correct symbol should be , rather than .
- Algo 1, line 16: We agree that using creates notational clash. We will instead use for threshold.
- Algorithm 1, line 19: Thank you for this catch. We will clarify when initializing our data structures (all of which are size- arrays) that the elements in and are paired with each other, so that when we remove an element in , the corresponding elements in and are removed as well.
- Algorithm 1, lines 27 and 29: Agreed, we will fix the notational overloading by clarifying that contains candidate solution vectors , with each . Similarly, contains scalar values , and contains scalar values . Then, we would revise Lines 16-19 of Algorithm 1 to say that , with corresponding index . Then if (assuming the reservoir is full), we replace with our current candidate solution , replace with , and replace with . Line 29 now becomes .
Algorithm 1: when WS is "Standard", then is updated, but under AS as "Simple Average," plays no subsequent role; is this correct?
Yes, this is correct because we are just taking a simple average of the candidate solutions in the reservoir (i.e., equal weightings of on each member of the reservoir). However, we clarify that entry into the reservoir, for both AS as "Simple Average" or "Weighted Average," is still based on the candidate solutions' (potentially exponentiated) passive steps. The difference between "Simple Average" and "Weighted Average" is how we form our ensemble solution from the vectors in our reservoir --- membership into the reservoir is still based on the (potentially exponentiated) number of passive steps. For discussion on the meaningfulness of using this passive-step count based approach vs. simply taking the most recent vectors, for example, please see the "Additional Wrappers" discussion in the Author Rebuttal. Nonetheless, with the notational correction on Line 15, the critical check should now be fully-defined.
Finally, considering the fact that there is a massive literature on taking averages of candidates from iterative learning procedures, I think many readers will find it troubling that the empirical investigations are completely inward-facing, i.e., they simply evaluate the base algorithms and the proposed wrapper under different settings, and do not treat other alternative averaging strategies which are far more generally applicable (e.g., averaging most recent candidates, downweighting over time, using so-called "anytime" sub-routines, and so forth).
We agree that this is a very important point. We will also cite prior work on online-to-batch averaging (references in response to v2Wj). Please see ``Additional Wrappers" in our Author Rebuttal for further additional base-methods and wrapper alternatives, showing our method works against a wider spectrum of alternatives.
Passive-aggressive algorithm is a seminal method in online learning. However, it may be unstable when outliers arise. This paper uses weighted reservoir sampling to stabilize the linear passive-aggressive online learning. The key idea is that the subsequent number of passive steps can reflect the generalization error. Then the weighted reservoir sampling is used to sample the models with low generalization error. The final model is the ensemble of the sampled models.
优点
This paper is clear and easy to follow. The proposed method makes a good use of the characteristic of the passive-aggressive algorithm. The experiments show the superiority of the proposed method.
缺点
The theoretical analysis assumes i.i.d. condition, which is not consistent with the motivation that individual outliers exist and do harm to PA classifier.
问题
I wonder whether the online gradient descent with momentum can perform well, because it can also be seen as a special ensemble method. I suggest authors add this method for comparison.
局限性
None
The theoretical analysis assumes i.i.d. condition, which is not consistent with the motivation that individual outliers exist and do harm to PA classifier.
Our method falls in a category of techniques that convert algorithms to a low-risk model which generalizes well to unseen data. In order to bound the risk in such manner, there has to be an assumption on the distribution which is usually an arbitrary i.i.d. data distribution. We do wish to highlight though that even under the i.i.d. model, the base passive-aggressive algorithms are often unstable because their update steps can be too aggressive or change the angle of the separating hyperplane too much, so to speak. Also, because the distribution is arbitrary, it can be a mixture distribution with a fraction of "outlier" points, and our theory still holds. The reviewer does bring up an interesting point, which is investigating underlying data distributions which explicitly contain outliers -- to our knowledge this hasn't been studied in the context of online-to-batch conversion and would be interesting future work.
Please also see the response to reviewer v2Wj and the one-page PDF; we have added a new theory that strengthens the manuscript.
I wonder whether the online gradient descent with momentum can perform well, because it can also be seen as a special ensemble method. I suggest authors add this method for comparison.
Thank you for your suggestion. Please see our ``Additional Base Models" response in the Author Rebuttal, where we investigate Stochastic Gradient Descent with Momentum alongside two similar methods Truncated Gradient Descent and AdaGrad. We also investigate the effectiveness of adapting WRS-Augmented Training to these non-passive aggressive base methods.
Thanks for your reply. It addresses my questions. I decide to raise my score to 6.
We are very glad that we were able to address your questions and are grateful for the raised score and vote of confidence. Please let us know if any other questions or feedback arise. Thank you so much!
This paper resolves the outlier sensitivity problem in online learning. The proposed approach (WAT) can stabilize passive-aggressive online learning algorithms and does not introduce common overheads like hold-out evaluation sets or additional passes.
优点
-
The proposed WAT shows a significant reduction in test accuracy fluctuations.
-
WAT maintains, and in some cases, enhances the sparsity of solutions in FSOL, which is crucial for high-dimensional data.
-
The implementation of WAT does not require additional passes over the training data or hold-out evaluation sets.
缺点
-
While FSOL-WRS shows statistically significant improvements in final test accuracy, PAC-WRS does not consistently show the same level of improvement.
-
Although minimal, there is still an additional memory and computational overhead associated with managing the reservoir of candidate solutions.
问题
-
Can the authors provide more detailed explanations on how intermediate weight vectors are selected and maintained in the reservoir, and discuss the impact of different reservoir sizes on the algorithm's performance?
-
I think more compared baselines are needed, such as ADAGRAD or Truncated Gradient.
局限性
This paper comprehensively discusses its limitations regarding the requirements for base and the assumptions on data distributions.
We appreciate the reviewer encouraging us to clarify the candidate solution selection procedure for our WRS-Augmented Training (WAT), revisiting our discussion in Section 3 and, in particular, Algorithm 1. The main idea is that WAT is a wrapper that builds around an existing passive-aggressive algorithm like PAC or FSOL.
Notationally, let be the intermediate candidate solution retained by the base model (e.g., PAC) at timestep . Given PAC's passive-aggressive nature, PAC only updates the intermediate candidate solution at timestep if it makes a mistake (i.e., incurs nonzero hinge-loss) on the current observation in our data stream.
Suppose PAC's intermediate candidate solution made a mistake at time . Then, imagine that the base PAC algorithm's intermediate candidate solution did not make any mistakes at timesteps , and only made its next mistake at time (where is short for passive steps). By definition of passive aggressive, we know that (because no update was made). However, at time , the base PAC algorithm makes an update to its intermediate candidate solution using the PAC update rule: . At this timestep , we say that our outgoing intermediate candidate solution had survived passive timesteps without making a mistake and requiring an aggressive update. Then, at timestep , we will insert into our reservoir with a probability that is a function of --- the larger is, the more likely is to enter our reservoir (see Lines 10-19 in Algorithm 1 for specific schemes). If enters our reservoir, to maintain our reservoir of size (with predetermined), we will need to remove an older member of the reservoir (please see Lines 16-18 of Algorithm 1).
While FSOL-WRS shows statistically significant improvements in final test accuracy, PAC-WRS does not consistently show the same level of improvement.
This question has helped us realize we did not fully convey the utility of WRS. The manuscript demonstrates same-or-improved accuracy (ROP, introduced in Section 4) clearly but does not as well emphasize the stability of the test accuracy over time. Having equal accuracy and more stability would be an improvement, and we deliver equal or better accuracy (depending on base-method) plus stability.
We now also include a new metric, which we call Worst-Case Performance Difference (WPD). Intuitively, we would also like to know what is the worst possible decrease in test accuracy performance between consecutive observations. For example, a model that has a demonstrated risk of dropping 50% in test accuracy performance between consecutive online observations is not a safe model for deployment. We define WPD to be the minimum (i.e., most negative) percent change in test set accuracy between consecutive observations in the second half of the epoch (as early iterations have natural asperity, and 1/2 epoch is well into ``this should be usable now'' territory).
On WPD, we find that both PAC-WRS and FSOL-WRS uniformly outperform PAC and FSOL, respectively, on all 16 of 16 datasets, averaged across 5 trials, with statistically-significant -values of . In particular, on Avazu (App), PAC-WRS achieves a worst-case performance decrease of only percent, compared to base PAC seeing a worst-case performance decrease of percent. For another example, on News20, the WPD values for PAC-WRS and base PAC are percent and percent, respectively. Please find the WPD values for PAC/FSOL and PAC/FSOL-WRS in the table below:
| Dataset | PAC-WRS | PAC | FSOL-WRS | FSOL |
|---|---|---|---|---|
| Avazu (App) | -7.3e-05 | -0.595765 | -5.3e-05 | -0.576633 |
| Avazu (Site) | -4.6e-05 | -0.012645 | -5.2e-05 | -0.127786 |
| Criteo | -0.000139 | -0.007106 | -0.000356 | -0.088427 |
| Dexter | -0.005556 | -0.057778 | -0.011111 | -0.193333 |
| Dorothea | -0.001739 | -0.005797 | -0.003478 | -0.868986 |
| KDD2010 (Algebra) | -1.1e-05 | -0.02043 | -8.6e-05 | -0.108299 |
| MNIST8 (4+9) | -0.000174 | -0.150834 | -0.000457 | -0.171815 |
| News20 | -0.00137 | -0.335327 | -0.002572 | -0.337632 |
| Newsgroups (Binary, CS) | -0.002111 | -0.159265 | -0.002912 | -0.27692 |
| PCMAC | -0.00446 | -0.064151 | -0.004803 | -0.185592 |
| RCV1 | -0.000152 | -0.160971 | -0.000159 | -0.097962 |
| Real-Sim | -0.00048 | -0.173778 | -0.000554 | -0.133466 |
| SST-2 | -0.000832 | -0.120622 | -0.000881 | -0.061439 |
| URL | -6.2e-05 | -0.071025 | -0.00034 | -0.122971 |
| W8A | -0.000371 | -0.090931 | 0 | -0.777236 |
| Webspam | -0.000491 | -0.112117 | -0.00075 | -0.273266 |
The main takeaway with our WPD analyses is that PAC-WRS and FSOL-WRS are overwhelmingly and statistically-significantly effective at preventing worst-case decreases in test set accuracy compared to their base algorithms. Combine with already existing results that WAT has same-or-improved accuracy, WAT essentially comes with no cost beyond a minor amount of additional computation.
Can the authors provide more detailed explanations on how intermediate weight vectors are selected and maintained in the reservoir?
See additional comment due to space limitation.
Although minimal, there is still an additional memory and computational overhead associated with managing the reservoir of candidate solutions.
Yes, we agree that there is still a minimal cost towards managing the candidate solutions in the reservoir. However, we invite the reviewer to view our ``Additional Wrappers" response in the Author Rebuttal, where we show that compared to other averaging schemes like moving average and exponential average, our WRS-augmented training mechanism wins decisively on both efficacy and computational cost.
[Can the authors] discuss the impact of different reservoir sizes on the algorithm's performance?
On the role of reservoir size , echoing our presentations and figures in Sections 4.1 - 4.3, in general, the bigger is the better in terms of performance. Larger values of (we tried ) are associated with both higher mean final test accuracies and stronger mean Relative Oracle Performances, with better consistency across 5x trials. However, from Figure 3 in the main text and similar figures in Appendix D, we do see that there are diminishing returns to increasing past a certain point. Of course, there is also a slightly increased storage cost from needing to store more vectors in our reservoir, though this cost is not significant. In general, based on our results across the 16 datasets, we recommend as a starting point.
I think more compared baselines are needed, such as ADAGRAD or Truncated Gradient.
Thank you for your suggestion. Please see our ``Additional Base Models" response in the Author Rebuttal, where we investigate AdaGrad and Truncated Gradient Descent alongside Stochastic Gradient Descent with Momentum, as well. We also investigate the effectiveness of adapting WRS-Augmented Training to these non-passive aggressive base methods.
Thank the author for the detailed rebuttal and additional results. I still think this is a good paper and should be accepted. The additional results give me more confidence to accept this paper.
The paper proposes a new algorithm for a binary online learning problem where the streaming data consists of i.i.d. samples. The authors propose a variant of the seminal algorithm of Passive-Aggressive classifier. The PAC algorithm only updates its current model when a missclassification occurs. The main idea of the paper is to augment this algorithm with a weighted reservoir sampling of previous models, where the weight of each model is proportional to how much the model “lived” before an update occurs. At each time step , the weighted ensemble solution from the reservoir is used for prediction.
The paper presents theoretical results for the risk of their algorithm, and conduct experiments to show the superiority of the methods compared against 2 online algorithm baselines.
优点
The idea of using reservoir sampling for a online algorithm is novel, and I have not seen it before.
The experimental section seems throughout, and a lot of datasets are used for the comparison with the baselines.
缺点
I believe that the theory is the weaker part of this paper. Also, the comparison with the previous baselines is not very well motivated.
I am confused by the theoretical setting of this work. In online learning, it is often assumed that the data is generated by an adversary and it does not come from the same distribution. For example, the analysis of [1,5] (baselines used and described by this work) never uses this assumption, and they upper bound measures of error related to regret (also this inconsistency is in line 3-4 of abstract). In this context, there is no distribution, so the sentence at line 64 is false “The goal is that as t goes to infinity, w_t will generalize well to out-of-sample examples” (there is no distribution to generalize w.r.t., and the examples are not samples from a distribution). I think this is the case for the analysis of the online algorithms described in Lines 65-74 other than [1,5] (I did not check)
To this end, the PAC (and FSOL) algorithm is trying to solve a different problem. The intuition that a solution is more promising if it is correct for a longer period of time only applies if the data is i.i.d., which is a different problem from the one of the paper introducing PAC. The same concern applies to the experimental section, where due to the random split (and thus shuffling) of the data, it can be considered as i.i.d., which is a different setting than the online learning setting.
The theoretical results (Thm 1 and Thm 2) are very hard to interpret, and it is unclear whether they provide any meaningful contribution.
If the data is i.i.d., a desirable property is that the gap between R_D(w_T) and R_D(w^*) goes to zero as T goes to infinity, where is the model of the algorithm at time (i.e., once I observe a large amount of data, my error converge to the error of ), most likely with rate 1/sqrt(T). Alternatively, if I do not assume that the data is i.i.d. and I focus on a online setting, I would like to see an upper bound to the regret of order O(sqrt(T)). [This could possibly also imply the former with an online to batch conversion].
Additionally, both Thm 1 and Thm 2 are very hard to read. The definition of is unclear (isn’t the minimum risk achievable by any model w always less than R_{D}(w^*)?). What is and .
Theorem 2 depends on assumptions on the models stored by the algorithm that are unclear whether they are ever satisfied. Theorem 2 has dependencies on those assumptions that make it unclear whether the result is ever meaningful (e.g., is the probability of Thm2 ever )? Also, what is in Thm 2 and lines 185-189 ( is it arbitrarily chosen?).
I found lines 168 confusing. It seems that R(h) is defined but never used in the main paper, only R_{D}(w). Also it is unclear to me on what loss is used, since l(w,z) is previously defined as the hinge loss, but l(h,z) is the 0-1 loss (81).
Suggestions: I would formally define concepts outside of related work (e.g, sparsity in line 73-74).
问题
See above.
局限性
The paper discusses limitations in the conclusion.
Also, the comparison with the previous baselines is not very well motivated.
The goal of our proposed method, WRS-Augmented Training (WAT), is to stabilize the test accuracy over time of a passive-aggressive base model, like PAC or FSOL, which originally can have massive fluctuations in test accuracy over time (see Figure 1 in the main text). Applying WAT to PAC and FSOL as a lightweight wrapper yields two new models PAC-WRS and FSOL-WRS. Our WAT is considered successful if the resultant PAC-WRS and FSOL-WRS output test accuracies over time that are much stabler than that of their base methods, PAC and FSOL, respectively. While the choices of test accuracy and (preservation of) sparsity are standard metrics in online binary classification, to directly quantify how much PAC/FSOL-WRS stabilize test accuracy relative to their base algorithms PAC/FSOL, we introduced a new metric, Relative Oracle Performance (ROP). Specifically, ROP measures the difference between the cumulative maximum test accuracy of the base method PAC/FSOL versus that of PAC/FSOL-WRS, averaged across timepoints . The motivation with ROP is that if a method has very stable test accuracy over time (with minimal fluctuations), then its test accuracy curve should look very similar to its cumulative maximum test accuracy curve, implying an ROP very close to , if not negative. It follows that we use PAC and FSOL as our main baselines, with the discussed metrics, for comparison in the main text. If accepted, we will add the above clarification to our camera-ready manuscript.
I believe that the theory is the weaker part of this paper.
We appreciate the reviewer pointing out the difference in our underlying setting compared to the PA model. Our method does fall in the category of online-to-batch conversion techniques like prior works [1-3]: We aim to take an online algorithm with known regret bounds on a fixed sequence of data and find a stable transformation of the model sequence to a low-risk final model in the iid setting. We believe this was not clearly conveyed and will more clearly differentiate our setting.
We also agree with the reviewer's request for bounds showing the gap between and goes to 0 as T -> infinity.
We have developed a new risk bound for our method, which is given in the PDF (Theorem 1). As can be inferred from the bound, the deviation in risk of our final model from the optimal risk shrinks over time with high probability. This is guaranteed to hold if the reservoir size grows at any rate over time. This is supported experimentally since large reservoir sizes (e.g. 64) are especially effective on our large datasets.
We sketch the proof steps and sequence of bounds/inequalities that give the new theorem below. A more detailed and well formatted (OpenReview doesn't support it) will be added to the camera-ready copy. First, using an indexing trick, we express the weighted reservoir model as a weighted average of the sequence of models, i.e. = WAvg(), which has cumulative loss .
(1) risk() <= WAvg(risk()) (Jensen's inequality)
(2) WAvg(risk()) + error1 w.h.p. (adapting a Bernstein's maximal inequality for martingales, e.g. [2], [3]). error1 here refers to the terms
(3) regret() + error2 for any (optimal) (applying PA algo-specific regret rates). Here the error2 is , with many online algos achieving either or .
(4) regret() risk() + error3 w.h.p. (Hoeffding bound). The error3 term is the term
Combining the steps and using union bound gives our result.
Specific theory qs:
Definition of : We will clarify our notation. Our depends on the observed sequence, so it's a random variable, while is an instance-independent value, which we need for the bound to make sense. is raised to the power . We will write instead to clarify that.
Thm 2 assumptions: It is true that strong assumptions were needed to make any reasonable bounds for a finite-size reservoir, because we are analyzing an inverse problem. To clarify, if we know the underlying risks of the sequence, we can compute exact probabilities on the survival distributions, but we can't reason backwards from observed survivals to risks without placing either (1) strong assumptions or (2) Bayesian priors on what the underlying risks are.
Our new theorem (see PDF) instead applies martingale concentration inequalities to perform the inverse inference in a large-sample setting, and thus we believe it is a stronger, more intuitive result which addresses the reviewer's concerns.
, loss used: Thank you for pointing that out, we will revise to use , which refers to risk under the 0-1 loss only. When the hinge loss is used (which was only mentioned after Thm 2), we will specifically use .
[1] Cesa-Bianchi N, Conconi A, Gentile C. "On the generalization ability of on-line learning algorithms." (2004).
[2] Cesa-Bianchi N, Claudio G. "Improved risk tail bounds for on-line algorithms." (2008).
[3] Dekel O. "From online to batch learning with cutoff-averaging." (2008).
Suggestions: I would formally define concepts outside of related work (e.g, sparsity in line 73-74).
We agree with the reviewer, and will move the formal definitions for sparsity and test accuracy into the second paragraph of the Introduction for our camera-ready manuscript. We propose the following insertion: ``In this paper, for full clarity, we define test accuracy as the proportion of points classified correctly (with no margin considerations involved) in a hold-out test set, operationally fixed to be percent of the relevant full dataset. We define sparsity as the proportion of entries in a vector that are zeroes."
I thank the authors for their detailed response.
After carefully reading the rebuttal and the other reviews, I have decided to keep my score.
The motivation of my score stems from my concerns on the lack of meaningful contribution from the theory results, and for the motivation of the setting of the work. I do not believe that the authors' rebuttal addresses my concerns.
(1) There is a disconnect between the (adversarial) online learning setting of the Passive-Aggressive classifier (PAC) which the authors build on, and their proposed work.
The PAC classifier is introduced for the more challenging setting of online learning (and much of the other referenced theory work), where the points are generated adversarially and there is no concept of distribution (I understand there is the online-to-batch conversion to the i.i.d. setting). The paper is often not precise in quoting those results to motivate their work, and it should be further motivated and clarified in the paper:
Examples:
-
in line 4-5, it is claimed that Passive-Aggressive classifiers enjoy low theoretical risk bounds. However, the analysis of the original paper of PAC targets regret.
-
In lines 64-65, it is written that the goal is to generalize well to out-of-sample examples. This is not the actual goal of the works referenced between 65-74, as they address the online setting where "out-of-sample" is not defined
(2) I still believe the theory results are weak.
(2.a) The original Thm 1 and Thm 2 are hard to interpret, and it is not clear whether they provide any meaningful bounds. The theorems depend on a lot of assumptions that are not even clear when they are satisfied, and even if they are satisfied, it is hard to interpret whether the results are meaningful. As a crucial example, theorem 2 still depends on epsilon, which is not defined in the statement and it is not clear how it is chosen for J_b (refer to my original review), since J_b needs to satisfy many assumptions. Again, Theorem 1 and Theorem 2 also do not seem to provide any meaningful result in the i.i.d. setting, see also (3) below.
[R_D(w^) should be the risk of the optimal classifier with respect to the distribution D, i.e. () - R_D(w^) = argmin_w R_D(w). This is the usual definition of w^ in ML work. The definition of w^* in line 178 only refers to "an optimal w^" with respect to their reservoir. I will use definition () in the remaining of the discussion, which I think is also the same used in the new THM of the rebuttal.]
(2.b) The authors study the i.i.d. setting, and they agree with my original review that in this case, it is important to show that the algorithm will eventually generalize, i.e. the risk converges to the risk of the optimal classifier with respect to the distribution D.
To this end, the authors provide a new theoretical result (new THM in the .pdf of the rebuttal). I believe this theorem suffers from the same problems of Thm 1 and Thm 2 in the fact that new THM does not provide any meaningful contribution, and it is hard to interpret.
- First of all, in order to apply the online-to-batch conversion, the most important step is to actually characterize the regret of the algorithm. The authors do not characterize the regret, hence the application of this reduction does not provide any meaningful result.
New THM depends on the regret r(T) and cumulative regret M_T. The regret r(T) of the algorithm is not characterized (and this characterization is actually the most important/meaningful part of the analysis). In order to show that R_D(w_{wrs}) converges to R_D(w^*), it is important to show that the regret r(T) is sublinear in T, so that r(T)/T goes to zero. The authors argue that the regret of many online algo is O(sqrt(T)), however, this does not imply that the particular algorithm introduced by the authors satisfies this regret (this has to be proven, and it is arguably the most challenging part of the proof).
- Aside from the characterization of the regret, new THM contains a term log(T/delta)/(K_T s) [here, I could not find what s is]. This term does not go to zero unless K_T goes to infinity, which means infinite memory. It seems to me that even if the authors show that r(T) is O(sqrt(T)), their specific analysis would still not show that R_D(w_{wrs}) converges to R_D(w^*).
Re: (2b).
We will clarify that the regret in our theorem is actually that of the underlying online algorithm (e.g. PAC/FSOL) used in our wrapper. In a similar manner to other online-to-batch ensembles, our method accepts an online algorithm as the base, with a known regret bound, and extracts an ensemble model with accompanying risk bound as a function of the original regret bound.
To make this more evident we can highlight an intermediate result first: where is the actual regret bound of PA/FSOL/etc. (and give our original theorem after). This form is more aligned with prior works such as (Cesa-Bianchi et al. 2008), (Dekel 2009).
We will also provide the actual regret bounds for the underlying online algos (e.g. PA/FSOL/AdaGrad) papers for the reader as a table in our updated paper. For example, FSOL has disregarding constants (Thm 2 of their paper).
*Re: log(T/delta)/(K_T s) term: * Here is the minimal survival time of the models within the reservoir. We clarify that increasing is a sufficient but not necessary condition for the error to shrink to 0. Another sufficient condition is growing larger over time, and this is likely because the online learning solutions naturally improve over time. We agree that ideally we would enjoy seeing a term of in the denominator, but note that since the other error terms have , a growth rate of product is sufficient to have a convergence rate not worse than the other terms.
We thank the reviewer for their detailed feedback on our presentation and theory. We reviewed our exposition and agree that we were imprecise in our writing in those areas. We have fixed the writing as follows:
Re: (1). We clarify that the underlying online algorithms bound regret rather than risk. Lines 4-5 (abstract) now say "While such algorithms enjoy low theoretical regret, in real-world deployment ... "
Section 2.1: "The goal is that as , will enjoy low cumulative regret."
To resolve the disconnect, we now clarify that there are two categories of prior work: (i) online learning algorithms such as PAC/FSOL/AdaGrad which bound regret for any sequence of points, and (ii) online-to-batch conversion from a sequence of online solutions to a final model, with bounded risk of the final model under the i.i.d. assumption. We clarify that our work falls into the second category, as our method wraps any underlying online learning method from (i) to produce a stable ensemble. Like other works in (ii), our goal is to achieve a low-risk final model which generalizes to unseen examples.
We also point out that although the online learning methods in (i) give regret bounds, at least some of them do ultimately care about generalization. For example, the FSOL and AdaGrad papers evaluate the final model on a train-test split over real datasets. This indicates the generalization/iid use case is of interest as an end goal even in those original papers! In addition numerous applied works have directly used PA algorithms for out-of-sample prediction, e.g. [1], [2].
[1] Kiritchenko et al. NRC-Canada-2014: Detecting aspects and sentiment in customer reviews
[2] Petrovic et al. RT to Win! Predicting Message Propagation in Twitter
We have expanded the above discussion with more references into a new subsection in our Related Works.
Furthermore, in the abstract, we now say: "We design a weighted reservoir sampling (WRS) approach to obtain a stable ensemble model from the sequence of solutions without requiring additional [...]"
Re: (2.a). We will restructure our theory section and clarify that our original Thms 1/2 are finite-sample bounds to assess the validity of our approach, i.e. to demonstrate that the approach of using high survival as a proxy measure is reasonable, rather than to provide learning/generalization bounds. For the latter (i.i.d. setting) we will rely on and emphasize the new theorem in the PDF.
As in our previous response we do acknowledge that assumptions are needed in Thm 2 because it's an inverse problem with little "evidence". However these assumptions are actually readily satisfied, and making them clearer below further strengthens our work -- thank you.
can be any small number that satisfies a partition within the reservoir separating "good" from "bad" models -- e.g. works. For demonstration, suppose the underlying risks of the models in the reservoir range from 0.1 to 0.3, with 0.1 being the best risk . Then . then contains the models with risks in and contains those with risks in . The only time this isn't satisfied for any is if one of the partitions always ends up empty, which only happens if all the models have the same risk, a highly unlikely scenario where it is obvious that an ensemble isn't needed.
We have added the above explanation to the paragraph introducing Theorem 2.
Re: R_D(w^*). As suggested, we will change the notation to be consistent so that is always the optimal classifier w.r.t. distribution D. We will use instead for the lowest-risk model actually contained in the reservoir.
See next comment for reply to (2b).
For all reviewers, insufficient time exists for us to run all datasets under all settings. In our responses below, we have results for 14 out of 16 datasets (for ). The largest two (Criteo and Avazu (Site)) could not be completed in the allotted time. KDD2010 is further a partial result, as not all jobs have finished all of the KDD2010 set. Because we are studying an online algorithm, we can share the partial-epoch results as they exist today. Camera-ready will have no issues getting all experiments done on time. The general trends are very clear from these 14 datasets, though, with WRS's advantage increasing as datasets tend to get larger.
Additional Base Models: Stochastic Gradient Descent with Momentum (SGD+M), Truncated Gradient Descent (TGD), and AdaGrad.
We thank Reviewers We6T and eWmu for encouraging us to explore additional baseline methods' test accuracy performances over time, and their particular suggestions of Stochastic Gradient Descent with Momentum (SGD+M), Truncated Gradient Descent (TGD), and AdaGrad.
In general, these three additional base algorithms are still susceptible to experiencing concerning fluctuations in test accuracy. For example, in Figure 2 of our 1-page rebuttal PDF, we see that on MNIST8 (4+9), SGD+M and TGD both experience significant fluctuations in test accuracy over time. On this particular dataset, AdaGrad did not experience significant fluctuations.
Note that these 3 new baselines are not Passive-Aggressive algorithms (they always update the weight vector), and so WAT is performing well despite being pushed beyond its designed purpose. To test these 3 baselines, we defined a pseudo-passive step as one that makes no classification error, and took the last weight vector before a mistake was made. The rest of WAT operated as normal under this pseudo-passive step weighting.
Critically, we find that for all three of SGD+M, TGD, and AdaGrad, our modified WAT effectively mitigates test accuracy instability when it exists, and does no harm when it does not. For example, referencing Figure 2 in our 1-page rebuttal PDF, modified WAT significantly improves test accuracy stability on SGD+M and TGD, even achieving performance higher than that of the corresponding oracles. On AdaGrad, where the test accuracy was already rather stable, adding modified WAT still yields a slight increase in stability. As such, our WAT method is still useful and adaptable to a wider class of base models, which can constitute fruitful future work.
This will be added to the manuscript, which shows that WAT has more general utility in stabilizing online algorithms even if they are not PA.
Additional Wrappers
We thank Reviewers We6T and NvJo for encouraging us to think more about the computational overhead of our WRS-augmented training (WAT) method, and to compare our WAT wrapper against alternative averaging strategies. Below, we compare WAT against averaging the most recent candidates (which we denote as moving average) and downweighting over time (which we denote as exponential average). Specifically, for both WAT and moving average, we use . For exponential average, at timestep , we recursively form our ensemble vector (using ), where is the base algorithm's candidate solution at timestep . This recursive update rule indeed exponentially downweights the contribution of previous candidate solutions .
Overall, we first find that the exponential average scheme is consistently ineffective at mitigating the test accuracy instability of the base models PAC and FSOL. The stability of the exponential average scheme is usually not much better than that of the base model, which makes sense because it still puts the majority of the weighting on the most recent candidate solution. We invite the reviewer to observe the results of exponential average on Avazu (App), KDD2010 (Algebra), RCV1, and MNIST8 (4+9) in Figure 1 of our 1-page rebuttal PDF, drawing the reviewer's attention to the massive oscillations of the red lines.
Second, we find that the moving average also has very mixed effectiveness. On some datasets, like RCV1 and MNIST8 (4+9) in the aforementioned Figure 1 (indicated by the blue lines), it performs only slightly worse than WAT. However, on other datasets like Avazu (App) and KDD2010 (Algebra), it performs significantly worse compared to WAT in terms of stabilizing test accuracy. This makes sense because, with WAT, we are much more selective about the quality of the candidate solutions that we retain in our reservoir, compared to moving average, which necessarily must include poor-performing solutions as they appear.
Third, from Table 1 in our 1-page rebuttal PDF, we find that for datasets with dimension , the moving average method can be significantly computationally slower per iteration than WAT. For example, with PAC as base model, the moving average method is, on average, slower per iteration than WAT on URL, and slower per iteration than WAT on Avazu (App). Similar trends hold when using FSOL as base. This makes sense because with WAT, we do not always add candidate solutions to our reservoir/set for averaging, while with the moving average method, we must always add new candidate solutions into our set. These insertion and deletion costs will accrue over time. On smaller datasets the moving average is faster or slower depending on dataset, but runtime is dominated by IO and all methods finish within minutes. This shows that in more real-world, large-scale settings where evaluation and checking are expensive, WAT is the fastest, most accurate, and most reliable method compared to all baselines.
Dear Reviewers,
Thank you for your reviews and time. The authors have provided a rebuttal with clarifications, additional theoretical analysis, and empirical results.
If you have not already, please take a careful look and follow up with additional questions and/or please update your review accordingly.
If you would like to discuss in private, please start a separate thread.
Thank you, Area Chair for Submission 7523
The paper seeks to address the issue of instability in passive-aggressive (PA) online learning algorithms, which are highly sensitive to individual data points, particularly outliers. These instabilities can lead to significant fluctuations in the model's accuracy, especially when an outlier is encountered late in the data stream. To mitigate this, the authors propose a novel approach called Weighted Reservoir Sampling (WRS), which augments the standard PA learning process by maintaining a reservoir of high-quality weight vectors that are less prone to overfitting due to outliers. This method is tested on two PA algorithms—Passive-Aggressive Classifier (PAC) and First-Order Sparse Online Learning (FSOL)—across several datasets, showing that WRS significantly stabilizes accuracy without the need for additional data passes or hold-out sets.
优点
Originality: The introduction of Weighted Reservoir Sampling (WRS) as a method to stabilize PA algorithms is a novel contribution. The idea of leveraging the number of passive updates as a proxy for the quality of solutions is innovative, especially in the context of online learning.
Quality: The paper is methodologically rigorous, providing a thorough theoretical analysis to support the proposed WRS method. The empirical validation across multiple datasets strengthens the claim that WRS enhances the stability of PA learning algorithms.
Clarity: The paper is well-written and clearly explains the problem, the proposed solution, and the results. The step-by-step presentation of the WRS method, along with detailed explanations of the experiments, makes the paper accessible to readers familiar with online learning. The authors also use lots of wrap tables, making best use of the space.
Significance: The significance of the work lies in its potential to improve the robustness of PA algorithms in real-world applications, where data streams often contain noisy or outlier data points. This method could be particularly valuable in fields like online advertising, real-time recommendation systems, and other high-dimensional streaming data applications.
缺点
Limited Applicability: While the WRS method is shown to work well with PAC and FSOL, its applicability to other online learning algorithms, especially those that are not passive-aggressive, is not explored. This limits the generality of the method.
Assumption on Data Distribution: The method assumes that the data distribution remains relatively stable over time. However, in real-world scenarios where data distributions might shift (concept drift), the effectiveness of WRS may diminish. This limitation is acknowledged but not deeply explored in the paper.
问题
Maintaining a reservoir of high-quality weight vectors adds computational overhead, particularly in terms of memory usage and processing time. While the paper claims this overhead is minimal, I would like to see if there are more detailed analysis or comparisons with baseline methods in terms of computational efficiency.
局限性
na
Thank you for this additional review. We know it is a bit much to add your review and then see so much other content in reviews/rebuttals, but we are encouraged by the significant overlap in your review and the other positive reviews we have received.
Limited Applicability:
Other reviews have brought this concern up as well, and we have worked to remediate it. If you look at the all-author rebuttal (and our response to We6T and NvJo), we have added three additional base-algorithms to stabilize. These algorithms are not technically passive-aggressive either, showing our theory and method empirically generalizes and that non-PA algorithms can be adapted to use with WAT. Thus, our method is shown to stabilize the PA, FSOL, Truncated Gradient Descent, SGD + Momentum, and AdaGrad based learning methods.
I would like to see if there are more detailed analysis or comparisons with baseline methods in terms of computational efficiency.
In addition to the added baselines, we also added other alternative "wrapper" algorithms. This includes the simple moving average and the exponentially weighted average baselines (see "Additional Wrappers" in the Author Rebuttal) and the top-K ablation (see our rebuttal + comment to Reviewer NvJo). In summary, these other wrapper alternatives can help sometimes, but WAT always equals or outperforms them in test accuracy, whereas the others each have a few datasets on which they degrade. Moreover, compared to the well-established moving average and exponentially-weighted average methods, as the datasets get larger, WAT is the fastest to run by up to 10x as it does extra work on a less frequent (stochastic) basis, and that extra work is computationally very cheap. Again, see the all-author rebuttal for a table with the speedup factors.
WAT only performs additional work when an error is made. This additional work entails a) the computation of a key via sampling a single uniform random variable and 3 floating point operations (extremely cheap, line 15 of Algorithm 1) and b) finding the smallest value/corresponding index in a size- array (also very cheap, with being the upper limit in our testing). The weight vector+bias are only added to the reservoir (with another vector+bias pair being replaced) if the stochastic reservoir insertion check (also very cheap) is satisfied (lines 17-19 of Algorithm 1). To be precise, if the reservoir is updated, the size- arrays and also each get updated at one index (very cheap). That is all the extra work required during the training process. All operations are extremely cheap, and the more accurate an algorithm is, the cheaper it gets as the reservoir insertion check becomes less frequent.
Assumption on Data Distribution
We thank the reviewer for this interesting point and will add more discussion in the paper. The use of the IID assumption in our theoretical analysis is a requirement to discuss the risk/generalization error, and is aligned with prior work on converting online algorithms to ensemble models. See also the all-author rebuttal for a new theorem that provides a conversion from the original base-algorithm specific regret into the risk bound for our WRS (proof sketch in reply to v2Wj due to character limits). While more empirical exploration of concept drift would be valuable, we believe it extends beyond the scope of the work and our time remaining in the rebuttal period.
We appreciate your review again, and we know there is a lot of other reviewer/rebuttal content to process at the last minute. We hope your positive review, which aligns well with our other positive and constructive reviews, is a good indication of the value of our work and that others will find interest/inspiration from our unique approach to a reservoir of weight vectors.
After reviewing the submission, reviewer comments, robust author rebuttal, and further discussion with reviewers, I recommend accepting this paper (poster). The author rebuttal goes a long way towards addressing the majority of reviewer criticisms, including expanding the use of WRS to additional base online learning algorithms and additional baseline stabilizing approaches (moving/exponential averaging), both of which further strengthen the message of the paper and I would expect to see in the final version of the paper. Finally, the other main criticism I think arises from the fact that this work studies online learning in the iid setting, which I agree is an important setting worth studying. But this should be clarified and discussed early on in the work to correctly set expectations.