Adjusting Model Size in Continual Gaussian Processes: How Big is Big Enough?
The paper addresses the challenge of determining model size in machine learning, focusing on online Gaussian processes. It presents a method for automatically adjusting model capacity during training while ensuring high performance.
摘要
评审与讨论
The paper addresses the problem of choosing an appropriate number of inducing points for a sparse Gaussian process in the context of continual learning, where batches of data are observed sequentially, such that the total number of data points is not known before training, which prevents the use of heuristics that depend on the size of the dataset. An "online ELBO", which lower bounds the marginal likelihood based only on the current batch of training data, is introduced and used to optimize noise, kernel, and variational parameters. A corresponding upper bound is derived and used to dynamically increase the number of inducing points. Empirical experiments demonstrate that the proposed strategy leads to efficient usage of resources while maintaining performance.
Update after rebuttal
I am satisfied by the rebuttal response and I continue to support acceptance of this submission. I maintain my score of 4.
给作者的问题
- It seems like all experiment results in the main paper only consider the RMSE, whereas Gaussian processes are celebrated for their ability to quantify uncertainty, which can be evaluated using e.g. predictive log-likelihood. Is there any particular reason why you only include NLPD results in the appendix? I find this particularly relevant because sparse Gaussian processes are known for struggling with good uncertainty predictions.
- For the experiment in Section 5.3, the data was sorted along the first dimension and divided into batches to simulate continual learning. This seems quite arbitrary (why not sort by the last dimension instead?). Do you have any specific reason for this? And how does this compare to batches which are simply sampled uniformly at random?
论据与证据
The main claim of the paper is to "develop a method to automatically adjust model size while maintaining near-optimal performance" in the context of continual learning, where model size refers to the number of inducing points in sparse Gaussian processes. This is achieved by introducing an online ELBO and an online upper bound on the log marginal likelihood which (together with a baseline noise model) are used to select the number of inducing points such that the approximation error will be below a certain threshold.
方法与评估标准
The proposed method is specifically designed for the problem at hand and makes sense. In terms of experiments, the paper uses the popular combination of a simple 1D toy experiment and standard UCI regression benchmarks, which is acceptable but arguably somewhat underwhelming. The proposed method is compared to two alternatives for automatic selection of the number of inducing points, which is also appropriate, given that there is not a lot of existing work on this topic (as far as I know). Additional interesting results on a real-world dataset are provided in Appendix D.4. I encourage the authors to include this experiment in the main paper.
理论论述
The main theoretical claim is a "Guarantee" in Section 4.2 which upper bounds the KL divergence between the actual and the optimal variational posterior over the latent function after observing the latest batch of training data. The argument given in line 291 right below the statement is sensible. I did not check the complete proof in Appendix C.
实验设计与分析
- For the experiment in Section 5.3, the data was sorted along the first dimension to create batches to simulate continual learning. This seems quite arbitrary (why not sort by the last dimension instead / what if the data is not sorted at all?).
- All considered datasets (including the real-world data used by the experiment discussed in Appendix D.4), seem to be quite small. In particular, they would be small enough to fit an exact Gaussian process on a modern GPU. I acknowledge that the number of inducing points was comparatively small (a few dozens or hundreds), but this may not faithfully represent a scenario where continual learning would actually be necessary because the whole dataset becomes too large to keep track of.
补充材料
The supplementary material contains the source code and experiment configurations. Although I did not execute the code myself, it seems to be well-documented and of good quality.
与现有文献的关系
Selecting the number of inducing points for sparse Gaussian processes is an unsolved problem and relevant to the whole research area of sparse Gaussian processes. While there are a few approaches in the literature, many practitioners simply choose an arbitrary value based on the amount of available computational resources. This paper provides a principled way of selecting the number of inducing points in the context of continual learning, and demonstrates empirically that the proposed method performs better than existing alternatives.
遗漏的重要参考文献
I do not know of any essential reference which is not currently discussed in the paper.
其他优缺点
Strengths:
- clear definition and motivation of the addressed research problem
- principled solution with theoretical arguments, which also seems to work well empirically
- detailed manuscript and appendix with thorough descriptions, pseudocode, derivations, etc.
Weaknesses:
- experiments only consider somewhat small datasets which might not be realistic for continual learning
其他意见或建议
- I encourage the authors to include (some) NLPD results and the experiment from Appendix D.4 in the main paper
- Figure 3 currently uses quite a lot of space in the main paper without providing a lot of information (low "information density")
Thanks for your detailed review and your clear recommendations for improvement. We appreciate your positive feedback on our work.
Suggestions on main results presentation and is there any particular reason why you only include NLPD results in the appendix?
Thank you for your suggestions on which results to include in the main paper. Space constraints were the main reasons to place some of the results on the appendix but we agree that including them in the main text will be valuable.
Data sorted by the first dimension: Why not sort by the last dimension instead / what if the data is not sorted at all?) Do you have any specific reason for this? And how does this compare to batches which are simply sampled uniformly at random?
The reason we sorted the data along the first dimension is to follow the experimental setup used by Chang et al. (2023) for the UCI experiments, but any other dimension could have been used. If we were to sample batches uniformly at random, we would expect a behaviour similar to the middle column of Figure 1. After a few initial batches, enough of the input space would be covered causing the number of inducing points to asymptote to a particular value.
All considered datasets [...] seem to be quite small...
We agree with the reviewer that adding larger-scale datasets could improve the paper. The main reason for not including them was that finding suitable large-scale real-world datasets for where GPs are an appropriate model to use is challenging. However, we were considering including a large-scale synthetic dataset in the final version, which would allow us to evaluate the scalability of our method in such scenarios.
I thank the authors for providing answers and clarifications. I continue to support the acceptance of this submission.
The paper introduces a new criterion for determining the number of inducing variables in a context of continual learning with single-output GP regression models. The general idea is to automatically adjust model size while maintaining near-optimal performance, but without the need of seeing future data points.
给作者的问题
N/A
论据与证据
Some points of strength that I consider relevant for this subsection on claims and evidence:
- Very interesting approach to continual learning, particularly with the focus on computational resources -- as stated mainly in the intro (column 2, pp 1).
- Correct identification of issues and challenges, limitations of current SOTA methods, and in general, a nice willingness to provide rigorous continual learning methods for GP regression close to full-batch performance.
- I do believe problem statement + literature review + method proposal is of highest scientific quality. Additionally, the work builds the methodology on top of 3 well-recognised solutions: "Titsias GP bound" from Titsias et al. (2014), streaming (sparse) GPs from Bui et al. 2017, and Burt et al. (2019, 2020) line of research to optimally find the number M of inducing points.
方法与评估标准
I will add in this current section all the questions/points of curiosity that I would like to hear about from the technical, methodological and theoretical sides. Additionally, I see the utility, and personally liked, the following decisions taken:
- Use of the re-parametrization from Panos et. al (2018), to obtain an extra likelihood parametrization on the old inducing points, such that the online ELBO in Eq. (7) is more interpretable and later allows to build Eq. (8) for the regression case considered.
- The way of selecting the threshold inspired by Grunwald & Roos, (2019) is certainly nice and inspirational in this case. Also the way that later the noise model is used as the baseline.
- There is a key point on the clarity of the authors when it is stated in section 4.3 that the strategy for selecting varies depending on if the batch is very large or not. Quite interesting indeed.
理论论述
Some questions on claims and details that I did not find clear enough or I did not understand very well while reading it:
- [Question] What is the main reason behind the focus on (single-output) regression problems only? The streaming sparse method of Bui et al. (2017) was applicable to both classification and regression problems.. I do see that it must be due to the guarantees and the bound build on top of the exact optimal bound of Titsias et al. (2014) that (as long as I remember) was developed only for GP regression with Gaussian likelihoods. Is there any other reason?
- [Question] Between Eq. (4) and Eq. (5), the term is omitted as it cannot be computed. I see that some comments on its properties are later added, but is Eq. (5) still a bound really? Is the theoretical rigour of the bound kept in this omission of the term?
- [Comment] To me, in Eq. (7) the last term could be ignored from an optimization point of view right? As it does not depend on , new variational parameters or new hyperparameters.
- [Question] The way and reasons in which is augmented for both and in Eq. (12) look a bit like mysteries. must vary wrt in the same way as ? Am I missing some details here?
- [Question] How do and mix together for the third (let's say ) iteration? I am missing a bit the algorithmic point of view, and how the method 'refresh' itself for each new batch from a variational parameter point of view.
- [Question] From the description in the 1st paragraph of the Experiments section, I am assuming that the likelihood noise hyperparameter is fixed. What happens if not? Does this produce issues to the stability of the method?
实验设计与分析
I do like the way experiments were designed, the empirical results and the perspective brought in both Figure 1 (types of data distribution in the batches) and Figure 2 (continual learning GP vs full-batch exact GP).
Some points that make me feel concerned somehow:
- Figure 2A is a bit confusing, since we have in the same vertical axis curves with M=8 and M=10 from the legend. Maybe there could be a better way to show this info without intersecting in the same points, curves with different M values.
- Continual learning is a problem that deals a lot with the idea of an "unstoppable" flow of input data, such as one never should keep data points in memory, revisit them, etc, etc. From Figure 1 both time and performance are fantastic, but I don't really perceive that the method has been tested in "stress" situations (i.e. 1k, 10k batches for instance). Additionally, such long time/term analysis would have been great wrt the memory allocation. (This is not a call for additional experiments in the rebuttal, just a comment/suggestion of improvement).
- To me, the work inherits a lot the spirit of Bui et al. (2017). However, the main weakness or point of technical struggle of that method was the management of old variational parameters, inducing points and hyperparameters. Does VIPS do better somehow? Experiments are not saying something new to me in this direction, at the moment.
补充材料
I (quickly) proofread the section C of the Appendix, on the Guarantee that the two bounds are equivalent. So far, I did not detect any mistake or misleading detail that made me distrust the proof.
与现有文献的关系
Good review of literature.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
Thank you for your detailed and thorough review. We appreciate your perspective on the significance of our work.
[Q] Focus on regression problems.
As you noted, the main reason is the theoretical guarantees and the bound from Titsias (2014), which was derived for GP regression with Gaussian likelihoods. In this case, the variational KL divergence can be made arbitrarily small by increasing the number of inducing points. This no longer holds in classification, where the likelihood is non-Gaussian
[Q] Validity of Eq. (5) without term
Eq. (5) is no longer a bound on the marginal likelihood of the full dataset, so the theoretical rigour of the original bound is, strictly speaking, lost when omitting the term. However, it remains a valid lower bound for the new data, given the posterior carried from previous batches. As we add more inducing points, approaches zero, and in the limit, Eq. (5) recovers the full bound.
This is where our argument becomes empirical: we show that by adding enough inducing points, we can keep small enough for Eq. (5) to behave as a proper ELBO during training, and that this is sufficient to maintain performance on the full dataset.
[Comment] The last term of Eq. (7) could be ignored from an optimization point of view.
From an optimisation point of view, yes, this is right.
[Q] Must vary with in the same way as ?
We are free to use a different number of inducing points to calculate compared to what we use in the lower bound. Using more inducing points for the upper bound leads to a stricter stopping criterion, allowing us to stop adding points to the lower bound sooner. This is useful since only the inducing points used for the lower bound are retained for future batches. So, by increasing computation for the upper bound slightly, we ultimately reduce the number of inducing points retained lowering the overall cost.
[Q] How do and mix for the third iteration ?
In our algorithm, we keep the old inducing point locations fixed and choose the new set of inducing points from among the locations in the new batch . Consider three batches with inducing variables . For the first batch, we construct the variational approximation using inducing points . For the second batch, we keep and select new inducing points , forming . The updated variational posterior is:
where , , . For the third batch, now summarises all past information. We select new inducing points and form: which leads to the update:
with updated kernel matrices and statistics, using instead of .
[Q] Is the noise hyperparameter fixed, as stated in the Experiments section?
Thank you for pointing this out. This is a typo: the noise hyperparameter is not fixed. The text should read, "the variational distribution, noise and kernel hyperparameters are optimised [...]". We will correct this.
Figure 2A: Thank you for the suggestion. We will emphasise that shown at the top is for VIPS using a subscript .
Stress testing: Thank you for your suggestions and for clearly indicating that additional experiments were not required at this stage. Please see our response to Reviewer Me8M regarding larger datasets.
Technical challenges in Bui et al. (2017). Does VIPS do better?
As you pointed out in your review, our bound in Eq. (7) is a more interpretable version of the online lower bound introduced by Bui et al. (2017) using Panos et al. (2018) reparametrisation, so both methods share the same properties. The novelty in our method is that, unlike Bui et al. (2017), which used a fixed number of inducing points and heuristically retained 30% of the old ones, we propose a principled way to decide how many and which new inducing points to add to maintain the approximation quality.
I want to thank the authors for their detailed responses to my comments and concerns. I was just in need of some clarifications, particularly on [Q] Validity of Eq. 5, [Q] bound, and [Q] on the third iteration. I see the current work as even stronger and of higher technical/scientific quality, so I am glad to update my score and thus make full quorum among all reviewers on acceptance. Additionally, I invite the authors to update the manuscript (if accepted) with some of the proposed points, even if some content should go in the Appendix, due to space constraints.
In a streaming data setting, where access to previously observed batches of data is not available, one cannot use Gaussian process methods with non-degenerate (i.e., full-rank) kernels. A very popular approach is to approximate the full Gaussian process with a variational approximation, in which the posterior is computed using a fixed-size set of inducing points. Nonetheless, in the streaming setting, a poor choice of the number of inducing points can lead to either poor performance or wasted computational resources.
The authors propose a new criterion based on a previously known online version of the variational ELBO. Specifically, when the gap between the true posterior and the online approximate posterior becomes sufficiently large, the model capacity is increased to alleviate this gap. The correctness of this criterion depends on the quality of the approximation at the previous step.
This proposal is evaluated using synthetic data, UCI datasets in a streaming setting, and a real-world dataset collected in a streaming fashion.
给作者的问题
None.
论据与证据
The authors' claims are validated by experimental results. They assert that their method achieves results close to the exact non-streaming GP (Sec. 5.2) and produces models with smaller footprints (Sec. 5.3). The results are easy to understand and well explained.
方法与评估标准
The evaluation criteria consider real-life constraints of streaming datasets, assess different hyperparameter combinations, and compare only those that meet a specific RMSE or NLPD threshold. This evaluation should fairly assess the methods while accounting for their predictive distributions and the problem's constraints.
理论论述
I have briefly checked the correctness of the theoretical claims (Sec. 4.1 and 4.2), and they appear valid given the assumptions; specifically, that the quality of the previous iteration affects the global quality of the current iteration.
实验设计与分析
Yes, as discussed in the evaluation criteria, their analysis is sound and follows best practices.
补充材料
I have not reviewed the supplementary material in detail, beyond the complete description of the experimental details.
与现有文献的关系
This paper fits into the field of sparse online Gaussian processes, a domain where the contributions by Csató and Opper (2001) are well known in the literature on sparse GPs. Increasing the capacity of the sparse model on the fly is an important problem, and the solution presented by the authors is, to the best of my knowledge, the first that addresses this by upper-bounding the gap, in terms of KL divergence, between the approximate model and the full GP model. It is unclear to me how this work could interact with alternative approaches, such as the expanding memory approach of Chang et al. (2023).
遗漏的重要参考文献
Given the paper's focus on methods that do not use memory or replay buffers, the references discussed seem appropriate to me.
其他优缺点
The focus on reducing the number of hyperparameters and using already established criteria for inducing point selection is a strength of this paper, as hyperparameters and their selection can be a significant limitation on the applicability of Gaussian process methods.
其他意见或建议
The authors forgot a stray quotation mark in their Impact Statement.
In the authors’ introduction to Gaussian processes, I would suggest stating that infinite-width neural networks are a well-known subclass of GPs rather than implying that GPs are either that or something else.
While the discussion in Section 3.2 is quite interesting, none of the authors' experiments use inner-product kernels or further explore the connection with neural networks.
Thank you for your thorough review and your suggestions. We appreciate your positive feedback.
Connection of our work with NNs
As we note in response to other reviewers, we view VIPS as a first step towards adaptive size in more general settings. Since GPs and NNs share structural similarities, the aim of Section 3.2 was to introduce these ideas as a foundation for future work on adaptive neural architectures. We are actively exploring such extensions in ongoing work.
It is unclear to me how this work could interact with alternative approaches, such as the expanding memory approach of Chang et al. (2023).
We believe our approach is complementary to Chang et al. (2023). Both approaches aim to retain a selected set of points (whether data or inducing points) to ensure a good approximation. Chang et al. (2023) achieve this by expanding their memory sequentially, adding a fixed number of data points at each step. In contrast, our method dynamically adjusts the number of inducing points to maintain a desired approximation quality. One promising direction could be to adapt our criterion to guide memory growth in their framework; this is an extension we are currently considering.
We will also take your other suggestions into account for the camera-ready version.
The submission proposes a method that dynamically adjusts the model size (i.e., the number of inducing points in a sparse Gaussian process) while maintaining near-optimal performance in a continual learning setting, where data is presented as a stream and data storage is not allowed. The proposed method requires only a single hyper-parameter (the threshold) to balance accuracy and complexity.
给作者的问题
Q1: Optimizing the inducing points, rather than selecting them from the training data, may lead to better solutions. Is it straightforward to implement this in the current model formulation?
Q2: Related to the item Other Strengths and Weaknesses, is it straightforward to extend the current base prediction model to a (sparse) deep (multi-layered) Gaussian model?
论据与证据
Main claim: The submission develops a criterion for model size adjustment based on existing variational bounds and demonstrates its performance by comparing it to existing inducing point selection methods using USI datasets and robot data.
方法与评估标准
The evaluation metric used is the number of inducing points learned by the methods, which makes sense given that the target performance is consistent across all methods. Therefore, a smaller number of inducing points is preferred. However, it seems that the UCI datasets used in the experiments do not include very large-scale data, and interpreting the results based solely on the real-world (robot) data may be challenging for those who are not familiar with this specific dataset or task.
理论论述
I did not verify each proof individually, but the methods used to derive the bounds appear technically sound.
实验设计与分析
The experiments appear to be conducted correctly, and the text is well-written, making it easy to understand the experimental results and the interpretation the authors intended to convey,
补充材料
The Supplementary Material includes detailed implementations of the proposed methods, complete proofs of the theorems presented in the main text, and comprehensive experimental results. All parts of the Supplementary Material were reviewed.
与现有文献的关系
The idea of adaptive model size presented in the submission is relevant to current trends in the machine learning community, as continual learning (dealing with streaming data) has gained significant attention.
遗漏的重要参考文献
It appears that the submission covers the essential references related to the research topic.
其他优缺点
Strengths The idea of dynamically adjusting the model size in continual learning appears novel, and integrating this approach within Gaussian Processes is promising The use of a single hyper-parameter may reduce the need for extensive fine-tuning.
Weaknesses The proposed approach is not a generic model adaptation method but is limited to a specific model (sparse Gaussian Processes). Additionally, GPs and sparse GPs are limited in their ability to model highly nonlinear or non-Gaussian data distributions.
其他意见或建议
It seems that the reference for Conditional Variance (CV), used as a baseline, is missing in the text.
Thank you for your encouraging feedback and for recognising the relevance of our work. We appreciate your positive comments on the clarity of our writing and experimental design.
Q1: Optimizing the inducing points, rather than selecting them from the training data, may lead to better solutions. Is it straightforward to implement this in the current model formulation?
Yes, it is straightforward to implement with our formulation. As our method builds on the variational framework of Bui et al. (2017), which itself builds on Titsias (2009), the inducing point locations can be jointly optimised with the other hyperparameters of the model using the streaming variational lower bound, just as in the batch setting. In this work, we opt to select inducing points locations from the data for simplicity. That said, our framework is compatible with gradient-based optimisation of inducing locations if desired.
Q2: Related to the item Other Strengths and Weaknesses, is it straightforward to extend the current base prediction model to a (sparse) deep (multi-layered) Gaussian model?
The extension is not straightforward. In the deep case, there is no analogous bound to Titsias' collapsed bound, which the base model in Bui et al. (2017) builds upon. However, we are actively working on extending VIPS to deep models.
The proposed approach is not a generic model adaptation method but is limited to a specific model (sparse Gaussian Processes). Additionally, GPs and sparse GPs are limited in their ability to model highly nonlinear or non-Gaussian data distributions.
We appreciate your observation. While our current focus is on sparse GPs, we view VIPS as a first step towards adaptive size in more general settings. In our work, we provide a principled criterion that adjusts model size to maintain accuracy with incoming data. In particular, since GPs and NNs share structural similarities, we hope that the ideas introduced here can inspire similar mechanisms for adaptive neural architectures. We are actively exploring such extensions in ongoing work.
Four knowledgeable reviewers recommend Accept and I agree. Reviewer Me8M recommends including a large-scale experiment, and the authors seem to agree it would strengthen the results. Please include such an experiment in an appendix.