PaperHub
6.0
/10
Rejected4 位审稿人
最低5最高8标准差1.2
6
5
5
8
3.3
置信度
ICLR 2024

Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

OpenReviewPDF
提交: 2023-09-15更新: 2024-02-11

摘要

关键词
Gaussian Process Latent Variable Modelsvariational inferenceAnnealed Importance Sampling

评审与讨论

审稿意见
6

The traditional algorithm for learning Bayesian GPLVMs is limited to simple data structure and faces challenges for high-dimensional data. Compared with variational inference (VI), importance-weighted sampling (IS) provides a more directed way of estimating and maximizing the marginal log-likelihood. To increase the effectiveness of this estimator, this paper proposes a sequence of bridge proposal distributions for sampling the latent variable HH, and develops the corresponding stochastic gradient annealed algorithm. Results on the toy datasets show outstanding performance of AIS compared to other baseline methods. Besides, the new model and algorithm is able to learn the model with missing data and get a better prediction for the unseen data.

优点

  • The introduction of a sequence of bridge distribution for the proposal distribution is interesting and effective for increasing the performance of the model
  • Maths are introduced step-by-step, which is clear and intuitive.
  • Algorithms are presented in a clear way.
  • Tables in these paper are good, showing the better performance of the newly propsosed ULA-AIS method compared with other two baselines.

缺点

  • The notation p~(X,H)\tilde p(X, H) is a bit confusing since this is actually an estimator of the marginal p(X)p(X), but it looks like a joint distribution.
  • The claim in the introduction "the dimension of the additional latent variable is limited to one" from the reference paper is only for deep GP. The latent of GPLVM has no such strong drawback.
  • Typo: undajusted.
  • No definition of the abbreviation: LV-GP or LVGP, MH, HMC.
  • Overall, the experiments are okay and the results are good. However, the experiments are not that adequate to fully convince me the validatiy of the new methods with the other two baseline methods. For example, is it possible to compare them on a synthetic dataset that the data is really from GPLVM. By this, we will fully understand the new method improves the learning performance of GPLVM than other traditional solvers, rather than some other reasons.
  • In summary, a detailed analysis on these seems to be necessary, since introducing a sequence of bridge distribution for the proposal distribution is delicate, complicated and needs to be thouroughly explained and clarified for readers.
  • The quality of the experiment results can be improved, see questions.

问题

  • Below Eq. 7, what is qk(Hk1)q_k(H_{k-1})? Should that be qk1(Hk1)q_{k-1}(H_{k-1})?
  • Is the inverse length-scale plot in Fig. 1 shows the result learned by ULA-AIS? Is it possible to also visualize the class label in the left plot, so that it can show us the latent recovery accuracy.
  • For Fig. 2 and Fig. 3, what about the reconstruction results from the other two baseline methods? And what about the latent space recovered by the other two baseline methods?
  • For Fig. 3 left, the bottom is the true data and the top row is the predicted? Figures in this paper are not explained clearly in their corresponding captions. Color usage, image order, method order, and etc sometimes make me confused.
评论

Weaknesses: The notation

R:Thank you for your comment. We appreciate your careful review of our work. You are correct in pointing out this. After rechecking the symbol, we have corrected the notation p^(H,K)\hat{p}(H,K) to avoid confusion.

The claim in the introduction "the dimension of the additional latent variable is limited to one" from the reference paper is only for deep GP. The latent of GPLVM has no such strong drawback.

R: We apologize for the misunderstanding. In the revised version, we have removed the statement claiming that the dimension of the additional latent variable is limited to one and provided a new explanation.

Typo: undajusted. No definition of the abbreviation: LV-GP or LVGP, MH, HMC. R: We have already corrected these typos. Please refer to the rebuttal revision for the updated version.

Overall, the experiments are okay and the results are good. However, the experiments are not that adequate to fully convince me the validatiy of the new methods with the other two baseline methods. For example, is it possible to compare them on a synthetic dataset that the data is really from GPLVM. By this, we will fully understand the new method improves the learning performance of GPLVM than other traditional solvers, rather than some other reasons.

R: Thank you for your suggestion. Our experimental dataset is primarily constructed following previous GPLVM verification methods. Through this approach, we ensure that our experimental dataset exhibits similar characteristics and distributions to traditional methods, enabling fair and accurate comparisons.

We acknowledge and agree with the suggestion to use synthetic datasets generated by GPLVM as an excellent way to compare the performance of our new method against other benchmark methods. In line with this, we plan to synthetically generate additional simpler datasets for further validation, which will be included in the final version of our work.

Questions:

1.Below Eq. 7...

R1: We have reviewed the equation and found no errors. qk(Hk1)q_k(H_{k-1}) represents the probability density of a specific bridging density on Hk1H_{k-1}.

2.Is the inverse length-scale plot in Fig. 1 shows the result learned by ULA-AIS? Is it possible to also visualize the class label in the left plot, so that it can show us the latent recovery accuracy.

R2: The inverse length-scale plot shown in Fig. 1 does indeed represent the results learned by ULA-AIS. We appreciate the reviewer's suggestion to visualize the class label in the left plot. However, due to space limitations, we have included the results in the appendix. Please refer to the rebuttal revision for further details.

3.For Fig. 2 and Fig. 3, what about the reconstruction results from the other two baseline methods? And what about the latent space recovered by the other two baseline methods?

R3: Thank you for the reviewer's inquiry regarding Fig. 2 and Fig. 3. As per your request, we have included the reconstruction results from the other two baseline methods and the recovered latent space visualizations in the rebuttal revision. Please refer to the appendix section in our response for more detailed information

4.For Fig. 3 left, the bottom is the true data and the top row is the predicted? Figures in this paper are not explained clearly in their corresponding captions. Color usage, image order, method order, and etc sometimes make me confused.

R4: Thank you for the reviewer's question regarding Fig. 3. We apologize for the lack of clarity in the captions, which caused confusion. Indeed, in Fig. 3 left, the bottom row represents the true data, and the top row depicts the predicted data by our AIS method on the MINIst dataset. To address this issue, we will provide clearer explanations for all figures in the final version, ensuring that readers can accurately understand the color usage, image order, and method arrangement

We would like to inform the reviewer that we have made the necessary modifications as per your request. We have addressed concerns related to the clarity of figure captions and have taken steps to improve them for better understanding. Additionally, we have incorporated the requested visualizations for the reconstruction results from other baseline methods and the recovered latent space. We kindly request the reviewer to reconsider our paper in light of these revisions. Thank you for your time and valuable feedback.

评论

Thank you for your detailed response. I still have the following questions.

  • Q2:
    • Fig. 1 left still lacks legend (i.e., color-class label). From Fig. 1 right, readers would assume red, green, and blue are different methods. If the different colors in Fig. 1 left represent different class labels, please clarify it explicitly.
    • Although ULA-AIS's neg ELBO is the best after 3000 iterations, the latent visualization in Fig. 1 is not visually better than that in Fig. 7 and Fig. 8. In that case, is the neg ELBO correctly reflects the significant superiority of ULA-AIS?
  • Q3:
    • Same as Q2, is there any significant difference between the three methods, in terms of reconstruction? If so, please tell out explicitly. This is a dimensionality reduction task, but it seems like the dimensionality reduction results from the three methods look similar.
  • References and Appendices are mixed up right now.
  • Q4:
    • Same as Q2, is there any significant difference between the three methods, in terms of reconstruction?

Discussions on these further questions are highly appreciated. Thanks.

评论

Fig. 1 left still lacks legend (i.e., color-class label). From Fig. 1 right, readers would assume red, green, and blue are different methods. If the different colors in Fig. 1 left represent different class labels, please clarify it explicitly.

R: Thank you for your input. We appreciate your feedback on the need for a legend in Figure 1. We have revised the figure to include a legend, and the updated version has been included in the rebuttal revision.

Although ULA-AIS's neg ELBO is the best after 3000 iterations, the latent visualization in Fig. 1 is not visually better than that in Fig. 7 and Fig. 8. In that case, is the neg ELBO correctly reflects the significant superiority of ULA-AIS?

R: Thank you very much for your review comments on our paper. We understand your concerns regarding the difficulty in visually demonstrating the superiority of our method through the visualization of the latent space. Additionally, you have also raised questions about the effectiveness of negative ELBO as an evaluation metric. Let me explain further.

Firstly, the visualization of the latent space offers only one intuitive way to visually represent the clustering and distribution of data points in the latent space. However, it does not provide a comprehensive evaluation metric like negative ELBO. Negative ELBO reflects the degree of match between the model and the true data distribution, which complements the results of latent space visualization. We emphasize the importance of negative ELBO to ensure a comprehensive evaluation of the model's fit.

Secondly, our ULA-AIS method not only performs well on negative ELBO but also has other advantages, such as smaller MSE and lower Negative Expected Log Likelihood . These aspects are demonstrated in the experimental results shown in Table 1.

Thirdly, we do acknowledge that in our previous experiments, some improper adjustments of certain hyperparameters, due to our oversight, led to less than ideal visualization results. We apologize for the confusion caused. Given this, we have taken your advice and retested the model to achieve better visualization results. We have updated the images of the latent space structure in the rebuttal revision to address this. A comparison of the updated Figures 1, 6, and 7 reveals that while all three methods achieve data dimensionality reduction and reconstruction, our method shows better clustering results. For example, the boundaries between different categories appear clearer, and there are fewer misclassified samples.

I hope this explanation clarifies our position. If you have any further concerns or questions, please let us know.

Same as Q2, is there any significant difference between the three methods, in terms of reconstruction? If so, please tell out explicitly. This is a dimensionality reduction task, but it seems like the dimensionality reduction results from the three methods look similar.

R: If you are referring to the dimensionality reduction task, we have already addressed that in Q2.

References and Appendices are mixed up right now.

R:Thank you for bringing this to our attention. We have separated the References and Appendices in the latest version of the paper to ensure clarity and organization. We appreciate your feedback on this matter.

Q4:Same as Q2, is there any significant difference between the three methods, in terms of reconstruction?

R: It seems like you asked the same question in Q3, I assume this might be a typo.

We kindly remind you that the discussion period is coming to an end. If you have any further questions, please feel free to discuss them with us promptly. We are more than happy to answer them for you.

评论

Sorry for the confusion.

My Q3 means the third of my original questions.

For Fig. 2 and Fig. 3, what about the reconstruction results from the other two baseline methods? And what about the latent space recovered by the other two baseline methods?

So, is there any significant difference between the three methods, in terms of reconstruction? If so, please tell out explicitly. This is a dimensionality reduction task, but it seems like the dimensionality reduction results from the three methods look similar.

My Q4 means the fourth of my original questions.

For Fig. 3 left, the bottom is the true data and the top row is the predicted? Figures in this paper are not explained clearly in their corresponding captions. Color usage, image order, method order, and etc sometimes make me confused.

So, is there any significant difference between the three methods, in terms of reconstruction?

Many thanks!

评论

Dear Reviewer G6Yr,

Thank you very much for your attention and valuable comments on our work. We would like to address your question regarding the reconstruction results in Figure 2 and Figure 3. According to your request, we have included the reconstructions of the baseline methods on the Frey Face and MNIST datasets in the appendix, and provided visualizations of the results.

From the visual appearance, it may seem that all three methods produce similar reconstructions. However, upon closer inspection, we can observe differences in certain details such as brightness and contrast. While these differences may be difficult to discern with the naked eye, we have quantified them using the mean squared error (MSE) between the reconstructed images and the ground truth. The MSE results for all three methods on the test set are reported in Tables 2 in the main text.

In particular, our method exhibits significant improvement in terms of MSE on the Frey Face dataset. Although the performance difference between our method and the baseline method is relatively small on the MNIST dataset, our method achieves a lower negative log-likelihood, indicating a more robust uncertainty estimation.

As for Q4, we understand the reviewer's confusion regarding Figure 3 left. We have added appropriate captions to all the figures in the latest version of the manuscript to clearly indicate the order of the images and methods. We have also revised the color usage in the images to ensure consistency with the corresponding captions, image order, and method order.

We sincerely apologize for the lack of clear explanation of the figures in the previous version. We appreciate the valuable input from the reviewer and thank you for your careful observation. We hope that these changes will help readers better understand and interpret our research findings.

Thank you again for your valuable input and guidance. We will make sure to provide additional clarification and explanation of these results in the final version of the paper.And we hope this response address your concerns and improves your perception of our paper.

评论

Thank you very much for these answers. I have no further questions. I have raised my score to 6. But I do think there are still space to improve the overall quality of the paper. For example, the error bars in the negative ELBO loss plot; figures layout for better comparison between different methods; etc. Btw, currently the main content has slightly exceeded the 9-page limit. Probably the presentation quality needs to be improved further.

评论

We sincerely appreciate your recognition and valuable suggestions for our work! We understand that this is a challenging task and we are truly grateful for the acknowledgment you have given to our efforts. We will carefully consider your suggestions regarding error bars and chart layout and strive to improve the overall quality of the paper.

Additionally, we will pay attention to controlling the length of the paper and continue to enhance the presentation quality.

Thank you again for elevating the rating of our paper! Your recognition and endorsement mean a great deal to us, and they leave us filled with excitement and motivation! We will continue to strive for improving the quality of the paper and presenting our research findings to higher standards. We appreciate your support for our work, and your encouragement will continue to inspire us to make continuous progress!

审稿意见
5

The manuscript proposes an annealed importance sampling scheme to perform scalable variational inference in Gaussian process latent variable models. A set of experiments on small datasets and two image datasets compares two variational inference algorithms.

优点

  • The GPLVM is well-established and a widely used tool.
  • The proposed techniques to achieve efficient sampling is also established.
  • The manuscript comes with code and can hence be easily reproduced.

缺点

  • Motivation
    • The paper does not convincingly motivate potential shortcomings of existing methods that should be overcome by the described approach.
  • Experiments
    • Proper comparison to other models needs to be improved: e.g. comparison to a standard GPLVM or some other simple density model (KDE) is missing.
    • Proper analysis of computational effort missing. Runtime analysis.
  • Typos
    • Abstract: "tighter" rather than "lower"?, VI is undefined
    • Intro: "in high-dimensional spaces"
    • Algorithm 1: "stepsizes" rather than "stepsides"
    • Section 3.1 last paragraph: "It is obvious that", "Therefore, the first three terms"
    • References: Capitalisation needs to be reviewed, e.g. Langevin, Eyring-Kramers, Gaussian, Bayes, Monte Carlo
    • Citations: Please properly distinguish between textual citation and parenthetical citations. In the manuscript, you only use textual citation.

问题

  • How do you come to the conclusion that your method shows "more robust convergence" as claimed in the contributions at the end of the introduction section? Which experiment backs this claim? Figure 4 shows a strong peak shortly before 600 iterations and a strange decay at 350.
  • Is it fair to compare MSE and NELL after a fixed number of iterations? To me, one should either fix the computational budget or compare after convergence. What happens if you evaluate at 5000 iterations?
评论

Weaknesses:

Motivation.

R: We highly appreciate your concern about the motivation behind our proposed method and understand your skepticism. In this response, we will provide a detailed explanation of the motivation behind our proposed method and how it effectively overcomes the limitations of existing approaches. For more detailed information, please refer to the introduction section of this paper.

Our research motivation stems from the potential applications of GPLVMs in function estimation and unsupervised learning. Previous studies have mainly focused on training GPLVMs using variational inference (VI) methods and replacing the true probability terms with approximate posterior probability distributions to reduce model complexity.

However, we have found that in VI methods, there are still challenges in dealing with high-dimensional spaces and the increase in relative variance as the latent variable dimension increases, due to the limitations of standard importance sampling. These limitations make it difficult to construct a well-performing proposal distribution in high-dimensional spaces, thereby limiting the performance of existing methods.

To overcome these issues, we propose a new method called SG-AIS, based on the well-established theory of AIS [1]. The advantages of this method are primarily manifested in the following aspects:

  • By introducing SG-AIS, we effectively address the challenges in high-dimensional spaces and reduce the increase in relative variance.
  • The concept and design of SG-AIS are based on the AIS method, but with the introduction of stochastic gradient techniques and annealing strategies, we achieve higher efficiency and better convergence performance. Additionally, we approximate the posterior distribution using Langevin stochastic flow, which leads to significant improvements.
  • We also design an efficient algorithm to reparameterize all variables in the evidence lower bound (ELBO) and propose a stochastic variant algorithm that uses data subsets to estimate gradients. This improves the speed and scalability of the algorithm.

We demonstrated the superiority of our approach through experiments on multiple benchmark datasets of GPLVMs. Our method not only exhibited tighter variational bounds and higher log-likelihood, but also demonstrated more robust convergence. These experimental results validate the effectiveness and feasibility of our proposed method.

Experiments

1.Proper comparison to other models needs to be improved:

R1:We completely understand your expectation for a more comprehensive comparison.

We chose to compare against Classical Sparse VI based on mean-field (MF) approximation and Importance-weighted (IW) VI methods because these approaches are closely related to our research motivation, which is to better address the posterior inference problem in the GPLVM model. Furthermore, previous work on variational inference for GPLVMs has primarily focused on these two aspects. Hence, our experimental tasks align well with these methods, such as tasks like making predictions in unseen data.

The standard GPLVM you mentioned may be referring to the original GPLVM work. However, due to the cubic computational complexity of Gaussian processes, it may require more computation time and memory resources. This is why we introduce the sparse inducing point method. We aim to improve the GPLVM model by using these methods and achieve better efficiency in terms of computational resources.(please refer to the text below due to limited space....)

评论
  1. Runtime analysis.

R2:We have updated the time consumption and other relevant information of the proposed algorithm in the "rebuttal revision" appendix. In addition, we now present the main results of the experiments in the following table for your convenience:

Dataset | Method | time |

Frey Faces | MF | 0.32s |

                  |    IW          |   1.46s (K=5)  | 2.85s(K=10)| 4.06s(K=15) |  5.45s(K=20) | 7.03s(K=25)|
                  
                  |  AIS(ours) |   1.53s (K=5) |  2.65s  (K=10)  | 3.79s(K=15)  |  4.80s(K=20) | 5.93s (K=25)

In our experiments, we observed that the time complexity of Importance-weighted (IW) VI and Annealed Importance Sampling (AIS) almost linearly increases with K as K increases.

In the IW algorithm, the time complexity mainly stems from the K repeated samplings of latent variables to data, which is determined by the time complexity of the GPLVM model itself, O(NM^2). As a result, as we increase the number of samples K, the frequency of repeated samplings increases, leading to a linear increase in time complexity.

In the AIS algorithm, only one sampling of latent variables to data is required, while the intermediate variable sampling is allocated to the annealing procedure, specifically the computation of Langevin stochastic flow. This sampling process is relatively less complex compared to the time complexity of the GPLVM model itself. Therefore, on this dataset, compared to IW, the time complexity of AIS becomes lower as K reaches a certain threshold. Typos.

R: Thank you for identifying the typos in the paper. We appreciate your keen eye for detail, and we will make the necessary corrections. We apologize for any confusion caused by these errors

Questions:

  1. How do you come to the conclusion that your method shows "more robust convergence" as claimed in the contributions at the end of the introduction section? Which experiment backs this claim? Figure 4 shows a strong peak shortly before 600 iterations and a strange decay at 350.

R1:The claim of "more robust convergence" in our contributions is based on the comparison of our proposed method with the baseline methods in terms of convergence behavior. The experiment that supports this claim is presented in Figure 4.

In Figure 4, we show the convergence curves for our method and the baseline methods over iterations. The strong peak shortly before 600 iterations that you mentioned is an expected behavior in the optimization process. It indicates that the model is actively exploring the solution space to find the optimal solution. The strange decay at 350 iterations may be due to fluctuations in the optimization process, but it does not affect the overall convergence behavior and the final performance of our method. As we described in the main text:''This can be attributed to the fact that, by adding Langevin transitions, the algorithm’s variational distribution gradually moves from the current distribution towards the true posterior distribution, resulting in sudden drops in the loss function when reaching the target distribution. Thus, such phenomena can be regarded as a common feature of annealed importance sampling and it becomes even more obvious in high-dimensional datasets.''

  1. Is it fair to compare MSE and NELL after a fixed number of iterations? To me, one should either fix the computational budget or compare after convergence. What happens if you evaluate at 5000 iterations?

R2:As mentioned before, we have already analyzed the time complexity of AIS compared to IW on the Frey Faces dataset. Our proposed method does not show an increase in time complexity compared to the baseline method IW (and sometimes even lower). Therefore, even though we used a fixed number of iterations, we can ensure the fairness of the experiments. Regarding your mention of the results after 5000 iterations, I have already obtained preliminary results, and our method still outperforms the previous method. In response to your feedback, we will include these results, even the results after convergence, in the revised article to address your concerns.

Thank you very much for providing us with valuable feedback and suggestions on our research work. We kindly request you to reconsider our paper and recognize the significant improvement in the performance of GPLVM modeling through our proposed method. If you have any further questions or concerns, we would be more than willing to provide additional explanations and experimental evidence to further support our research.

References: 【1】Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001

评论

Many thanks for your detailed reply.

The cubic computational effort should not be an issue for datasets of sizes n=1000,1599,1965,2163n=1000, 1599, 1965, 2163 so the comparison to "original GPLVM" should be feasible. Without this comparison (potentially on a subset of the data), it is hard for me judge the "superiority" of your approach.

评论

Dear Reviewer NbSz,

We appreciate your feedback and have made significant efforts to address your concerns within the given time frame. Specifically, we have conducted additional experiments comparing our proposed approach to the Standard GPLVM [1]. We performed experiments 8 times and averaged the results, analyzing the performance (in terms of MSE and NLL) on four different datasets. Due to limited computational resources, we were only able to run the Standard GPLVM on a subset of the image datasets. For the image reconstruction task, we randomly selected 300 images as the training set and used consistent hyperparameters for the other experiments, as stated in our main paper.

dataset | method | MSE | NLL |

Oilflow (1000,12) | Standard GPLVM | 2.45(0.05) | -12.42(0.07) |

************** | AIS (Ours) | 1.71(0.04) | -15.81(0.04) |


Wine Quality (1599,11) | Standard GPLVM | 30.53(0.03) | 2.82 (0.02) |

******************* | AIS (Ours) | 30.79 (0.04) | 2.42(0.03) |


Frey Faces (300,560) | Standard GPLVM | 130 (7) | 2632 (6) |

****************** | AIS (Ours) | 115 (6)| 2417(5)|


MNIST (300,784) | Standard GPLVM | 0.36(0.01) | -484(3)|

****************** | AIS (Ours) |0.31(0.01)|-496(2)|


In addition, we would like to emphasize that one of our baselines for comparison is the MF method [2]. As discussed in its experimental section, even the authors of MF approach also suggests that the performance of the Standard GPLVM may not match the Bayesian GPLVM on the Oilflow and Frey Faces datasets. This highlights the advantages of Bayesian methods, which offer greater flexibility, uncertainty modeling, and generalization capabilities compared to the maximum likelihood estimation of the Standard GPLVM. Furthermore, the focus of our paper is to explore a way to combine variational inference, AIS and Bayesian methodology to better estimate the posterior distribution. We hope that this response addresses your concerns and improves your perception of our paper.

【1】Lawrence, N. (2003). Gaussian process latent variable models for visualisation of high dimensional data. Advances in neural information processing systems, 16. 【2】Titsias, M., & Lawrence, N. D. (2010, March). Bayesian Gaussian process latent variable model. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 844-851). JMLR Workshop and Conference Proceedings.

审稿意见
5

The paper presents a novel approach to enhance the learning process of Gaussian Process Latent Variable Models (GPLVMs) through the integration of Annealed Importance Sampling within the framework of variational inference. This newly introduced inference methodology is anticipated to provide a more accurate and effective variational approximation, particularly in scenarios involving high-dimensional data. The experimental analysis conducted to evaluate the effectiveness of the proposed approach demonstrates notable improvements over conventional techniques, such as vanilla variational inference and importance-weighted VI.

Specifically, the results indicate that the proposed method achieves a significant reduction in the negative Evidence Lower Bound (ELBO), a crucial measure of the model's performance. Moreover, the approach yields a notable enhancement in the expected log likelihood and the Mean Squared Error (MSE) of reconstruction, indicating a substantial advancement in the model's ability to accurately represent and reconstruct complex datasets. The evaluation of these results was carried out on diverse datasets, including both digit and face datasets, further highlighting the applicability of the proposed method across various domains.

优点

The paper's well-organized flow introduces GPLVM models, followed by a comprehensive exploration of variational inference methods, including vanilla and importance-weighted schemes. Its main contribution is the introduction of AIS variational inference using Langevin diffusion. This novel approach effectively addresses dimensionality reduction and prediction tasks in high-dimensional data spaces, showcasing its potential and effectiveness in advancing the field.

缺点

The paper could benefit from a more comprehensive empirical validation, particularly in terms of rigorous experimentation and benchmarking against diverse data sets. Including a broader range of experiments and datasets would provide a more holistic understanding of the proposed AIS-based variational inference methods and their applicability across various contexts.

A more explicit discussion on the assumptions and constraints underlying the AIS variational inference approach would provide a clearer understanding of its potential constraints and practical applicability. This would help in contextualizing the scope and generalizability of the proposed method.

There are several typos in the paper (but not limited to): (p4) the fisrt three term ==> the first three terms (p4) jointly optimizes ==> jointly optimize

Addressing these potential weaknesses would not only strengthen the overall credibility of the paper but also provide a more comprehensive and balanced perspective for readers and researchers in the field.

问题

For experiments with Oilflow dataset, with only three runs, do you think the standard deviation reliable? How many runs for the data reported in Table 2?

评论

Thank you for your valuable feedback and suggestions. We appreciate your emphasis on the need for a more comprehensive empirical validation and benchmarking in our paper. We agree that including a broader range of experiments and datasets would enhance the understanding of our proposed AIS-based variational inference methods and their applicability across different contexts. In the revised version, we will thoroughly scrutinize various experimental settings, including the number of runs conducted, to ensure accuracy and reliability. Thank you for reminding us to pay careful attention to these details.

Regarding the assumptions and constraints underlying the AIS variational inference approach, we apologize for not explicitly discussing them in the paper. We would like to emphasize that the main focus of our paper is to present our own method, including theory, formulas, experiments, etc., due to the limited space. Additionally, we have appropriately referenced the background of our method and we have included some related work in the supplementary material. In order to address your concerns, in the revised version, we will include a more explicit discussion on the assumptions and constraints, providing a better understanding of their practical applicability.

Thank you for identifying the typos in the paper. We appreciate your keen eye for detail, and we will make the necessary corrections, including changing "the fisrt three term" to "the first three terms" and "jointly optimizes" to "jointly optimize". We apologize for any confusion caused by these errors.

In response to your question about the reliability of the standard deviation in the experiments with the Oilflow dataset, we acknowledge that with only three runs, the accuracy and reliability of the standard deviation might be limited. We understand the concerns raised by the small sample size. As discussed earlier, we plan to increase the number of runs in the final version to enhance the estimation of the standard deviation and improve the credibility of our results. In Table 2, the reported data is also based on three runs. We appreciate your concern, and we assure you that we will increase the number of runs in our experiments to address this limitation. Additionally, in the revised version, we will provide additional information regarding the number of runs conducted for the experiments presented in Table 2. Thank you for bringing this to our attention.

Thank you very much for providing us with valuable feedback and suggestions on our research work. We kindly request you to reconsider our paper and recognize the significant improvement in the performance of GPLVM modeling through our proposed method. If you have any further questions or concerns, we would be more than willing to provide additional explanations and experimental evidence to further support our research.

审稿意见
8

The paper proposes an Annealed Importance Sampling (AIS) approach for Variational Learning of Gaussian Process Latent Variable Models. This approach addresses the limitations of existing methods in analyzing complex data structures and high-dimensional spaces. The authors introduce a transition density and use a Langevin diffusion to approximate the posterior density. They also propose the usage of the reparameterization trick to simplify gradient computation and suggest employing stochastic gradient descent for sampling. Experimental results demonstrate that their method outperforms state-of-the-art approaches in terms of variational bounds, log-likelihoods, and convergence robustness.

优点

  1. The authors present their contribution alongside a very throrough theoretical discussion. The amount of details provided in the derivation of the proposed method is impressive and it shows the amount of work put forth by the authors.

  2. The article is quite well written, and although at some points can be quite dense and difficult to follow, this stems from the amount of information that is trying to be condensed in just a few pages. While this is a double-edge sword and certainly improvements could be made, I think the effort made by the authors is quite clear.

  3. The AIS approach allows for the analysis of complex data structures and high-dimensional spaces, which were challenging with previous methods. This increases the applicability and usefulness of Gaussian Process Latent Variable Models (GPLVMs) in various domains by combining Sequential Monte Carlo samplers and Variational Inference (VI). This enables a wider range of posterior distribution exploration, leading to better understanding and modelling capabilities.

  4. The authors propose an efficient algorithm by reparameterizing all variables in the ELBO, which leads to a simpler gradient computation and therefore an easier process to optimize model parameters during training.

  5. In the experiments performed, the authors show that the proposed method outperforms previous models in different aspects related to its tighter variational bounds, higher log-likelihoods, and more robust convergence. This indicates that their proposed approach is effective at capturing underlying patterns in data.

缺点

  1. The usage of annealing techniques as well as MC samplers can mean strong computational demands, especially in high dimensional settings and in complex models. , particularly when dealing with high-dimensional data or complex models. Several experiments are conducted in related settings, and scalability results are only reported in terms of iterations while no information on the hardware used is provided. I expect the authors to provide this information in the final version of the draft.

  2. On the same line as the previous point, I expect the authors to release some version of the code used to produce these results. This is not mentioned in the text and I deem reproducibility to be considered an important factor.

  3. The selection of hyperparameter values appears to play a crucial role in the performance of the presented method. Finding optimal values for parameters such as step size (η\eta) or the selected number of bridging densities (KK) may prove to be expensive and quite important in the final results. Moreover, even though it appears to be more restricted due to their usage, the choice for the evolution of βk\beta_k coefficients, while contained in 0,1 needs to be properly crafted as well.

  4. The proposed approach introduces additional complexity compared to traditional methods for Gaussian Process Latent Variable Models. This may make it more challenging to implement and understand, especially for researchers or practitioners with limited experience in this area.

Minor:

  • Considering the information conveyed in the text, I think the article is quite well written. However, at some points, I think it could be managed so that the discussion flows better. I suggest the authors summarize further the introduction in favour of section 2 so that further details can be provided about the required background since currently it is only touched on lightly and some ideas are pivotal here. On this same line, I suggest the authors include further references in the part of the text surrounding Eqs. 2,3 and 4, especially when mentioning something deemed to be "the classical MF-ELBO" or a "typical approximation". These are small points, but I deem them relevant nonetheless.

  • There seem to be some small typos throughout the text such as "fisrt" on page 4 or using "d" instead of "D" for dimensions on page 7 (e.g. "2d projections" instead of "2D"). I do not consider these important corrections at all since the text is very clean, I only suggest doing a final pass to fix these tiny mishaps.

问题

  1. How does the current method scale with the amount of data present in terms of running time? What are the computational requirements to run experiments such as the ones presented in the article?

  2. How sensitive is the method to different choices of reparameterizations?

  3. Can you provide more insights into how the proposed annealing process affects the exploration of posterior distributions? What are some potential trade-offs or considerations when choosing an appropriate annealing schedule?

  4. The paper mentions that experimental results demonstrate improved performance on toy datasets and image datasets, but what about other types of data or real-world applications? Are there any known limitations or challenges when applying this approach to different domains? How about running experiments for bigger datasets such as Imagenet?

评论

Weaknesses:

  1. The hardware. We have updated our experiments and conducted them on a Tesla A100 GPU in the revised "rebuttal revision".

  2. The code. In fact, we plan to publicly release the code and experimental setup related to our research paper upon publication. We will ensure that the code is user-friendly and comprehensible so that other researchers can validate and reproduce our experimental results.

  3. The selection of hyperparameter values. In Supplementary Material Appendix D of our paper, we have provided the settings for the number of bridging densities K, step size (η\eta), and coefficients (βk\beta_k). However, we acknowledge that finding optimal values for these parameters still poses challenges, and we look forward to future work addressing your concerns.

4.Time complexity We have updated the time consumption and other relevant information of the proposed algorithm in the "rebuttal revision" appendix. Please refer to it for more details.

Minor: Fluency.

For the fluency of the article, we have added more references in Section 2 in the "rebuttal revision." Furthermore, we plan to include additional background knowledge in Section 2 in the final version to provide readers with a better understanding of our method.

Typos. We have corrected these typos and will perform a final check.

Questions:

  1. Running time and computational requirements to run experiments.

R1:The computational requirements to run experiments, as presented in the article, depend on factors such as the size of the dataset, the number of inducing points, and the desired number of bridging densities (K). The time complexity of the proposed method is O(nm2)O(nm^2), where n is the dataset size and m is the number of inducing points. Additionally, the complexity linearly depends on the number of bridging densities (K) due to the introduction of annealed importance sampling. Hence, the computational requirements will vary depending on these factors. It is important to select these values appropriately to balance algorithmic time complexity and practical effectiveness.

2.How sensitive is the method to different choices of reparameterizations?

R2: While we have not conducted additional experiments specifically addressing this sensitivity issue, we have carefully considered and compared several reparameterization choices in our original study. We believe that the findings presented in our paper can provide insights into the impact of these choices on the method's performance. As you previously mentioned, the reparameterization method, such as Euler-Maruyama discretization, ''leads to a simpler gradient computation and therefore an easier process to optimize model parameters during training'' . However, we acknowledge that further investigations and experiments can certainly enhance our understanding of the influence of different reparameterizations. We will take this suggestion into account for future studies.

  1. More insights into how the proposed annealing process affects the exploration of posterior distributions? Some potential trade-offs or considerations when choosing an appropriate annealing schedule?

R3: The proposed annealing process in SG-AIS plays a crucial role in exploring the posterior distributions.

In the context of this paper, the posterior distribution refers to the distribution of the latent variables given the observed data. This distribution is often intractable and challenging to sample from directly. SG-AIS aims to approximate this posterior distribution by transforming it into a sequence of intermediate distributions, which can be more tractable and easier to sample from.

The annealing process gradually transforms the posterior distribution by introducing a temperature parameter β\beta. By annealing from β=0\beta = 0 to β=1\beta = 1, we move from an initial distribution, where the posterior is approximated by a simpler distribution to the target posterior distribution itself. The key idea behind annealing is it allows for a smoother exploration of the posterior space. At each intermediate distribution, we can use importance sampling to estimate the evidence by sampling from the proposal distribution and reweighting the samples using the ratios of the target and proposal distributions.

As the annealing process progresses, the samples from the proposal distribution gradually become more representative of the target distribution. This means that the exploration of the posterior space is not limited to a specific region but covers a wider range of possible configurations of the latent variables. (Due to limited space, please refer to the text below ...)

评论

The benefit of this exploration is that it allows for a more accurate estimation of the evidence, which corresponds to a tighter lower bound in the variational learning framework. By gradually annealing the temperature and exploring different distributions, SG-AIS can capture more complex structures in the posterior distribution, leading to better variational approximations in complex data and high-dimensional spaces.

When choosing an appropriate annealing schedule for Stochastic Gradient Annealed Importance Sampling (SG-AIS), there are several trade-offs and considerations to keep in mind:

  1. Computational Efficiency: The annealing schedule should be carefully designed to balance the computational resources required for estimating the evidence. Too many bridging densities can lead to excessive computational burden, while too few densities may result in less accurate estimates.

  2. Exploration vs Exploitation: The annealing schedule should strike a balance between exploration and exploitation of the posterior distribution. An aggressive schedule that moves quickly from the base distribution to the posterior may lead to exploration limitations, while a slow schedule may lead to insufficient exploration and inefficiency.

  3. Smoothness of Transition: The annealing schedule should ensure a smooth transition between bridging densities. Abrupt changes in the densities can result in high-variance importance weights, which may lead to inaccurate estimates. Smooth transitions can be achieved by gradually adjusting the temperature or using appropriate interpolation functions.

  4. Base Distribution: The choice of a suitable base distribution is critical for effective sampling in AIS. It should be computationally efficient to sample from and cover the entire support of the target distribution. The base distribution should strike a balance between simplicity and representativeness.

Queation4: The paper mentions that experimental results demonstrate improved performance on toy datasets and image datasets, but what about other types of data or real-world applications? Are there any known limitations or challenges when applying this approach to different domains? How about running experiments for bigger datasets such as Imagenet?

R4:The paper indeed focused on demonstrating the improved performance of SG-AIS on toy datasets and image datasets. However, when applying this approach to other types of data or real-world applications, there may be some limitations or challenges to consider.

One potential limitation could be the scalability of the method. As the size of the dataset increases, the computational resources required for estimating the evidence using SG-AIS may become more demanding. This is particularly true for large-scale datasets such as ImageNet, which contain millions of images. Running experiments on such massive datasets might pose challenges in terms of computational efficiency and memory requirements.

We would like to emphasize that GPLVM indeed did not utilize ImageNet as a precedent dataset. Given that ImageNet involves higher-dimensional data, it may be more appropriate to combine GPLVM with other deep learning tools, such as convolutional neural networks (CNNs) and transformers. We are currently exploring broader application scenarios to incorporate these tools effectively and leave room for future work. Thank you for bringing up this point, as it is important to consider the specific requirements and complexities of high-dimensional datasets like ImageNet when exploring the applicability of SG-AIS.

Additionally, the annealing schedule plays a crucial role in the exploration of the posterior distribution. Designing an appropriate annealing schedule may require domain knowledge or trial and error experimentation. It might be necessary to tune the schedule to ensure a balance between exploration and exploitation, as well as a smooth transition between bridging densities.

Regarding the applicability of SG-AIS in real-world applications, its performance may depend on the specific characteristics and requirements of the domain. Different datasets and applications may exhibit unique challenges, such as data sparsity, high dimensionality, or non-linear relationships, which could affect the effectiveness of SG-AIS. Evaluating the performance of SG-AIS in different domains and addressing these challenges would require further experimentation and investigation.

In summary, further research and experimentation are needed to assess its performance and address these challenges in real-world applications.

评论

I am writing to express my sincere gratitude for your time and effort in reviewing our article. We are deeply grateful for your valuable insights, thoughtful evaluation, and constructive suggestions. Your feedback has undoubtedly played an integral role in enhancing the quality of our research. We consider ourselves fortunate to have had the opportunity to benefit from your expertise.

Thank you for your dedication to ensuring the rigor and excellence of scientific research. We appreciate your continued support and remain at your disposal for any further inquiries or suggestions.

评论

I want to express my gratitude to the authors for addressing my questions and concerns about the paper in such detail. I value the effort put into responding not just to my points but also to those raised by other reviewers. While I believe the contribution is good, there's room for enhancement in the experimental section. Specifically, detailing the limitations of the proposed approach in the regimes of large and/or wide datasets and challenging synthetic experiments could be beneficial. Despite these considerations, based on the information provided, I am inclined to uphold my initial positive score.

评论

We understand the importance of highlighting the limitations of our proposed approach in the context of large and/or wide datasets, as well as the significance of conducting challenging synthetic experiments. We will certainly take this into consideration and strive to incorporate these aspects into our paper to further enhance its contribution.

Once again, we would like to express our gratitude for your positive score and for recognizing the value of our work. Your support means a lot to us, and we are committed to addressing the suggested improvements to deliver a stronger and more comprehensive paper. Thank you for your continued support and encouragement.

AC 元评审

This paper proposes an inference method for variational learning of Gaussian process latent variable models (GPLVM) based on annealed importance sampling. The main motivation is that of alleviating the deficiencies of previous approaches such as the importance-weighting method of Salimbeni et al (2019).

I thank the authors and the reviewers for the engaging discussions. Some of the strengths highlighted by the reviewers are: (i) the thoroughness and clarity of the mathematical development and the novelty of the proposed solution. The main concerns raised during the discussion is the lack of a more rigorous experimentation. For example: (a) evaluation of hyper-parameters sensitive and practical ways to set these; (b) More challenging datasets and (c) more competitive baselines. I agree with the reviewers these are weaknesses that diminish the contributions of the paper. One of reviewers pointed out the comparison with the standard GPLVM and the authors have provided some additional results under very constrained settings.

Because of the reasons above, my recommendation is that of a weak rejection, as I believe the papers has potential to be published at a top ML venue.

为何不给更高分

A more solid experimental evaluation is required.

为何不给更低分

N/A

最终决定

Reject