6.0

/10

Rejected4 位审稿人

最低5最高8标准差1.2

2.8

置信度

正确性2.8

贡献度2.5

表达2.8

ICLR 2025

In-Context Learning for Full Bayesian Inference

Arik Reuter,Tim G. J. Rudner,Vincent Fortuin,David Rügamer

OpenReview PDF

提交: 2024-09-13更新: 2025-02-05

TL;DR

We demonstrate that in-context learning can be effectively used for full Bayesian inference on real-world data.

摘要

关键词

In-Context LearningPrior-Fitted-NetworksBayesian Inference

评审与讨论

审稿意见

评分: 6置信度: 22024-11-03

This work proposes an in-context learning (ICL) approach for full Bayesian inference. Specifically, PFN is used as a prior, and flow matching is adopted for posterior inference. Experiments show that ICL can perform similarly to the Hamiltonian Monte Carlo sampler, and its sample quality is better than variational inference.

优点

The experimental results demonstrate that the proposed method approximates the posterior more accurately than VI-based approaches. In particular, Figure 2 clearly illustrates that the ICL's posterior aligns well with the ground-truth HMC's, while the posteriors by VIs fail to approximate multi-modal distributions. Interestingly, the autoregressive flow-based VI also fails in that way, contrary to its expected property.

缺点

How to infer the posterior using the proposed method is not explicitly explained.
The writing could be improved. For example, the current manuscript carefully explains continuous normalizing flows, which is appreciated, but it is unclear how they are used in the proposed method.
To my understanding, in the ICL framework, the (foundation) model, typically, Transformer, learns algorithms implicitly through the pre-training (e.g., Garg's work). In this work, however, the model seems designed for inference and pre-training on many related tasks. I would call such a framework meta-learning, rather than ICL.

问题

It would be interesting if the runtime of both pre-training and inference of the proposed mthod was shown.

评论- Response to weaknesses

2024-11-20

Thank you for taking the time to read our manuscript and providing detailed feedback.

We address your questions and comments below. We use double quotation marks (") to mark parts directly quoted from the revised manuscript. In the manuscript itself, all added parts have the text color blue.

How to infer the posterior using the proposed method is not explicitly explained [...] how [continuous normalizing flows] are used in the proposed method

To address these points, we have added a new section "Implementing Flow Matching" (Section 3.5) to the manuscript that details how exactly the flow matching framework is used during training and how posterior samples are generated based on a pre-trained in-context learner.

Below is the added text quoted from the revised manuscript for your convenience:

“Implementing Flow Matching

During the training phase, a tuple $(\mathbf{z}\_1,\mathbf{x})$ is drawn from the distribution $P^{\mathbf{z},\mathbf{x}}$ . Additionally, a time step $t \sim \mathcal{U}$ 0,1 and a sample $\mathbf{z}\_0$ is drawn from the base distribution $P\_{\mathcal{B}}$ , which is a standard Gaussian for all our applications. Subsequently, the ground-truth conditional flow $\psi(\mathbf{z}\_0|\mathbf{x}) = 1 - ( 1- \sigma_{min})t)\mathbf{z}\_0 + t \mathbf{z}\_1$ is computed, pushing forward $P_{\mathcal{B}}$ into $P^{\mathbf{z}|\mathbf{x}}$ up to time-point $t$ . The transformer encoder processes $\mathbf{x}$ and the decoder takes the representation of the encoder into account in order to output $v\_{t,\mathbf{x}}^{\theta}(\psi(\mathbf{z}\_0|\mathbf{x}))$ . This output should match the vector field that describes how the ground-truth flow $\psi(\mathbf{z}\_0|\mathbf{x})$ continues at time $t$ . The discrepancy to the ground-truth vector field is measured with the MSE-loss in Equation 6. In the sampling phase, we are given $\mathbf{x}$ and the goal is to sample from $P^{\mathbf{z}|\mathbf{x}}$ . To do so, first a vector $\mathbf{z}\_0 \sim P\_{\mathcal{B}}$ is drawn. The data $\mathbf{x}$ is passed through the encoder. The decoder defines a function that maps a time-point $t$ and a vector $\mathbf{\nu}$ onto a vector field: $(t, \mathbf{\nu}) \mapsto v_{t,\mathbf{x}}^{\theta}(\mathbf{\nu})$ taking $\mathbf{x}$ into account. This function is given to an ODE-solver in order to forward-solve the corresponding ODE with boundary conditions $0 \leq t \leq 1$ .”

The new section provides a computational perspective on flow matching as it is used in conjunction with the introduced architecture.

To my understanding, in the ICL framework, the (foundation) model, typically, Transformer, learns algorithms implicitly through the pre-training (e.g., Garg's work). In this work, however, the model seems designed for inference and pre-training on many related tasks. I would call such a framework meta-learning, rather than ICL.

We fully agree that ICL is a special case of meta-learning and included an additional explanation in the revised manuscript in the first paragraph of Section 2:

“ICL is a special case of meta-learning [1] characterized by using a large pre-trained model to learn from a context dataset without explicitly updating task-specific parameters. [...]”

We use the more specific term in-context learning instead of meta-learning to emphasize that our approach yields posterior samples directly based on a context set processed by a large transformer network without requiring additional parameter updates. Taking seminal work on meta-learning into account, e.g. [2,3,4], we believe that the term in-context learning is useful to contrast the underlying process of our approach.

Thank you again for your constructive feedback. Please let us know if our response addresses your questions and concerns, and if there is anything else we need to clarify.

References

[1] Vilalta, Ricardo, and Youssef Drissi. "A perspective view and survey of meta-learning." Artificial intelligence review 18 (2002): 77-95.

[2] Finn, Chelsea, et al. "Model-agnostic meta-learning for fast adaptation of deep networks." International conference on machine learning. PMLR, 2017.

[3] Sung, Flood, et al. "Learning to compare: Relation network for few-shot learning." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[4] Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in neural information processing systems 29 (2016).

评论- Response to the question

2024-11-20

It would be interesting if the runtime of both pre-training and inference of the proposed mthod was shown.

Thank you for this suggestion. We now include the runtime of all methods during pre-training and inference in a new section in the manuscript ("Appendix E Runtimes"). This demonstrates that while in-context learning requires a relatively large amount of computational resources during the pre-training phase, generating samples based on a new dataset can be done much more efficiently with ICL compared to HMC (for FA even achieving a factor 8 of speed-up). For your convenience, we also include the results (in Appendix E in the revised manuscript) below.

Scenario	Method	Mean Runtime (s)
GLM	Laplace Approximation	10.48 ± 0.25
	VI: DiagonalNormal	12.02 ± 0.26
	VI: MultivariateNormal	13.70 ± 0.29
	VI: Structured Normal	19.81 ± 0.98
	VI: IAF	15.44 ± 0.30
	HMC	120.24 ± 13.94
	ICL	107.79 ± 17.36
FA	Laplace Approximation	17.85 ± 0.21
	VI: DiagonalNormal	20.94 ± 0.66
	VI: MultivariateNormal	20.84 ± 0.28
	VI: Structured Normal	36.17 ± 0.61
	VI: IAF	23.75 ± 0.38
	HMC	248.26 ± 57.88
	ICL	31.49 ± 4.97
GMM	Laplace Approximation	27.52 ± 0.40
	VI: DiagonalNormal	29.74 ± 0.57
	VI: MultivariateNormal	30.50 ± 0.41
	VI: Structured Normal	42.44 ± 0.44
	VI: IAF	33.39 ± 0.49
	HMC	239.67 ± 32.71
	ICL	93.88 ± 10.47

Thank you again for your constructive feedback. Please let us know if our response addresses your questions and concerns, and if there is anything else we need to clarify.

评论- The end of the rebuttal phase is approaching

2024-11-24

Dear Reviewer prH2,

Thank you again for providing highly valuable feedback on our manuscript.

As the discussion period will end in approximately 72 hours, we would like to ask if our answers so far have successfully addressed your concerns and answered your questions.

In summary, we addressed the points raised by you as follows: We added a detailed explanation on how flow matching is used in our approach to the main body of the manuscript, and explain why we use the term in-context learning instead of meta-learning. We also found your proposal to have the runtimes of all methods in the paper very valuable and thus now include them in the revised version.

If there are any remaining concerns, please let us know. In case we sufficiently addressed your questions, we would greatly appreciate it if you consider raising the score.

2024-11-26

Thank you for the detailed clarification, which helped me understand your work more deeply. This raises another question: why is a Transformer necessary in your work?

评论- Response to your new question: results - part 2

2024-11-27

FA: Comparison when using an MLP-based encoder and a transformer encoder on 50 synthetic and 17 real-world datasets for three different scenarios.

Scenario	Type of Encoder	Synthetic Evaluation			Real-World Evaluation
		C2ST (↓)	MMD (↓)	W₂ (↓)	C2ST (↓)	MMD (↓)	W₂ (↓)
Scenario 1	MLP	0.579 (± 0.015)	0.017 (± 0.006)	0.364 (± 0.029)	0.634 (± 0.014)	0.013 (± 0.004)	0.331 (± 0.010)
	Transformer	0.552 (± 0.028)	0.034 (± 0.034)	0.289 (± 0.083)	0.606 (± 0.038)	0.068 (± 0.069)	0.265 (± 0.078)
Scenario 2	MLP	0.562 (± 0.038)	0.037 (± 0.042)	0.308 (± 0.097)	0.632 (± 0.068)	0.182 (± 0.407)	0.339 (± 0.174)
	Transformer	0.542 (± 0.006)	0.017 (± 0.006)	0.244 (± 0.033)	0.622 (± 0.032)	0.098 (± 0.039)	0.287 (± 0.046)
Scenario 3	MLP	0.539 (± 0.025)	0.023 (± 0.022)	0.278 (± 0.116)	0.680 (± 0.019)	0.268 (± 0.044)	0.253 (± 0.017)
	Transformer	0.537 (± 0.023)	0.024 (± 0.021)	0.259 (± 0.088)	0.609 (± 0.019)	0.124 (± 0.037)	0.179 (± 0.018)

GMMs: Comparison when using an MLP-based encoder and a transformer encoder on 50 synthetic and 17 real-world datasets for three different scenarios.

Scenario	Type of Encoder	Synthetic Evaluation			Real-World Evaluation
		C2ST (↓)	MMD (↓)	W₂ (↓)	C2ST (↓)	MMD (↓)	W₂ (↓)
Scenario 1	MLP	0.873 (± 0.045)	0.242 (± 0.363)	2.203 (± 1.098)	0.917 (± 0.067)	0.891 (± 1.150)	4.528 (± 2.701)
	Transformer	0.760 (± 0.092)	0.303 (± 0.548)	2.095 (± 1.692)	0.847 (± 0.082)	0.486 (± 0.623)	4.054 (± 2.782)
Scenario 2	MLP	0.921 (± 0.035)	0.291 (± 0.205)	2.870 (± 0.710)	0.992 (± 0.005)	0.399 (± 0.127)	5.505 (± 1.144)
	Transformer	0.812 (± 0.061)	0.159 (± 0.154)	2.314 (± 0.926)	0.937 (± 0.041)	0.282 (± 0.131)	3.947 (± 1.055)
Scenario 3	MLP	0.999 (± 0.000)	0.438 (± 0.181)	11.502 (± 9.719)	1.000 (± 0.000)	1.001 (± 0.149)	26.282 (± 3.731)
	Transformer	0.999 (± 0.001)	0.267 (± 0.154)	7.234 (± 2.974)	1.000 (± 0.000)	1.155 (± 0.258)	26.956 (± 3.114)

评论- Response to your new question: results - part 1

2024-11-27

For your convenience, we also include the results (in Appendix H in the revised manuscript) below.

GLMs: Comparison when using an MLP-based encoder and a transformer encoder on 50 synthetic and 17 real-world datasets for three different scenarios.

Scenario	Type of Encoder	Synthetic Evaluation			Real-World Evaluation
		C2ST (↓)	MMD (↓)	W₂ (↓)	C2ST (↓)	MMD (↓)	W₂ (↓)
Scenario 2	MLP	0.942 (± 0.093)	1.783 (± 1.048)	2.503 (± 0.814)	0.968 (± 0.012)	1.528 (± 0.394)	2.271 (± 0.315)
	Transformer	0.839 (± 0.072)	0.707 (± 0.658)	1.111 (± 0.300)	0.768 (± 0.033)	0.143 (± 0.089)	0.411 (± 0.094)
Scenario 3	MLP	0.957 (± 0.075)	2.236 (± 1.218)	2.681 (± 1.130)	0.972 (± 0.012)	1.658 (± 0.450)	2.076 (± 0.427)
	Transformer	0.611 (± 0.070)	0.089 (± 0.114)	0.423 (± 0.348)	0.576 (± 0.027)	0.037 (± 0.026)	0.257 (± 0.044)
Scenario 5	MLP	0.845 (± 0.115)	1.066 (± 0.859)	1.166 (± 0.996)	0.890 (± 0.055)	1.223 (± 0.791)	1.102 (± 0.383)
	Transformer	0.621 (± 0.063)	0.067 (± 0.080)	0.299 (± 0.195)	0.610 (± 0.045)	0.046 (± 0.020)	0.242 (± 0.038)

评论- Response to your new question

2024-11-27

Dear Reviewer prH2,

We are pleased to hear that our clarifications helped you understand our work. We would also like to thank you for inquiring about the role of the transformer in our paper, which is indeed very insightful.

This raises another question: why is a Transformer necessary in your work?

There are several conceptual reasons why transformers are central to our approach:

First, transformers are a commonly used baseline architecture working extremely well across a broad variety of fields such as computer vision, NLP, or even tabular data [1,2,3]. In addition, the papers introducing the PFN methodology [4,5], a central pillar of our approach, heavily rely on the transformer. Therefore, using a transformer is a natural and reasonable choice for our proposal, whose central point is to extend [4,5] and demonstrate the feasibility of in-context learning for full Bayesian inference. Furthermore, the notion of in-context learning itself, which we want to extend to full Bayesian inference, is closely connected to the transformer architecture [6,7,8]. This is confirmed by findings suggesting the attention mechanism to play a crucial role for in-context learning [9,10].

Ablation Study

Intrigued by your question, we conducted another ablation experiment to study your question from an empirical point-of-view and investigate in which cases the transformer architecture is indeed a crucial component. For the ablation, we exchange the transformer-encoder in our approach with an MLP using skip connections and batch normalization and ensure that the transformer encoder and the MLP have approximately the same number of parameters. We then ran the previous experiments using the different models (GLMs, FA, GMMs) in three different scenarios each; every scenario has 50 synthetic and 17 real-world datasets.

In summary, the results confirm our initial hypothesis that a transformer encoder is important for complex conditioning on the input data.

For the GLM scenarios, this is especially pronounced with substantial performance drops across all scenarios when using the MLP baseline instead of the transformer. In the case of FA and GMMs, the improvement of the transformer-encoder is smaller, albeit consistent. In particular, there is no case where the MLP performance is above one standard error of the transformer’s performance. For FA and GMMs, the gap between the MLP and the transformer encoder is, for example, especially noticeable in scenarios 1 and 3 of the FA cases on the real-world data and in scenarios 1 and 2 for the GMM cases.

If you have any remaining concerns or questions, please let us know.

References

[1] Dosovitskiy, Alexey, et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." International Conference on Learning Representations. 2020.

[2] Dubey, Abhimanyu, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).

[3] Gorishniy, Yury, et al. "Revisiting deep learning models for tabular data." Advances in Neural Information Processing Systems 34 (2021): 18932-18943.

[4] Müller, Samuel, et al. "Transformers Can Do Bayesian Inference." International Conference on Learning Representations.

[5] Hollmann, Noah, et al. "TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second." The Eleventh International Conference on Learning Representations.

[6] Panwar, Madhur, Kabir Ahuja, and Navin Goyal. "In-Context Learning through the Bayesian Prism." The Twelfth International Conference on Learning Representations.

[7] Dong, Qingxiu, et al. "A survey on in-context learning." arXiv preprint arXiv:2301.00234 (2022).

[8] Garg, Shivam, et al. "What can transformers learn in-context? a case study of simple function classes." Advances in Neural Information Processing Systems 35 (2022): 30583-30598.

[9] Akyürek, Ekin, et al. "In-Context Language Learning: Architectures and Algorithms." Forty-first International Conference on Machine Learning.

[10] Lee, Ivan, Nan Jiang, and Taylor Berg-Kirkpatrick. "Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability." The Twelfth International Conference on Learning Representations.

2024-11-27

Thank you for answering my question.

评论- Do you have further concerns or questions?

2024-11-27

Dear Reviewer prH2,

We are happy to hear that we could answer your question. Do you have any further concerns or questions?

If not, we would be grateful if you could consider raising the score.

Thank you again for your insightful comments and active participation in the rebuttal.

审稿意见

评分: 5置信度: 32024-11-04

This paper explores using in-context learning within transformers to perform full Bayesian inference on complex statistical models. The authors propose a framework leveraging continuous normalizing flows and flow matching, enabling transformers to infer posterior distributions in-context without explicit parameter updates. The approach is applied to GLMs, latent factor models, and GMMs, which often require techniques like Hamiltonian Monte Carlo (HMC) or variational inference (VI). The reliance on synthetic data for training may introduce biases when tested on real-world distributions that differ from synthetic priors. The results highlight ICL's potential as an efficient, flexible Bayesian inference tool that can operate directly on a diverse range of models and posterior complexities. This paper opens up further exploration into using ICL for full probabilistic modeling, potentially extending its use beyond traditional Bayesian inference.

优点

Innovative Framework for Bayesian Inference: The paper introduces a novel approach to using in-context learning (ICL) with transformers for Bayesian inference. By employing continuous normalizing flows (CNFs) and flow matching, the framework effectively handles high-dimensional posterior distributions, making it relevant for complex statistical models like generalized linear models (GLMs) and Gaussian mixture models (GMMs).
Strong Empirical Comparisons: The authors conduct extensive experiments, benchmarking ICL-based Bayesian inference against both Hamiltonian Monte Carlo (HMC) and various variational inference (VI) methods. Their approach shows promising results, especially in scenarios where VI models struggle with multimodality or non-normal distributions.
Real-World Applications: This paper evaluates performance on real-world datasets, enhancing the practical applicability of the proposed method. This broadens the scope of ICL by demonstrating its potential to rival or even outperform established inference techniques on complex, non-synthetic data.

缺点

Computational Expense and Scalability Issues. While ICL may yield accurate approximations of posterior distributions, it requires a high computational investment in training. This can limit the approach’s scalability, especially in real-world applications requiring quick adaptability across various data regimes. The paper's acknowledgment of this limitation highlights the need for optimization and scalability improvements, which were not sufficiently addressed.
Lack of Detailed Architectural Justification. Although the authors introduce a transformer-based architecture with flow matching for Bayesian inference, the rationale for specific architectural choices, such as the reliance on continuous normalizing flows over other methods, is underexplored. This omission leaves unanswered questions regarding the necessity or superiority of these choices compared to simpler or more established architectures for similar tasks.

问题

N/A. See weakness.

评论- Response to weakness 1

2024-11-20

Thank you for taking the time to read our manuscript and providing detailed feedback.

While ICL may yield accurate approximations of posterior distributions, it requires a high computational investment in training.

In-context learning is inherently expensive during the pre-training phase and our method is no exception—the arguably most prominent in-context learners, Large Language Models, are currently even trained on more than ten thousand GPUs in order to achieve state-of-the-art performance [1]. We have revised the limitations section in section 5 of the manuscript to emphasize this aspect even more clearly. It is noteworthy, however, that when practically applied to a new dataset, our approach can sample the posterior distribution very quickly. To substantiate this claim, we have conducted additional runtime comparisons of all methods. While some VI methods are faster than ICL, at the cost of inferior performance, our approach is consistently faster than HMC (for FA even achieving a factor 8 of speed-up).

For your convenience, we also include the results (in Appendix E in the revised manuscript) below.

Scenario	Method	Mean Runtime (s)
GLM	Laplace Approximation	10.48 ± 0.25
	VI: DiagonalNormal	12.02 ± 0.26
	VI: MultivariateNormal	13.70 ± 0.29
	VI: Structured Normal	19.81 ± 0.98
	VI: IAF	15.44 ± 0.30
	HMC	120.24 ± 13.94
	ICL	107.79 ± 17.36
FA	Laplace Approximation	17.85 ± 0.21
	VI: DiagonalNormal	20.94 ± 0.66
	VI: MultivariateNormal	20.84 ± 0.28
	VI: Structured Normal	36.17 ± 0.61
	VI: IAF	23.75 ± 0.38
	HMC	248.26 ± 57.88
	ICL	31.49 ± 4.97
GMM	Laplace Approximation	27.52 ± 0.40
	VI: DiagonalNormal	29.74 ± 0.57
	VI: MultivariateNormal	30.50 ± 0.41
	VI: Structured Normal	42.44 ± 0.44
	VI: IAF	33.39 ± 0.49
	HMC	239.67 ± 32.71
	ICL	93.88 ± 10.47

Thank you again for your constructive feedback. Please let us know if our response addresses your questions and concerns, and if there is anything else we need to clarify.

评论- Response to weakness 2

2024-11-20

the rationale for specific architectural choices, such as the reliance on continuous normalizing flows over other methods, is underexplored.

While exploring other architectural choices is an interesting avenue, we would like to note that [2] introduces theoretical results backing the choice of continuous normalizing flows over, for instance, various variants of Diffusion methods. They also empirically find flow matching to be a more robust and efficient objective. [3] confirms these results across several simulation-based inference tasks that are closely related to those in our paper.

To again substantiate our claims, we added additional empirical results to the manuscript by replacing flow matching based on optimal transport (OT) paths in our approach with a diffusion objective based on variance preserving (VP) diffusion probability paths from [4]. When comparing samples based on the VP diffusion probability paths and our initial flow matching implementation with OT transport paths, we find that our original approach consistently yields superior samples.

Thank you again for your constructive feedback. Please let us know if our response addresses your questions and concerns, and if there is anything else we need to clarify.

References

[1] Dubey, Abhimanyu, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).

[2] Lipman, Yaron, et al. "Flow matching for generative modeling." arXiv preprint arXiv:2210.02747 (2022).

[3] Wildberger, Jonas, et al. "Flow matching for scalable simulation-based inference." Advances in Neural Information Processing Systems 36 (2024).

[4] Song, Yang, et al. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020).

评论- Response to weakness 2: results

2024-11-20

For your convenience, we also include the results when comparing samples based on the VP diffusion probability paths and our initial flow matching implementation with OT transport paths (in Appendix C in the revised manuscript) below.

GLMs: Comparison of the OT flow matching and the VP diffusion objective on 50 synthetic and 17 real-world datasets for three different scenarios. All results within two standard errors of the best average result for each scenario are marked in bold.

Scenario	Model	C2ST (↓)	MMD (↓)	W2 (↓)	C2ST (↓)	MMD (↓)	W2 (↓)
Scenario 2	Diffusion paths	0.961 (± 0.040)	1.525 (± 0.777)	3.354 (± 1.333)	0.961 (± 0.016)	1.347 (± 0.365)	2.025 (± 0.270)
	OT paths	0.839 (± 0.072)	0.707 (± 0.658)	1.111 (± 0.300)	0.768 (± 0.033)	0.143 (± 0.089)	0.411 (± 0.094)
Scenario 3	Diffusion paths	0.903 (± 0.111)	1.080 (± 0.564)	1.733 (± 0.408)	0.936 (± 0.013)	1.002 (± 0.203)	1.442 (± 0.103)
	OT paths	0.611 (± 0.070)	0.089 (± 0.114)	0.423 (± 0.348)	0.576 (± 0.027)	0.037 (± 0.026)	0.257 (± 0.044)
Scenario 5	Diffusion paths	0.691 (± 0.074)	0.211 (± 0.143)	0.708 (± 0.233)	0.681 (± 0.038)	0.182 (± 0.093)	0.554 (± 0.090)
	OT paths	0.621 (± 0.063)	0.067 (± 0.080)	0.299 (± 0.195)	0.610 (± 0.045)	0.046 (± 0.020)	0.242 (± 0.038)

FA: Comparison of the OT flow matching and the VP diffusion objective on 50 synthetic and 17 real-world datasets for three different scenarios. All results within two standard errors of the best average result for each scenario are marked in bold.

Scenario	Model	C2ST (↓)	MMD (↓)	W2 (↓)	C2ST (↓)	MMD (↓)	W2 (↓)
Scenario 1	Diffusion paths	0.622 (± 0.043)	0.207 (± 0.121)	0.692 (± 0.192)	0.595 (± 0.012)	0.089 (± 0.011)	0.475 (± 0.019)
	OT paths	0.552 (± 0.028)	0.034 (± 0.034)	0.289 (± 0.083)	0.606 (± 0.038)	0.068 (± 0.069)	0.265 (± 0.078)
Scenario 2	Diffusion paths	0.826 (± 0.036)	0.768 (± 0.238)	1.219 (± 0.276)	0.878 (± 0.028)	0.793 (± 0.154)	1.056 (± 0.084)
	OT paths	0.542 (± 0.006)	0.017 (± 0.006)	0.244 (± 0.033)	0.622 (± 0.032)	0.098 (± 0.039)	0.287 (± 0.046)
Scenario 3	Diffusion paths	0.751 (± 0.048)	0.387 (± 0.216)	0.834 (± 0.163)	0.944 (± 0.008)	1.514 (± 0.056)	1.332 (± 0.028)
	OT paths	0.537 (± 0.023)	0.024 (± 0.021)	0.259 (± 0.088)	0.609 (± 0.019)	0.124 (± 0.037)	0.179 (± 0.018)

GMMs: Comparison of the OT flow matching and the VP diffusion objective on 50 synthetic and 17 real-world datasets for three different scenarios. All results within two standard errors of the best average result for each scenario are marked in bold.

Scenario	Model	C2ST (↓)	MMD (↓)	W2 (↓)	C2ST (↓)	MMD (↓)	W2 (↓)
Scenario 1	Diffusion paths	0.924 (± 0.024)	0.241 (± 0.381)	2.195 (± 1.431)	0.958 (± 0.030)	0.890 (± 0.912)	5.328 (± 2.544)
	OT paths	0.760 (± 0.092)	0.303 (± 0.548)	2.095 (± 1.692)	0.847 (± 0.082)	0.486 (± 0.623)	4.054 (± 2.782)
Scenario 2	Diffusion paths	0.942 (± 0.020)	0.213 (± 0.187)	2.748 (± 0.659)	0.984 (± 0.012)	0.411 (± 0.162)	5.397 (± 1.458)
	OT paths	0.812 (± 0.061)	0.159 (± 0.154)	2.314 (± 0.926)	0.937 (± 0.041)	0.282 (± 0.131)	3.947 (± 1.055)
Scenario 3	Diffusion paths	1.000 (± 0.000)	0.582 (± 0.280)	8.708 (± 4.945)	1.000 (± 0.000)	1.869 (± 0.342)	33.230 (± 8.095)
	OT paths	0.999 (± 0.001)	0.267 (± 0.154)	7.234 (± 2.974)	1.000 (± 0.000)	1.155 (± 0.258)	26.956 (± 3.114)

评论- Response to weakness 2 contd.: reasons for choosing a transformer

2024-11-27

Dear Reviewer rWy9,

Besides the ablation we provided for your initial review regarding the diffusion objective, we have now also conducted an extensive additional ablation study for the transformer encoder, a key component in our ICL approach.

We think that providing further conceptual justification as well as empirical evidence on this choice is important for addressing your concern that

the rationale for specific architectural choices [...] is underexplored

Conceptual reasons

There are several conceptual reasons why transformers are central to our approach:

Transformers are a widely used baseline architecture that perform exceptionally well across various domains, including computer vision, natural language processing (NLP), and tabular data [1,2,3]. Moreover, the PFN methodology introduced in [4,5], which forms a cornerstone of our approach, is built directly on the transformer architecture. Consequently, employing a transformer is both a logical and appropriate choice for our proposal, which focuses on extending [4,5] to showcase the feasibility of in-context learning for full Bayesian inference. Additionally, the concept of in-context learning, which we aim to generalize to full Bayesian inference, is intrinsically linked to the transformer architecture [6,7,8]. This connection is further supported by research highlighting the critical role of the attention mechanism in enabling in-context learning [9,10].

Empirical Results

We conducted another ablation experiment to study your question from an empirical point-of-view and investigate in which cases the transformer architecture is indeed a crucial component.

For the ablation, we exchange the transformer-encoder in our approach with an MLP using skip connections and batch normalization and ensure that the transformer encoder and the MLP have approximately the same number of parameters. We then ran the previous experiments using the different models (GLMs, FA, GMMs) in three different scenarios each; every scenario has 50 synthetic and 17 real-world datasets.

In summary, the results confirm our initial hypothesis that a transformer encoder is important for complex conditioning on the input data.

If there are any remaining concerns or questions, please let us know. If not, we would be grateful if you could consider raising the score.