Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders
Multimodal learning with VAEs where the dependence between expert distributions is taken into account.
摘要
评审与讨论
The authors propose to challenge the assumption of independence between unimodal experts in computing the joint posterior in multimodal VAEs. Therefore they propose the CoDE-VAE that uses a Bayesian approach to compute the joint posterior between unimodal experts, modelling the dependence between them. Experimental results show that their idea is effective and are positive, despite not significantly outperforming the most recent alternative approaches.
给作者的问题
- Is the dependency between expert errors assumed to be constant across the D latent dimensions? Why do the authors make this assumption?
- Which beta values and latent space dimensions are chosen to get the results for the MMVAE+ model on the MNIST-SVHN-Text dataset, reported in the main text? It is not really clear to me, even after having a look at the Appendix. It seems somehow strange that the model achieves a relatively low performance in this dataset, keeping in mind the results on PolyMNIST and CUB datasets.
论据与证据
- Empirical results are positive and show the effectiveness of the approach, despite not outperforming SOTA in the multimodal VAE literature.
- I think there seems to be some confusion in the paper about the concept of subsampling modalities in the ELBO, related to the limitations highlighted in [1]. The authors state "It is noteworthy that CODE-VAE does not rely on sub-sampling techniques, which have been shown to harm the performance of multimodal VAEs", but in the CoDE-VAE ELBO in eqn 3 subsampling actually happens. To see it, it is sufficient to notice that computing a given term of the sum in the ELBO requires reconstruction of all modalities given only a subset used for inference, hence sub-sampling of modalities happens and the CoDE-VAE is also subject to the theoretical limitations outlined in [1], as also confirmed in the experimental results.
[1] Daunhawer et al On the limitations of multimodal VAEs, ICLR, 2022.
方法与评估标准
As outlined also below the chosen datasets to benchmark the approach are sensible, while already well-studied in the multimodal VAE literature. As for the proposed method, challenging the assumption of independence between unimodal experts in approximating joint posterior inference for multimodal VAEs is a valuable research direction. Moreover, the proposed method appears to be effective.
理论论述
The authors justify their approach as a Bayesian method to approximate the joint posterior assuming a dependence between unimodal experts. The derivations to be seem correct and back up their theoretical claims.
实验设计与分析
The datasets chosen for the experiments are fairly standard in the multimodal VAE literature and existing models already achieve convincing results in these setups. While the comparisons on these datasets are valid, I think the authors could have picked a novel more challenging dataset to outline the benefit of their proposed model. I think experiments on the chosen datasets are properly conducted, and the results are properly commented. While model performance does not surpass certain recent approaches (e.g. MMVAE+), the results are still somehow positive and show that the suggested idea works. I'd strongly suggest the authors to compare with a recent paper [2] that proves to outperform alternative multimodal VAEs. While the proposed model has the option to be equipped with diffusion decoders, which would make an unfair comparison in terms of generative quality with CoDE-VAE, the authors show that the proposed ELBO without diffusion decoders still improves over alternative multimodal VAEs. Hence it seems relevant to include it in the comparisons in this paper.
[2] Palumbo et al. Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders, ICLR, 2024.
补充材料
I reviewed some parts, including derivations qualitative results and metrics.
与现有文献的关系
The idea discussed in this paper fits nicely in the literature of multimodal VAEs as it explores the direction of modelling dependence between unimodal experts in computing the joint posterior, which was not explored thus far to my knowledge.
遗漏的重要参考文献
Recent relevant work is not discussed. Specifically, [2]. As I mentioned above I strongly suggest the authors to at least discuss this paper in the related work, and advise also to include it in the experimental comparisons.
[2] Palumbo et al. Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders, ICLR, 2024.
其他优缺点
Weaknesses:
- Notation is sometimes confusing. E.g in section 3 the dimension appears at times as superscript and at times as subscript, even in the same equation ( ).
- Certain experimental comparisons could be more thorough. For instance, on PolyMNIST it would be more appropriate to assess the generative quality gap on all modalities (and possibly do an average in performance), instead of only focusing on generating modality .
其他意见或建议
- I think clarity in the paper could be improved in section 3.
- I suggest that, when grid-searching hyperparamters, the authors make explicit (e.g. in the MNIST-SVHN-Text experiment) which hyperparameters achieve best performance for each model, and hence are used to report the results. These info can also be left to the Appendix.
We appreciate your thoughtful and detailed comments. We will respond to your concerns and questions point by point.
Confusion about the concept of sub-sampling modalities
Thank you for pointing this out. In the paper, we use the concept of sub-sampling to refer to ELBO sub-sampling and to the use of mixture distributions to approximate consensus distributions. We will make this clear in the revised version.
Alternative Datasets
Based on your and Reviewer FV9Z comment, we provide results on the CELEB-A data. Due to the limited time, we only considered a latent space with 32D. For CoDE-VAE we assume , as we observed to be a value that performs well in other experiments. We obtained the following results
| CoDE-VAE | MMVAE+ | |
|---|---|---|
| Conditional FID | 92.11 (0.61) | 97.30 (0.40) |
| Unconditional FID | 87.41 (0.36) | 96.91 (0.42) |
| Conditional Coherence | 0.38 (0.001) | 0.46 (0.001) |
| Unconditional Coherence | 0.23 (0.003) | 0.31 (0.030) |
| Classification | 0.38 (0.066) | 0.37 (0.003) |
which may improve further by cross-validating in CoDE-VAE.
Comparison to Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders
Thank you for pointing out this paper, which we were not aware of. We acknowledge the importance of the paper in the field of multimodal VAEs and will discuss it in the related work section and will include the experimental comparison on PolyMNIST in the appendix of the revised version of the paper, as the Clustering Multimodal VAE (CMVAE) model introduced in [2] is not 100% comparable with our proposed CoDE-VAE model. The main focus of our research is to present a novel approach to estimate consensus distributions and to learn the contribution of each ELBO subset, while the goal in CMVAE is to couple multimodal VAEs with clustering tasks by leveraging clustering structures in the latent space and to introduce diffusion decoders, which is certainly a novel and relevant line of research. CMVAE captures clustering structures using a mixture model as a prior, and we hypothesized that this flexible prior in CMVAE plays an important role in the performance of unconditional generative tasks. We will include this discussion and the relation between the methods in the updated manuscript.
Notation and Clarity
Thank you for pointing this out. We will change the notation in the revised version to ensure consistency in the use of subscripts and superscripts, which will improve the clarity of Section 3.
Thorough comparison on the generative quality gap for PolyMNIST
We agree that an average generative quality gap would provide a robust comparison. However, the computational costs of such experiment is significant, as to assess the generative quality for each of the 5 modalities as a function of the number of input modalities requires to train at least 12*3=36 times each model (considering 3 different runs to report standard deviations). This will require 252 runs in total, as there are 7 different models in the evaluation of the quality gap (without considering the unimodal VAEs). Given that our research includes 3 data sets, 6 benchmark models, and several ablation experiments, we let such a robust comparison for future research.
Add grid-search hyperparameters
Thank you for pointing this out. We will add to the Appendix the hyperparameters for the grid search and their optimal values, including the ones for the new experiments on the CELEB-A data.
Is the dependency between expert errors assumed to be constant across the D latent dimensions?
Yes, for simplicity, we assume a common parameter for all dimensions. However, the CoDE is a flexible approach that does not impose any restriction on the way is specified. We let future research explore whether model performance can be improved by using different correlation values for different dimension.
Beta values and latent space for MMVAE+ on the MNIST-SVHN-Text data.
For all models in Section 4.1, we cross-validate using the grid . All models, but MMVAE+, assume that the latent space has 20 dimensions as in previous research. To select the dimensionalty of common and modality specific (MS) variables in MMVAE+, we follow a similar approach as in the original paper, where the authors choose the dimensions of these two to be equal to the dimension of the latent space in MMVAE (a model without MS variables) divided by the number of modalities. Therefore, MMVAE+ assumes that both common and MS variables have 7 dimensions. These details are explained in Appendix D3. We also tested to use 10 dimensions in both common and MS variables, so the decoders in the MMVAE+ model would generate modalities based on 20 dimensions, just as the other models. We did not observe significant differences.
The paper introduces a new method for aggregating multimodal expert distributions in Variational Autoencoders (VAEs) by incorporating the dependence between experts, which has traditionally been ignored in models like the product of experts (PoE) and mixture of experts (MoE). This method, called Consensus of Dependent Experts (CoDE), aims to improve the estimation of joint likelihoods in multimodal data by accounting for the dependencies between different modality-specific distributions. The paper proposes the CoDE-VAE model, which enhances the trade-off between generative coherence and quality, improving log-likelihood estimations and classification accuracy. The authors claim that CoDE-VAE performs better than existing multimodal VAEs, especially as the number of modalities increases.
给作者的问题
- How does the performance of CoDE-VAE change when there is a significant imbalance between the available modalities (e.g., when one modality is much more informative than the others)?
- Could you elaborate on how the CoDE method handles scenarios where the assumption of expert dependence does not hold (e.g., in highly independent modalities)?
- The paper mentions that CoDE-VAE reaches generative quality similar to unimodal VAEs in certain cases—could you provide more concrete examples of these cases, and how the model behaves with increasing modality count?
- What are the computational complexities of CoDE-VAE compared to existing models like PoE and MoE, particularly as the number of modalities increases?
论据与证据
The paper provides clear empirical evidence to support the claims about the CoDE-VAE’s superior performance. The experimental results on datasets such as MNIST-SVHN-Text, PolyMNIST, and CUB support the assertion that CoDE-VAE balances generative coherence and quality better than existing models. Additionally, the paper argues that CoDE’s consideration of expert dependence leads to better log-likelihood estimations and reduced generative quality gaps compared to models relying on modality sub-sampling. However, the discussion could benefit from more in-depth comparisons in specific edge cases where other models might outperform CoDE-VAE.
方法与评估标准
The methodology behind CoDE is sound, introducing a principled Bayesian approach to account for expert dependence. The CoDE-VAE model builds on existing multimodal VAEs, addressing key challenges like missing modalities and the imbalance in the contribution of different ELBO terms. The evaluation criteria, including generative coherence, log-likelihood estimation, and classification accuracy, are appropriate for comparing multimodal models. However, the explanation of how the model behaves in extreme cases (e.g., when only one modality is available) is not fully addressed.
理论论述
The theoretical claims regarding the new ELBO formulation and the aggregation method using CoDE are well-supported by the paper’s derivations and lemmas. The paper provides a solid mathematical foundation for the method, with proofs of key results, such as the posterior distribution and consensus distributions. There are no apparent issues with the correctness of these proofs.
实验设计与分析
The experimental design is robust, with comprehensive comparisons to multiple baseline models. The use of multiple datasets (MNIST-SVHN-Text, PolyMNIST, and CUB) provides a good cross-section of real-world multimodal problems. However, more detailed ablation studies or analyses of edge cases where the assumptions about expert dependence may not hold could further strengthen the paper.
补充材料
Yes.
与现有文献的关系
The paper does a good job of relating its contributions to the broader literature on multimodal VAEs and expert aggregation methods. The work is clearly motivated by existing challenges in multimodal learning, such as missing modalities and independent expert assumptions. The comparison with PoE and MoE methods, along with references to key multimodal VAE papers, establishes the novelty of the approach.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- The CoDE-VAE method is a novel and theoretically sound approach that addresses the challenge of dependent expert distributions.
- The experimental results are convincing, showing that CoDE-VAE outperforms existing methods in key areas like generative coherence and log-likelihood estimation.
Weaknesses:
- Some parts of the experimental setup could be explained more clearly, particularly regarding the optimization process for learning the contribution of each ELBO term.
- More detailed comparisons with edge cases or failures of the model would help to solidify the generalizability of the results.
其他意见或建议
- Consider adding more analysis on how CoDE-VAE behaves in cases with missing data or when some modalities are not available.
- A clearer distinction between the CoDE-VAE approach and similar models would benefit readers unfamiliar with multimodal VAEs.
Thank you for your thoughtful and detailed comments. We will address your concerns and questions point by point.
Experiments and analysis on edge cases
We agree that analyzing CoDE-VAE on edge cases helps to understand its behavior, and makes our research more robust. We believe that the experiments in Section 4.2 Generative quality gap (Figure 3), in Appendix D.4. PolyMNIST (Figure 13), and the classification results in Figures 8 and 10 provide a good indication that the performance of CoDE-VAE improves with the number of modalities and with the cardinality of the subset on which consensus distributions are conditioned on.
To address the scenario where the assumption about dependent experts may not hold, we trained the CoDE-VAE model on PolyMNIST using the modality in the following way. We apply 3 different levels of noise to the modality , 0 %, 25%, and 95%. Then, we paired the noisy version with the original modality to obtain a bi-modal data. For each of these data, we train CoDE-VAE assuming and , and generate the non-noisy version of . When CoDE-VAE is trained with the data with 0% noise, both modalities are the same and we expect that will have relatively high generative quality. On the other hand, when CoDE-VAE is trained with the data with 95% noise, the modalities are uncorrelated and is expected to have relatively high generative quality. We obtain the following average FID scores (link1, backup_link) :
| 0% | 25% | 95% | |
|---|---|---|---|
| 29.0 | 31.27 | 48.22 | |
| 26.12 | 29.90 | 53.68 |
showing that CoDE-VAE correctly captures the dependency between experts distributions through the parameter.
CoDE-VAE in cases with missing data or when modalities are not available.
The evaluation setup in our research follows the standard method in multimodal VAEs where all possible combinations of missing modalities are evaluated at test time (Appendix B). To handle missing data is not trivial, as we need to estimate consensus distributions. To estimate , any aggregation method would require the same number of observations and . This problem could be overcome by using only complete pairs , and re-weighting the ELBO terms with fewer samples and could be an interesting direction to pursue in future work.
CoDE-VAE when one modality is much more informative?
CoDE-VAE learns the contribution of each k-th ELBO term to the optimization, balancing the importance of relatively more informative modalities. The empirical results of Section 4.4 show that the text modality in the MNIST-SVHN-Text is relatively more important to the optimization of the ELBO as shown by the weight learned for the subset containing that modality. This result seems reasonable, as there is more noise in the MNIST and SVHN modalities. The ablation experiments of Section 4.5 show that CoDE-VAE achieves higher performance (coherence and FID), when the contribution of each subsets to the optimization, and so each modality, is learned.
CoDE where the assumption of expert dependence does not hold?
CoDE is a flexible approach that does not impose any restriction on the way is specified as long as it is invertible. For independent modalities, it should be enough to use (see answer on edge cases).
CoDE-VAE generative quality. Could you provide more concrete examples?
We recognize that the wording of this claim could be improved and made it more concrete. We have replaced the original sentence with "CoDE-VAE minimizes the generative quality gap as the number of modalities increases, achieving quality similar to unimodal VAEs measured by unconditional FID scores.", which corresponds to the experiments in Section 4.2 Generative quality gap. Furthermore, Figure 3 shows that CoDE-VAE achieves higher generative performance as the number of modalities used for model training increases, something most of the benchmark models are not able to achieve. We have added figures (backup) that show the generated modality as a function of input modalities, as well as generated samples by the unimodal VAE for qualitative comparisons.
Computational complexities of CoDE-VAE
We agree that this is an important aspect to be considered. Therefore, in Appendix C we mention that CoDE-VAE has a relatively high computational cost , where is the number of modalities. However, given that the size of the matrix depends only on (see the question from reviewer FV9z), model training is feasible on a single GPU even for 5-modality datasets.
The authors' rebuttal is quite professional and addresses some of my concerns successfully. I will be thinking about editing my initial review and rating after carefully going through other rebuttal contents to other reviewers (but will not require any additional details or raise questions from/to authors).
Thank you for your positive feedback. We are pleased to have been able to address your concerns.
This paper introduces the Consensus of Dependent Experts (CoDE) in the context of multimodal learning with Variational Autoencoders (VAEs). Current approaches for this task, such as: (i) the product of experts; or (ii) the mixture of experts assume cross-modal independence which is restrictive. Towards this end, the current work proposes a novel Empirical Lower Bound (ELBO) that estimates the joint likelihood by learning the contribution of each modality. The proposed method can strike a balance between generative coherence and generative quality. Empirical evaluations are conducted on several datasets.
Update after rebuttal
Convinced with the responses to my questions by the authors. Hence, raising my score.
给作者的问题
The following are some questions for the authors:
- L89, Col 1: Does the proposed approach really "minimize the generated quality as the number of modalities increase". This seems counter-intuitive. Or is this a typo?
- What would happen if instead of a Categorical distribution, a softmax distribution is used to weigh the experts?
论据与证据
Generally, the claims make sense. However, the chief concern with the claims is:
- The paper does not show how accurately the ELBO is minimized for the different datasets. This is an important shortcoming of the present work.
方法与评估标准
While overall the method is intuitive, the chief concern is as follows:
- L201, Col 2: Estimating the off-diagonal elements in in the forward pass could be computationally expensive for high-dimensional scenarios.
- Also, how does the model deal with cases, where is not full rank.
理论论述
Generally, the theoretical claims seem accurate.
实验设计与分析
Overall, the experimental evaluation is pretty broad but has the following shortcomings:
- Some of the more complex real-world image datasets, CELEB-A [a], CELEB-HQ [b] have not been experimented with.
- Generation Quality (as measured by FID scores) and Classification accuracy are not the best.
References: [a] Liu, Z., Luo, P., Wang, X. and Tang, X., 2018. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15(2018), p.11. [b] Karras, T., Aila, T., Laine, S. and Lehtinen, J., 2018, February. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations.
补充材料
Yes, I reviewed the entire supplementary material.
与现有文献的关系
The idea of combining multimodal distributions relate to prior works that combine hidden markov model distributions (Brown & Hinton, 2001), image synthesis to combine modalities and generate images using generative adversarial networks (Huang et al., 2022), large language models for comparative assessment of texts (Liusie et al., 2024), in early-exit ensembles (Allingham & Nalisnick, 2022), or in diffusion models to aggregate distillation of diffusion policies (Zhou et al., 2024).
遗漏的重要参考文献
Most important related works have been discussed.
其他优缺点
Other strengths:
- The proposed model can deal with scenarios when one or more modalities are missing.
- The current formulation allows for the estimation of uncertainties of each of the experts.
Other weaknesses: See previous sections.
其他意见或建议
The following are some additional comments:
- L98, Col 1: "....method, the derivation..." -> "....method, for the derivation..."
- L376, Col 1: "...significant..." -> "...significantly..."
Thank you for your careful and comprehensive comments. We will address your concerns and questions, point by point:
How accurately is ELBO minimized:
We are not completely sure if we follow your concern. If the comment refers to how the ELBO is maximized (the loss), we train the CoDE-VAE model until the ELBO converges, confirmed by visual inspection. These plots (backup backup backup) show the convergence of the ELBO. If your concern is about how close the ELBO is to the intractable marginal log-likelihood, we calculate log-likelihoods on the test sets using importance-sampling (shown in Figures 2 and 4 for all models). Please let us know If we are misunderstanding your concern.
costly in high-dimensional data.
is not a sample covariance matrix and its size depends on the number of expert distributions assessing consensus distributions (CD). So, it is limited by the number of modalities M, which typically is small and only one CD is conditioned on all modalities. In our research the largest M is 5 (PolyMNIST). Therefore, the computational costs to find the inverse of are affordable.
if it is not full rank.
is guaranteed to be full rank by construction, as for all , where . To see this, we need to show that the quadratic form , is only satisfied for a zero-vector . Let be the smallest value, which is positive by construction. Therefore, . Since , the only solution that satisfies is the zero-vector . We will add this discussion in the revised version.
More complex datasets
We added a new section in the appendix with experiments on CELEB-A for CoDE-VAE and MMVAE+, which is the model that stands out in the other experiments. Due to the limited time, we evaluate both models only for , assuming a latent space with 32D. For CoDE-VAE we assume , as we observed in other experiments to be a value that performs consistently well. We obtained the following results:
| CoDE-VAE | MMVAE+ | |
|---|---|---|
| Conditional FID | 92.11 (0.61) | 97.30 (0.40) |
| Unconditional FID | 87.41 (0.36) | 96.91 (0.42) |
| Conditional Coherence | 0.38 (0.001) | 0.46 (0.001) |
| Unconditional Coherence | 0.23 (0.003) | 0.31 (0.030) |
| Classification | 0.38 (0.066) | 0.37 (0.003) |
which may improve further by cross-validating in CoDE-VAE.
Generation Quality and Classification accuracy
Multimodal VAEs trade high generative quality with reduced generative coherence [1]. The experimental setup in our research shows that CoDE-VAE performs as well as or better than SOTA multimodal VAEs in terms of balancing the trade off between generative coherence and generative quality. When we assess in isolation whether multimodal VAEs are able to improve generative quality as the number of modalities increases, CoDE-VAE clearly shows a higher performance (Figure 3). It is possible to add modality-specific (MS) latent variables to CoDE-VAE to further improve the generative quality, which requires a careful design of the number of dimensions in the common and MS variables to avoid the short-cut problem [1]. When it comes to the classification results, Figures 8 and 10 in the appendix show that models using the mixture-of-experts are not able to achieve higher classification accuracy as latent representations are learned from subsets with more modalities. On the other hand, CoDE-VAE ranks 1st and 3rd, while balancing the trade-off between generative quality and generative coherence at the same time.
[1] Palumbo et al. Enhancing the Generative Quality of Multimodal VAEs Without Compromises, ICLR, 2023.
Typos and L89, Col 1
You are correct that there is a typo in L89, Col 1. The sentence should read "Furthermore, CoDE-VAE minimizes the generative quality gap as the number of modalities increases...", referring to the results in Figure 3. We will fix this in our revised version of the paper.
Softmax instead of categorical
We assume that you are referring to the Gumbel-Softmax distribution, which is useful when we need to sample and backpropagate at the same time. As our main interest is learning the parameters, we do not expect a significant different behavior by using the Gumbel-Softmax distribution. The main concern when using a different distribution, is whether the entropy term in the ELBO of the CoDE-VAE model can be evaluated, ideally in closed form.
I am grateful to the authors for addressing my concerns.
Thank you for your positive feedback. We're glad we were able to address your concerns and hope you'll consider reviewing your score based on our rebuttal.
This work introduces CoDE-VAE, a model that challenges the common assumption of independence between unimodal experts in multimodal Variational Autoencoders by modeling their dependencies. Using a Bayesian approach, CoDE-VAE aims to improve joint posterior estimation and enhance generative coherence and classification accuracy. Experimental results show that while CoDE-VAE performs well, its improvements over the latest alternatives are modest.
While this work has some limitations, like including more thorough experiments, improving clarity of the paper overall, and more detailed comparisons with edge cases or failures of the model would help to solidify the generalizability of the results, the strength outweigh the limitations. The authors propose a method that is novel and theoretically sound, and it addresses the challenge of dependent expert distributions. I think this work is of interest to the ML community