CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning
摘要
评审与讨论
The paper proposes CyIN, a unified framework to jointly handle both complete and incomplete multimodal learning via a cyclic informative latent space. The core contributions include (1) a dual Information Bottleneck (IB) module at token- and label-levels to extract semantically rich yet compact features, and (2) a cyclic cross-modal translation module that enables forward and reverse reconstruction of missing modalities using cascaded residual autoencoders. The method is evaluated on four datasets under complete, fixed missing, and random missing settings, demonstrating state-of-the-art performance. However, the paper does not clearly distill the core problem it aims to address, and further refinement is needed to improve its clarity and focus.
优缺点分析
Strength:
- CyIN elegantly bridges the gap between complete and incomplete modality inputs with one model, avoiding the need to retrain for each missing-modality configuration.
- The use of token-level and label-level Information Bottlenecks effectively filters noise and preserves task-relevant information, leading to better fusion and reconstruction.
- CyIN achieves SOTA or near-SOTA performance on MOSI, MOSEI, IEMOCAP, and MELD, consistently outperforming baselines in both complete and incomplete setups.
Weaknesses:
- The authors state in the abstract that the proposed method aims to address the challenge of missing modalities. However, in the fourth paragraph, they enumerate several limitations of existing methods—such as insufficient exploitation of missing information, lack of unified modeling, and poor generalization and robustness. Despite listing these issues, the proposed method does not explicitly address all of them, which may lead to confusion regarding the scope and objectives of the paper.
- While Information Bottleneck is a sound principle, the theoretical grounding of combining IB with cyclic autoencoding for modality reconstruction is not rigorously established.
- Parameter count, FLOPs, and training time are not reported. Furthermore, all experiments focus on affective/emotion datasets; effectiveness on large-scale or non-affective tasks remains unknown.
- Typo: builds -> build (in line 9)
问题
- The motivation of the paper.
- What theoretical guarantees, if any, does the proposed method offer?
- Training overhead analysis.
局限性
yes
最终评判理由
The authors have adequately addressed my main concerns in the rebuttal. Therefore, I will increase my score.
格式问题
none.
We would like to express our sincere gratitude for your thoughtful questions and valuable feedback of our strengths including "elegantly bridge the gap", "effectively filter noise and preserve task-relevant information" and "consistently outperforming baselines". Our responses to your concerns are presented in lists as follows. We are eager to engage in further discussions with you to address your concerns and enhance the quality of our work.
Q1: About the addressed limitation.
A1: The fourth paragraph outlines several known limitations of prior work, which motivated the design of our method. Below, we explain how CyIN attempt to address each limitation:
-
Insufficient exploitation of missing information:
CyIN incorporates a cyclic cross-modal translation mechanism that allows available modalities to reconstruct missing ones via bottleneck latents, promoting deeper exploitation of cross-modal information. Unlike previous direct imputation method, our model leverages available modalities to reconstruct the missing ones in a compact informative space. -
Task-unrelated noise interruption:
CyIN introduce IB to filter out irrelevant noise and redundancy by compressing representation. The informative space retains task-relevant information for multimodal representation, which not only reduce the reconstruction difficulty in incomplete multimodal learning, but benefit representation compression in complete multimodal learning. -
Lack of unified modeling for varying missing modality scenarios:
CyIN can handle arbitrary missing modality patterns without retraining separately by explicitly modeling diverse mising modalities scenarios with cyclic translation in a shared informative latent space, which improves generalization under unknown missing circumstances in real-world. -
Sacrificing complete multimodal performance:
CyIN consistently maintains state-of-the-art performance for both complete and incomplete modality inputs.
Q2: Theoretical grounding of combining IB with cyclic translation.
A2: We present further explanation for the theoretical analysis of CyIN. The optimization objective for combining IB with cyclic translation is deduced as:
Considering multimodal learning with two modalities without loss of generality, the proposed CyIN can be formulated as follows:
S₁ → B₁ ⇄ B₂ ← S₂
↓ ↓
T₁ T₂
- The left and right side with and denote the chain of information bottleneck. Note that the source and target states have { }/{ } in the token-level IB or { }, in the label-level IB, both of which have been theoretically deduced in our paper, formulated as:
- The middle part denotes the cyclic translation with and , where denotes the translation process while denotes alignment with the original bottleneck. Considering as the translator network, the translation process can be divided into cross-modal reconstruction across diverse unimodal latents and cyclic translation from the reconstructed latent back to the origin modality , optimized by:
-
When both modalities are presented as complete multimodal learning, we can regard the cyclic translation as sort of cross-modal interaction enhancing modality-shared dynamics extraction. Such cross-modal synergies can benefit the multimodal fusion process to attain the final discriminative representation.
-
When one of the modalities is missing, the framework is required to conduct incomplete multimodal learning. Assuming is missing, we can obtain bottleneck latents from modality to reconstruct the bottleneck latents , denoted as , which can be further decoded as the supplementary information for the multimodal fusion process .
The aforementioned two-modality scenarios can be generalized to arbitrary modality pairs without constraint, thereby facilitating efficient scaling abilityof CyIN.
Q3: Parameter count, FLOPs, and training time.
A3: We have reported the computational efficiency in Table . Compared with the state-of-the-art methods, the proposed CyIN achieves the lowest total parameter count (123.49M) and FLOPs (1.594T). The inference time of CyIN is faster than GCNet and faster than IMDer.
Due to the cyclic translation process during training, the training time of CyIN is slightly slower than GCNet leveraging Graph Neural Network in modality reconstruction, but still faster then IMDer utilizing diffusion models for reconstruction.
Table I. Computational efficiency comparision on MOSI dataset.
| Model | Total Param (M) | Total Training Time (h) | Inference Time (s/iteration) | FLOPs (T) |
|---|---|---|---|---|
| GCNet | 144.34 M | 1.47h | 70.62s | 3.747 |
| IMDer | 168.31 M | 1.89h | 103.75s | 5.466 |
| CyIN (ours) | 123.49 M | 1.61h | 22.41s | 1.594 |
Q4: Experiments on large-scale or non-affective tasks.
A4: We have extended experiments on 6 non-affective datasets with 2–4 modalities, including:
-
Multimodal Recommendation SOTA Paper: [1] Jin Li et.al. 2025. Generating with Fairness: A Modality-Diffused Counterfactual Framework for Incomplete Multimodal Recommendations. In Proceedings of the ACM on Web Conference 2025 (WWW'25).
-
Multimodal Face Anti-spoofing and Dense Prediction SOTA Paper: [2] Shicai Wei et.al. 2024. Robust Multimodal Learning via Representation Decoupling. 18th European Conference Computer Vision (ECCV'24).
-
Multimodal Medical Segmentation SOTA Paper: [3] Pipoli, Vittorio et.al. 2025. IM-Fuse: A Mamba-based Fusion Block for Brain Tumor Segmentation with Incomplete Modalities. 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI'25).
The experimental results shown below illustrate CyIN reaches the state-of-the-art performance, further demonstrating the effectiveness of CyIN in generalizing to diverse mainstream architectures.
The scale of datasets varies from 1.4k samples (NYU v2) to 65k samples (Tiktok), 87k samples (CASIA-SURF), 139k samples (Amazon Baby), which partially show the scaling performance of the proposed CyIN. We leave exploration on larger scale datasets in the future work.
Table. Experiment for Multimodal Recommendation task of accuracy and fairness (%) under random missing protocol with 0.4 missing rate as in the SOTA¹ paper.
| Dataset | Model | Recall@10 ↑ | Recall@20 ↑ | Precision@10 ↑ | Precision@20 ↑ | NDCG@10 ↑ | NDCG@20 ↑ | F@10 ↑ | F@20 ↑ | F_fuse@10 ↑ | F_fuse@20 ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Amazon Baby | SOTA¹ | 5.26 | 8.54 | 0.56 | 0.45 | 2.76 | 3.60 | 87.24 | 90.12 | 1.11 | 0.90 |
| CyIN (ours) | 5.43 | 8.56 | 0.57 | 0.45 | 2.89 | 3.68 | 90.10 | 92.70 | 1.14 | 0.90 | |
| Tiktok | SOTA¹ | 4.81 | 7.39 | 0.48 | 0.37 | 2.95 | 3.60 | 88.15 | 92.27 | 0.96 | 0.74 |
| CyIN (ours) | 4.91 | 7.57 | 0.49 | 0.38 | 3.15 | 3.81 | 89.02 | 92.31 | 0.98 | 0.75 | |
| Allrecipes | SOTA¹ | 2.49 | 3.36 | 0.24 | 0.16 | 1.33 | 1.55 | 96.82 | 89.52 | 0.49 | 0.33 |
| CyIN (ours) | 2.61 | 3.43 | 0.26 | 0.17 | 1.45 | 1.66 | 99.57 | 91.94 | 0.52 | 0.34 |
Table. Experiment for Multimodal Face Anti-spoofing task on CASIA_SURF dataset with ACER (↓) under fixed missing protocol as in the original SOTA² paper.
| RGB | Depth | IR | SOTA² | **CyIN (ours) ** |
|---|---|---|---|---|
| ✓ | 7.33 | 4.48 | ||
| ✓ | 2.13 | 2.83 | ||
| ✓ | 10.41 | 7.75 | ||
| ✓ | ✓ | 1.02 | 2.26 | |
| ✓ | ✓ | 3.88 | 3.06 | |
| ✓ | ✓ | 1.38 | 1.20 | |
| ✓ | ✓ | ✓ | 0.69 | 0.66 |
| Average | 3.84 | 3.18 |
Table. Experiment for Multimodal Dense Prediction task on NYU v2 dataset with mIoU (↑) under fixed missing protocol as in the original SOTA² paper.
| RGB | Depth | SOTA² | CyIN (ours) |
|---|---|---|---|
| ✓ | 44.06 | 43.93 | |
| ✓ | 41.82 | 43.84 | |
| ✓ | ✓ | 49.89 | 50.46 |
| Average | 45.26 | 46.07 |
Table VI. Experiment for Multimodal Medical Segmentation task on BraTS2023 dataset with DSC (%) (↑), averaged over four modalities under fixed modality setting in the original SOTA³ paper.
| Class | SOTA³ | CyIN (ours) |
|---|---|---|
| Enhancing Tumor | 74.6 | 75.1 |
| Tumor Core | 85.0 | 85.8 |
| Whole Tumor | 90.6 | 91.2 |
Q5: Typo.
A5: Thank you for pointing out. We will carefully check the manuscript.
Thank you for the clarifications regarding the contributions of this work. My concerns have been addressed. Based on this, I am willing to raise my score.
We are sincerely appreciate your thorough suggestion and considering raising your score. Thank you very much for improving this work and for your valuable time and effort!
This paper addresses the gap between complete and incomplete multimodal learning by proposing the CyIN (Cyclic INformative Learning) framework. The core idea is to construct an “information-purified latent space” via a two-stage Information Bottleneck (IB) mechanism—Token-level IB and Label-level IB—and then perform cross-modal cyclic translation to reconstruct missing modalities within this space. The authors evaluate their approach on four datasets under three settings.
优缺点分析
Strengths
-
Unified handling of complete and arbitrarily missing inputs with a single model, avoiding the common practice of training separate models for different missing patterns.
-
Clear writing, with Figure 1 providing an intuitive breakdown of the framework and loss components.
-
The experiments are relatively comprehensive, and the results demonstrate their effectiveness.
Weaknesses
-
Both the Information Bottleneck and cyclic reconstruction approaches (CRA) are established methods; the main contribution lies in their combination. The authors need to explain the connection between the two techniques and why their combination constitutes an optimal solution.
-
The paper does not quantify the cost introduced by the two-stage training, multiple translators, and multi-head cross-attention—specifically in terms of parameter count or inference latency.
问题
Refer to the above weaknesses.
局限性
Yes
最终评判理由
I appreciate the authors' efforts in addressing my concerns. I have carefully read the rebuttal. I will keep my score.
格式问题
None
We would like to express our sincere gratitude for your thoughtful questions and valuable feedback of our strengths including "unified handling", "clear writing" and "comprehensive experiments". Our responses to your concerns are presented in lists as follows. We are eager to engage in further discussions with you to address your concerns and enhance the quality of our work.
Q1: Connection between IB and Cyclic Translation.
A1: Our key contribution lies in the novel integration for IB and cyclic translation within a unified framework designed for both complete and incomplete multimodal learning.
Specifically, the IB serves as an effective representation learning principle by encouraging compact and task-relevant latent spaces, which helps reduce the complexity of high-dimensional inputs while preserving discriminative modality-specific and -shared information. Cyclic translation, on the other hand, enhances cross-modal reconstruction by reinforcing consistency and structural alignment between modalities. It captures modality-shared features, facilitating robust reconstruction when some modalities are missing.
We have increased a theoretical analysis in combining these two approaches in our work as follows. Considering multimodal learning with two modalities without loss of generality, the proposed CyIN can be formulated as follows:
S₁ → B₁ ⇄ B₂ ← S₂
↓ ↓
T₁ T₂
Combining IB with cyclic translation, we can deduce the theoretical loss as follows,
Minimizing above reconstruction loss, we can obtain informative unimodal bottleneck latents for each modality , which contains inter-modal features to conduct productive multimodal fusion. Comparing with directly fusing unimodal representations , fusion process implemented on the bottleneck latents , denoted as , can benefit from the compression ability of information bottleneck and lead to less reconstruction difficulty in cross-modal translation.
-
When both modalities and are presented, known as complete multimodal learning, we can regard the cyclic translation as sort of cross-modal interaction enhancing the extraction of modality-shared dynamics. Such cross-modal synergies can benefit the multimodal fusion process to attain the final discriminative representation.
-
When one of the modalities is missing, the framework is required to conduct incomplete multimodal learning. Assuming is missing, we can obtain bottleneck latents from modality to reconstruct the bottleneck latents , denoted as , which can be further decoded as the supplementary information for the multimodal fusion process .
In summary, IB provides compact, robust latent spaces for both understanding and generation, while cyclic translation enhances modality interaction and recovery. Their combination allows CyIN to unify complete and incomplete modality scenarios in the same framework. We believe this integration is not only novel but also essential for scalable and flexible multimodal learning.
Q2: Cost for two-stage training and different modules.
A2: To address this concern, we have quantified the additional computational costs introduced by the two-stage training, VIB networks, multiple translators, and multimodal fusion (multi-head cross-attention) in Table I across GCNet, IMDer, and our proposed CyIN.
Table I. Comparison on model complexity for the state-of-the-art methods and the proposed CyIN, including the model parameter (M) and proportion percentage (Pc).
| Model | Total Param (M) | Increased Param (M/Pc) | Detail Module Param (M/Pc) | Total Training Time (h) | Inference Time (s/iteration) | Detail Module Latency (s) |
|---|---|---|---|---|---|---|
| GCNet | 144.34 M | 34.86 M / 24.15% | - | 1.47h | 70.62 s | - |
| IMDer | 168.31 M | 51.89 M / 32.16% | - | 1.89h | 103.75 s | - |
| CyIN (ours) | 123.49 M | 14.0 M / 11.35% | VIB (1.74 M / 12.42%) | 1.61h | 22.41s | VIB (0.37 s / 1.65%) |
| Translator (1.45 M / 10.35%) | Translator (17.51 s / 78.13%) | |||||
| Fusion (9.64 M / 68.80%) | Fusion (0.26 s / 1.16%) |
CyIN achieves the lowest total parameter count (123.49M) and significantly lower inference time (22.41s) compared to GCNet (70.62s) and IMDer (103.75s). Besides, CyIN costs for Stage 1, for Stage 2, where Stage 1 for training informative space is relatively fast.
The added parameters and inference latency from the increased modular components as VIB and Translator modules are relatively minor. Specifically:
-
VIB introduces only 1.74M parameters (1.41% of the total, or 12.42% of increased parameters), and 0.37 s latency
-
Translator adds 1.45M (1.18% of the total, or 10.35% of increased parameters), and 17.51 s latency (Reconstruction for all missing modalities scenarios, 2.92s for each missing scenario)
-
Fusion module accounts for the majority increased parameters (9.64M, or 68.80% of increased parameters), and 0.26 s inference latency.
These results demonstrate that the proposed modular extensions are lightweight relative to the overall architecture. Most of the additional cost arises from the fusion mechanism, not from the cyclic translators or VIB modules. Despite these additions, CyIN remains the most efficient in terms of both parameter size and inference latency. We will explore faster translation way like one-step flow matching or consistency model to further speed up the translation process in the future.
Dear Reviewer WC7q, Thank you once again for your valuable feedback and thoughtful insights, which have been instrumental in helping us improve our work. As the discussion phase draws to a close, we kindly request you to let us know if the above clarifications and the previously added experiments have adequately addressed your remaining questions. Should there be any further concerns, we would be glad to address them before the discussion phase concludes. We are deeply grateful for your constructive and patient engagement during the entire review and discussion process. Warm regards, CyIN Author
The authors develop a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning, which builds an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities.
优缺点分析
- the paper is well written
- the technique details are sound
- the experimental results are convincing
- the novelty is not enough
问题
Multimodal latent representation learning was once a popular research topic in AI, with classical methods such as CCA widely explored. However, recent progress in this area appears limited. The main change lies primarily in the backbone architecture—shifting from traditional deep neural networks to today’s large language models (LLMs).
I did not see significant contributions from this work to the multimodal learning community. Its primary contribution appears to be the straightforward application of existing multimodal learning techniques—specifically those involving the information bottleneck—on modern LLMs.
The following is a more subjective comment from my personal perspective: VAE and the information bottleneck framework often feel like "one-size-fits-all" solutions that can be applied to virtually any representation learning problem. While they are well-suited for academic exploration and publishing, the complex training procedures involved—like those described in this paper—are unlikely to scale effectively for solving real-world AI tasks. For this reason, I personally do not prefer this type of work.
局限性
N/A
最终评判理由
The authors clarify their contirbution of this work, which provides a new perspective of information bottleneck to the multimodal learning community in bridging complete and incomplete multimodal learning.
格式问题
N/A
We would like to express our sincere gratitude for your thoughtful questions and valuable feedback on our strengths, including "well written paper", "sound technique details", and "convincing experimental results". Our responses to your concerns are presented in lists as follows. We are eager to engage in further discussions with you to address your concerns and enhance the quality of our work.
Q1: About novelty and contribution.
A1: We appreciate the reviewer’s feedback and would like to clarify and reaffirm the key contributions of our work to the multimodal learning community:
-
Addressing the Limitations of Prior Work: Many existing approaches either neglect the performance degradation caused by real-world challenges such as sensor failure and corrupted inputs, or suffer from task-irrelevant noise when reconstructing them. Moreover, strategies such as data augmentation or imputation often reduce performance on complete multimodal inputs and demand prior knowledge about the missing situation, making it difficult to jointly handle complete and incomplete multimodal scenarios in real-world applications.
-
A Unified Framework for Complete and Incomplete Multimodal Learning: Our proposed CyIN framework bridges this gap by unifying complete and incomplete multimodal learning within a shared informative latent space, offering a new direction for building robust multimodal systems capable of handling diverse scenarios without requiring separate models or heuristics. The token- and label-level IB with cyclic translation mechanism over informative latents improves cross-modal interaction in multimodal fusion and enables effective reconstruction of missing modalities.
-
Extensive Empirical Experiment Validation: We have validated CyIN across 6 sub-tasks and 10 diverse multimodal datasets involving 2–4 modalities. The framework consistently demonstrates strong performance under various missing modality settings.
Q2: Well-suited for academic exploration but not scalable for solving real-world AI tasks.
A2: We'd like to provide an open-minded discussion of both the academic and application value of our work. Admittedly, VAE and IB are often seen as "one-size-fits-all" solutions, as you suggested. However, academic exploration continues to yield valuable insights, and these models still have broad applicability. For example:
- VAEs are widely adopted in image latent encoding for recent AIGC applications such as Stable Diffusion [4].
- IB is actively being explored for enhancing safety in large language models [5].
Thus, we believe deeper theoretical exploration of VAE and IB in the context of multimodal models is equally important as pursuing application-focused research.
Moreover, to further validate the applicability of our proposed method, we conduct extensive experiments on current industrial multimodal applications, including:
- Sentiment analysis: MOSI, MOSEI datasets
- Emotion recognition: IEMOCAP, MELD datasets
- Multimodal recommendation [1]: Amazon Baby, Tiktok, Allrecipes datasets
- Face anti-spoofing [2]: CASIA-SURF dataset
- Dense prediction [2]: NYUv2 dataset
- Medical segmentation [3]: BraTS2023 dataset
Table. Experiment comparison for Multimodal Recommendation task of accuracy and fairness performance (%) on three datasets, including Amazon Baby, Tiktok, and Allrecipes. The modality setting follows the random missing protocol with 0.4 missing rate as in the SOTA¹ [1] paper.
| Dataset | Model | Recall@10 ↑ | Recall@20 ↑ | Precision@10 ↑ | Precision@20 ↑ | NDCG@10 ↑ | NDCG@20 ↑ | F@10 ↑ | F@20 ↑ | F_fuse@10 ↑ | F_fuse@20 ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Amazon Baby | SOTA¹ | 5.26 | 8.54 | 0.56 | 0.45 | 2.76 | 3.60 | 87.24 | 90.12 | 1.11 | 0.90 |
| CyIN (ours) | 5.43 | 8.56 | 0.57 | 0.45 | 2.89 | 3.68 | 90.10 | 92.70 | 1.14 | 0.90 | |
| Tiktok | SOTA¹ | 4.81 | 7.39 | 0.48 | 0.37 | 2.95 | 3.60 | 88.15 | 92.27 | 0.96 | 0.74 |
| CyIN (ours) | 4.91 | 7.57 | 0.49 | 0.38 | 3.15 | 3.81 | 89.02 | 92.31 | 0.98 | 0.75 | |
| Allrecipes | SOTA¹ | 2.49 | 3.36 | 0.24 | 0.16 | 1.33 | 1.55 | 96.82 | 89.52 | 0.49 | 0.33 |
| CyIN (ours) | 2.61 | 3.43 | 0.26 | 0.17 | 1.45 | 1.66 | 99.57 | 91.94 | 0.52 | 0.34 |
Table. Experiment comparison for Multimodal Face Anti-spoofing task with Average Classification Error Rate (ACER ↓) on the CASIA-SURF dataset. The modality setting follows the fixed missing protocol as in the original SOTA² [2] paper.
| RGB | Depth | IR | SOTA² (ACER ↓) | CyIN (ours) (ACER ↓) |
|---|---|---|---|---|
| ✓ | 7.33 | 4.48 | ||
| ✓ | 2.13 | 2.83 | ||
| ✓ | 10.41 | 7.75 | ||
| ✓ | ✓ | 1.02 | 2.26 | |
| ✓ | ✓ | 3.88 | 3.06 | |
| ✓ | ✓ | 1.38 | 1.20 | |
| ✓ | ✓ | ✓ | 0.69 | 0.66 |
| Average | 3.84 | 3.18 |
Table. Experiment comparison for Multimodal Dense Prediction task with mIoU (↑) on the NYUv2 dataset. The modality setting follows the fixed missing protocol as in the original SOTA² [2] paper.
| RGB | Depth | SOTA² (mIoU ↑) | CyIN (ours) (mIoU ↑) |
|---|---|---|---|
| ✓ | 44.06 | 43.93 | |
| ✓ | 41.82 | 43.84 | |
| ✓ | ✓ | 49.89 | 50.46 |
| Average | 45.26 | 46.07 |
Table. Experiment comparison for Multimodal Medical Segmentation task with Dice Similarity Coefficient (DSC %) (↑) on the BraTS2023 dataset. The results are averaged over four modalities: FLAIR, T1, T1c, and T2, following the fixed modality setting in the original SOTA³ [3] paper.
| Class | SOTA³ (DSC ↑) | CyIN (ours) (DSC ↑) |
|---|---|---|
| Enhancing Tumor | 74.6 | 75.1 |
| Tumor Core | 85.0 | 85.8 |
| Whole Tumor | 90.6 | 91.2 |
The experimental results shown above illustrate that CyIN achieves state-of-the-art performance. Besides, we'd like to point out that the training procedure is flexible in different training architecture, including: Transformer-based (Ours), Graph-based (SOTA¹ [1]), CNN-based (SOTA² [2]), Mamba-based (SOTA³ [3]) for both modality-specific feature extraction and multimodal fusion.
Lastly, we scale up CyIN with different sizes of PLMs including BERT, RoBERTa, and DeBERTa-V3 as follows, indicating efficient scaling ability similar to the scaling law of PLM training.
Table. Comparison of the proposed CyIN using different sizes of language models on MOSI dataset under both complete and randomly incomplete multimodal settings.
| Setting | Model Variant (PLM) | Acc7 ↑ | F1 ↑ | MAE ↓ | Corr ↑ |
|---|---|---|---|---|---|
| Complete | BERT | 48.0 | 86.3 | 0.712 | 0.801 |
| RoBERTa | 49.5 | 88.2 | 0.692 | 0.823 | |
| DeBERTa-V3 | 50.3 | 90.1 | 0.671 | 0.841 | |
| Random Missing (MR=0.7) | BERT | 28.0 | 65.9 | 1.117 | 0.530 |
| RoBERTa | 29.3 | 67.2 | 0.992 | 0.556 | |
| DeBERTa-V3 | 32.2 | 69.5 | 0.965 | 0.578 |
In summary, we believe our work provides a new perspective to the multimodal learning community in bridging complete and incomplete multimodal learning.
We sincerely hope that above responses could address your concerns and provide another perspective for the application of the proposed method.
[1] Jin Li et.al. 2025. Generating with Fairness: A Modality-Diffused Counterfactual Framework for Incomplete Multimodal Recommendations. In Proceedings of the ACM on Web Conference 2025 (WWW'25).
[2] Shicai Wei et.al. 2024. Robust Multimodal Learning via Representation Decoupling. 18th European Conference Computer Vision (ECCV'24).
[3] Pipoli, Vittorio et.al. 2025. IM-Fuse: A Mamba-based Fusion Block for Brain Tumor Segmentation with Incomplete Modalities. 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI'25).
[4] Rombach R et al. High-resolution image synthesis with latent diffusion models[C] Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR'2022).
[5] Liu Z et al. Protecting your llms with information bottleneck[J]. Advances in Neural Information Processing Systems, 2024 (NIPS'2024).
Thanks for your feedback that clarify the contributions of this work, and I will raise my score
Thank you very much for your kind response and for considering raising your score. We are grateful for the opportunity to improve our paper, and truly appreciate your precious time and effort.
The paper presents a novel framework, CyIN, designed to address the challenge of multimodal learning in scenarios with both complete and incomplete multimodal inputs. The authors introduce a cyclic informative latent space that enables efficient cross-modal interaction and multimodal fusion. This space is constructed using token-level and label-level Information Bottlenecks (IB) to capture task-relevant features and filter out noise. Additionally, the paper proposes a cross-modal cyclic translation mechanism to reconstruct missing modalities through forward and reverse propagation, enhancing robustness in incomplete input scenarios. Extensive experiments on four datasets demonstrate that CyIN achieves state-of-the-art performance in both complete and diverse incomplete multimodal learning settings, highlighting its effectiveness and generalization ability. The contributions include the development of the informative latent space, the cyclic interaction and translation mechanisms, and the unified framework for joint optimization of complete and incomplete multimodal learning.
优缺点分析
Strengths
-
The paper proposes a novel Cyclic Informative Learning framework (CyIN) that bridges the gap between complete and incomplete multimodal learning by constructing an informative latent space using token-level and label-level Information Bottlenecks.
-
The introduced cross-modal cyclic translation mechanism enhances the model's robustness and performance in scenarios with missing modalities by reconstructing the missing information through forward and reverse propagation.
Weaknesses
-
The paper assumes that all modalities contribute equally to the final task, but in practice, different modalities may have different importance to the task. For example, in some sentiment analysis tasks, text modalities may be more important than audio and visual modalities. This unbalanced contribution between modalities may affect the performance and generalization ability of the model
-
Although the paper proposes recurrent interaction and translation mechanisms based on information bottlenecks, the computational efficiency and scalability of these mechanisms can be a potential issue when dealing with large-scale multimodal data. For example, when the number of modalities increases or the data dimension is high
-
The experiments mainly focus on sentiment analysis and emotion recognition, and lack corresponding experimental verification for some broader multimodal tasks, such as multimodal question answering and multimodal recommendation. In addition, although the paper provides detailed hyperparameter Settings and training details, no in-depth analysis of the sensitivity of different hyperparameters is performed in the experiments. For example, the impact of the β parameter in the information bottleneck on the model performance and how to adjust β for the best performance under different missing rates can be further explored.
问题
see above
局限性
yes
最终评判理由
The response resolves most of my concerns, and I will maintain the score.
格式问题
n/a
We would like to express our sincere gratitude for your thoughtful questions and valuable feedback of our strengths including "novel framework" and "enhancing robustness". Our responses to your concerns are presented in lists as follows. We are eager to engage in further discussions with you to address your concerns and enhance the quality of our work.
Q1: About imbalance contribution of diverse modalities.
A1: In our work, we assumed equal contribution across modalities to ensure generality and stability, particularly under randomly missing modality scenarios. However, we acknowledge that in practice, different modalities may contribute unequally depending on the task [4], summarized as balanced multimodal learning [5], another issue in the field of multimodal learning.
To partially address this, we increase experiments by varying the number of cross-moda translator layers in Table I. We observe that allocating more layers in reconstruction allows the model to learn more semantics representation, where the results show more important contribution of text modality than others.
We plan to explore adaptive balancing strategies as part of our future work as stated in Section Limitations.
Table I. Performance comparison of CyIN using different CRA layer ratios for cross-modal translation (::) under random missing rate on the MOSI dataset.
| CRA Layer Ratio (::) | Acc7 ↑ | F1 ↑ | MAE ↓ | Corr ↑ |
|---|---|---|---|---|
| 4:4:1 | 28.7 | 67.5 | 1.114 | 0.532 |
| 2:2:1 | 28.5 | 67.2 | 1.122 | 0.521 |
| 1:1:1 (Original) | 28.0 | 65.9 | 1.117 | 0.530 |
| 1:1:2 | 27.5 | 66.3 | 1.119 | 0.531 |
| 1:1:4 | 27.4 | 65.6 | 1.116 | 0.534 |
Q2: About computational efficiency and scalability with modality number and data dimension.
A2: As shown in Table II, CyIN achieves the best inference efficiency and lowest FLOPs when handling three input modalitie, comparing with the state-of-the-art methods such as GCNet and IMDer.
Table II. Computational efficiency for the state-of-the-art methods and the proposed CyIN on MOSI dataset.
| Model | Total Param (M) | Total Training Time (h) | Inference Time (s/iteration) | FLOPs (T) |
|---|---|---|---|---|
| GCNet | 144.34 M | 1.47h | 70.62s | 3.747 |
| IMDer | 168.31 M | 1.89h | 103.75s | 5.466 |
| CyIN (ours) | 123.49 M | 1.61h | 22.41s | 1.594 |
Moreover, CyIN is designed to be scalable as the number of modalities increases. The cross-modal translation mechanisms introduce only a minimal computational overhead relative to prior methods (Each cross-modal translator only increase 1.45 M parameters, % of the total model parameters). This is due to the modular nature of the architecture and the efficient reconstruction performance in the informative space. Since reconstruction is performed in this compressed latent space, The computational cost remains manageable in scenarios with high-dimensional data, where IB serves as a compact and informative representation extractor [6,7].
To further validate scalability, we conducted additional experiments on datasets involving 2 to 4 modalities, as detailed in the next response. These results demonstrate the robustness and efficiency of CyIN across varying modality configurations.
Q3: About experimental verification for broader multimodal tasks.
A3: To further validate the applicability of the proposed method, we conduct extensive experiments on current industrial multimodal applications including:
-
Multimodal Recommendation SOTA¹ (3 modalities: visual, audio, textual) paper: [1] Jin Li et.al. 2025. Generating with Fairness: A Modality-Diffused Counterfactual Framework for Incomplete Multimodal Recommendations. In Proceedings of the ACM on Web Conference 2025 (WWW'25).
-
Multimodal Face Anti-spoofing (3 modalities: RGB, Depth, IR) and Dense Prediction (2 modalities: RGB, Depth) SOTA² paper: [2] Shicai Wei et.al. 2024. Robust Multimodal Learning via Representation Decoupling. 18th European Conference Computer Vision (ECCV'24).
-
Multimodal Medical Segmentation (4 modalities: FLAIR, T1, T1c, T2) SOTA³ paper: [3] Pipoli, Vittorio et.al. 2025. IM-Fuse: A Mamba-based Fusion Block for Brain Tumor Segmentation with Incomplete Modalities. 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI'25).
The experimental results shown below illustrate CyIN reaches the state-of-the-art performance.
Table III. Experiment comparison for Multimodal Recommendation task of accuracy and fairness performance (%) on three datasets, including Amazon Baby, Tiktok, and Allrecipes. The modality setting follows the random missing protocol with 0.4 missing rate as in the SOTA¹ paper.
| Dataset | Model | Recall@10 ↑ | Recall@20 ↑ | Precision@10 ↑ | Precision@20 ↑ | NDCG@10 ↑ | NDCG@20 ↑ | F@10 ↑ | F@20 ↑ | F_fuse@10 ↑ | F_fuse@20 ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Amazon Baby | SOTA¹ | 5.26 | 8.54 | 0.56 | 0.45 | 2.76 | 3.60 | 87.24 | 90.12 | 1.11 | 0.90 |
| CyIN (ours) | 5.43 | 8.56 | 0.57 | 0.45 | 2.89 | 3.68 | 90.10 | 92.70 | 1.14 | 0.90 | |
| Tiktok | SOTA¹ | 4.81 | 7.39 | 0.48 | 0.37 | 2.95 | 3.60 | 88.15 | 92.27 | 0.96 | 0.74 |
| CyIN (ours) | 4.91 | 7.57 | 0.49 | 0.38 | 3.15 | 3.81 | 89.02 | 92.31 | 0.98 | 0.75 | |
| Allrecipes | SOTA¹ | 2.49 | 3.36 | 0.24 | 0.16 | 1.33 | 1.55 | 96.82 | 89.52 | 0.49 | 0.33 |
| CyIN (ours) | 2.61 | 3.43 | 0.26 | 0.17 | 1.45 | 1.66 | 99.57 | 91.94 | 0.52 | 0.34 |
Table IV. Experiment comparison for Multimodal Face Anti-spoofing task with Average Classification Error Rate (ACER ↓) on the CASIA-SURF dataset. The modality setting follows the fixed missing protocol as in the original SOTA² paper.
| RGB | Depth | IR | SOTA² (ACER ↓) | CyIN (ours) (ACER ↓) |
|---|---|---|---|---|
| ✓ | 7.33 | 4.48 | ||
| ✓ | 2.13 | 2.83 | ||
| ✓ | 10.41 | 7.75 | ||
| ✓ | ✓ | 1.02 | 2.26 | |
| ✓ | ✓ | 3.88 | 3.06 | |
| ✓ | ✓ | 1.38 | 1.20 | |
| ✓ | ✓ | ✓ | 0.69 | 0.66 |
| Average | 3.84 | 3.18 |
Table V. Experiment comparison for Multimodal Dense Prediction task with mIoU (↑) on the NYUv2 dataset. The modality setting follows the fixed missing protocol as in the original SOTA² paper.
| RGB | Depth | SOTA² (mIoU ↑) | CyIN (ours) (mIoU ↑) |
|---|---|---|---|
| ✓ | 44.06 | 43.93 | |
| ✓ | 41.82 | 43.84 | |
| ✓ | ✓ | 49.89 | 50.46 |
| Average | 45.26 | 46.07 |
Table VI. Experiment comparison for Multimodal Medical Segmentation task with Dice Similarity Coefficient (DSC %) (↑) on the BraTS2023 dataset. The results are averaged over four modalities: FLAIR, T1, T1c, and T2, following the fixed modality setting in the original SOTA³ paper.
| Class | SOTA³ (DSC ↑) | CyIN (ours) (DSC ↑) |
|---|---|---|
| Enhancing Tumor | 74.6 | 75.1 |
| Tumor Core | 85.0 | 85.8 |
| Whole Tumor | 90.6 | 91.2 |
Q4: About hyper-parameter sensitivity.
A4: We have conducted additional experiments presented in Table VII, which systematically evaluate the effect of varying the most important hyper-parameters and . The former controls the trade-off degree of mutual information in IB and the latter adjusts the contribution of translation loss.
Table VII. Hyper-parameter sensitivity on and of the proposed CyIN on MOSI dataset with incomplete multimodal learning of random missing rate .
| Model Variants | Acc7↑ | F1↑ | MAE↓ | Corr↑ |
|---|---|---|---|---|
| β=2 | 24.6 | 59.4 | 1.248 | 0.461 |
| β=4 | 27.5 | 61.3 | 1.226 | 0.478 |
| β=8 | 30.0 | 63.6 | 1.199 | 0.519 |
| β=16 | 28.0 | 65.9 | 1.117 | 0.530 |
| β=32 | 26.4 | 66.0 | 1.118 | 0.506 |
| β=64 | 23.3 | 64.1 | 1.206 | 0.463 |
| γ=1 | 25.8 | 64.1 | 1.197 | 0.437 |
| γ=5 | 27.5 | 63.1 | 1.135 | 0.488 |
| γ=10 | 28.0 | 65.9 | 1.117 | 0.530 |
| γ=15 | 26.8 | 62.9 | 1.272 | 0.519 |
| γ=20 | 25.5 | 60.4 | 1.311 | 0.490 |
-
Setting gives better performance as a balanced trade-off for mututal information between and . Very small or large values of lead to clear degradation in all metrics, showing that improper bottleneck strength harms performance.
-
Setting yields the best overall performance across metrics. A small results in weak regularization, making it hard to align modalities under severe missing conditions while a large hurts performance greatly dur to neglecting building effective informative space.
[4] Peng X, Wei Y, Deng A, et al. Balanced multimodal learning via on-the-fly gradient modulation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) 2022.
[5] Zhang H, Wang W, Yu T. Towards robust multimodal sentiment analysis with incomplete data[J]. Advances in Neural Information Processing Systems (NeurIPS) 2024.
[6] Shwartz-Ziv R, Tishby N. Opening the black box of deep neural networks via information. arXiv:1703.00810, 2017.
[7] Saxe A M, Bansal Y, Dapello J, et al. On the Information Bottleneck Theory of Deep Learning. International Conference on Learning Representations (ICLR) 2018.
Dear Reviewer oakX, Thank you once again for your valuable feedback and thoughtful insights, which have been instrumental in helping us improve our work. As the discussion phase is approaching its end, we kindly ask whether the above clarifications and the previously added experiments have addressed your remaining questions. We would be happy to address any additional points you may have during the remaining time of the discussion phase. We are deeply grateful for your constructive and patient engagement during the entire review and discussion process. Warm regards, CyIN Author
Hi, authors,
Thanks for your response. It resolves most of my concerns, and I will maintain the score.
Thank you for your prompt response and We’re glad to hear that most of your concerns have been addressed. We are grateful for your constructive suggestions to improve our paper, and truly appreciate your valuable time and effort.
The paper introduces a method called “CyIN”, short for Cyclic INformative Learning framework, which builds a joint latent space using token- and label-level information bottlenecks applied cyclically across modalities. This space enables cross-modal reconstruction of missing modalities through forward and reverse propagation. The approach achieves good performance on several datasets.
Strengths:
- Addresses a real-world problem where multimodal systems often encounter missing or corrupted inputs during deployment
- Interesting combinations of information theory with cross-modal translation
- Clear presentation with intuitive figures and comprehensive experimental evaluation
Weaknesses:
- The novelty lies primarily in combining existing techniques (e.g. information bottleneck and cyclic reconstruction) rather than introducing fundamentally new concepts
- The paper assumes equal importance across modalities, which may not reflect real-world scenarios where certain modalities dominate
- Some reviewers found the theoretical justification for combining information bottleneck with cyclic translation could be more rigorous
- The multi-stage training process and multiple components add complexity that may limit practical adoption
The consensus is borderline but on the side of ‘acceptance’ suggesting this paper is suitable for Neurips.