PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
3
5
5
6
3.3
置信度
正确性2.5
贡献度2.0
表达2.8
ICLR 2025

M3CoL: Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

OpenReviewPDF
提交: 2024-09-19更新: 2025-02-05
TL;DR

We introduce a novel Mixup-based contrastive learning method to capture shared relations inherent in real-world multimodal data, improving SOTA multimodal classification performance.

摘要

关键词
Contrastive learningmultimodal learningrepresentation learningmutlimodal classification

评审与讨论

审稿意见
3

The paper introduces M3CoL (Multimodal Mixup Contrastive Learning), a method aimed at capturing shared, non-pairwise relationships within multimodal data. The framework includes a mixup-based contrastive loss to align mixed samples across modalities, facilitating more robust representations for multimodal classification tasks.

优点

  • Inovative way to perform contrastive learning: The use of Mixup in a contrastive learning setting for multimodal data is quite novel and experimentally illustrate to have positive effects.
  • Experiments regarding Attention map between text and image regions provide a good illustration for the effectiveness of alignment process.

缺点

  • The motivation of the manuscript is not strong. The process of aligning Positive couplets and Negative couplets in pairwise manner do not necessarily ignore the shared relational information exist between samples. There are lines of contrastive learning work (e.g. [1]) which align representations of sample within the same class together. Why does the mixup can better improve the performance compared to these approaches?
  • The rationale of using MixUp technique is not well stated. Is there any reason behind the choice of MixUp as a way to combine samples? Additional ablation studies can be provided to strengthen the choice empirically.
  • Beside the idea of MixUp contrastive learning strategy, the rationale of applying unimodal downstream loss is also short of explanation. While it does show improvement via Ablation study, why is it the case that it can indeed help the overall system?

[1] Zhang, Shu, et al. "Use all the labels: A hierarchical multi-label contrastive learning framework." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

问题

Please refer to Weaknesses for related questions.

评论

The motivation of the manuscript is not strong. The process of aligning Positive couplets and Negative couplets in pairwise manner do not necessarily ignore the shared relational information exist between samples. There are lines of contrastive learning work (e.g. [1]) which align representations of sample within the same class together. Why does the mixup can better improve the performance compared to these approaches?

  • We appreciate the reviewer’s feedback but respectfully disagree that the motivation of our manuscript is weak. Our key contribution addresses a critical limitation of traditional contrastive learning: its reliance on strict one-to-one correspondences across modalities, an assumption often violated in real-world data. As discussed in Line [49-76] and illustrated in Figure 1 (Left panel), traditional methods align paired modalities as positives and treat non-corresponding ones as negatives, failing to capture shared relations across different samples. For example, shared elements like "tomato sauce" and "basil" can relate across separate samples, which such methods overlook. In contrast, M3CoL (Figure 1, Right panel) generates mixed samples through convex combinations of data points, allowing the model to learn shared multimodal relationships beyond pairwise associations. This is crucial for modeling complex real-world relationships, such as imperfect bijections, which enhance generalization. While related works (e.g., Zhang et al., 2022) cluster semantically similar samples within class boundaries, they rely on rigid label hierarchies. M3CoL explicitly captures shared structures that may not align strictly with predefined classes, leveraging mixup-based contrastive loss to map representations in a latent space respecting multimodal semantics. This is especially valuable in datasets like ROSMAP/BRCA (shared biological pathways) and Food-101 (overlapping food attributes). Empirically, M3CoL demonstrates state-of-the-art performance on N24News, ROSMAP, and BRCA and comparable results on Food-101, validating its ability to model shared relations and generalize across domains. We believe this directly addresses the limitations of traditional approaches, offering a robust alternative for multimodal learning.

The rationale of using MixUp technique is not well stated. Is there any reason behind the choice of MixUp as a way to combine samples? Additional ablation studies can be provided to strengthen the choice empirically.

  • We respectfully disagree with the reviewer’s concern regarding the rationale for Mixup, as it is clearly addressed in lines 77-86. Mixup and its variants (e.g., Zhang et al., 2017; Yun et al., 2019) are well-established for enhancing feature spaces, improving robustness, and mitigating overfitting, particularly in low-sample settings. Our method leverages Mixup not only as a data augmentation tool but as a mechanism to generate mixed samples that capture shared relationships across modalities, addressing limitations of traditional contrastive methods. Building on recent advances (e.g., Shen et al., 2022; Kim et al., 2020), M3CoL adapts Mixup to multimodal contexts, providing a principled approach for modeling complex shared relationships.

Beside the idea of MixUp contrastive learning strategy, the rationale of applying unimodal downstream loss is also short of explanation. While it does show improvement via Ablation study, why is it the case that it can indeed help the overall system?

  • We respectfully disagree with the concern that the rationale for applying the unimodal downstream loss is insufficiently explained. This is explicitly discussed in lines 114-117 of the manuscript. The unimodal prediction modules provide additional supervision during training through classifiers 1 and 2. This strategy enables deeper integration of modalities by allowing the system to leverage complementary strengths of different modalities. For instance, when one modality is weak or noisy, the other can compensate, enhancing overall robustness and performance. The ablation study further validates this design choice, demonstrating its effectiveness in improving the system's capability to learn robust representations.
评论

Dear Reviewer 8Mrb, We hope we have addressed all the concerns you pointed out. Please let us know if you need any further clarifications; we will be happy to address them.

审稿意见
5

This study introduces M3CoL, a deep multimodal learning method for capturing complex relationships in real-world data. M3CoL captures shared multimodal relationships by employing a contrast loss based on mixed samples and introduces a fusion module for multimodal classification tasks for supplementary supervision.

优点

  1. M3CoL uses a smart technique to find and learn common patterns across different data types. It’s like having a tool that can spot similarities in things that might not look alike, making it good at understanding complex data relationships.
  2. The experiments and analysis are extensive, involving multiple datasets with various types of data and analyses.

缺点

The reasons for the training sample selection strategy are not explained, and some experimental results are incomplete.

问题

  1. The ACC for the Body section in N24News has not been provided.
  2. Samples from modality 1 (x_i^1,x_j^1) and modality 2 (x_i^2,x_k^2), along with their respective mixed data, are fed into encoders to generate embeddings. How were samples j and k selected, and why can’t they both be j?
评论

The reasons for the training sample selection strategy are not explained, and some experimental results are incomplete.

  • As stated in the paper (Line 159), the mixing indices j, k are drawn arbitrarily, without replacement, from [1, N], for both the modalities where N denotes the total number of samples present in the batch. j and k are kept different to ensure diversity in mixed samples. We respectfully ask the reviewer to specify any perceived incompleteness in results, as we believe all key experiments have been included.

The ACC for the Body section in N24 News has not been provided.

  • Thank you for pointing this out. We did not report the ACC for the Body section in N24News due to availability issues, as major baselines do not provide corresponding numbers for comparison. This ensures a fair and consistent evaluation across all reported metrics.

Samples from modality 1 (x_i^1,x_j^1) and modality 2 (x_i^2,x_k^2), along with their respective mixed data, are fed into encoders to generate embeddings. How were samples j and k selected, and why can’t they both be j?

  • Thank you for your observation regarding the selection of mixing indices j and k for modalities 1 and 2, respectively. As mentioned in the paper (Line 159), the indices j and k are drawn randomly, without replacement, from the set [1, N], where N denotes the total number of samples in the batch. The indices are kept distinct to ensure diversity in the mixed samples.
评论

Dear Reviewer MUir, We hope we have addressed all the concerns you pointed out. Please let us know if you need any further clarifications; we will be happy to address them.

审稿意见
5

The paper introduces M3CoL, a novel multimodal learning approach that leverages mixup contrastive learning to capture nuanced shared relations across modalities, going beyond traditional pairwise associations. The key contribution is a mixup-based contrastive loss function that aligns mixed samples from one modality with corresponding samples from others. The work highlights the importance of learning shared relations for robust multimodal learning and has implications for future research.

优点

1.) M3CoL's use of mixup-based contrastive learning to capture shared relations in multimodal data, offering a new perspective on multimodal representation learning. 2.) The theoretical analysis of M3CoL, including contrastive loss and the integration of unimodal and fusion modules, contributes to the theoretical understanding of multimodal learning. 3.) The paper is well written, with clear explanations of the methodology, experiments, and results, making it accessible to readers.

缺点

1.) The paper does not deeply address how M3CoL scales with very large datasets, which could be a limitation given the increasing size of real-world datasets. 2.) There's a potential risk of overfitting with mixup, especially in early training stages. More analysis on balancing generalization and overfitting would be valuable. 3.)M3CoL's effectiveness relies heavily on the quality of mixed samples. Discussion on how data quality variations across modalities might affect performance is lacking.

问题

see the Weaknesses

评论

The paper does not deeply address how M3CoL scales with very large datasets, which could be a limitation given the increasing size of real-world datasets.

  • We appreciate this concern. Our experiments demonstrate M3CoL's effective handling of moderately large datasets (Food-101: 60K samples, N24 News: 48K samples). The model's architecture enables batch processing and scales linearly with sample size, while memory requirements depend primarily on batch size rather than total dataset size. These characteristics make M3CoL practically applicable for large-scale deployments. Additionally, the consistent performance across datasets of varying sizes (from hundreds to tens of thousands of samples) indicates robust scalability.

There's a potential risk of overfitting with mixup, especially in early training stages. More analysis on balancing generalization and overfitting would be valuable.

  • We respectfully disagree with the concern about overfitting during early training stages. As clearly demonstrated in Figure 6(a) and 6(b), our test accuracy curves show stable and consistent improvement throughout training, with no signs of performance degradation that would indicate overfitting. In fact, the learning curves demonstrate that both mixup and M3CoL variants maintain steady improvement or stable performance even after 40,000 steps, suggesting effective regularisation rather than overfitting. The smooth, monotonically increasing nature of these curves provides empirical evidence that our approach successfully balances model generalisation and learning capacity.

M3CoL's effectiveness relies heavily on the quality of mixed samples. Discussion on how data quality variations across modalities might affect performance is lacking.

  • We acknowledge this valuable point about modality-specific data quality variations. While our current work focuses on establishing M3CoL's core methodology and demonstrating its effectiveness on standard benchmark datasets, analysing the impact of varying data quality across modalities represents an important future direction. We plan to explore this through systematic experiments with controlled quality variations in our future work.
评论

Dear Reviewer mvky, We hope we have addressed all the concerns you pointed out. Please let us know if you need any further clarifications; we will be happy to address them.

审稿意见
6

The paper introduces M3CoL to capture complex shared relationships in multimodal data by aligning mixed samples from one modality with corresponding samples from others. This method leverages a Mixup-based contrastive loss with controlled mixup factor, extending beyond typical pairwise associations. A SoftClip-based loss is also adopted to enable many-to-many relationships between the two modalities. M3CoL also incorporates a novel multimodal learning framework that integrates unimodal prediction modules and a fusion module to improve classification. Experimental results show that M3CoL outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, and achieves comparable performance on Food-101.

优点

Clarity: The paper is well-structured, with clear explanations of the methodology, including detailed descriptions of the Mixup-based contrastive loss and the unimodal and fusion modules. 

Significance: M3CoL advances multimodal classification by addressing the limitations of traditional contrastive methods, offering improved generalization across domains. Its contributions are valuable for future research in multimodal learning, especially nuanced multimodal relationships like medical datasets.

缺点

Originality: Incorporating Mixup in contrastive learning is not new [1-3], even in a multimodal setting ([4-6], see Questions.) The reviewer would truly appreciate the authors’ further discussions on [4-6].

Significance: datasets, especially the non-medical datasets, are relatively small. The effectiveness of the method is yet to be seen from larger, real-world datasets. Since this method is relatively straightforward, larger-scale experiments will improve the significance of the submission.

[1] Zhao, Tianhao, et al. "MixIR: Mixing Input and Representations for Contrastive Learning." IEEE Transactions on Neural Networks and Learning Systems (2024).

[2] Liu, Zixuan, et al. "ChiMera: Learning with noisy labels by contrasting mixed-up augmentations." arXiv preprint arXiv:2310.05183 (2023).

[3] Bandara, Wele Gedara Chaminda, Celso M. De Melo, and Vishal M. Patel. "Guarding Barlow Twins Against Overfitting with Mixed Samples." arXiv preprint arXiv:2312.02151 (2023).

问题

Could the authors kindly discuss the following related work ([6] being concurrrent):

[4] Wang, Teng, et al. "Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix." International Conference on Machine Learning. PMLR, 2022.

[5] Georgiou, Efthymios, Yannis Avrithis, and Alexandros Potamianos. "PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis." arXiv preprint arXiv:2312.12334 (2023).

[6] Bafghi, Reza Akbarian, et al. "Mixing Natural and Synthetic Images for Robust Self-Supervised Representations." arXiv preprint arXiv:2406.12368 (2024).

评论

Originality: Incorporating Mixup in contrastive learning is not new [1-3], even in a multimodal setting ([4-6], see Questions.) The reviewer would truly appreciate the authors’ further discussions on [4-6].

  • We appreciate the reviewer highlighting these relevant works. While mixup in contrastive learning has indeed been explored, our work differs in several key aspects: Previous works [1-3] focus on single-modality scenarios, while M3CoL introduces a novel multi-modal contrastive framework that specifically addresses cross-modal interactions. Regarding multimodal works: VLMixer [4] focuses on vision-language pre-training using CutMix, while M3CoL proposes a more general framework for any number of modalities PowMix [5] is specific to sentiment analysis, whereas M3CoL is domain-agnostic [6] focuses on mixing synthetic/natural images, while M3CoL addresses general multimodal data mixing Our key novelty lies in combining multimodal supervision with mixup-based contrastive learning in a unified framework that can handle arbitrary numbers of modalities, which hasn't been explored in previous works.

Significance: datasets, especially the non-medical datasets, are relatively small. The effectiveness of the method is yet to be seen from larger, real-world datasets. Since this method is relatively straightforward, larger-scale experiments will improve the significance of the submission.

  • Thank you for the valuable feedback. Our experiments demonstrate M3CoL's robust scalability, with consistent performance across datasets ranging from hundreds to tens of thousands of samples (e.g., Food-101: 60K, N24 News: 48K). The model's architecture scales linearly with dataset size, with memory requirements tied to batch size, making it practical for larger datasets. While our medical datasets are smaller, they include more modalities, demonstrating M3CoL's ability to handle diverse multimodal scenarios. Testing on larger real-world datasets is a promising future direction to further validate its effectiveness.

Could the authors kindly discuss the following related work ([6] being concurrent):

  • As discussed in our previous response, we have highlighted the key differences between M3CoL and works [1-6]. We will rewrite our related work section to include comprehensive discussions of all these works to better contextualise our contributions.
评论

The reviewer appreciates the authors’ responses. They mostly addressed the reviewer’s concern, and the reviewer updated the score.

评论

We are pleased to hear that our response has addressed your concerns, and we will certainly revise the related work section as you suggested. Thank you for your time and consideration!

评论

Dear Reviewers,

We appreciate your valuable feedback and have provided detailed responses to all your comments. We kindly ask you to review our responses. Should you have any further questions, please do not hesitate to initiate a discussion with us.

Thank you for your time and consideration!

AC 元评审

The paper introduces M3CoL, a method for capturing intricate shared relationships in multimodal data by aligning samples across different modalities. It utilizes a Mixup-based contrastive loss with controlled mixup factors to go beyond traditional pairwise associations. Additionally, a SoftClip-based loss is employed to facilitate many-to-many relationships between modalities. M3CoL incorporates a unique multimodal learning framework that integrates unimodal prediction modules and a fusion module to enhance classification accuracy.

审稿人讨论附加意见

After the rebuttal, the majority of the reviewers keep their ratings. One reviewer raised the rating to borderline accept, while the remaining reviewers are still not convinced by the rebuttal, including one reviewer suggesting clear rejection. The major points include innovation w.r.t previous relevant papers from the technical point of view, lack of motivation of using the MixUp technique here, etc. Reviewer 8Mrb responded to the AC during the discussion and remained unconvinced by the rebuttal. After considering all factors, the AC thinks that the paper remains not ready and would suggest the authors to consider the suggestions to improve the paper for a future submission.

最终决定

Reject