Improving Multimodal Learning Balance and Sufficiency through Data Remixing
摘要
评审与讨论
This paper introduces a method called Data Remixing to alleviate modality laziness and modality clash, which guarantees both sufficiency and multimodal balance. The authors demonstrate that batch-level gradient direction conflicts lead to modality imbalance. Firstly, the authors divide the samples into K subsets based on which modality the model learns worst, which is indicated by the KL divergence between unimodal prediction logits and uniform distribution. Secondly, the authors use these subsets to reassemble batch data and update the model. The ablation experiments show the effectiveness of the method.
给作者的问题
- How does the KL divergence distribution change during the training process?
- Given that different samples inherently contain varying degrees of modality information, how do you ensure that using KL divergence as your discrimination metric doesn't cause the model to learn from excessively noisy signals?
- What is the specific computational overhead during model training?
论据与证据
The paper presents a framework for addressing modality laziness and clash through batch-level optimization, supported by empirical evidence. While the method shows promise in audio-visual tasks, its effectiveness remains untested in broader scenarios. This raises doubts about whether the approach truly generalizes beyond the tested modalities.
Additionally, while the paper states there is "no additional computational overhead during inference," this assertion would be strengthened by including specific measurements. A comparative analysis of inference time and memory usage between the proposed method and established baselines would provide valuable insights into the practical scalability of the approach in real-world applications.
方法与评估标准
The method provides a framework from a data perspective to alleviate modality laziness. The evaluation criteria is intuitive and sound, but additional evidence would strengthen the demonstration of its effectiveness. Specifically, the paper would benefit from: (1) comprehensive comparisons between unimodal performance baselines and the corresponding unimodal branches in the proposed method, (2) ablation studies testing alternative metrics to KL divergence for modality evaluation, and (3) quantitative measurements of computational efficiency during both training and inference phases. These additional empirical validations would more conclusively establish the method's advantages over existing approaches.
理论论述
The theoretical proofs are correct but its contribution is not particularly significant.
实验设计与分析
The experimental designs and analyses in this paper are reasonable, but insufficient. Firstly, the paper lacks the comparisons between different methods and architectures on unimodal performance. For example, the paper should demonstrate unimodal baselines (audio-only, visual-only) on CREMAD and compare their accuracy with the corresponding unimodal branches in Data Remixing and other imbalanced multimodal learning methods. By presenting this, the paper can validate the improvement of multimodal learning imbalance further. Secondly, the paper lacks the experiments to validate the effectiveness of KL divergence-based evaluation method. It could conduct ablation studies comparing KL divergence with other alternative metrics for the proposed method. Thirdly, the paper argues that low-KL samples are “insufficiently trained”, but this is an interpretation without proof. It should visualize feature distributions (e.g. t-SNE) of low-KL samples before and after remixing, or ablate the remixing step for low-KL samples and observe if their accuracy drops significantly. Finally, the paper claims its training efficiency, but it lacks quantitative evidence like training time or FLOPs during training to support this claim.
补充材料
Not provided in the original paper.
与现有文献的关系
This paper makes an important contribution to the field of imbalanced multimodal learning by addressing the fundamental challenges of modality laziness and modality clash. The authors' analysis of batch-level optimization conflicts and their KL-based method for evaluating modality-specific learning provides valuable insights for the community. This work builds upon previous approaches but takes a more data perspective. However, while the proposed Data Remixing method shows promising results, the paper would be strengthened by more thorough experimental comparisons with state-of-the-art imbalanced multimodal learning techniques such as gradient modulation methods, prototype learning, and knowledge distillation approaches.
遗漏的重要参考文献
The paper's citations appear comprehensive, covering major works in the field.
其他优缺点
Strengths:
- The paper introduces a novel sample-level evaluation method for assessing unimodal learning capacity, which is computationally efficient and readily extensible to multiple modalities.
- The proposed framework demonstrates strong flexibility as it operates independently of model architecture, allowing it to be combined with various existing multimodal learning methods.
- The work provides an insightful analysis of modality clash at the batch-level optimization stage, offering an intuitive explanation that advances our theoretical understanding of multimodal learning challenges.
Weaknesses:
- Despite claims of efficiency, the paper lacks concrete measurements quantifying this overhead in terms of training time, memory usage, or computational complexity compared to baseline methods.
- The effectiveness of using KL divergence as the sample-level evaluation metric remains insufficiently justified. The paper lacks comparative analysis with alternative metrics and doesn't provide empirical evidence showing why this particular approach is optimal for identifying modality-specific training needs.
- The experimental validation is limited to audio-visual tasks with relatively simple fusion methods. Additional experiments on more diverse modality combinations and complex multimodal scenarios would better demonstrate the method's generalizability.
其他意见或建议
- Some expressions in the paper lack clear references. For example, the Resample method in the experiment section is not properly cited, and the headers for dropout and head in the ablation study are confusing.
- The rationale for selecting KL divergence as the evaluation metric deserves more thorough explanation. The paper should explicitly discuss why this metric is suitable for assessing modality learning capacity.
Weakness1: There are no concrete measurements to demonstrate the efficiency of Remix.
Response:
- Our method focuses on variations at the data level and we have reported size of training set in Table 2 of the main text to prove the efficiency.
- We measure the training time of 4 methods under the same conditions as shown below. We observe that sample-level evaluations tend to increase the training time, whereas Remix does not expand the training set, making it more efficient.
| Baseline | Remix | Resample | MLA | |
|---|---|---|---|---|
| CREMAD(sec) | 1536 | 2357 | 4525 | 6128 |
| KS(sec) | 3849 | 4946 | 10362 | 12868 |
Weakness2: Lack of justification for the effectiveness of the KL metric and evaluation of unimodal performance.
Response:
- We are inspired by the measurement of uncertainty in Active Learning when choosing KL divergence as a sample-level metric and select specific training samples.
- We have provided training results using other feasible metrics for comparison with results presented in the following table and summary their weaknesses below.
- Entropy: Mathematically equivalent with KL.
- Loss: Loss tends to overly bias the data toward the weak modality, preventing the strong modality from effectively training.
- Shapley: Shapley values sometimes fail to distinguish between modalities, as noted in Footnote 1 of the main text.
| Baseline | Loss | Shapley | KL | |
|---|---|---|---|---|
| CREMAD(A+V) | 64.52 | 69.89 | 68.28 | 72.72 |
| CREMAD(V) | 41.67 | 54.17 | 53.49 | 53.63 |
| CREMAD(A) | 53.76 | 52.69 | 53.90 | 54.57 |
| KS(A+V) | 50.23 | 54.78 | 53.93 | 55.63 |
| KS(V) | 29.03 | 44.80 | 42.41 | 42.68 |
| KS(A) | 40.32 | 40.21 | 40.90 | 44.06 |
- Another important point is that by observing the unimodal accuracy, we can find that Remix alleviates the insufficient modality learning, where the strong modalities are also improved.
Weakness3: The performance in other modality combinations and multimodal scenarios.
Response:
- Our method is not restricted to specific modalities. In our theoretical analysis, we make no prior assumptions about modality properties, ensuring its general applicability. Meanwhile, the key steps of Remix —decoupling and reassembling—are modality-agnostic. The process only considers the accuracy relationship between modality pairs at sample-level without imposing any constraints on the modality type.
- Our method is not limited by the number of modalities. As the number of modalities increases, our method remains applicable by simply retaining the modality with the lowest KL divergence during the decoupling process. The selection mechanism remains valid in more complex scenarios involving three or more modalities.
- To further demonstrate the broad effectiveness of Remix, we conduct additional experiments on the UCF101 with two modalities(optical flow, vision) and the CMU-MOSEI with three modalities(text, vision, audio). As shown in the table, Remix consistently improves performance, further validating its wide applicability.
| Baseline | GBlend | OGM | PMR | Reample | MLA | Remix | |
|---|---|---|---|---|---|---|---|
| UCF101 | 80.78 | 82.82 | 82.55 | 81.87 | 84.09 | 83.03 | 84.59 |
| CMU-MOSEI | 83.32 | 84.45 | 85.03 | 84.13 | 84.50 | 82.84 | 85.89 |
Weakness4: About Dropout and Head in ablation study.
Response:
- Dropout and Head are strategies to obtain unimodal outputs. Dropout refers to masking other modalities and using the output of the multimodal classification head as the unimodal output. Head involves adding an independent classification head to each encoder and using its output as the unimodal output.
- In Remix method, we select the Head approach for more accurate unimodal results. To synchronously update the parameters, we incorporate a unimodal loss into the loss function, which has been proved to improve multimodal capabilities. This ablation study aims to demonstrate that the improvements brought by Remix do not stem from the introduction of unimodal loss.
Weakness5: How to ensure that using KL divergence doesn't cause the model to learn from excessively noisy signals?
Response:
- We also encountered similar issues in our experiments. Therefore, we have acknowledged this potential limitation in the summary of main text. Our current approach to address this challenge involves setting additional thresholds.
- In UCF101, we observe that the KL divergence for optical flow are consistently smaller than those for visual. We adopted two strategies:
- Scaling-Based Adjustment: We introduce a hyperparameter to scale the KL divergence of the optical flow and compare with , achieving a performance of 84.59%.
- Minimum KL Threshold: We set a lower bound on KL divergence to ensure the model does not overly focus on samples with minimal information content, resulting in a performance of 84.01%.
Thank you very much for your response. That clarified my concerns, so I increased my score.
The authors address the problems of modality laziness and cross-modal clash in multimodal joint learning at the same time. They propose a method by remixing the original multi-modal input pairs, which involves decoupling multimodal data into unimodal subsets and selecting difficult samples for each modality to avoid modality dominance, and then reassembling them at the batch level to enable gradient alignment and avoid cross-modal interference. The experiments on public datasets show large improvements over existing methods, the generality on various fusion models and their compatibility with other methods.
给作者的问题
+This method includes two steps: data decoupling and resembling. So my question is whether it can be used for online training when the data is dynamically changing. +Apart from KL divergence, has the author considered other metrics to evaluate the performance of the modalities to determine the decoupling results? For example, would entropy be more robust? +How necessary is the warm-up phase, and can we reduce its duration or directly use pre-trained unimodal encoders instead to further improve the efficiency of multimodal training?
论据与证据
Yes, the claims are well supported by extensive experiments and analyses.
方法与评估标准
Yes, the proposed method is theoretically feasible and effective.
理论论述
I have verified the correctness of the theoretical claims, and there are no issues.
实验设计与分析
I reviewed the experimental designs of the paper. The experiments and analyses are sufficient to support the claims.
补充材料
This submission does not include supplementary material.
与现有文献的关系
The motivation and ideas differ from others:
- Previous methods only focused on modality laziness, while this paper explicitly proposes that the impact of modality imbalance is bidirectional: weak modalities are suppressed by strong modalities, and at the same time, the two modalities interfere with each other's optimization directions.
- Previous methods did not address the issue of modality conflict in joint learning from the deeper perspective of batch-level gradient inconsistency. To the best of my knowledge, this is the first work to address modality imbalance at the batch level.
遗漏的重要参考文献
No essential references are left undiscussed.
其他优缺点
Strengths
- Novelty: the data remixing method is novel, especially the idea of batch-level data reassembly to prevent cross-modal gradient interference.
- Practicality: the method does not require dataset expansion.
- Large improvement and generalizability: experiments on different multimodal fusion methods and architectures are conducted, and the improvement is large and consistent, verifying its efficacy and generalizability.
- Compatibility: it can be integrated with other existing methods such as MLA and Resample.
- The paper is well organized.
Weaknesses
- For the issue of multimodal fusion balance, although most existing works focus primarily on two modalities, I still hope to see the performance of this approach on three or even more modalities. Of course, this would bring new challenges to the design of the resembly strategy. However, it could further demonstrate the generality of the proposed method. Adding this part in the future will further enhance the contribution of the proposed method.
- In section 4.4, the athours reports the consistent improvement on different fusion architectures. However, I notice that the improvement varies and it is smaller on more complex fusion architectures. I speculate that the reason for this phenomenon lies in the fact that these more complex fusion models inherently include some adaptive adjustment strategies for feature selection. For example, MMTM recalibrates feature channels from different streams, and CentralNet simultaneously considers the contributions of individual modality features and fused features in decision-making. The authors should include a deeper analysis of this variation in improvement magnitude in the paper. This would better reveal the fundamental causes of modality imbalance and provide a basis for understanding the effectiveness of various strategies based on their contributions.
其他意见或建议
It is recommended to add more descriptions in Fig. 1 and Fig. 2 to make them easier to follow. For example, in Fig. 1, the motivation of the middle column that masking the strong modality can be added. In Fig. 2(c), the batch-level reassembly should be clearly presented.
Weakness1: The applicability of the Remix method on three or even more modalities.
Response:
- Our method is not limited by the number of modalities. As the number of modalities increases, our method remains applicable by simply retaining the modality with the lowest KL divergence during the decoupling process. The selection mechanism remains valid and ensures that our method remains effective in more complex scenarios involving three or more modalities.
- To further demonstrate the broad effectiveness of Remix, we conduct additional experiments on CMU-MOSEI with three modalities(text, vision, audio). As shown in the table, Remix consistently improves performance, further validating its wide applicability.
| Baseline | GBlend | OGM | PMR | Reample | MLA | Remix | |
|---|---|---|---|---|---|---|---|
| CMU-MOSEI | 83.32 | 84.45 | 85.03 | 84.13 | 84.50 | 82.84 | 85.89 |
Weakness2: Analysis of modality imbalance in fusion-based decision-making.
Response:
- We conducted a further analysis of feature fusion during model decision-making. We observe that modality imbalance is not limited to unimodal encoders but also manifests significantly at the feature fusion layer. Taking the concatenation method as an example, both the audio and video modality outputs are 512-dimensional, resulting in a 1024-dimensional fused feature. We compute the weighting distribution of each modality, and the results are presented in the table below.
| Baseline | Remix | |||||
|---|---|---|---|---|---|---|
| Audio | Video | Ratio(A/V) | Audio | Video | Ratio(A/V) | |
| CREMAD | 0.0258 | 0.0179 | 1.441 | 0.0215 | 0.0199 | 1.080 |
| KineticSound | 0.0314 | 0.0219 | 1.433 | 0.0287 | 0.0245 | 1.171 |
-
Our findings indicate that the Remix method not only promotes modality balance at the unimodal encoder level but also enhances balance at the fusion layer.
-
For more complex models(MMTM or CentalNet), the improvements brought by Remix are relatively smaller. We attribute this to the fact that early-stage interactions may be somewhat suppressed due to our decoupling process. However, this trade-off is offset by the enhanced unimodal capabilities, which ultimately contribute to overall performance improvements.
Weakness3: Other metrics to evaluate the performance of the modalities to determine the decoupling results?
Response:
- We are inspired by the measurement of uncertainty in Active Learning when choosing KL divergence as a sample-level evaluation method and a criterion for selecting modality-specific training samples.
- We have provided training results using other feasible metrics for comparison with results presented in the following table and summary their weaknesses below.
- Entropy: In classification tasks, Entropy and KL divergence are mathematically equivalent.
- Loss: Loss tends to overly bias the data toward the weak modality, causing preventing the strong modality from effective training.
- Shapley Value: Shapley values sometimes fail to distinguish between modalities, as noted in Footnote 1 of the main text.
| Baseline | Loss | Shapley | KL Divergence | |
|---|---|---|---|---|
| CREMAD(A+V) | 64.52 | 69.89 | 68.28 | 72.72 |
| KS(A+V) | 50.23 | 54.78 | 53.93 | 55.63 |
This paper mainly proposes to combat with the issues of modality laziness and modality clash. Both issues happen when multimodal models prioritize learning from the strong modality and the batch gradient can be interfered across modalities. The authors propose the Data Remixing method to solve such problems. Specifically, a sample-level decoupling of multimodal data is utilized to emphasize the learning of weak modal samples, and a batch-level reassembling of unimodal data is proposed to ensure each batch only contains data from single modal. The authors conduct experiments on CREAMD and Kinetic-Sounds datasets to demonstrate the effectiveness.
给作者的问题
Please refer to previous parts. My major concern includes the violation of the method to multimodal learning purpose that neglect the learning of mutual information, together with the limited novelty and insufficient experiments.
论据与证据
The modality laziness that proposed to be solved in this paper is a well-known issue of multimodal imbalance that is worthy to be researched. In contrast, the proposed modality clash is somehow less-known, which seems like a consequence or a phnomenon caused by modality imbalance, instead of another issue as the authors claim. Thus the novelty of this paper can be over-claimed, since the main contribution of multimodal decoupling is merely a sample-wise fine-grained of previous methods, and the reassembling part is problematic from my opinion, which will be explained in the following Methods part.
方法与评估标准
My major concern is about the proposed method. While I am convinced that the proposed Data Remixing can solve the problem of modality laziness, such strategy is considered to violate the principles of multimodal learning. I am mostly ok with the decoupling method, as it explicitly discriminate samples of weak modality. However, the selected data is then reassembled into unimodal-form batches, where the multimodal model independently learns from a single modality in each batch with input of other modalities masked with zeros. If there is no misunderstanding (supported by Fig. 1(c), "0 masked audio" in Fig. 2(c) and pseudo algorithm 1), the mutual information across-modality is never learned throughout the procedure of Data Remixing. Thus such method is doubt to be a simple ensemble of unimodal models, where modality laziness of course doesn't exist. While the authors only provide experiments on two audio-visual datasets with classification tasks, such tasks can be accomplished with less dependency on cross-modal knowledge. I wonder if the author can provide further experiments on more complicated understanding tasks like cross-modal retrieval or reasoning tasks, on multimodal scenarios with more mutual information, like image-text modalities, or the authors can provide more explanation to how Data Remixing learns such mutual information.
理论论述
The authors provide theoretical analysis on how Data Remixing solve the problem of modality clash in lines 220-250, which is considered trivial. Such details are recommended to be placed in Appendix.
实验设计与分析
The authors only conduct experiments two small-scale on dual modality datasets under classification tasks. More experiments on more popular datasets like Conceptual Captions and YFCC datasets in image-text scenarios, with diverse down-stream tasks including retrieval, reasoning, VQA, grounding tasks should be considered.
补充材料
The authors do not write an appendix.
与现有文献的关系
The considered issue of modality laziness is important, however novelty of the method is considered limited.
遗漏的重要参考文献
No extra references recommened.
其他优缺点
The legend and footnotes of Fig. 1 and Fig. 2 are too small that can hardly be read. The writing of method part is also recommended to be improved.
其他意见或建议
Please refer to previous parts.
Weakness1: Modality Clash and Modality Imbalance
Response:
- In summary, Modality Imbalance is unidirectional, while Modality Clash is bidirectional. Modality Imbalance refers to a scenario where a strong modality dictates the learning process, preventing other modalities from being sufficiently trained. Modality Clash describes interference between modalities. Even if modality balance is achieved, differences between modalities may still lead to insufficient learning across all modalities.
- The accuarcy of unimodal capabilities presented in the following table demonstrate that all modalities benefit from our approach, further validating its effectiveness.
| Dataset | Baseline | Remix | Dataset | Baseline | Remix |
|---|---|---|---|---|---|
| CREMAD(A+V) | 64.52 | 72.72 | KS(A+V) | 50.23 | 55.63 |
| CREMAD(V) | 41.67 | 53.63 | KS(V) | 29.03 | 42.68 |
| CREMAD(A) | 53.76 | 54.57 | KS(A) | 40.32 | 44.06 |
Weakness2: The Mutual Information in Multimodal Tasks.
Response: We understand the reviewer's concerns regarding the potential decrease in MI due to alternating unimodal training. This issue is carefully considered in our approach.
- First and foremost, it is important to clarify that our task is multimodal co-decision, which involves integrating multiple modalities to make a final decision—abstractly represented as A + B → C. The goal is to perform cross-modal feature selection and integration based on the learning outcomes of individual modalities. This differs from tasks such as retrieval, where decision-making relies on capturing shared information across modalities—abstractly represented as A → B. These tasks require highlighting consistency information through similarity measurements, which inherently mitigates modality imbalance issues present in co-decision tasks.
- Following the reasons, existing co-decision network architectures are inherently weakly interactive. For example, they often rely on direct concatenation of unimodal representations or gated fusion at the decision level. As a result, our data decomposition does not significantly impact interaction, since these models already operate with minimal cross-modal dependency. By conducting experiments measuring the MI between modality-specific features before and after applying our method, we prove our perspective.
| Baseline | Remix | |
|---|---|---|
| CREMAD-MI | 0.078 | 0.077 |
| KineticSound-MI | 0.016 | 0.013 |
- Therefore, in co-decision tasks, the performance bottleneck lies in modality imbalance and modality clash, which hinder effective unimodal learning and limit cross-modal synergy. Our work is specifically designed to address this issue by ensuring that each modality is adequately learned before integration, ultimately improving overall model performance.
- Additionally, we provide extra data to further support our argument. We construct a confusion matrix analyzing the relationship between model decision correctness and the presence of correctly predicted modalities. The results can be found at: https://anonymous.4open.science/r/ICM-Rebuttal-17C8/Fig1.png. By analyzing it, we find that in co-decision tasks, it is common for a model to have at least one modality predict correctly while the final decision is incorrect. However, after applying the Remix method, this type of misjudgment is significantly reduced. This suggests that Remix can guides samples to the appropriate modality space, and improves decision-making accuracy.
Weakness3: The method and experiments are limited.
Response:
-
Our method is not restricted to specific modalities. In our theoretical analysis, we make no prior assumptions about modality properties, ensuring its general applicability. Meanwhile, the key steps of Remix —decoupling and reassembling—are modality-agnostic. The process only considers the accuracy relationship between modality pairs at sample-level without imposing any constraints on the modality type.
-
Our method is not limited by the number of modalities. As the number of modalities increases, our method remains applicable by simply retaining the modality with the lowest KL divergence during the decoupling process. The selection mechanism remains valid and ensures that our method remains effective in more complex scenarios involving three or more modalities.
-
To further demonstrate the broad effectiveness of Remix, we conduct additional experiments on the UCF101 with two modalities(optical flow, vision) and the CMU-MOSEI with three modalities(text, vision, audio). As shown in the table, Remix consistently improves performance, further validating its wide applicability.
Baseline GBlend OGM PMR Reample MLA Remix UCF101 80.78 82.82 82.55 81.87 84.09 83.03 84.59 CMU-MOSEI 83.32 84.45 85.03 84.13 84.50 82.84 85.89
Thank you for your additional explanation and experiments. The experiments in Weakness3 resolve my concerns about limited diversity and number of modality in experiment settings.
Regarding the mutual information part, the authors state that the proposed method aims at co-decision while mutual information is less concerned, which is not much agreed from my perspective. The application of the method can be limited to simple classification tasks, and I do not see clear difference between such method with a weighted combination of predictions by unimodal-trained encoders. However, it seems that the other reviewers do not have similar concerns. As a result, I hold my concerns while raising my score to 2, and I am looking forward to further replies by the authors and other reviewers.
Thank you for your prompt response. We fully understand and acknowledge you that multimodal fusion strategy appears intuitively central to performance gains. However, our key insight—aligned with recent Balanced Multimodal Learning (BML) studies—reveals another important bottleneck: insufficient single-modality learning due to modality clash. Prioritizing modality-specific maturity creates the groundwork for collaborative gains. Our motivation is to address this foundational challenge.
Q1: Difference between Remix and weighted combination of unimodal-trained encoders.
Response: Remix can be distinguished from it in 3 key dimensions: model performance, modality balance, and learning efficiency (results based on CREMAD).
-
Model Performance: We train two unimodal models independently, and then incorporate them as pretrained branches into the multimodal model. During joint training, the model learn combination weights to fuse their outputs—implementing the weighted combination of unimodal-trained encoders and we mark it as Pretrain.
Baseline Pretrain Remix Multi 64.52% 65.73% 72.72% Video 41.67% 48.92% 53.63% Audio 53.76% 55.78% 54.57% From the results, we observe that pretrained unimodal models do improve the unimodal performances, but the multimodal performance remains limited. In contrast, Remix delivers a more significant improvement in overall performance. We attribute this to Remix's improved modality balance at the fusion layer.
-
Modality Balance: We conduct a further analysis of feature fusion and observe that modality imbalance also exits at the fusion layer. Taking concatenation as an example, both modalities' outputs are 512-dimensional, resulting in a 1024-dimensional fused feature. We compute the average absolute weights of each modality and their ratios.
Baseline Pretrain Remix Audio 0.0258 0.0220 0.0215 Video 0.0179 0.0173 0.0199 Ratio(A/V) 1.441 1.272 1.080 The results indicate that when using pretrained unimodal models, modality imbalance still persists at the fusion layer. While Remix introduces modality-specific training in the multimodal model, which promotes modality balance at unimodal encoders and the fusion stage. This fundamental difference highlights why Remix outperforms the weighted combination of unimodal-trained encoders, offering a more balanced approach to multimodal learning.
-
Learning Efficiency: Since the pretraining-based method requires training unimodal models separately and fine-tuning in multimodal tasks, it is inherently less efficient. To ensure a fair comparison, we measure the training time required for both methods to converge under the same experimental settings. The results show that Remix achieves significantly higher training efficiency.
Baseline Remix Pretrain Time(sec) 1536 2357 4869
Q2: Remix is limited to simple classification tasks.
Response:
- First, it is important to clarify that current research on BML has followed a consistent experimental paradigm. Starting from G-Blending(2021), through representative methods like OGM(2022) and AGM(2023), and up to the recent MLA(2024), all works focus on classification tasks. That's because classification provides the most direct and interpretable way to evaluate the representational capacity of modalities and the effectiveness of multimodal integration. In line with this established convention, we also center our experiments around classification to ensure meaningful and fair comparisons with existing methods.
- As we have mentioned, Remix is particularly well-suited for co-decision tasks. However, this does not imply a limitation to classification only, as Remix only requires evaluating unimodal learning performance at the sample level. We also evaluate Remix on video anomaly detection(VAD) and semantic segmentation, both showing improved performance.
-
SHT is a benchmark for VAD. We utilize and extract RGB frames and optical flow. We use AUC as the evaluation metric. For both modalities, we extract features from the previous 5 frames, and then concatenate them for VAD (Baseline). ||Method|Baseline|Remix| |-|-|-|-| ||Multi|0.617|0.641| |SHT|Flow|0.472|0.493| ||RGB|0.589|0.604|
-
SUN-RGBD is a benchmark for semantic segmentation. We utilize RGB and Depth as two modalities and use IOU as the evaluation metric. For both modalities, we employ ResNet50 as encoders. The extracted features are concatenated at the final layer and subsequently fed into a decoder to generate the final output (Baseline). ||Method|Baseline|Remix| |-|-|-|-| ||Multi|0.451|0.467| |SUN-RGBD|RGB|0.402|0.424| ||Depth|0.297|0.312|
-
Thank you for your thoughtful questions, which are invaluable in helping us refine our contributions.
The paper tackles the problem of multimodal learning and specifically how to make all the modalities to contribute equally to the training objectives. The authors suggest multiple steps to alleviate the issue, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference. The authors demonstrate the effectiveness of their method on CREMAD and Kinetic-Sound dataset.
给作者的问题
None
论据与证据
The authors claim to solve the problems of modality laziness and modality clash when jointly training multimodal models. They provide an experimental evidence comparing to other methods as well as an Ablation study. My major concern is that the authors do not demonstrate the effectiveness of the for their technique for MLLM training, where adding modalities to language without degrading language performance is hard.
方法与评估标准
The proposed method and evaluation criteria make sense, but limited to applications of audio and vision modalities only.
理论论述
There are no theoretical claims, the method is several heuristic /greedy steps: First the complete dataset and multimodal inputs are used for warm-up training to ensure the model has the basic representational capability. Then, the model is optimized through alternating steps where the multimodal inputs are decoupled based on the KL divergence of unimodal prediction probabilities and reassemble the data at the batch level according to the remaining modality. Afterwards, specific training for each modality is performed using the reassembled dataset
实验设计与分析
Yes, the experimental design an analyses make sense, however is limited to certain modalities
补充材料
No supplementary analysis is provided.
与现有文献的关系
The authors provide a comprehensive overview of the related work.
遗漏的重要参考文献
The related work does not discuss the ImageBind paper, published by Meta AI, which introduced a model that learns a joint embedding space across six modalities – images, text, audio, depth, thermal, and IMU data, enabling cross-modal retrieval and other applications.
其他优缺点
In overall the paper is well written and easy to follow.
其他意见或建议
None
Weakness1: The method and experiments are limited to certain modalities.
Response:
-
Our method is not restricted to specific modalities. In our theoretical analysis, we make no prior assumptions about modality properties, ensuring its general applicability. Meanwhile, the key steps of Remix —decoupling and reassembling—are modality-agnostic. The process only considers the accuracy relationship between modality pairs at sample-level without imposing any constraints on the modality type.
-
Our method is not limited by the number of modalities. As the number of modalities increases, our method remains applicable by simply retaining the modality with the lowest KL divergence during the decoupling process. The selection mechanism remains valid and ensures that our method remains effective in more complex scenarios involving three or more modalities.
-
To further demonstrate the broad effectiveness of Remix, we conduct additional experiments on the UCF101 with two modalities(optical flow, vision) and the CMU-MOSEI with three modalities(text, vision, audio). As shown in the table, Remix consistently improves performance, further validating its wide applicability.
Baseline GBlend OGM PMR Reample MLA Remix UCF101 80.78 82.82 82.55 81.87 84.09 83.03 84.59 CMU-MOSEI 83.32 84.45 85.03 84.13 84.50 82.84 85.89
Weakness2: Effectiveness on Text modality and MLLMs.
Response:
- Multimodal Large Language Models (MLLMs) differ significantly from our task in terms of architecture and underlying principles. MLLMs typically use the language modality as the foundation, mapping other modalities onto it. This differs from the modality imbalance issue we address in co-decision tasks. That's also the main difference between our method and ImageBind.
- Additionally, we follow the existing model structures in the Balanced Multimodal Learning (BML) domain to ensure a fair comparison and demonstrate the effectiveness of our proposed method.
- Additionally, we have demonstrated the effectiveness of the text modality within our method. In table above, we have supplement our results with the CMU-MOSEI dataset, which includes text, video, and audio modalities. In the baseline model, the accuracy of the text modality was 79.96%, and after applying Remix, it improved to 81.29%. This result indicates that Remix is also beneficial for the language modality, further validating its effectiveness across different modality types.
Weakness3: There are no theoretical claims.
Response:
- In this paper, we introduce a novel phenomenon termed modality clash, which defines the bidirectional interference between modalities in multimodal learning (as illustrated in Figure 3 in main text, we quantify the optimization direction deviation of strong modalities). This perspective differs from the traditional modality imbalance problem, which primarily emphasizes the one-way suppression of weaker modalities by stronger ones.
- Accordingly, our Remix method is designed to directly address this issue. To mitigate modality clash, we need to control the composition of each training batch, which is achieved through the assembling step. To select appropriate samples and enable batch control, we first perform decoupling of multimodal inputs based on sample-level evaluation.
This paper addresses the common multimodal learning problems of "modality laziness" (over-reliance on one data type) and "modality clash" (conflicting learning signals). The authors propose a "Data Remixing" method, which involves decoupling data into single modalities, focusing on harder-to-learn samples for each, and then reassembling these into specialized batches. Experiments on the CREMAD and Kinetics-Sounds datasets demonstrate that this technique improves model performance by ensuring more balanced contributions from all modalities and mitigating interference during training.
The reviewers in general are positive about the submission and agreed the experiments demonstrated the effectiveness of the proposed approach. There was still concern on the design principle in the reassembling step of the approach, with which model never explicitly learns the mutual information or the relationship between the different data types. The authors are encouraged to add more clarification and experiments on this in the next version.