PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
3.8
置信度
创新性2.5
质量2.5
清晰度2.5
重要性2.8
NeurIPS 2025

CMoB: Modality Valuation via Causal Effect for Balanced Multimodal Learning

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We propose CMoB, a novel causal-aware method that quantify dynamic variations of sample-level modality contributions for balanced multimodal learning.

摘要

关键词
Multimodal LearningCausal EffectModality Valuation

评审与讨论

审稿意见
4

This paper introduces Causal Modality Evaluation (CMoB) to address modality imbalance in multimodal learning. CMoB uses Shannon information uncertainty theory for a gain function, combines causal learning to measure fine - grained modality contributions within samples, and applies dynamic optimization to strengthen weak modalities.

优缺点分析

Pros

  1. Combining causal learning with information theory to dynamically evaluate modality contributions at the sample level breaks through the limitations of traditional modality - level analysis.
  2. The benefit function is constructed based on Shannon information theory, and the causal effect derivation is complete.
  3. Compared with more than 10 baseline methods on 5 cross-modal datasets, the results show the best performance in all cases.

Cons

  1. The global nature of ITE is mentioned in the paper, but the consistency under different modality combinations is not sufficiently verified.
  2. Lacking ablation experiments to verify the effectiveness of each module.
  3. The specific implementation of the Mask(·) function in Equation 5 is not explicitly described.
  4. It is recommended to add efficiency tests on large-scale datasets and cross-modal number extension experiments.

问题

Can the quantification of causal effects explain the synergistic/competitive relationships between modalities? For example, the complementarity of audio-video modalities in emotion recognition.

局限性

yes

最终评判理由

The authors' reply addresses most of my concerns. So I keep my initial score.

格式问题

no

作者回复

Q1:Q1:The global nature of ITE is mentioned in the paper, but the consistency under different modality combinations is not sufficiently verified.

R1:R1:Thank you for your comment. The "global nature" means our individual treatment effect (ITE) evaluation applies to different modality combinations, but the ITE values for each modality vary across these combinations. We conducted four different modality combinations :{T+A}, {T+V}, {V+A}, { T+A+V } for experimental verification using three modality datasets (CMU-MOSEI). The Concat method is a baseline method. The comparative experiments demonstrate the superiority and generalizability of our method, as shown in Table 1 .

Table1 Comparison with different modality combinations on CMU-MOSEI datasets. The evaluation metric is Accuracy (ACC).

DatasetMethodT+AT+VV+AT+A+V
CMU- MOSEIConcat0.7450.7510.6560.789
CMU- MOSEIOurs0.7960.7730.7010.812

Q2:Q2:Lacking ablation experiments to verify the effectiveness of each module.

R2:R2: Thanks for your comment. We conducted an ablation study to demonstrate the effectiveness of the proposed method on CREMA-D and KineticsSounds datasets , and the results are shown in the Table 2. Here, CQM represents our proposed a causal-aware modality contribution quantification method, and RE denotes the dynamic modality optimization Strategy. Ablation studies clearly demonstrated the critical contribution of our proposed modules to overall method performance enhancement.

Table2 Ablation analysis of proposed modules.

DatasetModuleACC
CQMRE
CREMA-D××65.50
×76.42
79.75
KineticsSounds××65.63
×70.96
72.03

Q3:Q3:The specific implementation of the Mask(·) function in Equation 5 is not explicitly described.

R3:R3:Thank you for your comment. We employed Time Masking for audio modality and Spatial Masking for video modality.

Q4:Q4:It is recommended to add efficiency tests on large-scale datasets and cross-modal number extension experiments.

R4:R4:Following your suggestion, we conduct an experimental on large-scale dataset VGGSound. The VGGSound datasets consist of both audio and video modalities. The VGGSound dataset, which contains 310 classes and a wide range of audio events in everyday life, is a relatively large dataset. It includes 168,618 videos for training and validation, and 13,954 videos for testing.The experimental results show the superiority of our method. The results of the comparative experiment are shown in Table 3.

Table3 Comparison with different methods on VGGSound dataset.

MethodMAPACC
AGM51.98%47.11%
MLA54.73%51.65%
ReconBoost53.87%50.97%
MMPareto54.74%51.25%
Ours54.98%51.74%

In the manuscript we have already submitted, we have conducted experiments on more-than-two modalities datasets CMU-MOSEI and NVGesture. To further validate the effectiveness of our method under scenarios with more modalities, we follow this paper and conduct further experiments on the Caltech101-20 dataset [1]. We compare our method with the Concat method and the Shapley value method. The experimental results demonstrate the effectiveness of our method, as shown in Table 4.

Table4 Accuracy of our methods on Caltech101-20 dataset.

Num of modalitiesConcatShapleyOurs
282.9183.4783.71
387.7187.9988.22
493.6494.0794.35
594.6394.7394.86

Q5:Q5:Can the quantification of causal effects explain the synergistic/competitive relationships between modalities? For example, the complementarity of audio-video modalities in emotion recognition.

R5:R5:Thank you for your question. We can indeed explain synergistic/ competitive relationships between modalities by quantifying causal effects of each modality. We treat each modality as a treatment variable and the emotion label as the outcome. Using our defined benefit function B(M), we compute the causal effect of each modality. For example, when processing with audio–video modalities in emotion recognition:

Synergistic: During iterative training, when a single modality cannot predict the emotion accurately, but adding the other modality achieves correct prediction, we have ITE(audio/video)ITE(audio/video)=B(F^(S(x)))B(F^(S(x)do(t=x(video/audio)))) B\bigl(\hat{F}\bigl(S(x)\bigr)\bigr) - B\Bigl(\hat{F}\bigl(S(x) \mid do(t = x^{(video/audio)})\bigr)\Bigr) = B(multimodal)B(video/audio)B(multimodal) - B(video/audio) = 22, indicating a synergistic relationship where audio and video are complementary.

Competitive: During iterative training, if the audio modality alone successfully predicts the emotion label, yet adding the video modality yields an accurate prediction with lower confidence than that achieved using the audio modality alone. At this point, ITE(video)ITE(video)=B(F^(S(x)))B(F^(S(x)do(t=x(audio)))) B\bigl(\hat{F}\bigl(S(x)\bigr)\bigr) - B\Bigl(\hat{F}\bigl(S(x) \mid do(t = x^{(audio)})\bigr)\Bigr) =1-1, and the relationship between the two modalities is competitive.

When the causal effect of a modality is less than 1, it negatively impacts emotion prediction; if it is greater than 1, the impact is positive.

[1]Wei Y, Feng R, Wang Z, et al. Enhancing multimodal cooperation via sample-level modality valuation[C]. Proceedings of the Conference on Computer Vision and Pattern Recognition. 2024: 27338-27347.

评论

Thank the authors for the detailed reply! It addresses most of my concerns. I will keep my initial score.

审稿意见
4

This paper proposes the CMoB framework to address the "modal imbalance" problem in multimodal learning: first use the entropy-inspired sample benefit function to measure the confidence gain of each sample after adding different modalities, then treat each modality as a "process" to calculate the individual treatment effect (ITE), and finally dynamically shield and enhance the weak modality. Experiments on five datasets show that CMoB generally outperforms or approaches the latest balancing methods in terms of accuracy/F1, and its improvement effect on weak modalities is verified by t-SNE and Grad-CAM analysis.

优缺点分析

Strength:

  • This work is the first to incorporate causal Individual Treatment Effect (ITE) estimation into sample-level modality valuation, enabling a fine-grained quantification of each modality’s contribution. The introduction of causal inference (ITE estimation) into the modality rebalancing task is novel and insightful. It enriches the modality contribution evaluation beyond simple gradient or attention-based metrics.
  • The shift from global, modality-level contribution estimation to per-sample analysis allows CMoB to dynamically capture fine-grained imbalance and perform more targeted optimizations.
  • Extensive experiments across multiple benchmarks substantiate the effectiveness and robustness of the proposed approach.

Weaknesses:

  • Per-sample ablation and dynamic masking are likely to incur substantial additional training time and computational overhead.
  • The paper lacks an ablation study that disentangles the respective contributions of the benefit function, the ITE computation, and the dynamic masking mechanism.
  • It is not recommended to share code via GitHub during the review process, as it may compromise the double-blind review principle. Instead, it is advisable to use an anonymous code-sharing platform such as https://anonymous.4open.science/. process, as it may compromise the double-blind review principle. Instead, it is advisable to use an anonymous code-sharing platform such as https://anonymous.4open.science/.
  • Typo: "modality-spcifice" should be "modality-specific" in line 157.

问题

  • How do the benefit‐function values and the final performance figures change if temperature-scaled logits or negative log-likelihood are used instead of the default soft-max confidence?
  • For each benchmark, what is the percentage increase in wall-clock training time and peak GPU memory consumption relative to the strongest baseline?
  • Please report an ablation study that quantifies the individual gains contributed by each module of the proposed framework.

局限性

yes

最终评判理由

The authors’ response has satisfactorily addressed my concerns. I keep my positive score.

格式问题

none.

作者回复

Q1:Q1:How do the final performance figures change if temperature-scaled logits or negative log-likelihood are used instead of the default soft-max confidence?

R1:R1:Following your suggestion, we replaced the default softmax confidence with temperature-scaled logits. We set the temperature parameter T=2 and T=0.5. The evaluation metric is Accuracy (ACC). The experimental results demonstrate that using temperature-scaled logits(T=2) is helpful for the target task. In our method, we use cross-entropy loss, which is essentially softmax + negative log-likelihood. Our method(CMoB) isn't restricted to using cross - entropy loss and softmax confidence. We mainly use them for a fair comparison with other methods.

MethodCREMA-DKineticsSounds
CMoB(softmax)79.7572.03
CMoB(temperature-scaled logits(T=2))79.9272.41
CMoB(temperature-scaled logits(T=0.5))79.0471.88

Q2:Q2:For each benchmark, what is the percentage increase in wall-clock training time and peak GPU memory consumption relative to the strongest baseline?

R2:R2:Thanks for your comment. Our modality selection and dynamic masking method incur higher computational overhead. We compare training time and computational overhead with the baseline model that performs best. Among them, the MLA method is the strongest baseline in the CREMA-D dataset, and the MMPareto method is the strongest baseline in the KineticsSounds dataset. Preliminary results show that our method increases GPU memory consumption by approximately 20% to35%. We consider that this overhead is acceptable given the improvement in the accuracy of the experimental results, as shown in the table below. Compared with methods that use data processing strategies (e.g., Shapley value) to address modality imbalance, our method achieves lower computational overhead and reduced training time.

MethodCREMA-DCMU-MOSEI
training timeGPU memoryACCtraining timeGPU memoryACC
MMPareto4h35min10672MiB76.877h29min15598MiB81.18
MLA4h14min9295MiB79.436h53min13841MiB78.65
Shapley value5h52min13514MiB77.829h41min24384MiB79.87
CMoB5h37min13014MiB79.758h34min19030MiB81.24

Q3:Q3:Please report an ablation study that quantifies the individual gains contributed by each module of the proposed framework.

R3:R3:Thanks for your comment. We conduct an ablation study to demonstrate the effectiveness of the proposed method on CREMA-D and KineticsSounds datasets, and the results are shown in the table below. Here, CQM represents our proposed a causal-aware modality contribution quantification method, and RE denotes the dynamic modality optimization Strategy. Ablation studies clearly demonstrate the critical contribution of our proposed modules to overall method performance enhancement.

DatasetModuleACC
CQMRE
CREMA-D××65.50
×76.42
79.75
KineticsSounds××65.63
×70.96
72.03
评论

I thank the authors for their rebuttal. My concerns have been resolved, and I maintain my positive score.

审稿意见
4

The paper introduces a causal-aware modality valuation approach (CMoB) for addressing the problem of modality imbalance in multimodal learning. This method allows for the balancing of modality contributions at the sample level, helping to improve the performance of weak modalities while mitigating modality imbalance. The paper demonstrates the effectiveness of CMoB through experiments on several multimodal datasets.

优缺点分析

Strength:

  1. The method focuses an important issue in multimodal learning, modality imbalance, and provides a potential solution to improve the integration of diverse modalities in complex datasets.
  2. The paper validates its approach with experiments on multiple multimodal datasets, showcasing the effectiveness of the proposed method.

Weakness:

  1. The motivation behind the paper is not clear. The authors mention being inspired by the human nervous system but then shift to a causal learning perspective without clearly explaining the connection between these two concepts. The introduction section lacks logical coherence in presenting the proposed approach.
  2. The proposed method claims to solve modality imbalance from a causal learning perspective. However, the actual operation involves comparing the effects of including or excluding a modality, which is conceptually similar to existing methods like Shapley value. What are advantages of this approach over Shapley value or other similar techniques?
  3. The paper's writing style needs improvement, especially in the introduction, which spends excessive time discussing related work but fails to clearly explain the motivation behind the proposed method. In addition, Section 3.1 is too detailed and spends considerable space describing concepts like cross-entropy, which do not directly contribute to the core argument of the paper. This section could be more concise and focused.
  4. In Section 4.3, the authors claim to replicate the experiment from [45] on the "scarcely informative modality" case but using the original CREMA-D dataset. However, [45] created such a case by adding significant noise to the audio modality of the CREMA-D dataset. The original CREMA-D dataset does not inherently contain "scarcely informative modality" cases.
  5. The quality of the images in the paper, particularly Figures 2 and 3, is poor. The legends in these figures are difficult to read, which diminishes the clarity of the data presented.

问题

Please refer to the above section.

局限性

Please refer to the above section.

最终评判理由

After all the authors' responses, many of my concerns are addressed. Hence, I raise my score.

格式问题

No formatting concerns.

作者回复

Q1:Q1:The authors mention being inspired by the human nervous system but then shift to a causal learning perspective without clearly explaining the connection between these two concepts.

R1:R1: Thanks for your comment. Through analyzing the human cognitive system's processing of multimodal data, we observe that it first extracts discriminative causal features from raw multimodal inputs via experientially-guided mechanisms. Then, by using information entropy to dynamically quantify causal contributions across modalities, this process corresponds to the principle of "Causal Effect"in causal learning. Therefore, our framework simulates the cognitive system's " Dynamic Evaluation of Modality Contribution " by quantifying the causal effects at the sample level .

Q2:Q2:What are advantages of this approach over Shapley value or other similar techniques?

R2:R2:Thanks for your comment. Shapley value method inevitably leads to exponentially longer training times and heavier computational burdens. Our method (CMoB) reduces this burden substantially, while providing theoretical guarantees through causal intervention principles. For example, when evaluating the contribution of the M3M^3 modality in a dataset containing three modalities {M1M^1, M2M^2,M3M^3}, the Shapley value method is necessary to calculate the combinations of four subsets {∅, M1M^1, M2M^2, M1M^1+M2M^2}. Our causal learning framework only calculates the {M1M^1+M2M^2} coalition. The peak GPU memory consumption on the CMU-MOSEI dataset was 19030MiB for CMOB method and 24384MiB for Shapley value method . Comparison experiments with the baseline of Shapley values[1] further demonstrate the superiority of our method, as shown in the table below.

Table1 Comparison with Shapley value algorithms. The evaluation metric is Accuracy (ACC).

MethodCREMA-DKineticsSoundsUCF-101CMU-MOSEINVGesture
Concat65.565.6381.878.9981.33
Shapley value77.8268.0185.2579.8782.87
Ours79.7572.0386.8281.2484.06

Q3:Q3:The paper's writing style needs improvement, especially in the introduction, which spends excessive time discussing related work but fails to clearly explain the motivation behind the proposed method. In addition, Section 3.1 is too detailed and spends considerable space describing concepts like cross-entropy, which do not directly contribute to the core argument of the paper. This section could be more concise and focused.

R3:R3:Thanks for your question. To address modality imbalance problem, current methods primarily rely on gradient variations to measure modality contribution. These methods primarily rely on modality-level contribution assessments to measure gaps in representational capabilities and enhance poorly learned modalities, overlooking the dynamic variations of modality contributions across individual samples. Although Shapley value-based methods can measure modality contributions at the sample level, they cause a sharp increase in computational overhead and training time on datasets with more than two modalities. Inspired by human cognitive science, we propose a causal-aware modality contribution quantification method from a causal perspective to capture fine-grained changes in modality contribution degrees within samples. And we dynamically select and optimize modalities based on real-time changes in their contributions.

In addition, In Section 3.1, we primarily formalize the modality imbalance problem. We will simplify section 3.1 and emphasize only components essential to modality imbalance understanding.

Q4:Q4: In Section 4.3, the authors claim to replicate the experiment from [45] on the "scarcely informative modality" case but using the original CREMA-D dataset. However, [45] created such a case by adding significant noise to the audio modality of the CREMA-D dataset. The original CREMA-D dataset does not inherently contain "scarcely informative modality" cases.

R4:R4:Thanks for your comment. Here we may not have provided a clear expression. We modify the audio data of the CREMA-D dataset, adding extra white Gaussian noise to make it noisier and scarcely discriminative. In this section, we conduct comparative experiments using the processed CREMA-D dataset.

Q5Q5:The quality of the images in the paper, particularly Figures 2 and 3, is poor. The legends in these figures are difficult to read, which diminishes the clarity of the data presented.

R5:R5:Thanks for your comment. We acknowledge that the current presentation does not adequately convey critical details. The legends in these figures are difficult to decipher because the annotation elements were not scaled appropriately. In the next version, we will use a higher resolution (600 dpi) and adjust the font size of all annotations to 10pt+, and increase the line thickness by 50% in Figures 2 and 3.

[1]Wei Y, Feng R, Wang Z, et al. Enhancing multimodal cooperation via sample-level modality valuation[C]. Proceedings of the Conference on Computer Vision and Pattern Recognition. 2024: 27338-27347.

评论

Thank you for the authors' response, which has addressed some of my concerns. However, as per the current version, I am still unconvinced of the correlation between the proposed method and cognitive science. It seems unnecessary to introduce cognitive science here, as it only adds confusion.

One more question: it seems that there is no accuracy comparison reported in the experiments for the scarcely informative modality case.

I prefer to keep my original score, and strongly recommend refining the paper thoroughly, particularly the motivation and method descriptions.

评论

Q1:Q1: It seems that there is no accuracy comparison reported in the experiments for the scarcely informative modality case.

R1:R1: We sincerely thank you for your suggestion. We conducted a comparison experiment to demonstrate the effectiveness of the proposed method on the processed CREMA-D dataset (scarcely informative modality case). The experimental results demonstrate the effectiveness of our method, as shown in Table 1.

Table1 Comparison with imbalanced multimodal learning methods on the processed CREMA-D dataset(scarcely informative modality case).

MethodACCF1
OGM62.6365.07
Greedy63.1763.83
PMR65.7365.33
AGM62.8763.73
Relearning67.4968.28
MLA68.3469.81
Ours68.7569.88

Q2:Q2:The correlation between the proposed method and cognitive science.

R2:R2: The focus of our manuscript centers on the challenge of modality imbalance in multimodal learning. Existing research in cognitive science [1-4] demonstrates that when the number of modalities within a sample space increases, the human brain can dynamically assess and extract discriminative feature information from heterogeneous multimodal data, continuously refining Cognitive Processes through adaptive learning. This aligns with the two key properties summarized in our manuscript: 1) Modality Contribution Valuation, and 2) Granular Adjustment at the Sample Level. Our research aims to establish a multimodal learning framework that computationally formalizes this Cognitive Processes, thereby mitigating the modality imbalance problem.

We reiterate our appreciation for your question regarding ''the connection between cognitive science and our proposed method." This inquiry has deepened our reflection on the rationality of the correlation between the two. We will provide a more explicit articulation of their relationship in the revised version.

Should our interpretation not fully address your concerns, we remain open to further scholarly discourse to contribute to this field of study.

[1] Silveira I, Varandas R, Gamboa H. Cognitive lab: A dataset of biosignals and HCI features for cognitive process investigation[J]. Computer Methods and Programs in Biomedicine, 2025, 269: 108863.

[2] Shen X, Hu X, Zhang R, et al. A Lightweight Triple-Modal Fusion Network for Progressive Mild Cognitive Impairment Prediction in Alzheimer's Disease[J]. Frontiers in Neuroscience, 2025, 19: 1637291.

[3] Occhipinti A , Verma S , Doan T A C .Mechanism-aware and multimodal AI: beyond model-agnostic interpretation[J].Trends in Cell Biology, 2024, 34(2):85-89.

[4]Xuelong L. Multi-Modal Cognitive Computing[J]. SCIENTIA SINICA Informationis, 2022, 53(1):1-32.

审稿意见
4

The authors discuss the limitations in existing modality rebalancing methods and point out they neglect the dynamic variations of modality contributions at the sample level during training. They propose a causal-aware modality validation approach for balanced multimodal learning. An intervention method is introduced to evaluate the causal effect, quantifying changes in modality contributions at the sample level. The fine-grained evaluation approach enables targeted optimizations across modalities at the sample level, effectively mitigating the issue of multimodal imbalance. Experimental results on multiple datasets show the effectiveness of the proposed algorithm.

优缺点分析

Strengths: The authors propose a causal-aware modality valuation method to evaluate the sample-level modality contribution in multimodal training. The authors design an optimization strategy for modality selection at the sample-level according to the contribution degree of each modality in order to mitigate the modality imbalance problem. The authors validate the effectiveness of the proposed method by comparison and ablation experiments on publicly available data.

Weaknesses:

  1. Some important references are missing. Some related works about sample-level modality valuation and balanced learning, such as PDF (Cao, et al., ICML 2024), SMLS (Zhou, et al., Information Fusion 2025), and ARL (Wei, et al., ICML 2025), should be discussed in the related works.

  2. The authors should compare one or some relatively new algorithms, such as “Zhou, et al.: Dataset-aware Utopia modality contribution for imbalanced multimodal learning. Information Fusion 2025,” “Ma, et al.: Improving Multimodal Learning Balance and Sufficiency through Data Remixing. ICML 2025,” and “Zhou, et al.: Sample-level Self-paced Learning to Tackle Multimodal Imbalance Problem. ICASSP 2025.”

  3. The language quality of the submission is not very good. Some presentation issues should be addressed: (1) For example, “We” should be “we” in Line 143. (2) “we define a benefit function” should be “We define a benefit function” in Line 8 of the Abstract. (3) “we addresses the above problem” should be “we address the above problem” in Line 69 and Line 70. (4) “modality-specifice” should be “modality-specific,” which is used many times in this paper, such as in Line 157, Line 161, and Line 170.

In addition to the above issues, there are some confusing descriptions, such as “modality data”, I guess it should be “modality-specific data.” I truly suggest the authors carefully check the whole paper and improve the overall presentation quality.

问题

The authors should compare their work with some relatively new algorithms.

局限性

no

最终评判理由

The authors have compared the proposed method with some recent methods, validating their effectiveness. I have accordingly raised my rating to 4. But the writing quality should be further improved.

格式问题

no

作者回复

Q1:Q1: Some important references are missing. The sample-level modality contribution was discussed in "Cao, et al: Predictive Dynamic Fusion. ICML 2024".

R1:R1: Thanks for your comment. Although both our method and the PDF[1] method are sample-level modality contribution quantification, the two methods focus on different aspects. Our method(CMoB), using causal learning and intervention-based contribution estimation, offers better interpretability. CMoB focuses on identifying dominant/ weak modalities at the sample-level during training, whereas PDF focuses on assessing quality disparities among modalities. CMOB emphasizes enhancing learning for weak modalities, whereas PDF aims to reduce reliance on low-quality modalities via Relative Calibration (RC). We conduct comparative experiments with the PDF method on the CREMA-D dataset. The experimental results show that our proposed method exhibits strong robustness, as shown in the Table1. We will discuss Cao et al (ICML 2024) in the Related Work section and include experimental comparisons in the revision.

Q2:Q2: The authors should compare some relatively new algorithms.

R2:R2: Following your suggestion, we conduct comparative experiments with the latest modality imbalance algorithms on the CREMA-D dataset, as shown in the Table 1. To further validate our approach, we add Gaussian noise on 50% modalities and ϵ presents the noise degree. The experimental results show the superiority of our method. In our manuscript , we also compared with relatively new algorithms, such as the MMPareto, MLA, Relearning, and MMCooperation methods were proposed in 2024, and the MMCosine, PMR, and AGM methods were proposed in 2023.

Table1 Comparison with the latest algorithms on the CREMA-D dataset. The evaluation metric is Accuracy (ACC).

Methodε = 0.0ε = 5.0
Concat61.56±1.3752.33±3.32
LATE FUSION61.81±2.1349.84±3.72
PDF[1]63.31±1.1157.85±2.04
Shapley[2]77.8270.44
OPM[3]68.4564.25
ERL-MR[4]74.9169.53
Utopia[5]78.6768.89
Ours(CMoB)79.7570.72

Q3:Q3: The language quality of the submission is not very good.

R3:R3: Thanks for your comment. We will comprehensively revise the manuscript to improve its clarity, grammatical accuracy, and conciseness in the next version.

[1] Cao B, Xia Y, Ding Y, et al. Predictive Dynamic Fusion[C].Proceedings of the International Conference on Machine Learning. PMLR, 2024: 5608-5628.

[2] Wei Y, Feng R, Wang Z, et al. Enhancing multimodal cooperation via sample-level modality valuation[C]. Proceedings of the Conference on Computer Vision and Pattern Recognition. 2024: 27338-27347.

[3]Wei Y, Hu D, Du H, et al. On-the-fly modulation for balanced multimodal learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025:47(1):469–485.

[4]Han W, Cai C, Guo Y, et al. Erl-mr: Harnessing the power of euler feature representations for balanced multi-modal learning[C]. Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 4591-4600.

[5]Zhou Y, Liang X, Xu Y, et al. Dataset-aware Utopia modality contribution for imbalanced multimodal learning[J]. Information Fusion, 2025: 103383.

评论

Thanks for the response. The newly added results addressed my concerns, and I decided to raise my rating to 4 (borderline accept). But I still suggest the authors carefully check this paper and improve the overall quality thoroughly.

最终决定

The paper tackles the problem of modality imbalance in multimodal learning paradigm via causality-aware modality contribution quantification to capture more granular variations in modality contribution degrees in samples. Some notable positive aspects mentioned by the reviewer are:

  • the paper addresses an important issue in multimodal learning with causality perspective
  • first work to introduce causal individual treatment effect
  • Extensive experimental results show effectiveness and robustness

Some concerns identified by the reviewers were:

  • how different is to Shapley value techniques
  • the motivation is not clear
  • substantial cost due to per sample ablation
  • improvements in writing at some places
  • missing references to some relevant related works

After the post-rebuttal discussion period, the reviewers acknowledged that their major concerns have been resolved. For example, the comparison with latest methods, computational cost overhead compared to other aspects, cognitive science aspect, and new experimental results on accuracy comparison have been resolved. Therefore, the AC decides to recommend the acceptance of the paper and recommend authors to incorporate important reviewer's comments in the final version.