PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
4.3
置信度
创新性3.0
质量3.3
清晰度2.8
重要性3.0
NeurIPS 2025

Plug-and-play Feature Causality Decomposition for Multimodal Representation Learning

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Representation LearningCausal InferenceMultimodal Learning

评审与讨论

审稿意见
4

This paper introduces a novel module, Feature Causality Decomposition (FCD), designed to enhance multimodal representation learning by explicitly modeling and separating synergistic, unique, and redundant components of unimodal features. Drawing from principles of causal inference, particularly backdoor adjustment, the method aims to remove modality-specific task-agnostic noise while preserving task-relevant information. FCD integrates two core submodules: Causality Components Decomposition (CCD) and Synergistic Distribution Alignment (SDA). The proposed method is plug-and-play and can be inserted into existing intermediate fusion models without structural modifications. The authors evaluate FCD on five benchmark datasets across various tasks and report consistent performance improvements over several state-of-the-art (SOTA) methods.

优缺点分析

Strengths

  • The authors provide a theoretical framework connecting backdoor adjustment (from causal inference) with mutual information optimization, supporting their decomposition strategy.

  • Applying causality to multimodal learning is an interesting trend, and the proposed study is well motivated.

  • Experiments on conventional multimodal learning datasets (CMU-MOSI, MOSEI, Food101, etc.) and base models. Ablation studies and Grad-CAM visualizations add interpretability and support the claimed benefits of each component.

Weaknesses

  • Despite the sufficient motivation, the topic of feature disentanglement on backdoor adjustment is not novel. See [1] for an example of similar work on backdoor adjustment on graph representation learning. Therefore, the methodological contribution is limited.

  • While the paper leverages causal language and theoretical constructs, the actual role of causality remains largely conceptual. The core assumption—that modality-specific noise can be removed via backdoor adjustment—is not empirically validated. The SCM and do-operator usage is not grounded in real causal discovery or data-generating processes.

  • Following the previous comment. The assumed causal structure (e.g., the existence and confounding role of modality-specific features) is not validated via data-driven methods such as causal discovery, nor is any synthetic experiment conducted to test causal robustness. This undermines the practical credibility of the theoretical claims.

  • Scalability is limited for empirical evaluation as the authors only performed experiments on CMU---MOSI-like datasets, which are small in scale. Empirical evaluations on larger-scale datasets (e.g., MIMIC-CXR) are encouraged to demonstrate the scalability

[1] Sui, Yongduo, et al. "Causal attention for interpretable and generalizable graph classification." Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 2022.

问题

  • Could the authors clarify whether this SCM was constructed based on empirical data, prior knowledge, or intuition? Additionally, can you discuss how sensitive your method is to potential misspecifications of this causal structure?

  • Empirical evaluations on larger-scale datasets (e.g., MIMIC-CXR) is encouraged to demonstrate the scalability

局限性

This paper specified in the checklist that they discussed limitations but it was not found in the main text nor in the appendix. From what I can see the scalability is a significant limitation

最终评判理由

Thanks for the rebuttal. I agree that validating real causal structure would be difficult and I also acknowledge the effectiveness of causal debiasing. The authors' rebuttal demonstrates the principle behind the proposed framework. Hence I will increase my score. However, I would also encourage the authors to perform experiments on modern high-dimensional (not only large sample size) datasets such as MIMIC-CSR, which will greatly enhance the empirical evaluation part, particularly the scalability.

格式问题

N/A

作者回复

We are truly grateful for your insightful suggestions and are pleased to provide our responses.


Weaknesses

W1. Backdoor-adjustment has been widely used (novelty is limited).

A1. Backdoor-adjustment has been widely used in various kinds of tasks, including graph representation learning, multimodal sentiment analysis [1] etc. The backdoor-adjustment itself has been proposed for a long time [2], which usually performed as a tool for debiasing. Many researchers have dedicated to employ it for removing the causal effect of confounders in their systems, such as generating counterfactual samples, working together with frontdoor-adjustment etc. So do we. However, different with them:

  • (1) We utilize it in our theoretical derivation process for removing the causal effect of unimodal uncertainty noise on the task prediction.

  • (2) We just utilize it in the theoretical level to avoid the potential irrationality and poor diversity of counterfactual samples. For example, in [3], the backdoor-adjustment is employed to generate counterfactual samples via replacing the background of samples within a batch. The diversity of generated samples are highly related to the dataset scale, where only backgrounds contained in the dataset can be used for intervention. Different with them, we apply backdoor-adjustment on the theoretical derivation process without explicitly generating counterfactual samples, which can avoid the potential impact of dataset scale (the diversity of confounder).

  • (3) FCD is a plug-and-play module that can be implemented to the existing methods for unimodal uncertainty noise removal.

In short, FCD is a different representation learning plug-and-play module compared with existing ones. Just like the Bayes’ theorem in probabilistic theory [4], the backdoor-adjustment is a basic conception in causality theory.

[1] Changqin Huang et.al. AtCAF: Attention-based causality-aware fusion network for multimodal sentiment analysis. Information Fusion 2025.

[2] Judea Pearl. Causality. Cambridge university press 2009.

[3] Sarah Rastegar et.al. Background no more: Action recognition across domains by causal interventions. Computer Vision and Image Understanding 2024.

[4] Jaynes, E.T.: Probability theory: The logic of science. Cambridge university press 2003.


W2. The actual role of causality remains largely conceptual.

A2. We aim to extract the complementary and consistent information whilst removing the uncertainty noise mixed in the unimodal features. As it has been widely proved effective for defounding (debiasing), backdoor-adjustment and causality are also employed in our work. However, we aim to provide a generic plug-and-play module. To this end, we hope to simplify the actual complexity of FCD as much as possible through theoretical analysis and deduction. Therefore, we propose Theorem 4.1, which proves that the unimodal uncertainty noise can be removed by Eq. 6 based on backdoor-adjustment under certain conditions. Although it seems to be “conceptual”, the actual roles of causality and backdoor-adjustment are vital in our paper.


W3. Validation of causal structure.

A3. Please refer to the A1 in Question below.


W4. Experiments on larger-scale dataset.

A4. Please refer to A2 in Question below.


Questions

Q1. The motivation of SCM construction.

A1. Previous researches mainly focused on the better utilization of the complementary and consistent information (such as fusion strategy, dynamic fusion, etc.), or mitigating aleatoric uncertainty within each modality. We summarize the previous related works and empirically abstract the unimodal feature into 3 types of factor (synergistic, unique and redundant) corresponding to work [1], which proposed these concepts and validated them from both theory and demo experiments.

We acknowledge your suggestion that data-driven causal structure validation helps to improve the persuasiveness of the proposed SCM, though we note that this direction has rarely been explored in the existing literature [2-5]. To validate the causal structure of our proposed SCM and the correspondence with [1], following [1], we conduct the experiments using Self-MM on CMU-MOSI testing set, where the normalized information increment is used as metric. Please refer to [1], for their theoretical derivation and explanation.

We varify the causal relationship between z\mathbf{z}, h\mathbf{h}, s\mathbf{s}, u\mathbf{u}, r\mathbf{r}, y^\hat{y}. We first store variables of each modality, and then apply PCA on the hidden embeddings (z\mathbf{z}, h\mathbf{h}, s\mathbf{s}, u\mathbf{u}, r\mathbf{r}) along the dimension axis. Therefore, all vectors are reduced in dimension to scalars. Then, we apply Causal Additive Models algorithm for causal discovery. We repeat this procedure for 1000 times to calculate edge confidence. Since figures are not allowed in rebuttal period, we simply describe the result as follow:

z\mathbf{z}h\mathbf{h}s\mathbf{s}r\mathbf{r}u\mathbf{u}y^\hat{y}
z\mathbf{z}0.000.750.830.550.310.23
h\mathbf{h}0.190.000.310.720.830.40
s\mathbf{s}0.230.270.000.200.310.80
r\mathbf{r}0.290.450.370.000.380.27
u\mathbf{u}0.420.550.470.390.000.78
y^\hat{y}0.340.370.540.230.340.00

Each element of the table stands for the frequency of edge occurrence in repeated experiments. The higher the frequency, the stronger the causal relationship between two variables. It is apparently that synergistic s\mathbf{s} and modality-specific task-related u\mathbf{u} have stronger causal relationship with the target y^\hat{y}. And the overall results indicate the causal effectiveness of our proposed SCM structure.

[1] Álvaro Martínez-Sánchez et.al. Decomposing causality into its synergistic, unique, and redundant components. Nature Communications 2024.

[2] Yongduo Sui et.al. Causal attention for interpretable and generalizable graph classification. SIGKDD 2022.

[3] Changqin Huang et.al. AtCAF: Attention-based causality-aware fusion network for multimodal sentiment analysis. Information Fusion 2025.

[4] Yanan Zhang et.al. Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective. NeurIPS 2024.

[5] Weixing Chen et.al. Cross-Modal Causal Intervention for Medical Report Generation. TIP 2025.

[6] Peter Bühlmann et.al. CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics 2014.


Q2. Empirical evaluations on larger-scale datasets (e.g., MIMIC-CXR).

A2. We evaluated our method on different datasets with different scales (CMU-MOSI with 21992199 samples, UPMC Food101 with nearly 10510^5 samples, please refer to Appendix. C for more details.). Besides, as we mentioned in Weaknesses A1, since FCD doesn’t acutally generating counterfactual samples, the impact of dataset size is theoretically reduced.

Although we tried to apply for MIMIC-CXR dataset as son as we can, it requires certain Certification (the CITI program) and verification to access the dataset and it is too large to donwload and train, which may delay the rebuttal process. Practically, following CMCRL [1], we use the preprocessed dataset, and conduct the quantitative experiment on CMCRL. Because of the complexity of CMCRL and the scale of dataset, the experiment is still ongoing. We will update the results as soon as they come out.

[1] Weixing Chen et.al. Cross-Modal Causal Intervention for Medical Report Generation. TIP 2025.

评论

Thanks for the rebuttal. I agree that validating real causal structure would be difficult and I also acknowledge the effectiveness of causal debiasing. The authors' rebuttal demonstrates the principle behind the proposed framework. Hence I will increase my score. However, I would also encourage the authors to perform experiments on modern high-dimensional (not only large sample size) datasets such as MIMIC-CSR, which will greatly enhance the empirical evaluation part, particularly the scalability.

评论

We sincerely appreciate your positive evaluation and recognition of our work. Should you have any remaining questions or concerns regarding our manuscript, research, or the innovations presented therein, we would be delighted to engage in further discussion or provide additional clarification.

审稿意见
4

This paper proposes a plug-and-play Feature Causality Decomposition (FCD) framework for multimodal representation learning. The method performs a two-stage decomposition process: first, it separates unimodal features into modality-invariant and modality-specific components; second, the modality-specific component is further decomposed into unique and redundant parts using a backdoor-adjustment mechanism grounded in causal inference. Extensive experiments are conducted on two benchmark datasets for multimodal sentiment analysis (MSA), as well as three additional datasets, to validate the effectiveness of the proposed approach.

优缺点分析

Pros:

  1. The paper addresses a critical problem in multimodal representation learning—namely, the challenges of redundancy and conflict across modalities. Effectively decomposing multimodal representations remains a central research challenge, and the proposed method directly tackles this issue.

  2. The proposed approach is conceptually straightforward. By introducing an additional FCD loss to guide the learning process, the method integrates seamlessly with existing architectures and enhances performance without introducing significant complexity.

  3. The experiments are comprehensive and convincing. The use of CMU-MOSI and CMU-MOSEI—two widely adopted benchmarks in multimodal sentiment analysis (MSA)—demonstrates the effectiveness of the method in standard settings. Moreover, the evaluation on additional classification tasks further supports the generalizability of the approach, which is a notable strength.

Cons:

  1. Potential missing citation for Theorem 4.1: It appears that Theorem 4.1 is derived from prior work, but no citation is provided. Proper attribution is important for clarity and academic integrity.

  2. There are several textual and formatting typos throughout the paper. For example, line 267 mentions five methods, while Table 2 only presents four. Additionally, there are typographical errors in lines 290 and 318. These inconsistencies may affect the readability and perceived rigor of the work.

  3. The ablation study and case study are conducted on different datasets, which may reduce the coherence of the analysis. It would strengthen the paper to conduct both studies on the same dataset. For instance, providing a case study on the MOSI dataset—similar to Figure 5—could offer more consistent and interpretable insights into the effectiveness of each component.

问题

  1. How to introduce the FCD to the existing work, given that previous work has its unique design. Which layer or block is for the 5 methods for MSA? Is there some extra effort to tune the hyperparameters? If so, what is the motivation behind tuning the hyperparameter?

  2. Whether the number of modalities affects the performance or the use of FCD?

  3. What is the difference between FCD and methods following the disentanglement paradigm? like DMD[1] and its follow-up papers.

[1] Decoupled Multimodal Distilling for Emotion Recognition, CVPR 2023.

局限性

yes.

最终评判理由

The authors address most of my concerns and give the details of my questions. I think this work is meaningful and prefer to accept it if it can revise the content in the rebuttal and add some recent related work to the performance comparison, especially the decoupled-based framework, which is highly related to the proposed method in this paper.

[1] Li et al. "Decoupled multimodal distilling for emotion recognition." CVPR 2023. [2] Wang et al. "DLF: Disentangled-language-focused multimodal sentiment analysis." AAAI 2025.

格式问题

NA

作者回复

We truly value your expertise and the careful consideration you've given to our work, which has helped us refine our approach.


Cons (Weaknesses)

W1. Citation of Theorem 4.1.

A1. We are sorry for this careless. Theorem 4.1 is based on the casual inference [1], information theory [2], and probabilistic theory [3]. We will formally add the necessary citation in the camera-ready version if accepted.

[1] Judea Pearl. Causality. Cambridge university press 2009.

[2] Ronghao Lin et.al. Multi-Task Momentum Distillation for Multimodal Sentiment Analysis. Affective Computing 2024.

[3] E. T. Jaynes. Probability theory: The logic of science. Cambridge university press 2003.


W2. Typos in the paper.

A2. Thanks again for your careful review. We will revise these typos and recheck the content of the paper.


W3. Different datasets in ablation study and case study.

A3. The reason why we conducted these experiments on different datasets is mainly because of the characteristics of the datasets themself. We didn’t conduct case study on CMU-MOSI dataset because the visual modality data are videos, which can not be directly displayed in the paper. Thus, we perform the ablation study on the MVSA-S using CLMLF and the results are as follows.

L_MI\mathcal{L}\_{\text{MI}}L_Dis\mathcal{L}\_{\text{Dis}}L_SDA\mathcal{L}\_{\text{SDA}}Eq. 16Eq. 13Acc-2F1
70.8969.49
71.5670.70
71.7870.95
71.1169.43
71.7871.06
72.2271.62

From the table, we can find that it has the similar tendency with the ablation study on CMU-MOSI using Self-MM. Eq. 16 and Eq. 13 are the ablation cases for task-related features fusion and shortcut in SDA module. The results also show the collective effect of our designs, which is consistent with Section 5.4.


Questions

Q1. Introduce FCD to existing works and the motivation behind tuning the hyperparameter.

A1. As you mentioned, previous works have their unique designs. So we abstract them into a unified process framework (Fig. 2 (a)), which mainly consists three parts (unimoda encoder, fusion module and prediction head). FCD is designed to decompose the synergistic, unique and reduntant features from unimodal representations, so it is attached between the unimodal encoders and the fusion module (Fig. 2 (b)).

Since FCD is a plug-and-play module, we only tune the hyperparameters involved by FCD and fixed the original ones. We report the hyperparameter search process in Appendix E. We first ranging each hyperparameter in {0.5,0.05,0.005,0.0005,0.000050.5, 0.05, 0.005, 0.0005, 0.00005} to find the suitable scale, then a more fine-grained search is performed in this suitable scale to find the final results. Please refer to Appendix E for more details about the tuning process.


Q2. The impact of the number of modality on FCD.

A2. The number of modelity would significantly affect the training time cost because of the pairwise comparison L_SDA\mathcal{L}\_{\text{SDA}}, which has O(M2)O(M^2) complexity. The purpose of designing this L_SDA\mathcal{L}\_{\text{SDA}} is to reduce the differences between different modal distributions, ensuring that the extracted synergistic features are mapped into the same representation space and indeed shared among MM modalities. Moreover, the calculation of Sinkhorn divergence may also become time-consuming as MM increases. To improve it, we can randomly sample two modalities in each iteration. Then only one Sinkhorn divergence between two distributions is calculated to achieve approximate alignment effect.


Q3. The difference between FCD and methods following the disentanglement paradigm. like DMD[1] and its follow-up papers.

[1] Decoupled Multimodal Distilling for Emotion Recognition, CVPR 2023.

A3. FCD not only focus on the disentanglement of modality-invariant and modality-specific information (as DMD[1] did), but also explicitly considers the unimodal uncertainty noise contained in each modality caused by data acquisition processes and sensor characteristics. In short, FCD takes the modality-invariant (consistent information), modality-specific task-related (complementary information) and modality-specific task-agnostic (unimodal uncertainty noise) into consideration simultaneously. Then, based on causal inference, we propose to remove the unimodal noise mixed in the modality-specific feature to obtain the complementary information. Meanwhile, the consistent information is also extracted. After obtaining both consistent and complementary information, the task-related information can be better utilized. Moreover, FCD is a plug-and-play module that can be integrated to existing methods to improve their performances by eliminating the unimoda uncertainty noise.

[1] Yong Li et.al. Decoupled Multimodal Distilling for Emotion Recognition, CVPR 2023.

评论

Thanks for the responses. Most of my concerns have been resolved, and I also think this work is valuable. However, I noticed that some results reported in this paper for the base case do not align with those in the original paper. For example, MMML's original paper reports that Acc2 (89.69) and F1 (89.67) for the MOSI dataset, and Acc2 (88.02) and F1 (88.15) for the MOSEI dataset.

Could you give some details on why this happened?

评论

Thanks for your question. The proposed FCD is a plug-and-play module that would be integrated in existing methods. So, to evaluate the performance when FCD is integrated in each method, we reproduced each original method employed in our experiments following their main hyper-parameter settings. This is to avoid the influence of unrelated factors such as hardware equipment and impact of dependency package versions on the comparison of experimental results between both parties. Therefore, the misalignment of results between the original methods and the ones reported in our paper may caused by the different experiment hardware and software environment.

Besides, we did not put in too much effort to make the reproduced results completely consistent with the original papers. This is also because of the plug-and-play characteristic of FCD. The effectiveness of FCD depends on the changes in the emmployed method results before and after FCD application, rather than the absolute values. As for MMML, considering the training and tuning (FCD hyper-parameters) time overhead of MMML, we set the patience of early stop to 3 which is smaller than 8 in their paper. Although this may sacrifice the performance compared with the original paper, it is still fair to validate the effectiveness of FCD since the settings are the same between ''Base'' and ''Ours'' cases.

To sum up, we reproduced each method locally and compared the relative results between the ''Base'' case and ''Ours'' case to evaluate the effectiveness of FCD without precisely reproducing the results reported in their original papers.

评论

Thanks for the clarification. I got the details, and it makes sense. I have no further questions.

评论

We are glad you found our answers to your questions and concerns useful. And we sincerely appreciate you feel our work valuable. If you have any more questions before the discussion period ends, please feel free to post them, and we will do our best to answer them.

审稿意见
4

This paper proposes a plug-and-play feature causality decomposition method for multimodal representation learning, addressing the limitations of existing approaches that fail to distinguish between complementary information and aleatoric uncertainty noise within modalities. The method disentangles unimodal features into modality-invariant (synergistic information shared across modalities) and modality-specific components, with the latter further decomposed into unique and redundant features using backdoor adjustment to eliminate noise while preserving task-relevant information. The resulting synergistic and unique features are then integrated into existing fusion modules, with causality theory supporting the noise removal process. Extensive experiments demonstrate the effectiveness of this approach in enhancing multimodal representations.

优缺点分析

Strengths: This article introduces a novel Structural Causal Model (SCM) assumption for multimodal data, providing an innovative approach to feature representation. The proposed method includes a feasible feature disentanglement technique, which is rigorously supported by both theoretical analysis and experimental validation. The framework demonstrates strong potential for advancing multimodal data modeling.

Weaknesses: While the proposed SCM assumption is promising, the article lacks a detailed rationality analysis of this assumption. Additionally, the presentation could be refined to enhance clarity and precision. Further improvements in explanatory depth and linguistic expression would strengthen the work.

问题

  1. The size of this plug-in module appears to be relatively large, with a considerable number of parameters. Typically, plug-in modules should maintain a compact design without excessive parameters. Could you clarify how the parameter scale of the proposed FCD module compares to that of the base model?

  2. The author should clearly explain why the URD is a measure-preserving bijective function and what "measure-preserve" specifically refers to in this context. In mathematics, "measure" has a strict definition. What exactly does "measure" refer to here?

  3. The rationality of the assumed information model in Fig. 1 requires further elaboration. Specifically, it should be clarified whether common multimodal data aligns with the proposed SMC model. The authors could enhance the justification by providing concrete examples to illustrate the model's validity. For instance, in image-text multimodal tasks, which factors are task-related, which are synergistic features, and what are the other features mentioned in Fig1.

  4. The motivation and necessity of SDA are not sufficiently justified. Is such a complex design truly necessary for SDA? If replaced with a simpler strategy—for instance, using a shared MLP across all modalities—would the performance degrade significantly?

Additional comments:

  1. Some typos: line 226 and 338

  2. The authors should more clearly and directly clarify what kind of claim or what kind of conclusion their theorem supports.

  3. The font of Definition should be different from the main contexts.

  4. For the case study, it is suggest to select and analyze challenging cases where prior methods proved ineffective. By demonstrating that the proposed approach succeeds in these scenarios—particularly when aligned with the SCM hypothesis—we can highlight its critical role. Specifically, these cases should strongly support the hypothesis, illustrating that without it, the model would fail. Conversely, if the hypothesis is not rigorously validated, there remains a risk that the method's effectiveness could be coincidental rather than causally linked to the proposed mechanism.

局限性

Yes

最终评判理由

The authors have addressed my concerns and provided clarifications that improved my understanding of the paper's contributions and core principles. I appreciate their efforts in revising the manuscript based on my earlier feedback. However, I have some reservations regarding the extent of the revisions. The original submission contained several key points that required clarification or updates. While the authors have resolved these issues, I am uncertain whether making substantial revisions to a conference paper at this stage is appropriate. If these critical points are not adequately incorporated into the final version, it may affect readers' ability to fully grasp the work. While I maintain my score, I remain generally positive about this work.

格式问题

NA

作者回复

We greatly appreciate the time and effort you have taken to review our work, and we believe your suggestions have significantly improved the quality of the manuscript.


Q1. The parameter size of FCD.

A1. There are only several MLPs integrated in FCD and the parameter sizes of them depend on the hidden dimensions of the output of modality encoders.

Specifically, for each modality, CCD module has 4 MLPs with the input size dmd^m and the output size dmd^m (Fm_h\mathcal{F}^m\_{\text{h}}, Fm_s\mathcal{F}^m\_{\text{s}}, Fm_URD-f\mathcal{F}^m\_{\text{URD-f}} and Fm_URD-d\mathcal{F}^m\_{\text{URD-d}}). So the parameter scale is O(dm2)O({d^m}^2). For SDA module, there are M MLPs that map each unimodal feature to a common space and then map to the original dimension. Then parameter scale is O(d2+d_m=1Mdm)O(d^2+d\sum\_{m=1}^{M}d^{m}).

Since the base methods have their unique designs, there is no exact parameter size ratio between FCD and base model. We report the training time cost in Appendix F, which may also reflect the parameter size relationship between the FCD and the original model. It can be seen that the growth rate brought by FCD is relatively large for models with original short training times. Conversely, the growth rate brought by FCD is relatively small.

Therefore, the FCD parameter scale is associated with the base model unimodal encoder output dimensions. If the output dimensions are large, the FCD parameter size would be larger than those whose output dimensions are small.


Q2. The measure-preserving bijection of URD and the ''measure'' in our paper.

A2. We technically apply SVD and construct skew symmetric matrix to constrain URD. We give the pseudo code to train FCD in Appendix B. Specifically, every singular value in Σ\Sigma is enforced to be greater than 10510^{-5} to ensure that the weight matrix of the linear layer in URD remains invertible. Then, the skew symmetric matrix with exponential map is constructed to achieve the measure-preserving characteristic. Since the nesting of two measure-preserving functions is still measure-preserving, the URD is also a measure-preserving bijective function. Please refer to Appendix B for more details of the training process and further description of measure-preserving bijective URD.

We specify that URD should be ''measure-preserving'' to make Lemma A.1 in Appendix A hold. In Lemma A.1, if F\mathcal{F} is not measure-preserving, it may change the probability distribution of variables. For example, if F\mathcal{F} is not measure-preserving, let Y=2XY=2X and XX is uniform distribution P(X)=1/2P(X)=1/2. In this case, the probability distribution of YY should be P(Y)=1/4P(Y)=1/4, Lemma A.1 do not hold. Therefore, the ''measure-preserving'' ensures that the probability distributions of variables are the same. And the ''measure'' here refers to the size of subsets within a stochastic variable's domain, allowing probability to be assigned to events.


Q3. Further elaboration of SCM (Fig. 1).

A3. As we stated in Introduction (Section 1) and Related Works (Section 2), previous researches mainly focused on better utilization of complementary and consistent information or mitigating aleatoric uncertainty within each modality. We summarize previous related works and empirically abstract the unimodal feature into 3 types of factor (synergistic, unique and redundant).

For example, in image-text task,

  • The task-related parts are the ones that contribute to the final task prediction. The synergistic parts are both included in the image and the text modalities, which jointly contribute to the final task prediction.

  • The unique parts are the semantic contents that only appear in image modality (or text modality) and don't appear in text modality (or image modality). Both synergistic and unique factors are task-related and derived from the task-related factor.

  • The redundant parts are the uncertainty noise within one modality, such as the typos in text modality or quality of image, which is derived from the noise factor.

  • The multimodal data store these information in their different format (the specific factor), but they are still associated in high-level semantic (the synergistic factor).

In case study (Section 5.5), we visualize the synergistic and unique parts in both image and text modality, which can more directly understand the corresponding regions of different types of features in the original sample.


Q4. The motivation and necessity of SDA.

A4. SDA module is used to constrain the extraction of synergistic feature from each modality. The synergistic feature represents the consistent information that presents in multiple modalities, facilitating alignment of modalities within a common representation space. Without SDA, there is no explicit strategy to constrain the features are able to be aligned within a common representation space, which causes the failure of synergistic feature extraction.

Similar to your idea, we also employed a shared MLP across all modalities (F_SDA\mathcal{F}\_{\text{SDA}}, see Fig. 4). To further align the features, we propose L_SDA\mathcal{L}\_{\text{SDA}} (Eq. 15) to explicitly optimize F_SDA\mathcal{F}\_{\text{SDA}} and guide features from different modalities towards a common representation space. In our ablation study (Section 5.4), we set the coefficient of L_SDA\mathcal{L}\_{\text{SDA}} (λ_3\lambda\_3) equals to 00 to validate the effectiveness of L_SDA\mathcal{L}\_{\text{SDA}}, where only the parameter-sharing MLP is employed and there are no constrains on the synergistic information alignment. From Table 3, we can find that the performance decline without this explicit constraint. Therefore, it is necessary to explicitly supervise the module by L_SDA\mathcal{L}\_{\text{SDA}} (Eq. 15). Please refer to Section 5.4 for more details.


Additional comments:

Q1. Some typos: line 226 and 338

A1. Thank you again for your meticulous review. We will double check the details in the article.


Q2. The authors should more clearly and directly clarify what kind of claim or what kind of conclusion their theorem supports.

A2. Theorem 4.1 states that if the extraction function Fm_mb\mathcal{F}^m\_{\text{mb}} is a measure-preserving bijection, then effectively extracting the task-related feature um_n\mathbf{u}^m\_n from the modality-specific feature hnm\mathbf{h}^m_n while separating out the modality noise rm_n\mathbf{r}^m\_n via backdoor adjustment is equivalent to maximizing the mutual information between the annotation and the um_n\mathbf{u}^m\_n after causal intervention. Based on this, URD extracts the unique feature (u) via Eq. 6 where the unimodal noise is simultaneously removed.


Q3. The font of Definition should be different from the main contexts.

A3. Thank you for your suggestion. We will revise it in our camera-ready version if accepted.


Q4. For the case study, it is suggested to select and analyze challenging cases where prior methods proved ineffective. By demonstrating that the proposed approach succeeds in these scenarios—particularly when aligned with the SCM hypothesis—we can highlight its critical role.

A4. Thanks for your rigorous suggestion to the case study. We will conduct this experiment and revise it in our camera-ready version if accepted.

评论

Dear Reviewer KESk,

We hope this message finds you well. We would like to kindly remind you that the author rebuttal for Submission 16534 has been submitted and is awaiting your response. As part of the NeurIPS 2025 review process, it is important to review the authors' responses to your initial comments and provide any follow-up feedback or a mandatory acknowledgement to complete the review cycle.

If you have any questions or need further clarification, please feel free to reach out.

Thank you for your dedication to the review process.

Best regards,

Area Chair

评论

The author has addressed my concerns and helped me better understand the contributions and fundamental principles of this paper. I appreciate the authors' efforts in addressing my previous concerns. I have a concern that the original submission contains many key points that require clarification or updating. Although the authors have solved them, I'm uncertain whether making extensive revisions to a conference paper is appropriate. If these key points cannot be incorporated into the final version, it may hinder readers' comprehension of this work.

评论

We sincerely appreciate your positive evaluation and recognition of our work. Should you have any remaining questions or concerns regarding our manuscript, research, or the innovations presented therein, we would be delighted to engage in further discussion or provide additional clarification.

审稿意见
4

This paper introduces a plug-and-play module named "Feature Causality Decomposition (FCD)," which aims to address a core challenge in multimodal learning: how to differentiate and utilize valuable complementary information while simultaneously eliminating modality-specific noise. To achieve this fine-grained decomposition, FCD primarily relies on several key modules and constraints. The Causality Components Decomposition (CCD) module first performs an initial separation of the raw feature z into a modality-invariant part (i.e., the synergistic feature s) and a modality-specific part h. The Synergistic Distribution Alignment (SDA) module then employs pairwise Sinkhorn divergence to align the s features from all modalities to enforce their consistency. The most critical step occurs in the processing of the modality-specific part h. The model applies backdoor adjustment theory, using a Unique Redundant Decompose (URD) submodule to further decompose h into u and r. This process is guided by a mutual information-based loss function, L_MI, which aims to maximize the association between u and the task label, thereby isolating task-irrelevant components into r. Experiments on multiple datasets and existing models demonstrate the effectiveness of this decomposition strategy.

优缺点分析

Strengths:

  1. The paper addresses an important and subtle problem in multimodal learning: how to distinguish modality noise from effective complementary information. Its core contribution lies in moving beyond the traditional shared-specific binary decomposition. It introduces a causal inference framework to propose a more fine-grained, three-part decomposition (synergistic, unique, redundant).

  2. The paper's methodology is well-designed and technically sound. The model's architecture is clearly defined (e.g., CCD, SDA, URD modules), and it employs mature techniques from the field (e.g., Sinkhorn divergence, InfoNCE loss) to achieve its objectives. The experimental validation is thorough; by applying the FCD module to 9 different state-of-the-art models across 5 standard datasets, the paper provides substantial empirical evidence for the method's effectiveness and its "plug-and-play" generalization ability.

  3. The paper is well-structured and clearly written. The authors effectively use diagrams (such as the causal graph and model architecture) to illustrate their complex theoretical model and workflow, which greatly aids the reader's understanding of the core ideas.

Weaknesses:

  1. A main concern of the method is that its separation of synergistic (s) and specific (h) features relies entirely on the indirect guidance of downstream loss functions. The model lacks an explicit mechanism (e.g., an orthogonality constraint or a mutual information minimization loss) to actively prevent information leakage between these two branches. Therefore, the "purity" of the decomposition beomes a core assumption that is not directly verified, which to some extent undermines its theoretical rigor.

  2. Although the experimental coverage is broad, it still needs further statistical analysis. All performance results are reported without standard deviations or significance tests. This makes it difficult to determine to what extent the reported performance gains are stable and reproducible, rather than being a result of random fluctuations during model training.

  3. The design of the Synergistic Distribution Alignment (SDA) module has a notable limitation. As it relies on computing pairwise Sinkhorn divergence between modalities, the computational complexity may grow rapidly.

问题

  1. The paper reports consistent performance improvements across multiple baseline models. To assess the stability of these results, could the authors clarify whether the data in the tables are based on single runs or are averages of multiple independent experiments? Reporting the mean and standard deviation over several runs is necessary to establish the statistical significance of the results.

  2. Can the feature decomposition guarantee that h contains only specific information and s only synergistic information? The model uses indirect loss functions to guide the information separation between the s (synergistic) and h (specific) branches. This raises a question: is this mechanism sufficient to guarantee that there is no information leakage between the two? For example, is it possible for the modality-specific branch h, in order to optimize the final task, to also learn and encode some cross-modal synergistic features? Without an explicit constraint to prevent such cases in the current model, how can the purity of the decomposition be guaranteed?

  3. The data in Appendix F demonstrates the computational overhead of FCD for 2-3 modalities. Considering that the L_SDA loss is based on pairwise comparisons, its complexity grows quadratically (O(M^2)) with the number of modalities M. How is the scalability of the method envisioned for scenarios with many more modalities (e.g., M > 5)? Would this growth in computational complexity become a bottleneck for its application?

  4. The L_MI and L_Dis loss functions used in the model are essentially contrastive learning, whose effectiveness is often related to the number of negative samples provided by the batch size. The paper does not include a sensitivity analysis for this hyperparameter. Could the authors provide the specific batch sizes used in the experiments and comment on the FCD's performance sensitivity to this parameter?

局限性

Yes.

最终评判理由

Thank you for the rebuttal. I appreciate the clarifications regarding feature decoupling (Q2) and the added sensitivity analysis (Q4). However, my primary concern about experimental rigor (Q1) remains. The claim of "no stochastic mechanism" implies a fixed seed, yet the rebuttal neither confirms its consistent use across baselines nor provides the necessary multi-run results. This lack of statistical validation raises significant concerns about potential cherry-picking, especially on larger datasets where such low variance would be highly unusual. I also concur on the computational overhead challenges (Q3). As a result, I will maintain my original rating.

格式问题

N/A

作者回复

Thank you very much for your thoughtful and constructive comments, which would be greatly helpful to improve the quality of our work.


Q1. The standard deviation over several runs of our experiments.

A1. We conducted multiple independent experiments and found that each experimental result is basically the same. This is because we followed the hyperparameters reported in each base methods or their official scripts. And there is no stochastic mechanism designed in our approach. So the standard deviations are pretty small and they were not indicated in the paper. We are sorry for the unclear statement and will refine our description of the experimental design and results.

MethodMAECorrAcc-2F1
Self-MM0.000.000.00/0.000.00/0.00
MMIM0.000.000.00/0.000.00/0.00
MCL-MCF0.000.000.00/0.000.00/0.00
AtCAF0.000.000.00/0.000.00/0.00
MMML0.000.000.00/0.000.00/0.00

These are the standard deviations of 5 independent experiments on CMU-MOSI dataset.


Q2. The guarantee of separation between the s (synergistic) and h (specific) branches (purity).

A2. The guarantee of separation between the s\mathbf{s} and h\mathbf{h} branches is the direct optimization targets, i.e., the item L_FCD\mathcal{L}\_{\text{FCD}} (Eq. 17).

  • (1) On the one hand, to separate the synergistic feature s\mathbf{s}, L_SDA\mathcal{L}\_{\text{SDA}} (Eq. 15) is employed to optimize F_s\mathcal{F}\_{\text{s}} and SDA module. It constrains s\mathbf{s} to be mapped into a common representation space.

  • (2) On the other hand, to separate the modality-specific feature h\mathbf{h}, the other two optimization targets are employed, i.e., L_Dis\mathcal{L}\_{\text{Dis}} (Eq. 19) and L_MI\mathcal{L}\_{\text{MI}} (Eq. 6). These two items jointly supervise the extraction of h\mathbf{h}, ensuring it’s not only modality-specific but also task-related information.

Since the unique feature u\mathbf{u} is mixed in h\mathbf{h}, there exists a complex high-level semantic overlap between s\mathbf{s} and h\mathbf{h}, because s\mathbf{s} and u\mathbf{u} describe the same object (the task annotation is the same for a single data point). It is tricky to directly separate them since the semantic contents of different samples are various. If we directly add constraints (like orthogonality constraint), the two output may become task-related and task-agnostic. In this case, we cannot guarantee the task-related part contains both complementary and consistent information, which may cause information distortion.

Inspired by [1], we use two parallel MLPs to map the original feature to the synergistic s\mathbf{s} and the modality-specific h\mathbf{h} with the supervision of L_FCD\mathcal{L}\_{\text{FCD}}, as we mentioned above. Although there are no regularization terms between these MLPs, the output of them are still constrained by L_FCD\mathcal{L}\_{\text{FCD}}, ensuring them effectively extracting the corresponding features.

[1] Ying Zhou et.al. Triple disentangled representation learning for multimodal affective analysis. Information Fusion 2025.


Q3. The computational overhead of pairwise comparison L_SDA.

A3. Your concerns are very reasonable and worth improving upon. The purpose of designing this pairwise comparison loss function is to reduce the differences between any different modal distributions, in order to ensure that the extracted synergistic features are indeed shared among MM modalities. As you have pointed out, this strict constrain will cause the O(M2)O(M^2) complexity. To improve it, we can randomly sample two modalities in each iteration. Then only one Sinkhorn divergence between two distributions is calculated to achieve approximate alignment effect.


Q4. The sensitivity analysis of batch size hyperparameter.

A4. We reproduced and conducted our experiments based on the batch size given in each compared paper or their scripts. Since FCD is a plug-and-play module and we aim to validate the performance under the original experimental settings, we did not tune the batch size (or other base hyperparameters) in our experiments. Here, we conduct the sensitiveity analysis on Self-MM with FCD on the CMU-MOSI dataset and the results are as follows.

Batch Size8163264128
MAE67.8368.0268.2968.1167.94
Corr80.3479.8680.6280.2780.57
Acc-284.74/86.9684.43/86.2784.95/87.0985.15/87.4185.74/87.85
F184.74/87.0184.48/86.3184.93/87.1185.16/87.4285.71/87.83

This experiment result shows that larger batch size may enhance the performance of the model because of the increment of negative sample diversity in contrastive losses. We would add the results to the camera-ready version if accepted.

评论

We sincerely appreciate your positive evaluation and recognition of our work. Should you have any remaining questions or concerns regarding our manuscript, research, or the innovations presented therein, we would be delighted to engage in further discussion or provide additional clarification.

评论

We sincerely appreciate the valuable time and efforts of the AC and the reviewers again. We are pleased that the reviewers recognize the significance and contribution to the community of our work (Reviewer #B7dA, Reviewer #KESk, Reviewer #hx2Y, Reviewer #Qg9n), theoretical and methodological rigor (Reviewer #B7dA, Reviewer #KESk, Reviewer #hx2Y), the novelty (Reviewer #KESk, Reviewer #Qg9n), thorough experiment design (Reviewer #B7dA, Reviewer #KESk, Reviewer #hx2Y), and well-written manuscript (Reviewer #B7dA, Reviewer #hx2Y). Below, we summarize how we have addressed each reviewer's concerns in detail:

Reviewer #B7dA:

We provided additional experiment and results that clarified the sensitivity of batch size and the standard deviation to address the concerns of negative samples' impact in InfoNCE losses and the stability of quantitative results. Besides, we provided the theoretical analysis for the guarantee of separation between the s\mathbf{s} (synergistic) and h\mathbf{h} (specific) branches. We also clarified the necessity of the pairwise comparison LSDA\mathcal{L}_{\text{SDA}}. However, the reviewer has only submitted the Mandatory Acknowledgement without replying to the rebuttal, so we hope the rebuttal content can solve the reviewer's concerns (score == 4).

Reviewer #KESk:

We added additional elaborations on the parameter size of FCD, SCM, the measure-preserving bijective of URD, the motivation and necessity of the proposed module. The reviewer acknowledged that all concerns were resolved (score == 4).

Reviewer #hx2Y:

We provided an additional ablation study and its result to improve the coherence of the analysis. Besides, we also further clarified the tuning strategy, impact of modality number on FCD, and the differences with previous works. The reviewer admitted most concerns have been addressed and thought the work is valuable (score == 4).

Reviewer #Qg9n:

We further clarified the differences with previous researches. We also provided an additional experiment and analysis on the causal structural of the proposed SCM and the reviewer admitted the principle behind the proposed framework. The reviewer confirmed satisfaction with the aditional analysis and decided to raise the original score (score >= 4).

We do hope this summany demonstrates our efforts to address all reviewer's feedback and highlights the improvements that have been made to our manuscript during the rebuttal process.

最终决定

This paper introduces a plug-and-play Feature Causality Decomposition (FCD) module for multimodal representation learning, aimed at separating task-relevant synergistic and modality-specific features from redundant noise via causal inference (backdoor adjustment). FCD integrates seamlessly with existing fusion models and demonstrates consistent performance gains across nine state-of-the-art models and five benchmark datasets.

Reviewers praised the paper for addressing an important problem with a theoretically grounded framework, clear architecture (CCD, SDA, URD modules), and thorough experiments. Concerns included potential feature leakage, missing statistical analyses, SDA computational complexity, and novelty of backdoor adjustment. The authors’ rebuttal clarified these issues, providing standard deviations, explicit loss-based feature separation, strategies to reduce SDA complexity, and distinctions from prior disentanglement methods.

Given its originality, strong empirical validation, and effective response to reviewer concerns, the paper meets NeurIPS standards for technical rigor and impact.