5.3

/10

Poster4 位审稿人

最低4最高7标准差1.1

4.3

置信度

正确性2.0

贡献度2.0

表达2.5

NeurIPS 2024

Classifier-guided Gradient Modulation for Enhanced Multimodal Learning

Zirun Guo,Tao Jin,Jingyuan Chen,Zhou Zhao

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

balanced multimodal learninggradient modulation

评审与讨论

审稿意见

评分: 5置信度: 42024-06-22

This paper proposes a balanced multimodal learning method. Compared to existing methods that only consider the gradient size, it also considers the direction of the gradient.

优点

The experiment includes multiple data sets and multiple tasks.

缺点

There is less visualization analysis of the experiment.

问题

It is interesting that the author calculates the improvement of each modality (such as the change in accuracy) instead of the current performance. However, it is hoped that more convincing theoretical proofs can be added.
It is recommended to visualize changes in indicators before and after modulation (such as utilization rate, etc.), including only adjusting the gradient size, gradient direction, and both. It is best to also add the performance change process for each modality to avoid "using the indicators proposed by yourself to measure your performance."
Can it be implemented using only one classifier, instead of one classifier for each modality?
Please add a brief introduction to the experimental comparison method, such as AGM, PMR, etc.
The method in this paper is similar to the OGM method, but the original OGM also includes a part that uses Gaussian noise to improve generalization. Is this part of the method used in this paper? Can you add relevant discussions?

局限性

Yes, the authors have adequately addressed the limitations

作者回复

2024-08-06

Thanks for your valuable time and comments.

(W1):Theoretical analysis.

We provide an analysis of why the combination of improvement can balance the training. In Sec 3.2, we know the dominant modality will be updated faster than others, which makes the gradient $\partial\Omega/ \partial\phi$ much larger. Therefore, according to Eq.(3,4), $\Delta\theta^{\phi_i}$ will be larger than other modalities, indicating a larger step towards the optimal, which in turn influences $\Delta\epsilon$ (i.e. $\partial\Omega/ \partial\phi(\uparrow)\rightarrow\Delta\theta^{\phi_i}(\uparrow)\rightarrow\Delta\epsilon(\uparrow)$ ). Then, for modality $i$ , change in balance term will be $\frac{\sum_{k=1,k\ne i}^M\Delta\epsilon_k}{\sum_{k=1}^M\Delta\epsilon_k}=\frac{\sum_{k=1}^M\Delta\epsilon_k-\Delta\epsilon_i}{\sum_{k=1}^M\Delta\epsilon_k}=1-\frac{\Delta\epsilon_i}{\sum_{k=1}^M\Delta\epsilon_k}$ . If $i$ is the dominant modality ( $\Delta\epsilon_i(\uparrow)$ increases), then $\frac{\Delta\epsilon_i(\uparrow)}{\sum_{k=1}^M\Delta\epsilon_k(-)}$ will increase, which proves that the balance term $1-\frac{\Delta\epsilon_i}{\sum_{k=1}^M\Delta\epsilon_k}$ will decrease. According to CGGM update rule ( $\theta^{\phi_i}=\theta^{\phi_i}-\rho (1-\frac{\Delta\epsilon_i}{\sum_{k=1}^M\Delta\epsilon_k})\nabla g_i$ ), $\Delta \theta^{\phi_i}$ will decrease, slowing down the optimization of the dominant modality. For other modalities, $1-\frac{\Delta\epsilon_i}{\sum_{k=1}^M\Delta\epsilon_k}$ will increase, thus accelerating their optimization. During training, the balancing term keeps changing according to the optimization, thus making all modalities sufficiently optimized.

(W2):Visualization.

We have uploaded figures in PDF in the general response. We have made the following visualizations: Change in losses under four scenarios. Change in balancing term. Change in acc, gradient size and direction. Compared with Fig.2 in paper (0.7 after 1 epoch), the accuracy of text modality does not increase very fast with CGGM (0.55 after 1 epoch), indicating CGGM imposes constraints to it during optimization(see Fig3 in PDF). In Fig2 in paper, the dominant modality always has the largest gradient while in Fig.1 in PDF, the gradient magnitude of the text modality decreases at first, indicating that CGGM slows down its optimization. In Fig2 in paper, $`cos`(g_a, g_{mul})< 0$ during training, indicating an opposite optimization direction between the unimodal and multimodal, thus hindering the optimization process. In Fig.1 in PDF, we observe that $`cos`(g_{i}, g)> 0$ for all modalities, indicating that all modalities have the same direction with the multimodal. In Fig.2 in PDF, we can observe the loss of the dominant modality will drop much slower than that in Fig2a. Besides, the losses of all modalities in Fig2(bcd) are smaller than those in (a), indicating the effectiveness of CGGM. In Fig.3 in PDF, when the value is higher than the red line, the modality is promoted. When the value is lower than the red line, the modality is suppressed. In the first few iterations, the dominant modality is suppressed, ensuring that other modalities are fully optimized. During the optimization, balancing terms of three modalities turn up and down, ensuring each modality is sufficiently optimized.

(W3):Using only one classifier instead of one classifier for each modality

It is difficult for only one classifier to catch the unimodal gradients and features accurately. During optimization, the classifier takes all modalities as input and absorbs multimodal information, thus failing to reflect unimodal utilization rates effectively. The comparison on IEMOCAP is shown below.

	3 classifiers	1 classifier
Acc	75.4	72.1

Besides, additional classifiers does not require much memory, for classifiers do not pass gradients to modality encoders during backpropagation.

	No classifier	3 classifiers
Memory(MB)	4438	4446

(W4):Brief introduction to comparison methods.

Thanks for your suggestion. G-Blending computes an optimal blending of modalities based on their overfitting behaviors to balance the learning. Greedy proposes conditional learning speed to capture the relative learning speed to balance the learning. OGM balances the multimodal learning by monitoring the discrepancy of their contribution to the learning objective with gradient enhancement. AGM proposes a metric built on mono-concept to represent competition state of a modality. PMR introduces prototypes for each class, accelerating the slow-learning modality by enhancing its clustering toward prototypes and reducing the inhibition from dominant modality with prototypical regularization.

(W5):Discussion on OGM and CGGM.

There are crucial differences between OGM and CGGM:

OGM is based on cross-entropy loss and it cannot applied to other tasks such as regression.
OGM calculates the discrepancy ratio between two modalities. However, when there are $M$ modalities, it needs to calculate a total of $M$ ratios for each modality. OGM does not consider how to use $M^2$ ratios to balance the training. In contrast, CGGM first calculates the term individually and then combines them to represent utilization rate, indicating its universality.
OGM overlooks the influence of gradient direction.

OGM also includes a part that uses Gaussian noise to improve generalization. We do not use this because this method is based on SGD generalization ability which follows a Gaussian distribution when the batch size $m$ is large enough. However, in many tasks, other optimizers may be chosen and batch size may be small. Therefore, adding Gaussian noise may hinder the optimization process. For example, we use AdamW on Food101 and when we add gaussian noise to the gradient, we find the accuracy drops from 92.9 to 92.5.

2024-08-13

Dear reviewer, a reminder to take a look at the author's rebuttal and other reviews. Did the rebuttal address your concerns?

2024-08-13

Dear Reviewer kP2D,

Thank you for your valuable time and comments on our manuscript. The rebuttal period is set to end soon, and we are looking forward to your feedback. During the rebuttal stage, we dedicated significant time and effort to address the concerns you raised in your initial review.

Please let us know if our response addresses your concerns and if you have any other questions or require any further information from us. We are happy to provide additional details or clarification as needed. We appreciate your time and consideration, and look forward to hearing back from you.

Best regards

2024-08-14

Dear Reviewer kP2D,

Thank you again for your valuable time and comments. The rebuttal period is set to end today, and we are looking forward to your feedback. During the rebuttal stage, we dedicated significant time and effort to address the concerns you raised in your initial review. Please feel free to reach out if you have any other questions. We are happy to provide additional details or clarification as needed.

审稿意见

评分: 5置信度: 42024-07-02

This paper proposed a balanced multi-modal learning method with Classifier-Guided Gradient Modulation (CGGM), considering both the magnitude and directions of the gradients, with no limitations on the type of tasks, optimizers, the number of modalities.

优点

Balanced multi-modal learning considering both the magnitude and directions of the gradients is a reasonable idea.
The proposed method is easy to follow.

缺点

Balanced multi-modal learning considering the directions of the gradients is not novel. A previous work [1] have already analyze the issue of modality dominance caused by gradient conflicts. The difference and the comparison with this method should be considered in detail. Besides, the approach to controlling gradient magnitude is similar to the ideas of OGM[2] and PMR[3].
This framework still does not explore the imbalance issue of multi-modal learning in more flexible task formats, such as the potential imbalance in tasks like AVQA and multi-modal generation. Expanding task formats to regression and segmentation tasks is only a minor improvement. Existing work can also be extended to these tasks with minor adjustments.

[1] Wang, H., Luo, S., Hu, G. and Zhang, J., 2024, March. Gradient-Guided Modality Decoupling for Missing-Modality Robustness. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 14, pp. 15483-15491). [2] Peng, X., Wei, Y., Deng, A., Wang, D. and Hu, D., 2022. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8238-8247). [3] Fan, Y., Xu, W., Wang, H., Wang, J. and Guo, S., 2023. Pmr: Prototypical modal rebalance for multimodal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20029-20038).

问题

The weaknesses above should be carefully considered.

局限性

The authors have addressed the limitations.

作者回复

2024-08-06

Thanks for your valuable time and constructive comments.

(W1): Difference and comparison with previous methods.

The differences between CGGM and [1] are:

[1] considers a fixed loss term for direction during the training process while CGGM employs a dynamic loss term to balance the training. The fixed loss term in [1] might face a challenge during the training process because the gradient directions of different modalities are always changing. For example, PMR also employs a regularization loss term. However, this loss is only added in the first several epochs and needs to be deleted manually to avoid performance damage according to the optimization process. However, CGGM employs an adaptive loss function with the balancing term to train the model. It will dynamically adjust the loss according to the utilization rate of each modality during the training process.
[1] only considers the direction and overlooks the impact of magnitude while CGGM considers both.
[1] aims to address missing modality issues with direction loss while CGGM aims to address the imbalanced multimodal learning.

We compare the direction loss in [1] with the loss in CGGM and present the results on IEMOCAP below.

	baseline	Loss in [1]	Dynamic loss in CGGM
Accuracy	70.7	71.6	73.3
F1 score	69.5	70.9	72.8

From the results, we can observe that our dynamic loss can adjust the directions of gradients better than the fixed loss term in [1].

Additionally, there are several important differences between CGGM and PMR:

PMR needs to calculate a prototype for each class. This indicates that PMR can only be applied to pure classification tasks. In contrast, CGGM can be applied to various tasks.
The whole PMR and its formulations are based on cross-entropy loss. In contrast, CGGM has no limitations for this problem.
PMR discusses the gradient direction, but it does not balance the multimodal learning from the perspective of direction explicitly. It proposes two loss terms for modal acceleration and regularization. PMR adds this regularization term in the first few epochs and needs to delete it manually to avoid performance damage. In contrast, CGGM can modulate magnitude and direction dynamically with the training process.

Besides, there are also several important differences between OGM and CGGM:

OGM is based on cross-entropy loss and it can not applied to other tasks such as regression.
The balance term in OGM is designed for two modality situations. It calculates the discrepancy ratio between two modalities. However, when there are $M$ modalities, it needs to calculate a total of $M$ ratios for each modality. OGM does not consider how to use $M^2$ ratios to balance the training. In contrast, CGGM first calculates the term individually and then combines the term from different modalities to represent the utilization rate, indicating its universality.
OGM overlooks the influence of gradient direction.

[1] 2024, March. Gradient-Guided Modality Decoupling for Missing-Modality Robustness. In Proceedings of the AAAI Conference on Artificial Intelligence

(W2): More flexible tasks.

To show the universality and superiority of CGGM, we conduct experiments on more flexible tasks: multimodal retrieval tasks and video question answering. Specifically, we use MSRVTT and MSRVTT-QA datasets. For multimodal retrieval task, we use the common multimodal retrieval transformer MMT pre-trained on HowTo100M as the backbone. We used extracted features for the retrieval task. There are seven modalities in the extracted MSRVVT (motion, audio, scene, ocr, face, speech and appearance features). For video QA, we use the pre-trained VALOR-B as the backbone. For MMT and VALOR-B, they both consist of modality encoders and a multimodal fusion or decoder for downstream tasks. And we initialize the unimodal classifier with the multimodal fusion of the model. The results are shown below.

		Text $\rightarrow$ Video			Video $\rightarrow$ Text
	R@5( $\uparrow$ )	R@10( $\uparrow$ )	MnR( $\downarrow$ )	R@5( $\uparrow$ )	R@10( $\uparrow$ )	MnR( $\downarrow$ )
None	54.1	67.3	26.9	54.8	66.9	23.9
CGGM	59.2	70.2	21.2	59.6	69.8	20.1

	None	CGGM
MSRVTT-QA	46.7	47.2

From the results, we can observe that even in flexible tasks, CGGM can also improve the performance of the baseline model. Due to our idea of difference of evaluation metrics, CGGM can be easily applied to these flexible tasks, indicating its universality and effectiveness.

Thank you for your valuable feedback. We will incorporate these details in our paper.

2024-08-12

Thanks authors for the explanations and supplementary experiments. Considering the opinion of other reviewers and the contribution of this work, I decide to raise the rating to Borderline accept. I think the innovation in this work is a bit limited and it is an incremental work.

2024-08-12

Thank you for your considered review and feedback. We appreciate you raising the score and your thoughtful assessment of our work. We believe the insights and findings of our paper can make a meaningful impact and extend previous research in a way that advances the state of the art and universality. Please feel free to reach out if you have any other suggestions or questions.

审稿意见

评分: 7置信度: 42024-07-05

This paper focuses on the notorious modal imbalance problem in multi-modal learning. To alleviate the modality imbalance, the proposed method modulates gradient magnitude and the directions of the gradient simultaneously. Experiments on various multi-modal datasets demonstrate the efficiency.

优点

This paper explores an interesting problem. In the current joint learning paradigm, the dominant modality overpowers the learning process and the resulting gradient prohibits further exploration of the features in the weak modality.
This paper is well-written and easy to follow.

缺点

The motivation is not stated clearly. From Eq(3) and Eq(4), we can observe that the gradient magnitude can affect the update of a specific modality. However, the explanation of how the gradient direction between the specific modality and their fusion influences the modality update is unconvincing.
The experiments are not convincing. The authors ignore recent state-of-the-art methods, such as UMT/UME [1], QMF [2], and ReconBoost [3]. It is recommended that the authors compare these methods. Additionally, the authors should plot the gradient direction, accuracy curve, and gradient profile after using their method and compare these to Fig. 2 to better highlight the effectiveness of their approach.
The related work section lacks discussion on recent research. UMT [1] distills well-trained uni-modal features to assist multi-modal learning. QMF [2] provides a quality-aware multimodal fusion framework to mitigate the influence of low-quality multimodal data. ReconBoost [3] finds that the major issue arises from the current joint learning paradigm. They propose an alternating learning paradigm to fully harness the power of multi-modal learning.
Some notations are confusing. Please see the questions below.

For now, I recommend a borderline for this paper, leaning to reject. If the concerns in weakness can be addressed in the rebuttal phase, I am willing to raise my concern and accept this paper.

[1] On Uni-Modal Feature Learning in Supervised Multi-Modal Learning. ICML 2023.

[2] Provable Dynamic Fusion for Low-Quality Multimodal Data. ICML2023

[3] ReconBoost: Boosting Can Achieve Modality Reconcilement. ICML 2024.

问题

In section 3.2, $\mathcal{L}$ denotes both the overall empirical loss and the loss of individual samples. It is recommended to denote the loss of individual samples as $\ell$ .

Equations (3) and (4) are incorrect. The chain rule of differentiation for a scalar with respect to multiple vectors should be applied, rather than the chain rule for a scalar with respect to another scalar.

\frac{\partial z}{\partial x} = \left(\frac{\partial y}{\partial x}\right)^{T}\cdot \frac{\partial z}{\partial y}

局限性

None.

作者回复

2024-08-06

Thanks for your valuable time and constructive comments.

(W1): The explanation of how the gradient direction between the specific modality and their fusion influences the modality update.

In Eq(3) and Eq(4), we know that the parameter $\theta$ and the term $\partial \mathcal F, \partial \mathcal \Omega. \partial \mathcal \phi$ are all vectors not scalars. Therefore, these terms already include the direction information of the gradient. And just like the gradient magnitude, gradient direction can also affect the update of modalities from Eq(3) and Eq(4).

Additionally, we can use cosine similarity $\cos(g_i,g_j)$ to represent the direction between two vectors. As shown in Fig2(c), $\cos (g_a,g_{mil})<0$ , indicating they optimize towards the opposite direction, thus hindering the gradient update for multimodal branch. When we implement CGGM, in Fig.1(c) in PDF, $\cos(g_i,g_{mil}) > 0$ for all modalities, indicating they optimize towards the same direction.

(W2): Experiments.

Thanks for the reminder. We will include these additional baselines into our papers. Since MOSI is a regression task where the label is a score and loss function is mean absolute error, methods such as QMF and ReconBoost where distributions are needed cannot be applied. Additionally, we implement a combination of Dice loss and entropy loss on BraTS dataset in our paper. Therefore, we modified it slightly to fit the methods.

	UPMC-Food 101	MOSI	IEMOCAP	BraTS
None	90.3	81.2	70.7	69.2
UMT	91.8	81.8	70.8	69.5
UME	90.7	80.8	71.5	70.3
QMF	92.9	-	72.1	71.6
ReconBoost	92.5	-	73.1	71.8
CGGM	92.9	82.8	75.4	73.9

Meanwhile, we visualize the balancing process and the performance comparison in the PDF file which can be downloaded in the general response. Comparing the Fig.2 in the paper and Fig.1 in the additional PDF, we have several observations:

Performance: Compared with Fig.2 in the paper (around 0.7 after one epoch), the accuracy of text modality does not increase very fast with CGGM (around 0.55 after one epoch), which indicates that CGGM imposes constraints to the dominant modality during the optimization process (see Fig.3 in PDF). Besides, the accuracies of all the modalities and the fusion improves, indicating the effectiveness of CGGM.
Gradient magnitude: In Fig.2 in the paper, the dominant modality always has the largest gradient while in Fig.1 in the PDF, the gradient magnitude of the text modality decreases at first, indicating that CGGM slows down its optimization and accelerates other modalities' optimization, helping each modality learn sufficiently, thus improving the multimodal performance.
Gradient direction: In Fig.2 in the paper, $`cos`(g_a, g_{mul})< 0$ during the training process, indicating an opposite optimization direction between the unimodal and multimodal, thus hindering the optimization process. In Fig.1 in the PDF, we observe that $`cos`(g_{i}, g)> 0$ for all modalities, indicating that all modalities have the same direction with the multimodal fusion.

(W3): Recent work.

We apologize for missing some recent work. We will add UMT/UME, QMF and ReconBoost methods in the related work section and incorporate the experimental comparisons with our methods.

(W4): Notations.

We appreciate your feedback on the notation. We will update the $\ell$ and equations for a better presentation. $(\frac{\partial\mathcal F}{\partial\mathcal \Omega})^\top\frac{\partial \ell(\hat{y}^n,y^n)}{\partial \mathcal F}$

2024-08-13

Dear reviewer, a reminder to take a look at the author's rebuttal and other reviews. Did the rebuttal address your concerns?

2024-08-14

Thanks for your efforts during the rebuttal phase. Most of my concerns have been addressed. I will raise my score to accept it.

2024-08-14

Thank you for your valuable time and feedback. We are glad to hear that your initial concerns have been addressed through our rebuttal. We will incorporate these additional results and clarifications into our manuscript. We believe these modifications will make our manuscript more solid and convincing.

2024-08-13

Dear Reviewer evvZ,

Best regards

审稿意见

评分: 4置信度: 52024-07-08

This paper proposes CGGM, a novel strategy to balance the multimodal training process. Compared with existing methods, it can deal with the unbalanced multimodal learning problem with different optimizers, takes, and more than two modalities.

优点

The motivation is sufficient and the experiments on different tasks and datasets prove that the proposed method solves the problem well.

CGGM stands out by considering both the magnitude and direction of gradients for balancing multimodal learning. This combined approach effectively addresses the modality competition problem and ensures that all modalities contribute equally to the model’s performance.

缺点

The reviewer is curious about the computational complexity of the additional classifier or decoder. Is there any experimental result?
As present in Line 149, the classifier fi consists of 1-2 multi-head self-attention (MSA) layers and a fully connected layer for classification and regression tasks. Does this apply to all models or classification tasks? Why is it set up like this? Why not just set it to the same classifier structure as the multimodal head?
What's the light decoder used for segmentation tasks?
The introduction of unimodal classification may limit the learning of multimodal tasks, such as gradient conflicts. How to deal with this problem?
PMR has also discussed the problem of gradient direction and introduced unimodal loss to assist multimodal learning. What's the difference between CGGM and PMR?
The reviewer is concerned about the accuracy of using the difference between the two consecutive ε to denote the modality-specific improvement for each iteration. According to my experience, the loss of the dominant modality will quickly drop to the magnitude of 1e-2 to 1e-3, while the magnitude of the weak modality is around 1e-1. At this time, the loss change of the weak modality will be larger, and according to the author, it will be regarded as the dominant modality. I
What's the performance of the proposed method on CRAME-D and AVE datasets? They are also widely used in previous studies.

问题

see weakness

局限性

作者回复

2024-08-06

Thanks for your valuable time and constructive comments.

(W1): Computational complexity of the additional classifiers experimental results.

The additional classifiers will need more computational resources during training. However, during inference, the classifiers will be discarded. Therefore, they have no impact during the inference stage. We report the memory cost (MB) of the additional classifiers in the table below.

Setting	Food101	MOSI	IEMOCAP	BraTS
With classifiers	8846	3902	4446	18072
Without classifiers	8838	3894	4438	18048

From the table, we can observe that the additional computational increase is low. There are two main reasons: (1) the classifiers or decoders are light with only a few parameters; (2) the classifiers only use the gradients to update themselves and do not pass the gradients to the modality encoders during backpropagation. Therefore, there is no need to store the gradient for each parameter, thus reducing memory cost.

(W2): About the additional classifiers setting.

Initially, we set the classifier the same as the multimodal head. However, according to the experiments, we observe that there is little performance gain compared with the classifiers with only a few layers. The results are shown below.

Classifier	$f_1$	$f_2$	$f_3$	Multimodal
multimodal head	55.1	56.2	67.5	75.6
CGGM (1-2 layers)	54.9	56.8	67.4	75.4

In the table, $f_1,f_2,f_3$ represents the accuracy of three modality classifiers. As we can see, more layers do not bring much improvement to the model and will add additional computational resources and parameters. From another perspective, the function of the classifiers in our paper is the reflection of the utilization rates of different modalities but not their own accuracies. Therefore, if these classifiers can reflect the utilization rates of each modality, we can design them as light as possible. Additionally, as discussed in W1, the additional decoders will not bring much computational costs.

(W3): Light decoder in segmentation task.

In PyTorch, we implement the decoder with two Conv layers with concatenation of low-level features generated by the encoders and interpolation functions.

(W4): Unimodal branch may limit the learning of multimodal tasks. How to deal with this?

The unimodal classifiers have no influence on the multimodal tasks because the unimodal classifiers only use the independent loss function to update themselves and do not pass the gradients to the multimodal branch. The reason is that the classifier is designed to reflect the utilization of each modality. Besides, this design can reduce the computational resources because there is no need to store the gradient for each parameter.

(W5): What's the difference between CGGM and PMR?

There are several important differences between CGGM and PMR:

PMR needs to calculate a prototype for each class. This indicates that PMR can only be applied to pure classification tasks. In contrast, CGGM can be applied to various tasks.
The whole PMR and its formulations are based on cross-entropy loss. In contrast, CGGM has no limitations for this problem.
PMR discusses the gradient direction, but it does not balance the multimodal learning from the perspective of direction explicitly. It proposes two loss terms for modal acceleration and regularization. PMR adds this regularization term in the first several epochs and needs to delete it manually to avoid performance damage. In contrast, CGGM can modulate magnitude and direction dynamically with the training process.

(W6): Concerns about two consecutive $\epsilon$

Yes. At initial iterations, if no constraints are added, the loss of the dominant modality will drop quickly in our experiments (See Fig.2 (a) in the additional PDF in the general response). However, our method can prohibit this quick process. From iteration 0, CGGM will calculate the $\epsilon$ . $\epsilon$ serves as the constraint to the dominant modality. If the dominant modality has much more improvements than that of weak modalities, CGGM will modulate its gradients to slow down its update and accelerate the weak modalities' update. Therefore, after CGGM is implemented, the loss of the dominant modality will not drop quickly (See Fig.2 (b) in the additional PDF, the loss of text drops much slower than (a)). Besides, CGGM can calculate the balancing term dynamically during the training process according to the optimization situations. Therefore, although text is the dominant modality in a task, during the training process, the "dominant modality" always changes according to the dynamic balancing term in CGGM. At some time, text is the dominant modality and at another time, audio may become the dominant modality (See Fig.3 in PDF). We can also reach this conclusion by comparing the accuracy change in Fig.1 in the PDF and Fig.2 in the paper.

(W7): Performance on CREMA-D and AVE

We conduct these experiments for more evaluations of CGGM. For backbones, we use ResNet as encoders. For both datasets, we adopt SGD as optimizers. For additional classifiers, we use an MLP layer. The results are shown below.

	CREMA-D	AVE
None	61.5	64.8
CGGM	79.2	73.6

Thanks for your suggestion. We will incorporate these details in our paper.

2024-08-13

Dear reviewer, a reminder to take a look at the author's rebuttal and other reviews. Did the rebuttal address your concerns?

2024-08-13

Dear Reviewer S6e7,

Best regards

2024-08-14

Dear Reviewer S6e7,

作者回复

2024-08-06

We sincerely thank all reviewers for their great effort and constructive comments on our manuscript. During the rebuttal period, we have been focusing on these beneficial suggestions from the reviewers and doing our best to add several experiments and revise our manuscript.

According to the reviewers' suggestions, we have included more recent baselines and more flexible tasks, which demonstrates the effectiveness and universality of CGGM. Besides, we have included several visualizations for a deeper analysis of CGGM. These visualizations include:

Visualization of the changes in losses under four different scenarios.
Visualization of the changes in the balancing term.
Visualization of the changes in performance, gradient size, and direction.

These visualizations can be downloaded in this response. We believe these addition will make our experiments and methods more comprehensive and insightful. Additionally, we have made further clarifications about the setting of classifiers, the additional computational resources required by the classifiers, the differences between CGGM and other methods, as well as the theoretical analysis of CGGM. These modifications aim to make our manuscript more solid and convincing.

最终决定Accept (poster)

2024-09-25

This paper was reviewed by four experts in the field. The paper received mixed reviews of Borderline Reject, Accept, Borderline Accept, and Borderline Accept after the rebuttal. During the rebuttal, two reviewers (evvZ, hVC3) raised their scores and the other two reviewers did not respond to the rebuttal.

The AC carefully read the paper, the reviews, and the authors' responses. Reviewer S6e7's concern is mainly about clarification of technical details. The AC believes that the authors' rebuttal should address the questions. Therefore, the decision is to recommend the paper for acceptance. The reviewers did raise some valuable suggestions in the discussion that should be incorporated in the final camera-ready version of the paper. The authors are encouraged to make the necessary changes to the best of their ability.

Why not higher: Many concerns were raised by reviewers. The authors' rebuttal needs to be incorporated into the final paper to improve the work.