Probabilistic Conformal Distillation for Enhancing Missing Modality Robustness
We propose a PCD method to handle the missing modality problem, which transfers privileged information of modality-complete representation by considering the indeterminacy in the mapping from incompleteness to completeness.
摘要
评审与讨论
By formulating the missing modality representation as a probability distribution, this work proposes probabilistic conformal distillation (PCD), which is an objective function that encourages 1. consistent latent representation of multimodal embeddings for data points in the same class. 2. Geometric consistency of inter-class latent representation. Simulations shows the empirical success of such strategy in improving the missing modal robustness for student models.
优点
- The overall structure of the paper is well organized and easy to follow.
- The proposed loss function (PCD) intuitively makes sense.
- Simulations show a clear advantage for the proposed strategy in classification.
缺点
- My major concern about PCD lies at its potential use cases. The reasons are two folds. First, the whole framework relies on a definition of positive and negative points, which is not obvious for a lot of multi-modal learning problems (For example, cross-modal retrieval (without class labels), missing-modal imputation, etc). At this point, the paper only successfully demonstrated its advantage in classification (I will talk about segmentation later). Second, the scalability of the proposed approach is also of a minor concern, and the main reason is due to the loss L_g. As this loss seems to scale quadratically with batch size, the time and memory complexity of PCD needs to be formally analyzed.
- For segmentation, the improvements seem marginal at best, therefore, it is important to provide error bars to validate whether there is any statistically significant difference.
- Still for segmentation, it is mentioned in line 134 that the positive group for each sample only contains itself. In this case, L_u, if I understand correctly, just encourages the latent representation of all samples to be distinctive from each other, and I do not see the purpose of L_g in this case, as there is nothing special about the inter-class geometry at this point. It would be important for the author to conduct a simulation similar to Table 2, but for segmentation, to see whether the inclusion of these losses make any difference (Sepecially given that the segmentation simulations only show marginal improvments).
- There are some writing issues that need to be addressed. For example, equation (1) is about the modal of a distribution, while it is described as probabilistic peak expectation in line 118. Please note that there is a rich literature in modal regression, and these two concepts should not be confused. There is also a value inconsistency in Table 1. Notice in the first column of CASIA-SURF results, the difference between ETMC and PCD is (7.91-7.23 = 0.68), while the improvements in marked as 0.74.
问题
See weakness
局限性
No
Q1: First, the whole framework relies on a definition of positive and negative points, which is not obvious for a lot of multi-modal learning problems (For example, cross-modal retrieval, missing-modal imputation, etc). Second, as the loss seems to scale quadratically with batch size, the time and memory complexity of PCD need to be formally analyzed.
- We are sorry that there may be somewhat misunderstanding about the mission of PCD. PCD focuses on robust multimodal fusion and requires that the maximum number of input modalities be greater than 1. Therefore, mainstream cross-modal retrieval and generation tasks, which typically have only a single input modality, are not suitable for the missing-modality problem explored by PCD. As for the recent popularity of the multimodal retrieval task, where the query text prompts modifications in the query image, will become invalid if any input modality is missing. Hence, it does not align with the problem that PCD aims to solve. Besides, although PCD relies on the definition of positive and negative points, class labels are not necessary. When class labels are available (e.g. the classification tasks), modality-complete representations sharing the same class with the modality-missing input are considered positive points, while the ones from different classes are negative points. When class labels are unavailable (e.g. the segmentation tasks), only the modality-complete representation of the same sample is considered positive, and all others are considered negative, as we have set up in the segmentation task.
- To address the concern about computational cost, we estimated the memory complexity and the time for each iteration of PCD and the other three methods with the same batchsize (64) on CeFA. It can be seen that PCD does not significantly increase training time and memory complexity.
| Method | MD | RAML | MMANet | PCD |
|---|---|---|---|---|
| Memory (G) | 4.371 | 3.285 | 3.621 | 3.809 |
| Time per iteration (s) | 0.155 | 0.104 | 0.106 | 0.174 |
In addition, we explore the relationship between batch size and memory complexity, and computation time. The results indicate that a larger batchsize is not always better. The optimal batchsize (64) has a relatively low computational cost.
| Batch Size | 32 | 64 | 128 | 192 |
|---|---|---|---|---|
| Memory (G) | 3.145 | 3.809 | 5.739 | 8.081 |
| Time per iteration (s) | 0.151 | 0.174 | 0.195 | 0.268 |
| Avg ACER () | 22.70 | 22.63 | 23.79 | 25.16 |
Q2: For segmentation, the improvements seem marginal at best, therefore, it is important to provide error bars to validate whether there is any statistically significant difference.
In Table 6 in Appendix, we detail the stability experiments for PCD across four datasets. Each experiment is repeated for three times, allowing to calculate the average score along with the standard deviation. The results reveal that, even in its worst-case scenario, PCD outperforms the best competing methods. These outcomes not only underscore PCD’s superior performance but also attest to its stability and consistency across a wide range of segmentation testing conditions. In addition, to further confirm the effectiveness of PCD on segmentation tasks, we conduct experiments on a larger dataset, SUN RGB-D[1], which contains 5,285 RGB-Depth pairs for training and 5050 pairs for testing. The results are shown below. We can see that PCD is effective even on a larger segmented dataset.
| Methods | {R} | {D} | {R,T} | Avg |
|---|---|---|---|---|
| Separate Model | 43.94 | 39.81 | 47.84 | 43.86 |
| MMANET | 44.73 | 39.94 | 47.54 | 44.07 |
| PCD | ||||
| 0.90 | 1.49 | -0.30 | 0.68 |
Q3: For segmentation, Lu, just encourages the latent representation of all samples to be distinctive from each other, and I do not see the purpose of Lg, as there is nothing special about the inter-class geometry. It would be important to conduct ablation studies for segmentation, to see whether the inclusion of these losses make any difference.
We are sorry for the misunderstanding caused by the lack of clarity. PCD aims to model different modality-missing representations as distinct distributions to fit their unknown PDFs. This is realized by considering properties of positive and negative points on the modeled distribution . It is important to note that all positive and negative points are generated by a pretrained modality-complete model and remain fixed during training. In a segmentation task, for a modality-missing input, there is only one positive point for the PDF of the input's mapping variables in the modality-complete space, namely, its corresponding modality-complete representation. Here, the negative points are modality-complete representations of all other samples. The loss is used to maximize the probabilities of at and minimize them at . This implicitly encourages the mean of closer to and further away from . Besides, in , the structures are represented by calculating the distances of peak points of and , respectively, where is no longer divided into and . The complete ablation studies is shown in Table 7 in Apendeix or Table 1 in the pdf (the same), which verify the effectiveness of .
Q4: There are some writing issues that need to be addressed. Equation (1) is about the modal of a distribution, while it is described as probabilistic peak expectation in line 118. There is also a value inconsistency in Table 1.
We really appreciate the reviewer’s detailed review and will carefully proofread the whole submission to fix the imprecise expressions and typos. Specifically, in the revision, we will modify line 118 to be the peak of probability and recalculate the Table 1.
Thank you for the response.
I think the minor point in Q2 has been properly addressed, although the main issues in segmentation or other potential cross-modal learning still remains.
Q1: First, I am not sure why did the author talks about number of input modalities as cross-modal retrieval and cross-modal imputation are not limited to one input modality. In fact, in most cases, like cross-modal imputation for sensory data, includes a large set of input modalities. Second, I think the author reaffirms my concern, which in the case that when there is no clear label information, the notion of inter-class geometry basically becomes not obvious (in the sense that n points will simply become n distinct classes).
Q3: I am confused by the response explaining L_g. The authors say that z* is not longer divided into positive and negative sets, but the loss function L_g is actually aligning the geometric vector g_i with the geometric vectors of other points in the positive group G_p right? Why does the author say that there is no positive and negative points anymore? Although, the new ablation study on L_g kind of addressed my concern, conditioned on the validity of the results.
To summarize, my concern regarding the intuition of PCD for groupless tasks remains (which in my opinion is not well-explained or well-supported theoretically), simulation results have show that it does provides tangible benefits, and the inter-class geometry loss L_g is being evaluated to be useful in groupless tasks. However, the response from the author regarding the intuition behind the mechanism of L_g further confused me. The mathematical motivation and empirical results are disjoint in the context of groupless tasks. Therefore, I would like to keep my score at this moment.
Q1: First, I am not sure why did the author talks about number of input modalities as cross-modal retrieval and cross-modal imputation are not limited to one input modality. In fact, in most cases, like cross-modal imputation for sensory data, includes a large set of input modalities. Second, I think the author reaffirms my concern, which in the case that when there is no clear label information, the notion of inter-class geometry basically becomes not obvious (in the sense that n points will simply become n distinct classes).
A1: Thank you for the follow-up discussion. We will address the concerns one by one.
First, we really apologize about misunderstanding the modality number in cross-modal tasks. Now, we get a better understanding about the reviewer's meaning and agree with the possibility of conducting experiments of these tasks. However, due to time constraints, we have to admit that we cannot finish the experiments in time before ddl (Aug 13, AoE). We promise that we will report these experimental results of these tasks as soon as possible and include them into the revision. Here, we could only provide a brief overview of the implementation. If the task with labels, the implementation of PCD will be similar to that of the classification task. Without labels, it will resemble the implementation of the segmentation task.
Second, about the reviewer's concern, we would like to explain that it is not a problem, since the algorithm is by default compatible with the label-aware and label-free settings. There is a similar example regarding contrastive learning [1] and supervised contrastive learning [2] to help us clarify this. In constrastive learning, one sample is augmented into two views, and the representations of the two views are optimized to pull together and push far away from the representations of other samples. In supervised contrastive learning, the representations of samples from the same class are pulled together and pushed away from the representations of samples from the other classes. But generally, if the label information is available, the performance of supervised contrastive learning is better than the performance of contrastive learning, which to some extent aligns with the reviewer's opinion.
The difference between the contrastive learning-based in classification and segmentation tasks is analogous to that between supervised contrastive learning and contrastive learning. The former considers all sharing the same class as as positive samples, whereas the latter uses from the same instance as the positive sample. There is no concept for inter-class geometry in PCD, the calculation of geometric vectors is label independent and the significance of remains valid in segmentation, as described in A3.
[1] He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 9729-9738.
[2] Khosla P, Teterwak P, Wang C, et al. Supervised contrastive learning[J]. Advances in neural information processing systems, 2020, 33: 18661-18673.
Q3: I am confused by the response explaining L_g. The authors say that z* is not longer divided into positive and negative sets, but the loss function L_g is actually aligning the geometric vector g_i with the geometric vectors of other points in the positive group G_p right? Why does the author say that there is no positive and negative points anymore? Although, the new ablation study on L_g kind of addressed my concern, conditioned on the validity of the results.
A3: We apologize for the potential unclear description. Here, we will re-explain the computation procedure of for the segmentation task based on its equation. Firstly, we list some important notations in the following table.
| Notation | Explanation |
|---|---|
| The modality-missing input for sample . | |
| The mean value of the modeled modality-missing distribution | |
| The representation of the modality-complete input corresponding to . | |
| The modality-missing geometric vector with as the core, which is computed across all in the batch. | |
| The positive modality-complete geometric vector with as the core, which is computed across all in the batch. | |
| The negative modality-complete geometric vector with as the core, which is computed across all in the batch. | |
| The contrastive learning-based loss about and . |
The term aims to align with its single positive modality-complete counterpart in a contrastive learning-based algorithm, as expressed by the following equation:
$
L_g = - \mathrm{log} s(g^{\star}_i,g_i) = - \mathrm{log} \frac{\mathrm{exp}(\beta(g^{\star}_i,g_i)/\tau)}{\mathrm{exp}(\beta(g^{\star}_i,g_i)/\tau)+ \mathrm{exp}(\beta(g^{\star}_j,g_i)/\tau)},
$
where calculates the cosine similarity between and , is the temperature coefficient. It is worth noting that in segmentation, due to the high dimensionality of multimodal features, only one negative vector is selected to conserve computational resources (here is about the claim of positive and negative vectors in ). Although the positive vector is only , still contributes to the conformal relationship between the peak points of and the peak points of due to the alignment between and . We represent by calculating the distances of , and is obtained by the distances of , namely:
$
g^{\star}_i(b)=\alpha( z^{\star}_i, z^{\star}_b), g^{\star}_j(b)=\alpha( z^{\star}_j, z^{\star}_b), g_i(b)=\alpha(\mu_i,\mu_b),
$
where are -dimensional vectors with as the cores, respectively. is the batch size. Theoretically, can be any formula for calculating the distance between vectors. Notice that are computed across all samples in the batch, without distinguishing between positive and negative points (here is about the claim of not requiring positive and negative ). The ablation studies on NYUv2 and Cityscapes in Table 7 in Appendix validate the effectiveness of in segmentation tasks. Specifically, PCD includes all loss components outperforms the model with only and by an average of 1.37% and 0.86%, respectively.
We appreciate the reviewer's challenges about the clarity of this part, and will carefully improve this part with the notation table and more detailed explanation in the revision. All reviewer's advice will be incorporated to improve the submission.
Dear Reviewer UGMA,
As the deadline is approaching, we would greatly appreciate it if you could check our responses at your earliest convenience. Your feedback is invaluable to us, and we want to ensure that we have addressed any remaining concerns promptly.
Thank you so much for your time and consideration.
Best,
Authors of Paper 14962.
This paper studied the missing modality robustness problem for multimodal training, by introducing a probabilistic conformal distillation to handle the stringent determinate alignment given the irreparable information asymmetry. Specially, PCD adopts the alignment of the extremum of distribution while maintaining the geometric consistency among modality relation. With extensive experiments, the authors demonstrate the consistent improvement of PCD over the current state-of-the-art methods on several benchmarks.
优点
(1) The idea of probabilistic conformal distillation is interesting, since the brute-force distillation from complete modality to missing modality is usually ill-posed given the information asymmetry. The authors designed a mild way to simultaneously propagate the task-relevant information and avoid the overfitting.
(2) The instantiation of the probability extremum and geometry consistency for PCD is elegant due to the simplicity and minimal extra cost in implementation. This makes the proposed method easily extended for different missing modality cases as shown in the experimental parts, i.e., segmentation and classification.
(3) The writing is easy to follow and the experiments on the well-used benchmark show the promising improvement over current state-of-the-art methods. A range of the experiments for ablation and further analysis confirm the superiority of PCD.
缺点
Although the submission is overall good, there are still some minor concerns that need to be validated and addressed, which I summarize as follows.
(1) Some equations are misleading. For example, in Eq. (2), it is not clear that whether the authors mean the accumulative probability of z_p^* is larger than z_n^, or exactly mean each z_p^ to z_n^*. Similar cases happen in Eq. (3). I think the authors should clarify this part due to their intrinsic difference and different implication. Besides, the similarity measure s(\cdot, \cdot) cannot be any metric, since it occurs inside the log operation (Eq. (5)) and should be satisfy the positive constraint.
(2) It is not clear why the Eq. (7) and Eq. (9) share the coefficient in Eq. (10). Is this optimization stable? Are they in the similar scale in terms of the loss range. At least, the authors should give some evidences or some ablations to show the rationality about the shared \lambda in Eq. (10).
(3) It seems that the improvement of the experiments in Table 1 for NYUv2 and Cityscapes are smaller than one. It is better for these experiments with multi-round trials and report the std for the statistical signification explanation.
Overall, I think the idea of probabilistic conformal distillation is an interesting and promising way to combat the brute-force alignment between missing modality and complete modalities during robust training. Carefully improving the submission by considering above weakness will make it more convincing.
问题
see weakness.
局限性
The limiations are discussed in the paper.
Thank you for your time devoted to reviewing this paper and your constructive suggestions.
Q1: Some equations are misleading. For example, in Eq.(2), it is not clear that whether the authors mean the accumulative probability of is larger than , or exactly mean each to . Similar cases happen in Eq.(3). I think the authors should clarify this part due to their intrinsic difference and different implication. Besides, the similarity measure cannot be any metric, since it occurs inside the log operation (Eq.(5)) and should be satisfy the positive constraint.
A1: Eq.(2) requires that the probability of any one positive point is larger than the probabilities of all negative points . Eq.(3) has a similar meaning, namely, for any , the inequality should be hold. And for , thanks for pointing out the areas where our presentation is unclear.
Q2: It is not clear why the Eq. (7) and Eq. (9) share the coefficient in Eq. (10). Is this optimization stable? Are they in the similar scale in terms of the loss range. At least, the authors should give some evidences or some ablations to show the rationality about the shared in Eq. (10).
A2: The Eq.(7) and Eq.(9) are realizations of the Probability Extremum and Geometric Consistency components of Eq(5), respectively. And Eq.(5) is the simplification of the objective Eq.(4). In this simplification process, no distinct coefficients are generated for the two terms in Eq.(5). Thus we share the coefficients of Eq.(7) and Eq.(9) in Eq.(10). Besides, we also conduct experiments on CeFA to explore the impact of using different coefficients. The results are shown below. As can be seen, setting different hyperparameters does not cause significant performance fluctuations.
| 1.4 | 1.6 | 1.8 | 2.0 | 2.2 | Best Baseline | |
|---|---|---|---|---|---|---|
| Avg ACER | 23.36 | 24.69 | 22.63 | 23.85 | 22.56 | 27.94 |
| 1.4 | 1.6 | 1.8 | 2.0 | 2.2 | Best Baseline | |
| Avg ACER | 22.10 | 22.99 | 22.63 | 23.41 | 22.15 | 27.94 |
Q3: It seems that the improvement of the experiments in Table 1 for NYUv2 and Cityscapes are smaller than one. It is better for these experiments with multi-round trials and report the std for the statistical signification explanation.
A3: In Table 6 in Appendix, we detail the stability experiments for PCD across four datasets. Each experiment is repeated for three times, allowing to calculate the average score along with the standard deviation. The results reveal that, even in its worst-case scenario, PCD outperforms the best competing methods. These outcomes not only underscore PCD’s superior performance but also attest to its stability and consistency across a wide range of segmentation testing conditions. In addition, to further confirm the effectiveness of PCD on segmentation tasks, we conduct experiments on a larger dataset, SUN RGB-D[1], which contains 5,285 RGB-Depth pairs for training and 5050 pairs for testing. The results are shown below. We can see that PCD is effective even on a larger segmented dataset.
| Methods | {R} | {D} | {R,T} | Avg |
|---|---|---|---|---|
| Separate Model | 43.94 | 39.81 | 47.84 | 43.86 |
| MMANET | 44.73 | 39.94 | 47.54 | 44.07 |
| PCD | ||||
| 0.90 | 1.49 | -0.30 | 0.68 |
I read the replies, which addressed most of my concerns. I would like to raise my score.
Dear Reviewer EqMG,
We sincerely appreciate you taking the time to review our responses and contributing to improving this paper. We will carefully follow the reviewer's advice to incorporate all the addressed points with additional exploration in the updated version. Thank you once again for your dedicated and valuable contribution in reviewing our paper!
Best,
Authors of Paper 14962.
Summary: This paper studies the robustness under missing modality scenarios. In multi-modal learning, missing modality is a very common problem that might hinder the learning performance of many existing strategies. The authors assume that the modalities’ information redundancy could potentially help the learning under missing modalities. Specifically, for the missing modality, a probability distribution can be formed as an estimation for the real modality value. To leverage such an assumption, two learning properties are proposed, namely extremum property and conformal property. Based on these two properties, the authors estimate the potential modality distribution as a Gaussian distribution, and data points with complete representations are considered positive points, and negative points otherwise. By encouraging the probability extremum objective and geometric consistency objective, the multimodal learning framework can be successfully formulated. Through extensive experiments, the effectiveness of the proposed method is carefully evaluated and justified.
优点
Strengths:
- This paper studies a very interesting research topic and could have a potential impact in the field of multimodal learning, as well as for realistic application purposes.
- The proposed salutation is technically solid and novel, which is a good contribution. It would be better if a theoretical framework is proposed to further justify the proposed method.
- The experiments are extensive and sufficient. Both quantitative comparisons between many recent baseline methods and detailed analysis are provided. Such as ablation study on different modules, analysis of knowledge distillation strategy, hyperparameter sensitivity analysis, and computational overhead. The evaluation is conducted on many well-known dataset, which makes the results convincing.
- The performance improvement is promising.
缺点
Weaknesses:
- There is abuse of notations which are not clearly defined, which makes the reading a bit hard. Moreover, the writing shall be further improved. There is no general logic in formulating the methodology. The authors mostly just plainly demonstrate what is done.
- A stronger motivation should be proposed to justify the proposed two properties. When introducing the two properties, it is suggested to first identify the key problem in learning with missing modalities. For example, maybe some empirical evidence that some missing modalities still follow a similar data distribution as the complete ones, then it would be reasonable to propose probability extremum property and further design the learning objective. In future versions, addressing this part could further help the quality and readability of this paper.
- In the experiments, how did the authors control the missing rate of modalities? If there are some modalities in some instances are missing, the missing rate could significantly affect the final learning results. Moreover, the sampling strategy could be essential, how did the authors sample missing modalities? Is it randomly sampled or uniformly sampled?
- Moreover, what is the difference between learning under datasets with three modalities and two modalities? Does the change of modalities number affect the learning performance of the proposed method? In my opinion, if the modality number changes, the estimation of the probability distribution could change. With more modalities, the estimation could be more accurate. However, if the missing rate also increases due to the introduce of additional modalities, the influence could be unpredictable. Can the authors make some explanation with respect to this part?
问题
Please see the weakness part.
局限性
The authors have discussed limitation in the main paper.
Thank you for your time devoted to further comments. We would like to make more detailed explanations to figure out your concerns.
Q1: Notations are not clearly defined. Moreover, the writing shall be further improved. There is no general logic in formulating the methodology.
A1: We really appreciate the reviewer’s detailed review and will comprehensively polish the writing by clarifying the general logic, and carefully proofread the submission to fix the typos and grammatical errors.
Q2: A stronger motivation should be proposed to justify the proposed two properties. For example, maybe some empirical evidence that some missing modalities still follow a similar data distribution as the complete ones, then it would be reasonable to propose probability extremum property and further design the learning objective.
A2: Thanks for your suggestions on our motivation. We use t-SNE to visualize the distribution of the modality-complete, RGB, Depth, and IR representations of the unified model without PCD distillation. The results are shown in Figure 1 in the pdf. It can be observed that each unimodal distribution is similar to the modality-complete distribution, which provides empirical evidence for PCD to consider the indeterminacy in the mapping from incompleteness to completeness.
Q3: In the experiments, how did the authors control the missing rate of modalities? Moreover, the sampling strategy could be essential, how did the authors sample missing modalities? Is it randomly sampled or uniformly sampled?
A3: We are sorry for missing the information about the augment setting. In this paper, we conduct experiments under two modality-missing settings: 1) training with modality-complete data and testing with modality-missing data; 2) training and testing with the modality-missing data.
- Most of the experiments are under setting (1). During training, we augment each modality-complete sample by simulating all potential missing modality scenarios and randomly sample one of the augmented data as the training sample for the current epoch. Thus, the missing rate is , where is the number of modalities, and samples with missing modalities are uncertain in each epoch. Furthermore, to investigate the impact of the random augmentation strategy, we conduct additional experiments with extreme augmentation conditions, namely, excluding simulations that only RGB, Depth, or IR modality is available. The results on CASIA-SURF are shown below. It can be seen that when the random augmentation is no longer applied, there is a decrease in performance. These results indicate that appropriately simulating and sampling various missing scenarios helps enhance the multimodal robustness. During testing, we build various testing sets with different missing cases, where each set contains only one missing case using the entire testing data. We report the results on each testing set, as well as the average results across them.
| Augmentations | {R} | {D} | {I} | {R,D} | {R,I} | {D,I} | {R,D,I} | Average |
|---|---|---|---|---|---|---|---|---|
| w/o {R} | 7.97 | 2.36 | 10.22 | 1.18 | 4.05 | 1.22 | 0.92 | 3.99 |
| w/o {D} | 7.39 | 4.48 | 8.59 | 1.53 | 3.55 | 1.42 | 0.65 | 3.94 |
| w/o {I} | 7.00 | 2.19 | 15.43 | 1.03 | 4.00 | 1.59 | 0.82 | 4.58 |
| PCD | 6.54 | 1.67 | 8.13 | 0.80 | 2.76 | 0.82 | 0.54 | 3.03 |
- The experiments in Table 5 in the main paper and Table 9 in Appendix are under setting (2). During training, we augment each data by simulating all possible modality-missing cases for this data. During testing, we build the same testing sets as setting (1). In Table 9, we evaluate the performance on both the CASIA-SURF and CeFA datasets, where each modality of the training data has either 30% or 40% of its data missing. As can be seen, the missing rate for training data could affect the final results under setting (2), where the missing rate is larger, the performance is worse. This is because that PCD is only applied to the data that has a modality-complete counterpart, a large missing rate means that less data is used for distillation.
Q4: What is the difference between learning under datasets with three modalities and two modalities? Does the change of modalities number affect the learning performance of the proposed method? Can the authors make some explanation with respect to this part?
A4: To answer the reviewer's question, in the following, we explore the impact of the number of modalities by controlling it on CASIA-SURF. The experiments are performed under setting (1) and the results are shown below. Notice that, the higher the number of modalities, the larger the missing rate of the training set. For example, the missing rate is 2/3 for two-modality and 6/7 for three-modality. From the results, it can be observed that with more modalities, better modality-complete representations can be provided, so as to transfer more privileged information. This results in a model that is relatively robust across various missing cases. When the number of modalities is small, there are fewer missing cases that need to be considered. The model may focus more on fitting an easy missing case, resulting in marginal improvements. For example, when complete modalities are RGB and Depth, the result of two-modality data at Depth (1.73%, lower is better) is better than that of three-modality (2.20%).
| Training Modalities | {R} | {D} | {I} | {R,D} | {R,I} | {D,I} | {R,D,I} |
|---|---|---|---|---|---|---|---|
| RGB, Depth | 7.67 | 1.73 | \ | 1.18 | \ | \ | \ |
| RGB, IR | 7.14 | \ | 14.52 | \ | 5.61 | \ | \ |
| Depth, IR | \ | 1.61 | 6.62 | \ | \ | 1.54 | \ |
| RGB, Depth, IR | 7.23 | 2.20 | 5.66 | 0.99 | 2.86 | 0.89 | 0.74 |
Thanks for further providing experimental results and carefully addressing my concerns, there are no other questions left. So I decided to keep my current score.
Dear Reviewer upEV,
Thank you for your constructive feedback and comments on our submission. We believe that these comments have significantly strengthened our work. We would greatly appreciate it if, upon reviewing our revisions, you could help champion our submission in the next phase or consider raising the score to support our submission given to the current diverged score ratings. Thank you once again for your thoughtful feedback and for considering our request.
Sincerely,
The authors of 14962
We gratefully thank all the reviewers for their devoted efforts and constructive suggestions on this paper. We are glad that the reviewers have some positive impressions of our work, including:
- The overall structure of the paper is well-organized and easy to follow. (Reviewer EqMG, UGMA)
- The explored research topic is interesting and could have a potential impact in multimodal learning and realistic application purposes. (Reviewer upEV, EqMG)
- The method is novel, technically solid, and can easily be extended to different missing modality task. (Reviewer upEV, EqMG)
- Extensive, sufficient and rigorous experiments with comprehensive ablation study and analysis (Reviewer upEV, EqMG, UGMA).
We have addressed the reviewers' comments and concerns in individual responses to each reviewer. The reviews allowed us to improve our draft, and the changes made in our responses are summarized below:
- We conduct several experiments to analyze the impact of the missing rate, the sample strategy, and the number of modalities. (A3, A4 to Review upEV).
- We explore the sensitivity of using different hyperparameters to optimize the loss function Eq.(10) (A2 to Review EqMG).
- We showcase the stability of PCD in four datasets. (A3 to Review EqMG, A2 to Review UGMA).
- To address the reviews concern about computational cost, we compared PCD with other SOTA methods in terms of memory complexity and the time for each iteration (A1 to Review UGMA).
- We provide further clarifications on several points, including the empirical evidence about motivation (A2 to Review upEV), the details of equations (A1 to Review EqMG), the application scope of the algorithm (A1 to Review UGMA) and the effectiveness of two loss terms (A3 to Review UGMA).
We appreciate all reviewers' great effort again! We are looking forward to your reply.
Below is the list of the references that can be used in the responses.
[1] Song S, Lichtenberg S P, Xiao J. Sun rgb-d: A rgb-d scene understanding benchmark suite[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 567-576.
The paper addresses the problem of missing modalities in multi-modal learning. The authors propose leveraging the information redundancy across modalities to improve learning under missing modalities.
Specifically, the authors assume that for a missing modality, a probability distribution can be estimated to approximate the real modality value. They introduce two key properties - the extremum property and the conformal property - to guide the estimation of this probability distribution, modeled as a Gaussian in the paper.
The two properties are considered in two objective loss functions as follows:
A. An "extremum probability" objective, which says that the complete modality value in latent representation occurs at the maxima of the missing modality probability distribution; B. A "geometric consistency" objective, which maintains the geometric relationship between the latent representations of missing modalities and the complete modality within a batch of samples, exploiting a contrastive learning loss.
Experimental evaluation is based on several benchmarks in image classification and segmentation, which shows that this "Probabilistic Conformal Distillation" (PCD) method consistently outperforms state-of-the-art approaches in improving robustness to missing modalities.
In summary, the key contributions are the probabilistic modeling of missing modalities and the novel PCD learning objective that leverages extremum and conformal properties to enhance multimodal learning performance in the presence of missing data. Experimental results are encouraging.
Concerns from the reviewers are mostly from the clarity and preciseness in elaborations of the methodology, as well as the application scenarios. Among the various notational and clarity issues raised by reviewers, a particular one is about the usage of contrastive learning loss in geometric consistency objective. The confusion mostly comes from that the paper is introduced mainly through the perspective of classification where the positive samples arise from the same class (e.g. Eq. (7-9) as definitions) while in segmentation positive sample is just one which is only discussed briefly in line 134. However, in practice, positive samples in contrastive learning may be also generated from various data augmentation techniques from the same image. In the case of a single positive sample, is probably a stronger objective than extreme objective as it imposes more geometric relationship on different samples within the batch.
Although the confusions exist in those parts about notations (e.g. is a distance but inner product is not) and elaborations, such concerns could be clarified in the final version without too much new studies. Therefore the current manuscript could be accepted if the authors make all the clarifications raised by the reviewers carefully.