5.0

/10

Poster4 位审稿人

最低3最高7标准差1.4

4.0

置信度

正确性3.0

贡献度2.3

表达2.5

NeurIPS 2024

Learning the Latent Causal Structure for Modeling Label Noise

Yexiong Lin,Yu Yao,Tongliang Liu

OpenReview PDF

提交: 2024-05-13更新: 2025-01-16

摘要

关键词

Label noisetransition matrices

评审与讨论

审稿意见

评分: 5置信度: 42024-06-30

This paper proposes that traditional noise-labeled learning methods based on noise transition matrices have limitations. Specifically, only the noise transition matrices of certain special examples can be effectively estimated, while the transition matrices of other examples need to be estimated based on similarity. However, these similarity assumptions are difficult to establish in many real-world applications. Therefore, this paper suggests that learning the latent causal structure governing the generative process of noisy data can help estimate noise transition matrices. This paper constructs a novel structural causal model to simulate the generative process of noisy data and integrates semi-supervised learning techniques, ultimately achieving state-of-the-art results.

优点

In the structural causal model, the assumption of dependency relationships among the latent variables z is novel, and the analysis of its reasonableness is convincing.
The theoretical analysis of the identifiability of latent variables in the appendix is reasonable.
The experimental design of this paper is comprehensive.

缺点

The structural causal model of the labeling process is not reasonable. According to the labeling process, the example feature x should be the cause of the noisy label y ̃. If only the clean label y is the cause of the noisy label, then this structural causal model is still modeling class-dependent noise rather than instance-dependent noise.
In line 150, the assumption that latent noise variables are generated by the clean label is not reasonable.
The CSGN method, like other noise-labeled learning methods based on semi-supervised learning, relies on the small-loss criterion and clean samples.
The CSGN method introduces semi-supervised learning techniques for warmup; however, the experimental section does not include ablation studies.

问题

Can the authors explain why the clean label is the cause of the latent noise variables?
Can the authors show the model performance without using semi-supervised learning?

局限性

The authors have discussed some limitation and it seems reasonable.

作者回复

2024-08-07

Response to Reviewer 22VP

Q1: The structural causal model of the labeling process is not reasonable. The example feature $X$ should be the cause of the noisy label $\tilde{Y}$ . If only the clean label y is the cause of the noisy label, then this structural causal model is still modeling class-dependent noise rather than instance-dependent noise.

A1: We believe that it is a misunderstanding. Our model indeed addresses instance-dependent noise. Specifically, the generation of instance $X$ and noisy label $\tilde{Y}$ involves common latent causes $\mathbf{Z}$ , such as $Z_3$ depicted in Figure 2(b). In causality theory, the presence of common causes between two variables, such as $X$ and $\tilde{Y}$ , establishes dependence between them [1,2]. This indicates that our model is inherently instance-dependent, not instance-independent.

Moreover, when there exists label noise, the real cause of the noise is some latent causal factors rather than the image itself, which is very intuitive. For example, we usually say that light and angle cause noise, but we do not say that the image causes noise. This implies that the direct causes of label noise are light and angle rather than image. However, previous work [6] ignores this point. In their generation model, instance $X$ is the direct cause of the noisy label $\tilde{Y}$ . This is one of the insights of our paper. We have also provided experiments to verify the effectiveness of our structural causal model, as shown in Table 3 and 4 of the PDF. The experiment details are in the response to Reviewer S6tG.

Q2: In line 150, the assumption that latent noise variables are generated by the label is not reasonable.

A2: Thank you for highlighting this concern. It is a challenge to identify the latent factors where these factors are not independent but causally related. The assumption that latent noise variables are generated by the label can provide a theoretical guarantee of identifiability. Though this assumption introduces certain constraints, compared to other models like CausalNL, our generative framework offers a more realistic representation of the underlying data generation processes.

To empirically validate our model, we conducted experiments by integrating our generative framework into CausalNL and InstanceGM. These experiments were carried out on the CIFAR-10 dataset, which features instance-dependent label noise. We refer to these adaptations as "CausalNL'" and "InstanceGM'", respectively. The experiment results are shown in the Table 3 and 4 in the PDF. The experiment results demonstrate the effectiveness of our model.

Q3: The CSGN method, like other noise-labeled learning methods based on semi-supervised learning, relies on the small-loss criterion and clean samples. Can the authors show the model performance without using semi-supervised learning?

A3: Thanks for your advice. The CSGN can also work with the early stopping method, PES [5]. We follow the setting in PES and conduct experiments on the CIFAR10 dataset. We refer to the version of CSGN that works with PES as “CSGN-PES”. Notably, CSGN-PES does not rely on semi-supervised learning, the small-loss criterion, or the selection of clean samples. The experiment results are shown in Table 11 of the PDF. The experiment results demonstrate that CSGN-PES maintains robust performance even in the absence of semi-supervised learning techniques.

Q4: The CSGN method introduces semi-supervised learning techniques for warmup; however, the experimental section does not include ablation studies.

A4: Thank you for pointing this out. We have conducted ablation studies to assess the impact of removing the semi-supervised learning warmup phase from the CSGN method. In these studies, we replaced the semi-supervised warmup with a regular early-stopping approach, where the neural networks were trained for 10 epochs on the training data. The variant of CSGN without the semi-supervised learning warmup is denoted as CSGN-WOSM. The experiment results are shown in Table 9 of the PDF. The results indicate that CSGN retains its effectiveness even without the semi-supervised learning warmup.

2024-08-13

The authors have provided a detailed response to your questions. How has your opinion changed after reading their response? Have they appropriately addressed your questions and concerns?

2024-08-14

Dear AC

As the authors would have noticed, the score has already been updated after the authors posted the rebuttal.

Best, Reviewer qzn3

审稿意见

评分: 3置信度: 52024-07-01

In learning with noisy labels, estimating an instance's noise transition matrix is crucial for inferring its clean label. Current studies assume previous relations between transition matrices and it may not hold in real world scenarios. Motivated by the relation between noise transition matrices are established through the causal structure, this paper suggests to learn the latent causal structure through learnable graphical model.

优点

Using DAG for flexible latent causal structure.

缺点

This paper points out that other studies estimate the transition matrix based on some similarity assumptions, e.g. class identity, manifold assumption. However, there is instance-dependent transition matrix modeling, which means that the transition matrix is different sample by sample. It means there is no similarity assumption for instance-dependent transition matrix as modeling part.
Furthermore, in the modeling of the paper, it is assumed that the relation structure is established from the causal structure. It means there exists some arbitrary similarity assumption of similar latent leads to similar transition matrix, even when the latent variable is not interpretable.
After all, the similarity in manifold and the similarity of latent structure could be very similar, since the samples generated from the same latent variable would have similar features, so they could exist in similar places in manifold space in implementation. If samples with similar latent variable exist in very different space in manifold, considering the latent can be a must.
In one sentence, the motivation to learn the latent causal structure is not convincing enough.
For implementation, it will require much time and memory computation cost since it should generate X (=input), which is a limitation of this method.
For experiments, the baselines are outdated. The most recent baseline seems to be published at 2022. There are more recent researches for learning with noisy label and they should be compared.
Why NPC and SOP are not included as baselines, although those are the methods of using generative process for noisy label problem? Since this paper utilizes the generative model for classification task, NPC and SOP must be compared as important baselines.
The position of Figure 5 is inadequate. It should be located at the experiment part.
No ablation studies.

问题

How can I know the learned latent causal structure is good? Also, how can I measure good latent structure will lead to good transition matrix estimation?
According to Figure 5, the transition matrix estimation error of MEIDTM is smaller CSGN for CIFAR100 dataset. However, considering Table 3, the accuracy of MEIDTM is not better than CSGN. It means the proposed method does not always surpass other approaches in the estimation of noise transition matrics. What it means?
Any examples of similarity-based assumption failure cases? Can modeling latent causal structure solve those failures?

局限性

作者回复

2024-08-07

Response to Reviewer pfn2

Q1: The transition matrix is different sample by sample in instance-dependent transition matrix modeling. It means there is no similarity assumption for instance-dependent transition matrix.

A1: We believe that there is a misunderstanding. The transition matrix varying from sample to sample does not imply the absence of similarity assumption. For example, part-dependent label noise [3] represents an instance-dependent label noise scenario where it is assumed that transition matrices for certain parts of instances are similar. Additionally, without some form of similarity between samples, it is impossible to estimate transition matrices for the entire dataset, as only the noisy label is observable, and the clean label remains hidden. Therefore, existing methods employ predefined similarities to estimate transition matrices across a dataset.

Q2: Any examples of similarity-based assumption failure cases? Can modeling latent causal structure solve those failures?

A2: Yes, our model performs well when the similarity-based assumption fails. We conducted an experiment on the moon dataset. We synthesize the noisy labels where the transition matrices on the same manifold are not the same. The test accuracy of our method is 98.07±0.69%, and the estimation error of the transition matrix is 0.10±0.07. In contrast, the test accuracy of MEIDTM is 91.06±0.75%, and the estimation error of the transition matrix is 0.45±0.16. The results show that our method surpasses the MEIDTM in terms of test accuracy and estimation errors in the transition matrix.

Furthermore, we tested our model in a scenario where the similarity-based assumption is valid, i.e., the transition matrix is the same across the same manifold. The test accuracy of our method is 98.35 ± 0.19%, and the estimation error of the transition matrix is 0.08±0.05. In contrast, the test accuracy of MEIDTM is 96.23±0.25%, and the estimation error of the transition matrix is 0.42±0.11. The experiment results show the generalizability and robustness of the proposed method.

Q3: The paper assumes that the relation structure is established from the causal structure. It means there exists some arbitrary similarity assumption of similar latent leads to similar transition matrix. The similarity in manifold and the similarity of latent structure could be very similar, since the samples generated from the same latent variable would have similar features.

A3: We do not manually define some similarities like the existing methods. Instead, our method aims to recover the causal factors that cause noisy labels, and the relations among different transition matrices are established based on the causal factors. For example, an annotator is likely to annotate the pictures with the feature “furs” as “dog”. When this annotator annotates the pictures containing the feature “furs,” their noisy labels are probably “dog.”

The similarity of our model is not similar to the similarity in the manifold since our method can still perform well when the similarity assumption in the manifold is not met.

Q4: The motivation to learn the latent causal structure is not convincing enough.

A4: Learning the latent causal structure can recover the causes of noisy labels and then establish the relations among different transition matrices without predefined similarity. The empirical results in A2 also demonstrate the effectiveness of our method.

Q5: The proposed method will require much time and memory computation cost since it should generate X, which is a limitation of this method.

A5: Our method is a generative-based method, and it requires an additional generative network, leading to more computation cost. We acknowledge this limitation and will include it in Appendix C. However, the existing work, CausalNL and InstanceGM, also have this limitation.

Q6: The baselines are outdated. Since this paper utilizes the generative model for classification task, NPC and SOP are important baselines.

A6: We have updated our set of baselines. RENT [4], which was published at ICLR 2024, is included. Additionally, NPC and SOP have been included as baselines as well, as shown in Table 1 and 2 of the PDF.

Q7: The position of Figure 5 is inadequate. It should be located at the experiment part.

A7: We will relocate the position of Figure 5 in the final version.

Q8: No ablation studies.

A8: We have conducted ablation studies, shown in Table 9 and 10 of the PDF.

Q9: How can I know the learned latent causal structure is good? Also, how can I measure that a good latent structure will lead to good transition matrix estimation?

A9: The efficacy of the learned latent causal structure can be empirically validated by comparing the performance outcomes of our model against those of other models. Specifically, we conducted experiments using the moon dataset, where our method achieved a test accuracy of 98.07±0.69% and an estimation error for the transition matrix of 0.10±0.07. In comparison, the CausalNL model, which does not model the latent causal structure, recorded a test accuracy of 97.88±0.75% and a transition matrix estimation error of 0.12±0.06. The results show that learning latent causal structure is good, and it can lead to good transition matrix estimation.

Q10: The transition matrix estimation error of MEIDTM is smaller than the CSGN on the CIFAR100 dataset. But the accuracy of MEIDTM is not better than CSGN.

A10: Thanks for your comment. There are other factors that can affect the test accuracy. For example, MEIDTM only employs one neural network to select clean data and trains the classifier. Moreover, MEIDTM does not employ the semi-supervised learning technique. CausalNL is the closest one to our setting. When the settings of the two methods are similar, the one with a smaller transition matrix estimation error has better accuracy.

评论- Further Discussion

2024-08-12

Dear Reviewer pfn2,

We sincerely appreciate your valuable time and effort in reviewing this paper. As we approach the end of the author-reviewer discussion period, we would be grateful for any additional feedback or confirmation regarding whether your concerns have been satisfactorily addressed.

Note that our rebuttal carefully addresses your concerns. Some major responses are summarized as follows:

we explained that the transition matrix varying from sample to sample does not imply the absence of similarity assumption, and we gave an example. More details can be found in the A1 of the rebuttal;
we conducted experiments on the scenarios where the similarity-based assumption fails and the similarity-based assumption is valid. The experiment results show that our method can work well in both scenarios, but the performance of MEIDTM drops significantly in the scenario where the similarity-based assumption fails. More details can be found in the A2 of the rebuttal;
we empirically verify that the similarity in manifold and the similarity of latent structure is different. Instead of predefined similarity, our method can capture the relations (or similarity) among different transition matrices. More details can be found in the A3 of the rebuttal;
we have added RENT (published at ICLR 2024), NPC, and SOP as baselines, and their empirical results are shown in Table 1 and 2 of the PDF;
we have conducted sufficient ablation studies, as shown in Table 9 and 10 of the PDF;
we have empirically verified that a good latent structure can lead to good transition matrix estimation. More details can be found in the A9 of the rebuttal.

Thanks again for your valuable comments. If there are any remaining concerns, we can discuss them.

Best regards,

Authors

2024-08-13

Thanks authors for their sincere answers. Although authors have shown through efforts for their studies, I am still not convinced for the following points.

I think part-dependent label noise is rather an old and specific version of instance dependent transition matrix studies. Rather, I think there are no manual similarity assumptions also in BLTM[1].
After all, I think the motivation of why we should use the paper's method to estimate the transition matrix is mainly based on its performance. However, I am not convinced with its utility, since the performance gap does not seem so much (as reviewer afzG also pointed out)
Furthermore, there should be some empirical findings showing the learned latent causal structure is optimized. If there are other factors that can affect the test accuracy as the authors said in the rebuttal, test accuracy cannot be an adjust metric showing the proposed method does capture the true transition matrix or true causal structure well or not.
I doubt whether the baselines are well reproduced or not. For example, BLTM[1] performance, it should show better performance for instance dependent noise than the naive cross entropy, since it is the method mainly targeting the instance dependent noise. However, according to the authors' experiment results, it shows even worse experiment result for CIFAR10 instance dependent label noise setting than the naive cross entropy.
Similar patterns also happen for many baselines including Mentornet, PTD, CausalNL, MEIDTM, BLTM, NPC, and RENT (considering CIFAR10 result). I am not sure what those papers' assumptions are for estimating the transition matrix, but I don't think the similarity assumption of each paper is the problem because the methods including Reweight, Forward and CCR shows better performances than CE as we expected.

Therefore, I will keep my initial score.

[1] Yang, S., Yang, E., Han, B., Liu, Y., Xu, M., Niu, G., & Liu, T. (2022, June). Estimating instance-dependent bayes-label transition matrix using a deep neural network. In International Conference on Machine Learning (pp. 25302-25312). PMLR.

2024-08-13

Thank you for the time and effort in reviewing our work. We believe most of the concerns are minor and caused by misunderstanding. Please kindly let us know if you have any major concerns.

Q1: There are no manual similarity assumptions in BLTM.

A1: Thanks for your comment. Though BLTM does not have manual similarity assumptions, but it relies on some strong assumptions.

1). BLTM models the probability distribution of the transition matrix $p(\tilde{Y}|Y,X)$ with a function $f$ . They directly assume that the optimal $f^*$ can be identified given only the distilled dataset. 2). They also assume the noise rate is upper bounded. Note that these assumptions are quite strong as they generally cannot be verified in real-world cases. Fig. 5 of our main paper also shows that the estimation error of BLTM is worse than both our method and MEIDTM on the challenging CIFAR100 dataset under different noise rates.

Q2: The concern about the performance.

A2: The performance gap on some easy datasets and small noise rates, but the performance gap is large on some complex datasets with large noise rates and large-scale real-world datasets. For example, under the instance-dependent label noise with a noise rate of 0.5, the performance (percentage) of our method on the CIFAR-100 dataset is 74.60 ± 0.17, whereas the performance of the best baseline is 61.54 ± 0.34. On the WebVision dataset, the performance of our method is 79.84, whereas the performance of the best baseline is 77.78.

Q3: Empirical findings to show the learned latent causal structure is optimized.

A3: Thanks for your insightful comment. We can empirically verify the learned latent causal structure on a synthetic dataset. Specifically, we conduct an experiment on the moon dataset. To create noisy labels caused by a single factor, we manually corrupted the labels, with the noise rate for each data point depending on the second-dimension value of the data point. Note that the causal factors of the moon dataset are independent. We trained our model on this synthetic dataset with the dimension of the latent factor $Z$ set to 2. After training, the causal weight between two causal factors is -0.0008. The influence is small enough, which means that they are independent. The values of the mask variable $M_{\tilde{Y}}$ for noisy labels were [0.0000, 0.0232], which shows that our mask mechanism effectively identifies and selects the critical latent factor responsible for generating noisy labels.

We also compare the performance of our method with CausalNL. In CausalNL, the direct cause of noisy labels is the image, which is not aligned with the generation process of the moon dataset. Our method can achieve a test accuracy of 98.07±0.69 and an estimation error for the transition matrix of 0.10±0.07. In comparison, the CausalNL model, which does not model the latent causal structure, recorded a test accuracy of 97.88±0.75 and a transition matrix estimation error of 0.12±0.06. The results show that a good causal structure can lead to good transition matrix estimation.

Q4: The performance of the BLTM model falls short of that achieved using naive cross-entropy.

A4: BLTM assumes the noise rate is upper bounded to learn transition matrices well. Their original paper modifies the instance-dependent noise generation method in PTD [1]. For comparison, our paper follows the original instance-dependent noise generation method in PTD and reproduces the BLTM. The bounded noise assumption does not hold, which leads to the quality of the distilled dataset becoming low. Then, the function $f$ trained on the distilled dataset cannot model the transition matrices well. Thus, the performance of BLTM is bad.

Q5: Similarity assumptions of the existing papers are not important and not a problem.

A5: The similarity assumptions are important for existing methods to estimate the transition matrix. The existing methods can only estimate the transition matrices of some training samples since only noisy labels are given. To estimate the matrices for the rest of the training samples, the existing methods usually establish relations between the transition matrices by predefined similarity. However, when these assumptions do not hold in the real world, the estimation error of the transition matrices will be large.

In this paper, instead of predefined similarity, we propose a method that can capture the relation between transition matrices by recovering the causes of noisy labels. We believe our work is interesting to readers in the label noise community.

The empirical results on large-scale dataset demonstrate that with our method, the test accuracy on the WebVision dataset can achieve 79.84, whereas the performance of the best baseline is 77.78, and the performance of Forward is 61.12, which can prove that the similarity assumption of each paper is the problem.

Reference

[1] Part-dependent label noise: Towards instance-dependent label noise.

2024-08-14

Thank authors for their instant responses.

I think there is some misunderstanding. My concern is, whether the experimental results are well reproduced or not, since it seems there are some gaps between the original papers' experiment results and the reproduced results. I know that similarity assumption itself has already been widely considered and researched thoroughly already in the transition matrix related studies

评论- Response to Reviewer pfn2

2024-08-14

Thanks for your reply. The experimental results are well reproduced. We have carefully compared our results and the original papers’ results and the performance of baselines are comparable. We would like to provide the details for each baseline we used as follows.

For PTD and CausalNL, they also run experiments on instance-dependent label noise. Our experiment results are identical to their results.

Some baselines do not experiment on the instance-dependent label noise. Thus, we reproduce results on instance-dependent label noise. Our results are comparable to the results reported in existing papers. Specifically,

CE, the standard cross-entropy loss. We reproduce it with PreAct-ResNet-18. A paper [1] also reproduce the results, the accuracies are 52.19±1.42 and 42.26±1.29 on CIFAR-100 under noise rates of 0.2 and 0.4. Our corresponding results are 54.98 ± 0.19 and 43.65 ± 0.15, which are comparable with the paper [1].
MentorNet pretrains a classification network to select reliable examples for the main classification network. The original paper does not experiment on the instance-dependent label noise. We replace the CNNs used in MentorNet with PreAct-ResNet-18 and reproduce the results. A paper [2] also reproduces the results, the accuracies are 51.73±0.17 and 40.90±0.45 on CIFAR-100 under noise rates of 0.2 and 0.4. Our corresponding results are 55.98 ± 0.32 and 43.79 ± 0.48, which are comparable with the paper [2].
Coteaching uses two classification networks to select reliable examples for each other. The original paper does not experiment on the instance-dependent label noise. We replace the CNNs used in Coteaching with PreAct-ResNet-18 and reproduce the results. A paper [1] also reproduces the results, the accuracies are 57.24±0.69 and 45.69±0.99 on CIFAR-100 under noise rates of 0.2 and 0.4. Our corresponding results are 61.54 ± 0.06 and 49.50 ± 0.10, which are comparable with the paper [1].
Forward uses the transition matrix to correct the loss function. The original paper does not experiment on the instance-dependent label noise. We replace the ResNets used in Forward with PreAct-ResNet-18 and reproduce the results. A paper [1] also reproduces the results, the accuracies are 58.76±0.66 and 44.50±0.72 on CIFAR-100 under noise rates of 0.2 and 0.4. Our corresponding results are 56.59 ± 0.25 and 46.03 ± 0.65, which are comparable with the paper [1].
DivideMix divides the noisy examples into labeled examples and unlabeled examples and train the classification network using semi-supervised technique MixMatch. The original paper does not experiment on the instance-dependent label noise. We follow their original settings to reproduce the results. A paper [3] also reproduce the results, the accuracies are 77.07 and 70.80 on CIFAR-100 under noise rates of 0.2 and 0.4. Our corresponding results are 76.81 ± 0.14 and 73.12 ± 0.32, which are comparable with the paper [3].

There are two baselines, CCR and Reweight, modeling class-dependent label noise; the original paper does not experiment on the instance-dependent label noise, and we have not found related papers that report their performance on the instance-dependent label noise. Specifically,

CCR uses forward-backward cycle-consistency regularization to learn noise transition matrices. Their original paper does not experiment on the instance-dependent label noise. We use a PreAct-ResNet-18 as the backbone and follow their original settings to reproduce the results.
Reweight estimates an unbiased risk defined on clean data using noisy data by using the importance reweighting method. Their original paper does not experiment on the instance-dependent label noise. We use a PreAct-ResNet-18 as the backbone and follow their original settings to reproduce the results.

评论- Response to Reviewer pfn2 (continue)

2024-08-14

For BLTM and MEIDTM, our results are different from those of their paper as we use the instance-dependent label noise generation method in PTD [4]. The results of their paper used different instance-dependent label noise generation methods. Specifically,

For BLTM, they assume the noise rates have upper bounds. As stated in the Problem setting section of their paper [5]: “This paper focuses on a reasonable IDN setting that the noise rates have upper bounds $\rho_{max}$ as in (Cheng et al., 2020)”. To generate such instance-dependent label noise, they modify the instance-dependent label noise in PTD [4]. As shown in the second line of the Algorithm 2 in [5]: “Sample instance flip rates $q_i$ from the truncated normal distribution $\mathcal{N}(\eta,0.1^2,[0,\rho_{max}])$ , which is different from the original algorithm in PTD [4]: “Sample instance flip rates $q\in \mathbb{R}^n$ from the truncated normal distribution $\mathcal{N}(\tau,0.1^2,[0,1])$ ;”, i.e., The truncated normal distributions in two algorithms are different.
For MEIDTM, they follow the instance-dependent label noise generation method in the work [6]. This instance-dependent label noise generation method is also different from the method in PTD. The Line 3 of Algorithm 1 of this paper [6] is “Sample $W \in \mathcal{R}^{S \times K}$ from the standard normal distribution $\mathcal{N}(0,0.1^2)$ ;”, while the original step in PTD is “Independently sample $w_1,w_2,\dots,w_c$ from the standard normal distribution $\mathcal{N}(0,0.1^2)$ ;”. Correspondingly, the Line 4 of the Algorithm 1 of the paper [6] is “ $p=x_n\cdot W$ ”, while the original step in PTD is “ $p=x_i\cdot w_{y_i}$ ”. The original noise generation algorithm has $c$ parameters $w_i$ , where $c$ is the number of the class, but the noise generation method in MEIDTM only has one parameter $W$ .

Reference

[1] Bai, Yingbin, et al. "Understanding and improving early stopping for learning with noisy labels." Advances in Neural Information Processing Systems 34 (2021): 24392-24403.

[2] Yuan, Suqin, Lei Feng, and Tongliang Liu. "Late stopping: Avoiding confidently learning from mislabeled examples." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[3] Garg, Arpit, et al. "Instance-dependent noisy label learning via graphical modelling." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2023.

[4] Xia, Xiaobo, et al. "Part-dependent label noise: Towards instance-dependent label noise." Advances in Neural Information Processing Systems 33 (2020): 7597-7610.

[5] Yang, Shuo, et al. “Estimating Instance-dependent Bayes-label Transition Matrix using a Deep Neural Network.” arXiv preprint arXiv:2105.13001 (2021).

[6] Cheng, Hao, et al. "Learning with instance-dependent label noise: A sample sieve approach." arXiv preprint arXiv:2010.02347 (2020).

2024-08-14

Dear Reviewer pfn2,

Thank you very much for your time and comments. To further clarify our paper, we will follow your valuable advice and include the aforementioned details in our revision. We hope this can address your concern well.

The author-reviewer discussion ends soon, please kindly let us know if there are any additional concerns or suggestions. We are happy to provide answers.

Many thanks,

Authors

审稿意见

评分: 5置信度: 32024-07-12

The work tackles noisy labels learning under the context of classification problem. The proposal leverages a casual model to embed the relationship amongst features, labels and noisy labels. The authors show that the proposed causal model can be identified even under noisy data, hence enables learning the noise transition matrices and ultimately improves classification accuracy.

优点

Overall, the paper is well-organized. The generative model is stated clearly. The proposed casual model enables identifying the transition matrix, which is a good-step toward instance-dependent noise. The proposed learning algorithms is discussed in details and demonstrated its effectiveness via several experiments on synthetic and real datasets.

缺点

The latent factors generating $**X**$ and latent factors generating noisy label $\widetilde{Y}$ are different, as stated in several places, e.g, page 6, line 260. I agree with the intuition that only a subset of latent factors should affect the generation of $**X**$ (or $\widetilde{Y}$ ), hence the use of L1 norm in (3). But I don't get why do they have to be different? And how to enforce it?
Performance on real dataset, i.e, CIFAR-10N is not very convincing as the gaps between the best and second best method are less than 1% in several cases.
Performance wise, it is unclear what is the contributing factor to the performance of the algorithm 1. Particularly, how significant is the step of selecting clean samples? Can we have another baseline of training classifier only over these clean samples?
The optimization problem is complicated as it serves as a criterion to learn both casual model structure and parameters.

In addition, I would suggest to have more discussion on the identifiability result. Given that there are several missing details (articulated in the Questions), I'll consider changing score when those questions are answered.

问题

What is dimension of $**Z**$ , i.e., number of latent factors $Z_i$ ?
What are $g_Y^1, g_Y^2$ in the Algorithm 1?
How is $q_\psi$ updated after the warm-up step?
How the mask variables $M_X, M_{\widetilde{Y}}$ are updated? Or are they considered as variables?
How many clean samples have been selected in the experiment? How accurate are they?
Under the proposed casual model, can we have $P(\widetilde{Y} | Y, X) = P(\widetilde{Y} |Y) P(Y|X)$ ? I suggest to have such discussion to clearly distinguish the proposed model vs the instance-independent confusion matrix.

局限性

Yes

作者回复

2024-08-07

Response to Reviewer afzG

Q1: I don't get why latent factors generating $\mathbf{X}$ and latent factors generating noisy label $\tilde{Y}$ have to be different? And how to enforce it?

A1: Thank you for your question. The latent factors for generating instance $X$ and noisy label $\tilde{Y}$ can be the same, as our method is flexible enough to allow the masks to select any subset of latent factors, including potentially all factors, for generating $X$ and $\tilde{Y}$ .

We introduce the masks to select factors because not all latent features $X$ cause $\tilde{Y}$ . For instance, in Figure 1(a), the feature "fur" causes the noisy label "dog," whereas other features like "wall" and "floor" are irrelevant to the noisy label. To effectively isolate the minimal necessary latent factors for generating noisy labels, we employ an L1 norm constraint on the mask variables.

To empirically validate our method, we conducted an experiment on "moon dataset". More settings about the "moon dataset" can be found in the global response. After training, the values of the mask variable $M_{\tilde{Y}}$ for noisy labels are [6.0166e-05, 2.3246e-02], which means that the second factor is selected. These results demonstrate that our mask mechanism can effectively select the latent factor for generating noisy labels.

Q2: Performance between the best and second best method are less than 1% in several cases on CIFAR-10N.

A2: The CIFAR-10 is a relatively easy dataset where models can achieve high accuracy. For instance, the PreAct-ResNet-18 achieves an accuracy of 93.17 ± 0.04% when trained on clean data. However, our model demonstrates significant effectiveness on more challenging datasets with limited samples, such as CIFAR100N. Here, our method achieves an accuracy of 74.60 ± 0.17%, outperforming DivideMix, whose accuracy is 61.54 ± 0.34%. Moreover, our model also shows robust performance on the large-scale real-world dataset WebVision, achieving a top-1 accuracy of 79.84%, compared to 77.32% for DivideMix.

Q3: How significant is the step of selecting clean samples? Can we have another baseline of training classifier only over these clean samples?

A3: Thank you for your question. The selection of clean samples is a crucial step in many noise-robust algorithms, including CausalNL, InstanceGM, and DivideMix, which use a heuristic approach of alternatively training classifiers and selecting clean samples. The accuracy of the selected clean samples improves as the improvement of the classifier. Our method follows the heuristic process.

We train a baseline using cross-entropy loss on the selected clean samples. We use "CE-clean" to denote this baseline. The experiment results are in Table 5 and 6 of the PDF.

Q4: The optimization problem is complicated as it serves as a criterion to learn both casual model structure and parameters.

A4: Though the optimization problem is complicated, the optimization process of our model is stable using modern optimizers, such as SGD and Adam.

Q5: I would suggest to have more discussion on the identifiability result.

A5: Thanks for your advice. Here, we provide more discussion on the identifiability result: The theoretical results indicate that when the number of causal factors is 4, $n_s*15$ confident examples from distinct classes are required to identify the causal model, where $n_s$ is the number of different style combinations. If the changes in style combinations do not affect the parameters of our causal model, only 15 confident examples from distinct classes are required. This discussion will be added to the final version of our paper.

Q6: What is the number of latent factors $Z_i$ ?

A6: Thank you for your suggestion. The number of latent factors is 4, as mentioned in Line 632 of our paper. To clarify this and prevent any confusion, we will add a statement in Section 4 of the final version of the paper to emphasize that the number of latent factors is 4.

Q7: What are $g_Y^1,g_Y^2$ in the Algorithm 1?

A7: $g_Y^1$ and $g_Y^2$ are two classification networks used to model the distribution of $q_\psi(Y|X)$ , which are defined in line 197 of the paper. Specifically, we follow previous work (DivideMix) to adopt a Co-Teaching learning paradigm. We will clarify this in Appendix D.

Q8: How is $q_{\psi}$ updated after the warm-up step?

A8: After the warm-up step, the entire model, including $q_{\psi}$ , is optimized end-to-end by minimizing the loss defined in Equation 9.

Q9: How the mask variables $M_X,M_{\tilde{Y}}$ are updated? Or are they considered as variables?

A9: The mask variables $M_X$ and $M_{\tilde{Y}}$ are learnable parameters that are updated through the optimization of Equation 9. During this optimization, two key constraints apply to these variables: First, the causal factors selected by $M_X$ and $M_{\tilde{Y}}$ can be used to generate $X$ and $\tilde{Y}$ ; Second, the mask variables are required to be sparse. These constraints are integral to the loss defined in Equation 9. We will clarify the update of mask variables in the final version.

Q10: How many clean samples have been selected in the experiment? How accurate are they?

A10: Thanks for your insightful question. We report the number and the accuracy of the selected clean samples on the CIFAR-10 and CIFAR-100 datasets, as shown in Table 7 and 8 of the PDF. We will report the number and the accuracy of the selected clean samples in the Appendix of our paper.

Q11: Under the proposed casual model, can we have $P(\tilde{Y}|Y,X)=P(\tilde{Y}|Y)P(Y|X)$ ? I suggest to have such discussion to clearly distinguish the proposed model vs the instance-independent confusion matrix.

A11: Thanks for your insightful question. We believe you mean to $P(\tilde{Y}|Y,X) = P(\tilde{Y}|Y)$ . Our model is instance-dependent (see details on the global response). We will clarify this distinction in the introduction of our paper.

2024-08-11

Dear authors,

Thank you for addressing my concerns/questions. I have a few more questions.

Q1. In the paper, it is stated clearly that they are different. Does that mean the method allows the latent factors generating instance and noisy label to be different, but they do not have to (or should not?) be in general?

Q6. Why it is 4 in particular? Does 4 play a crucial role in result, theoretically or empirically?

Q7, 8, 9. I am really confused. Why are there two classifiers $g_Y^1, g_Y^2$ ? What are the difference between them?

In the Algorithm1, $g_Y^1, g_Y^2$ are output of WarmUP step, but the Warmup section in page 5 never mentions about the 2 classifiers. In particular, in line 197, there is only 1 "classification network $g_Y$ " which is used to "model the distribution $q_\psi (Y |X)$ ".

Also, is $g_Y$ or the $q_\psi (Y |X)$ , or both trainable?

Q9. The term mask is usually used to imply binary vectors. Are those mask vectors $M_X, M_Y$ not binary vectors? Do they have any constraints beside sparsity?

2024-08-11

Thanks for your comments.

Q1: The term mask is usually used to imply binary vectors. Are those mask vectors not binary vectors? Do they have any constraints beside sparsity?

A1: Those mask vectors are not binary vectors in our implementation. Our mask vectors are continuous vectors, which can be optimized easily using gradient-based methods, such as SGD and Adam. In contrast, the binary vectors are hard to be optimized using gradient-based methods since they are discontinuous and non-differentiable. To enforce the learned mask vectors to be sparse, we use L1 loss to constrain them.

Beside sparsity, the subsets of causal factors selected by $M_X$ and $M_{\tilde{Y}}$ , i.e., $M_X \odot \mathbf{Z}$ and $M_{\tilde{Y}} \odot \mathbf{Z}$ , have to be able to generate the instance $X$ and noisy label $\tilde{Y}$ . With the sparsity constraint and the generation constraint, the mask variables can learn to select minimum but essential latent variables for generating instance $X$ and noisy label $\tilde{Y}$ .

Q2: In the paper, it is stated clearly that they are different. Does that mean the method allows the latent factors generating instance and noisy label to be different, but they do not have to (or should not?) be in general?

A2: Thank you for your insightful question. The latent factors generating instances and noisy labels are different in general. Thus, we state they are different in the paper. However, this does not mean that our method cannot work in scenes where the latent factors generating instances and noisy labels are the same. Our method allows the masks to select any subset of latent factors, including two same subsets. When the factors generating instance $X$ and noisy label $\tilde{Y}$ are the same, two subsets of latent variables selected by $M_X$ and $M_{\tilde{Y}}$ are the same. This is because the constraint that the selected latent variables can generate instance $X$ and noisy label $\tilde{Y}$ allow the mask variables to select the essential latent variables for generation, and with the sparsity constraint, redundant latent variables would be removed.

To avoid confusion, we will state it in our paper.

Q3: Why it is 4 in particular? Does 4 play a crucial role in result, theoretically or empirically?

A3: Thank you for your insightful question. We conducted a sensitivity test and selected the number of causal factors as 4. Specifically, we conduct a grid search on the CIFAR10 dataset under instance-dependent noise with a noise rate of 0.5. The experiment results are shown as follows:

2	3	4	5	6	7
95.77 ± 0.06	95.78 ± 0.06	95.89 ± 0.06	95.77 ± 0.07	95.86 ± 0.12	95.75 ± 0.11

8	10	12	14	16	18
95.70 ± 0.07	95.74 ± 0.09	95.84 ± 0.07	95.37 ± 0.30	90.72 ± 3.49	13.93 ± 10.60

The experiment results demonstrate that the proposed method is not sensitive to the number of causal factors, i.e., four is not crucial in the result.

Q4: Why are there two classifiers $g_{Y}^1$ , $g_{Y}^2$ ? What are the difference between them? In the Algorithm1, $g_{Y}^1$ , $g_{Y}^2$ , are output of WarmUP step, but the Warmup section in page 5 never mentions about the 2 classifiers. In particular, in line 197, there is only 1 "classification network $g_{Y}$ , " which is used to "model the distribution $q_{\psi}(Y|X)$ ".

A4: We sincerely apologize for the confusion we caused. To avoid error accumulation, we follow previous work [1] to use two classifiers to select clean samples for each other. The classifiers have the same network structure but different initial parameters, and they are trained on different clean samples. To avoid confusion, we will clarify we use two classifiers in the Warmup section.

Q5: Is $g_Y$ or the $q_{\psi}(Y|X)$ , or both trainable?

A5: The distribution $q_{\psi}(Y|X)$ is modeled by the neural network $g_Y$ . For example, given a sample (x, y), $g_Y(x)$ can model $q_{\psi}(Y|X=x)$ . We do not have another network to model $q_{\psi}(Y|X=x)$ . The neural network $g_Y$ is trainable.

Reference

[1] Han, Bo, et al. "Co-teaching: Robust training of deep neural networks with extremely noisy labels." Advances in neural information processing systems 31 (2018).

2024-08-13

Thank you for your answers. I now have a better understanding of the proposed method. However, I do not think the contribution is significant enough. The algorithm includes several independent techniques (warmup step to pick out a large amount of clean samples, using different classifiers for robustness) which play crucial roles to the final performance. While these inclusions are performance-wise justified, it also makes it difficult to recognize the advantage of the proposed ideas (casual model as well as optimization criterion) in comparison to all baselines methods.

After all consideration, I would like to keep my initial rating.

2024-08-13

Thanks for your reply. We would appreciate it if you could re-evaluate our contributions:

We offer a new perspective on estimating noise transition. Existing methods rely on manually defined similarities for transition matrices, which may not hold in the real world. In our work, we do not predefine the similarity. In contrast, our model can capture the relation between noise transition matrices by recovering the cause factors that generate noisy labels.
We have identified two important oversights in existing generative methods. 1) The existing methods assumes that an instance $X$ directly causes the noisy label $\tilde{Y}$ . However, the real cause of the label noise is some latent causal factors rather than the image itself, which is very intuitive. For example, we usually say that light and angle cause noise, but we do not say that the image causes noise. 2). The existing methods assume that latent variables $\mathbf{Z}$ are independent. However, the latent variables are not independent but causally related in the real world. For example, the existence of a lamp or sun can influence the brightness of the image. These oversights lead to a gap between the generation process of existing methods and the real-world generation process.
Building on the oversights identified, we propose a more realistic model that can be effectively optimized. First, in our causal model, the cause of the noisy labels is some latent causal factors rather than the image itself. The causal factors in our model are not independent but causally related. A method that can recover the causal factors and their associations is developed. Second, motivated by the fact that not all the features in the images cause noisy labels, a mask is also designed. Specifically, two masks $M_X$ and $M_{\tilde{Y}}$ are employed to select a subset of causal factors to generate instance $X$ and noisy label $\tilde{Y}$ . With the L1 norm constraint, the mask variables can learn to select the minimum but essential causal factors for the generation. Ablation studies in Table 3 and 4 of the PDF verify the advantage of our method.
Empirically, the empirical performance of our method is state-of-the-art in both synthetic and real-world datasets. For example, the accuracy (percentage) of our method is 74.60 ± 0.17, while the performance of the best baseline is 61.54 ± 0.34 on the CIFAR-100 dataset under instance-dependent label noise with a noise rate of 0.5. On the real-world dataset, Webvision, the performance of our method is 79.84, while the performance of the best baseline is 77.78.

The above summary will be added to the introduction of the final version of the paper.

For your concerns about the independent techniques of our algorithm. Our causal model can also work without clean samples and multiple classifiers. We provide an empirical result that our method works with PES [1], which is referred to as CSGN-PES. This version does not rely on clean samples and only has one classifier. The experiment results are shown in Table 11 of the PDF. The experiment results show that our method can clearly improve the performance of PES.

With regard to the advantage of the proposed ideas, we have sufficient empirical evidence to prove that our causal model (in Figure 2 (b) of the paper) is superior to the existing method (in Figure 2 (b) of the paper). Specifically, we replace the causal model in CausalNL [2] and InstanceGM [3] with our causal model. The modified versions of CausalNL and InstanceGM are referred to as “CausalNL'” and “InstanceGM'”, respectively. We keep the experiment settings the same as their original experiments. For example, CausalNL uses two classifiers and does not use the MixMatch technique. Thus, CausalNL’ uses two classifiers and does not use the MixMatch technique. Similarly, InstanceGM uses two classifiers and the MixMatch technique. Thus, InstanceGM’ uses two classifiers and the MixMatch technique. The experiment results are shown in Table 3 and 4 of the PDF. Our method clearly performs better than the existing method. The experiment results demonstrate the proposed idea, our causal model, is better than the existing method.

Note that the baselines, such as CausalNL and DivideMix, also used some tricks, including selecting clean samples and using different classifiers for robustness. For comparison, our algorithm includes these techniques. However, the experiments in Table 3 and 4 of the PDF can demonstrate the advantage of our model.

Reference

[1] "Understanding and improving early stopping for learning with noisy labels." Advances in Neural Information Processing Systems 34 (2021): 24392-24403.

[2] "Instance-dependent label-noise learning under a structural causal model." Advances in Neural Information Processing Systems 34 (2021): 4409-4420.

[3] "Instance-dependent noisy label learning via graphical modelling." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2023.

审稿意见

评分: 7置信度: 42024-07-13

The author addresses the problem of instance-dependent label noise. While previous research has used models that generate images from true labels and then predict noisy labels from the images and true labels, the author takes a different approach. The proposed model generates images from some latent factors derived from the true labels and predicts noisy labels from other latent factors derived from the same true labels. This process is implemented using a sparse constraint on the latent factors that contribute to the image and noisy label generation, along with a VAE (Variational Autoencoder) method. This generative approach necessitates true labels, and the author employs a semi-supervised model (MixMatch) to predict the true labels.

优点

The proposed method by the author is both intuitive and implemented in a logical manner, demonstrating remarkable performance improvements across various benchmarks. Although the proposed method involves multiple loss functions for different objectives, which introduces several hyperparameters, experimental results show that the sensitivity to these hyperparameters is not significant.

缺点

Although the proposed method is constructed with a sound idea and logical progression, it seems highly dependent on MixMatch. In other words, there is a lack of analysis on whether the model in Figure 2(b) is more effective than the model in Figure 2(a). By referencing representative papers of Figure 2(a), such as CausalNL and InstanceGM, a comparison with InstanceGM using MixMatch could provide a better explanation of the effectiveness of the model in Figure 2(b). However, in the experiments, only CausalNL was compared.

问题

To better demonstrate the effectiveness of the proposed model, I recommend either reporting the performance of InstanceGM or describing the results of applying various classifier training methods instead of relying solely on MixMatch.

局限性

作者回复

2024-08-07

Response to Reviewer S6tG

Q1: There is a lack of analysis on whether the model in Figure 2(b) is more effective than the model in Figure 2(a). To better demonstrate the effectiveness of the proposed model, I recommend either reporting the performance of InstanceGM or describing the results of applying various classifier training methods instead of relying solely on MixMatch.

A1: Thank you for your suggestion. We have conducted experiments to assess whether the model in Figure 2(b) outperforms the model in Figure 2(a). Specifically, we replaced the generative models in CausalNL and InstanceGM (as shown in Figure 2(a)) with our proposed model (Figure 2(b)), while maintaining identical settings for the other experimental parameters. These experiments were conducted on the CIFAR-10 and CIFAR-100 datasets, which feature instance-dependent label noise. We refer to the modified versions of CausalNL and InstanceGM as "CausalNL'" and "InstanceGM'", respectively. The results, presented in Tables 3 and 4 of the PDF, indicate that the model in Figure 2(b) is indeed more effective than the model in Figure 2(a). We will include these results in the final version of our paper.

2024-08-12

Thank you for your response. Are you indicating that the CausalNL and InstanceGM methods included in Tables 3 and 4 are algorithms that utilize the MixMatch technique?

2024-08-12

Thanks for your comments. We follow the same setting as their original methods. CausalNL’ and CausalNL do not utilize the MixMatch technique; InstanceGM’ and InstanceGM utilize the MixMatch technique.

2024-08-12

Thank you for providing valuable additional experimental results under tight time constraints. These results have allowed me to assess the practicality of the causal structure. It seems that you have provided a high-performing algorithm based on a sound and reasonable idea. I will maintain my decision.

2024-08-12

Thanks again for your valuable time and feedback. We deeply appreciate your recognition of our work.

Best regards,

Authors

作者回复

2024-08-07

Global response

We sincerely appreciate the time and effort the reviewers invested in reviewing our manuscript. Your insightful comments and constructive advice have been instrumental in enhancing the quality and clarity of our work. We are grateful for your detailed feedback and guidance. In response to several frequently asked questions, we present a clarification and introduce a new experimental setting. The reference list below includes all papers cited in our responses.

Distinguishing our model from instance-independent approaches

In our model, the generation of instance $X$ and noisy label $\tilde{Y}$ involves common latent causes $\mathbf{Z}$ , such as $Z_3$ depicted in Figure 2(b). In causality theory, the presence of common causes between two variables, such as $X$ and $\tilde{Y}$ , establishes dependence between them [1,2]. This indicates that our model is instance-dependent, not instance-independent ( $P(\tilde{Y}|Y,X) = P(\tilde{Y}|Y)$ ).

The setting on the moon dataset.

We conducted an experiment using a synthetic dataset known as the "moon dataset". The data points have two dimensions and are categorized into two distinct categories. To create noisy labels caused by a single factor, the noise rate for each data point is dependent on the value of its second dimension. The training data is shown in Figure 1 of the PDF. We trained our model on this synthetic dataset with the dimension of the latent factor $Z$ set to 2.

Reference list

[1] Schölkopf, Bernhard. "Causality for machine learning." Probabilistic and causal inference: The works of Judea Pearl. 2022. 765-804.

[2] Reichenbach, Hans. "The Direction Time". Berkeley, CA, USA: Univ. of California Press, 1956.

[3] Xia, Xiaobo, et al. "Part-dependent label noise: Towards instance-dependent label noise." Advances in Neural Information Processing Systems 33 (2020): 7597-7610.

[4] Bae, HeeSun, et al. "Dirichlet-based Per-Sample Weighting by Transition Matrix for Noisy Label Learning." The Twelfth International Conference on Learning Representations.

[5] Bai, Yingbin, et al. "Understanding and improving early stopping for learning with noisy labels." Advances in Neural Information Processing Systems 34 (2021): 24392-24403.

[6] Yao, Yu, et al. "Instance-dependent label-noise learning under a structural causal model." Advances in Neural Information Processing Systems 34 (2021): 4409-4420.

评论- Further Discussion

2024-08-10

Dear Reviewers,

We really appreciate your constructive comments, which helped us improve this paper. We have carefully addressed the concerns you raised. If there are any unresolved concerns, we would be glad to have further discussions.

Best regards,

Authors

最终决定Accept (poster)

2024-09-25

Two reviewers had highly critical reviews. However, the authors engaged substantially with both reviewers during the discussion period. It is my opinion that the concerns of both reviewers were sufficiently addressed by the authors, even if the reviewers did not acknowledge this.

In one case, after a very fruitful back-and-forth, a reviewer decided that the independent contributions of the different modeling components was unclear and thus that the paper did not represent a meaningful methodological innovation. However, the authors countered this with results from a compelling experiment where they replace the causal models of two baseline methods (CausalNL and InstanceGM) with their proposed causal model, and showed that the performance of both baselines improved substantially. The authors further provided results from ablation studies in their rebuttal which suggested that the impressive performance of their proposed approach was not owed simply to a "bag of tricks". I found these supplementary experiments very convincing and commend the authors for supplying them on short notice.

In another case, a reviewer expressed skepticism that the paper had correctly implemented the myriad of baselines to which they compare the proposed method. In response, the authors very carefully walked through each baseline method, showing how the numbers obtained by their implementation closely matched the numbers reported in each of the relevant papers. I found this convincing. It is of course fair for the reviewer to ask this question, particularly since the paper reports SOTA results in every single experiment, along every metric, as compared to over 10 recent baselines. Such sweeping advantage is uncommon in my experience, and skepticism is warranted. However, some level of trust is inherent to the conference reviewing process, and the authors have done enough to earn it.

Beyond these two critical reviewers (whose concerns were appropriately addressed by the authors) the remaining reviewers were excited by the paper. They appreciated the elegance and conceptual clarity of the proposed causal model, found the motivation to be timely and important, and were very impressed by proposed approach's empirical performance. I think this is a very solid paper overall.

If accepted, the authors should edit the camera-ready version to incorporate all of the supplemental experiments they reported during the discussion period.