Teaching Humans Subtle Differences with DIFFusion
摘要
评审与讨论
The authors explore piecing together various popular methods and models such as the usage of CLIP embeddings and Diffusion models in a novel way to propose an algorithm to detect subtle discriminative differences and synthetically introduce them to images while keeping individual examples recognizable. The methodology boasts the ability to train on unpaired small datasets and also claims to pick up differences missed by domain experts. This work could have meaningful impact in scientific education and training, particularly in domains where visual expertise is crucial but difficult to teach through traditional methods.
优缺点分析
Strengths:
- The proposed methodology can work on unpaired datasets as well as on limited training data which makes it useful for niche subjects with less available data.
- Using machine learning to provide a method to teach humans subtle differences is a novel idea and provides a significant improvement in human classification ability.
- Comprehensive quantitative testing done across multiple domains and appropriate baseline comparisons including recent methods like TIME and Concept Sliders.
- The paper includes thoughtful analysis of dataset bias visualization and the relationship between training set size and performance.
Weaknesses:
- While the authors have provided the basic details behind the sub-components further details behind the reasoning of choosing the exact methodology is missing. Furthermore the experiments focuses on paired classifications, it would be good to have experiments that capture performance for multi-class problems.
- More commentary providing an analysis of the trade-off between identity preservation and class change would have provided a stronger backbone to the proposed algorithm.
- While there is an experiment provided showcasing the teaching capability, the paper is lacking an expert verification to ensure the changes generated align with scientific facts. It would be good to have a benchmark against current knowledge before expanding further.
- There needs to be heavy adversarial tests done to make sure this methodology does not get influenced by inherent biases in machine learning models (especially the CLIP model used for embedding) if it is to be used for teaching.
问题
- How do you ensure the difference of means approach doesn't amplify false correlations in training data?
- What happens when class boundaries are not linearly separable in CLIP embedding space?
- How would you validate that discovered visual patterns represent genuine scientific phenomena rather than dataset artifacts?
局限性
Yes
最终评判理由
While the answers clarify my questions, it does not push the paper from a borderline accept to an accept since I still feel the flaws in the technique persist.
格式问题
No
Thank you very much for your feedback. We are glad to read that you found our method novel and the core idea meaningful (along with 3qb4). We are also happy to find that you are satisfied with the experiments (along with Qza6). We hope we’ve answered your questions and addressed the weaknesses in the following sections.
“What is the reasoning behind the exact methodology?”
The goal of our method is to perform an edit to an image that only modifies the category, with minimal modifications to the instance as possible. As such, we must disentangle the category from the instance. We chose EF-DDPM as the inversion technique since the inversion “imprints” the original image into the noise maps better, and so when we perform our edit, it corrupts the instance of the image less. To capture category differences, we propose to perform simple algebra in the image conditioning space, which automatically captures the discriminative features since the subtraction of the representative vectors captures the key differences between the categories.
“Could you provide an experiment that captures performance for multi-class problems?”
While DIFFusion is currently framed around a binary source-to-target transformation, the mechanism of computing the diff arithmetic vector that discriminates between classes can be extended to multi-class scenarios. One approach involves designating the positive embedding as the target class mean and the negative embedding as the average of all other classes. This “one-vs-rest” approach would allow us to distinguish between features of one class against all other classes in the dataset. We demonstrate the results of this technique on an extended AFHQ dataset of 3 classes: cat, dog and wildlife, as an example. We trained 3 separate classifiers (cat vs dog+wildlife, dog vs cat+wildlife, wildlife vs cat+dog), and computed corresponding three sets of positive and negative average embeddings ((positive: cat, negative: dog+wildlife), (positive: dog, negative: cat+wildlife), (positive: wildlife, negative: cat+dog)). We then ran inference with these 3 corresponding sets of classifiers/positive+negative average embeddings. We evaluated flip rate and LPIPS on going from the validation sets of the negative classes (ex: dog+wildlife) to the positive class (ex: cat). We repeat this for going from cat+wildlife to dog, and cat+dog to wildlife. Since we can’t submit images, we report flip rate and LPIPS values, we will add the qualitative examples in the camera ready’s supplemental.
The column is the starting class and the row is the target class, where the first number is the LPIPS and the second is the flip rate.
| Dataset | Dog | Cat | Wildlife |
|---|---|---|---|
| Dog | x | 0.2642 / 1.0 | 0.3687 / 1.0 |
| Cat | 0.2926 / 1.0 | x | 0.4190 / 1.0 |
| Wildlife | 0.3096 / 1.0 | 0.2676 / 1.0 | x |
“What is the trade-off between identity preservation and class change?”
The tradeoff between identity preservation (dominated by LPIPS) and class change (dominated by success-rate) is further analyzed in Appendix B.1, where we plot SR vs. LPIPS curves. Generally speaking, a higher AUC in these curves indicate an overall better classifier flip-rate and identity preservation to input.
“How do you guarantee alignment with scientific facts?”
Identifying whether the edits provided are reflective of dataset biases, artifacts, or true scientific discriminative features would require an expert. All three are very interesting, though. If it is an artifact or dataset bias, it is also good to know about it so that it can be avoided in future datasets that might be used for ML training. As a result, all three case scenarios are useful, but we agree that an expert is needed to disentangle between them.
“How does this method get influenced by CLIP’s biases?”
This is a good point. Fortunately, fine-tuning for a few steps ensures that the conditioning provided by clip ends up generating images that are in distribution for the categories, effectively overriding CLIP’s biases in large part. However, we agree that this should be an active area of research before this method gets deployed in the real world, and we plan to explore it in future work.
“How do you ensure the difference of means approach doesn't amplify false correlations in training data?”
This could definitely be the case. It depends on what is meant by “false”. For example, in the spurious correlations example, (see section 4.5) we show that “false correlations”, such as Dachshunds always existing in deserts and Corgis in jungles definitely influences the edit in the end, since Dachshunds are converted to Corgis, but the environment changes to slightly more “jungle”-like, and vice versa. The point of our method is that it will pick up on the strongest discriminative feature, and if that feature is “false”, then it will highlight this in the edit. Our method can be used to perform automatic dataset bias identification.
“What happens when class boundaries are not linearly separable in CLIP embedding space?”
We can make a reasonable assumption that it is linearly separable in a feature space. The choice of feature space is important, and if it is not linearly separable, there are many representation learning frameworks to make it separable.
Thank you for addressing my questions. While the clarifications are helpful, they do not significantly improve my view of the paper’s validity. Therefore, my evaluation and rating remain unchanged.
The paper proposes DIFFusion, a counter-factual image editing framework to teach humans how to notice sub-pixel differences between visually similar classes. DIFFusion starts from a real image, and then invert it into diffusion-model noise maps, followed by adding a class-diff vector in CLIP-embedding space and then finally re-decode.
By controlling the strength of this vector and the number of denoising steps skipped, DIFFusion outputs identity-preserving images in which only the minimal, discriminative features can change. Experiments are run on six binary class datasets.
优缺点分析
Strengths:
- Proposes an interesting idea of diffusion-based image editing not as an explainability only but as an explicit machine-teaching mechanism, which is very interesting. The approach synthesizes counterfactual variations of one single example, then trains humans to recognise class-distinguishing micro-features that experts may find challenging to verbalise. The core idea of subtracting CLIP embedding mean of the source class from that of the target class to create a class-difference vector Δc, then adding a scaled version of Δc to the conditioning signal of a diffusion model may come across as simple but could have implications if many domains. The overall concept is to fill a gap where labelers cannot articulate decision boundaries but still need to learn them.
- The approach is methodological Sound and eq 1-8 provide a clean mathematical derivation. Also, principled ablations on ω and Tskip.
- The experimental suite spans six markedly different datasets three of which almost never appear in main CV benchmarking research papers.
- The authors have open-sourced code, pre-trained checkpoints, and scripts for all experiments.
Weaknesses:
- DIFFusion is intrinsically framed around a single source:target, and arithmetic vector Δc is computed as the difference between two CLIP class means. Here success is if an oracle flips from one class label to the other, and the user study tests a before/after recognition of exactly two categories. Extending this to multi-class involve non-trivial re-design of the sampling policy, likely new evaluation metrics.
- The other challenge is that the entire pipeline hinges on CLIP’s semantic space being both linear and meaningful for a given domain. In OODs, CLIP embeddings can collapse, mis-align, or over-compress small cues, potentially steering Δc in arbitrary directions. There is no mention of quantifying how sensitive performance is to embedding quality.
- The user evaluation comprises 30 participants split into three groups of ten—a sample that yields wide confidence intervals and low statistical power, especially when broken down by dataset and expertise level. Although the appendix displays error bars, the NeurIPS checklist marks “statistical significance” as NA, and no formal power analysis is provided.
- Each image requires DDPM inversion+generation of ten candidate edits. Out of these only the first LPIPS-minimal, label-flipping image is kept. On an A100 GPU, it is tens of seconds per image and ~40 GPU-hours for the six-datasets. This kind of overhead may be acceptable for expert teaching tools but is a limitation in real-time or mobile deployment.
- The paper is missing an analysis of adversarial or privacy risks. An attacker could craft deceptive counterfactuals or exploit the inversion step to recover information about proprietary training images.
问题
- Please quantify how the visual-algebra automatically isolates discriminative features, meaning are the ∆c directions regularized?
- Please provide a legend for ω and Tskip.
- Since selecting the first edit that flips prediction could bias toward minimal LPIPS, how would you verify no adversarial artefacts emerge with this apptoach?
- Please report standard deviations over the 5 ω/Tskip samples to gauge stability.
- Can you provide the exact video frame rate and if participants could replay. Also clarify IRB approval ID.
- Please clarify if DIFFusion can chain two vectors, for example Monarch->Viceroy->Queen without any re-inversion?
- What is the computational cost of DIFFusion?
局限性
N/a
最终评判理由
I'm happy with the response the authors have provided, and I'll keep my score since it is already high.
格式问题
None.
Thank you very much for your feedback. We’re very glad that we were able to demonstrate our method as both a technique for explainability and a mechanism for machine-teaching, and that these were both interesting to you! We’re glad that you found our approach methodologically sound (along with reviewer KpBL). We’re also happy to hear that you found our experiments to be principled (along with Qza6 and FpZn). We hope that we’ve answered your questions and addressed the weaknesses in the following sections.
“How can we extend our method to multi-class settings?”
While DIFFusion is currently framed around a binary source-to-target transformation, the mechanism of computing the diff arithmetic vector that discriminates between classes, can be extended to multi-class scenarios. One approach involves designating the positive embedding as the target class mean and the negative embedding as the average of all other classes. This “one-vs-rest” approach would allow us to distinguish between features of one class against all other classes in the dataset. We demonstrate the results of this technique on an extended AFHQ dataset of 3 classes: cat, dog and wildlife, as an example. We trained 3 separate classifiers (cat vs dog+wildlife, dog vs cat+wildlife, wildlife vs cat+dog), and computed corresponding three sets of positive and negative average embeddings ((positive: cat, negative: dog+wildlife), (positive: dog, negative: cat+wildlife), (positive: wildlife, negative: cat+dog)). We then ran inference with these 3 corresponding sets of classifiers/positive+negative average embeddings. We evaluated flip rate and LPIPS on going from the validation sets of the negative classes (ex: dog+wildlife) to the positive class (ex: cat). We repeat this for going from cat+wildlife to dog, and cat+dog to wildlife. Since we can’t submit images, we report flip rate and LPIPS values, we will add the qualitative examples in the camera ready’s supplemental.
The column is the starting class and the row is the target class, where the first number is the LPIPS and the second is the flip rate.
| Dataset | Dog | Cat | Wildlife |
|---|---|---|---|
| Dog | x | 0.2642 / 1.0 | 0.3687 / 1.0 |
| Cat | 0.2926 / 1.0 | x | 0.4190 / 1.0 |
| Wildlife | 0.3096 / 1.0 | 0.2676 / 1.0 | x |
“How do you account for CLIP embedding collapsing/misaligning/overcompressing small cues?”
Great point! We noticed this to be the case in some datasets too, particularly OOD datasets. This is where fine-tuning for a little bit (< 2000 steps) helps a lot. Without fine-tuning for example, the Butterfly and Retina datasets struggle because these types of images were likely not seen during CLIP’s initial training.
“Is the user evaluation statistically significant?”
We performed an independent two-sample t-test between the results of our method compared to unpaired results, and the best baseline compared to unpaired results. The null hypothesis that the p-value is measuring the likelihood is that the mean of the two underlying distributions is the same. For our result, that probability is less than 5% in both studies, whereas for the best baseline, it is above 80%. While the number of participants in our study is low, such a striking difference in p-values gives us confidence in our results. However, we agree that the number of participants is on the lower side and will do a larger study in our future work.
“Please quantify how the visual-algebra automatically isolates discriminative features, meaning are the ∆c directions regularized?”
The approach is discriminative because the difference of means is the same as an LDA classifier (linear discriminant analysis) with an identity covariance. However, our approach will work with any discriminative linear classifier as well, such as a logistic regression or linear SVM.
“Please provide a legend for and ”
Thank you for pointing this out! We will update Figure 8 to label the there in the camera ready.
“Since selecting the first edit that flips prediction could bias toward minimal LPIPS, how would you verify no adversarial artefacts emerge with this approach?”
Since the conditioning space is CLIP image encoder’s latent space, and the diffusion model we used was trained to be conditioned on this space, the prior is to generate images that are on the natural image manifold. We verify this qualitatively.
“Please report standard deviations over the 5 / samples to gauge stability.”
(Mean ± Std Dev) by Dataset and Manipulation Value
| Dataset | |||
|---|---|---|---|
| afhq | 0.87 ± 0.07 | 0.89 ± 0.04 | 0.89 ± 0.04 |
| butterfly | 0.54 ± 0.22 | 0.60 ± 0.20 | 0.71 ± 0.11 |
| celebahq | 0.79 ± 0.16 | 0.82 ± 0.11 | 0.86 ± 0.05 |
| kermany | 0.54 ± 0.25 | 0.71 ± 0.17 | 0.79 ± 0.12 |
| kikibouba | 0.68 ± 0.19 | 0.70 ± 0.17 | 0.75 ± 0.12 |
| madsane | 0.88 ± 0.06 | 0.90 ± 0.02 | 0.90 ± 0.00 |
In Table 2, we reported results per single , according to the best LPIPS. For AFHQ, , and the rest are .
“Can you provide the exact video frame rate and if participants could replay. Also clarify IRB approval ID.”
The frame rate was 1 FPS (so 2 second gif). Our study is under IRB approval, and we will release the approval ID for the camera ready.
“Please clarify if DIFFusion can chain two vectors, for example Monarch->Viceroy->Queen without any re-inversion?”
While the core mechanism of manipulating conditioning vectors could theoretically be extended, directly chaining multiple vectors without re-inversion is not explicitly supported by the current method. It would likely require either re-inverting the image after each transformation to obtain new noise maps for the intermediate image, or developing a more complex sampling strategy, which is a promising direction for future work.
“What is the computational cost of DIFFusion? How does it compare to existing methods / is it a blocker for real-time deployment?”
- For evaluation, as noted, we generate ten candidate edits per image to thoroughly explore the SR versus LPIPS trade-off against baselines. However, since and are only being used during sampling, the inversion process is performed only once per input image. Therefore, the total computational cost for each image is actually 1×(inversion cost)+10×(generation cost), not 10×(inversion+generation).
- In our experiments and in the table above we can see that a choice of =2 and between 0.8-0.9 is usually near-optimal for most if not all datasets. Therefore, for real-world scenarios, the search space could potentially dramatically be reduced when running on a new domain.
- The effective per-image runtime (if run in a batch) is approximately 5 seconds on an A100 GPU.
- As diffusion models continue to evolve towards faster and more efficient architectures, such as distilled diffusion models (which can operate with only a few or even a single step), DIFFusion's runtime will inherently improve.
Given that the discussion phase approaches its end, we would like to ask to ask the reviewer if our rebuttal is clear and if there is anything else that can be provided to clarify better. Thank you!
I'm happy with the response the authors have provided and encourage them to include these in the updated version. I'll keep my score since it is already high.
DIFFusion is a diffusion-based method presented to help human experts understand subtle discriminative features in images difficult to articulate by experts. The presented method is said to perform well in challenging scenarios, which is shown in six experimental settings and evaluated with multiple metrics.
Disclaimer: I do want to highlight my lack of expertise in this particular field. I can assess the soundness of this paper, however I am less confident in assessing whether the chosen baselines are the correct ones to compare to, or if the evaluation is standardized. For these matters I'd refer to the other reviewers.
优缺点分析
- Strengths
-
The experiments are designed to rigorously evaluate the effectiveness of the DIFFusion system in generating counterfactual visualizations and its utility in teaching humans to discern subtle visual difference.
-
The user study seems convincing, showing solid improvements by using DIFFusion.
-
Table 2 paired with Fig. 3 show superior performance across the chosen datasets, which is convincing and compelling.
-
Theory seems sound.
- Weaknesses
-
In the user study, the best baselines often perform worse than a user analyzing unpaired data. This makes me think that it is possible that either the experiment is not useful to evaluate the claims, or else the baselines are not correctly chosen. I'd like the authors to clarify this point.
-
Not clear where the technical novelty lies. Please clarify
-
It is said that othere teaching methods "typically require aligned, abundant data and focus on single modalities" and that "our work extends these efforts". I think it would strengthen the paper to clarify what has been done to mitigate the problems with data and whether DIFFusion would perform equally than the baselines in the presence of abundant data.
问题
-
Why do the best baselines sometimes perform worse than a user analyzing unpaired data in the user study, raising concerns about the experiment's utility or the baseline selection?
-
Could you please elaborate on the core technical novelty of DIFFusion, distinguishing its contributions from existing state-of-the-art image editing and counterfactual generation methods?
-
Could the authors clarify the specific mechanisms within DIFFusion that mitigate these data requirements, especially regarding "aligned, abundant data"? Furthermore, how would DIFFusion's performance, particularly its LPIPS, compare to baselines when abundant data is available, and would it still offer a significant advantage?
局限性
Sensitivity to Dataset Bias and Limited Precise Control: A primary limitation is that DIFFusion edits images based on differences between class mean embeddings. This makes the method sensitive to dataset bias
Performance on "Common Objects" Datasets: While DIFFusion excels on scientific datasets and those where visual details are hard to describe textually (like Black-Holes and KikiBouba), its LPIPS (perceptual distance) performance on "common objects" datasets like AFHQ and CelebA-HQ is not always the best compared to baselines
最终评判理由
My concerns about technical novelty remain. However, the importance of the experiments and why baselines perform poorly has been addressed. I will increase my score.
格式问题
None
Thank you very much for your feedback. We are grateful that you found our experiments to be rigorously evaluated, both from a metrics point of view and from a user’s educational standpoint. We are glad that you found our experiments compelling and convincing (along with reviewer 3qB4 and FpZn), our results good (along with KpBL), and that the theory seemed reasonable to you. We hope we’ve answered your questions and addressed the weaknesses in the following sections.
“Why do the best baselines sometimes perform worse than a user analyzing unpaired data in the user study, raising concerns about the experiment's utility or the baseline selection?”
The chosen baselines perform well on familiar datasets where humans already understand class differences (e.g., AFHQ, CelebaHQ), confirming their appropriateness for counterfactual image generation. However, they struggle on scientific datasets (Table 2), likely due to visual corruptions that distract users. However, we can only conduct meaningful user studies on these challenging scientific datasets, where participants must learn new concepts, precisely where baselines fail. This methodological constraint actually highlights DIFFusion's value: when genuine learning is required on unfamiliar content, DIFFusion's minimal, identity-preserving changes that reveal true discriminative features become crucial for effective human learning.
“Could you please elaborate on the core technical novelty of DIFFusion, distinguishing its contributions from existing state-of-the-art image editing and counterfactual generation methods?”
DIFFusion introduces a novel perspective on diffusion-based image editing as a machine-teaching tool, generating counterfactuals to help humans learn class-discriminative features that are hard to verbalize. Unlike prior counterfactual generation work that relies on instance-level perturbations [1,2], DIFFusion uses simple latent-space arithmetic that enables realistic, fine-grained edits on limited data.
“Could the authors clarify the specific mechanisms within DIFFusion that mitigate these data requirements, especially regarding "aligned, abundant data"?”
DIFFusion avoids the need for unpaired data by defining a semantic shift between classes. The shift can be applied to any input to guide the generation toward the target class. To address data scarcity, DIFFusion (1) relies on the robustness of CLIP embedding averages, where only a few samples per class are needed to define meaningful class directions, and (2) uses LoRA when needed to fine-tune a large pretrained diffusion model with just tens to hundreds of examples.
"Furthermore, how would DIFFusion's performance, particularly its LPIPS, compare to baselines when abundant data is available, and would it still offer a significant advantage?"
DIFFusion's performance works better than the baselines even when abundant data is available. To clarify, the results in Table 2 are after training and computing the directions over the training set of each dataset.
[1] Guillaume Jeanneret, Loïc Simon, and Frédéric Jurie. Diffusion models for counterfactual explanations, 2022.
[2] Guillaume Jeanneret, Loïc Simon, and Frédéric Jurie. Adversarial counterfactual visual explanations, 2023.
Given that the discussion phase approaches its end, we would like to ask to ask the reviewer if our rebuttal is clear and if there is anything else that can be provided to clarify better. Thank you!
My concerns about technical novelty remain. However, the importance of the experiments and why baselines perform poorly has been addressed. I will increase my score.
The paper presents a method that leverages diffusion models to highlight differences between visual categories in images, with the goal of teaching humans to recognize differences. The method is tailored for domains where text descriptions are not easy, such as scientific imaging. The model consists of four main steps, inversion, space arithmetic, generation, and fine-tuning (optional). The results indicate that the method works well and produces counterfactual images that preserve the identity of the original image while making sufficient changes to change the decision of a pre-trained reference classifier.
优缺点分析
Strengths:
- Simple method that leverages existing components, including pretrained diffusion models and CLIP encoders.
- Almost perfect scores in terms of class flipping.
- Realistic image generation.
Weaknesses:
- The selected baselines are not necessarily well aligned with the task. Machine teaching has been used to approach this problem in the past, and there are solutions that are even multi-class (e.g., [1]). The selected baselines are recent models that could be hacked for this task, but are not specific. Can the authors compare to a baseline that aims to teach humans too?
- The technical innovation is limited, and it is unclear what is the novelty. For instance, the solution is built in terms of binary classifiers, when previous works have approached the problem with multi-class problems.
- The methodology makes sense, but it seems to be a straightforward application of existing methods and uses image manipulation strategies that are well known in the community, including space arithmetic and latent inversion.
[1] Becoming the expert - interactive multi-class machine teaching. CVPR 2015.
问题
- What makes the proposed methodology different with respect to previous works in machine teaching for vision problems?
- What other directly related baseline could be used to compare the significance of the proposed contributions, which is directly designed for human teaching?
- How can be the model extended to multi-class settings?
局限性
Yes
最终评判理由
The authors have addressed the questions in my initial review, and the paper seems ready for publication.
格式问题
No concerns
Thank you very much for your feedback. We are glad that you found our method both simple and capable of strong and realistic counterfactual image generation (along with Qza6). We hope we’ve answered your questions and addressed the weaknesses in the following sections.
“What is the technical innovation?”
Our technical innovation is in unlocking the capability of performing subtle edits on limited data, which was previously not possible in scientific domains for example, as demonstrated via the baselines’ qualitative and quantitative results (see: Table 2).
“How can the model be extended to multi-class settings?”
While DIFFusion is currently framed around a binary source-to-target transformation, the mechanism of computing the diff arithmetic vector that discriminates between classes can be extended to multi-class scenarios. One approach involves designating the positive embedding as the target class mean and the negative embedding as the average of all other classes. This “one-vs-rest” approach would allow us to distinguish between features of one class against all other classes in the dataset. We demonstrate the results of this technique on an extended AFHQ dataset of 3 classes: cat, dog and wildlife, as an example. We trained 3 separate classifiers (cat vs dog+wildlife, dog vs cat+wildlife, wildlife vs cat+dog), and computed corresponding three sets of positive and negative average embeddings ((positive: cat, negative: dog+wildlife), (positive: dog, negative: cat+wildlife), (positive: wildlife, negative: cat+dog)). We then ran inference with these 3 corresponding sets of classifiers/positive+negative avg embeddings. We evaluated flip rate and LPIPS on going from the validation sets of the negative classes (ex: dog+wildlife) to the positive class (ex: cat). We repeat this for going from cat+wildlife to dog, and cat+dog to wildlife. Since we can’t submit images, we report flip rate and LPIPS values, we will add the qualitative examples in the camera ready’s supplemental.
The column is the starting class and the row is the target class, where the first number is the LPIPS and the second is the flip rate.
| Dataset | Dog | Cat | Wildlife |
|---|---|---|---|
| Dog | x | 0.2642 / 1.0 | 0.3687 / 1.0 |
| Cat | 0.2926 / 1.0 | x | 0.4190 / 1.0 |
| Wildlife | 0.3096 / 1.0 | 0.2676 / 1.0 | x |
“Is the methodology novel or is it a straightforward application of existing methods and uses image manipulation strategies?”
While previous work can generate counterfactual images in common domains, as demonstrated in the quantitative and qualitative results, they don’t perform well in scientific domains. The technical novelty lies in developing a system that can automatically identify the key discriminative features, and apply them just enough to modify the structure without corrupting the instance, which is useful to teach humans subtle differences between categories. We believe that the simplicity of the method is its strength.
“What makes the proposed methodology different with respect to previous works in machine teaching for vision problems?”
Thank you for pointing this out! We view this work as highly complimentary. While [1] and similar machine teaching methods focus on optimal example selection for multi-class teaching, our approach fundamentally differs by generating synthetic counterfactual examples rather than selecting from existing data. The key distinction lies in our method's ability to automatically identify and visualize discriminative features that explicitly demonstrate the minimal edit necessary to flip the classifier’s prediction. However, we will make sure to cite this.
[1] Becoming the expert - interactive multi-class machine teaching. CVPR 2015.
Given that the discussion phase approaches its end, we would like to ask to ask the reviewer if our rebuttal is clear and if there is anything else that can be provided to clarify better. Thank you!
(a) summary: The paper proposes to use conditional diffusion models to generate counterfactual examples for the specific case of human teaching. It argues that in domains where subtle differences between classes may be difficult to describe textually, showing visual examples can help humans to learn these. The paper proposes to rely on CLIP embeddings of the images and embedding arithmetics to move from the class of origin to the target class and combine these conditioning vectors with isolated noise maps from the diffusion model to preserve the initial structural features of the original image. The experimental results show favorable performance of the method in producing CEs as compared to selected baselines both quantiatively and qualitatively. It also contains a user study showing that the method can help in human learning.
(b) strength: Use of visual CE for human teaching is an interesting idea with potential for further development. The paper shows promising results especially on scientific domains.
(c) weaknesses: Little technical novelty relying on existing ideas - diffusion models for CE, CLIP embeddings arithmetic. Missing technical/methodological and empirical comparisons with other diffusion-based method for visual CE. Technical details remain opaque, e.g. quality/appropriatness of CLIP embeddings for scientific OOD domains, diffusion models finetunnig for OOD datasets, quality of classifiers for complex scientific domains and its effects, computation costs related to generating candidate edits.
(d) reasons to reject: Though using CE for human teaching is an interesting idea with further potential, the current paper brings little technical novelty combining standard techniques. Some questionable properties with potentially detrimental effects are not sufficiently discussed, e.g. appropriateness of CLIP embeddings and linear arithmetics for OOD domains, possibility of adversarial as opposed to counterfactual edits, quality of initial class annotations, scientific validity of image edits and safegards against biases. User study, though helpful, is limited in size and needs elaboration to be able to argue favorable effects on human learning.
(e) review discussion: The reviewers generally welcomed the idea of CE use for human learning. They have raised concerns about the limited technical novelty and added value in comparison to existing methods. The authors see this in the ability to operate over scientific domains with limited sets of data. Though seen as a potential benefit, this has raised additional questions such as the appropriatness of CLIP embeddings and arithmetics in these domains, the validity of the image edits and the risk of biasing the human learning process. The authors discussed these topics actively in the rebutal to partial yet not quite complete satisfaction of the reviewers.