Explainable Concept Generation through Vision-Language Preference Learning
A method to automatically articulate concepts to explain neural networks.
摘要
评审与讨论
Some concept-based XAI methods, such as [1,2], require the creation of a concept-specific set of images to pass through the network. To eliminate the human-in-the-loop, the authors explore using a combination of reinforcement learning and generative models to create concept-specific image sets. The authors stated goals include: (1) producing concept sets that are beyond what human practitioners may be able to discover on their own. (2) producing concepts at a variety of abstraction levels that they are able to control with a parameter. (3) producing concept sets that demonstrate an increases in novelty, abstractness, diversity, generalizability (to non-vision domains) and actionability (utility).
The core of the method is to optimize a LORA-adapted stable diffusion model to produce concept sets that trigger the target model (model to be explained). The RL agent is used to search a set of text prompts for the diffusion model for the text prompts.
[1] Kim, Been, et al. "Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)." International conference on machine learning. PMLR, 2018. [2] Schut, Lisa, et al. "Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero." arXiv preprint arXiv:2310.16410 (2023). [3] Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).
优点
(S1) The authors identify an interesting question: how to choose a probe set for extracting concepts from models? The proposed method considers an interesting direction of using generative methods to create images that may not be present in the dataset. (S2) The formulation of considering TCAV scores as preference scores is interesting. (S3) The authors make an effort to quantitatively assess the concept images generated by their method.
缺点
(W1) The authors stated goals for this work are misaligned with the goals of explainable AI in general. The goal of XAI is to improve a user's understanding of a model as a primary goal. However, the authors focus on "abstractness" and "novelty" as the goals of their method. While the authors propose several experiments to measure abstractness and novelty, they fail to provide any convincing experiments on the utility of their method for explaining network behavior. For example, [1] provides a clear, well-motivated experiment, i.e. users are asked to predict model behavior when given an explanation.
(W2) Additionally, I find the term novelty to be misleading. The term novelty can be interpreted in at least two ways, for example, it could be interpreted as how distinct different discovered concepts are from each other. When measuring novelty in this manner, it seems that the author's method may not exhibit novelty, since in Line 288 they state, "it is possible for different actions to lead to the same explainable state." Instead, the authors measure novelty as the distance between concept images and images the test set. They use this metric to compare their work to prior work, resulting in the fairly trivial result that generated images are farther from the test set than methods that use the test set.
(W3) The ablation study for choosing TCAV is unnecessary. The authors choose to use either a human, LLM, or TCAV scores from the model to make preference judgements. However, since the goal is to explain the model the other first two options are completely unnecessary.
(W4) The authors claim to be able to generate concepts sets at different levels of "abstraction". However, whether this "abstraction" is due to SD or due to the target model is unclear. Since, the goal is to understand the target model, once again, I find this experiment to be insufficient to say anything interesting about the target model.
(W5) Finally, the computational cost of this method is quite high and it is unclear if the results are worth the cost.
In summary, the authors lack a key experiment measuring the utility of their method. Additionally, I find that some experiments that the authors conduct to have trivial results.
There are several places with typos. L91 mode -> model L534 the term NUT is introduced without being defined?
[1] Colin, Julien, et al. "What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods." Advances in neural information processing systems 35 (2022): 2832-2845.
问题
The most important question is: can the authors provide any experimental result that concretely shows that their method is able to explain model decisions?
Some secondary questions are:
- Why RL at all? Why not just optimize every "seed prompt"?
- How often do seed prompts converge to the same outputs?
- This work brings up the question, under what domain should we care about explaining model behavior? The generated images are OOD for the model. When do we want to explain OOD behavior for the model? Is this useful?
- How does your method relate to adversarial attack methods?
- How long does it take for your method to run?
We are thankful to the reviewer for the detailed feedback. While we appreciate the reviewer's recognition of the interesting questions posed and the novel directions explored in our work, we believe there are some key misunderstandings regarding the goals and contributions of the proposed method.
Since there are various types of contributions such as frameworks, datasets, algorithms, insights, applications, etc. at ICLR, we would like to clarify that our paper is an algorithmic contributions paper. Please see our detailed answers below.
Please refer to this Anonymized GitHub Link where we have compiled detailed explanation and images/plots for better understanding.
W1.
Are our goals misaligned with XAI? We would like to clarify that the stated goals of our work are not misaligned with the broader goals of explainable AI (XAI). Our primary objective, consistent with XAI and what the reviewer states, is to improve the user’s understanding of the model. However, there is more nuance to this broad goal. The focus on "novelty," that most existing methods cannot provide, is not a deviation from this goal but rather an added dimension to further this goal. While we appreciate the reviewer's perspective, we probably do not want to fix the goals of XAI as it’s an evolving field.
Unlike in the most classical setting of XAI where we want to come up with explanations of a particular decision that helps a non-ML-expert user (e.g., say, a medical doctor), our goal is to provide insights to expert machine learning engineers or data scientists to identify what the neural network has learned (so that if there are any vulnerabilities they can fix before deployment). As highlighted in recent discussions [1] and our paper (Figure 2), humans cannot understand everything a neural network has learned because they do not learn using the same concepts that we learn–their learning manifold is different to ours. If a human were to manually come up with a concept set, they are going to miss important concepts because they do not know about the neural network’s learning manifold and therefore the engineers cannot fix the vulnerabilities of the network. That is why we need a method to automatically probe millions of concept configurations and see what really matters. Since trying millions of configurations is not computationally feasible, we formulate a deep RL solution that generates concepts using diffusion. By doing so, we do not deviate from the original goal.
We believe the confusion happened because “human” was used as an overloaded term, resulting in a perspective clash. In the classical TCAV setting, two groups of humans are involved: those who create concept sets offline (“creator humans”) and those who utilize these concepts online (“user humans”). When we state “humans" we are specifically referring to the creator humans, as our contribution is developing an algorithm for concept set creation/generation rather than the downstream application of an existing method. We argue that the concepts that the model indeed uses can be divided into two groups: concepts that creator humans can think of/retrieval methods can create and those they cannot (i.e., novel concepts). Our method can generate concepts from both groups. The latter group is more useful to debug models as shown in our concluding experiment (Figure 9).
We also believe that "abstractness" is an interesting byproduct of RLPO, as correctly identified by R1, R3, and R4. They provide a layered understanding of the model’s reasoning, revealing both high-level and low-level concepts that contribute to its decisions—something that, for the best of our knowledge, the XAI community has never seen before.
While we like the utility metric introduced in [2], it does not help measure novelty. If we crop part of an image as a concept, the utility score will be high, though this concept is not novel at all. Our argument is that there are other patterns/concepts that trigger the neural network (the goal of this work) and identifying them is important to fix the issues. Please see Table 2 where we explain this gap between “what the human thinks” vs. “what the neural network has actually learned.” One motivation for closing this gap stems from our attempt to understand why neural networks perform badly for certain complex decision-making tasks.
- Schut, Lisa, et al. "Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero." arXiv preprint arXiv:2310.16410 (2023).
- Colin, Julien, et al. "What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods." Advances in neural information processing systems 35 (2022): 2832-2845.
W2.
We understand that the term "novelty" could be interpreted in multiple ways, and we appreciate the opportunity to clarify its intended meaning and significance in our work.
Novelty as in different concepts. This is not what we originally meant but because of RLPO’s exploration strategy we indeed get different concepts (see Figure 6). Thank you for helping us identify this added advantage. If we read lines 285-288, “Different actions may result in different explainable states, reflecting various high-level concepts inherent to f(·)....Also, it is possible for different actions to lead to the same explainable state.” - The first sentence tells about the novelty as in different concepts. The second sentence tells us if we have two semantically similar seeds (but the RL algorithm “initially” does not know if they are similar), the RL algorithm will force SD to learn similar concepts. Therefore, we do not have to worry much about having semantically repeated seeds.
Please see the experiment below to validate this. We now ran the RLPO algorithm 3 times (i.e. 3 trials) for the same seed prompt set. During inference, we calculated the CLIP embedding similarity among the top 3 concepts (stripes, running, and mud for the zebra class - see Figure 6). The high Wasserstein and low cosine similarities indicate that the generations are not similar. A very high Hotelling's T-squared score (a generalization of the t test for multiple dimensions) also indicates that the generated images are from the same distribution (for reference, this score is 2.4 for stripe-stripe).
Inter-class Concept Comparisons for “zebra” class across three trials:
| Metrics | Stripes-Running Concept | Running-Mud Concept | Mud-Stripes Concept |
|---|---|---|---|
| Average Cosine similarity | 0.677 ± 0.010 | 0.699 ± 0.0004 | 0.734 ± 0.0004 |
| Average Wasserstein distance | 8.1533 ± 0.057 | 7.850 ± 0.022 | 7.480 ± 0.033 |
| Average Hotelling's T-squared score | 7598.507 ± 84.5 | 13069.681 ± 2147.81 | 7615.731 ± 538.06 |
| Are they from the same distribution? | No | No | No |
Novelty as in deviation from the test set. We believe there is a misunderstanding here. We are not just generating concept images that are far away from test images. We are generating images that are far away from test images but “still provide a high TCAV score.” This is a challenging constrained optimization problem that we address using deep RL and diffusion. As the other reviewers also correctly identify, we do not think developing an algorithm to do this is trivial. Please see the graph here in Anonymized GitHub Link (and Table 4) for comparison. The comparison to prior work is not trivial. Retrieval-based methods directly rely on the dataset, in its simplest form "cropping" parts of existing images to produce explanations. While such an approach can highlight important features in input images that help non-expert users understand the network’s decisions, it is inherently limited to patterns present in the dataset. RLPO, on the other hand, explores beyond the dataset, generating concepts that trigger the network–unveiling the vulnerabilities of a neural network help engineers fix issues of the neural network.
W3.
From an XAI perspective, the comparison might look unnecessary. However, since the major contribution of work is developing an algorithm, given the popularity of LLM feedback and human feedback these days, it raises the question why we use XAI feedback. We wanted to highlight the infeasibility of using LLM and human feedback mechanisms in our framework. If the reviewer thinks this comparison is confusing to have in the main paper, we can move it to supplementary. Let us know.
W4.
The key assumption in our work is that stable diffusion can generate realistic images given a prompt (and generative models will continue to grow in the coming years). If SD generates images that do not help us explain the output, RL will not optimize it further. Therefore, we do not see how this is a problem. Referring to Figure 5, the explanation of the tiger class progresses through levels of abstraction from high to low: the importance of zoos, followed by animals in zoos, then striped animals in zoos, and finally orange-and-black striped animals, and so on. At the highest level of abstraction, what this means is, images of zoos trigger for tigers whereas random abstract concepts such as, say, beach, do not trigger for tigers. This indeed helps verify engineers that the neural network, in this example, has learned the correct representation.
W5.
We can still obtain explanations in real-time because TCAV can run in real-time. However, we agree that RLPO cannot create concept sets in real-time, mainly because of the diffusion fine-tuning step (unfortunately, good concepts, whether designed by a human or diffusion comes at a cost). However, we do not think there is a need to create concept sets in real time. For instance, if we apply TCAV for identifying a disease from an X-ray, we can create the concept set using a test set before deployment, which will take a few hours, and then run TCAV in real-time. Hence, concept set creation is a one-time investment. In case of a long-term distribution shift in an application, we can keep adding concepts to the dictionary, if RLPO discovers anything new. On a different note, please also note that the traditional method of manually creating a concept set can not only be slow and labor intensive but also can miss the important concepts. In a real-world setting, retrieval methods or human designed concepts can be used as a starting set of concepts and expand it using RLPO to generate what retrieval or humans could not see/think of.
Thank you for pointing out the typos. We have fixed them.
Q1.
Why RL The RL action space are a combination of seed prompts. Assuming 30 mins per run, it will take 2^20 * 30 = 182 years (assuming maximum combination to be 20) to do this if we brute-force. Since RL intelligently and dynamically picks which prompt combinations to use (and not use), RLPO takes only ~8 hours. Therefore, unlike a static ranking approach, our RL-based framework is much more pragmatic to handle unbounded generative models. The high epsilon case in Table 1 is somewhat similar (yet better) to brute forcing through the seed prompts. To see the quality of generated images with and without RL (seed prompt “stripes” for the zebra class after 300 steps), please see the images in this Anonymized GitHub Link.
Q2.
Though theoretically plausible (e.g., when the seeds are “stripe” and “stripes”), we’d say the probability is negligible because, 1) as explained in Appendix C2, we remove repeated seeds in the first place and 2) RL will stop taking the same action if its not getting rewards. In our experiments, that never happened.
Q3.
It will generate both in-distribution and out-of-distribution concepts. Even with retrieval methods or human-created datasets, we will have both and we do not not see why it is a problem as we are generating concept images not test images. If the question is where would testing OOD behavior is useful:
- Safety-Critical Domains: In applications like healthcare or autonomous driving, understanding how the model generalizes to rare or unexpected scenarios (potentially OOD-like) is crucial for trust and reliability.
- Bias Detection and Debugging: Generated concepts can reveal biases or spurious correlations learned by the model, allowing for more targeted mitigation strategies.
- Generalization and Robustness Analysis: Abstract concepts help analyze how the model reasons under different conditions, providing insights into its robustness.
Q4.
Our method seeks to generate explainable concept images that reflect the model’s internal representations. As an application, if an adversarial method (e.g., perturbing the inputs) can find adversarial test points, we can explain it using TCAV. Our method will help to get better concepts for TCAV that humans might have not thought of but useful to explain those test points.
Q5.
Training the RLPO framework takes approximately 6–8 hours for a particular class, depending on a workstation with a gaming GPU–most of the time is for SD fine-tuning. As explained under W5 above, please note that we have to run this only once (to pre-compute the concepts), not for every single test image. At test time, the run-time is the same as the TCAV–a few milliseconds to seconds.
Thank you for your response.
Our primary objective, consistent with XAI and what the reviewer states, is to improve the user’s understanding of the model. However, there is more nuance to this broad goal. The focus on "novelty," that most existing methods cannot provide, is not a deviation from this goal but rather an added dimension to further this goal. While we appreciate the reviewer's perspective, we probably do not want to fix the goals of XAI as it’s an evolving field.
If this is the primary objective, I still believe that the link between novelty, abstractness, etc... and improving the user's understanding is missing. I am not trying to "fix the goals of XAI", but in order to claim that these secondary properties are valuable to the primary objective, a clear effect, showing that the method improves human understanding, should be demonstrated.
As highlighted in recent discussions [1] and our paper (Figure 2), humans cannot understand everything a neural network has learned because they do not learn using the same concepts that we learn–their learning manifold is different to ours. If a human were to manually come up with a concept set, they are going to miss important concepts because they do not know about the neural network’s learning manifold and therefore the engineers cannot fix the vulnerabilities of the network.
I have no problem with the goal of generating concepts that may be different than what humans come up with, this is one of the strengths of the work as I indicated. My concern is that the results in Figure 4 do not immediately strike me as useful for human understanding. If it was clear that they were useful, experiments showing that the explanations help human understanding would not have been as necessary. However, the results are, as you indicate, abstract. Thus, you are making a strong claim that this abstractness is a positive and not a negative for human understanding of the model. Unfortunately, there are no experiments that back this claim up. I still believe this work needs a clear demonstration of usefulness for explainability. Also, it's a reach to claim that this method surfaces vulnerabilities in networks without providing experiments that either exploit or defend against these vulnerabilities.
While we like the utility metric introduced in [2], it does not help measure novelty.
I suggested this work because it directly measures usefulness, not novelty. The authors should convince the readers that novelty is worthwhile for explainability through experiments. This experiment may not be the best fit, for example, your work may be best suited for helping users understand failure modes on OOD images.
Our argument is that there are other patterns/concepts that trigger the neural network (the goal of this work) and identifying them is important to fix the issues.
It's known that "strange" patterns and concepts can trigger neurons, this fact is exploited in adversarial attack research. This work does not show adversarial attacks using the method nor fixes those attacks. This would also be an interesting contribution.
In summary, my primary concern remains unaddressed. I am not convinced that the applications you write about are possible with the method provided. As you show clearly in your work, your results are more novel and abstract than prior work. While I have no issues with measuring novelty and abstractness, I don't believe these traits can be used as a proxy metric for usefulness as a tool for explainability.
I'm not as familiar with RL and I find your responses adequate for Q1, Q2.
Training the RLPO framework takes approximately 6–8 hours for a particular class...
This is a lot of time spent to analyze a single class of a model. If you were to analyze all of imagenet this would result in 6000 hours (250 days). Once again, this strikes me as not particularly useful for most users unless the method can provide extremely valuable insights.
We thank the reviewer for going through our response. To summarize our response, we address several misunderstandings: we clarify that the human survey effectively measures usefulness, that run-time is not a limiting factor, and we resolve the reviewer’s confusion regarding the meaning and application of abstractions.
We also highly appreciate if the reviewer can suggest concrete and correct experiments, yet feasible and fair to fit in a standard ML methods paper.
We, by not any means, argue that retrieval methods such as CRAFT are bad in all aspects–they have their own merits such as being simple and fast. In fact, we thought of using retrieval methods as potential priors for the generative model. As clearly shown in Figure 2, this paper wants to expand the set (from orange to blue). In order to expand (in our case, generate novel concepts), we have to sacrifice something (in our case, runtime is higher than retrieval methods, as highlighted under limitations). This expansion is what the reviewer identifies as “your work may be best suited for helping users understand failure modes on OOD images.”
Our primary objective, consistent with XAI and what the reviewer states, is to improve the user’s understanding of the model. However, there is more nuance to this broad goal. The focus on "novelty," that most existing methods cannot provide, is not a deviation from this goal but rather an added dimension to further this goal. While we appreciate the reviewer's perspective, we probably do not want to fix the goals of XAI as it’s an evolving field.
If this is the primary objective, I still believe that the link between novelty, abstractness, etc... and improving the user's understanding is missing. I am not trying to "fix the goals of XAI", but in order to claim that these secondary properties are valuable to the primary objective, a clear effect, showing that the method improves human understanding, should be demonstrated.
In machine learning conferences, different papers have diverse contributions such as methods, frameworks, datasets, applications, theory, etc. As we highlighted, this is a methods paper. Therefore, we focused on the aspects of the algorithm (e.g., why RL is a better choice) and conducted extensive experiments to validate it. We hope that the reviewer agrees that it is unfair to ask to deploy a model in a real-world application to assess the downstream usefulness (unfortunately, this is the only proper way to assess the real usefulness) as it requires significant more effort and typically beyond the contributions of a methods paper (99% ML papers do not have such deployment usefulness assessments), though it might be fair to ask in a journal paper. Below we discuss why our human survey indeed measures usefulness. If there is a concrete, yet feasible, experiment that the reviewers can suggest, we would be happy to conduct. Unfortunately, the authors find the reviewer’s request rather vague.
As highlighted in recent discussions [1] and our paper (Figure 2), humans cannot understand everything a neural network has learned because they do not learn using the same concepts that we learn–their learning manifold is different to ours. If a human were to manually come up with a concept set, they are going to miss important concepts because they do not know about the neural network’s learning manifold and therefore the engineers cannot fix the vulnerabilities of the network.
I have no problem with the goal of generating concepts that may be different than what humans come up with, this is one of the strengths of the work as I indicated. My concern is that the results in Figure 4 do not immediately strike me as useful for human understanding. If it was clear that they were useful, experiments showing that the explanations help human understanding would not have been as necessary. However, the results are, as you indicate, abstract. Thus, you are making a strong claim that this abstractness is a positive and not a negative for human understanding of the model. Unfortunately, there are no experiments that back this claim up. I still believe this work needs a clear demonstration of usefulness for explainability. Also, it's a reach to claim that this method surfaces vulnerabilities in networks without providing experiments that either exploit or defend against these vulnerabilities.
We believe there is a misunderstanding about terminology. We mean abstractions not as in “abstract arts” but as abstraction levels as in Computer Science—a representation of an idea or concept, often removing specific instances or details to focus on the broader principle. In our case, we go from zoomed-out to zoomed-in the neural network’s activation space (to clarify, not physically zooming): zoo->animals->animals with stripes->stripes, etc.
We do not think we make a strong claim about abstractions as what we state is “we observe the progression of output concepts generated by the SD” and, we visualize it to help understand the progression theoretically (Figure 1 and Appendix B). Also, please note that abstraction levels are a byproduct—the main contribution is generating novel concepts, for which we have provided plenty of experiments (the main results figure is Figure 6, not Figure 4.). Since it is not a must for the user to use all levels of abstractions (just using the final abstraction level is totally fine for some applications as in most XAI methods), we do not see how it is a negative aspect. For applications such as long-horizon planning, abstractions are important, though not mandatory for every use case. Instead, they serve as an additional layer of insight.
We do not claim “method surfaces vulnerabilities,” what we claim is that "if there are any vulnerabilities, they can be fixed before deployment.” The first sentence is a definite claim and the second one is an ability/opportunity. If it was misunderstood, we can further tone it down. The typical workflow of a methods paper with 9 pages is providing the method and experimentally validating that the method is correct ideally with some theoretical insights. We went a step further by illuminating the future potential for model debugging as depicted in Figure 9. We believe it is not fair to ask to run adversarial examples and how to fix them as it is an application which is work worthy for a completely different contribution/paper.
While we like the utility metric introduced in [2], it does not help measure novelty.
I suggested this work because it directly measures usefulness, not novelty. The authors should convince the readers that novelty is worthwhile for explainability through experiments. This experiment may not be the best fit, for example, your work may be best suited for helping users understand failure modes on OOD images.
Before showing an explanation is useful for an application, firstly we should come up with correct explanations. Although the contribution of the paper is about coming up with explanations, we went a step beyond to show that the explanations are useful for the user to explore and understand a model’s inner workings. The human survey (Table 2) in fact measures the usefulness. All the concept images we showed the human are concepts, either from retrieval methods or our generative method. The participants tell us how much information they gained about the model after seeing the explanation. This new information gain is the usefulness/utility of the explanation.
Here is a simple method to come up with an incorrect explanation that gives a high utility score using [2]: given an image that contains a zebra to test, run an object detector to crop the image and pick the bounding box that has the highest similarity to the test image (i.e., bounding box around the zebra). If we use the metric in [2], the usefulness will be 100% and will beat any other XAI method to-date because the user associates the zebra test image with the zebra concept image. But is the explanation correct and trustworthy? No (unless lucky), because it never looked inside the neural network.
Utility should always be measured with respect to the end-goal, which, in our case, is finding novel concepts. Simply adapting [2], which measures the association between the test input and the concept, is not correct in our case because what we want to measure is what the user did not expect (information gain as in Information Theory). This setup is different to retrieval methods, where they show compared to a random crop, cropping from the zebra itself is better. To reiterate, our comparison was not to show that retrieval methods cannot explain but to show that we can expand the concept set beyond retrieval methods.
Our argument is that there are other patterns/concepts that trigger the neural network (the goal of this work) and identifying them is important to fix the issues.
It's known that "strange" patterns and concepts can trigger neurons, this fact is exploited in adversarial attack research. This work does not show adversarial attacks using the method nor fixes those attacks. This would also be an interesting contribution.
It is a long-term goal. We believe it is not fair to ask to run adversarial examples and how to fix them as it is an application which is work worthy for a completely different contribution/paper.
In summary, my primary concern remains unaddressed. I am not convinced that the applications you write about are possible with the method provided. As you show clearly in your work, your results are more novel and abstract than prior work. While I have no issues with measuring novelty and abstractness, I don't believe these traits can be used as a proxy metric for usefulness as a tool for explainability.
The usefulness of our tool is the ability to find new concepts. The human survey exactly measures that and it is not a proxy. We believe the reviewer’s confusion stems from trying to compare with some other literature whose objective might be different. As this confusion can happen to the broad readership of XAI, we will clarify this in the paper.
Training the RLPO framework takes approximately 6–8 hours for a particular class...
This is a lot of time spent to analyze a single class of a model. If you were to analyze all of imagenet this would result in 6000 hours (250 days). Once again, this strikes me as not particularly useful for most users unless the method can provide extremely valuable insights.
Thank you for pointing this out. We, however, believe that there is a misunderstanding here. The delay is due to the current speed of generative models, which are expected to be significantly faster in the future [1]. Additionally, the reported time is for the most standard fine-tuning process (LoRA) on a small-scale workstation GPU. If faster versions of LoRA are used on GPC/AWS, run-time is not a significant bottleneck.
On a different note, ImageNet is a standard dataset, not an application. When neural networks are adapted to the real-world applications, though the network is trained on imagenet, the last few layers of the network are removed and fine-tuned for the actual number of classes. In most applications (e.g, medical image analysis, autonomous driving, etc.), the number of classes that the decision-making module has to deal with is relatively low.
- Chadebec, Clement, et al. "Flash diffusion: Accelerating any conditional diffusion model for few steps image generation." arXiv preprint arXiv:2406.02347 (2024).
Thank you for your response and clarifications.
My understanding of your claims are:
- You are presenting a method to expand into the "blue" region, producing more novel concepts.
- Your work aims to test the method in the context of its ability to produce such concepts. Specifically, you test that the concepts are outside of the space that humans can generate themselves.
I believe the authors have successfully demonstrated that they discover visual concepts that activate the network that are novel to humans (Table 2).
Unfortunately, I believe we disagree on the value of this contribution to XAI (which, to my understanding, is primarily explored in Fig. 9). While most ML papers do not deploy their methods in real-world use cases, most fields have clear metrics. It is challenging to come up with clear metrics in XAI and while I can agree that an extensive evaluation of the usefulness is out of the scope of the paper, I think it is reasonable that some evidence of potential usefulness to XAI (or some other domain) is demonstrated. In my opinion, Fig. 9 does not sufficiently demonstrate this potential usefulness.
On abstractness, I appreciate the clarification and I understand the intention better. In the paper the authors state:
These abstractions hint us about what the model prefers when it is looking for tiger, starting from a four-legged orange furred animal, to black and white stripes with orange furred animal, to black and white stripes with orange furred and whiskers.
However, this seems to me heavily confounded by both the prompt and the path through stable diffusion's weight space. I am not sure it can be ascribed to the model's preferences. I also find similar issues with ClipSeg, which introduces another model's biases as an intermediary to interpreting the target model.
Overall, I prefer to maintain my rating. Thank your for the discussion.
I am not factoring the following points into my decision, but I would like to point them out.
Here is a simple method to come up with an incorrect explanation that gives a high utility score using [2]: given an image that contains a zebra to test, run an object detector to crop the image and pick the bounding box that has the highest similarity to the test image (i.e., bounding box around the zebra). If we use the metric in [2], the usefulness will be 100% and will beat any other XAI method to-date because the user associates the zebra test image with the zebra concept image. But is the explanation correct and trustworthy? No (unless lucky), because it never looked inside the neural network.
This is an incorrect representation of the experiment proposed in [2]. The classes selected by the authors of [2] are specific and designed to eliminate trivial explanations.
our goal is to provide insights to expert machine learning engineers or data scientists to identify what the neural network has learned (so that if there are any vulnerabilities they can fix before deployment). We do not claim “method surfaces vulnerabilities,” what we claimed was method “so that if there are any vulnerabilities they can fix before deployment.” The first sentence is a definite claim and the second one is an ability/opportunity.
I find that the writing style in the paper and the responses leans towards strong claims which have been followed up with re-interpretations in the discussion. For example, the first sentence (to me) strongly implies that your method would help engineers surface vulnerabilities, I do not believe this is a surprising or incorrect interpretation.
Thank you for the response! The reviewer’s new understanding about the claims is correct. Thank you for acknowledging that the paper has successfully demonstrated its claims.
Since we have clarified all of previous concerns and new concerns below through new experiments, we sincerely hope the reviewer can reconsider the score. As the reviewer can also see, we honestly put a lot of extra effort into rebutting 6 reviewers, mostly because this is not a classical XAI-style paper. We are sure that the reviewer also thinks that the contribution of the paper, the amount of experiments we have run, and the rebuttal is worth updating the score. Please see our answers below.
The results presented in Figure 9 of the paper highlights a potential use case of the proposed method. It shows how concepts shift with fine-tuning and the proposed method’s ability to detect these new concepts.
To further showcase the usefulness, we conducted an additional experiment. In this experiment, we choose a pre-trained Googlenet classifier for the Tiger class whose important seed prompts were ‘orange black and white’, ‘orange and black’, and ‘blurry image’ with TCAV scores of 0.66, 0.66, and 0.62, respectively. Out of these seed prompts, ‘orange black and white’ and ‘orange and black’ highlight the tiger pixels while ‘blurry image’ seed prompt highlights the background pixels (see sample explanations in anonymous github). What that means is, in order to classify a tiger, Googlenet looks at both the foreground and background. Now the engineers want the classifier to classify the tiger based on tiger pixels, not its background (note: from the classical Wolfe-Husky example in LIME [1], we know the spurious correlation of background).
To this end, we generated 100 tiger images based on concepts related to ‘orange black and white’ and ‘orange and black’ using a separate generative model and fine-tuned our Googlenet model. Running RLPO on this fine-tuned model revealed that the model learned some new concepts such as ‘whiskers’ and also revealed that previous concepts such as ‘orange black and white’ and ‘orange and black’ are now more important with TCAV scores of 1.0 and 1.0, respectively. This means that the classifier is now only looking at tiger pixels, not the background. (see dataset samples and shift plot in anonymous github). This experiment clearly shows how the proposed method can be used to improve a neural network’s undesirable behavior.
- Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.
This paper introduces an algorithm for generating concept images to post-hoc explain a model. It hereby focuses on the use of the TCAV score, an RL learning setting and stable diffusion.
优点
The paper tackles an important topic, and I find the combination of previous approaches that the authors propose with their work intuitive and interesting.
缺点
Although i find the algorithmic approach interesting enough, I am unsure what the actual underlying goal is of the approach and how this is evaluated/evidenced in the experimental section. I.e., in the intro the authors mention "Therefore, it is importantto automatically find human-centric concepts that matter to the DNN’s decision-making process." So my understanding is that the ultimate goal is to provide concepts that are helpful for humans to understand the decision making processes, but without the need to provide extensive concept data precollection. But I am missing a clear experiment to evaluate the usefulness of the generated concept images. Table 2 seems to be hinting at something along this line, but the details are not clear to me from the text. IN any case this should be one of the key evaluations to perform. Especially, as the exemplary concept images look far too abstract for me to be able to understand what they should represent. Please point me towards the relevant sections in case this is missing.
Overall, I am leaning towards accept, but would like the following points clarified first:
- What is the exact goal of the method?
- How do the experimental evaluations provide evidence for reaching this goal?
Minor: Overall, the paper could benefit from a grammar check, e.g., "also explains what type of features is the model focuses on." in line 426
问题
Maybe I missed it, but is there any information on the training time of the proposed algorithm? If not, I think this could be valuable information to provide at least in the appendix, but mentioned in the main text. The authors mentioned limitations, but it would be good to have real numbers.
Whats the difference between the concept images in Figure 6 and 5? In figure 5 they look very abstract, whereas in figure 6 the are high-quality images.
伦理问题详情
N/A
We thank the reviewer for the constructive comments and identifying the method as interesting. We have addressed the concerns in weakness and questions as follows.
Please refer to this Anonymized GitHub Link where we have compiled detailed explanation and images/plots for better understanding.
W.
To summarize:
Our claim: When considering automated concept set creation techniques (that are retrieval based), the proposed generative method can come up with novel concepts that trigger the neural network (unveiling such new patterns can help engineers fix models).
Experimental evidence to support the claim:
- Quantitative: Table 4 shows that our method can generate new concepts.
- Qualitative: Figure 4 illustrates some results of Table 4.
- Human evaluation: Most humans cannot imagine certain patterns will even trigger the neural network.
Let us elaborate on this.
To evaluate the concept's relevance to the model, we evaluate the method’s success through multiple metrics, as reflected in the experiments section. Table 4 provides quantitative evidence for the relevance of the generated concepts by measuring TCAV scores. Higher TCAV scores indicate that the concepts discovered are aligned with the model’s internal representations, confirming their significance for decision-making. This shows that the generated concepts "matter" to the model. We use metrics such as cosine similarity and euclidean distance to show that the generated concepts are novel—generated concepts are farther away from the class images, indicating that concepts generated by our method are not a subset of class data as in retrieval methods.
We also conducted a human experiment (results shown in Table 2), where when asked for identifying relevant concepts within generated, retrieved or both, most volunteers irrespective of laymen or experts mostly picked retrieved concepts, even though both were equally important to the network. This experiment indicates that it is not easy for humans to imagine concepts by themselves as the neural network’s learning process is different to ours.
Furthermore, to help humans understand what each relevant concept represents, we made use of ClipSeg to identify the intersection between generated concept images and test images. Figure 6 highlights the regions each relevant concept represents in the test image.
Q1.
For the experiments presented in the paper, the RLPO framework with DQN+Diffusion typically requires approximately 8 hours per class on a machine equipped with an NVIDIA RTX 4090 GPU to train, with the most computationally intensive step being the iterative fine-tuning of the generative model. This concept set creation is a one-time investment for an application of interest. Evaluating TCAV is pretty much real-time once we have the concept set. We will include this discussion in the revised version of the paper.
Q2.
The images in Figure 6 are final outputs of the RLPO framework–this is what most people want as explanations. They are the lowest level of abstraction in explanations. However, if someone wants higher levels of abstractions in explanations, as shown in Figure 5, they can also obtain them. In Figure 5, the explanation of the tiger class progresses through levels of abstraction from high to low: the importance of zoos, followed by animals in zoos, then striped animals in zoos, and finally orange-and-black striped animals, and so on.
Thank you for your time and I now understand the claims better. However then I don't quite yet understand the interpretation of Table 2 results. What do these results tell us about how the concepts can be utilized by humans? It seems the concepts are not likely to be understood by humans. This seems undesirable, unless the point is about generating concepts that the model indeed uses. Just the interpretation of this evaluation for the global goal is a little unclear.
So the gist of my questions: what is the benefit of having "explanations" if they are not understandable by humans? I understand that the combination with ClipSeg provides aid here. It would just be good if the authors can elaborate more on this explainable aspect. In other words the Evaluation section clarifies whether the identified concepts are novel and reliable, but not whether they "explain" the model. I ask these questions because in the methods section the authors state "To this end, we leverage the state-of-the-art text-to-image generative models to generate high quality explainable concepts." But for me, clear evaluations on the explainable aspect are missing. I hope this clarifies the issue.
Do concepts explain the model?
We believe the philosophical question of whether the concepts really explain the model is valid for all concept-based techniques. If a human collects a concept set, is it guaranteed that another human will perceive the same pattern? Not necessarily, as cognitive biases can influence interpretation. If a retrieval method crops and collects some parts of images, is it always understandable to humans? We believe the same argument is true for generated concepts as well. The concepts our method generates are explainable by design since RLPO iteratively fine-tunes the diffusion model to generate images with high TCAV scores. Additionally, to further verify whether these generated concepts are indeed explaining the neural network we performed the classical c-deletion experiment. As shown in Fig. 8, we see that gradually removing these concepts from the input leads to drop in average-class accuracy. If they were some irrelevant concepts, we would not have seen such a drop.
The user human would see something similar to Figure 6 in manuscript (see Anonymized GitHub Link). For a zebra test point, the user will see three sets of concept images shown on the right, with the most important at the top. The red set (highest TCAV) highlights stripes, and on the left, it shows which parts of the test image these concepts are most relevant to. The other two sets provide supplementary explanations. The green set illustrates a green wooded background (note that Stable Diffusion’s seed for this was running, but RLPO fine-tuned it to generate such backgrounds—seeds are used only for our analysis and are not important for the user human). The blue concept set depicts a brown background with some green on the horizon. These supplementary explanations primarily describe the background. Consolidating everything, while the most important concept for the network to classify a zebra is black-and-white stripes, its habitat, including a brown background and green trees, also contributes to the classification.
“To this end, we leverage the state-of-the-art text-to-image generative models to generate high quality explainable concepts.” What we meant by high quality is the quality of images generated by the model. We can rephrase and tone it down. We acknowledge that our perspective was rooted in the creator human’s mindset as we were on a quest to automate the concept set creation process. As highlighted in the concluding experiments of section 4.5, we truly hope our work would benefit 1) engineers debug issues in neural networks (Figure 9) and 2) make the use of TCAV easier in downstream applications (Figure 10).
We thank the reviewer for their prompt response and giving us the opportunity to clarify the queries.
Please refer to this updated Anonymized GitHub Link where we have compiled detailed explanation and images for better understanding.
We believe the confusion happened because “human” was used as an overloaded term. In the classical TCAV setting, two groups of humans are involved: those who create concept sets offline (“creator humans”) and those who utilize these concepts online (“user humans”). When we state “humans cannot,” we are specifically referring to the creator humans, as our contribution is developing an algorithm for concept set creation/generation rather than the downstream application of an existing method. To clarify, we do not claim that the generated concepts are not understandable to the user humans. Rather, we argue that the concepts that the model indeed uses can be divided into two groups: concepts that creator humans can think of and those they cannot (i.e., novel concepts). Our method generates explanations from both groups, and both types of concepts are understandable to user humans.
Understandable should be corrected as guessed. For instance, in the reviewer’s summary “concepts are not likely to be understood guessed by creator humans.” Vanilla TCAV requires creator humans to come up with (i.e., guess) a set of concepts to test ahead of time. In other words, it assumes the creator humans know a set of explanations ahead of time as TCAV only picks from the set which concept/explanation is correct. When we tried to apply TCAV in some real-world applications, we found out that it is impossible for creator humans to guess all such concepts/explanations ahead of time. That motivated us to generate concepts.
Interpretation of results in Table 2:
Hypothesis: Creator humans struggle to guess all important concepts that truly matter to the network.
Experimental setup: Rather than asking creator humans to come up with large concept sets, we present them with two concepts and ask them which one they would include in the concept set. To this end, as detailed in Appendix D6 (and the following image), we show creator humans a test image along with two explanations—one obtained from a retrieval-based method and the other generated by our method (without revealing the experimental setup to them). Those creator humans were asked to determine which of the two explanations could have influenced the network's decision that the test image, for example, represents a zebra. They could choose the first explanation, the second, or both.
Metrics: All generated concepts matter to the network as they have a high TCAV score. Out of them, how many of them are identifiable by creator humans?
Results: The results in Table 2 shows that creator humans often recognized explanations from retrieval-based methods, as these align with cropped elements of the test images. However, they were less likely to guess generated explanations, even though these are valid concepts that influence the neural network's decision and have high TCAV scores. This confirms that our method successfully discovers valid explanations that are not immediately apparent to creator humans.
Ok thank you yes this helps to clarify the difference. I suggest the authors encorporate such a clarification in the paper. Maybe adding a sentence both in the introduction and methods section would suffice, though some details on this in the preliminary section could be good too. I have raised my score with the trust that the authors do so.
We thank the reviewer for participating in the discussion. We have updated the paper and further clarified the contribution in the revised version.
Authors proposed a new algorithm, Reinforcement Learning-based Preference Optimization (RLPO), designed to generate high-level visual concepts that explain the decisions of DNNs. Unlike traditional concept-based explanation methods (such as TCAV), RLPO creates sets of concept images, eliminating the need for manual concept image collection, thus making the process of explaining DNN decisions more efficient and generalizable. RLPO fine-tunes a stable diffusion model with preference optimization to generate images that effectively explain neural network decisions.
优点
The idea is very innovative, and the writing is clear. By using a generative model instead of traditional manual concept collection, it reduces human intervention and improves efficiency. RLPO can generate concepts at different levels of abstraction, offering a more detailed explanation of the DNN’s internal decision-making process.
缺点
In some sections (such as the algorithm explanation and mathematical proofs), the descriptions appear overly lengthy, which might hinder readers' understanding. The description of how RLPO is applied in sentiment analysis tasks is somewhat vague, and further detailing the specific steps of this experiment could be helpful. Certain terms (such as “concept generation” and “concept extraction”) are defined and used inconsistently throughout the paper, which could lead to confusion.
问题
Same as weakness
Thank you for recognizing the innovation and clarity of RLPO, as well as its contribution to improving the efficiency and generalizability of concept-based explanation methods. We also appreciate your valuable feedback on areas requiring clarification, which we will address comprehensively in the revised version.
W1.
We acknowledge that some sections, such as the algorithm explanation and mathematical proofs, may appear overly lengthy. To address this, we plan to streamline these sections in the final version by:
- Providing concise summaries alongside detailed explanations to improve accessibility for readers.
- Moving some of the more intricate details (e.g., mathematical derivations) to the appendix, ensuring the main text remains reader-friendly while preserving the rigor.
We believe this restructuring will make the algorithm and proofs more digestible without compromising their completeness.
W2.
Given the page limit we were not able to include all the details on the sentiment analysis experiment in the main paper. We have added additional explanation for our experiment on sentiment analysis tasks in the Appendix D.4. We will highlight it in the main paper.
W3.
We appreciate your feedback regarding the inconsistent use of terms like “concept generation” and “concept extraction.” We will fix these in the revised paper.
Thanks for your feedback. I will keep the positive score.
The paper proposes to treat the concept set creation as an optimisation problem. Through the means of deep reinforcement learning, it iteratively refines the concepts obtained using Stable Diffusion generative model so one can generate concepts instead of retrieving them in the existing data. They also introduce the concept generation with respect to an abstraction level. The authors also explore the ideas of analysing the concepts during fine-tuning and NLP problems.
优点
Novelty: the work builds upon existing methods in interpretability, such as concept based explanation, but contains a set of novel ideas including (1) the problem statement and the method for optimisation of concept set through reinforcement learning technique and LoRA (2) the idea to use such concept set generation for to produce explanations for different abstract layers
Clarity: the paper is clearly written and well presented.
Motivation: The authors give a strong motivation for the proposed method: the retrieved data might not filly describe the concepts embedded into the model, and the optimisation of such concepts using generative models and RL can be a promising alternative
Significance: I think this work is significantly improving the available options for explaining visual (and potentially language) models and therefore I believe it is significant enough.
Correctness: I checked the paper and I believe the claims and the maths are correct
Reproducibility: as far as I can see, taking into account both the paper and the code, the experiments look reproducible to me.
缺点
Clarity: there are a few questions regarding the limitations of this work (see below in the questions section)
问题
Questions:
- “As a specific application, we see what concepts are removed and added, as well as how the concept importance changes when we fine-tune ResNet50 model on ImageNet to improve accuracy” If we want to track the evolution of concepts during the training procedure, do we need to retrain the concepts every time we update the model, or is there any workaround which would help track the concepts during the finetuning process? Either way is fine, however might be good to clarify it in the paper or in the appendix.
- Figure 9 should have the x axis annotated ( I understand it represent concept identifier)?
- One of the concerns might be the lack of systematic quantitative comparison with the retrieval-based explanations. Is there any possibility to compare the explanations numerically with the state-of-the-art prototypical retrieval-based explanation methods, such as, e.g. Craft (Fel et al, 2023)?
- I think there is one important limitation of the proposed method which I would like the authors to cover in the paper: the explanations also, in a way, depend upon the inner working of the generative model. Imagine, for example, that the generative model has mode collapse and does not represent the whole set of patterns available to the model we explain. In this case, it would lead us to obtaining the best possible explanation amongst the suboptimal ones, which may lead to the explanations not representing some aspects of the model’s inner working. In a case when it is a safety-critical application, that mode collapse might represent some anomalous event, which may not be covered by the set of explanations achievable by the model and therefore would not allow us to understand the reasons behind the model. A variation of such problem might include the problem of explanation when the generator only offers poor quality of image output(for whatever reason, e.g. it does not cover some particular concept). To make it clear, I understand that there is also a counterpart to such limitation in a case of standard, retrieval-based, prototypical explanation: there might not be such a piece of data which would closely match the phenomenon. It is therefore, in one way or the other, a limitation of many post hoc explanation models. I wonder if the authors agree with such limitation, and in any case would like to ask to include the discussion.
- In relation to that, perhaps not mandatory, but another idea of an experiment: how does the numerical performance of the algorithm, e.g. Figure 8, would compare for different image generators? Does the GAN model, which is prone to mode collapse, result in worse C-insertion metrics?
Thank you for your detailed and insightful review! We appreciate the reviewer’s novelty of ideas. Please see our answers below.
W1.
To track the evolution of concepts over iterations, it is necessary to retrain the RL algorithm iteratively. This ensures that we can account for changes in the network as it learns new features and potentially forgets old ones during fine-tuning. However, we can maintain a concept set and keep adding newly discovered concepts to the set. Conditioning on a prior set, how to efficiently find a novel set would be a very new and interesting research direction for generative models in general.
W2.
Thanks for pointing this out! The x-axis represents the concept identifiers, and we will include appropriate annotations in the revised manuscript to make this clear.
W3.
In terms of comparisons, we extensively compared the novelty of concepts generated by our method and retrieval methods using a variety of metrics (Table 4) as well as through human participant evaluation (Table 2). The challenge with retrieval methods is that they leak information about the test set into the explanation, as in its simplest form, they are cropped parts of class images. Human created or generated methods do not have this limitation. However, we find it difficult to design a fair and solid experiment to show this aspect numerically. Not just in our method, even if we compare human created concepts and retrieval methods, they are hard to compare as they offer different perspectives.
W4.
We acknowledge that the quality and diversity of generated outputs depend on the generative model's capabilities. Issues such as mode collapse or insufficient representation of certain patterns could limit the range of explanations. However, such limitations are not unique to generative approaches—they are also inherent to retrieval-based methods, which are similarly constrained by available data and even human collected concepts, which are influenced by cognitive bias. Though mode collapse was an issue in GANs, we find it hard to find solid literature on mode collapse in SOTA diffusion models. However, we agree that this is a good point and we will include a discussion of this limitation in the revised manuscript. Specifically, we will highlight the dependency of the explanations on the generative model's capability to produce high-quality and diverse outputs. Thank you for pointing it out.
W5.
Although the experiment the reviewer suggested is really interesting, it would be difficult to replicate RLPO with GANs, since we can not generate arbitrary images based on any random text prompt with GANs. As explored by past research [1], it is possible to develop a GAN to generate random images based on textual description, but they have to be trained specifically for a particular dataset. That is in a way one of the major limitations of using GANs, it is hard to use them beyond the task they are trained to do. Instead, to test the reviewer’s hypothesis, we tested the RLPO experiment with an older version of Stable Diffusion (SD v1.1), which is known to produce biased (arguably, it is a result of mode collapse) and suboptimal images.
- Reed, Scott, et al. "Generative adversarial text to image synthesis." International conference on machine learning. PMLR, 2016.
Area under curve obtained after applying c-deletion on top 3 concepts generated by SD-1.1 and SD-1.5 for “zebra” class:
| Concept | Stable Diffusion v1.1 | Stable Diffusion v1.5 |
|---|---|---|
| C1 | 4.182 | 1.265 |
| C2 | 5.957 | 2.525 |
| C3 | 6.180 | 2.905 |
We see that we get the lowest area under the curve for the SD v1.5, indicating that concepts generated by SD v1.5 (the good generator) are more related to the class features learned by the neural network. Therefore, the quality of the generator matters.
Many thanks for thoroughly answering to my comments. It answers my questions (and especially the last experiment is much appreciated). I am reading the responses by the other reviewers.
This paper focuses on the generation of concept images to explain black-box image classification models. It proposes a reinforcement learning-based preference optimization (RLPO) algorithm to fine-tune a diffusion model for generating images that can maximize the TCAV scores. It also proposes to use DQN to search appropriate actions from the seed prompts. Experiments show that the proposed approach can generate complex and abstract concepts aligning with the test class.
优点
- Optimizing diffusion models to maximize TCAV scores sounds interesting and original to me.
- The authors performed extensive experiments to demonstrate the effectiveness RLPO.
- Most parts of the paper are clearly written and easy to follow.
- The visual results presented are helpful in illustrating the advantages introduced by RLPO.
缺点
- While diffusion models make the whole system more flexible than "retrieval methods" by generating new images that may not be presented in current datasets, the quality of such images is questionable. For example, in Figure 5, the generated image at timestep 30 looks unrealistic and is not closely aligned with the seed prompt (worse than the initial image).
- Though Table 1 shows RL helps the model to select approprate seed prompts from the given set, I'm still concerned about the necessity of using DQN in the framework. VLM/LLM can be used to generate a small high-quality seed prompt set, and even rank them based on their potential importance (e.g., the experiment in this paper has a set with only 20 prompts). One can use the whole set and focus on the fine-tuning of the diffusion model.
- The efficiency seems to be a problem.
问题
- While Tabel 4 shows the concept images generated by RLPO are more diverse, how do we know if they are faithfully reflecting the the same concept instead of overfitting the TCAV?
- Why should the problem be designed as a sequential decision-making problem?
- Do people need to finetune one RLPO framework with DQN+Diffusion for each class? Would it be more efficiently and equally effective if they just use an untuned Diffusion model and prompt LLMs/VLMs to generate more detailed text descriptions for the target class given the seed prompt?
- What is in Property 2?
We appreciate your thoughtful review, which acknowledges the originality, experiments, and clarity of our approach. Please see our response below. Please refer to this Anonymized GitHub Link where we have compiled detailed explanation and images/plots for better understanding.
W1.
Are generated images not representative of the seed prompt? We would like to clarify a potential misunderstanding that the generated images should look like the seed prompt. With the example described in the paper (Figure 5), we wanted to show that there are millions of very diverse images that can be generated using the seed prompt (in this case, “zoo”). However, because of RLPO, the diffusion model learns to only generate images that explain the network’s decisions. Figure 5 demonstrates this process. Please observe that the seed prompt “zoo” becomes more animal-ish at t=10, then animals get more stripes at t=20, then colors appear at t=30, and so on. We have now added this description in the paper.
Are images unrealistic? Neural network classifiers can provide the same output for different input patterns/concepts. Some of those concepts might be expected whereas some others are unexpected or unrealistic. As we argue with reference to Figure 2, we want RLPO to find such concepts that the human might think of. Therefore, the ability to reveal unrealistic concepts that give a high TCAV score is what we indeed wanted (note that the image quality does not degrade in this case as the TCAV score is higher and the Aesthetics score>3.). Having said that, for the specific case in Figure 5, please note that t=30 is not the final image—we just showed timesteps until the class label flips to tiger.
Aesthetics scorer: https://github.com/discus0434/aesthetic-predictor-v2-5?tab=readme-ov-file
W2.
What’s the necessity of RL? Using only GPT was indeed our first attempt, which proved to be highly inefficient. Note that the action space is not 20 seed prompts but the combination of the 20 seed prompts. Assuming 30 mins per run, it will take 2^20 * 30 = 182 years to do this if we brute-force. Since RL intelligently and dynamically picks which prompt combinations to use (and not use), RLPO takes only ~8 hours. Therefore, unlike a static ranking approach, our RL-based framework is much more pragmatic to handle unbounded generative models. The high epsilon case in Table 1 is somewhat similar (yet better) to brute forcing through the seed prompts. To see the quality of generated images with and without RL (seed prompt “stripes” for the zebra class after 300 steps), please see the images in this Anonymized GitHub Link.
W3.
Is computational complexity an issue? We agree that we should have clarified this point in the paper. We can still obtain explanations in real-time because TCAV can run in real-time. However, we agree that RLPO cannot create concept sets in real-time, mainly because of the diffusion fine-tuning step (RL is very fast). However, we do not think there is a need to create concept sets in real time. For instance, if we apply TCAV for identifying a disease from an X-ray, we can create the concept set using a test set ahead of time before deployment, which will take a few hours, and then run TCAV in real-time. Hence, concept set creation is a one-time investment. In case of a long-term distribution shift in a particular application, we can keep adding concepts to the dictionary, if RLPO discovers anything new. Please also note that the traditional method of manually creating a concept set can not only be slow and labor intensive but also can miss the important concepts.
Q1.
Are explanations faithful? Good point. Compared to a random search algorithm, RL dynamically rewards updates that give a high TCAV score. Therefore, conceptually, we do not expect RLPO to generate totally different concept sets every run. We now ran the RLPO algorithm 3 times (i.e. 3 trials) for the same seed prompt set. During inference, we calculated the CLIP embedding similarity between trials for the top three seed prompts (stripes, running, and mud for the zebra class - see Figure 6). The low Wasserstein and high cosine similarities indicate that the generations are similar. A low Hotelling's T-squared score (a generalization of the t test for multiple dimensions) also indicates that the generated images are from the same distribution (for reference, this score is 7488.86 for stripes-running).
Inter-trial Concept Comparisons for “zebra” class across three trials:
| Metrics | Stripes-Stripes concept | Running-Running concept | Mud-Mud concept |
|---|---|---|---|
| Avg. Cosine similarity | 0.996 ± 0.0008 | 0.997 ± 0.0004 | 0.997 ± 0.0004 |
| Avg. Wasserstein distance | 0.955 ± 0.074 | 0.828 ± 0.074 | 0.823 ± 0.068 |
| Avg. Hotelling's T-squared score | 2.462 ± 0.091 | 2.365 ± 0.081 | 2.663 ± 0.229 |
| Are they from the same distribution? | Yes | Yes | Yes |
Q2.
In our approach, SD is fine-tuned iteratively: generating a set of images, assessing their relevance, and then refining the model to improve alignment with explainability objectives. This sequential framework allows the model to adaptively optimize towards explanations by building upon the outcomes of previous steps. Each step in the sequence informs the next, ensuring that the fine-tuning process converges towards increasingly meaningful and interpretable concept representations. This dynamic adjustment is essential for steering SD toward generating high-quality explanations that progressively align with the target class, rather than relying on static, one-shot methods that lack adaptability. Please see the animation we posted in the Anonymized Github.
Q3.
Yes, fine-tuning one RLPO framework with DQN+Diffusion per class is necessary. This is because each class may have unique features and abstractions that require tailored exploration to generate meaningful and interpretable concept representations. The RLPO framework ensures that the generated concepts align closely with the specific nuances of the class as learned by the model, which a generic approach cannot achieve.
“An untuned diffusion model and prompt LLMs” is exactly what is behind most VLMs such as GPT4 (it calls DALL.E2 which is based on a diffusion). In other words, the question is more akin to why don’t we use GPT4 to generate images directly. Finding a single long, descriptive prompt that effectively encapsulates the target class's abstractions is highly challenging, if not impossible, except for something obvious such as “stripes.” Therefore, even in that case we have to develop an automated prompt engineer using RL. From our experience, it is much more efficient and controllable to fine-tune stable-diffusion compared to developing an automated prompt engineer.
Q4.
Thank you for catching this oversight in the appendix. We have rectified this in the revised paper. refers to the time at which the system reaches an explainable state. After this point, though we can fine-tune SD, TCAV does not change significantly.
Once again, thank you for taking the time to provide us with your constructive feedback on our submission. We understand that the reviewer may not have had the time to thoroughly go through the rebuttal yet. If there are any further clarifications or additional details that we can provide to address your queries, please let us know. We are more than happy to provide any additional information or explanations to ensure our approach is clearly conveyed.
Thank you for your detailed responses, which have partially addressed my concerns.
I now understand better why RL is utilized as you model the trajectory as the sequence of prompt seeds entered into the model. But I'm still confused as you claim each action is a combination of prompt seeds (the action space is not 20 seed prompts but the combination of the 20 seed prompts). At least from your results, I don't think I see any experiments of using multiple seeds simultaneously as one action in a step.
By showing the "without RL" results, do you mean the model is trained by iteratively entering prompt seeds or by using the selected prompt alone? My question on the RL design is consistent with my question on the design of "sequential decision-making", as it is straightforward if researchers just tune one SD model for each prompt seed. Though you claim that it takes "2^20 * 30 = 182 years" to turn the model without RL, what I have in mind is to tune one model for each seed independently (20 * 30min = 10h), which is close to the speed of RLPO and do not require complex decision making. Your response to my Q2 shows how RLPO makes decisions in a sequential way, but it does not really demonstrate the benefits of doing it. I think this is also a shared concern that has been raised by Reviewers AFj3 and D5KJ.
My last remaining concern is about the usefulness mentioned by Reviewers 5Zn5 and D5KJ, which is consistent with my motivation for raising W1 and Q1. As you mentioned, the primary objective of this research is to improve the user’s understanding of the model. Therefore, the interpretability of the generated images themselves should be very important. For example, are users really able to understand what concepts are represented by the generated images, such as those shown in Figures 13-17?
Due the issues mentioned above, I would keep my original score.
Thank you for your response! We have clarified the reviewer’s questions on the RL optimization routine and usefulness raised by another reviewer below. We have also added new experiments to further bolster the latter. Since we have clarified all previous concerns and new concerns, we sincerely hope the reviewer can reconsider the score.
Clarification on the seed prompts
Apologies for the confusion. RL choses one action at a time. But multiple actions will be used as the prompt for stable diffusion. For instance, at t=1, action={cat}, prompt={cat} and at t=2, action={water}, prompt={cat, water}, and so on. If the combination is not useful (based on TCAV), then the RL will decide not to proceed through this trajectory. Please see Appendix D.3.2 for a plot that shows the number of action combinations used over time. As a side note, please also note that stable diffusion generates a very different set of images for prompt={cat} at t=1 vs. at t=2, because at each timestep, stable diffusion is fine-tuned. Please also see in Anonymous Github to see that updating on one seed prompt at a time without RL does not lead to meaningful explanations.
Clarification on Without RL experiment and results:
We demonstrate the benefits of the sequential approach both conceptually and experimentally. In our benchmark experiment on training the diffusion model without an RL agent, we fine-tuned the same diffusion model for each prompt by going through the seed list in a for loop. If we only had a few seed prompts, we could have trained separate diffusion models. But when we consider the multiple combinations of seeds, it is not feasible to have many diffusion models (if so, for N seeds, we need 2^N diffusion models). Therefore, we have only one diffusion model, and RL decides which optimization trajectories are not useful. The GIF in anonymous github provides a visualization of this sequential process. As illustrated in the results (anonymous github), if we don’t use RL to optimize, we see that stripes seed does not converge to a good quality concept compared to the one obtained using RL for the same time budget. The main reason for that is, the RL agent learns over time what trajectories are worth optimizing and drops the less explainable trajectories.
Clarification on usefulness:
Below we clarify the reviewer's misunderstanding of our primary objective and then present new experimental results. The primary objective of this research is to come up with novel concepts that trigger the neural network (i.e. understanding the neural network).
Our claim: When considering automated concept set creation techniques (that are retrieval based), the proposed generative method can come up with novel concepts that trigger the neural network (unveiling such new patterns can help engineers fix models).
Experimental evidence to support the claim:
- Quantitative: Table 4 shows that our method can generate new concepts.
- Qualitative: Figure 4 illustrates some results of Table 4.
- Human evaluation: Most humans cannot imagine certain patterns will even trigger the neural network.
We have recently updated the manuscript to clearly specify the contribution.
Having said that, we now conducted an additional human survey to measure the usefulness of the provided explanation. The experiment is two fold:
- Step 1: We first asked 19 ML engineers to choose relevant generated concepts for Googlenet classifier to classify a zebra class without telling them that all shown images are actual concepts. All the engineers selected the ‘stripes’ concept to be most important while some also selected the ‘mud’ concept. But most missed the ‘running’ concept. This indicates that engineers cannot think of all the important concepts that the neural network gets activated.
- Step 2: Then we showed engineers the concept-explanation mapping on a random input image (figure 6 in paper) and asked them if the provided explanation helped them understand the model better and if it provided new insights. More than 90% of the engineers agreed that the explanation helped in better understanding the neural network and around 84% agreed that it provided new insights about the neural network that they didn’t have previously. This result clearly suggests that the new concepts discovered by the proposed method help engineers discover new patterns that they did not imagine before. (anonymous github)
Taking an additional step, through another experiment, we demonstrate how these new explanations help engineers fix issues in neural networks.
To further showcase the usefulness, we conducted an additional experiment. In this experiment, we choose a pretrained Googlenet classifier for the Tiger class whose important seed prompts were ‘orange black and white’, ‘orange and black’, and ‘blurry image’ with TCAV scores of 0.66, 0.66, and 0.62, respectively. Out of these seed prompts, ‘orange black and white’ and ‘orange and black’ highlight the tiger pixels while ‘blurry image’ seed prompt highlights the background pixels (see sample explanations in anonymous github). What that means is, in order to classify a tiger, Googlenet looks at both the foreground and background.
Now the engineers want the classifier to classify the tiger based on tiger pixels, not its background (note: from the classical Wolfe-Husky example in LIME [1], we know the spurious correlation of background). To this end, we generated 100 tiger images based on concepts related to ‘orange black and white’ and ‘orange and black’ using a separate generative model and fine-tuned our Googlenet model. Running RLPO on this fine-tuned model revealed that the model learned some new concepts such as ‘whiskers’ and also revealed that previous concepts such as ‘orange black and white’ and ‘orange and black’ are now more important with TCAV scores of 1.0 and 1.0, respectively. This means that the classifier is now only looking at tiger pixels, not the background. (see dataset samples and shift plot in anonymous github). This experiment clearly shows how the proposed method can be used to improve a neural network’s undesirable behavior.
- Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.
Concept-based explanation methods typically require practitioners to guess and gather multiple candidate concept image sets, a process that can be imprecise and labor-intensive. To address this challenge, this paper redefines the creation of concept image sets as an image generation problem. To this end, in this work, the authors introduce an RL-based preference optimization algorithm that fine-tunes the vision-language generative model using approximate textual descriptions of concepts. They also conduct extensive sets of experiments to demonstrate that the proposed method effectively articulates complex and abstract concepts that align with the target class, which are often difficult to create manually.
优点
The paper is well-written and easy to follow. The problem definition is novel and interesting. The paper effectively reframes the generation of concept image sets as an image generation problem, providing a novel perspective that addresses the limitations of traditional concept-based explanation methods. This shift enhances the efficiency and effectiveness of generating meaningful concepts. Through a series of well-designed experiments, the paper demonstrates the capability of the proposed method to generate complex and abstract concepts that align with the target class. This empirical evidence strengthens the paper's contributions.
缺点
While the authors demonstrate the effectiveness of their proposed method through both qualitative and quantitative analyses, several aspects of the framework are not thoroughly justified. For instance:
(a) The generated concept sets using SD+LORA are randomly divided into two groups. Why is "random" grouping considered optimal? Could this be problematic?
(b) The use of reinforcement learning (RL) is also not justified. Is it really necessary to implement an RL policy for the purpose mentioned in the paper? Why not simply calculate the TCAV score for each possible seed prompt kt, and then fine-tune the SD+LORA weights based on the seed prompt that yields the highest TCAV value?
(c) Incorporating each component of the proposed framework increases computational demands and time constraints. Consequently, such an analysis may not be feasible in real-time, as generating images with Stable Diffusion, training the DQN-based RL policy, and fine-tuning the Stable Diffusion model with preferences all require significant processing time. Is this complex pipeline truly necessary?
(d) I also struggle to see a practical significance for the proposed problem. Why is it important to generate a diverse set of concept images at all? It seems more crucial to generate concepts that directly explain the task at hand.
(e) Furthermore, why choose DQN? As we continuously update the LORA weights based on the TCAV preference score, the underlying RL environment becomes non-stationary. This means that the same action taken by an RL policy could lead to different reward values at different times. How can this issue be addressed?
问题
Please see my comments on the weakness section.
Thank you for identifying the novelty and appreciating the experiments. We hope the following explanations will clarify the queries for the reviewer.
Please refer to this Anonymized GitHub Link where we have compiled detailed explanation and images/plots for better understanding.
Wa.
To compute TCAV scores, we need two concept image groups. The reason for random grouping is that, initially, we do not have TCAV scores for individual concept images. Random grouping is not problematic–it is the unbiased sampling method. Why this works is, because images in the two groups are different, the TCAV score in one group is slightly higher. Then we fine-tune stable diffusion to generate images similar to that group with higher TCAV. We regenerate two groups of images from the fine-tuned model and iterate this process. Although the images in initial groups are highly variable, overtime they learn how to generate images of a particular type/concept.
We can also think of a different analogy. This sampling step is somewhat analogous to rejection sampling and Metropolis-Hasting (MH). In rejection sampling, we pick a sample from a proposal distribution, which is typically a uniform distribution, and reject it based on some criteria. Similarly, we generate two sets of images randomly and evaluate which one to reject. In M-H also we compare two points (though they are old and new points). Through such iterative rejections and model updates, we can converge to the target distribution.
Wb.
We would like to clarify that, the action space is not 20 just seed prompts but the combination of the 20 seed prompts. Assuming 30 mins per run, it will take years to do this if we brute-force. Since RL intelligently and dynamically picks which prompt combinations to use (and not use), RLPO takes only ~8 hours. Therefore, unlike a static ranking approach, our RL-based framework is much more pragmatic to handle unbounded generative models. The high epsilon case in Table 1 is somewhat similar (yet better) to brute forcing through the seed prompts.
To see the quality of generated images with and without RL (seed prompt “stripes” for the zebra class after 300 steps), please see the images in this Anonymous Github.
Wc.
We agree that we should have clarified this point in the paper. We can still obtain explanations in real-time because TCAV can run in real-time. However, we agree that RLPO cannot create concept sets in real-time, mainly because of the diffusion fine-tuning step (RL is very fast). However, we do not think there is a need to create concept sets in real time. For instance, if we apply TCAV for identifying a disease from an X-ray, we can create the concept set using a test set ahead of time before deployment, which will take a few hours, and then run TCAV in real-time. Hence, concept set creation is a one-time investment. In case of a long-term distribution shift in a particular application, we can keep adding concepts to the dictionary, if RLPO discovers anything new. Please also note that the traditional method of manually creating a concept set can not only be slow and labor intensive but also can miss the important concepts.
Wd.
Let us explain with an analogy. Why does it snow on Mount Denali in Alaska? It could be due to its high elevation, its location in the Arctic, or the orographic effect—all valid explanations. Similarly, if an autonomous vehicle hits a pedestrian, why did it happen? Perhaps the pedestrian was occluded, the AV struggled to identify pedestrians wearing pants, or a reflection might have confused its sensors.
If only engineers can obtain the range of reasons why a neural network triggers for a particular output, they can assess the vulnerabilities of the neural network and fix them. Humans cannot think of all these reasons because they do not understand the neural network’s learning process. That’s where our method shines.
While generating multiple concepts is not mandatory, it provides an added advantage. In critical applications, relying on a single explanation can be risky, especially if it fails to capture the full scope of the model's behavior. A diverse set of concepts ensures that users and domain experts can explore multiple dimensions of the model's reasoning. These explanations could also potentially allow experts to learn from the model itself [1].
- Schut, Lisa, et al. "Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero." arXiv preprint arXiv:2310.16410 (2023).
We.
Compared to, say, PPO, DQN is stable for discrete action spaces and more sample efficient. However, DQN can be replaced with any deep RL algorithm that supports discrete action spaces. While the LoRA weights are updated iteratively, the TCAV scores (or rewards) generated for a given set of concepts remain consistent for the same configuration of weights (we do not change the neural network under test).
Once again, thank you for taking the time to provide us with your feedback on our submission. We understand that the reviewer may not have had the time to thoroughly go through the rebuttal yet. If there are any further clarifications or additional details that we can provide to address your queries, please let us know. We are more than happy to provide any additional information or explanations to ensure our approach is clearly conveyed.
I thank the authors for the clarifications. However, I am still not convinced about its practical significance. Each example (such as Why does it snow on Mount Denali in Alaska? or Why did an autonomous vehicle hit a pedestrian? ) demands a task-specific explanation that is both tailored to the particular requirements of the task and diverse in its approach. It is not enough to provide a generic variety of explanations; the diversity must directly align with and enhance the task-specific goals. This distinction ensures that explanations are not only varied but also relevant and meaningful in the context of the task.
How does the current approach succeed in generating this level of diversity within task-specific explanations?
Thank you for your response! Since we have clarified all previous concerns and below we clarify new concerns experimentally, we sincerely hope the reviewer can reconsider the score.
The proposed method indeed generates a diverse set of task specific explanations. This is how a practical pipeline would look like. Consider an autonomous vehicle hit a pedestrian. If we apply the proposed method, it will provide a diverse set of explanations (say, 5 explanations - typical XAI methods would provide only one). Then the analyzing human looks at each explanation to rule out the real cause. If we did not have a diverse set of explanations, there is a good chance we would have missed the real cause.
Below we provide a concrete experiment that shows how diverse explanations help resolve issues in neural networks. In this experiment, we choose a pre-trained Googlenet classifier for the Tiger class whose important seed prompts were ‘orange black and white’, ‘orange and black’, and ‘blurry image’ with TCAV scores of 0.66, 0.66, and 0.62, respectively. Out of these seed prompts, ‘orange black and white’ and ‘orange and black’ highlight the tiger pixels while ‘blurry image’ seed prompt highlights the background pixels (see sample explanations in anonymous github). What that means is, in order to classify a tiger, Googlenet looks at both the foreground and background. Now the engineers want the classifier to classify the tiger based on tiger pixels, not its background (note: from the classical Wolfe-Husky example in LIME [1], we know the spurious correlation of background).
To this end, we generated 100 concept images based on concepts related to ‘orange black and white’ and ‘orange and black’ using a separate generative model and fine-tuned our Googlenet model. Running RLPO on this fine-tuned model revealed that the model learned some new concepts such as ‘whiskers’ and also revealed that previous concepts such as ‘orange black and white’ and ‘orange and black’ are now more important with TCAV scores of 1.0 and 1.0, respectively. This means that the classifier is now only looking at tiger pixels, not the background. (see dataset samples and shift plot in anonymous github). This experiment clearly shows how the proposed method can be used to improve a neural network’s undesirable behavior.
- Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.
We sincerely thank all the reviewers for their detailed and constructive feedback. We deeply appreciate your recognition of the novelty and potential impact of our approach, as well as the thoughtful questions. In this response, we categorize the feedback into two sections for clarity:
- Weaknesses (W): Concerns raised in the reviews.
- Questions (Q): Specific queries or clarifications sought regarding the methodology, results, or framework.
We address all points comprehensively, providing additional context, experimental evidence where necessary, and detailing revisions planned for the manuscript to address your feedback effectively.
We sincerely thank all the reviewers for the discussion and appreciate AC’s efforts in overseeing the review process. We appreciate all the detailed and constructive feedback. It has helped us refine and clarify our work. We have addressed all the concerns and clarified our contributions as well as potential misunderstandings.
We explained the distinction between the roles of "creator humans" (who design concept sets) and "user humans" (who interpret the results). Our method primarily targets creator humans by automating concept creation, a tedious and perhaps even impossible task for humans. We also elaborated on the role of abstraction in our method, emphasizing that abstractions, such as moving from "zoo" to "stripes," are a natural byproduct of the approach, offering optional and insightful layers of understanding without being a primary claim. Furthermore, we validated the novelty of the generated concepts through quantitative metrics and qualitative assessments. We also showed the importance of RL in our framework as requested by reviewer AFj3, KnkW, D5KJ. Reviewers KnkW, TZBx already agrees that RL plays an important role in learning over time—what trajectories to optimize and what to drop so that we find explanations in the fastest way.
The additional experiments conducted during the rebuttal process further solidify our contributions. As requested by reviewers KnkW and 5Zn5, we conducted additional experiments to validate the faithfulness of generated concepts. These experiments showed that the generated concepts are unique, consistent across runs, and represent distinct distributions for different seed prompts.
Most importantly, we show that our method can provide actionable/useful insights through our additional experiments where concept shifts were effectively directed through fine-tuning to show (concerns raised by reviewers KnkW and D5KJ). Finally, a human survey validated that the generated concepts improve the understanding of model behavior, showing the practical utility for debugging and analysis.
We are grateful for the reviewers' input, which has greatly strengthened our submission. With these we believe we have addressed all the concerns raised by the reviewers and we are hopeful that reviewers D5KJ, AFj3, KnkW will reconsider the original review.
Scientific Claims and Findings:
This paper introduces a novel algorithm called Reinforcement Learning-based Preference Optimization (RLPO) to automatically generate visual concepts that explain the decisions of deep neural networks (DNNs). The algorithm addresses the limitation of traditional concept-based explanation methods (like TCAV) which require manual collection of concept images. The authors claim that RLPO generates more novel and abstract concepts compared to traditional methods, offering a detailed understanding of the DNN's decision-making process. They also suggest that RLPO is generalizable to non-vision domains, specifically NLP tasks.
Strengths:
The paper is well-written and easy to follow.
The proposed method is novel and addresses a significant limitation in existing XAI methods.
The use of a generative model to create concept sets is innovative and reduces human intervention.
The method can generate complex and abstract concepts, offering a detailed explanation of the DNN's decision-making.
Weaknesses:
Some reviewers found the explanations of the algorithm and mathematical proofs to be overly lengthy.
The practical significance of generating a diverse set of concept images wasn't immediately clear to all reviewers.
The necessity of using RL and the complexity of the pipeline were questioned.
The paper could benefit from a grammar check and consistent use of terms.
Missing elements:
Clear and concrete examples of how the generated concepts can be used to debug or improve DNNs.
A more thorough justification for design choices, such as the use of RL and DQN.
A clearer explanation of how the method is applied in sentiment analysis tasks.
Reasons for Rejection:
The paper does not adequately demonstrate the practical usefulness of the proposed method for explaining network behavior.
The experiments focus heavily on novelty and abstractness, but their connection to improving user understanding is not well established.
The paper lacks a key experiment directly measuring the usefulness of the generated concepts for explaining model decisions.
审稿人讨论附加意见
The rebuttal period involved extensive discussion on several key points:
Practical Usefulness and Novelty: Reviewers questioned the practical significance of the method and the meaning of novelty in the context of XAI. The authors clarified the notion of novelty as the ability to generate concepts that are beyond what humans can typically conceive, emphasizing its importance in identifying model vulnerabilities. They also provided additional experiments to demonstrate the usefulness of the method in debugging and improving DNNs.
Necessity of RL: The necessity of using reinforcement learning (RL) in the framework was questioned, with suggestions to use simpler methods like brute-forcing seed prompts. The authors defended the use of RL by highlighting its efficiency in exploring the vast space of prompt combinations, arguing that brute-forcing would be computationally infeasible. They also presented experimental results comparing the performance of RLPO with and without RL, demonstrating the benefits of their approach.
Abstraction Levels: The concept of abstraction levels generated by the method was discussed, with reviewers seeking clarification on its meaning and relevance to XAI. The authors explained that abstraction levels provide a layered understanding of the model's reasoning, revealing both high-level and low-level concepts that contribute to its decisions. They acknowledged that while not mandatory for every use case, abstraction levels offer an additional layer of insight into the model's behavior.
Human Understanding: The ability of humans to understand the generated concepts and their role in improving user understanding was a key point of discussion. The authors clarified the distinction between two types of humans involved in XAI: those who create concept sets and those who use them. They emphasized that their method targets the former, automating the process of concept creation, and that the generated concepts are indeed understandable to the latter. They also presented results from a human survey to support their claim that the generated concepts improve user understanding.
Weighing in on the Rebuttal:
The authors' clarifications on novelty and the additional experiments provided valuable insights, but they did not fully alleviate concerns about the practical usefulness of the method.
The defense of RL usage was convincing, especially with the experimental comparison.
The explanation of abstraction levels was helpful, but their practical relevance to XAI remains somewhat unclear.
The distinction between different types of humans in XAI was insightful, but the human survey results were not entirely convincing in demonstrating a significant improvement in user understanding.
Overall, the rebuttal addressed some of the initial concerns but did not fully resolve the primary issue of demonstrating the practical usefulness of the method for XAI.
Reject