4.0

/10

withdrawn4 位审稿人

最低3最高5标准差1.0

3.8

置信度

正确性2.3

贡献度2.0

表达1.3

ICLR 2025

From Feature Visualization to Visual Circuits: Effect of Model Perturbation

geraldin nanfack,Michael Eickenberg,Eugene Belilovsky

OpenReview PDF

提交: 2024-09-28更新: 2024-11-15

摘要

关键词

feature visualizationvisual circuitsrobustness of interpretabilityadversarial model manipulation

评审与讨论

审稿意见

评分: 3置信度: 32024-10-22

This paper introduces a novel explanation attack method that simultaneously manipulates two types of explanations: feature visualization and visual circuits. The experimental results demonstrate that the proposed method can manipulate both explanations. While the authors claim this is the first method to target visual circuits, the evaluation lacks sufficient comparisons to related approaches. Furthermore, additional discussion of the results is necessary. Given these issues, I am inclined to reject this paper in its current form.

优点

The related works and background are well presented, and the proposed method is clearly explained, making it easy to follow.
The paper reveals an intriguing phenomenon—visual circuits exhibit robustness to feature visualization attacks, which adds insight to the field.

缺点

The proposed method is only compared against Push-Up\Down, which limits the scope of the evaluation. Additionally, results for Push-Up and Push-Down on ResNet-50 are missing in Table 2, leaving it unclear whether the performance improvement comes from greater accuracy degradation in ProxPulse. More comprehensive comparisons with a wider range of related methods are needed.
The clarity of the visualizations could be improved. Labels and names in the illustrations are too small and difficult to read, which detracts from the readability of the results.
The paper does not provide sufficient insight into why visual circuits are robust to ProxPulse attacks. Since the proposed method falls within the existing paradigm of explanation manipulation, the authors should offer a deeper discussion or analysis to explain this phenomenon, which is crucial to understanding the significance of the findings.
Although the authors validate their findings with several results, the lack of experiments comparing their method to related explanation attack methods that manipulate visual circuits raises concerns. Without these comparisons, it is difficult to determine whether the robustness observed is a general property of visual circuits or specific to the ProxPulse attack.

问题

Could you provide an explanation for why visual circuits are robust to ProxPulse? If this topic is considered out of scope, please clarify why.
Is the observed performance improvement due to increased accuracy degradation? In Figure 38, the accuracy drop seems still significant—can you provide additional results from other attack methods to confirm that the drop is not substantial? Additionally, why are the Push-Up and Push-Down results for ResNet-50 missing from your evaluation?

审稿意见

评分: 5置信度: 52024-10-31

This paper aims to solve the interpretability problem of large-scale deep neural networks and achieve mechanical interpretability by identifying circuits in the form of subgraphs that can be understood by humans. Specifically, this paper propose a new attack called ProxPulse that can manipulate two types of feature visualizations simultaneously. Furthermore, this paper introduce a new attack based on ProxPulse that reveals the manipulability of visual circuits and highlights their lack of robustness.

优点

This paper addresses the interpretability of DNNs from the perspective of visual circuits and reveals that circuit-based explanations lack uniqueness and stability.

缺点

The overall writing of the article is not concise. It is hard to understand whether it is explaining DNN or proposing a new method for attack tasks.
The limitations of existing methods are not clear in the abstract.
The introduction section does not clearly describe the tasks that this method is applied to.
There is a lack of description of attack methods in related work.

问题

Is the focus of the paper on improving the attack method?
Can re-describe the relationship between ProxPulse and DNN interpretability?

审稿意见

评分: 5置信度: 32024-11-02

This paper addresses model manipulation attacks aimed at altering model parameters without affecting overall performance, but significantly changing interpretability methods. Specifically, they focus on feature visualizations and mechanistic interpretability, introducing two objective functions: ProxPulse and CircuitBreaker. ProxPulse effectively changes feature visualizations but has limited impact on mechanistic interpretability circuits. To address this, the authors propose CircuitBreaker, which they demonstrate to be effective in modifying neuron activations and activation circuits. Moreover, ProxPulse can also be used to alter synthetic feature visualizations. They empirically validate their attack methods with qualitative and quantitative experiments

优点

Major Strengths:

They provide an attack mechanism for mechanistic interpretability circuits, which the authors claim to be the first to do so.
They design and evaluate their attack mechanism for feature visualizations for both natural and synthetic features, as opposed to one or the other.

Minor Strengths:

The authors provide a myriad of quantitative and qualitative experiments.
Attacking the synthetic activations reveals an interesting pattern where the feature visualizations take a similar form across different channels and layers which could lead to further explorations.

缺点

Major Weaknesses/Concerns:

The objective of ProxPulse (Eq 5.), while explained to push up the smallest activations, is not very well motivated or explored, why and how is this objective different/better than the push-up baseline in [1]? The constant C is described as “very large”, how is this hyperparameter tuned and what’s the effect on the resulting model as we change it? Why the inverse logarithmic formulation, why the +1? I can only guess for numerical stability. These should all be discussed.
Similar to the ProxPulse objective, the CircuitBreaker objective (Eq. 7) is difficult to decipher. The second term is straightforward and the motivation is clearly stated, but the authors describe the first term as the same component in the ProxPulse objective with the purpose of “maintaining the initial feature visualization”. This raises major questions: firstly, why? Isn’t the objective of this loss term to alter the circuit activations? Two, as described by the authors and what seems from the mathematical form of the ProxPulse term, it simply increases small activations rather than “maintains” them. Overall, it is unclear to me what the connection is between ProxPulse and CircuitBreaker and why the first term in CircuitBreaker is included.
The CircuitBreaker objective is based on the Attr function described in Eq. 3, which depends on the partial derivative of the activation f. Therefore, optimizing CircuitBreaker as a loss function using gradient descent methods should require computing the Jacobian of the neural network activations, or at the very least, its products with a vector. This is generally considered to be an arduous task due to the high number of parameters. However, I did not find any mention of this in the paper. Perhaps I’m missing something, but I’d appreciate it if the authors clarify how they went about optimizing this objective.

Minor concerns (not affecting the decision):

The authors refer to [2] as they employ a norm-constrained maximization for a first-order approximation. It should be noted that [2] employs first-order Taylor approximation for weight perturbation, which is a different objective than perturbing the input. This might be confusing/misleading to readers whether they are familiar or unfamiliar with that paper. I think talking about a first-order approximation maximizer is sufficient, and [2] can be discussed in related works.
Line 370 is referring to a missing figure (Fig ??)
Some of the CircuitBreaker-related figures contain red outlines for better readability and some don’t. This is cause for confusion, I would suggest adding the red outlines to all the related figures for better comprehension and consistency.
The empirical metrics such as kendal-\tau and clip-\delta are adopted from [1] but not explained in detail with equations in the main text or the supplement.

References:

[1] Nanfack et. al. Adversarial attacks on the interpretation of neuron activation maximization.

[2] Foret et. al. Sharpness-aware minimization for efficiently improving generalization.

问题

Refer to the major weaknesses listed. Overall, I believe the problem is interesting, the results are somewhat convincing and the authors consider a slightly different problem setting than existing works. However, the methodology and the objective functions are vague and not well motivated, and the differences with that of the existing literature are not explored very well, or at the very least, the exploration is not presented well in the text. Perhaps if the authors were to provide explicit clarifications about the stated concerns I would be inclined to change my rating, however, I do feel the presentation is lacking in the method section.

Refer to minor concerns listed for improvements to help the paper (but not affecting the decision).

审稿意见

评分: 3置信度: 42024-11-02

This paper introduces two model manipulation techniques, ProxPulse and CircuitBreaker, applied to image classifiers to demonstrate the vulnerability of feature visualization and mechanistic interpretability techniques. The study evaluates these attacks on models like AlexNet, ResNet-50, ResNet-152, and DenseNet-201. Initially, the authors show that ProxPulse can effectively alter both synthetic and natural feature visualizations without reducing model accuracy. They then demonstrate that ProxPulse is ineffective against visual circuits, prompting the introduction of CircuitBreaker, which successfully disrupts visual circuits and reveals the lack of stability in circuit-based interpretability techniques.

优点

ProxPulse and CircuitBreaker are relatively novel techniques that advance the understanding of manipulation effects on feature visualization and circuit-based interpretability, offering a fresh perspective and contributing novel insights to mechanistic interpretability literature.
This work specifically demonstrates the vulnerability of visual circuits based on structured pruning, an area not previously studied.
The experiments span multiple model architectures (AlexNet, ResNet-50, ResNet-152, DenseNet-201) and the ImageNet dataset, ensuring the findings are relatively robust across different models. The use of various correlation and similarity metrics provides a rigorous evaluation framework.

缺点

Despite proposing novel methods to highlight the vulnerability of feature visualization and visual circuits, previous research has already demonstrated that both types of feature visualizations are susceptible to model perturbations [1,2]. Consequently, the newly proposed attack lacks a fresh message, and the results do not appear surprising or particularly impactful. This limits the paper’s motivation and relevance.
Given this point, the first part of the paper, which reiterates prior findings on feature visualization vulnerabilities, lacks strong motivation. The second part of the paper, focusing on visual circuits, could stand as a separate work, potentially more engaging for the audience. This section would benefit from further content supporting its claims, such as extending the analysis to non-vision models (e.g., language models) and alternative architectures beyond CNNs.
Another major weakness is that the proposed attacks are not tested on more robust models, such as those adversarially trained. This limitation raises questions about the effectiveness of ProxPulse and CircuitBreaker when applied to more resilient architectures.
The study’s focus on vision-based models also limits the generalizability of ProxPulse and CircuitBreaker to other domains, such as language or multimodal networks. Moreover, feature visualization and visual circuits represent only a subset of interpretability techniques, and the proposed methods may not be applicable to the broader range of interpretability methods in machine learning.
The experiments are conducted on relatively simple CNN architectures and datasets, leading to questions about whether these findings would hold for more complex, state-of-the-art models.
The presentation includes numerous visualizations but lacks sufficient summary analysis, which can make it difficult for readers to discern overarching patterns across experiments. Additionally, the presentation could be substantially improved: spacing needs refinement, many figure captions are too small, and several images are so tiny they require zooming to be readable. This hinders the reader’s comprehension and overall experience with the paper.
Although the paper identifies vulnerabilities in interpretability techniques, it does not offer potential defense mechanisms or mitigation strategies, which would make the work more comprehensive and practical.

[1] Geirhos, Robert, et al. "Don't trust your eyes: on the (un) reliability of feature visualizations." arXiv preprint arXiv:2306.04719 (2023).

[2] Nanfack, Geraldin, et al. "Adversarial attacks on the interpretation of neuron activation maximization." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 5. 2024.

问题

In addition to the weaknesses mentioned above, please address the following questions:

Have the authors experimented with robust classifiers? These models often exhibit better saliency explanations and more stable representations.
What about other architectures, such as MobileNet, EfficientNet, or ConvNeXt?
Do the authors provide a full comparison with baseline methods for visual circuits? It appears that this is only partially covered in Section A.6.
In line 370, there is an undefined figure reference (Fig. ??) that needs correction.

撤稿通知

2024-11-15

We thank the reviewers for their time and constructive feedback. We will incorporate these suggestions into the revised version of our paper.