6.0

/10

Poster4 位审稿人

最低6最高6标准差0.0

3.8

置信度

正确性2.8

贡献度2.5

表达3.3

NeurIPS 2024

MambaLRP: Explaining Selective State Space Sequence Models

Farnoush Rezaei Jafari,Grégoire Montavon,Klaus Robert Muller,Oliver Eberle

OpenReview PDF

提交: 2024-05-14更新: 2025-01-15

摘要

关键词

Explainable AIexplainabilityinterpretabilitystate space modelsMambalong-range dependency

评审与讨论

审稿意见

评分: 6置信度: 52024-07-08

The paper presents a method to correctly apply LRP to Mamba models. Through careful analysis, the authors demonstrate that applying LRP directly results in poor performance and propose modifications to recover the propagation rules. The obtained method outperforms the alternatives, grounded by theory, and the authors show several applications of their method, including identifying gender biases and measuring the long-range abilities of S6.

优点

Impact: Mamba is an emerging architecture (~600+ citations, 10K+ stars on GitHub). Thus, providing an attribution method for these models is crucial. In this essence, correctly applying LRP is an important direction that facilitates the community in improving and understanding these models.
Simplicity: The method and modifications are very simple, with few hyper-parameters. While this might be seen as a lack of novelty, I view it as an advantage, making the method easy to use and adaptable to various applications, such as other variants of Mamba (Mamba2) and other domains (DNA, speech and more).
Informative Ablation studies and justification of decision choices are insightful. For example, Figure 3 and Tables 1, 2, 6, 9 provide comparisons with naive LRP and also present ablation studies that allow the reader to measure the contribution of each modification to the method.
Section 6 (use cases) shows that the method is applicable and allows the authors to explore the gender bias and long-range abilities of Mamba models and provide insightful analyses about Mamba models.

缺点

The comparison with previous work should be improved:

1.1) Metrics and Benchmarking: Can the authors highlight which results are reproduced by them and which results are taken from previous work? Moreover, as far as I understand, the method in [4] is the only method that developed an interpretability method for Mamba before (perhaps in parallel) to this work. If I understand correctly, although the two methods share and use the same pre-trained models, there is no overlap in the metrics. Am I correct? If so, is there a reason for this discrepancy? Can the authors compare results directly with previous work using previously proposed metrics to ensure the gap doesn’t arise from employing [4] incorrectly?
1.2) Informative Comparison: Additionally, I think the comparison with [4] can be more accurate. The method in [4] applies its method only to S6 layers (without gating and convs), while Mamba LRP is an end-to-end method (which is a strength of Mamba-LRP). However, it is still important to make some apple-to-apple comparisons. Can the authors check both methods on a model without conv and gating layers to determine which approach is better for providing an explanation to S6 layers? Alternatively, it seems that the issue with [4] is fixed in [Uni]. Would the authors be able to compare their method to [Uni]? Providing this comparison would be highly valuable to the community. (I understand that this is a very new paper, so I will not decrease my score if the results are less favorable than those of [Uni]).

Insufficient empirical analysis:

2.1) For the long-range experiments (Figure 6), it would be very insightful to compare the behaviour of Mamba to Transformers (Pythia pre-trained models can serve as the transformer baselines since they are trained on the same data). Additionally, analyzing the trend with larger models could yield valuable information on how increased model size enhances long-context capabilities in practice, perhaps the 7B models from [Sca] can be used (which exist in Hugging Face).
2.2)

“Residual lack of conservation is due to the presence of biases in linear and convolution layers, which are typically non-attributable”

Can the authors empirically justify this claim? This can be easily validated by taking a pre-trained model, omitting the biases, fine-tuning it for several epochs, and then checking the conservation again.

Novelty: The novelty of the method is somewhat limited. One could argue that it merely involves a few applications of detach and the simple half-propagation rule (in addition to previously proposed contributions), which can be easily summarized in a few lines of code (as detailed in Algorithm 2). However, I believe this is not a significant drawback, particularly given the demand for such simple tools in the community. Additionally, the thorough evaluation, including insightful ablation studies, novel test case evaluations, and theoretical justification, is sufficiently robust.

[Uni] A Unified Implicit Attention Formulation for Gated-Linear Recurrent Sequence. Models. Zimerman et al.

[Sca] An Empirical Study of Mamba-based Language Models. Waleffe et al.

问题

Mamba LRP uses "Detach" on the SSM recurrent matrices. Similarly, [4] uses an attention matrix determined by A, B, and C parameters. It seems that both methods ignore the influence of the input on the system matrices (which is the core of the selection mechanism of Mamba). Am I right? Could addressing this issue provide a way to improve both methods? I would be glad to hear what the authors think about it.
Minor: Traditionally in the SSM domain, Delta denotes the step size. I suggest the authors replace \Delta with \delta when discussing the differences between the two scores.
In the Needle-in-a-haystack experiment (Figure 7), is there a reason not to increase the context length? It seems that the most interesting part of the figure is missing (which could show if there are edge cases, for example, regimes where the model succeeds in finding the needle, but Mamba-LRP fails).
Minor: Perhaps a relevant work that is missing is “Does Transformer Interpretability Transfer to RNNs?” by Paulo et al.
There is a standard trend in the SSM literature to omit D (and treat it as a parameter-based skip connection). Is it used in Mamba LRP, or is it ignored (like other biases, which are typically non-attributable)? It would be better if it were written explicitly in the paper.
I wonder if the authors can explore the potential limitations or failure cases of the proposed Mamba-LRP. Such information can help the community improve the method in the future. Are there cases where [4] or naive LRP might be better than Mamba-LRP?
Half propagation: While I'm not an expert in LRP, I suspect there are more effective methods to manage the gating mechanism. For instance, instead of normalizing the scores by averaging (0.5(x + y)), it may be better to use a weighted approach such as $(1 - a) \cdot x + a\cdot y$ where the value of a is determined by the actual norms of x and y, with different a values for each channel. Is there something I'm missing? Can this method improve the conservation properties?

局限性

From my perspective, the authors address most of the limitations, except those pointed out in the weaknesses section and question 6 (failure cases).

作者回复

2024-08-07

Thanks for the valuable comments. We address them below and add more discussions as an official comment.

1.1: We use the official code of [1] to produce the results of AttnRoll and G $\times$ AttnRoll (Mamba-Attr). We evaluated MambaLRP's performance against these approaches using flipping and insertion metrics that are well-established for analyzing faithfulness [4,6] and are directly related to those used by [1]. In our metrics, $A^F_{MoRF}$ / $A^F_{LeRF}$ corresponds to what [1] calls positive/negative perturbation. [4] suggests combining both metrics to derive a more resilient metric, resulting in $\Delta A^F = A^F_{LeRF} - A^F_{MoRF}$ , used in [4,5]. The only difference is that [1] tracks accuracy in positive/negative perturbations, while we track changes in output logit, which is a standard practice in the XAI community [4,5,6]. Both metrics highly overlap as shown in Tab.B of the attached PDF. When using faithfulness metrics, various factors can impact the result, e.g. how the input is masked. To ensure a fair and consistent analysis across all models, tasks and XAI methods, we use a unified evaluation metric, as done in [3,5,6], and have not directly used the numbers reported in [1].

As requested, we follow the metric of [1] and mask pixels from least-to-most relevant and vice versa (10%-90%), track the top-1 accuracy and calculate AUC for the prediction on ImageNet val set for Vim-S (Tab.C in the attached PDF). As shown, MambaLRP outperforms [1].

1.2: [1] argues that Mamba uses implicit attention mechanisms and provides XAI tools, which aggregate attention across layers and may not capture the impact of certain layers. In contrast, MambaLRP, as you noted, considers the full model structure, highlighting its strength. Moreover, [1] introduced Mamba-Attr, which as said in the paper "exploits the gradients of both the S6 mixer and the gating mechanism''. In our evaluations, we compared MambaLRP with G $\times$ AttnRoll (Mamba-Attr). As the methods of [1] are designed to explain the predictions of Mamba models, it is both valid and fair to compare MambaLRP with them in terms of the faithfulness of the explanations, as done in [3,5]. Despite [2] being released after the NeurIPS deadline, we compared MambaLRP's performance with the numbers reported in [2] for Vim-S in Tab.C of the PDF. This shows that MambaLRP outperforms the recent approach of [2]. We have added [2] to our related works.

2.1: As requested, we did a direct comparison to Transformers for our LRD use case (Tab.A of attached PDF). Many of the widely-used Transformers, e.g. GPTNeoX and Pythia unfortunately, do not allow inputs longer than 2048 or do not generate sensible text from long inputs. Instead, we used Llama-2 and Llama-3. We found that Llama-3 uses information from more intermediate mid-range dependencies than Mamba, though both favor tokens close to the end of context. Given Llama-3's larger size (8B) compared to Mamba (130M) and their different training settings, the analysis supports that Mamba can effectively use long-range information. Please refer to our answer to reviewr 4ExW for details.

2.2: Please find the results in Fig.A of PDF; conservation is fully preserved for biases set to 0.

3: As many XAI methods, including LRP, are not model-agnostic, tailored approaches must be developed for emerging classes of DNNs like Selective State Space Sequence Models (Mamba models). Mamba models include previously unstudied components, making existing LRP rules unreliable for these models, as shown in Tab.1 of paper. Since the explainability of Mamba models using LRP has not been addressed in the literature, we propose novel LRP rules specifically for SSM components, SilU, and multiplicative gates. These rules are not chosen heuristically but are the result of our thorough analysis of how relevance propagates through each component and when conservation breaks. Please refer to Appendix A and B for our theoretical analysis and derived LRP rules. As ease of implementation is desirable for XAI methods, we provided straightforward implementations that bypass the need for the implementation of complex LRP rules. These highlight the strengths of our approach: 1) theoretical soundness, 2) more faithful explanations, and 3) ease of implementation. Our contributions are the theoretical analysis of relevance propagation in different Mamba components, proposing new LRP rules for those that violate conservation, providing straightforward implementations, thorough performance evaluation against other XAI approaches, and insightful use cases, showing the value of MambaLRP for other lines of XAI research.

Questions

1: In forward path, $A$ , $B$ , and $C$ are calculated based on the input and are treated as constants in backward path. By detaching, we do not consider the gradient flow through them; however, their influence is still addressed by weighting the input in backward pass.

3: We initially limited the context length to 2048 to match the model's training and focus on retrieval accuracy limitations and MambaLRP solutions. At your request, we extended the experiment to 4k context length, with results in the PDF. Performance drops beyond 2560 and we found no cases where the model retrieved correctly but MambaLRP failed to explain reliably.

5: $D$ is used in MambaLRP. As it does not violate conservation, no modifications are necessary.

6: We investigated edge cases using Mamba-1.4B trained on Med-BIOS and found that MambaLRP outperforms Mamba-Attr [1] in about 95% of instances. In the most extreme edge case, both methods mostly select the same relevant tokens, but their score distributions differ slightly (see Fig.B of PDF).

7: In our work, $\alpha$ is not treated as a hyperparameter but set to 0.5 based on theoretical considerations. Specifically, in Equation (12) of the Appendix A.3, we show that $\mathcal{R}(x) = 2\mathcal{R}(y)$ , leading to the choice of 0.5.

We will include your suggestions in the paper.

评论- Response to Reviewer dNt6: Further Clarification

2024-08-08

Due to space constraints, we had to keep our earlier responses brief. We would now like to take this opportunity to expand on some of our answers to ensure greater clarity and convenience for you.

Further elaborations on the experimental results of long-range dependencies (LRD) experiment

Inspired by your comments, we conducted comparisons using two state-of-the-art Transformers: Llama-2 and Llama-3.

Setup: We employed LRP propagation rules of [3] to extract explanations for the Llama models. As in the Mamba experiment, we generated 10 additional tokens from HotPotQA's input and analyzed the prediction of the generated token at each step.

Results: Our findings are summarized in Table A of the attached PDF. Notably, Llama-2, which was trained with a context length of 4096 tokens, begins to produce less sensical text when given inputs exceeding this length. This behavior aligns with observations in recent studies [7,8].

In contrast, Llama-3 and Mamba are capable of generating sensible text even with longer context lengths, as shown in Table A of the PDF. Although Llama-2’s histogram suggests it uses the full context window and appears to identify more relevant long-range input tokens than Llama-3 and Mamba, closer inspection reveals that this long-range capability is based on unspecific token information. The most relevant tokens for Llama-2 are often non-semantic, such as the newline token <0x0A> and the beginning-of-sentence token <s>, which often appear at the beginning of the context paragraph. This reliance on non-semantic tokens explains the long-range information retrieval seen in the histogram, particularly in the context size exceeding 4K tokens (the shaded areas in the histogram of Llama-2).

In contrast, Llama-3 and Mamba primarily attribute relevance to semantically meaningful tokens, particularly those near the end of the input context window. Llama-3’s histogram indicates it uses more intermediate mid-range dependencies compared to Mamba. Given Llama-3’s significantly larger size (8B) compared to Mamba (130M), our initial analysis shows that Mamba effectively uses long-range information. We also find that this ability is not exclusive to SSMs but can also be achieved by larger Transformers (e.g. Llama-3).

In this initial investigation, MambaLRP has proven valuable in comparing the long-range context capabilities across models. We hope our work will facilitate further comparative studies in this area. We will include a summary of this comparison in our final paper.

Questions

Is there a reason not to increase the context length in needle-in-a-haystack of Figure 7, and are there any failure cases?

We did not include context lengths beyond 2048 in the paper, as the model was trained using a context length of 2048 and the study was not designed to assess extrapolation beyond this limit. Our study highlights some limitations of the retrieval accuracy metric and introduces an explanation-aware measure. This new measure ensures that the retrieved information is not only correct but also correct for the right reasons. As requested, we extended the experiment to context lengths beyond 2048, up to 4096, and included the results in Figure C of the PDF. As can be seen, the model's retrieval performance begins to drop as the context length goes beyond 2560.

In our analysis, we did not encounter any edge cases where the model successfully retrieved the needle but MambaLRP failed to provide the correct explanations. However, we have shown cases where the model found the needle based on incorrect features, which MambaLRP was successfully able to detect. Please refer to Appendix C.7 (Figure 13).

Including "Does Transformer Interpretability Transfer to RNNs?"

Thanks for your suggestion. We will add it to our related works.

Is $D$ used in MambaLRP?

Skip connections are not ignored in MambaLRP. As they do not violate the conservation property, no modifications are necessary. This is why we did not mention them in the paper. However, based on your suggestion, we will include it in the paper.

If you have any further comments or questions, we are happy to address them during the discussion period.

评论- Response to New Material and Clarifications (Reviewer dNt6)

2024-08-09

Thank you for your response. I appreciate the clarifications and the new material, which significantly enhance the paper. The updated results, particularly Figure A and Table C, are highly informative and insightful, and I strongly recommend including them in the final version.

Since weaknesses 1 and 2 (comparison with previous work and insufficient empirical analysis) were partially addressed, I am raising my score from 5 to 6 and my confidence from 4 to 5. The score is not higher than 6 due to realtivly limited novelty, as I (and other reviewers) point out in weakness 3:

The novelty of the method is somewhat limited. One could argue that it merely involves a few applications of detach and the simple half-propagation rule (in addition to previously proposed contributions), which can be easily summarized in a few lines of code (as detailed in Algorithm 2). However, I believe this is not a significant drawback, particularly given the demand for such simple tools in the community. Additionally, the thorough evaluation, including insightful ablation studies, novel test case evaluations, and theoretical justification, is sufficiently robust.

One final request - the tools proposed in the paper are valuable and could be useful for the community. However, implementing them accurately and efficiently may be challenging for researchers unfamiliar with Mamba or LRP. To fully realize the potential of this work, it is critical to make it accessible and relevant. Therefore, I suggest that the authors upload their code (with minor modifications to support Mamba 2), alongside the camera-ready version.

2024-08-10

Thank you for your valuable feedback and suggestions. We are happy to see that you have increased your score. We have included the new experiments in our paper.

Regarding making the code accessible

We understand the importance of providing accessible and clean code to the community. Therefore, we are preparing a user-friendly GitHub repository, and we will include the link in the camera-ready version. We further plan to extend MambaLRP and our code to include Mamba variants including the recent Mamba2 model.

Thank you again for the support of our work and we look forward to see how the open access to reliable XAI for Mamba can lead to domain insights and model improvements in our community.

审稿意见

评分: 6置信度: 42024-07-12

Post-rebuttal update

Given the authors' additional experiments and changes-to-be-made to the manuscript, raising my original score from a 5 to a 6.

Original review

The authors tackle the problem of explainability in Mamba SSMs, which have been recently proposed and widely adapted. Towards this end, they leverage layer-wise relevance propagation (LBP) and derive the necessary procedures/equations to enable LBP in Mamba models, similar to those previously proposed in Conservative Propagation paper for Transformers (Ali et al, 2022). They benchmark their procedure on both Mamba language and vision models across a range of benchmarks and other widely used explainability methods to demonstrate the effectiveness of their approach.

优点

Originality - Given the wide-spread adaptation of Mamba SSMs since their recent debut, explainability of these models is a very important topic. While the specific detach approach to enable LRP in LLMs is not new, the adaptation to Mamba models is new and an excellent application of this methodology.

Quality - The proposed approach is a good addition to the growing list of Mamba works, and will make a good tool for practitioners/researchers to explain the decisions of Mamba models going forward. The extensive benchmarks and evaluation of both NLP and vision models and tasks are also solid contributions. However, the evaluation of Mamba LLMs was somewhat limited (see weaknesses for further discussion).

Clarity - For the most part, the paper is well written. However, important details regarding evaluation are either missing or could be better highlighted (discussed in weaknesses).

Significance - As previously mentioned, Mamba models continue to gain traction, particularly with the recent release of Mamba 2. Thus, this work has significant opportunity for significant impact as the need to explain the decision making processes behind Mamba models grows.

缺点

Quality

As previously noted, there is significant room for improvement regarding the evaluation of Mamba LLMs. In particular, only the smallest and second largest checkpoints are used for the majority of experiments (with the exception of the 2.8B checkpoint used in the needle in the haystack tests). However, for a paper dedicated to enabling the capability of this line of LLMs, this seems insufficient; why are the 130M and 1.4B checkpoints used for the majority of tests? In particular, the 130M model performs the worst among the suite of Mamba 1 models, but is used to demonstrate qualitative claims about the method (e.g., Figure 3, Figure 4, and reported runtimes). Given the 2.8B is the most accurate and, thus, most desirable for practical use, this should ideally be evaluated along with the 130M (and 1.4B) models to demonstrate the efficacy of this approach at 2.8B Mamba-parameter size.

Clarity

For both the presented experiments and compared methods, several important details were not clear in the paper. In particular, for Figure 6, how is position difference calculated? E.g.:

We use the HotpotQA [70] subset from the LongBench dataset [11], designed to test long context understanding. After selecting all 127 instances, containing sequences up to 8192 tokens, we prompt the model to summarize the full paragraph by generating ten additional tokens Fig. 6, shows the distribution of the positional difference between a relevant token and the currently generated token. How are the relevant tokens determined? Upon first pass, I wondered whether they were somehow derived from the supporting fact labels in the HotpotQA dataset, but it seems to be calculated as the difference between the tokens generated position vs position in the input. Can you clarify this, as well as how (exactly) the histogram in Figure 6 is calculated? Also, the HotpotQA subset of LongBench contains 200 instances, yet the paper evaluates 127; what is the source of this discrepancy?

Most importantly, it is not clear whether the author's are comparing to Attention Rollout, which the Ali et al 2024 Mamba explainability paper was based on, or Mamba-Attribution (i.e., the new mechanism derived for Mamba models in the Ali et al 2024 paper). As this is the most relevant Mamba-specific method given the proposed method, this point requires clarification within the main text.

Other portions which are unclear:

"The results without fast CUDA kernels" <- what are fast CUDA kernels in this context? The hardware-optimized selective SSM scan kernel included in Mamba (in contrast to the full PyTorch SSM kernel)? Or are these kernels the authors are contributing to speed up their described method?
"All models were trained for a maximum of 10 epochs, with an early stopping mechanism in place." <- What is the early stopping criterion?
Lines 192-193 should forward reference fine-tuning details are further described in "C.1 Models and dataset." Otherwise, it is easy to miss that small excerpt on line 193 which alludes to 130M and 1.4B being fine-tuned for the various tasks (which could lead to thinking pretrained Mamba models are being evaluated).

问题

"We use an instruction-finetuned Mamba-2.8B model" <- In order to make the paper as self-contained as possible, could the authors summarize the fine-tuning recipe for this model within the supplementary (rather than forward referencing to the huggingface page)?

局限性

Could the authors include foreseeable limitations of their work? E.g., potentially large memory utilization to enable explainability in Mamba models? Also, in the case that the compared method AttnRoll was not Mamba-Attribute (or not evaluated using the released code of the Ali et al 2024 paper), these would be important evaluation limitations to list.

作者回复

2024-08-07

Thanks for your detailed review and useful suggestions. We will address your points below.

Evaluation of Mamba LLMs also on larger 2.8B models

With our empirical evaluations in Table 1, we aimed to cover a representative range of tasks (four different text classification tasks, image classification), models (Mamba, Vision Mamba) and model sizes (130M and 1.4B), which confirmed that our proposed approach consistently have highest faithfulness scores---independent of the performance, architecture and domain of the specific Mamba model, compared to existing explanation methods.

It is important to note that our experiments aim to evaluate the effectiveness of our explanation method against other methods applicable to Mamba-based models, rather than evaluating the capabilities of the Mamba architecture itself. To our knowledge, this is the most comprehensive evaluation of different attribution methods for Mamba models in the literature. Previous work [MambaAttr2024] was tested on the Mamba-130M and Vim-S model.

To further extend our evaluations and follow your suggestion, we fine-tuned Mamba-2.8B model on the SST-2 dataset using the same protocol described in Appendix C.1. The following table presents the performance evaluation of different explanation methods on this model, confirming the superior performance of MambaLRP. We are currently running this experiment on the other NLP datasets as well and will add the results to the final version of the paper.

Mamba-2.8B results:

Methods	$\Delta A^F$
Random	0.007
GI	-0.043
SmoothGrad	-0.675
IG	0.322
AttnRoll	0.452
G $\times$ AttnRoll (Mamba-Attribution)	0.341
LRP (LN-rule)	0.820
MambaLRP (Ours)	1.157

How are the relevant tokens in Figure 6 determined and how is the histogram calculated? Why are there 127 of 200 HotpotQA samples considered?

As mentioned in Section 6, we used the top-k most relevant tokens that received positive relevance scores when applying MambaLRP and plotted the resulting histogram of relevant tokens over the positional differences between the generated and identified relevant tokens. In the paper, we present results for $k=10$ . We have further validated that the shape of the histogram distribution of relevant tokens remains consistent for $k \in \{1,3,5\}$ . By applying a context length cutoff of 8096 to filter HotpotQA, we obtained the 127 samples used in our experiment.

Clarification regarding comparison to [MambaAttr2024]

We compared our proposed method to both Attn-Rollout and Mamba-Attr introduced for the Mamba models in [MambaAttr2024]. We referred to them in our paper as AttnRoll and G $\times$ AttnRoll, in consistency with the respective methods proposed for Transformers. We have modified the name G $\times$ AttnRoll to Mamba-Attr in our paper.

What are fast CUDA kernels in this context?

When we mention fast CUDA kernels, we are referring to the hardware-optimized selective SSM scan kernel included in Mamba. Our proposed explanation method does not involve any additional optimizations.

What is the early stopping criterion?

In our experiments, we employ early stopping and end training as soon as the validation loss ceases to improve. We have now added this information to the paragraph on training details in Supplement C.1.

Adding references to Appendix C.1 and adding details about instruction-finetuned model

Thanks for your suggestions. We have now added a reference to enhance accessibility and clarify our experimental setup in Section 5. We will also add more information about the instruction-finetuned model to the supplementary material.

Regarding including foreseeable limitations of our work and potential memory limitations, clarification on comparison to [MambaAttr2024]

Regarding memory consumption, LRP is a backpropagation-based explanation method that requires storing activations and gradients. The memory usage depends on the model and input size. To reduce memory consumption, techniques such as gradient checkpointing can be employed. This is also true for other gradient-based methods.

Regarding the comparison with [MambaAttr2024], we did evaluate the performance of MamabLRP against Mamba-Attr (see also our answer to your previous point above). We have now clarified this potential misunderstanding in the main text and apologize for any confusion about the specific baseline method we used. For our evaluation of Mamba-Attr and Attn-Rollout, we have used the official code provided by the authors, as described in Appendix C.3, providing a fair and reproducible benchmark evaluation.

A potential limitation of gradient-based explanation methods, like MambaLRP, is that the gradient information might not be accessible, e.g., due to proprietary constraints. In these situations, one possible solution is to approximate the gradient information. We will use the additional space to add a dedicated Limitations section to outline and discuss these aspects of our method.

[MambaAttr2024] A. Ali, et al. The hidden attention of mamba models. arXiv:2403.01590, 2024.

评论- Reply to rebuttal

2024-08-10

I thank the authors for their detailed response. My concerns have been addressed, and the authors have agreed to incorporate these changes into the next draft of the paper.

As previously stated, the submitted work is especially important as the Mamba family of models continues to be widely adapted.

It is important to note that our experiments aim to evaluate the effectiveness of our explanation method against other methods applicable to Mamba-based models, rather than evaluating the capabilities of the Mamba architecture itself. To our knowledge, this is the most comprehensive evaluation of different attribution methods for Mamba models in the literature. Previous work [MambaAttr2024] was tested on the Mamba-130M and Vim-S model.

The breadth of evaluation is appreciated. Note that MambaAttr2024 is an unpublished manuscript, so their evaluation has not faced scrutiny for NeurIPS publication. For NeurIPS, rigor is necessary. Evaluating the proposed method on the largest checkpoint is representative of how well MambaLBP will work on the Mamba model most likely to be used by LLM practitioners; in personal experience, the behavior/performance of the 5 Mamba LLM checkpoints varies significantly. Thus, generalization across the checkpoints is not a given, and the authors' additional experiments to confirm this greatly help to address this concern. I look forward to the other NLP experiments evaluated on the 2.8B model, and I plan to raise my score from a five to a six.

As noted by another reviewer, the reproducible implementation of this method is a critical contribution, as well as the benchmarking suite. Mamba and Mamba2 have proven temperamental for features considered standard for Transformer models (e.g., unmodified use in Huggingface's SFTTrainer), so there is a large margin for error involved in researchers reproducing these results from scratch. What are the authors' plans for releasing code to reproduce results from the paper?

2024-08-10

Thank you for your thoughtful comments and for recognizing the improvements made during the short rebuttal period. We are glad to hear that your concerns have been addressed and you are increasing your score. We are actively working on including all NLP experiments for the 2.8B model in the camera-ready version to ensure a comprehensive evaluation across different checkpoints. Additionally, the results for the SST-2 dataset, which were ready for the rebuttal are already added to the paper.

Regarding plans for releasing the code

We understand the importance of sharing our code to support ease of use and ensure reproducibility. To this end, we are preparing a user-friendly GitHub repository to provide easy access. The link to this repository will be included in the final camera-ready version of our work. Additionally, detailed instructions for replicating our results, along with demo Jupyter Notebooks, will be available in the repository.

审稿意见

评分: 6置信度: 32024-07-22

The paper introduces an LPR framework for Mamba. The method breaks the Mamba architecture into three parts (SiLU activation, selective SSM, and multiplicative gating) and analyzes the layers using relevance scores. The evaluations on languages and images show that the proposed method is more precise and faithful than other explanation methods. They also show several use cases including gender bias, long-range capabilities, and the needle-in-a-haystack test.

优点

The paper is easy to follow.
The introduced LRP framework is faster and provides more faithful explanations than other explanation methods.
The experiments show interesting interpretations of Mamba specifically for gender bias, long-context capabilities, and needle-in-a-haystack test in Mamba.

缺点

The current paper is limited to the explainability of Mamba only. I believe LRP can be applied to Transformers. It would be more interesting to compare the behavior of different methods instead of showing the behavior of one method.
This point is related to the first point. The use case experiments are interesting, but it would be more useful if this could be compared with different methods. For instance, in the long context capability of Mamba, the paper mainly shows that Mamba can use earlier context. Instead, the paper can show that one method has better long context capability than the other in terms of accuracy, and it is verified by LRP. Also, the behavior of different methods can be compared such as 1) if the Transformer-based method is not as good as Mamba in terms of long-range modeling, what does the Transformer model focus on 2) how does the behavior change with tasks (generation vs QA) 3) how does it change with the training datasets, etc.

Overall, the paper lacks a more interesting and useful analysis of the methods using LRP as explained above. This limits the contribution of the paper.

问题

Tables 10 and 11 show the runtime comparisons with and without using fast CUDA kernels. When using fast CUDA kernels, other methods including gradient × Input, SmoothGrad, and Integrated Gradients improve the speed a lot (22-25x faster), but the proposed MambaLRP has only 1.3x speed-up. What is the reason?

局限性

The paper briefly explained the limitation in Section 7.

作者回复

2024-08-07

Thanks for your detailed feedback. We are happy to see that you appreciate the potential of our approach in providing faithful explanations. We will address your points below.

Regarding the paper being limited to the explainability of Mamba

The class of Selective state space sequence models (Mamba models) present a significant change in model architecture compared to Transformers, necessitating tailored XAI methods to address their unique components. As LRP is not model-agnostic, new LRP rules must be developed for emerging classes of deep neural networks, as they may contain new components for which no existing LRP rules apply. For instance, the LRP framework has already been successfully extended to include Transformers in [3,5].

Mamba models, however, include new components absent in Transformers, and no prior propagation rules exist for them. Consequently, following the work of [3,5,9,10,17] who have extended the LRP framework to new architectures (e.g. Transformers, CNNs, GNNs, regression networks), we derived new LRP rules for Mamba models. Proposing these rules is not trivial; it requires performing a detailed analysis of the relevance propagation process, guided by the conservation axiom. This resulted in more faithful explanations compared to naive applications of existing LRP rules not designed for Mamba and also other approaches, presented in our main paper (Table 1).

Regarding comparing to the behavior of Transformers

While we agree that comparing the capabilities of Transformer and Mamba is an interesting line of research, our work focuses on deriving and thoroughly evaluating our proposed explanation approach for Mamba models. In this, we follow the standard approach to evaluate new explanation methods in terms of faithfulness [4, 6], which is inherently model-specific. Therefore, a comparison to transformer models is not included in the evaluation section of our paper. Instead, to show the versatility and robustness of our proposed method, we tested MambaLRP across various selective state space model architectures and sizes.

To compare the capabilities of Mamba and Transformer using their explanations, we have extended our long-range dependency experiments to include Llama-2 and Llama-3 Transformers (see the attached PDF for results). We describe our findings just below. Moreover, in our bias detection use case in Section 6 of our paper, we have already compared the performance of Mamba-130M and Mamba-1.4B to several Transformer models.

Comparing long-range dependencies of Mamba and Transformers

Thank you for this suggestion. We also found comparing Mamba and Transformer models in terms of their long-range capabilities very interesting. Thus, we investigated this further and did a direct comparison to Transformers for our long-range dependency use case. Many of the widely-used Transformers, e.g. GPT-2, GPTNeoX and also Pythia unfortunately, do not allow inputs longer than 2048 or do not generate sensible text from these long inputs. Instead, we use two state-of-the-art Transformers: Llama-2 and Llama-3, and extract attributions using the LRP rules of [3]. As in our Mamba experiment, we generate 10 additional tokens from the HotPotQA's input and at each step explain the prediction of the generated token. Please find the results in Table A of the attached PDF.

For Llama-2, trained with a context length of 4096, the generated text becomes increasingly less sensible and repetitive for contexts longer than 4k, a limitation noted in [7,8]. When analyzing histogram distributions across models, it seems that Llama-2 uses information more uniformly across the entire context and identifies more relevant long-range dependent tokens compared to Llama-3 and Mamba. However, its output becomes non-sensible for lengths above 4K tokens. Thus, the identified relevant tokens are typically non-semantic such as new line token <0x0A> in Llama-2 and beginning of sentence token <s> found at the start of the context paragraphs. For Llama-3 and Mamba, the attributions can identify meaningful relevant tokens. When directly compared, Llama-3 uses information from more intermediate mid-range dependencies than Mamba, though both favor tokens close to the end of the input as relevant. Given Llama-3's much larger size (8B) compared to Mamba (130M) and their different training settings, this analysis supports that Mamba can effectively use long-range information. We also find that this ability is not exclusive to SSMs but can also be achieved by Transformers.

In this first step, MambaLRP allowed us to compare long-range context capabilities across models and hope our work facilitates the generation and investigation of future comparative studies. We will add a summary of this comparison to our final paper.

Regarding the reduced speed-up of MambaLRP compared to Gradient × Input, SmoothGrad, and Integrated Gradients.

We apologize for the confusion caused by a typo in the number. The correct runtime for MambaLRP listed in Table 11 is 0.03063. This means when using or not using fast CUDA kernels, the runtime of MambaLRP is comparable to GI.

Please let us know if you have further questions or comments. At this stage, we are still able to make revisions to the draft and actively participate in the discussion.

评论- response to the rebuttal

2024-08-13

Thank you for the answers. The additional analysis provided during the rebuttal is helpful. Although the rebuttal response partially addresses my concerns (limited empirical analysis), it certainly improves the quality of the paper. I also believe that the proposed method would be useful to the community. I increase my rating from 4 to 6. Please add the full analysis between Transformers and Mamba (like Figures 6 and 7 in the paper) to the revised version.

2024-08-13

Thank you for your feedback. We are glad to hear that you found our additional analyses helpful and that our proposed method is useful to the community. We are also pleased that you have raised your score. As promised, we will include the additional experiments in the paper.

审稿意见

评分: 6置信度: 32024-07-22

The paper applies Layer-wise Relevance Propagation to Mamba layers. To maintain the relevance conservation law, the authors propose three fixes to SiLU activation, S6 and the multiplicative gating operators respectively with the technique of gradient blocking. The proposed method improves the faithfulness of LRP explanantion on Mamba significantly.

优点

The paper is well written and easy to follow.
The improvement of the faithfulness is significant compared to the baselines without the fixes proposed in the paper.

缺点

The contribution of the work seems incremental and the techniques used are well established. The main novelty here is applying the existing LRP techniques to the Mamba architecture.
The explanation generated by MambaLRP is not surprising and does not bring any new insights into the behavior of architectures. It is unclear how these explanations can be used to improve the model's performance on alleviating gender bias and retrieving long-range information.

问题

Can we leverage the explanations generated by MambaLRP to improve the model 's behavior on the downstream gender debiasing and the long-context retrieval tasks?

局限性

No. The authors should discuss the limitation of the usefulness of their methods.

作者回复

2024-08-07

Thank you for your review and useful comments. We will address them in our response below.

Regarding the main novelty of applying LRP to the Mamba

As new model architectures, such as SSMs and Mamba models, are developed, the field of XAI is challenged to develop faithful attribution methods to explain their predictions, especially given that the naive applications of existing XAI methods often fail in this regard. As LRP is not a model-agnostic framework, new LRP rules must be derived as new classes of deep neural networks emerge, since these models may include new layers or components for which no LRP rules exist. LRP was initially introduced in [9] to explain predictions for kernel-based classifiers and for multilayered neural networks. Over the years, it has been extended to accommodate new architectures. For instance, as the existing LRP rules were insufficient for explaining models such as Transformers, [3,5] introduced novel LRP rules specifically for Transformer models. Similarly, with the advent of Graph Neural Networks (GNNs), [10] proposed GNN-LRP to explain their predictions.

We would like to emphasize that for the class of selective state space sequence models (i.e., Mamba models), there exist no propagation-based techniques such as LRP that effectively address the Mamba architecture. A naive application of existing LRP techniques, which are not designed for Mamba, leads to far inferior and unreliable feature attributions as shown in Table 1. To overcome this shortcoming, we have (1) identified the unreliable model components and (2) derived novel propagation rules for them, which form MambaLRP. Our novel LRP rules are grounded in our theoretical analysis of the relevance propagation process through different layers/components within Mamba, which identified those that violate the conservation property. Consequently, we proposed novel LRP rules for the SSM components, SiLU activation functions, and multiplicative gates. The explicit LRP rules that we have proposed for Mamba models can be found in Appendix B. Given that ease of implementation is a key characteristic of XAI methods, we have further contributed by proposing straightforward implementation strategies that bypass the need for the implementation of complex LRP rules in Section 4.

Regarding MambaLRP explanations not being surprising, without new insights into the behavior of architectures.

In the field of XAI, one line of research focuses on developing reliable explanation methods for DNNs [3,5,9,10,13,14] and another complementary line of research on using these methods for in-depth model analysis [12,15] and insight [16]. XAI applications, such as model debugging, identifying biases, examining fairness, and assessing capabilities like long-range dependencies, rely on high-quality explanations. Thus, our study specifically targets the challenge of generating high-quality explanations for the novel class of selective state space sequence models, as their highly non-linear and complex architectures make explaining them a significant challenge.

As we have proposed a novel explanation algorithm for Mamba, our main experiments are focused on analyzing the faithfulness of the explanations generated by MambaLRP against those produced by other methods, shown in Table 1. We demonstrate that our method provides more faithful explanations compared to existing alternatives while being computationally more efficient (see Appendix C.8). We observed that this approach does allow us to identify unexpected model behaviors, as illustrated in our image classification experiment in Figure 5, where the prediction of the class 'carton' is influenced by the presence of a watermark on the image. To bring new insights into the model's behavior, we used MambaLRP to uncover gender bias, investigate long-range capabilities of Mamba, and propose an explanation-aware measure for the Needle-in-a-haystack test, shown in Section 6.

Regarding leveraging MambaLRP explanations for model improvement

Please note that our use cases (Section 6) focus on model diagnosis rather than model improvement. Identifying a model's weaknesses is the essential first step before any improvements can be made. For example, as illustrated in Figure 5, MambaLRP can be employed to analyze the model's sensitivity to Clever-Hans features, such as watermarks in Chinese. Additionally, it can be used to detect gender biases, as discussed in Section 6. In the case of gender debiasing, as an example, if ground-truth explanations are available, the model can be fine-tuned to align the generated explanations with the ground-truth as done in [11,15] for other XAI approaches and bias types. Regarding long-context retrieval tasks, the needle-in-a-haystack test is designed to evaluate such capabilities in LLMs. In Section 6, we introduced a new metric based on the explanations generated by MambaLRP to measure the model's retrieval performance in an explanation-aware manner. This means that instead of solely checking retrieval accuracy, one can also assess if the model retrieves the right information for the right reasons. For more detailed and interesting results, please refer to Figure 13 in Appendix C.7. In that figure, we have shown a case where the model retrieved the correct information based on incorrect evidence, which MambaLRP was successfully able to detect. This failure case could not be identified by the retrieval accuracy metric typically used in the needle-in-a-haystack experiment, highlighting the value of MambaLRP in providing practical insights. Leveraging XAI methods for improving safety aspects of models is a very active line of research and crucially requires reliable explanation methods. With MambaLRP, we have proposed a faithful and robust explanation method for this purpose.

2024-08-13

Dear Reviewer,

I would appreciate if you could comment on the author's rebuttal, in light of the upcoming deadline.

Thank you, Your AC

2024-08-13

Thanks for the clarification. The answers have addressed my questions and I have raised my score from 5 to 6.

2024-08-14

Thank you for your comment. We are glad that your questions were addressed and that you raised your score.

作者回复

2024-08-07

We thank the reviewers for their comments and valuable feedback. We responded to the comments and made the following changes to our submission. In particular:

We extended our faithfulness evaluation in Table 1 (main paper) to the larger Mamba-2.8B model trained on SST-2, confirming MambaLRP consistently outperforms all other approaches. To our knowledge, this is currently the most extensive evaluation of XAI methods for Mamba in the literature.
Reviewer 4ExW and Reviewer dNt6 expressed interest in a direct comparison to Transformers to investigate long-range dependencies. We performed this experiment for Llama-2 and Llama-3 using LRP for Transformers. Our analysis reveals that Mamba effectively uses long-range information, matched by the more sophisticated and larger Transformer model (Llama-3), and notably not Llama-2. Due to space constraints, details for this experiment are given in response to Reviewer 4ExW.

Clarification regarding the novelty and contributions of MambaLRP

As new classes of deep neural networks are developed, such as selective state space sequence models, XAI needs to keep up with these model innovations by providing reliable explanations. We contribute to this goal via:

A novel, thorough analysis of the relevance propagation through Mamba components, guided by the conservation axiom (see Section 4 and Appendix A). Proposal of new LRP rules to mitigate violation of conservation, leading to state-of-the-art explanations for Mamba (see Appendix B). Efficient implementation of these proposed rules (see Section 4). Thorough evaluation of attribution methods for Mamba (see Section 5). Insightful and practical use cases, addressing aspects of AI safety and transparency (see Section 6).

While we focused on Mamba, the proposed propagation rules are generally applicable to other models using these components, e.g. multiplicative gates in recent models HGRN [18], RWKV [19] and MEGA [20]. We summarized our contributions and methodological novelty more clearly in Section 1, 2 and 4 now.

We also appreciate the positive remarks, such as:

The improvement of the faithfulness is significant compared to the baselines. (Reviewer wgXY)
The introduced LRP framework is faster and provides more faithful explanations than other explanation methods. (Reviewer 4ExW)
The experiments show interesting interpretations of Mamba specifically for gender bias, long-context capabilities, and needle-in-a-haystack test in Mamba. (Reviewer 4ExW)
Explainability of Mamba models is a very important topic. The proposed adaptation to Mamba models is new and an excellent application of this methodology. (Reviewer mtfa)
This work has significant opportunity for significant impact as the need to explain the decision making processes behind Mamba models grows. (Reviewer mtfa)
LRP for Mamba is an important direction that facilitates the community in improving and understanding these models. (Reviewer dNt6)
The method is applicable and allows the authors to provide insightful analyses about Mamba models. (Reviewer dNt6)
The paper is well written and easy to follow. (Reviewer wgXY)

In summary, additional experiments prompted during rebuttal (1) confirmed the robustness of our evaluation, (2) provided detailed descriptions addressing soundness and novel contributions of MambaLRP, and (3) extended our use cases with additional comparisons, demonstrating the reliability and practical benefits of MambaLRP to the community.

We responded to each reviewer's initial comments below and hope our answers appropriately address all their questions and concerns. We are happy to answer any remaining questions during the discussion phase.

To avoid redundancies, we added references from the individual rebuttals below.

Rebuttal References

[1] Ali et al. The hidden attention of mamba models. arXiv, 2024.

[2] Zimerman et al. A Unified Implicit Attention Formulation for Gated-Linear Recurrent Sequence Models. arXiv, 2024.

[3] Ali et al. XAI for transformers: Better explanations through conservative propagation. ICML, 2022.

[4] Blücher, et al. Decoupling Pixel Flipping and Occlusion Strategy for Consistent XAI Benchmarks. TMLR, 2024.

[5] Achtibat et al. AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers. ICML, 2024.

[6] Samek et al. Evaluating the visualization of what a Deep Neural Network has learned. IEEE TNNLS, 2017.

[7] Chen et al. Clex: Continuous length extrapolation for large language models. ICLR, 2024.

[8] Huang et al. Training-free long-context scaling of large language models. ICML, 2024.

[9] Bach et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 2015.

[10] Schnake et al. Higher-Order Explanations of Graph Neural Networks via Relevant Walks. PAMI, 2022.

[11] Anders et al. Finding and removing clever hans. Inf Fusion, 2022.

[12] Przemyslaw et al. Marrying Fairness and Explainability in Supervised Learning. ACM Conf on Fairness, Accountability, and Transparency, 2022.

[13] Samek et al. Explaining deep neural networks and beyond: A review of methods and applications. Proc of IEEE, 2021.

[14] Arrieta et al. Explainable Artificial Intelligence: Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf fusion, 2020.

[15] Ali et al. Explainable Artificial Intelligence: What we know and what is left to attain Trustworthy Artificial Intelligence. Inf Fusion, 2023.

[16] Roscher et al. Explainable machine learning for scientific insights and discoveries. IEEE Access, 2020.

[17] Letzgus et al. Toward explainable artificial intelligence for regression models: A methodological perspective. IEEE Signal Proc, 2022.

[18] Qin et al. Hierarchically gated recurrent neural network for sequence modeling. Neurips, 2023.

[19] Peng et al. RWKV: Reinventing RNNs for the Transformer Era. EMNLP, 2023

[20] Ma et al. Mega: Moving average equipped gated attention. ICLR, 2023.

最终决定Accept (poster)

2024-09-25

After a fruitful discussion, the reviews agree that the paper is well written and clear (especially when improved with the details provided in the rebuttal) and that the proposed approach (MambaLRP) is a significant contribution to understanding an increasingly popular class of models. The proposed approach provides significant faithfulness improvements, which are very welcome, as showcased by the extensive experiments. I strongly encourage the authors to release the source code, as promised, with the final version of the paper, as it is essential to enable reproducibility.