5.8

/10

withdrawn4 位审稿人

最低3最高8标准差1.8

4.8

置信度

正确性3.0

贡献度2.5

表达2.8

ICLR 2025

Efficient Visual Transformer by Information Bottleneck Inspired Token Merging

Yancheng Wang,Yingzhen Yang

OpenReview PDF

提交: 2024-09-26更新: 2025-01-23

TL;DR

We propose Information Bottleneck inspired Token Merging (IBTM), which performs token merging in a learnable manner inspired by the information bottleneck principle and renders efficient vision transformers with competitive performance.

摘要

关键词

Visual TransformerToken MergingInformation Bottleneck

评审与讨论

审稿意见

评分: 3置信度: 52024-10-28

This paper proposes an innovative Information Bottleneck inspired Token Merging (IBTM) block for ViTs. IBTM block generates a mask for merging tokens by minimizing the difference between the mutual information of the merged tokens with the original tokens and the corresponding class label. With the IBTM block, the merged tokens can better simulate the behaviour of the original tokens during inference. The proposed approach is adapted to four different pre-trained ViT backbones. Comparisons against three token merging methods demonstrate the effectiveness and generalizability of the IBTM block.

优点

This paper has compared state-of-the-art token merging methods on various backbones.
The proposed token merging strategy with respect to the variation is interesting.

缺点

Insufficient and unclear motivation: In the Introduction section, the authors keep arguing that existing token merging methods [1, 2, 3] significantly sacrifice prediction accuracy and claiming their goal is to merge tokens without largely sacrificing performance. However, the reasons why existing token merging methods do not work well and potentially lead to significant performance drops are not clearly explained or justified. As a result, the knowledge gap that IBTM intends to narrow and the drawbacks of existing methods that IBTM is designed to resolve are fairly unclear. In short, the motivation that simply “method A does not work well so we propose a better method B” is weak. The weak motivation essentially affects this manuscript, lowering its academic contribution.
Imprecise argument: In the Introduction section, the authors claim that "existing token merging methods largely sacrifice the prediction accuracy of the original transformer networks for reduced computation costs." However, this assertion is misleading, as with an appropriate token merging ratio, these methods do not experience significant accuracy loss. For instance, ToMe [1] achieves a 20% speed-up on ViT-B/AugReg with only a 0.4% accuracy drop, which cannot be described as "largely sacrificing prediction accuracy."
Disorganized structure and weak readability of the Introduction section: Following Weaknesses 1 and 2, the whole Introduction section is poorly structured. It is overly repetitive and includes unnecessary content, which detracts from the clarity of the main argument. For instance, Lines 67-96 contain repetitive statements and too many details for the Introduction, which should be summarized for readability. And there is no need for a separate subsection 1.1 on the contributions.
Unfair comparisons and potential over-claiming: The authors neglect the finetuning-free nature of some existing token merging methods [1,2] and compare their finetuned IBTM results to the off-the-shelf (i.e., without finetuning) performance of these methods. Such comparisons are not fair and do not accurately reflect the proposed IBTM’s advantages, which also lead to potential over-claiming of the superiority of IBTM.
Lack of explanation on applying token merging to hierarchical ViTs: The paper fails to clarify how existing token merging methods, like ToMe [1] and ToFu [2], are adapted for hierarchical ViTs, such as Swin-Transformer [4]. In fact, most token merging (and token pruning) methods are designed for plain-structured ViTs, and applying them to hierarchical models can conflict with window-based self-attention and downsampling layers. In Table 1, where the authors report the performance of these methods on Swin Transformer and other hierarchical ViTs [5], it is necessary to explain how they adapted the existing token merging methods and their proposed IBTM for these hierarchical ViTs.
Missing technical details: In the Formulation section, it is unclear whether the token merging mask is dynamically generated w.r.t. different inputs or fixed for all the inputs. If the mask is dynamically generated, based on Equations 3 and 4, I do not think this method would be faster than ToMe. If the mask is fixed after finetuning, it is necessary to analyze the mask pattern, which may reflect some hidden token relationships and bring deeper insights.
Performance issues: As shown in Table 7 in Appendix D.2, the proposed IBTM also suffers from a noticeable performance drop when the compression ratio decreases, which undermines its claimed advantages.
Writing and presentation issues:

8.1. Inconsistent and confusing terminology: In Lines 15-16, IBTM is introduced as an abbreviation for "Information Bottleneck inspired Token Merging," while in Lines 52-53, it is referred to as "Transformer with Learning Token Merging," causing confusion. In Line 1059, the proposed method is again referred to as LTM.

8.2. Grammar errors: There are quite a few grammatical errors. For instance, in Line 63, the word "less" should be "fewer“. I suggest the authors carefully proofread this manuscript.

8.3. Reference formatting errors: In Line 175, the reference format is incorrect at the start of the sentence. In Line 408, ToFu (Kim et al., 2024) is cited twice unnecessarily.

8.4. Line 47 cites Attention Augmented Convolutional Networks [6], which was published earlier than ViT and seems irrelevant to the corresponding claim. In addition, some recent token merging-based methods [7,8] should be cited, which provide necessary insights in this direction.

[1] Bolya, Daniel, et al. "Token merging: Your vit but faster." ICLR, 2023.

[2] Kim, Minchul, et al. "Token fusion: Bridging the gap between token pruning and token merging." WACV, 2024.

[3] Bonnaerens, Maxim and Dambre, Joni. "Learned thresholds token merging and pruning for vision transformers." TMLR, 2023.

[4] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." ICCV, 2021.

[5] Cai, Han, et al. "Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction." ICCV, 2023.

[6] Bello, Irwan, et al. "Attention augmented convolutional networks." ICCV, 2019.

[7] Xu, Xuwei, et al. "GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation." WACV, 2024.

[8] Wei, Siyuan, et al. "Joint token pruning and squeezing towards more aggressive compression of vision transformers." CVPR, 2023.

问题

According to Weakness 4, could the author please provide IBTM's performance without finetuning on the four ViT backbones for a fair comparison against ToMe and ToFu?
According to Weakness 5, could the author please provide details on adapting existing token merging methods (e.g., ToMe) to hierarchical ViTs (e.g., Swin Transformer), including but not limited to how to perform the downsampling layer, how to split the windows after reduction, how many tokens are there in each layer?
According to Weakness 6, could the author please explain how the token merging mask is generated for various inputs?
Could the author please compare their model with token pruning methods, such as EViT [1] and ATS [2], under the same experimental environment? EViT and ATS are strong baseline models, which achieve competitive trade-offs between accuracy and efficiency compared to the token merging methods in various standard experiments [3].
Could the authors please explain why choosing efficient ViTs (i.e., FLOPs < 1G) as the backbone for experiments rather than large ViTs? Token reduction technologies (including token pruning and token merging) are initially designed to resolve the massive computational complexity problem of ViTs. They should target expediting large ViTs rather than efficient ViTs. I do not find any significant contribution by reducing the FLOPs of EfficientViT from 0.52G to 0.44G after token merging in terms of real-world application.

[1] Liang, Youwei, et al. "Not all patches are what you need: Expediting vision transformers via token reorganizations." ICLR, 2022.

[2] Fayyaz, Mohsen, et al. "Adaptive token sampling for efficient vision transformers." ECCV, 2022.

[3] Haurum, Joakim Bruslund, et al. "Which tokens to use? investigating token reduction in vision transformers." ICCV, 2023.

评论- Response to Reviewer b1MG Part 1

2024-11-30

Thank you for your review We appreciate the review and the suggestions in this review. The raised issues are addressed below.

Responses to the Weaknesses

1. “Insufficient and unclear motivation...”

We respectfully point out that this review missed the important insight and motivation which have been acknowledged by other reviewers such as Reviewer m6sM. We have a very important insight as the motivation for token merging with the IB principle. As evidenced in Table 2 and Table 5 in the paper, existing token merging methods, such as ToMe [1] and LTMP [3], already reduce the IB loss of the base models since they reduce redundant tokens with less relevant information for classification, which adheres to the IB principle that aims at learning representations more correlated with class labels while decreasing their correlation with the inputs. By explicitly reducing the IB loss in the token merging process, IBTM-Transformers enjoy lower IB loss compared to existing token merging methods and show better performance with even lower computational costs.

2. “Imprecise argument...”

We respectfully point out that there is a common sense mistake in this review, and there is misunderstanding about the claim “existing token merging methods largely sacrifice the prediction accuracy of the original transformer networks for reduced computation costs.” As a common practice in the token merging and the general model compression literature, we cannot look at the accuracy drop only at a low compression rate, such as a 20% speed-up on ViT-B/AugReg with only a 0.4% accuracy drop mentioned by this reviewer. Instead, we need to examine the performance of the compressed models across various compression rates. Our claim should be understood from a comparative study perspective, which is factually evidenced by the results in Table 1, where IBTM outperforms ToMe with comparable FLOPs/inference time. In addition, Table 8 of the revised paper shows that IBTM, with a 20% speed-up on ViT-B, still outperforms the vanilla ViT-B without compression.

In the revised paper, we clearly indicate that the IBTM causes less drop in prediction accuracy compared to the state-of-the-art token merging methods after this claim. The revised argument reflects this comparative aspect more accurately, acknowledging that while methods like ToMe can achieve efficiency gains with minor accuracy drops, IBTM demonstrates a better trade-off between the computational efficiency and the model performance, benefiting from the IB principle.

3. “Disorganized structure and weak readability of the Introduction section...”

We remark that all the other reviewers appreciate the organization of the instruction section. On the other hand, we respect the opinion of this reviewer and revised the introduction section in the revised paper, focusing on eliminating redundancy and emphasizing key points succinctly. In addition, we integrated the contributions directly into the main text of the Introduction, eliminating the need for a separate subsection.

4. “Unfair comparisons and potential over-claiming...”

We respectfully point out that there is a factual mistake in this review about the unfair comparison and potential over-claiming. As a standard practice in the token merging literature where the token merging models have learnable parameters, these earnable parameters need to be trained in a fine-tuning process. For example, LTMP [3], a token merging method with learnable parameters, loads the pre-trained vision transformers and fine-tunes the additional learnable parameters in the token merging modules. The underlying reason is that these learnable parameters are initialized randomly, so we can only obtain subpar performance without training these parameters. Moreover, it is important to note that the token merging methods without fine-tuning, such as ToMe and ToFu, have a computational step with token matching at all the layers to ensure similar original tokens are merged so that these models still need a computational process before performing token merging.

IBTM requires fine-tuning since it incorporates randomly initialized parameters in the token merging modules that require training. We follow the standard settings of the fine-tuning based token merging in LTMP [3] for our experiments. As evidenced in Table 1, IBTM outperforms the LTMP after fine-tuning for different numbers of epochs. In addition, it is worthwhile to mention that although ToMe [1] and ToFu [2] are free of fine-tuning, they require additional computation costs in performing the token matching at different layers.

评论- Response to Reviewer b1MG Part 2

2024-11-30

5. “Lack of explanation on applying token merging to hierarchical ViTs...”

For the experiments with Swin Transformers, the token merging is performed between the multi-head self-attention modules with either regular or shifted windows and the MLP layers. As a result, the input to the MLP layers has fewer tokens, thereby decreasing the computational costs in the MLP layers. Following the MLP layers, the merged tokens are padded with the same features to restore the original number of tokens. This ensures compatibility with the hierarchical structure of the Swin Transformer, maintaining alignment with subsequent operations such as patch merging and window-based self-attention. This design allows the token merging process to integrate seamlessly with the Swin Transformer architecture while preserving its structural and computational efficiency.

6. “Missing technical details...”

The token merging masks are dynamically generated w.r.t. different inputs following Equation (3) and Equation (4) in Section 3.2. of the paper. As shown in Proposition 3.2, the updating formulation of the token merging mask is based on the features at corresponding layers. Although ToMe is faster than LTMP with the same token compression ratio, models compressed by LTMP exhibit higher classification accuracy than ToMe with an even higher token compression ratio. For example, we set the compression ratio of the LTMP models to $0.7$ and the compression ratio of all the competing token merging methods to $0.75$ for the experiments in Table 1. It is observed that our LTM-Transformers achieve higher top-1 accuracy with even fewer FLOPs and faster inference speed compared to the current state-of-the-art token merging methods.

7. “Performance issues...”

The performance drop with increasing compression is inevitable for token merging. The compression ratio directly affects the number of retained tokens, which has a fundamental impact on the model's representation power. Reducing the number of tokens too aggressively inevitably removes informative content, which leads to a decrease in prediction accuracy. However, the proposed IBTM aims to strike a balance by optimizing the merging process based on the Information Bottleneck principle, ensuring that the merged tokens retain maximum relevant information. This allows IBTM to maintain a competitive performance level, even as we push towards higher compression ratios, outperforming other existing methods in the same context. The aforementioned implementation details have been added to Section E.1 in the appendix of the revised paper.

As evidenced in Table 1 in the paper, the IBTM consistently outperforms existing state-of-the-art token merging methods, including ToMe [1], ToFu [2], and LTMP [3]. Although IBTM suffers from performance drops with a higher compression ratio, the performance drop is significantly less severe than existing token merging methods. As shown in Table 7, LTM-ViT-B, with a compression ratio of 0.65, still achieves the same performance as the original ViT-B model.

8. “Writing and presentation issues...”

Thanks for pointing out the errors and typos. We have fixed them in the revised paper. In addition, the two recent works [7, 8] related to token merging are discussed in the related works of the revised paper.

Responses to the Questions

1. “According to Weakness 4...IBTM's performance without finetuning...”

Please refer to the response to Weakness 4.

2. “According to Weakness 5...details on adapting existing token merging methods (e.g., ToMe) to hierarchical ViTs (e.g., Swin Transformer)...”

Please refer to the response to Weakness 5.

3. ”According to Weakness 6...how is the token merging mask generated for various inputs?”

Please refer to the response to Weakness 6.

评论- Response to Reviewer b1MG Part 3

2024-11-30

4. "...compare their model with token pruning methods, such as EViT [9] and ATS [10]..."

We compare LTMP with the token pruning methods EViT [9] and ATS [10] as suggested by the reviewer. For a fair comparison, we apply EViT and ATS on the same pre-trained ViT-B backbone used in Table 1 in the paper, and apply the same data augmentation strategies as reported in Section 4.1. As both EViT and ATS require fine-tuning, we follow the settings in Section 4.1.1 and fine-tune the models compressed with EViT and ATS, that are EViT-ViT-B and ATS-ViT-B for 1, 5, 10, 25, and 50 epochs, respectively, and compare them with IBTM models fine-tuned for the same number of epochs. The results are shown in the table below. It is observed that with an even faster inference speed, the IBTM-ViT-B achieves higher top-1 accuracy on ImageNet compared to the state-of-the-art token pruning methods. For example, IBTM-ViT-B outperforms EViT-ViT-B by $0.37\%$ in top-1 accuracy after fine-tuning for 50 epochs, demonstrating the superiority of the IBTM in compressing vision transformers.

Methods	Inference Time (ms/batch)	Top-1 Accuracy (%) Epoch = 0	Top-1 Accuracy (%) Epoch = 1	Top-1 Accuracy (%) Epoch = 5	Top-1 Accuracy (%) Epoch = 10	Top-1 Accuracy (%) Epoch = 25	Top-1 Accuracy (%) Epoch = 50
ViT-B	37.2	83.74	-	-	-	-	-
EViT-ViT-B	31.9	-	82.22	82.54	83.15	83.49	83.62
ATS-ViT-B	31.2	-	82.49	82.85	83.05	83.32	83.38
IBTM-ViT-B (Fine-tuned)	30.7	-	83.35	83.57	83.76	83.91	83.96

5. "...explain why choosing efficient ViTs (i.e., FLOPs < 1G) as the backbone for experiments..."

In our experiments, we evaluate the effectiveness of IBTM on both large ViTs, such as ViT-B and Swin-B, and efficient ViTs, such as MobileViT and EfficientViT. By applying IBTM on vision transformers of different sizes, we aim to demonstrate the universal applicability and scalability of IBTM across different computational scales and architectural frameworks. The focus on efficient ViTs is particularly relevant for real-world applications that require a balance between accuracy and computational efficiency. These environments often include mobile devices and embedded systems where even modest reductions in computational load have significant implications. Although efficient vision transformers already exhibit low computational costs, further compressing the models are still critical in environments with limited computational resources, such as mobile and embedded systems.

References