PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差0.7
7
6
6
5
3.8
置信度
正确性3.0
贡献度2.8
表达2.8
NeurIPS 2024

Accelerating Transformers with Spectrum-Preserving Token Merging

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

a new method for token merging in transformer architecture

摘要

Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior work has proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top $k$ similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5% average performance drop of ViT-MAEH compared to 2.6% as baselines), image-text retrieval (0.3% average performance drop of Clip on Flick30k compared to 4.5% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties to the original token space under mild conditions.
关键词
token mergingvision transformermodel compression

评审与讨论

审稿意见
7

This paper proposes Protect Informative Tokens Merging (PIToMe), a token merging method that safeguards crucial information-bearing tokens prior to the merging step. PIToMe focuses on the token-splitting step before token merging. Specifically, PIToMe defines and calculates an energy score for tokens to be merged, and marks the highest scored tokens as suitable candidates for merging. Experiments show that PIToMe consistently outperforms previous token merging methods such as ToMe, DiffRate, ToFu, etc. Theoretical analysis reveals that PIToMe preserves the intrinsic spectral properties to the original token space.

优点

  1. Being the first method to consider the graph relations of tokens during token merging, the design of PIToMe is novel. Thorough theoretical analysis demonstrates the design insights of PIToMe.
  2. Although some improvements are not significant enough, PIToMe successfully outperforms the previous token merging methods. Considering that many investigations towards token merging have been conducted by many works, I consider the improvements, albeit minor, as strengths.
  3. Visualizations demonstrate the design insights of PIToMe. From those demonstrations, I could see that PIToMe merges the correct tokens in the image.

缺点

  1. ToMe is actually capable of conducting token merging at a very high compression rate (see Table 10 of ToMe paper, ViT-L and ViT-H can achieve about 2.5x speedup). Another recent work on arXiv [1] also demonstrate that this compression rate can even increase to >3x. However, in the experiments reported by PIToMe, the biggest speedup I could find is about 2x. I wonder whether PIToMe still works out for higher compression rates, since the metric vital for finding token clusters gradually defines more tokens as informative tokens (Eq. 4 in this PIToMe article).

[1] PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation (arXiv 2403.09192)

问题

  1. Also on the topic of compression rate. In ToMe, r can be randomly set to achieve any compression rates. Can I randomly set the r values for each layer in PIToMe?

局限性

See weaknesses & questions. I may change the rating after carefully checking the replies and reviews from other reviewers.

作者回复

Thank you for your feedback and high score, and for recognizing the main contribution of our work. We would now address your concerns regarding the compression rate of PiToMe.

1. ToMe is actually capable of conducting token merging at a very high compression rate (see Table 10 of ToMe paper; ViT-L and ViT-H can achieve about 2.5x speedup). Another recent work on arXiv [1] also demonstrates that this compression rate can even increase to >3x. However, in the experiments reported by PIToMe, the biggest speedup I could find is about 2x.


Our PITOME can indeed achieve up to a 3×3\times speedup, as evidenced in Figures 5 and 6 (in the submission). In these experiments, we benchmarked PiToMe against various baselines using different ratios of rr, where the compression rate ranged from approximately 70%70\% down to 30%30\%. Generally, performance dropped by approximately 0%0\% to 3%3\%, keeping more than 95%95\% of the baseline model's performance. Remarkably, on text-classification tasks (figure 10 in the submission), PiToMe can even compress down to 20%20\% of the number of tokens. This resulted in training and inference speeds approximately 5×5\times faster while maintaining accuracy above 85%85\% (about a 7%7\% accuracy drop compared to baseline ). Please refer to figures 2, 3, and 4 in the global rebuttal, where we conduct experiments on a wider range of ratio rr.

Thank you for recommending arXiv [1], which was accepted by ECCV24 on July 01, 2024. Indeed, modulating token features during token merging can mitigate the performance drops under a low compression rate and achieve up to ×3\times3 speed. However, we do not recommend compression rates higher than 60%60\% because compressing beyond this rate leads to significant performance degradation in many tasks. As shown in the arXiv [1] (PYRA) and ToMe, achieving a 3×3\times faster inference speed typically results in a 4%4\% to 7%7\% drop in accuracy compared to the baseline in image classification tasks, which is undesirable. Also, please note that factors other than the compression rate, such as hardware and batch size, can affect the model's speed. Therefore, we are unable to fully compare PyRA with PiToMe as the author hasn't published their code.

2. I wonder whether PITOME still works out for higher compression rates since the metric vital for finding token clusters gradually defines more tokens as informative tokens (Eq. 4 in this PITOME article)


First, we want to clarify that Equation 4 defines the token energy score. In our approach, informative tokens (low energy) are clustered into a preserved group, which will not be merged, and we only merge the less informative tokens (top 2k tokens with highest energy). Like other compression methods, PITOME can handle higher compression rates; however, there is a trade-off between performance and compression rate.

It is important to note that the compression process primarily merges less informative tokens while preserving the informative ones, which is true as long as a reasonable ratio (r)( r ) is maintained. This is demonstrated in all the trade-off figures, where PiToMe can maintain model performance before a significant drop occurs if (r)( r ) is too low. In the final version, we will include more ablation studies on higher compression rates with various (ri)( r^i ) ratios for each layer (li)( l^i ), as suggested by the reviewer.

3. Also, on the topic of compression rate. In ToMe, r can be randomly set to achieve any compression rates. Can I randomly set the r values for each layer in PIToMe?


In PITOME, the compression rate can be flexibly controlled based on the ratio r, representing the fraction of remaining tokens. Specifically, each layer lil^i is controlled by a ratio rir^i, meaning that 1ri1- r^i of the tokens will be merged. The overall percentage of remaining tokens after compression is given by:

rremain=i=1NriNr_{\text{remain}} = \frac{\sum_{i=1}^{N} r^i}{N}

For example, if we consider a ViT with 12 layers (N=12)(N = 12) and r=0.9r = 0.9, the percentage of remaining tokens after compression is rremain=0.538r_{\text{remain}} = 0.538. Similarly, for a BLIP2 model with 48 layers, rremain=0.362r_{\text{remain}} = 0.362 with r=0.9r = 0.9.

Thus, it is feasible to set the value rir^i for each layer randomly. However, it is important to consider the trade-off between performance and compression. As shown in our results, performance dramatically drops when r<0.85r < 0.85 for all layers.

评论

Dear Reviewer 4Z7M,

We thank you again for your positive feedback on our work. Regarding your remaining questions about the merging capacity of PiToMe, we wonder if our explanation is clear enough. If you have other questions, please feel free to raise it. We are happy to discuss it before the discussion period ends in two days.

Thank you very much

Regards

The Authors

审稿意见
6

This paper proposes a new strategy for reducing the number of tokens used by a vision transformer by merging similar tokens without. Such methods are very sensitive to the type of clustering/partitioning used on the tokens. The proposed approach, PITOME explicitly identify highly informative tokens, which should be kept out of the merging process.

More specifically, the tokens are place in a graph where each node corresponds to a key embedding and the edges are the cosine distance between said keys. Based on this graph, an "energy" is computed for each token, capturing how mergeable they are. Only the top 2k2k are considered mergeable, while the rest is preserved. The actual following merging step follows a similar procedure as previous token merging methods:

In summary, an "energy" score is computed on each token, using their key embeddings as base features. Using these energy scores, EPITOME draws itself apart from other merging techniques such as ToME in two ways;

  • explicitly exclude some tokens from the merging process
  • split tokens across the two sets AA and BB for bipartite matching as to maximise the change of max

优点

  • The proposed method is thoroughly evaluated on image, text and image+text tasks
  • Generally, PITOME achieves higher accuracy while preserving the expected number of FLOPs and memory usage
  • the issues addressed by EPITOME are well explained and motivated.

缺点

  • Generally, the experiments section is really hard to parse:

    • the figure placement is a bit all over the place
    • some results would benefit to be moved to the appendix; for instance, in my opinion, the trade-off curves are much more informative than the tables
    • some figures are hard to read (Figure 3) (e.g. small captions, overlapping curves due to the y/x axes scales)
    • ablation experiments are much too condensed
  • In terms of ablation, there is little discussion about the term mm when computing the energy score: It seems to be an important hyperparameter as it impacts the "merge ability" of the tokens as the layer depth increase. Since ToMe and other related work are presented as having a lack of robustness to the token splitting strategy, it would be interesting to also show how robust PITOME is to this hyper parameter.

  • Some claims didn't seem substantiated enough to me

    • e.g. line 120 "PITOME remains robust to token partitioning strategies": However PITOME uses a specific token partitioning algorithm. And the ablation on it (random split in Table 1) show that it actually suffers from a different portioning
    • line 276. "We contend that this improvement stems from the merging of less significant tokens in PITOME potentially enhancing the robustness of the language mode" -> this seems like a very vague hypothesis which in my opinion does really need a justification to begin with, since some other merging methods also outperforms the baseline. It may as well be a consequence of the noisiness of question answering datasets.

问题

  • I didn't fully understand the role of Table 2: isn't PITOME agnostic to the choice of the architecture ? it would send a stronger message to combine PITOME with various backbones, rather than compare it to different architectures.

局限性

there is a limitation section discussing some current drawbacks of token merging methods in general

作者回复

Thank you for your thoughtful feedback and for pointing out substantiated claims. We will now address your concerns one by one.

1. The experiments section is hard to parse

We will address these concerns in the final revision using the additional page. This will allow us to provide more details about the experimental results. Furthermore, we will move some less important parts to the appendix. For example, in Figure 3, we will remove redundant space at the bottom of the top-row figure and the top of the second-row figure to increase the size of the curves. Regarding the ablation study, we will include more detailed descriptions for each setting and provide an analysis of their impacts to improve clarity.

2. Discussion about the term mm in the ablation when computing the energy score

We acknowledge that the ablation study for mm is a valuable addition to the paper. To support readability, we include Figure 1 in the global rebuttal to visualize the impact of mm when experimenting with multiple margins mm on different tasks using various thresholds. From figure 1, it can be seen that, in both tasks, the adaptive version m=0.90.9li/lm=0.9 - 0.9 \cdot l_i/l achieves the best results. While models with fixed mm tend to have the accuracy drop sharply when rr is lower than some threshold. The reason might be as the token space becomes more spare in deeper layers, PiToMe with fixed mm will likely assign the same energy score to every token which is undesirable since we want to identify isolated tokens to protect them while merging the others.

3. Some claims didn’t seem substantiated enough

  • Line 120, which is “PITOME remains robust to token partitioning strategies.”

We apologize for the confusion the statement has caused. Indeed, this was a typo mistake. What we meant to convey is that "PITOME remains robust compared to other token partitioning strategies." In addition, we want to highlight that PITOME offers a better trade-off between speed and accuracy compared to previous BSM-based algorithms (i.e, ToMe, Diffrate, etc), while still being able to preserve spectral properties of the token space, similar to loop-based algorithms (i.e k-mean, spectral clustering, graph coarsening, etc).

  • In line 276, “We contend that this improvement stems from the merging of less significant tokens in PITOME potentially enhancing the robustness of the language mode”:

We agree that the dataset may contain noise. However, this noisiness can be interpreted as either biased or diverse perspectives. A possible explanation for the performance improvement in our token merging method is that the model treats all objects in the image equally, regardless of their size and the number of tokens they have. Large objects have their tokens merged, while smaller objects' tokens are preserved. Therefore, the number of tokens representing an object is not much different. This approach allows the model to avoid bias caused by object size and focus on the objects themselves rather than being biased by larger objects with a greater number of tokens. Additionally, by paying equal attention to all scene objects, the model aims to capture diverse perspectives.

4. The role of Table 2

In Table 2, we aim to demonstrate the following points:

  • PITOME can be combined with various backbones (as mentioned by the reviewer). In particular, we test PITOME with 2 backbones in Table 2 (BLIP, BLIP2) and two additional backbones in Figure 3 (CLIP and ALBEF).
  • We show that PiToMe, when combined with various backbones still outperforms state-of-the-art architecture (CLIP, ALBEF, UNITER, VilT, etc...).
  • In addition to the image-text retrieval task, we evaluated PITOME on various other tasks using multiple backbones. For the image classification task, we tested our algorithm on six backbones with two different pre-training styles (DeiT and MAE). We used two backbones (LLaVA-7B and LLaVA-13B) for the visual question-answering task. We tested on two backbones (BERT and DistilBERT) in the text classification task. These experiments confirm the versatility of our algorithms.
评论

Dear authors, Thank you for your response/clarifications and additional results. It is indeed good to see that the adaptive version of mm performs well across tasks. I am currently inclined to keep my score of 6 as I think the paper is technically solid and with good/robust evaluation, but with moderate to high impact, as the improvement over the ToME baseline is not always very significant.

评论

Thank you for your feedback. We will integrate those points into the final camera if the paper is accepted.

Regards

Authors

评论

Dear Reviewer jrpm,

We would like to thank you very much for your feedback, and we hope that our response addresses your previous concerns, for e.g., about the experimental section in Table 2.

If there exist some points we have not responded to so far, please feel free to let us know soon, as the discussion period is expected to end soon. We would be more than happy to address any additional concerns from you.

Thank you again for spending time on the paper; we really appreciate it!

Sincerely

Authors.

审稿意见
6

This paper introduces an energy-based approach to the token merging process, utilizing energy scores to avoid erroneous merges by distinguishing between informative or isolated tokens and redundant tokens. This method enhances the efficiency of heuristic merging approaches and preserves the spectral properties between the original and merged graph, thus accelerating token-based architectures such as ViT. Specifically, the technique employs cosine similarity with neighboring tokens to evaluate the energy score. Tokens with a high energy score are considered redundant and can be merged, indicating they belong to large clusters (e.g., background). Conversely, tokens with low energy scores are treated as foreground tokens. Furthermore, the authors also present theoretical findings demonstrating that the proposed method more effectively approximates the spectral distance between the initial token spaces and the merged token set compared to existing approaches. The method has shown competitive results across four tasks: image-text retrieval, visual question answering with LLMs, image classification, and text classification, underscoring its effectiveness and significance in the field.

优点

  • This paper is well-written and clearly presents both the proposed solution and the experimental results.
  • In addition to thorough experiments demonstrating the effectiveness of the proposed method, the paper provides a theoretical derivation to explain why PiToMe outperforms another related approach (e.g., ToMe).

缺点

  • Some formulas and inferences are ambiguous, leading to misunderstandings that undermine the correctness of the theoretical explanations.
  • The lack of results for higher merging ratios weakens the practicality of the proposed approach.
  • There is a lack of discussion on related works (e.g., CrossGET [1], TRIPS[2]).

[1] CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers, ICML 2024

[2] TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection, EMNLP 2022

问题

  • There are some ambiguities that need to be clarified in the methodology section. The authors claim that mm is a fixed constant margin; however, according to the context, mm seems to be the threshold of the margin, and 1m1-m is the margin. If I am wrong, the following description is confusing: In lines 162-163, "m=0.90.9×li/lm = 0.9 − 0.9 × l_i/l, ... indicating an increasing margin as tokens move to deeper layers." This sentence seems to be the complete opposite of the conclusion. Also, does α=1.0\alpha = 1.0 mean that when token jj is outside the margins, fm(x)f_m(x) is a negative constant? Furthermore, there are no related ablation studies or discussions of the impact of the α\alpha .
  • In most of the experiments, the ratio rr is 0.95, 0.975, 0.9, and 0.925. Only the image classification results include a ratio r=0.7r = 0.7. Existing approaches (e.g., ToMe) can achieve ratios of r=0.50.6r = 0.5 \sim 0.6. I am curious about the consistency of the proposed method at higher merging ratios.
  • Some existing work, such as TRIPS and CrossGET, has also investigated token merging. The lack of thorough discussion or comparison with previous work weakens PiToMe's claimed novelty. I suggest the authors more clearly present their merits and differences in the paper.

局限性

  • The authors discuss some limitations in the last section.
作者回复

Thank you for your helpful feedback. In what follows, we will address your concerns individually.

1. The authors claim that is a fixed constant margin; however, according to the context, mm seems to be the threshold of the margin, and 1m1-m is the margin.

We're sorry for the confusion. As we use the cosine similarity to compute the energy score, the value of xx in Eq. 4 will be in the range (1,1)(-1,1). A value close to 1 indicates that the two vectors point in the same direction (identical tokens). Thus, since m=0.90.9li/lm=0.9-0.9\cdot l_i/l , we have that for the first layer, the margin will have a value of 0.9, and only neighbors that have a cosine similarity larger than 0.9 will be considered as neighbors. Tokens outside this margin will have the value xx replaced by a negative constant α\alpha. The margin becomes adaptively "wider" in deeper layers as mm decreases.

2. Does mean that when the token is outside the margins, is a negative constant? Furthermore, there are no related ablation studies or discussions of the impact of the α\alpha

Yes, due to the strict page limits, we could not include the ablation regarding these parameters, the margin mm, and the lower bound α\alpha . Recall that we used α(exp(xm)1)\alpha(\exp(x-m) - 1) to smoothen the fm()f_m(\cdot) function to consider neighbor tokens that are outside but are close to the margin mm. The larger α\alpha, the more consideration for these neighbors, and α=0\alpha=0 means we completely ignore them. The table below shows experiment results on the image-text retrieval task using different values of α\alpha and mm.

rrα=1.0\alpha=1.0α=0.5\alpha=0.5α=0.0\alpha=0.0
0.85519.98518.66515.9
0.875545.9544.22542.54
0.9562.82562.42561.92
0.925571.88571.1570.62
0.95577.5577.43577.4
0.975580.24579.82579.76

3. In most of the experiments, the ratio is 0.95, 0.975, 0.9, and 0.925. Question about the consistency of the proposed method at higher merging ratios.

Thank you for your comment. We define the ratio rr differently from the ratio rr as mentioned in your question. We define the ratio rr as the number of tokens remaining in each layer after merging, not as the percentage of total remaining tokens after compression. Since PITOME is applied to all layers, (1r)(1-r) percent of tokens will be merged after each layer. For example, let's consider a ViT model with 12 layers and a ratio r=0.9r=0.9. This means that 10% of the tokens will be merged after each layer. The percentage of remaining tokens after compression is thus given by:

rremain=i=1120.9i12=0.538.r_{\text{remain}} = \frac{\sum_{i=1}^{12} 0.9^i}{12} = 0.538.

rremainr_{\text{remain}} is the ratio rr mentioned in your question, which means reducing nearly half the number of tokens the model needs to process and leading to about two times the throughput. This percentage can further decrease for larger models with more layers. For instance, in the BLIP2 model with 48 layers, we can achieve rremain=0.362r_{\text{remain}} = 0.362  if r=0.9r = 0.9 . For large language models like LLaVA, rremainr_{\text{remain}} can be even lower (in our paper we tested with rremainr_{\text{remain}} ranging from 0.732 down to 0.277).

4. PiToMe compared to TRIPS and CrossGET

For better readability we added a table to compare these works:

TRIPsCrossGETPiToMe
Methodselective protect and merge image token based attention score from [CLS] token from text encoderIntroduced two components: (i) Complete-Graph Soft Matching (CGSM): to merge tokens with the highest similarity scores. (ii) Cross-Guided Matching and Ensemble: introduce learnable tokens to provide cross-modal importance scores, guiding the CGSM algorithm and enabling weighted average merging.Introduced energy scores to identify mergeable and isolated (informative) tokens. Tokens in large clusters will have high energy scores and be merged.
Strengthboost the model accuracy at a lower computational cost than the unaccelerated baseline model(i) Can be applied to both modality dependence and independent models. (ii) achieve better accuracy than the previous method (ToMe) in both w/wo training(i)Share the shame strength with CrossGET (ii)Robust against imbalanced clusters
WeaknessIt is tailored for modality dependent models like ALBEF, and still need pertaining before inferenceCSGM is very sensitive to the token space’s distribution and is suboptimal when facing token space with imbalanced clusters (Confirmed by the author in the appendix section)Applied to only the ViT encoder of the VL model without any cross-modal guidance

Since these papers do not provide their source code, it is difficult to reproduce their results. Also, we would like to explain more clearly the weakness of the CSGM algorithm used in CrossGET:

  • CSGM's implementation divides all tokens into two disjoint sets, TsT_s , and TdT_d, allowing only one token from TsT_s to merge with one token in TdT_d. This is suboptimal for imbalanced clusters, such as large clusters with many tokens (e.g., backgrounds) and small clusters (e.g., small objects), which are common in practice.
  • For example, let's assume we have four tokens where three are close to each other and one is isolated. CSGM would merge two of the three close tokens and then merge the remaining close token with the isolated one, causing significant information distortion. The correct behavior should be to merge the three close tokens and leave the isolated one untouched.
  • PiToMe, on the other hand, assigns high energy scores to tokens in large clusters, prioritizing their merging while protecting isolated tokens. In the example above, the three close tokens are identified as mergeable, while the isolated token remains protected. Moreover, PiToMe can also benefit from cross-modality guidance techniques as add-ons to enhance performance.
评论

Dear Reviewer ZkPm,

Because we have around two days left before ending the discussion phase, we wonder if our rebuttal above answers your concerns. Your major questions about discussing related works, compressing rates, and ablation studies on margin mm were provided. Please let us know if there is any points you are still unclear.

Thank you very much for taking the time to give our feedback. We look forward to hearing your response soon.

Regards

The Authors

评论

Dear Authors,

Thank you for your clarifications and the additional results. After reading your responses and the comments from other reviewers, I am raising my score to 6. As most of my concerns have been addressed, I believe that adding the relevant precise descriptions and ablation studies will further enhance the quality of the revised version.

评论

Dear Reviewer ZkPm,

We appreciate your reading our rebuttal and increasing the score. If the paper is accepted, we will add those ablation studies to the revised version.

Regards

Authors

审稿意见
5

This paper proposes a novel method called PITOME, which enhances the efficiency of Vision Transformers (ViTs) by merging tokens in a way that preserves crucial information. Unlike previous methods, PITOME uses an energy score to prioritize and protect informative tokens while merging redundant ones. This approach reduces the computational and memory requirements by 40-60% with minimal impact on accuracy. PITOME achieves superior performance in tasks such as image classification, image-text retrieval, and visual question answering, demonstrating its effectiveness and robustness compared to existing token merging techniques. Furthermore, it theoretically preserves the spectral properties of the original tokens, contributing to its efficiency and performance stability.

优点

  1. The introduction of the energy score metric for token merging is a novel idea that addresses the limitations of previous methods. This innovative approach effectively distinguishes between informative and redundant tokens, ensuring the preservation of critical information during the merging process.
  2. The theoretical foundation for preserving the spectral properties of the original token space is a unique contribution. This aspect of the work sets it apart from existing methods, which often do not consider the impact on the spectral properties of the token space.
  3. The paper provides extensive empirical validation across multiple tasks, including image classification, image-text retrieval, and visual question answering.
  4. The theoretical analysis that supports the empirical findings enhances the overall quality of the paper. By demonstrating that PITOME preserves spectral properties, the authors provide a solid theoretical justification for the observed performance improvements.

缺点

  1. The proposed method, PITOME, does not show significant improvement over the baseline, particularly ToMe, in many tasks. The empirical results indicate that in some cases, PITOME's performance is only marginally better or even comparable to existing methods, which raises questions about its practical benefits.
  2. As shown in Figure 11, PITOME fails to outperform ToMe in visualizations, and in some instances, the token merging appears unreasonable. For example, in Figure 11(d), the merging around "a tennis racquet" is poorly executed, leading to doubts about the method's effectiveness in preserving critical details.
  3. The paper includes extensive mathematical derivations and proofs, which significantly impact its readability. The dense mathematical content makes it difficult for readers to quickly grasp the core contributions and the practical implementation of the proposed method.
  4. The paper primarily focuses on improving ToMe by addressing its token merging strategy. While this is a valid research direction, the scope of the problem tackled is relatively narrow. The significance of the improvement may not be high enough to warrant extensive interest or application, as it addresses a specific aspect of an existing method rather than introducing a fundamentally new concept or problem.

问题

  1. In Figure 11, PITOME's token merging appears unreasonable in some cases (e.g., the tennis racquet example). Could you provide a deeper explanation or justification for these visualizations and any potential improvements to address these issues?
  2. The mathematical derivations and proofs are quite complex and impact readability. Are there ways to simplify these explanations or provide more intuitive summaries of the key points to improve accessibility for readers?

局限性

The authors have acknowledged some limitations of their work and provided suggestions for future research directions. The paper does not explicitly address any potential negative societal impacts. Given the focus on efficiency improvements in Transformers, the primary societal impact would likely relate to the broader implications of making these powerful models more accessible and efficient.

作者回复

Thank you for your valuable feedback and for recognizing the strengths of our work. We appreciate your thoughtful comments. We will now address each of your concerns individually.

1. Performance comparison with Tome

We would like to emphasize that PiToMe demonstrates significant improvements over the baselines, particularly ToMe. We acknowledge that the condensed presentation of experiments in the paper might have made it difficult to discern the performance gains of our method. We highlight the performance gains of PiToMe over ToMe as follows:

  • Image-text retrieval: At a 60% compression rate, PiToMe outperforms ToMe by a margin of approx 11% for recall score at top 1.
  • Visual Question Answering: With 50% compression rate, PiToMe maintains a more stable performance drop below 0.05% while ToMe's drop exceeds 1%.
  • Text classification: PiToMe achieves a better trade-off, compressing over 60% of tokens with accuracy above 90% (about a 2% drop) and over 80% of tokens with accuracy above 85%.
  • Image classification: PiToMe saves 40-60% of FLOPs, with an average performance drop of 1.4% compared to 2.6% for ToMe.

2. Figure 11

We kindly argue that a misunderstanding might cause the observation that the merging is unreasonable. For clarity, we discuss how we generate these visualizations and then provide a definition of good token merging:

  • Implementation: Tokens are represented as grid boxes with black borders, colored by the mean color of the pixels they cover. Tokens with higher attention scores will have a bolder cyan border. During the merging process, result tokens will have their colors replaced by the mean color of all the merged tokens. If merged tokens are adjacent, the black borders will be removed.
  • Good merging criteria: Objects represented by a small number of tokens should be preserved, while larger objects with repeated textures, such as background areas, can be considered for merging.

In Figure 11d, we observe that PITOME effectively minimizes information distortion by preserving important tokens representing "the tennis racquet," "the man," and the "Adidas logo" while merging tokens that contain redundant information representing the background. Despite some mis-merged tokens in the background regions, PITOME largely preserves the distribution of all tokens. This is illustrated by the attention map, which closely resembles that of the original model. On the other hand, In ToMe, there are several wrongly merged tokens; for example, all tokens representing the “tennis racquet" are merged into a single token. Similarly, for the “Adidas logo”. Also, ToMe merges tokens representing the man's arm and the white skirt with background tokens, which can be seen from the resulting false color.

3. Mathematical derivations and proof

The theory section, with mathematical derivations and proofs, is intended to show the spectrum preservation property of a graph built based on merged tokens using our method compared to the original token graphs in ViT. To improve readability, we will emphasize the key steps in deriving our theoretical results and provide high-level summaries in the final version.

4. Scope and significance

We focus on ToMe because it pioneered this line of research, making it a prime candidate to demonstrate our theoretical results. To address the reviewer's concern, we first provide a comparison table, below

MethodTargeted architectureDescriptionWeaknesses
ToMe / ToFuViTBSM with odd &even indices partitioningrisk damaging tokens in later layers
DiffRateViTBSN with partitioning based on attention score from [CLS] tokenTightly coupled with the attention scores of classification task and performs poorly on several tasks
PuMerVL modelsBSM+Cross model guidanceis not able to be used for modal-independent models like CLIP, and is tightly coupled with a vision-language model
CrossGETVL modelsCGSM + Cross Model guidanceis tightly coupled with vision-language models and is sensitive to token distribution
DCTLanguage modelsused the FFT operator to filter out high frequencies with low amplitudePerform poorly in the vision domain
PITOMETransformer-basedPiToMe identifies large clusters and partitions them based on the ordered Energy score of each token. Thus, it is robust again token distribution and can be used for any Transformer-based architecture

We summarize our contribution:

  • Conceptual novelty: We introduce a new concept called energy score. When applying this into other applications like LLM, 3D point clouds, video processing, or PDEs, it could open new research directions for designing an optimal energy score that captures specific underlying data distributions. Furthermore, energy minimization theory has been well-studied in fields such as chemistry and physics. Thus, we believe this will pave the way for a new research direction in these fields.
  • Theoretical results: Our theoretical results provide a solid foundation for understanding and applying the energy score concept, offering valuable insights for further research and development.
  • Extensive applications: To demonstrate the versatility of PITOME, we provided extensive experiments for multiple scenarios (image image-text retrieval, image classification, text classification, and VQA using LLAVA-7B and 13B). Furthermore, we compared our method with SOTA compression models in each specific downstream task, showcasing the broad applicability and effectiveness of our approach. Moreover, PITOME’s accuracy can also benefit from existing techniques (differential merging schedule, cross-modal guidance, etc.) to improve its accuracy.

Most importantly, due to the widespread use of Transformers and the associated high computational resources required, significant efficiency improvements can lead to substantial resource savings and open up new applications.

评论

Dear Reviewer zfJf,

We sincerely appreciate the time you have taken to provide feedback on our work, which will help us to improve its clarity, among other attributes. We want to follow up to check if our response fully addressed your concerns/questions before ending the discussion period.

Sincerely,

The Authors

评论

Dear Reviewer zfJf,

We have around five hours from this post before ending the discussion phase. Could you please let us know whether our responses answer your main concerns?

Regards

Authors

作者回复

First, we are grateful to the reviewers for their valuable comments and detailed feedback. We are pleased that the reviewers recognize our energy-based token merging as a novel idea (Reviewer zfJf and Reviewer 4Z7M) with a theoretical foundation explaining the underlying mechanism (Reviewer zfJf and Reviewer ZkPm). All reviewers also acknowledged our extensive experiments on image, text, and image-text retrieval, demonstrating that PITOME achieves higher accuracy while maintaining the expected number of FLOPs and memory usage (Reviewer JRPM and Reviewer 4Z7M).

Then, we would like to summarize the main contributions of our work:

  • We are introducing a new concept called the energy score, which can be applied to various applications such as LLM, 3D point clouds, video processing, or PDEs. This could open new research avenues for designing an optimal energy score that accurately represents specific underlying data distributions. Additionally, it could lead to new research directions in fields like chemistry and physics.
  • We provide a robust and solid theoretical foundation for understanding and applying the energy score concept, offering valuable insights for further research and development.
  • We demonstrated PITOME's versatility through extensive experiments in various scenarios, including image-text retrieval, image classification, text classification, and VQA using LLAVA-7B and 13B. Additionally, we compared our method with top compression models in each specific task, showcasing its broad applicability and effectiveness. PITOME's accuracy can also benefit from existing techniques to improve its performance further.
  • Significant efficiency improvements can lead to substantial resource savings and open up new applications due to the widespread use of Transformers and the high computational resources required

The reviewers raised several concerns, which we addressed in the individual response. We summarize the highlights of our responses as follows:

  • Performance comparison with Tome: We discuss the advantages of our approach over the baselines, including Tome, across various tasks with notable performance gains. Furthermore, we provide a detailed analysis of the speed and compression rate comparison between our PITOME and Tome.
  • Impact of variable mm and α\alpha: We include additional ablation studies on various values of mm and α\alpha to showcase the effectiveness and robustness of our proposed PITOME.
  • Compression rate: We clarify the merging ratio rr used in our paper and discuss in detail the trade-off between performance and compression rate.

Finally, we have carefully addressed all the reviewers' comments and questions. We will revise and update our final version, using one additional page, based on all the reviewers' suggestions.

最终决定

This paper proposes a new token merging method, i.e., PIToMe, to enhance the efficiency of Vision Transformers (ViTs) by merging tokens in a way that preserves crucial information. Different from previous methods, PIToMe uses an energy score to prioritize and protect informative tokens while merging redundant ones. In this way, the computational and memory requirements are reduced by 40-60% with minimal impact on accuracy. Superior performance is achieved in multiple tasks such as image classification, image-text retrieval, and visual question answering, demonstrating its effectiveness and robustness compared to existing token merging techniques. Furthermore, it theoretically preserves the spectral properties of the original tokens, contributing to its efficiency and performance stability.

This paper received four reviews, with ratings of 1 accept, 2 weak accept, and 1 borderline accept. All reviewers are overall positive about this paper. They recognized the novelty and contributions of this paper, such as novel ideas, extensive empirical validation, higher performance, and good presentation.

Meanwhile, the reviewers pointed out some suggestive comments that should be considered in preparing the final version, such as a relatively narrow problem, dense mathematical content (difficult for readers to quickly grasp the idea), ambiguous formulas and inferences, lack of results for higher merging ratios, and lack of discussion on related works. The authors provided meaningful responses to the reviewers' concerns and the reviewers are generally satisfied with the rebuttal. I recommend acceptance and strongly suggest the authors take the reviewers' suggestive and detailed comments and the authors' promised revisions into consideration in the final version.