PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
4
5
3.3
置信度
创新性3.0
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

Compress & Cache: Vision token compression for efficient generation and retrieval

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
token compressionllava

评审与讨论

审稿意见
5

The authors propose a method to optimize the inference speed of VLMs in a RAG scenario.

Specifically, given a pretrained VLM, the authors perform a first inference pass using an image, a pre-defined summarization prompt, and a small set of summary tokens. The summary tokens are then cached and used for further inferences alongside a query prompt.

The authors claim that this should significantly improve speed, as during "online" inference, the model would then process only around 6% of the image tokens, which the authors claim constitutes the overwhelming part of the input sequence compared to the textual query.

To make this possible, the authors fine-tune a pretrained VLM, in their case LLaVA 1.5, via LoRA adapters, using two losses:

  1. A contrastive loss, applied on the summary tokens after the first forward pass.
  2. An auto-regressive loss, applied on the generations produced by the model, given the text instructions and the summary tokens.

The two losses are then summed and jointly optimized using two LoRA adapters, one for summarization and one for generation. The adapters are trained for 10,000 steps using data from LLaVA-665K and CC3M, representing generative and contrastive tasks, respectively. The authors aim for a 1:1 sampling ratio between the two.

The fine-tuned models are then comprehensively evaluated across several text generation and image/text retrieval benchmarks, and the results show that C&C largely outperforms existing token pruning and matryoshka methods, even surpassing the original LLaVA baseline in several cases.

优缺点分析

Strengths

  1. The paper is well-written and easy to follow.
  2. The paper addresses a relevant problem, namely improving the inference speed of VLMs, with a focus on RAG/indexing scenarios, which are relevant for real-world use cases.
  3. Their method is comprehensively evaluated across various datasets and tasks and compared to several recent token pruning and matryoshka methods, largely improving over them.

Weaknesses

My primary concern regarding this paper is the confounding effect of LoRA fine-tuning. In particular, I am surprised to see that C&C improves over the baseline in several cases, e.g., +3.1% on VisWiz with 32 tokens, +0.1% on MMBench with only 16 tokens, representing a 97% reduction in the original number of tokens.

Thus, I wonder how much of this performance improvement (both over LLaVA and other methods) can be attributed to the LoRA fine-tuning and how much to C&C. I believe that a comparison between a LLaVA model fine-tuned via LoRA using a similar procedure (as much as possible) and C&C would provide clearer insights.

Beyond that, there are some minor details:

  1. I believe that the dataset size comparison between EVA-CLIP and C&C on line 271 is unfair, as it compares the size of the dataset used for training this specific CLIP model versus the size of a fine-tuning dataset.
  2. Table 1, for MMB, the best-performing method with 36 tokens is Matryoshka Multi, but its entry is not bolded.
  3. Small typo in line 47 of the supplementary, "performs" should be "performance".

问题

  1. How are the summary tokens initialized?
  2. While the authors provide a FLOPs-based comparison between indexing and inference in Figure 5, I wonder how the inference speed would vary (and compare to other token pruning methods) under different scenarios, e.g., when the image is given by the user, and no retrieval is performed. Would it then be more efficient to use a matryoshka method compared to two VLM inference steps?
  3. The paper states that they aim for a 1:1 sampling ratio between LLaVA 665k and CC3M. Do the authors aim for an overall 1:1 ratio, or do they strive to maintain this ratio in each batch? Would a different sampling ratio improve performance?
  4. It would be interesting to know the total runtime of the LoRA fine-tuning process.
  5. To my understanding, the first-stage LoRA adapters are optimized for both the contrastive and auto-regressive objectives, while the second-stage adapters are optimized only for the auto-regressive objective. Is this correct?

局限性

The authors address some limitations of their work.

最终评判理由

The authors properly addressed all my concerns. I have thus increased my score to accept.

格式问题

N/A.

作者回复

We thank the reviewer for their time and feedback provided. We hope our responses below address their remaining concerns.

Q1. On confunding effects of LoRA fine-tuning C&C improving over the baseline in several cases. How much of this performance improvement (both over LLaVA and other methods) can be attributed to the LoRA fine-tuning and how much to C&C. I believe that a comparison between a LLaVA model fine-tuned via LoRA using a similar procedure (as much as possible) and C&C would provide clearer insights.

A1. We believe the reasons behind the observed behaviours are multifold: (1) As shown in Table 1 in the paper, multiple compression methods report for certain datasets improvements over the uncompressed LLaVA-1.5 baseline (e.g. [28] on SQA and VisWiz, [3,28] on VisWiz, [3] on MMB). This suggests that the compressor acts as a filter that filters out some of the distractors, focusing more on prominent objects/concepts. (2) Performance on these datasets tends to vary between adjacent checkpoints during training, and the last checkpoint is not always the best one across all datasets, despite being the best on average. (3) The rest is indeed down to the LoRA finetuning, as the result of the experiment you suggested showcases below:

CompressorGQAMMBMMEPOPESQATextVQAVisWiz
LLaVA-1.562.064.31510.785.966.858.250.0
+LoRA63.466.11496.986.468.458.249.8

Q2. I believe that the dataset size comparison between EVA-CLIP and C&C on line 271 is unfair, as it compares the size of the dataset used for training this specific CLIP model versus the size of a fine-tuning dataset.

A2. What we meant is the size of the dataset used for contrastive training (which is optimal for retrieval tasks). We will make this clear, thank you.

Q3. Table 1, for MMB, the best-performing method with 36 tokens is Matryoshka Multi, but its entry is not bolded. And a small typo in line 47 of the supplementary, "performs" should be "performance".

A3. Thank you! We have fixed them in the updated manuscript.

Q4. How are the summary tokens initialized?

A4. The summary tokens are initialized randomly. We explored multiple choice (e.g, zero-init, character-based init) but ultimately didn't notice a significant difference.

Q5. While the authors provide a FLOPs-based comparison between indexing and inference in Figure 5, I wonder how the inference speed would vary (and compare to other token pruning methods) under different scenarios, e.g., when the image is given by the user, and no retrieval is performed. Would it then be more efficient to use a matryoshka method compared to two VLM inference steps?

A5. Following your suggestion, we benchmark the LLaVA-1.5 7B model on a RTX 4090 GPU. Each result is averaged over 100 runs following a warm-up period.

Original LLaVA model cost: 0.0587 sec/image (out of which 0.00353 sec spent for the vision encoder)

Caching cost: 0.0584 sec/image

Online C&C cost (16 tokens): 0.00158 sec/image

Matryoshka (16 tokens): 0.00521 sec/image

Online C&C cost (4 tokens): 0.000406 sec/image

Matryoshka (4 tokens): 0.004001 sec/image

Caching cost + Online C&C (16 tokens): 0.0599 sec/image

Caching cost + Online C&C (16 tokens): 0.0588 sec/image

Notice that the practical measurements closely align with the expected approximated improvements. Moreover, in an online manner, once the features are cached, we are notably faster at building the context.

Furthermore, our method can also make use of a sliced LLM to speed up the compression. To showcase this, we train a model with the following configuration: a LLaVA-OV 0.5B generator with a compressor instantiated using the first 25% of the LLM's layer, as opposed to 100%. This results in a 4x faster and smaller compressor. To recover the performance drop, we distill the features produced by the full-sized compressor. The results for 16 tokens (i.e. 45x compression rate, 729/16) are presented in the Table below:

CompressorGQAMMBMMEPOPESQATextVQAVisWizrealworldQA
none58.352.9146188.367.265.847.454.1
100% (full)57.852.9151688.270.964.348.954.5
25%57.149.2140584.370.361.844.253.0
25% + distill58.051.1147786.473.764.648.553.9

This means that we can reduce the compression time by up to 4x without notable performance degradation.

Q6. The paper states that they aim for a 1:1 sampling ratio between LLaVA 665k and CC3M. Do the authors aim for an overall 1:1 ratio, or do they strive to maintain this ratio in each batch? Would a different sampling ratio improve performance?

A6. We aim for a 1:1 sampling ratio globally, not within a batch. The batches themselves are simply constructed to maximize throughput, grouping the samples by length. We did try a few different sampling ratios. The model is generally robust in the range 1:1 - 3:1 (665K:CC3M).

Q7. It would be interesting to know the total runtime of the LoRA fine-tuning process.

A7. A training run took 40 hours on 24 GPUs.

Q8. To my understanding, the first-stage LoRA adapters are optimized for both the contrastive and autoregressive objectives, while the second-stage adapters are optimized only for the autoregressive objective. Is this correct?

A8. That's correct, we will make this clearer in the updated paper.

评论

Thank you for your hard work preparing the rebuttal! You have properly addressed all my concerns. I will raise my score to accept.

评论

We thank the reviewer for checking our rebuttal and for the valuable suggestions made. We are happy to hear that our response has fully addressed your concerns.

审稿意见
5

This paper presents a novel method, named C&C, to compress vision tokens into more compact features for a given VLM with minimal performance loss. The proposed method employs a “double-forward” calculation flow with two training objectives, enabling the VLM itself to extract compressed tokens that are simultaneously suitable for generation, discrimination, and storage. Extensive experiments across various scenarios demonstrate the effectiveness and efficiency of the proposed method.

优缺点分析

Strengths

  1. The overall writing and organization of this paper is good.
  2. The proposed method is simple, effective, and straightforward.
  3. The experiments are sufficient to demonstrate the effectiveness of the proposed method.

Weaknesses

The illustrations can be further refined for clarification and better readability. For example, Figure 2 does not show that the proposed method does not require finetuning the entire LLM, and the stage-specific LoRA modules are also missing from this figure.

问题

Regarding the storage perspective, the proposed method needs to store a summarized version of images with shape Rk×dR^{k' \times d}, where dd is the hidden dimension of LLM. In contrast, the original LLaVA-1.5 only need to store features with shape Rk×dvR^{k \times d_v}, where dv<dd_v < d is the dimension of the vision encoder output. It would be better to clarify this difference to illustrate the actual compression difference, rather than just comparing the number of tokens.

局限性

Yes.

最终评判理由

The authors' response has already resolved my concerns well. Since I had already expressed a clear intention to accept the paper, I will keep my score unchanged.

格式问题

No.

作者回复

We thank the reviewer for their feedback and for recognizing the novelty of our approach. We hope our responses below address their remaining concerns.

Q1. The illustrations can be further refined for clarification and better readability. For example, Figure 2 does not show that the proposed method does not require finetuning the entire LLM, and the stage-specific LoRA modules are also missing from this figure.

A1. Thank you for your suggestion. We have updated the figures and will include them in the updated manuscript.

Q2. Regarding the storage perspective, the proposed method needs to store a summarized version of images with shape Rk×dR^{k' \times d}, where dd is the hidden dimension of LLM. In contrast, the original LLaVA-1.5 only need to store features with shape Rk×dvR^{k \times d_v}, where dv<dd_v < d is the dimension of the vision encoder output. It would be better to clarify this difference to illustrate the actual compression difference, rather than just comparing the number of tokens.

A2. You are generally right, if we opt to store the vision features prior to the projection layer that maps them to the input dimension of the LLM then generally dv<dd_v < d. The caveat is that we will have to rerun the projector in this case. For the LLaVA-1.5 7B model, d=4096d=4096 and dv=1024d_v=1024 while for the smaller LLaVA-OV 0.5B model dvd_v is a bit larger than dd, with d=896d=896 and dv=1152d_v=1152.

We will make this distinction clearer in the updated paper.

评论

Thank you to the authors for their response to my comments. Since I had already expressed a clear intention to accept the paper, I will keep my score unchanged.

评论

Dear Reviewer h3Po,

Please note that submitting mandatory acknowledgement without posting a single sentence to authors in discussions is not permitted. Whether your concerns have been addressed or not, please do tell the authors.

Thanks,

AC

审稿意见
4

This paper introduces C&C (Compress & Cache), a novel paradigm for compressing vision tokens in Large Vision-Language Models (LVLMs). C&C decouples compression from inference by performing a one-time, offline indexing step where the LVLM itself compresses numerous vision tokens into a few compact "summary tokens." These cached tokens are then used for efficient online inference. The method is trained with a "double-forward pass" strategy, jointly optimized by an autoregressive loss for generation and a contrastive loss for discrimination. C&C aims to create a unified, nearly lossless representation for both generative and discriminative tasks, and experimental results show it sets a new state-of-the-art on various benchmarks, particularly in high-compression settings and for Visual RAG.

优缺点分析

Strengths

  1. The core contribution is the "Compress & Cache" paradigm, shifting compression from an on-the-fly task to an offline process. This is a highly novel and practical approach, perfectly suited for real-world applications like Retrieval-Augmented Generation (RAG) and on-device deployment, where pre-indexing is feasible. It elegantly bypasses the inherent limitations of online compressors.

  2. The "double-forward pass" strategy, which enables the LVLM to act as its own compressor, is a key technical highlight. This "self-compression" mechanism leverages the LLM's intrinsic capabilities for information integration, allowing it to create high-fidelity summary tokens without relying on external modules, which is crucial for its near-lossless performance. Robust and Comprehensive Experiments: The paper's claims are substantiated by rigorous and comprehensive experiments. The extensive ablation studies (Tables 7, 8, 9) systematically validate each design choice, including the dual-loss function, the double-forward pass, and stage-specific LoRA adapters, providing strong evidence for the method's effectiveness.

Weaknesses

  1. The method's core strength is also a limitation. It is not suitable for real-time applications (e.g., live video analysis) where pre-processing is impossible. While this is an intrinsic property of the paradigm, it constrains the method's scope of application.

  2. The paper claims C&C is a general method, yet all experiments are confined to the LLaVA family. The LVLM landscape includes diverse architectures (e.g., Qwen-VL, CogVLM) with different vision-language interfaces. The effectiveness of the "self-compression" mechanism is not guaranteed to transfer across these architectures. By limiting validation to a single model family, the paper's claim of generalizability remains insufficiently supported by empirical evidence.

问题

n/a

局限性

n/a

最终评判理由

I have reviewed the authors’ response as well as their discussion with the other reviewers. My concerns were addressed and I would like to maintain my recommendation to accept the paper.

格式问题

n/a

作者回复

We're grateful for your time and valuable feedback, and for recognizing the novelty of our approach! We hope our responses below address your remaining concerns.

Q1. The method's core strength is also a limitation. It is not suitable for real-time applications (e.g., live video analysis) where pre-processing is impossible. While this is an intrinsic property of the paradigm, it constrains the method's scope of application.

A1. Our main focus is indeed on settings naturally suitable for offline indexing (e.g. RAG-based applications).

However, we note that our method is sufficiently versatile, and the cost can be in part alleviated by combining our method with techniques like distillation and LLM slicing. To showcase this, we train a model with the following configuration: a LLaVA-OV 0.5B generator with a compressor instantiated using the first 25% of the LLM's layer, as opposed to 100%. This results in a 4x faster and smaller compressor. To recover the performance drop, we distill the features produced by the full-sized compressor. The results for 16 tokens (i.e. 45x compression rate, 729/16) are presented in the Table below:

CompressorGQAMMBMMEPOPESQATextVQAVisWizrealworldQA
none58.352.9146188.367.265.847.454.1
100% (full)57.852.9151688.270.964.348.954.5
25%57.149.2140584.370.361.844.253.0
25% + distill58.051.1147786.473.764.648.553.9

Q2. The paper claims C&C is a general method, yet all experiments are confined to the LLaVA family. The LVLM landscape includes diverse architectures (e.g., Qwen-VL, CogVLM) with different vision-language interfaces. The effectiveness of the "self-compression" mechanism is not guaranteed to transfer across these architectures. By limiting validation to a single model family, the paper's claim of generalizability remains insufficiently supported by empirical evidence.

A2. Thank you for your suggestion. We would like to note that we have already validated the efficacy of our approach using two different architectures, LLaVA-1.5 and LLaVA-OneVision (LLaVA-OV). Although they partly share their name, they have different architectures and are trained on different data. Vision encoder: CLIP ViT-L/14 at 336px (LLaVA-1.5) vs SigLIP ViT-SO400m/14 at 384px (LLaVA-OV); LLM: Vicuna-1.5 (LLaVA-1.5) vs Qwen2 (LLaVA-OV); Input: single patch and image at fixed 336px resolution (LLaVA-1.5) vs multi-patch (i.e. high resolution) 384-2304px operating range with multi-image support; Num vision tokens: 576 tokens per image (LLaVA-1.5) vs 729 tokens per patch (LLaVA-OV).

In the main manuscript, the LLaVA-OV model was evaluated on the VisRAG suite. For consistency, in the table above, from answer A1, we report results on a shared suite of datasets. The same conclusions hold, our compressed models match and even outperform the original model in some cases.

We note that many of the existing alternative models (e.g., QwenVL) are trained on closed-source data, making the retraining of the compressor under a fair setting impractical.

评论

"We note that many of the existing alternative models (e.g., QwenVL) are trained on closed-source data, making the retraining of the compressor under a fair setting impractical."

Does it mean it is large-scale data training needed for your method?

评论

Thank you for your question and for checking our rebuttal.

It's not about the scale of the training data, but about having access to the same data that the original uncompressed model was trained on. All methods we compare with in Table 1 have the same requirement (i.e., access to the original training data).

评论

My concerns are addressed.

审稿意见
5

This paper proposes Compress & Cache (C&C) to compressing vision tokens in Large Vision Language Models (LVLMs). Unlike prior works that focus on on-the-fly token compression, C&C introduces an offline token compression and caching strategy, effectively decoupling the compression process from inference. The author of the paper proposed double-forward pass training strategy, which uses the LVLM itself to generate compact visual summary tokens suitable for both generative and discriminative tasks. C&C employs both an autoregressive loss and a contrastive loss, and further improves performance using stage-specific LoRA adapters. The method achieves state-of-the-art results on a wide range of VQA, image captioning, and image-text retrieval benchmarks.

优缺点分析

Strengths

  1. The novelty of the paper is good. The paper introduces a double-forward pass strategy that reuses the same LVLM for both compression and generation.

  2. The incorporation of both autoregressive loss (L_AR) and contrastive loss (L_disc) allows C&C to perform well in both domains.

  3. The experiments performed in the paper is comprehensive. The author of the paper conducted sufficient tasks with C&C and compare with many baseline methods. The results shows robustness of the method.

  4. The ablation and FLOPs results are very helpful.

Weaknesses

  1. The most significant weakness of the proposed method is the lack of runtime latency results on the real hardware. The FLOPs analysis it good but it remains in theoretical computation. A comprehensive testing on GPU or other computing devices will be more convincing.

  2. For any compression work, the main purpose is to maintain performance and reduce computation. Even the performance is good, it is still unknow if the robustness is affected. Some works report the result variance, or test the method in perturbed or adversarial input to show how robust the compressed model is. My concern on this paper is how the C&C performs when in such scenario?

问题

please refer to weaknesses.

局限性

yes

最终评判理由

The rebuttal addresses my concerns. This paper demonstrates very solid works on vision token compression, which has potential for future research on VLMs. Therefore I suggest acceptance for this paper

格式问题

no

作者回复

We thank the reviewer for their feedback and for recognizing the novelty of our approach. We hope our responses below address their remaining concerns.

Q1. The most significant weakness of the proposed method is the lack of runtime latency results on the real hardware. The FLOPs analysis it good but it remains in theoretical computation. A comprehensive testing on GPU or other computing devices will be more convincing.

A1. We reported FLOPs estimates because the timings themselves are subject to the specific implementation. Following your suggestion, we benchmark the LLaVA-1.5 7B model on a RTX 4090 GPU. Each result is averaged over 100 runs following a warm-up period.

Original LLaVA model cost: 0.0587 sec/image (out of which 0.00353 sec spent for the vision encoder)

Caching cost: 0.0584 sec/image

Online C&C cost (16 tokens): 0.00158 sec/image

Online C&C cost (4 tokens): 0.000406 sec/image

Notice that the practical measurements closely align with the expected approximated improvements: 576/16 = 36 theoretical vs 0.0587/0.00158 = 37.1, respectively 576/4 = 144 theoretical vs 0.0587/0.000406 = 144.5. The practical gains are slightly higher because, in our case, the vision encoder doesn't need to be rerun.

Q2. For any compression work, the main purpose is to maintain performance and reduce computation. Even if the performance is good, it is still unknown if the robustness is affected. Some works report the result variance, or test the method in perturbed or adversarial input to show how robust the compressed model is. My concern on this paper is how the C&C performs when in such scenario?

A2. Thank you for your suggestion. Following [1] we evaluate our approach under a various set of perturbations, e.g: zoom blur, elastic transformation, pixelation, JPEG compression, shot noise, brightness jitter, contrast jitter, Gaussian noise, etc. For brevity, we include below the results in terms of relative performance drop on a subset of them. Notice that both approaches, with and without compression, have similar robustness strength.

Noise typeMMBMMEPOPESQATextVQArealworldQA
Zoom Blur (baseline)20.450.00.07.310.017.15
Zoom Blur (compressed)16.910.00.06.700.013.44
Snow (baseline)11.040.00.02.510.07.73
Snow (compressed)10.790.00.02.230.012.50
Defocus Blur (baseline)12.500.00.03.170.07.49
Defocus Blur (compressed)11.810.00.02.090.012.72
Blank Image (baseline)73.3842.2143.429.5290.1422.95
Blank Image (compressed)72.4541.8743.3211.4589.3124.82
Saturate (baseline)0.160.00.01.700.02.90
Saturate (compressed)0.730.00.01.260.00.45
Elastic Transform (baseline)5.520.00.02.360.03.62
Elastic Transform (compressed)4.520.00.00.420.04.91
Pixelate (baseline)8.441.999.000.8968.0813.77
Pixelate (compressed)7.582.529.353.3567.9311.64
Spatter (baseline)7.474.941.941.1812.266.52
Spatter (compressed)4.961.752.410.778.397.81
Speckle Noise (baseline)10.883.013.292.3615.248.21
Speckle Noise (compressed)11.660.373.481.8914.3010.04
JPEG Compression (baseline)2.60-0.892.240.05.684.83
JPEG Compression (compressed)1.601.802.82-0.633.724.48
Shot Noise (baseline)12.663.464.691.6216.839.18
Shot Noise (compressed)11.522.014.570.3516.2710.05
Impulse Noise (baseline)12.014.644.351.9916.308.21
Impulse Noise (compressed)9.045.744.790.7714.749.60
Brightness (baseline)3.900.00.00.660.03.62
Brightness (compressed)2.920.00.0-0.070.0-0.45
Contrast (baseline)3.083.062.111.554.635.80
Contrast (compressed)5.102.661.700.144.126.25
Gaussian Noise (baseline)12.014.584.751.9215.337.97
Gaussian Noise (compressed)12.244.374.810.2813.827.05
Motion Blur (baseline)12.344.335.733.030.06.76
Motion Blur (compressed)12.104.586.102.510.08.28

(negative values denote cases where the performance increases marginally post-transformation).

For completeness, we also report for a subset of transformations how the performance degradation changes under varying augmentation strengths (levels 3 and 5). The same conclusions hold true.

Noise typeMMBMMEPOPESQATextVQArealworldQA
Elastic Transform 5 (baseline)18.340.00.03.100.010.39
Elastic Transform 5 (compressed)17.780.00.02.440.010.85
Elastic Transform 3 (baseline)5.520.00.02.360.03.62
Elastic Transform 3 (compressed)4.520.00.00.420.04.91
Shot Noise 5 (baseline)27.449.309.825.1737.9014.73
Shot Noise 5 (compressed)27.5511.879.894.2636.7314.30
Shot Noise 3 (baseline)12.663.464.691.6216.839.18
Shot Noise 3 (compressed)11.522.014.570.3516.2710.05
Gaussian Noise 5 (baseline)27.2711.1111.133.6237.9711.35
Gaussian Noise 5 (compressed)22.5911.0611.473.9137.2711.86
Gaussian Noise (baseline)12.014.584.751.9215.337.97
Gaussian Noise (compressed)12.244.374.810.2813.827.05

[1] Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models, Chen et al, NeurIPS 2023

评论

I appreciate the effort the author put for rebuttal. My concerns are fully addressed. I suggest the author to add hardware testing results in the final version of the paper. I will raise my score to accept.

评论

We thank the reviewer for checking our rebuttal and for the valuable suggestions. We are happy to hear that our response has fully addressed their concerns, and we will certainly incorporate the suggested hardware testing results into the final manuscript.

评论

Dear Reviewers,

Thanks for your hard work during the review process. We are now in the author-reviewer discussion period.

Please (1) carefully read all other reviews and the author responses; (2) start discussion with authors if you still have concerns as early as possible so that authors could have enough time to response; (3) acknowledge and update your final rating. Your engagement in the period is crucial for ACs to make the final recommendation.

Thanks,

AC

最终决定

This paper presents a novel framework that compresses visual tokens in Large Vision-Language Models for both generative and discriminative tasks. Reviewers acknowledged the contribution and strong performance of the proposed method, while initially raising concerns regarding its efficiency and missing important experiments.

After the rebuttal, the authors addressed most of the concerns, and all reviewers agreed to accept this paper. AC read all the reviews, author rebuttals, and the paper, and believes this is a strong paper and recommends acceptance.