CAT: Content-Adaptive Image Tokenization
We develop an adaptive image tokenizer that compresses images into variable-sized latent features based on its content complexity.
摘要
评审与讨论
This paper introduces the Content-Adaptive Tokenizer (CAT), a novel approach for image tokenization that dynamically adjusts the number of tokens (with predefined 3 scales) based on image complexity derived from text descriptions. Traditional image tokenizers use fixed compression ratios, which are inefficient for simple images and insufficient for complex ones. CAT addresses this by combining a caption-driven complexity evaluator (using LLMs) and a nested Variational Autoencoder (VAE) architecture. Experiments on different datasets and different tasks show the effectiveness of the proposed CAT. However, may major concern is that that the performance gap comes from the help of LLM, which is not used in most of the evaluated methods.
优缺点分析
This paper presents the Content-Adaptive Tokenizer (CAT) to dynamically adjusts token counts for images based on text-derived complexity via an LLM evaluator and a nested VAE. Its strengths include innovative adaptive framework that improves reconstruction quality for complex images while reducing token usage on natural images, strong empirical validation across several datasets and different downstream tasks (classification, generation), and robust LLM-driven complexity assessment.
The main weaknesses involve limited compression flexibility (fixed 8x/16x/32x ratios), reliance on text descriptions that results in more complex usage of the VAE, added LLM inference overhead, and insufficient discussion of LLM biases or environmental impacts in broader implications.
问题
-
Directly use a VLM as a scorer to produce the compression ratio without using an LLM. The impact of using and not using LLM should be discussed.
-
In Table 3, the proposed adaptive CAT outperforms the other 16x methods. Are the number of parameters of adaptive CAT and the fixed 16x baseline model the same? The number of parameters and FLOPs are better to be provided in the table.
-
While evaluating on ChartQA dataset, rFID is not suitable since it is mainly computed by Convolution Neural Networks, which are not dedicated for texts. It is suggested to evaluate the reconstructed text images with some OCR methods, which are designed to recognize the texts in images.
-
While evaluating CAT on the classification downstream task, the classification head takes the 8x latent features as input. However, the classification heads in evaluated methods take the 16x latent features as input, which means the number of parameters of the classification head in CAT is more than the parameters of other classification heads in the evaluated methods, resulting in an unfair comparison.
局限性
Yes
最终评判理由
Thank the authors' response. Most of my concerns are solved. I decide to keep my rating as 4 (Boardline accept).
格式问题
None
Thank you for recognizing CAT’s contributions, including its “robust LLM-driven complexity assessment,” “innovative adaptive framework,” and strong performance across diverse tasks. To address your questions on evaluation, we add new results regarding ChartQA OCR accuracy. We also perform additional analysis on compute costs. Please find our detailed response below.
Question 1
Directly use a VLM as a scorer to produce the compression ratio without using an LLM. The impact of using and not using LLM should be discussed.
This is an excellent idea that we actually explored in our work. As detailed in L174-L177, we evaluated both VLM + LLM pipelines (e.g., InstructBLIP for captioning followed by Qwen 2.5/Llama 3 for scoring) and VLM-only pipelines (e.g., LLaVA-1.5, which handles both captioning and complexity scoring in a single pass). The compute analysis on 512x512 images is shown below:
| Task | Model Type | # of Model Calls (per image) | Avg. Wallclock Time on H100 (per image) |
|---|---|---|---|
| Reconstruction (Image Data Only) | VLM | 1 | ~1.63s |
| Reconstruction (Image Data Only) | Captioner + LLM | 2 | ~1.84s |
| Generation (Text Data Only) | LLM | 1 | ~1 s |
In short, the complexity evaluation step involves a single forward pass through a VLM for reconstruction tasks and a single forward pass through a LLM for generation tasks. At training time, this adds <1% to the total compute budget across 1M steps. At inference time, the average wall-clock time per image is ~1s on NVIDIA H100. When processed in batches, this cost is further amortized, making the overhead negligible relative to the downstream model cost.
We also note that, as mentioned in Section 6, users can specify token counts manually during generation, which eliminates the need for LLM calls. Overall, the computational benefit from reduced token usage significantly outweighs the minimal overhead from LLM-based scoring.
Question 2
In Table 3, the proposed adaptive CAT outperforms the other 16x methods. Are the number of parameters of adaptive CAT and the fixed 16x baseline model the same? The number of parameters and FLOPs are better to be provided in the table.
We provide the model size and FLOPs of CAT and baselines below:
| Aspect | CAT | Fixed 16x Baseline | Fixed 32x Baseline |
|---|---|---|---|
| Adaptive Encoding | ✓ | x | x |
| Average Latent Dim on ImageNet | 29x29 | 32x32 | 16x16 |
| Param Count | 187M | 169M | 187M |
| Average FLOPs (G) on ImageNet | 602 | 640 | 822 |
CAT has a slightly higher parameter count than the fixed 16x baseline due to an additional downsampling block used to support 32x compression. This module processes 16x latent features into a more compact 32x representation, which the fixed 16x model does not require. However, on standard datasets like ImageNet, CAT is more compute-efficient overall, as many images are assigned higher compression ratios. This results in a lower average latent dimension and reduced FLOPs per image compared to the fixed 16x baseline. These savings highlight CAT’s efficiency in both training and inference, driven by its ability to adapt compression based on image content.
Question 3
While evaluating on ChartQA dataset, rFID is not suitable since it is mainly computed by Convolution Neural Networks, which are not dedicated for texts. It is suggested to evaluate the reconstructed text images with some OCR methods
Thank you for this suggestion! We agree that rFID may not fully capture the semantic fidelity of text in datasets like ChartQA. To address this, we evaluated the reconstructed images using an OCR-based metric. Specifically, we used the image_to_data() function from the PyTesseract library to extract per-word confidence scores, and computed the average OCR confidence across all words in each image. The following results show that CAT achieves higher OCR confidence compared to fixed baselines, further supporting its ability to preserve fine-grained textual content in visual representations. We will include the experiment details and score computing script in the updated paper.
| Model | Compression Ratio | OCR Confidence ↑ | rFID ↓ |
|---|---|---|---|
| Fixed 8x | 8x | 78.4 | 8.21 |
| Fixed 16x | 16x | 71.2 | 8.67 |
| Fixed 32x | 32x | 65.7 | 10.79 |
| CAT (Ours) | 8x / 16x / 32x , avg 8.12x | 84.6 | 5.27 |
Question 4
the classification heads in evaluated methods take the 16x latent features as input, which means the number of parameters of the classification head in CAT is more than the parameters of other classification heads in the evaluated methods, resulting in an unfair comparison.
Great observation! However, as shown in Appendix Table 10, the majority of images from the classification datasets are compressed at 16x or 32x, with fewer than 5% using the 8x compression ratio. This means that, in practice, most CAT latent features are shorter than the fixed 16x representations. Since we apply zero-padding to standardize input lengths for the classification head (as described in L718), many of the classification head parameters are effectively inactive due to being multiplied by zeros. In this sense, the experimental setup is actually slightly unfavorable to CAT, as a large portion of images are more aggressively compressed, resulting in shorter effective representations.
That said, we agree that a perfectly fair comparison is inherently challenging due to the adaptive nature of CAT. To address this concern, we will take the following steps in the final version:
- Include fixed 8x and fixed 32x baselines in Table 6 for reference
- Add the above explanation to the paper to clarify the trade-offs in representation length and classifier parameter usage.
Weakness 1: limited compression flexibility (fixed 8x/16x/32x ratios)
Thank you for pointing this out. We have acknowledged this limitation explicitly in Section 7. We chose fixed ratios for CAT mainly for deployment and empirical reasons.
Deployment-wise, fixed ratios enable efficient batching on GPUs. With just three discrete levels,CAT already spans a wide compression range (8x-32x), which we found sufficient to differentiate between high-fidelity and high efficiency settings.
Empirically, our analysis in Section 3.1 shows that over 56% of images can be compressed at 16x or higher with negligible quality loss. Table 1 also demonstrates that CAT assigns 8x to nearly all ChartQA images (96%), while applying higher compression to simpler natural images. This reflects meaningful and data-driven differentiation, i.e., the selection of 8x, 16x, and 32x compression are sufficient to prove the effectiveness of our text-driven compression concept.
That said, we fully agree enabling more flexible compression is a promising direction for future work. We are actively exploring ways to combine our LLM-based complexity scoring with transformer-based tokenizers that support flexible token counts. We are also studying patch-level adaptation, where different regions of an image receive different compression levels. While these extensions are exciting, we believe they build upon, rather than diminish, the core contribution of this paper.
Weakness 2: reliance on text descriptions that results in more complex usage of the VAE
We would like to clarify that the LLM-based complexity evaluation is fully decoupled from the VAE architecture. Once the compression ratio is selected, the VAE functions identically to standard tokenizers, using only the relevant subset of encoder and decoder blocks. This design does not add runtime complexity to the VAE itself, and in fact simplifies deployment by supporting multiple compression levels within a single model (see Fig. 1).
While the text-based complexity evaluation introduces an additional step in the entire compression pipeline, it provides substantial practical benefits, in particular saving token count and compute for diverse reconstruction and generation tasks. Moreover, in settings like text-to-image generation where only text descriptions are available at inference time, CAT is the only method to allow context adaptive tokenization, whereas existing adaptive tokenizers all require image inputs and thus are not applicable to generative settings.
Weakness 3: added LLM inference overhead
Please see response to Question 1.
Weakness 4: insufficient discussion of LLM biases or environmental impacts in broader implications.
We appreciate the reviewer raising this point, and we will expand the broader impact section in the final version to include discussion on LLM biases and environmental impact.
This paper proposes an adaptive image tokenization strategy, named CAT, that varies the number of tokens to represent an image based on its complexity. For measuring the complexity, they leverage LLMs using captioning and VQA data extracted from the image, and assign a complexity score to the image. A nested VAE architecture is used to represent images with variable compression rates (three compression rates are used: 8x, 16x, 32x). Evaluations are performed on image reconstruction, generation, and classification tasks to show the effectiveness of the approach.
优缺点分析
Developing adaptive visual tokenizers is an interesting and promising research direction. I also find the papers focus on developing a better measure of image complexity intriguing. The method is well-motivated and explained, and the empirical results seem promising. Overall, I am leaning towards accepting this paper, but there are a few points that require further clarifications:
-
I found the first part of Sec. 3.1. a bit confusing. Exactly because of the reason in L118, is MSE a good metric to measure complexity? If not, what is the take away here?
-
LLM-based complexity measure proposed in this paper can be a potential solution, however, how do we ensure that it's well-calibrated and better aligned with human preferences? A user study could be useful here. I found the correlation study in Table 2 not fully conclusive.
-
I would be curious to see the generation performance on other popular datasets, e.g. COCO.
-
Finally, I suggest including comparisons to relevant baselines. Both FlexTok [A] and Semanticist [B] are recent adaptive tokenization strategies that are missing in the comparisons here. While not adaptive, TiTok [C] is also a relevant and recent baseline that can compress images quite substantially. Having these comparisons would be helpful to better understand the effectiveness of the method.
I am happy to hear authors' opinions on these.
[A] FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
[B] “Principal Components” Enable A New Language of Images
[C] An Image is Worth 32 Tokens for Reconstruction and Generation
问题
-
Is MSE a good metric to measure complexity? If not, what is the main take away from Sec. 3.1?
-
How do we ensure that the proposed complexity measurement is well-calibrated and better aligned with human preferences?
-
What is the generation performance in other benchmarks, e.g. COCO?
-
How do you compare against the relevant baselines listed above?
-
Will the code and models be released?
局限性
Yes.
最终评判理由
Thanks authors for their response addressing most of my concerns. I am happy to keep my rating as accept.
格式问题
This paper is quite packed, thus it's hard to read some figures, e.g. Figure 4.
Thank you for the positive review. We’re encouraged by your statement that “The method is well-motivated and explained, and the empirical results seem promising.” To address your question around generation dataset and related work, we add new COCO generation comparison and expand the related work section. Below are detailed responses to your specific questions.
Weakness 1 & Question 1: Clarification on MSE
I found the first part of Sec. 3.1. a bit confusing. Exactly because of the reason in L118, is MSE a good metric to measure complexity? If not, what is the take away here?
Is MSE a good metric to measure complexity? If not, what is the main take away from Sec. 3.1?
Thanks for asking about our MSE analysis. To clarify: MSE is not a good metric to measure perceptual complexity, which is precisely our main takeaway from Section 3.1.
The proof-of-concept analysis in Section 3.1 serves two purposes. First, we want to demonstrate the potential for adaptive compression: We show that 56% of images can be compressed to 16x with negligible MSE increase, so significant computational savings are possible. Second, we want to highlight the limitations of existing metrics: we show that MSE, JPEG size, and LPIPS all have low Pearson correlations (r ≤ 0.36) with acceptable compression ratio. Besides, images with fine-grained elements (text, faces) may have low MSE but poor visual quality at high compression. Thus, existing metrics can fail to capture perceptual complexity.This motivates our LLM-based approach that considers semantic and perceptual factors rather than pixel-level differences.
We will clarify this interpretation in the revision to make the takeaway more explicit.
Weakness 2 & Question 2: Calibrating Complexity Score
LLM-based complexity measure proposed in this paper can be a potential solution, however, how do we ensure that it's well-calibrated and better aligned with human preferences? A user study could be useful here.
How do we ensure that the proposed complexity measurement is well-calibrated and better aligned with human preferences?
Thank you for asking this. We use a carefully designed scoring rubric (Appendix B) along with in-context examples to ensure the robustness of our evaluation pipeline. In the paper, we have provided several pieces of evidence to show the quality of our complexity measure:
Quantitatively, we show that LLM-based scores achieve the highest Pearson r = 0.55 and 62% agreement with the “ideal” compression ratio. We also conducted robustness ablations across LLMs and captioning styles (Appendix Table 10), finding little variation. The strong downstream performance of CAT also shows the effectiveness of our LLM-based complexity evaluation system.
Qualitatively, manual inspection confirms perceptually challenging images receive high complexity scores. Examples in Figure 4 also align with intuitive expectations.
We agree that a user study would be a valuable next step to validate alignment with human perception, and we plan to pursue this in future work. However, conducting such a study requires IRB approval and annotator recruitment, which is not feasible within the rebuttal period. In the meantime, we hope that the above evidence addresses the reviewer’s concerns.
Weakness 3 & Question 3: COCO Results
I would be curious to see the generation performance on other popular datasets, e.g. COCO.
What is the generation performance in other benchmarks, e.g. COCO?
Thank you for this helpful suggestion. To address this, we directly evaluated the ImageNet-trained DiT + CAT on COCO image generation. We report FID-5K by generating 512x512 images in a zero-shot fashion on 5k prompts from the MS-COCO 2017 validation set. Other diffusion configurations are kept the same as the ImageNet experiment. Note that we do not retrain from scratch using the COCO training data because of the limited time for rebuttal. We will leave retraining for future work.
| Model | Avg. Tokens | FID ↓ |
|---|---|---|
| DiT + LDM VAE | 256 | 27.31 |
| DiT + Fixed 16x | 256 | 26.6 |
| DiT + CAT (Ours) | 212 | 23.57 |
We find that CAT continues to outperform fixed-token baselines, maintaining strong performance while using significantly fewer tokens on average. This result reaffirms our method’s efficacy. We will include full COCO generation results in the final version.
Weakness 4 & Question 4: Related Work
Finally, I suggest including comparisons to relevant baselines. Both FlexTok [A] and Semanticist [B] are recent adaptive tokenization strategies that are missing in the comparisons here. While not adaptive, TiTok [C] is also a relevant and recent baseline that can compress images quite substantially. Having these comparisons would be helpful to better understand the effectiveness of the method. / How do you compare against the relevant baselines listed above?
How do you compare against the relevant baselines listed above?
We appreciate these pointers. Here's our perspective on the baselines you mentioned:
FlexTok and Semanticist: These methods focus on 1D sequence representation. Our work is complementary as it addresses content-adaptive compression in VAE-based 2D tokenizers. In terms of evaluation, these methods do not report results on the datasets used in our paper, such as ImageNet, COCO, or ChartQA. While we are happy to include direct comparisons in the final version, doing so fairly would require retraining their tokenizers at comparable model scale and on the same training datasets as CAT. This is a nontrivial effort that is unfortunately infeasible within the rebuttal period.
TiTok: While TiTok achieves impressive compression, it uses a fixed tokenization strategy. We actually include TiTok in our comparison (Table 2) where our adaptive approach (0.46 rFID) outperforms TiTok-B-128 (1.52 rFID) on ImageNet-512.
To sum up, CAT’s strong results are precisely due to the integration of LM-based complexity evaluation, which none of the existing works study. We will add the above discussion and relevant works to the paper.
Question 5: Code Release
Will the code and models be released?
Yes. We already include our training code in the supplementary material and plan full public release later.
Thanks for your response addressing most of my concerns. I am happy to keep my rating. One note about FlexTok, Semanticist, and TiTok though:
In terms of evaluation, these methods do not report results on the datasets used in our paper, such as ImageNet, COCO, or ChartQA.
They both actually evaluate on ImageNet (see Table 1 in the corresponding papers), so I think a comparison would still be useful to have for reconstruction and generation performance.
We actually include TiTok in our comparison (Table 2) where our adaptive approach (0.46 rFID) outperforms TiTok-B-128 (1.52 rFID) on ImageNet-512.
Yes, I meant comparison for generative performance, but I can see it may not be clear from the writing. Overall, I think the current Table 7 is limited in terms of comparisons, so having those relevant baselines here would be nice.
Thank you for clarifying the questions and sorry for the initial misunderstanding regarding reconstruction vs. generation results. To address your comments, we will do the following:
-
For ImageNet Reconstruction results in Table 2, we will add a FlexTok baseline with rFID 1.45, and a Semanticist baseline with rFID 0.72. These represent the best-performing configurations reported in their respective papers.
-
As for ImageNet Generation results in Table 7, we will add a separate table to provide context with other relevant baselines, as shown below. We will also add this to the discussion surrounding Table 7.
| Method | # tokens | gFID ↓ |
|---|---|---|
| MaskGIT VQ-GAN | 16x16 | 6.06 |
| Taming VQ-GAN | 16x16 | 5.20 |
| LlamaGen | 16x16 | 3.06 |
| TiTok-L | 32 | 2.77 |
| TiTok-B | 64 | 2.48 |
| TiTok-S | 128 | 1.97 |
| FlexTok d12-d12 | 1-256 | 3.83 |
| FlexTok d18-d18 | 1-256 | 2.02 |
| FlexTok d18-d28 | 1-256 | 1.86 |
| SEMANTICIST w/ DiT-XL | 256 | 2.57 |
However, we note that the methods in this new table are not directly comparable to our proposed method for two key reasons:
(1) Resolution mismatch: FlexTok and Semanticist generate 256x256 images, whereas we generate 512x512 images
(2) Architecture and training setup are different: e.g., CAT uses DiT-XL and trains for only 400K steps, whereas FlexTok is based on AR Transformers and trains with 200B tokens.
To ensure a fair comparison, it might be better to keep the rows within Table 7 directly comparable, which is why we will present these other results in a separate table and discussion. As part of our future work, we plan to re-evaluate CAT in different settings (e.g., 256x256 resolution, longer training) to provide a more comprehensive comparison against these methods.
Thank you again for your constructive feedback, which has helped us improve the quality of our work!
The paper proposes Content-Adaptive Tokenizer (CAT), which dynamically adjusts token counts for images based on content complexity. It uses LLMs to evaluate complexity via captions and a nested VAE for variable-rate compression. CAT reduces token usage by 18% on natural images, improves rFID by 15% on complex datasets, boosts classification accuracy, and enhances generation throughput by 23%. It integrates language-guided evaluation with adaptive architecture for efficient, high-quality image representation. The key components include: (1) an LLM-based system that analyzes image captions and perceptual queries (e.g., presence of text/ faces) to assign optimal compression ratios (8x, 16x, 32x); (2) a nested VAE that routes intermediate features to shared latent blocks, enabling variable-rate compression in a single model. This design allows CAT to encode simpler images with fewer tokens while preserving details in complex ones.
优缺点分析
Strengths 1.Combines LLMs with a nested VAE to dynamically adjust token counts based on content complexity inferred from text descriptions, addressing inefficiencies of fixed-token approaches. This allows simpler images to be encoded with fewer tokens while preserving details in complex ones. 2.The experiments are comprehensive, including comparisons of max acceptable compression ratios, performance comparisons across multiple datasets, ablation studies on model hyperparameters, and ablation studies on LLM models. 3.Reduces token usage by 18% on natural image datasets (e.g., ImageNet, COCO) while maintaining high reconstruction quality. Achieves an average 15% reduction in rFID on detail-rich datasets, outperforming fixed baselines in preserving fine details like text and faces. 4.Improves top-1 accuracy in image classification across diverse datasets and boosts training throughput by 23% in text-to-image generation. CAT’s adaptive representations enhance both efficiency and performance in reconstruction, classification, and generation.
Weaknesses The core innovation of the work lies in using prompt-based image complexity prediction from LLMs. However, the design of the tokenizer still adopts a traditional architecture, showing insufficient innovativeness. In the Table 7 experiments related to Image Generation, add a comparison with DiT+Fixed 8x. Other questions can be found in the Questions section.
问题
Although CAT reduces token consumption, the LLM-based complexity evaluation adds an extra inference step. How does this affect computational budget? Please explain in combination with the experimental data.
The significance of Adaptive Image Tokenization lies in enabling a flexible trade-off between computational cost and output quality through adaptive token quantity. However, the article does not provide a comparison of computational overhead for CAT, including comparisons with fixed compression ratio methods and existing adaptive tokenization approaches in terms of inference time, video memory occupation, etc.
What is the advantage of CAT over Fix 8x in terms of computational overhead, and what additional benefits does it offer? As can be seen from Table 3, Fix 8x significantly outperforms CAT in accuracy. Is the computational overhead advantage of CAT large enough to justify sacrificing Fix 8x's accuracy edge for adopting CAT?
局限性
Yes.
最终评判理由
The rebuttal has addressed most of my concerns. While the computational cost of LLM seems too expensive, I think the paper makes an important contribution, and decided to keep my score.
格式问题
No formatting issues.
Thank you for your thoughtful review and for recognizing the significance for content-aware tokenization. To address your questions regarding baselines, we add new results for DiT + fixed 8x tokenizer on ImageNet generation. We also update the paper with detailed computational costs analysis. Please find our response below.
Weakness 2: Adding Baseline
In the Table 7 experiments related to Image Generation, add a comparison with DiT+Fixed 8x.
Thank you for pointing this out! We have previously run DiT + fixed 8x (400K training steps) as a baseline. We did not include it in the main text because the comparison might be unfair---fixed 8x uses 1024 tokens per image, which is nearly 5 times the number of tokens used by CAT (~197 tokens per image). However, for reference purposes, we will still update Table 7 to include the following DiT + fixed 8x baseline:
| ImageNet Generation | Tokenizer Compression Ratio | Avg. Tokens | FID↓ | sFID↓ | IS↑ | Precision↑ | Recall↑ | Eval FLOPs↓ |
|---|---|---|---|---|---|---|---|---|
| DiT + Fixed 8x | 8x | 1024 | 4.24 | 8.64 | 204.88 | 0.75 | 0.57 | 4x |
| DiT + Fixed 16x | 16x | 256 | 5.89 | 11.81 | 187.47 | 0.72 | 0.49 | 1x |
| DiT + CAT (Ours) | 8x / 16x / 32x | 197.44 | 4.56 | 10.55 | 191.09 | 0.75 | 0.49 | 0.82x |
Despite using significantly fewer tokens, CAT achieves similar FID (4.56 vs. 4.24) to DiT+fixed 8x. Thus, CAT offers a better efficiency-performance trade-off than both fixed 16x and fixed 8x setups.
Question 1: LLM Evaluation Overhead
Although CAT reduces token consumption, the LLM-based complexity evaluation adds an extra inference step. How does this affect computational budget? Please explain in combination with the experimental data.
As noted on L174-L177, we tested two pipelines for complexity evaluation: (1) a unified pipeline using LLaVA1.5 7B for both captioning and scoring in a single pass, and (2) a separated pipeline using InstructBLIP for captioning and Qwen2.5 7B/LLaMA3.1 8B for scoring. Both pipelines achieve similar complexity score distribution due to our carefully designed scoring rubric. Please find the compute analysis on 512x512 images below:
| Task | Model Type | # of Model Calls (per image) | Avg. Wallclock Time on H100 (per image) |
|---|---|---|---|
| Reconstruction (Image Data Only) | VLM | 1 | ~1.63 s |
| Reconstruction (Image Data Only) | Captioner + LLM | 2 | ~1.84s |
| Generation (Text Data Only) | LLM | 1 | ~1 s |
In short, the complexity evaluation step involves a single forward pass through a VLM for reconstruction tasks and a single forward pass through a LLM for generation tasks. At training time, this adds <1% to the total compute budget across 1M steps. At inference time, the average wall-clock time per image is ~1s on NVIDIA H100. When processed in batches, this cost is further amortized, making the overhead negligible relative to the downstream model cost.
We also note that, as mentioned in Section 6, users can specify token counts manually during generation, which entirely eliminates the need for LLM calls. Reviewer mnGd noted this flexibility: CAT "allows users to specify the desired token count at inference time, enabling a flexible trade-off between computational cost and output quality." Overall, the computational benefit from reduced token usage significantly outweighs the minimal overhead from LLM-based scoring.
Question 2: Compute Comparison
The significance of Adaptive Image Tokenization lies in enabling a flexible trade-off between computational cost and output quality through adaptive token quantity. However, the article does not provide a comparison of computational overhead for CAT, including comparisons with fixed compression ratio methods and existing adaptive tokenization approaches in terms of inference time, video memory occupation, etc.
We appreciate the reviewer’s suggestion, so here is a detailed computational overhead comparison on 512x512 images.
Encoding Task
| Aspect | CAT | Fixed 16x Baseline | Fixed 8x Baseline | Fixed 32x Baseline |
|---|---|---|---|---|
| Adaptive Encoding | ✓ | x | x | x |
| Param Count | 187M | 169M | 147M | 187M |
| Latent Dim | Adaptive (avg. 29 on ImageNet) | 32 | 64 | 16 |
| Inference Time | LLM eval (~50ms) + VAE encoding (adaptive) | VAE encoding only | VAE encoding only | VAE encoding only |
We note that existing adaptive tokenization methods, such as ElasticTok (210M parameters) and FlexTok (286.7M parameters), support multiple compression ratios, but do not include a mechanism for deciding which ratio to apply to a given image. In contrast, CAT provides an explicit LLM-based complexity evaluation system that determines the appropriate compression level per image. As a result, it is difficult to measure or compare their effective compression behavior on a dataset without retraining or manually selecting ratios.
Generation Task
| Aspect | DiT + CAT | DiT + Fixed 16x Baseline | DiT + Fixed 8x Baseline | DiT + Fixed 32x Baseline |
|---|---|---|---|---|
| Token Count | Adaptive (avg. 197 tokens on ImageNet) | 256 tokens | 1024 tokens | 64 tokens |
| Memory Usage | Lower (due to fewer tokens) | Medium | Higher | Lowest |
| Generation FLOPs | ~0.82x | 1x | ~4x | ~0.25x |
We see that CAT generally reduces token count and thus improves the efficiency over fixed ratio baselines. Moreover, unlike many existing adaptive tokenizers that require an image input for token count estimation, CAT’s text-only evaluation allows it to adapt compression even in generative settings without access to ground-truth images. As CAT’s tokenizer also requires a single model, we believe our work offers distinct computational advantages over existing approaches for adaptive tokenization.
Question 3: Trade-Off between Quality and Compute
What is the advantage of CAT over Fix 8x in terms of computational overhead, and what additional benefits does it offer? As can be seen from Table 3, Fix 8x significantly outperforms CAT in accuracy. Is the computational overhead advantage of CAT large enough to justify sacrificing Fix 8x's accuracy edge for adopting CAT?
We first note that on complex datasets such as ChartQA, our rFID (5.27) is actually better than the fixed 8x baseline (8.21). For natural image datasets like COCO and ImageNet, we clarify the comparison below:
| Metric | Fixed 8x | CAT (Adaptive) | Computational advantage of CAT |
|---|---|---|---|
| COCO rFID ↓ | 0.48 | 0.65 | - |
| ImageNet rFID ↓ | 0.24 | 0.46 | - |
| Avg. Compression (COCO) | 8x (Fixed) | 16.07x (Adaptive) | ~2x |
| Avg. Compression (ImageNet) | 8x (Fixed) | 17.46x (Adaptive) | ~2.81x |
| Latent Dim (COCO) | 64x64 | 31x31 | ~4.26x |
| Latent Dim (ImageNet) | 64x64 | 29x29 | ~4.87x |
While fixed 8x achieves better reconstruction metrics on natural image datasets, CAT is 4-5x more compute efficient. This is because CAT preserves quality where it matters, such as images with fine details like facial features or text. Thus, while fixed 8x offers slightly stronger quality, CAT delivers selective quality at significantly lower computational cost, making it a more efficient and scalable solution. Thus, the key advantage of CAT lies in its adaptivity.
Weakness 1: Tokenizer Architecture Novelty
The core innovation of the work lies in using prompt-based image complexity prediction from LLMs. However, the design of the tokenizer still adopts a traditional architecture, showing insufficient innovativeness.
Thank you for noting our contribution in proposing LLM-based complexity evaluation. However, we believe that the nested VAE architecture used by CAT (Fig. 1) differs from existing VAE architectures in important ways:
- Skip connections with channel matching: Our design routes intermediate encoder features to shared latent blocks, enabling multiple compression ratios in a single forward pass
- Shared Gaussian parametrization: The shared middle block maintains scale consistency across different compression levels while processing variable spatial dimensions
- Parameter sharing: We strategically increase parameters in shared modules to handle multi-scale features effectively
Compared with traditional tokenizer architectures that support only fixed compression and need separate models for different ratios, our nested VAE is a single model that supports multiple compression levels, enabling adaptive compression without the storage and computational overhead of multiple models. This shows our architectural novelty.
The rebuttal has addressed most of my concerns. However, in the rebuttal to Question 1, the authors claim that "the computational benefit from reduced token usage significantly outweighs the minimal overhead from LLM-based scoring" without a quantitative comparison. 1s/image on H100 seems too expensive to me, considering that complexity evaluation is relatively easy. In future work, a possible improvement is to use a small dedicated model to evaluate complexity, which may be distilled from LLM. Overall, I think the paper makes an important contribution and will keep my score.
The paper proposes to use an LLM to predict how many tokens should be used to compress an image using a VAE architecture. The reviewers appreciate the comprehensive empirical validation across several datasets and different downstream tasks. Reviewer mS49 considers the work interesting and promising, and the focus on developing a better measure of image complexity intriguing. The reviewers, however, had several concerns like the computational overhead of the LLM, the limited compression flexibility, as well as the calibration of the LLM-based complexity measure. The rebuttal addressed these points, but it also showed that the computational overhead is very high. After the rebuttal, the reviewers unanimously recommend the acceptance of the paper.
The AC agrees with the recommendation of the reviewers. While the current approach has many limitations like the very high computational overhead for inference and limited compression flexibility, the paper proposes an interesting idea and future works might address these limitations.