Addressing Weaknesses

The link between the systematic investigation and CRT is not very strong, which makes their combination in one paper feel somewhat disjointed.

The first half of the paper shows that increasing bits per pixel above 16k codebook size and 256 tokens per image generally makes scaling worse. Further, while reducing to 1k codebook size improves the scaling law marginally, explicitly reducing the reconstruction capacity to such a degree greatly harms generation performance close to saturation. This motivates us to look for a rate-distortion trade-off which does not directly inhibit the bottleneck size, of which CRT is one. Section 4 analysis expands on this concept. We will make this explicit in the text, which should improve the overall flow of the paper.

**Although this paper provides systematic investigation about the trade-off between compression and generation, these investigations are not absolutely new and insightful.

**

We would like to clarify that the existence of scaling laws in this scenario is not surprising. Scaling laws have been widely studied and validated since Kaplan et al. 2020. The connection between scaling laws and the rate-distortion trade-off (which we measure as bits per pixel) is not studied in prior art. The reviewer points to VAR and LlamaGen. We address our novelty to these individual studies below:

VAR. VAR includes scaling laws with regard to a proposed generation procedure (next-scale generation) with a specific, fixed stage 1 tokenizer. We study the effect changing the tokenizer's compression rate has on stage 2 scaling, a dimension not explored in VAR.

LlamaGen. LlamaGen includes some results with increasing token count and stage 2 model capacity, however, they do not attempt to establish scaling laws and thus miss much of the interplay between compute scale and model capacity. They also do not study the effect of codebook size on scaling. Further, their codebook scaling does not get 100% codebook utilization or uniform codebook distribution, resulting in subpar scaling beyond 16k codebook size. For scaling laws to be valid, components of the pipeline have to be tuned and optimized.

We emphasize that the setup of studying stage 1 rate-distortion in connection to stage 2 compute scaling laws is, to our knowledge, unique in literature. Having factors related to stage 1 bottleneck capacity studied together in a unified, systematic setting is crucial for understanding the interplay of rate-distortion and stage 2 modeling.

Limited experiments about CRT. In this paper, the effectiveness of CRT is only validated on a codebook of size 16k and 256 tokens per image. CRT is designed for image tokenizers. It is essential to provide more experiments to show effectiveness with various codebook size and tokens per image.

We thank the reviewer for this important point. Given the depth of our scaling law study, we decided to focus CRT on the setting which demonstrated the best trade-off between reconstruction and generation in the first part of our paper (16k codebook size with 256 tokens per image). We show additional results at multiple scales with more tokens per image below (400 and 576). We observe that across tokens per image configurations and stage-2 model size, CRT outperforms the baseline.

400 tokens per image

Parameter Count	Method	FID
111M	Baseline	5.3745
	CRT	4.7140
	Baseline	5.4305
	CRT	4.3158
211M	Baseline	3.7601
	CRT	3.3526
	Baseline	3.4438
	CRT	3.0547
775M	Baseline	2.4247
	CRT	2.2492

576 Tokens per Image

Of course, here is the table with the Precision and Recall columns removed.

Parameter Count	Method	FID
111M	Baseline	6.2174
	CRT	4.7268
	Baseline	4.7259
	CRT	4.5559
211M	Baseline	3.6164
	CRT	3.3401
	Baseline	3.6185
	CRT	3.1098
775M	Baseline	2.4453
	CRT	2.1984
	Baseline	2.2381
	CRT	2.0993

We also ran in settings with greater codebook size. We show those results below.

Parameter Count	Codebook Size	Method	FID
111M	16384	Baseline	4.8837
	16384		4.2323
	131072	Baseline	5.0036
	131072		4.4945
550M	16384	Baseline	2.8497
	16384		2.5793
	131072	Baseline	2.8454
	131072		2.4488
775M	16384	Baseline	2.5452
	16384		2.2079
	131072	Baseline	2.7080
	131072		2.1833

It is computationally infeasible for us to generate full scaling law studies for each specific setting, but we hope the results above are evidence that our conclusion is not specific to a particular tokenizer configuration.

Missing discussion about Figure 9 (center right). In the main text, there is no discussion about skewness vs. vocabulary size.

We thank the reviewer for pointing this out. Skewness is another view of the concentration of codes property demonstrated by Figure 9 (center left). We compute skewness as , demonstrating that as the codebook size increases, so does codebook specialization per position. We will include this discussion in the final revision.

If we have adequately responded to the concerns with our experiments, we request that the reviewer raise their score.