PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
5
4
2.8
置信度
创新性3.0
质量3.0
清晰度2.8
重要性2.8
NeurIPS 2025

GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

OpenReviewPDF
提交: 2025-04-09更新: 2025-10-29

摘要

关键词
autoencoderimage generation

评审与讨论

审稿意见
4

This paper introduces GPSToken, a Gaussian Parameterized Spatially-adaptive Tokenization framework for image representation and generation. Unlike conventional grid-based tokenizers, GPSToken uses parametric 2D Gaussians to dynamically model regions of varying shapes, positions, and textures, enabling non-uniform and content-aware tokenization. The proposed method partitions images into texture-homogeneous regions, initializes Gaussian parameters based on entropy-driven complexity, and refines them via a transformer. A splatting-based renderer converts tokens back into feature maps for decoding. GPSToken decouples spatial layout (Gaussian parameters) from texture features, facilitating a two-stage generation pipeline that first synthesizes structural layouts and then generates textures. Experiments on ImageNet demonstrate state-of-the-art performance.

优缺点分析

Strengths

  1. The main idea of this paper is quite interesting, and the proposed methods are reasonable.

  2. The writing of this paper is easy to follow.

Weaknesses

  1. The experiments seem unfair in some cases. For example, in Tab.1, the parameter size of the VAVAE is only 69.8M, which is significantly smaller than the proposed method.

  2. I think it will be better to completely discuss the training overhead of the proposed method and compare it with existing SOTA methods instead of only discussing SiT-XL/2.

I am not an expert in this field, and I may not understand all the strengths and weaknesses of this paper.

问题

  1. In tab.2, I notice that the proposed method and MAETok have similar performance. However, the proposed method has a larger generator, and the MAETok has a larger tokenizer. Which model part do you think will influence the performance most? Are there any experimental results that can reflect this?

局限性

Yes

最终评判理由

Overall, I think leveraging parametric 2D Gaussians to model different image regions is a very interesting idea, and the authors also provide a reasonable pipeline to realize the proposed idea. While I initially had some concerns about the experiments, the authors addressed them satisfactorily in their rebuttal. Thus, I maintain my positive score.

格式问题

N/A

作者回复

We sincerely thank this reviewer for the constructive comments and suggestion. We hope our following point-to-point responses can address this reviewer's concerns.


Q1: Parameter size.

We thank the reviewer for the insightful comment. Although VAVAE has fewer parameters (69.8M) than our method (128M), our model size is moderate among the 14 methods compared in Tab. 1 of the main paper. Moreover, direct comparison based solely on parameter count is not reasonable due to the significant architectural and training differences: SDXL-VAE uses residual blocks, TiTok employs transformers, VAVAE leverages a large pretrained vision model, FlexTok incorporates a generative decoder, and our GPSToken introduces Gaussians as a novel image representation. Notably, FlexTok has about eight times the number of parameters as our GPSToken, yet it performs significantly worse — demonstrating that parameter count alone does not determine performance.

More importantly, the number of tokens has a greater impact on the downstream generator than the tokenizer's parameter count, as it directly affects the optimization difficulty and the computational cost—note that the generator typically contains far more parameters (e.g., 708M in our case). Consequently, most prior works focus on improving reconstruction quality under the same token count, placing less emphasis on tokenizer parameter efficiency. In Tab. 1 of the main paper, under this standard and fair setting (same token count), our GPSToken achieves the best reconstruction performance across all metrics.

We will make these points clearer in the revision.


Q2: Training overhead.

We sincerely thank the reviewer for the insightful comment. We agree that a comprehensive discussion on the training overhead in comparison with existing SOTA methods is valuable.

However, due to the substantial computational cost—our SiT-XL/2 baseline takes approximately 9 days for 400k iterations on our GPU cluster—conducting extensive retraining across multiple SOTA frameworks within the rebuttal period is unfortunately infeasible.

Instead, we can provide an indirect comparison based on the reported results in literature. As stated in their papers, REPA achieves comparable performance to SiT-XL/2 (400k iters) in just 100k iterations, and MAETok reaches similar performance in about 50k iterations. In contrast, our GPSToken matches the 400k-iteration performance of SiT-XL/2 in approximately 150k iterations. Thus, in terms of convergence speed measured by iteration count, REPA and MAETok converge faster than our two-stage framework.

However, per-iteration efficiency and wall-clock training time are equally important. REPA relies on a large pretrained vision model (DINOv2-L) on the base of SiT-XL/2, leading to high computational and memory overhead per step. In contrast, our method significantly improves per-step efficiency than SiT-XL/2 (see our response to Q2 of reviewer xDKG). Therefore, despite requiring more iterations, our approach achieves competitive or even superior wall-clock training speed in practice.

It should be noted that our method is orthogonal to existing acceleration techniques, such as auxiliary losses (e.g., in MAETok, REPA). We believe these techniques can be integrated into our framework for even greater efficiency. We also would like to clarify that the primary contribution of GPSToken is a novel Gaussian-based visual tokenization method and a two-stage generation framework, enabling effective image representation and structured generation. Reduced training overhead is a beneficial byproduct, not our main focus.

We will revise the manuscript to include a more comprehensive discussion on these aspects.


Q3: Tokenizer vs. generator.

Thank you for the insightful observation. The difference in model scale—our GPSToken generator is slightly larger while MAETok employs a relatively larger tokenizer—stems primarily from the disparate experimental setups and architectural choices across methods, rather than intentional design for performance tuning. Notably, our approach employs a conventional residual block-based decoder, whereas MAETok utilizes a transformer-based decoder. These fundamental architectural distinctions naturally lead to variations in parameter distribution and model size.

Regarding which component—tokenizer or generator—has a greater influence on overall performance, this remains an open question in the field. Prior work exhibits diverse emphases: methods such as VQVAE and TiTok focus heavily on improving tokenizer fidelity under the premise that better reconstruction leads to stronger generative performance. In contrast, approaches such as REPA employ auxiliary losses to guide generator training, effectively treating the tokenizer as a secondary component. More recently, emerging frameworks such as FlexTok and REPA-E advocate for joint optimization or unified modeling of tokenization and generation, blurring the boundary between the two modules. This evolving landscape makes it challenging to isolate the relative importance of each part under the current paradigms.

In our view, the tokenizer plays a more critical role in the overall system. It shapes the structure of the latent space, thereby determining the intrinsic complexity of the generative task and establishing the theoretical performance ceiling for the generator. A poorly structured latent space can impose optimization barriers that even a powerful generator may struggle to overcome.

By far, there lack comprehensive empirical studies to systematically ablate or compare the contributions of tokenizers and generators under controlled settings. This gap is largely due to the significant computational resources required and the difficulty in establishing fair comparisons across diverse architectures, training objectives, and theoretical foundations. We believe that developing standardized benchmarks for modular evaluation—such as plug-and-play tests between tokenizers and generators—would be a valuable direction for future work.

评论

Dear Reviewer pm2Q,

Many thanks for your time in reviewing our paper and your constructive comments. We have submitted the point-to-point responses. We appreciate if you could let us know whether your concerns have been addressed, and we are happy to answer any further questions.

Best regards,

Authors of paper #1542

审稿意见
4

This paper addresses the limitations of conventional grid-based image tokenizers, which lack spatial adaptivity and content-awareness. The authors propose GPSToken, a Gaussian-parameterized tokenization framework that dynamically models image regions using 2D Gaussians to encode their shape, position, and texture. By disentangling layout from appearance, GPSToken enables a two-stage image generation pipeline—first synthesizing structural layouts, then generating textures. Experiments on ImageNet show that GPSToken achieves state-of-the-art performance on both image reconstruction and generation tasks, with significantly fewer tokens and faster convergence.

优缺点分析

The research problem is meaningful, and the authors present their ideas clearly, with careful and thorough explanation of the modeling process. The writing is solid and communicates the technical components effectively.

However, a key weakness is that while the core idea—using 2D Gaussians to dynamically model variable region shapes and positions—is appealing in reducing redundancy in simple regions and enabling finer granularity in complex ones. The motivation for introducing 2D Gaussian functions is not clearly established, as the paper lacks a deeper analysis connecting their specific properties to the concrete demands of the tokenization or generation tasks. I would encourage the authors to further articulate the theoretical motivation and necessity for using this specific formulation.

In addition, the experimental section seems to lack necessary ablation studies to justify design choices, and the setup feels somewhat underexplored. Stronger empirical evidence would improve the credibility and completeness of the work.

问题

  1. What is the basis for setting σx(i){\sigma }_{x}^{\left( i\right) } and σy(i){\sigma }_{y}^{\left( i\right) } to 16\frac{1}{6} for wi{w}_{i} and hi{h}_{i} to ensure full coverage during rendering ? Is this choice empirically optimal? How sensitive are the results to this initialization?

  2. How does the method perform on real-world datasets beyond ImageNet? Any plans to test on more diverse datasets (e.g., COCO, FFHQ)??

局限性

The authors have briefly acknowledged the heuristic limitation in Gaussian initialization and the need for more specialized generation architecture. However, it would be helpful to explicitly discuss:

  1. Potential failure cases (e.g., cluttered scenes, occlusions).
  2. How token count influences spatial bias or missed details.
  3. Scalability to higher resolutions.

格式问题

no

作者回复

We sincerely thank this reviewer for the constructive comments and suggestion. We hope our following point-to-point responses can address this reviewer's concerns.


Q1: Motivation for 2D Gaussians in tokenization.

Our choice of 2D Gaussian functions for image region modeling is grounded in their mathematical properties, which align well with the requirements of adaptive image tokenization.

Standard grid-based tokenization methods suffer from spatial redundancy—simple regions are over-partitioned, while complex ones lack sufficient resolution. An ideal tokenization scheme should have the following properties:

(1) flexibly represent semantic regions of arbitrary location and scale;
(2) model fuzzy boundaries in a soft, probabilistic manner;
(3) be fully differentiable to enable end-to-end optimization with gradient-based learning; and
(4) maintain low parametric complexity to ease downstream generation tasks.

After evaluating alternatives such as bounding boxes and segmentation masks, we find that 2D Gaussians offer a principled and effective solution:

  • Spatial adaptivity: Each 2D Gaussian is parameterized by a mean μ=(μx,μy)\mu = (\mu_x, \mu_y) and a covariance matrix (determined by σx\sigma_x, σy\sigma_y, and ρ\rho), enabling anisotropic shapes that adapt to region geometry. Multiple Gaussians can be combined to model complex structures.

  • Soft boundary modeling: The Gaussian density function provides a smooth, continuous weight distribution, naturally capturing the uncertainty and gradual transitions at object boundaries in natural images.

  • End-to-end differentiability: All parameters are differentiable, and feature aggregation from CNN or ViT feature maps can be performed via soft weighting (inspired by differentiable rendering in 3DGS), enabling seamless integration into gradient-based training.

  • Parameter efficiency: A 2D Gaussian requires only five parameters to describe a spatial region—μx\mu_x, μy\mu_y, σx\sigma_x, σy\sigma_y, ρ\rho—offering a compact yet expressive representation that avoids overburdening the generative decoder.

In contrast, bounding boxes are limited to axis-aligned rectangles and exhibit hard, non-differentiable boundaries. Segmentation masks offer precise shapes but they are high-dimensional, discrete, and incompatible with differentiable optimization. Therefore, our use of 2D Gaussians is not merely inspired by 3D Gaussian Splatting (3DGS), but is also a well-motivated choice that satisfies the core demands of adaptive, efficient, and learnable tokenization.

We will add more discussions in the revision, particularly in the introduction and method sections.


Q2: Missing ablation studies.

Actually, we have included a comprehensive analysis in Section C of the Appendix (starting at L42). This section presents both quantitative results and visualizations of the learned Gaussian parameters, which could help elucidate the role of each component in GPSToken. Specifically, the "Refine." module adjusts Gaussian distributions to better align with local textures and consistently improves performance across metrics. The "Init." module reallocates Gaussians from homogeneous to texture-rich regions, enabling finer semantic modeling without degrading reconstruction quality.

To further justify our design and address the concern of this reviewer, we have conducted additional sensitivity studies on key hyperparameters, including entropy threshold λ\lambda, support factor ss, minimum region size smins_{\min}. Please refer to our response to Q2 of Reviewer EBwp, where we show that the model performance remains stable within reasonable ranges and our default settings achieve favorable performance.
We will better highlight these results in the revision.


Q3: Why set σx(i),σy(i)=16wi,16hi\sigma_x^{(i)}, \sigma_y^{(i)} = \frac{1}{6}w_i, \frac{1}{6}h_i?

This is based on the 3σ rule — a well-established empirical principle for normal distributions — which states that nearly 99.7% of data points of a normal distribution lie within the range of μ±3σ\mu \pm 3\sigma. Based on this rule, each region is designed to have a width of approximately 6σ6\sigma (spanning from μ3σ\mu - 3\sigma to μ+3σ\mu + 3\sigma), ensuring the inclusion of the vast majority of relevant data points. Notably, this initialization has minimal impact on the final outcome, as we subsequently refine the σ\sigma values through transformer blocks in the encoder.


Q4: Performance on COCO and FFHQ.

As suggested, we further evaluate GPSToken on two additional real-world datasets: COCO2017 and FFHQ, randomly sampling 5,000 images from each. As demonstrated in the following table, GPSToken consistently outperforms its competitors — MAETok (128 tokens) and VAVAE (256 tokens), representing the state of the art in 1D and 2D tokenization, respectively — across all metrics and datasets under the same token counts.

DatasetToken CountMethodPSNR ↑SSIM ↑LPIPS ↓rec. FID ↓
COCO2017128MAETok22.670.6230.1018.91
GPSToken-M12823.470.6570.0834.72
256VAVAE25.010.7360.0526.01
GPSToken-L25627.410.7940.0352.23
FFHQ128MAETok25.530.7070.0644.66
GPSToken-M12826.350.7450.0503.72
256VAVAE28.060.8080.0271.95
GPSToken-L25630.020.8460.0191.51

Specifically, GPSToken-L256 achieves the best results on all three benchmarks: 30.02 PSNR / 0.846 SSIM on FFHQ, 28.81 PSNR / 0.809 SSIM on ImageNet (see Tab. 1 in the main paper), 27.41 PSNR / 0.794 SSIM on COCO2017. The observed performance trend -- FFHQ >> ImageNet >> COCO2017 -- aligns well with the inherent data complexity. The structured nature of human faces in FFHQ facilitates reconstruction, whereas ImageNet, with its 1,000 diverse object categories, presents a greater challenge. COCO2017, featuring complex scenes with multiple objects at varying scales and rich contextual interactions, poses the most difficult reconstruction task.

The consistent superiority across datasets with different characteristics underscores the generalization capability of GPSToken.


Q5: More discussion on limitation.

Thanks for the constructive comments. We will include more discussions in the Limitations section based on this reviewer's suggestion.

Potential Failure Cases. GPSToken adaptively allocates fewer tokens to simple regions and more tokens to complex regions for efficient representation without compromising reconstruction. One limitation of this strategy is that it may offer limited benefit for uniformly simple images and allocate dense tokens everywhere for images with high-frequency details or noise, reducing its efficiency. Our method is robust to occlusions because token allocation depends on the observed complexity of the visible region, but not predefined object structures. Changes in local content caused by occlusion naturally influence the token distribution based on the resulting complexity. We will show examples of these cases in the revision.

Token counts. GPSToken supports flexible token allocation during inference, allowing adaptation to varying computational budgets or fidelity requirements. As demonstrated in Fig. 6 of the main paper, reducing the total number of tokens preserves global structure while gradually sacrificing fine-grained local details. In contrast, increasing the token count will enhance the local detail reconstruction across the image.

Crucially, this adaptability does not introduce significant spatial bias in token distribution. Our region partitioning strategy balances both spatial extent and local complexity, with a minimum region size constraint during initialization. This design ensures that even under low-token regimes, no region will be entirely omitted, thereby maintaining a spatially balanced representation.

Higher resolution. GPSToken demonstrates favorable scalability to higher-resolution images. Empirically, we observe that reconstruction quality (measured by PSNR and SSIM) remains consistent when scaling both image resolution and token count proportionally. For instance, as shown in our response to Q4 of reviewer EBwp, reconstructing a 512×512512 \times 512 image using 512 tokens (or a 1024×10241024\times 1024 image using 2048 tokens) achieves comparable fidelity to reconstructing a 256×256256 \times 256 image with 128 tokens. This result aligns with prior tokenization methods, suggesting that our approach preserves representational efficiency across resolutions.

评论

Dear Reviewer pWj1,

Many thanks for your time in reviewing our paper and your constructive comments. We have submitted the point-to-point responses. We appreciate if you could let us know whether your concerns have been addressed, and we are happy to answer any further questions.

Best regards,

Authors of paper #1542

审稿意见
5

GPSToken proposes a Gaussian parameterized spatially-adaptive tokenization framework to address the inflexibility of conventional 2D/1D grid tokenization in representing regions with varying shapes, textures, and locations. The framework achieves non-uniform image tokenization by: 1) partitioning images into texture-homogeneous regions using an entropy-driven algorithm; 2) parameterizing each region as a 2D Gaussian (mean for position, covariance for shape) with texture features; 3) optimizing Gaussian parameters via a specialized Transformer for continuous adaptation and content-aware feature extraction. During decoding, Gaussian-parameterized tokens are reconstructed into 2D feature maps using a differentiable splatting-based renderer, enabling end-to-end training with standard decoders. By decoupling spatial layout (Gaussian parameters) from texture features, GPSToken enables a two-stage generation pipeline: structural layout synthesis via lightweight networks, followed by structure-conditioned texture generation.

优缺点分析

Strength:

  1. Unlike prior grid-based or 1D tokenization, GPSToken introduces a continuous, parametric representation of image regions using 2D Gaussians, which dynamically adapts to local structures. This is a departure from discrete or fixed-grid methods.
  2. The two-stage generation pipeline (structure-first, texture-later) aligns with human perception and reduces computational complexity, making it feasible for real-world applications. The ability to adjust token count at inference enhances flexibility for quality-efficiency trade-offs.
  3. The paper outlines the entropy-driven partitioning algorithm, Gaussian parameterization, and transformer refinement process in sufficient detail, including mathematical formulations (e.g., Eq. 2 for modified Gaussian, Eq. 3 for splatting). Experimental settings (dataset, tr-aining configurations, metrics) are well-documented.
  4. The paper provides extensive quantitative results on ImageNet, comparing against a wide range of state-of-the-art tokenizers (e.g., SDXL-VAE, TiTok, MAETok) across reconstruction and generation tasks. The performance gains (e.g., FID 1.64 with 128 tokens) are significant and supported by multiple metrics (PSNR, SSIM, LPIPS).

Weakness:

  1. Experiments are primarily conducted on ImageNet 256256. More experiments on ImageNet 512512 or t2i could be conducted.
  2. The paper highlights faster training convergence in line 63 but lacks concrete metrics (e.g., GPU hours, memory footprint) to quantify computational efficiency against baselines (e.g., SiT-XL/2). How does GPSToken optimize practical deployment costs? Please supplement with detailed computational benchmarks (e.g., training time per epoch, inference time, VRAM usage) and comparative analysis.

问题

  1. In the "GPSToken-driven Two-stage Image Generation" framework, which specific loss function is employed to optimize the model, and how do these losses contribute to the overall generation quality and semantic consistency of the generated images?
  2. What are the detailed training recipe adopted in "GPSToken-driven Two-stage Image Generation"? For instance, are there any specific layers or components, such as the decoder, whose weights are frozen during certain training phases? And how do these strategies impact the model's convergence speed and performance in generating high-fidelity images?

局限性

yes

最终评判理由

Thanks for the response from the authors. The rebuttal has addressed my concerns. I hope the authors include the additional experiments in their manuscript to better support their claims.

格式问题

No major concerns

作者回复

We sincerely thank this reviewer for the constructive comments and suggestion. We hope our following point-to-point responses can address this reviewer's concerns.


Q1: Higher resolution and T2I tasks.

Regarding higher-resolution image reconstruction tasks, we refer the reviewer to our response to Q4 of Reviewer EBwp, where additional experiments on 512×512512\times 512 and 1024×10241024\times 1024 images are presented. The successful tokenization and faithful reconstruction at these scales indicate that our method can reliably encode fine-grained visual details -- a critical prerequisite for high-resolution generation.

Regarding high-resolution or text-to-image (T2I) generation tasks, we acknowledge that additional experiments would further validate the applicability of our tokenizer in generative frameworks. However, such experiments generally require substantial computational resources, typically involving several weeks to months of training on large GPU clusters. Due to these practical constraints, we are currently unable to complete them within the rebuttal timeline. We plan to conduct these studies as part of our future work.

We believe that our experiments are well-aligned with recent advances in the context of image tokenizers. Many recent works (e.g., MaskBit (ICLR 2025), VAVAE (CVPR 2025)) primarily focus on reconstruction and generation tasks on ImageNet 256×256256\times 256 images. Our experimental design follows this established convention, prioritizing the analysis of the tokenization process itself.


Q2: Computational benchmarks

We thank the reviewer for the insightful comment. In response, we provide comprehensive computational benchmarks by comparing GPSToken with the SiT-XL/2 baseline in both training and inference.

As shown in the table below, at 1M iterations, GPSToken achieves a significantly lower FID score (7.61 vs. 14.50), with 42% less training time (256h vs. 439h), 73% higher training throughput (1.09 vs. 0.63 iters/s), and 35% lower VRAM consumption (41,498 MB vs. 63,684 MB). During inference, although VRAM usage increases slightly (9,636 MB vs. 9,126 MB), our method nearly doubles throughput (0.129 vs. 0.067 samples/s), reducing latency by approximately half.

MethodMetric500K1000KT-Mem (MB)T-Thpt (iters/s)I-Mem (MB)I-Thpt (samples/s)
BaselineFID19.0714.5063,6840.639,1260.067
Time (h)219439
OursFID9.577.6141,4981.099,6360.129
Time (h)128256

Notes: T-Mem: Training Memory, T-Thpt: Training Throughput, I-Mem: Inference Memory, I-Thpt: Inference Throughput

These efficiency gains are attributed to two key design choices:

(i) GPSToken reduces the number of effective tokens, lowering computational and memory overhead;
(ii) the two-stage generation framework simplifies the learning objective and stabilizes optimization, facilitating faster convergence to higher-quality solutions.

Overall, GPSToken not only improves generation quality but also significantly reduces training cost and inference latency.


Q3: Loss functions for generators.

Thank you for the question. We apologize for the lack of details in loss function for generators.

Both the layout and texture generators use the standard velocity matching loss from SiT: the model predicts the diffusion velocity vθ(xt,t)\mathbf{v}_\theta(\mathbf{x}_t, t), and we apply an L2L_2 loss to match the target v\mathbf{v}^*:

L=E[vθ(xt,t)v2].\mathcal{L} = \mathbb{E} \left[ \|\mathbf{v}_\theta(\mathbf{x}_t, t) - \mathbf{v}^*\|^2 \right].

This loss promotes high generation quality by modeling data dynamics accurately. In the layout stage, it encourages semantically coherent Gaussian arrangements. In the texture stage, it ensures fidelity and alignment with layout conditions, enhancing overall semantic consistency.


Q4: Training recipe for generators.

Both the layout and texture generators are trained independently and end-to-end, with no component frozen — all parameters are updated during training.

Specifically, the layout generator uses a SiT-B/4 architecture and is trained to model ginitg_{init} (which encodes only the position and shape of initialized Gaussians) using the standard velocity matching loss. Since ginitg_{init} represents a simple geometric prior, its distribution is relatively easy to learn. The conditional texture generator employs a SiT-XL/2 architecture and learns to generate the refined layout gg and image features ff, using {g,f}\{g, f\} as targets and the same velocity matching loss. Although modeling their joint distribution is challenging, conditioning on the generated ginitg_{init} — from which gg is refined — provides strong structural guidance, significantly reducing optimization difficulty.

At inference, a calibration step is employed to refine ginitg_{init} to correct minor errors (L190 in the main paper), mitigating potential misalignment from independent training.

As shown in Tab. 2 and Figs. 7/8 in the main paper, this strategy enables faster convergence and high-fidelity generation, validating the effectiveness of our decoupled yet coherent two-stage design.

评论

I appreciate the authors' detailed response. All my concerns have been resolved, and I will raise my score.

评论

Many thanks for the support on our work! We will adopt your suggestions into the revision of the manuscript.

Authors of paper #1542

评论

Dear Reviewer xDKG,

Many thanks for your time in reviewing our paper and your constructive comments. We have submitted the point-to-point responses. We appreciate if you could let us know whether your concerns have been addressed, and we are happy to answer any further questions.

Best regards,

Authors of paper #1542

审稿意见
4

The paper proposes GPSToken, a way to break an image into variable-sized “Gaussian” tokens instead of a fixed grid. Each token stores its position/shape (as a 2-D Gaussian) and a texture feature. A differentiable renderer turns these tokens back into an image-like map so standard networks can use them. For image generation, the authors first predict a coarse layout of Gaussians and then fill in texture with a diffusion model, achieving better FID scores on ImageNet-256 than previous tokenization methods while converging faster.

优缺点分析

Paper Strengths

  • Model is large (≈128 M tokenizer + 675 M generator) and still needs 64–256 tokens, so actual memory/flops vs standard VQ/VAE remain unclear.
  • Many hyper-parameters (entropy threshold λ, support factor s, calibration algorithm) are given without sensitivity study.
  • Strong quantitative gains over many recent 1-D/2-D tokenizers on ImageNet reconstruction & generation.
  • Paper is well organized; code release promised.

Weaknesses

  • Lacks wall-clock latency / memory benchmarks and statistical significance (no error bars).
  • Speed-ups are only theoretical; pooling overhead & two extra mat-muls may erode gains for small images—no profiling provide

问题

  • The tokenizer alone has >120 M parameters, and splatting requires per-pixel Gaussian evaluation. Please report (i) encoder/decoder FLOPs, (ii) GPU memory during training and inference, and (iii) wall-clock throughput compared with VQGAN and TiTok at equal image size. Demonstrating competitive efficiency would strengthen the practical impact.

  • Can GPSToken scale to 512×512 or 1024×1024 without exploding token counts? A small experiment or theoretical discussion on how σ scales with resolution—and whether the same model can tokenize non-natural images (e.g., medical, satellite)—would clarify robustness.

局限性

Yes

格式问题

NA

作者回复

We sincerely thank this reviewer for the constructive comments and suggestion. We hope our following point-to-point responses can address this reviewer's concerns.


Q1: Params, FLOPs, memory, latency and throughput.

In terms of model size, our GPSToken tokenizer contains approximately 128M parameters, which is comparable to GaussianToken (130M) and larger than VQVAE-f16 (89M). Meanwhile, TiTok-B64 has 205M parameters and FlexTok has 950M. Overall, GPSToken is highly competitive in model scale.

As shown in the following table, GPSToken has moderate computational cost. While the FLOPs of its decoder are higher than those of VQVAE-f16 and TiTok-B64, its latency is competitive with GaussianToken and significantly better than FlexTok, which employs a heavy autoregressive decoder. In terms of memory, GPSToken uses less GPU memory than FlexTok and GaussianToken during inference, and less memory than VQVAE-f16 during training, exhibiting favorable memory efficiency in both phases.

MethodParams (M)FLOPs (G) EncoderFLOPs (G) DecoderMemory (MB) TrainMemory (MB) InferenceLatency (ms)Throughput (sample/s)
VQVAE-f1689.6556101478222262772±\pm0.05111.44
TiTok-B64204.822097345892235996±\pm0.0883.44
GPSToken-M128127.83832689507932567180±\pm0.1544.56
GaussianToken130.617062285607815352181±\pm0.1744.32
FlexTok949.72837665-72752714±\pm3.112.88

Table: Comparison of parameter size, computational cost, memory usage, latency, and throughput for generating 256×256256\times256 images on an A100 GPU. Batch size is 8 for inference and 16 for training. Latency is averaged over 20 runs with standard deviation reported. Throughput is measured in samples per second. Training memory for FlexTok is unavailable due to lack of released training code.

While GPSToken ranks mid-tier in standalone efficiency, it is worth mentioning that tokenizers typically play a minor role in the overall latency of generation pipelines. The diffusion generator dominates the inference time; hence, efficiency is not our primary design goal. For instance, generating eight 256×256256\times256 images takes the diffusion model about 60 seconds, while decoding the latent with GPSToken requires only 0.15 seconds, which is negligible compared to the total latency.

We hope the above analysis clarifies the efficiency profile of GPSToken and demonstrates its practical viability in real-world pipelines.


Q2: Hyper-parameter experiments.

The experimental results on λ\lambda, ss, and smins_{\min} (hyper-parameter in calibration algorithm) are as follows.

Hyper-parameterPSNRSSIMLPIPSrec. FIDFID
λ\lambda = 023.520.6380.1101.022.59
λ\lambda = 524.060.6520.0830.682.23
ss = 117.550.43880.3435160165
ss = 324.070.6570.0800.662.18
ss = 724.060.6580.0800.662.16
smins_{min} = 824.050.6560.0800.662.18
smins_{min} = 1624.030.6570.0810.662.19
ours(λ=2.5,s=5,smin=4\lambda=2.5, s=5, s_{min}=4)24.060.6570.0800.652.18

Entropy threshold λ\lambda: As stated in Eq. 4 of the main paper, λ\lambda balances region size and complexity. A larger λ=5\lambda=5 encourages Gaussians to concentrate on complex regions, leading to minor performance degradation. In contrast, setting λ=0\lambda = 0 allocates Gaussians solely based on region size, resulting in a uniform spatial distribution. This causes a significant drop in performance: LPIPS increases from 0.08 to 0.11, and rec.FID rises from 0.65 to 1.02. We set λ=5\lambda=5 to 2.5 based on experimental experience.

Support factor ss: As stated in Eq. 2 of the main paper, ss controls the effective rendering support of each Gaussian. Performance degrades significantly when s=1s = 1, but stabilizes for s3s \geq 3. This aligns with the 3σ3\sigma rule --- 99.7% of the mass of a 2D Gaussian lies within three standard deviations from the mean. To ensure full coverage, we set s=5s = 5 in all experiments.

Minimal region size smins_{min}: We set smins_{min} in the calibration algorithm to match its value in the initialization algorithm, where smins_{min} determines the minimum width or height of each region. We observe that increasing smins_{min} from 4 to 16 results in negligible performance degradation. This is expected because, with 128 tokens representing a 256×\times256 image, the average spatial extent per token is approximately 22×\times22 pixels. Consequently, most of the segmented regions naturally have a width or height greater than or equal to 16, making the choice of smins_{min} within this range largely inconsequential for the final tokenization.


Q3: Speed-ups for two-stage generator.

Existing methods that use pooling layers for speed-up often sacrifice performance (e.g., DiT shows FID degradation from 19.47 to 43.01 with larger pooling strides). In contrast, our GPSToken delivers practical speed-ups in training.

As shown in the following table, compared to the baseline (SiT-XL/2), our method achieves better FID (9.57 vs. 19.07 at 500K iterations) while reducing training time by 42% (128h vs. 219h) and boosting throughput by 73% (1.09 vs. 0.63 iters/s) with a reduced memory usage. These results confirm tangible speed-ups of our two-stage generator.

MethodMetric500K1000KT-MemT-Thpt
BaselineFID19.0714.50636840.63
Time (h)219439
OursFID9.577.61414981.09
Time (h)128256

Notes: T-Mem: Training Memory (MB), T-Thpt: Training Throughput (iters/s)


Q4: Higher resolution.

Thanks for the insightful question on the scalability of GPSToken to higher resolutions.

Similar to prior work, GPSToken requires token count to scale linearly with pixel count for consistent reconstruction quality. As shown in the table below, 128 tokens for 256×256 images achieve performance (e.g., LPIPS: 0.080, FID: 2.18) comparable to 512 tokens for 512×512 or 2048 for 1024×1024. However, using only 128 tokens for 512×512 images sharply degrades performance (LPIPS: 0.274, FID: 21.0), indicating severe loss of detail.

ResolutionToken CountPSNRSSIMLPIPSrFIDFID
256×256256 \times 25612824.060.6570.0800.652.18
512×512512 \times 51212821.920.5860.27418.1321.01
512×512512 \times 51251226.440.7540.0840.612.04
1024×10241024 \times 102451223.470.7290.2577.529.54
1024×10241024 \times 1024204827.160.8480.0930.882.32

This is visually evident in Fig. 6 of the main paper (32 tokens at 256×256), where coarse structures remain but fine textures blur due to insufficient token density—similar to using 128 tokens for 512×512, supporting our observation.

Notably, PSNR and SSIM improve at higher resolutions with linearly scaled tokens. This occurs because content within each token’s receptive field becomes smoother and less textured, easing reconstruction and inflating pixel-wise and structural similarity scores.

Regarding the parameter σ\sigma, which controls the spatial extent of each GPS-token, we hypothesize that it should scale with the average area covered per token. Specifically, the average area per token is proportional to H×W/NH \times W / N, where H×WH \times W is the image resolution and NN is the number of tokens. To maintain consistent local modeling capacity, σ\sigma should scale with the square root of this area:

σH×WN.\sigma \propto \sqrt{\frac{H \times W}{N}}.

Thus, when resolution increases but NN remains fixed, σ\sigma will increase to cover larger regions, resulting in over-smoothing and loss of detail. When both resolution and NN scale proportionally, σ\sigma can remain unchanged, preserving local fidelity.


Q5: Non-natural images.

To evaluate the robustness and generalization of our GPSToken to non-natural images, we conducted experiments on public datasets: STARE (retinal fundus images in medical imaging) and WHU_RS19 (satellite remote sensing images). We compare against VAVAE (256 tokens, a leading 2D tokenizer) and MAETok (128 tokens, a top-performing 1D tokenizer).

In the table below, we report PSNR, SSIM, and LPIPS scores, which are more suitable for evaluating reconstruction quality on non-photorealistic images compared to metrics like FID.

DatasetToken CountMethodPSNR ↑SSIM ↑LPIPS ↓
STARE128MAETok32.980.8180.051
GPSToken-M12834.750.8680.036
256VAVAE36.320.8960.019
GPSToken-L25637.600.9150.014
WHU_RS19128MAETok21.730.5060.195
GPSToken-M12823.200.5600.127
256VAVAE23.570.6190.142
GPSToken-L25626.330.7310.064

GPSToken consistently outperforms competitors at comparable token counts, e.g., achieving higher PSNR (26.33 vs. 23.57) than VAVAE on WHU_RS19. Due to space limits, visual comparisons are deferred to the revision. GPSToken-L256 also better reconstructs fine details (e.g., capillaries, trees, cars) while preserving large-scale structures.

These findings suggest that GPSToken is not only effective on natural images, but also generalizes well to diverse image domains, making it a robust and versatile tokenizer for various vision tasks.

评论

Dear Reviewer EBwp,

Many thanks for your time in reviewing our paper and your constructive comments. We have submitted the point-to-point responses. We appreciate if you could let us know whether your concerns have been addressed, and we are happy to answer any further questions.

Best regards,

Authors of paper #1542

评论

Dear Reviewers of paper #1542,

Many thanks for your time and engagement in reviewing our manuscript. We have uploaded our responses to your valuable comments. Since the deadline of the reviewer-author discussion period is coming soon, we deeply appreciate if you could let us know whether your concerns have been addressed. We are also happy to answer any further questions.

Best regards,

Authors of paper #1542

最终决定

This paper tackles spatially-adaptive image tokenization with a Gaussian parameterized approach. In contrast to fixed spatial correspondence, tokens are initialized with an iterative splitting algorithm to represent regions of variable size. Empirical evidence shows its advantage in reconstruction quality, generation quality, and training convergence speed. The final version should reflect the suggestions from reviewers and the promises in the rebuttal.