PaperHub
4.4
/10
Poster4 位审稿人
最低2最高3标准差0.5
3
3
2
2
ICML 2025

Visual Autoregressive Modeling for Image Super-Resolution

OpenReviewPDF
提交: 2025-01-09更新: 2025-07-24
TL;DR

This paper introduces VARSR, a novel visual autoregressive framework for image super-resolution.

摘要

关键词
visual autoregressive modeling; image super-resolution

评审与讨论

审稿意见
3

This paper proposes VARSR, a visual autoregressive framework for image super-resolution (ISR), addressing the trade-off between fidelity and realism. By leveraging next-scale prediction, prefix tokens for LR conditioning, scale-aligned rotary positional encodings (SA-RoPE), and a diffusion refiner for quantization residual modeling, VARSR achieves promising results in perceptual quality while maintaining computational efficiency. The authors also introduce a large-scale dataset and an image-based classifier-free guidance (CFG) mechanism to enhance realism. Extensive experiments demonstrate VARSR’s superiority over GAN- and diffusion-based methods in both qualitative and quantitative metrics, with significant efficiency gains.

给作者的问题

  1. How does the computational cost of VARSR scale with output resolution? The paper mentions tiling for higher resolutions but lacks empirical analysis.

  2. Could the diffusion refiner be replaced with a lightweight alternative (e.g., GAN-based) without sacrificing performance?

  3. The prefix token approach assumes LR and HR scales are spatially aligned. How does VARSR handle severe misalignment (e.g., rotation or perspective distortion in LR inputs)?

论据与证据

The paper provides strong empirical evidence to support its claims, including quantitative results, qualitative comparisons, ablation studies, and user evaluations. These claims are well-aligned with the broader goals of ISR and generative modeling, and the evidence demonstrates VARSR’s effectiveness in addressing key challenges in the field.

方法与评估标准

The methods and evaluation criteria in the paper are well-designed and comprehensive. VARSR’s innovations (e.g., next-scale prediction, prefix tokens, SA-RoPE, diffusion refiner, CFG) are rigorously validated through quantitative metrics, qualitative comparisons, ablation studies, and human evaluation. The use of both reference-based and non-reference metrics ensures a balanced assessment of fidelity and perceptual quality, while the user study provides valuable insights into real-world applicability.

理论论述

The theoretical claims in the paper are well-supported by both theoretical justifications and empirical evidence. The use of a large-scale dataset and a robust training pipeline further enhances the model’s generative priors, making it a strong candidate for real-world ISR applications.

实验设计与分析

The experimental designs and analyses in the paper are comprehensive and well-structured. They effectively validate VARSR’s performance through quantitative metrics, qualitative results, ablation studies, and human evaluation. The use of both synthetic and real-world datasets ensures a robust evaluation, while the ablation studies provide valuable insights into the contributions of each component.

补充材料

The supplementary material covers implementation specifics, additional ablation studies, and visualizations, reinforcing the paper’s claims and demonstrating VARSR’s effectiveness in ISR. The inclusion of limitations and real-world benchmarks further highlights the practical applicability and areas for future improvement.

与现有文献的关系

  • Image Super-Resolution (ISR)

ISR is a well-studied problem in computer vision, aiming to reconstruct high-resolution (HR) images from low-resolution (LR) counterparts. Traditional methods (e.g., SRCNN, VDSR) focus on pixel-level fidelity but struggle with real-world degradations and perceptual quality.

  • Autoregressive Modeling

Autoregressive models, popularized in language modeling (e.g., GPT, LLaMA), have recently been adapted to vision tasks (e.g., VQVAE, DALL-E). These models generate images by predicting tokens sequentially, often in a coarse-to-fine manner.

  • Generative Models

Generative models, including GANs, VAEs, and diffusion models, have revolutionized image synthesis and restoration. Each has strengths (e.g., GANs for realism, diffusion for detail) and weaknesses (e.g., GAN instability, diffusion inefficiency).

遗漏的重要参考文献

The references are relatively comprehensive.

其他优缺点

  • Strengths
  1. First application of visual autoregressive (VAR) modeling to ISR, introducing next-scale prediction as a core mechanism.

  2. The large-scale dataset (4M images) and training pipeline (C2I pretraining + ISR finetuning) enhance generative priors.

  3. Comprehensive evaluations across synthetic (DIV2K-Val) and real-world datasets (RealSR, DRealSR) using both reference (PSNR, SSIM) and non-reference metrics (MANIQA, CLIPIQA, MUSIQ).

  4. VARSR outperforms SOTA methods (e.g., PASD, SeeSR) in perceptual metrics (e.g., +6.7% MANIQA, +3.7% CLIPIQA) while matching diffusion models in fidelity.

  5. Efficiency: 10× faster inference than diffusion methods (0.59s vs. 5.85s for DiffBIR).

  • Weaknesses
  1. Some implementation details are under-explained, e.g., the exact architecture of the diffusion refiner, the interaction between SA-RoPE and multi-scale tokens, and the training dynamics of the VQVAE with scale dropout.

  2. The user study (50 images, 20 participants) is relatively small-scale. Expanding this would strengthen claims about human preference.

  3. The large-scale dataset’s curation process (e.g., semantic balance, filtering thresholds) is described briefly. A deeper discussion of potential biases or limitations (e.g., domain coverage) is needed.

  4. Limited testing on niche or highly degraded real-world scenarios (e.g., historical photographs, extreme compression artifacts).

其他意见或建议

None.

作者回复

A1. Implementation details.

Thank you for your suggestions. We will elaborate further on these details in our supplementary materials, for example, the MLP architecture of the diffusion refiner, which consists of linear and activation layers. Additionally, our code will be open-sourced to help readers double-check and verify our implementation details.

A2. User Study.

We conducted a larger-scale user study involving 20 participants to evaluate 100 images. The scale of this study is already sufficiently large compared to previous research (e.g., PASD in ECCV2024 involved 15 participants evaluating 40 images). The experimental results further validate our assertion that our VARSR attains the highest selection rate of 57.1%, significantly surpassing alternative methods. This underscores the potent capability of VARSR in real-world settings to produce lifelike images that harmonize with human aesthetics.

MethodsBSRGANReal-ESRGANStableSRPASDSeeSRVARSR(Ours)
Selection Rates0.35%0.7%3.5%17.3%21.05%57.1%

A3. Domain coverage of our large-scale dataset.

To achieve diversity and balance in images from different domains, we conducted semantic clustering and supplemented specific category data. As shown in the Tab below, we ensured that the dataset covers a wide range of category scenes and a relatively balanced proportion (Scenes with a broader semantic scope correspond to a higher proportion of images). This includes portraits, people, food, animals, natural landscapes, cartoons, cityscapes, indoor and outdoor scenes, ensuring comprehensive coverage of visual concepts and richness of scene content.

It is undeniable that due to the limited number of images, some rare scenes may not be covered well, such as the semantic scenes discussed in Sec. B.4 Limitations of our Supp..

CategoriesIndoorOutdoorNatureHumanPlantObjectAnimalTextFoodCartoonOthers
Rates10%9%13%28%8%9%6%4%8%2%3%

A4. Highly degraded scenarios.

In Sec. B.2 of our Supp., we conducted evaluations on the RealLR200 dataset and achieved SOTA results compared to previous methods. The RealLR200 dataset already contains many highly degraded real-world images, such as historical photographs and extreme compression artifacts.

We further provided restoration results in these extreme degradation scenarios, which can be accessed at https://figshare.com/articles/figure/extreme_pdf/28668452?file=53243348. It can be observed that VARSR can still produce faithful and high-quality results in such extreme degradation scenarios, validating the robust capability of VARSR.

A5. Tiling for High-resolution images.

VARSR adopts a tiling approach in generating high-resolution images, which is completely consistent with the diffusion based method. Specifically, we uniformly divide the high-resolution image to be generated into overlapping grids, with each grid having a standard resolution of 512×512512\times 512. VARSR generates SR results for each grid separately and then tiles them together to obtain the complete image restoration result.

Therefore, the computational cost increases linearly with the increase in output resolution. However, we can batch process different grids in parallel to accelerate the inference process, significantly increasing the actual computational speed. Importantly, due to VARSR and the diffusion model using the same tiling approach, the tenfold efficiency advantage over the diffusion model still exists when handling images of various resolutions.

A6. Clarification of Diffusion Refiner.

Please see A4 to Reviewer yFzJ.

A7. Spatial Misalignment.

In the field of image restoration, our goal is to recover the original HR image from severely degraded LR images, where degradation refers to distortions such as blur, noise, compression, and other artifacts that may occur during transmission or capture processes, as defined in the classic work SRCNN (TPAMI2015) in ISR field.

The spatial distortions you mentioned, such as rotation or flipping, typically do not fall within the scope of distortions that need to be addressed in ISR field. When spatial distortions such as rotation occur, we can employ preprocessing methods (e.g., corner detection) to restore spatial positioning, followed by the image restoration process. Therefore, we can assume that LR and HR images are always aligned in spatial scale. Our proposed SA-RoPE effectively conveys this spatial structural consistency, thereby enhancing the ISR performance.

审稿意见
3

This work presents a novel generative model for image super-resolution, leveraging visual autoregressive modeling. The generative process is conditioned on the low-resolution input, treated as a prefix token. To enhance performance, the authors introduce scale-aligned positional encoding and a diffusion-based refinement step to mitigate quantization losses. The proposed method surpasses previous diffusion-based models in both fidelity and efficiency.

after rebuttal: weak accept

给作者的问题

  • How does the model perform when using a standard super-resolution model as the refiner instead of a diffusion model? Does the efficiency also improve?

  • How does the model perform on IQA metrics that were not used for data filtering?

论据与证据

The authors identify the challenges of adapting autoregressive models for image super-resolution and propose tailored solutions for each issue.

方法与评估标准

Yes, please refer to experimental design for more details.

理论论述

No, as there are no proofs or theoretical claims.

实验设计与分析

Yes. The experimental validation is consistent with previous work, and sufficient ablations regarding the contributions are provided. However, the data collection and filtering processes remain questionable.

补充材料

Yes, I read the entire supplement.

与现有文献的关系

This paper explores a novel generative framework recently popularized by VAR, while also tackling efficiency, adding further value to the approach.

遗漏的重要参考文献

Essential references are discussed.

其他优缺点

Strengths:

  • Overall, this is a strong and well-written paper, supported by extensive experimental results.

  • The authors clearly articulate their motivation and provide thorough explanations of their methodology, which significantly enhances the efficiency of generative image super-resolution, which is a valuable contribution.

Weaknesses:

  • The use of classifier-free guidance noticeably reduces SSIM, likely affecting PSNR similarly. Established perceptual metrics like LPIPS also show a negative impact.

  • The larger dataset appears to negatively influence SSIM and LPIPS. Since a filtering stage based on MUSIQ/MANIQA was applied, it seems that the selected images are optimized specifically for these metrics, raising concerns about data representativeness.

  • The overall performance of VARSR seems strong specifically on the metrics used for pre-filtering the data. Compared to previous methods like SeeSR, which use smaller and lower-quality datasets, this brings the value of the collected data into question and the fairness of evaluation.

  • The training setup is complex and resource-intensive, aiming to design a more efficient diffusion-based alternative — though diffusion processes are still integral to the approach.

其他意见或建议

None.

作者回复

A1. Effectiveness of CFG.

CFG leads to a certain reduction in fidelity metrics but significantly improves the perceptual quality of the image. Existing fidelity metrics have limitations in accurately measuring human perceptual quality, especially when the original HR image quality is low. In Fig.4, generated images of higher perceptual quality for humans can lag behind in certain fidelity metrics, as overly smoothed low-quality images tend to score better on these metrics.

These limitations have been confirmed in many previous studies (e.g., SUPIR in CVPR2024 and Pipal in ECCV2020), and mathematical derivations have verified the inherent contradiction between fidelity and quality (The Perception-Distortion Tradeoff in CVPR2018). In Fig.9, the introduction of CFG results in significantly richer textures in the generated images, substantially improving perceptual quality to meet human preferences while maintaining correct semantics.

A2. Other IQA Metrics.

Thanks for your valuable comments. First, as mentioned in A1, fidelity and IQA metrics can exhibit certain contradictions. Using our large-scale datasets for training can generate images that retain semantics and have higher quality.

Second, the good performance of VARSR in IQA metrics does not originate from specialized data filtering but from its ability to generate high-quality images. We further conduct evaluation using other well-adopted IQA metrics that were not used for data filtering, including CNNIQA (CVPR2014), HyperIQA (CVPR2020), and TOPIQ (TIP2024). As shown in the Tab below, VARSR continues to achieve SOTA results, surpassing other methods.

MetricsBSRGANRealESRSwinIRLDMStableSRDiffBIRPASDSeeSRVARSR(Ours)
DIV2KCNNIQA0.54920.56520.54020.55790.62740.64130.62690.66130.6661
HyperIQA0.56820.55860.52350.52250.61000.61640.61580.66660.7031
TOPIQ0.54130.51820.47960.46950.59230.61050.61650.67930.7020
RealSRCNNIQA0.55130.56240.52810.56370.60290.60770.59380.65940.6692
HyperIQA0.56170.52310.50930.49360.57030.56900.60010.67460.7038
TOPIQ0.55020.51370.48820.47620.55790.55800.59200.68540.6991
DRealSRCNNIQA0.49890.48490.50170.53670.55180.60250.57940.61320.6445
HyperIQA0.53050.49380.50740.50500.55370.59920.60080.65830.6866
TOPIQ0.50580.46220.46940.48070.53300.58310.59630.65340.6800

A3. Complexity of the training setup.

Our model is no more complex than diffusion methods, both requiring the same three-step process for application in ISR tasks: (1) training the VAE, (2) pre-training on C2I/T2I tasks, and (3) fine-tuning on ISR. The apparent simplicity of training in previous diffusion methods stems from directly using pre-trained Stable-Diffusion and its VAE as base models. These base models have already undergone training in the first two steps, requiring only fine-tuning for the ISR task.

However, the open-source VAR base model falls short of our needs, as it could only generate 256256256*256 images and is limited in generated image quality. Therefore, we need to conduct training in all three stages, making it appear more complex. We intend to open-source the base models trained in the first two steps to ease training burdens for future research and contribute more to the community.

A4. Clarification of Diffusion Refiner.

Our proposal of the Diffusion Refiner does not mean introducing an additional ISR process but offers a mapping of continuous residual distributions. As highlighted in lines 185-195, quantization of image discrete vectorization leads to loss, thereby restricting the upper bound of restoration, as VAR can only predict the quantized discrete vectors of the image. Therefore, we specifically proposed a refiner to convert predictions of categorical vector distribution into a continuous-valued space through a diffusion loss, thereby enhancing the upper bound of VAR's capacity. Such an idea has been validated in previous works: MAR (NIPS2025) and HART (ICLR2025).

The Refiner solely serves to map the probability distribution of quantized residuals with VAR features as a condition, and does not have the capabilities of an ISR model. A lightweight network (only 37M parameters accounting for 3% of the 1.1B model) suffices for the discrete-to-continuous mapping. In the Tab below, a larger refiner does not yield significant gains, which was also confirmed in MAR (NIPS2025). Thus, we believe that a standard SR model as the refiner will not lead to improvement.

RefinerPSNRSSIMLPIPSDISTSFIDMANIQACLIPIQAMUSIQ
RealSR6-layer24.610.71690.35040.2470137.550.55700.700671.26
18-layer24.510.71840.34920.2478138.750.55210.704370.97
审稿意见
2

This paper proposed a image super resolution framework that is built on top of the VAR framework -- next-scale prediction. They modify the original VAR architecture to digest tokens from low res image inputs. And then, they leverage the original VAR architecture to upsample until the highest granularity. Given the quantized signals, they further add one diffusion refiner to close the quantization gap, i.e., producing continuous tokens from the VAR prediction. In the above process, scaled roPE is introduced, and a modified CFG is presented. Experiment-wise, they compare mostly against generative image super resolution prior, including GAN and diffusion based priors. They demonstrate they are higher quality and more efficient.

给作者的问题

none

论据与证据

yes.

方法与评估标准

Method-wise, I am confused on the necessity of the diffusion refiner, and the introduction of CFG together with its positive and negative embedding. If the message is VAR is great for ISR, then the framework should focus on pushing more on the upperbound of VAR related components. However, now it ends up instead mixing diffusion with VAR, which makes the key message vague to me.

Evaluation criteria makes sense.

理论论述

no theoretical claims.

实验设计与分析

I am not sure BSRGAN or Real-ESRGAN is SOTA, which was published back in 2021. Latest GAN baselines, such as GigaGAN(CVPR 2023,https://mingukkang.github.io/GigaGAN/), is not compared directly. Similarly, diffusion-based baseline is missing, SinSR (https://github.com/wyf0912/SinSR, CVPR 2024).

补充材料

yes.

与现有文献的关系

repurpose VAR for Image super resolution.

遗漏的重要参考文献

They have discussed the essential prior work, e.g., VAR and VQVAE

其他优缺点

Strengths:

  1. The authors did a fantastic job in ablation over their design choices.
  2. The qualitative and quantitative results show their greater quality against the selected baselines. And it makes their ablation justification easy to follow.
  3. The metrics are carefully selected and discussed.

Weakness:

  1. As is stated above. I am confused on the necessity of the diffusion refiner, and the introduction of CFG together with its positive and negative embedding. If the message is VAR is great for ISR, then the framework should focus on pushing more on the upperbound of VAR related components. However, now it ends up instead mixing diffusion with VAR, which makes the key message vague to me.

  2. The role of diffusion diffuser is very marginal. As can be seen from the Table 5, despite improvement, it is very tiny scale better. For this level of improvement, an additional level of VAR might just suffice without the need to introduce the diffusion part. Then, the need of CFG is also questionable to me.

其他意见或建议

none

作者回复

A1. Comparison with more SOTA.

BSRGAN and Real-ESRGAN are still commonly used SOTA GAN-based models due to their excellent performance. Other ISR works (e.g., SeeSR, PASD) also chose them as GAN-based baselines for comparison.

GigaGAN does not provide open-source models or code for testing on its homepage. Therefore, we further conduct a comparison with more recent SOTA DASR (GAN-based, ECCV2022) and SinSR (diffusion-based, CVPR2024). As shown in the Tab, the results are consistent with the findings in the paper, with VARSR leading by a significant margin in perceptual quality metrics, validating the strong performance of VARSR.

MetricsPSNRSSIMLPIPSDISTSFIDMANIQACLIPIQAMUSIQ
DIV2KDASR24.460.62530.36960.253357.370.31040.496053.96
SinSR24.220.59220.34290.215742.170.41010.641161.46
VARSR23.910.58900.32600.221835.510.53400.734771.27
RealSRDASR25.470.75750.324010.2267133.330.24700.319841.21
SinSR24.860.71910.34720.2500142.310.39850.616260.57
VARSR24.610.71690.35040.2470137.550.55700.700671.26
DRealSRDASR29.750.82620.30990.2275155.360.28090.381342.41
SinSR28.260.74430.37430.2495173.420.38430.630255.28
VARSR28.160.76520.35410.2526155.870.53620.724068.15

A2. Necessity of Diffusion Refiner.

As mentioned in lines 043-051 of our paper, we believe that VAR is effective for ISR tasks. Similar to other improvements made to VAR-related components (e.g. Prefix Tokens, SA-RoPE), the Diffusion Refiner is also an attempt to enhance the upper bound of VAR in ISR tasks.

As highlighted in lines 185-195, quantization of image discrete vectorization leads to loss. Even if VAR accurately predicts all quantized discrete tokens, the generated image will still have quantization losses that limit its upper bound of restoration. Therefore, we specifically proposed a refiner to convert predictions of categorical discrete vector distribution into a continuous-valued space, thereby enhancing the upper bound of VAR's representational capacity. This Refiner utilizes the loss of diffusion form to model the probability of quantization residuals instead of the complete latents to accelerate convergence. The idea of using a diffusion refiner for discrete-to-continuous mapping has been validated in many previous works: MAR (NIPS2025) and HART (ICLR2025).

In Tab.5, the introduction of the diffusion refiner led to improvements in all metrics, especially in perceptual quality metrics. Notably, MANIQA achieved an average improvement of 2.2%, and SSIM improved by 0.82%. Fig. 8 shows that the diffusion refiner reintroduced many image details lost in quantization (e.g., textures of the clothes and the flowers). Furthermore, the diffusion refiner is an extremely lightweight module, with only 37M parameters accounting for just 3% of the 1.1B model. We believe that the Diffusion Refiner is effective in enhancing the representational capacity upper bound of VAR in ISR tasks with a very small param increase.

A3. Necessity of CFG.

Similarly, the introduction of CFG is also aimed at expanding the upper bound of the image quality generated by VAR, thereby generating more realistic images through guided sampling.

As mentioned in lines 152-156, VAR, GANs, and Diffusion models all target fidelity as the optimization objective in ISR tasks, which may result in generated images being overly smooth and lacking in detail. It tends to retain distortions such as blur from LR images, leading to lower human-perceived quality. To address this, we propose an image-based CFG that follows the principles of the standard CFG. By learning low-quality image distributions during training, it allows us to guide the probability distribution during sampling towards generating higher-quality images, thereby expanding the upper bound of the image quality.

Image-based CFG validates the introduction of a new form of CFG into the VAR framework, striking a balance between realism and fidelity similar to the diffusion model. Results in Tab.6 and Fig.9 confirm the effectiveness of Imaged-base CFG, resulting in significantly richer textures in the generated images, substantially improving perceptual quality to meet human preferences while maintaining correct semantics.

审稿意见
2

This paper presents VARSR, a Visual Autoregressive Model for Image Super-Resolution (ISR). VARSR leverages autoregressive modeling with next-scale prediction, prefix tokens for integrating low-resolution conditions, Scale-aligned Rotary Positional Encoding (SA-RoPE) for preserving spatial structure, and a Diffusion Refiner for pixel-level fidelity. The model is trained on a large-scale dataset of over 4 million high-quality images. Experiments show that VARSR outperforms existing methods in both fidelity and realism, offering higher computational efficiency compared to diffusion-based models.

给作者的问题

Please see weakness

论据与证据

  • The paper compares VARSR with several state-of-the-art methods (GAN-based, diffusion-based, and autoregressive-based), and VARSR consistently performs well in both quantitative and qualitative evaluations across different datasets. It achieves superior results on no-reference IQA metrics, such as MANIQA, CLIPIQA, and MUSIQ, and performs on par with diffusion-based methods in terms of reference-based metrics like PSNR and SSIM.

  • The paper demonstrates that VARSR is more computationally efficient than diffusion models, requiring only 0.59s for inference, significantly reducing the number of steps needed compared to diffusion-based methods.

方法与评估标准

Nan

理论论述

Nan

实验设计与分析

Yes

补充材料

Yes

与现有文献的关系

  • The paper builds upon prior research in autoregressive image generation (e.g., VQGAN, DALL-E) and image super-resolution (e.g., GAN-based methods, diffusion models). It extends the work on autoregressive models by introducing next-scale prediction and addressing issues specific to ISR tasks, such as pixel-level fidelity and the preservation of spatial structure.
  • I keep up with the literature in this area.

遗漏的重要参考文献

Nan

其他优缺点

  • The originality of the approach lies in its integration of autoregressive modeling with image super-resolution, making it capable of achieving high fidelity and realism with computational efficiency. The extensive experiments, including a user study, validate the proposed method's performance in real-world scenarios, highlighting its potential practical application.

其他意见或建议

  • When comparing with other generative models, it appears that the authors did not re-train these models on the newly proposed large-scale dataset. As shown in Table 9, VAR benefits from training on the large-scale dataset. Therefore, directly comparing it with other generative models trained on ImageNet is not a fair comparison.
  • The motivation for introducing VAR into the super-resolution field is not clearly explained. The authors claim that VAR preserves structure better than the Markov process in diffusion models, but the final results show that its reference-based metrics are generally weaker than those of diffusion-based methods, which seems contradictory. Additionally, the authors argue that VAR is more efficient than DDPM, but there are already many efficient designs for diffusion models and one-step generation methods. Therefore, the claim of efficiency improvement for basic diffusion models alone is insufficient to support the motivation of this paper. I believe VAR is an excellent work, but its primary contribution lies in aligning the generative paradigms of vision and text via autoregressive methods. Using it as a simple replacement for diffusion models lacks a clear motivation.
作者回复

A1. Training with different datasets.

Thanks for your comments. In fact, baseline methods are trained on different datasets (e.g., SeeSR uses LSDIR and PASD uses DIV2K/FFHQ, with a tenfold difference in data quantity). There are also differences in the pretrained models, which are important in providing generative priors for ISR (e.g., diffusion methods using Stable Diffusion and GAN methods trained from scratch). In lines 258-270, the primary objective of using large-scale data is for pretraining to acquire generative priors. The open-source VAR base model falls short of our needs, as it could only generate 256256256*256 images and is limited in generated image quality. However, diffusion methods leverage the powerful Stable Diffusion as their base model, which is pretrained on billions of image-text pairs, far surpassing our scale of 4 million. Thus, in evaluating ISR methods, previous works (e.g., SeeSR/PASD) typically compare based on the performance of the models themselves.

Based on your advice, we present VARSR results trained on the same datasets as baselines. In the Tab below, when pretraining with our large-scale data and fine-tuning with the same LSDIR as SeeSR, VARSR still performs exceptionally well, far surpassing other methods in perceptual quality metrics. This is consistent with our conclusion when training on large-scale data, validating the superiority of the VARSR framework.

MetricsPSNRSSIMLPIPSDISTSMANIQACLIPIQAMUSIQ
DRealSRDiffBIR26.570.65160.45370.27240.46020.644561.06
PASD27.450.75390.33310.23220.45510.636563.69
SeeSR28.130.77110.31420.22300.50770.689364.75
VARSR (trained on LSDIR)27.870.75360.37160.26200.53680.720667.74

A2. Advantages of VARSR in preserving semantics.

VAR has the advantage over diffusion models in better human-perceived semantic fidelity and quality.

(1) Firstly, in terms of fidelity metrics as PSNR and SSIM, VARSR performs similarly to SeeSR and outperforms all other diffusion methods on DIV2K and DrealSR datasets (Tab.1). However, existing fidelity metrics have limitations in measuring human perceptual quality. Fig.4 shows that images with higher human perceptual quality may not score well on fidelity metrics, as overly smoothed images tend to perform better. These limitations have been confirmed in previous studies (e.g., SUPIR in CVPR2024, Pipal in ECCV2020), and mathematical derivations verify the contradiction between fidelity and quality (The Perception-Distortion Tradeoff in CVPR2018).

(2) Secondly, user study and numerous examples support VARSR's superiority in maintaining spatial structure and preserving semantics. In Tab.10 (user study), the results of VARSR align more closely with human preferences. In Fig.5/15/16/17, examples illustrate that VARSR excels in generating textures faithful to the original image, outperforming diffusion models. In the last two rows of Fig.5, only VARSR accurately restores architectural semantics and smooth details. In the 2nd and 3rd cases of Fig.17, VARSR generates clear foliage and walnuts, while diffusion methods exhibit illusion issues (SeeSR produces fabricated content). These results highlight VARSR's superior preservation of semantics and spatial structure.

A3. Efficiency Comparison.

Thanks for your comments. We believe that the superior efficiency of VARSR lies in the advantages of the VAR framework over the standard diffusion framework. Therefore, there is no need to compare with models that are optimized to enhance efficiency for a specific framework. This is because these efficiency designs can also be applied to VARSR to further improve efficiency. For example, one-step diffusion models (e.g., OSEDiff in NIPS2025 and SinSR in CVPR2024) often undergo knowledge distillation to match the performance of multi-step generation. Our VARSR can also adopt this approach to simplify the inference steps (e.g., reduce the inference scales).

Tab. 3 and the explanations (lines 380-384/407-413) demonstrate VARSR's notable efficiency enhancements over diffusion models, validating our motivation. Compared to the one-step diffusion method SinSR (refer to A1 to Reviewer vbZD), VARSR outperformed in most metrics, showcasing its effectiveness.

A4. Motivation.

VAR provides a novel and effective approach for addressing ISR tasks, offering advantages over diffusion models rather than a straightforward substitute. Our work is just an initial attempt, and there is vast potential for further leveraging VAR in ISR tasks. In addition to the advantages in preserving structural features and efficiency, as you mentioned, VAR aligns the generative paradigm for both vision and text. Thus, we believe that VAR has the potential to be integrated with LLMs, enabling the direct utilization of human preferences to guide ISR through optimization forms such as DPO or GRPO, which is a promising avenue for future research.

最终决定

This paper introduces visual autoregressive modeling to image super-resolution. The paper originally received 2xWeakReject and 2xWeakAccept. The main concerns include unfair comparison, unclear motivation, insufficient evaluations, reliance on diffusion, etc. The authors have provided rebuttals. Afterward, all reviewers keep their ratings. In the discussion phase, Reviewer yFzJ acknowledge that although this method currently relies on the diffusion refiner, it first introduces VAR to ISR and establishes a strong baseline for future research. Considering the rebuttal and discussions from all reviewers, ACs recommend accepting this paper. The authors are suggested to carefully revise the paper and incorporate newly conducted experiments according to the comments and discussions.