PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
5
3
4
5
4.3
置信度
创新性2.8
质量2.8
清晰度2.8
重要性3.0
NeurIPS 2025

QBasicVSR: Temporal Awareness Adaptation Quantization for Video Super-Resolution

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29

摘要

While model quantization has become pivotal for deploying super-resolution (SR) networks on mobile devices, existing works focus on quantization methods only for image super-resolution. Different from image super-resolution, the temporal error propagation, shared temporal parameterization, and temporal metric mismatch significantly degrade the performance of a video SR model. To address these issues, we propose the first quantization method, QBasicVSR, for video super-resolution. A novel temporal awareness adaptation post-training quantization (PTQ) framework for video super-resolution with the flow-gradient video bit adaptation and temporal shared layer bit adaptation is presented. Moreover, we put forward a novel fine-tuning method for VSR with the supervision of the full-precision model. Our method achieves extraordinary performance with state-of-the-art efficient VSR approaches, delivering up to $\times$200 faster processing speed while utilizing only 1/8 of the GPU resources. Additionally, extensive experiments demonstrate that the proposed method significantly outperforms existing PTQ algorithms on various datasets. For instance, it attains a 2.53 dB increase on the UDM10 benchmark when quantizing BasicVSR to 4-bit with 100 unlabeled video clips. The code and models will be released on GitHub.
关键词
video super resolution (VSR)post training quantization (PTQ)efficient VSR

评审与讨论

审稿意见
5

The paper introduces QBasicVSR, the first post-training quantization (PTQ) framework specifically designed for video super-resolution (VSR), addressing unique challenges such as temporal error propagation and shared temporal parameters that hinder the direct application of image SR quantization methods. QBasicVSR employs a dual adaptation strategy: Flow-Gradient Video Bit Adaptation (FG-VBA) for video complexity-aware bit allocation, and Temporal Shared Layer Bit Adaptation (TS-LBA) for layer-wise sensitivity-based quantization. Without requiring ground-truth data, the framework uses only 100 unlabeled video clips for calibration and introduces a lightweight fine-tuning procedure guided by a full-precision BasicVSR. Experiments show that QBasicVSR significantly outperforms existing PTQ and efficient VSR methods in both accuracy and resource efficiency, achieving significant speedup and maintaining high visual quality even with 4-bit quantization.

优缺点分析

Strengths:

  1. The paper proposes the first dedicated quantization method tailored for VSR, addressing a previously unexplored area. While quantization has been extensively studied for image super-resolution, its extension to video tasks poses new challenges due to temporal dependencies and recurrent structures.
  2. The paper demonstrates a successful adaptation of quantization methods to the video domain by incorporating spatiotemporal information through two novel modules: Flow-Gradient Video Bit Adaptation (FG-VBA) and Temporal Shared Layer Bit Adaptation (TS-LBA). These components are designed to dynamically allocate bit-widths based on video complexity and layer sensitivity, effectively mitigating issues such as temporal error propagation, shared temporal parameterization, and temporal metric mismatch—all of which are critical but often overlooked in existing quantization literature.
  3. The proposed framework achieves highly efficient training and inference, requiring significantly fewer computational resources compared to existing efficient VSR methods. Specifically, the method offers more than ×200\times 200 speedup and operates on just 1/8 of the GPU resources used by prior approaches. This efficiency is particularly impactful for practical applications such as real-time video enhancement on edge devices, where both speed and resource constraints are crucial.

Weaknesses:

  1. The proposed quantization method is only applied to BasicVSR with a fixed resolution upsampling rate of ×4\times 4, which raises concerns about its generalizability to other models and scaling factors. While BasicVSR is a representative baseline, it remains unclear whether the method—or its key components—can be effectively applied to other VSR architectures. Evaluating the approach on additional models and extra upsampling scales would help demonstrate its broader applicability.
  2. The efficiency evaluation of the proposed QBasicVSR is questionable. Although the authors claim that QBasicVSR is 200 times faster and uses only 1/8 of the GPU resources compared to state-of-the-art efficient VSR methods, this comparison appears incomplete. The reported time and GPU consumption only reflect the fine-tuning phase of QBasicVSR, while the method itself starts from a pretrained BasicVSR model. The omission of the training time of BiascVSR makes the claimed efficiency gains less convincing. We also suggest including the inference time in terms of frames per second (FPS) in the main paper, as the authors mentioned real-time VSR.
  3. The calibration dataset plays a crucial role in both the calibration and fine-tuning phases. However, the main paper does not discuss how this dataset is sampled or the potential performance variability resulting from different samples; instead, this information is deferred to the supplementary material. This discussion should be included in the main text. Furthermore, the rationale for selecting 100 samples for the calibration dataset is not clearly explained.
  4. There are also several issues related to the writing of the paper. First, based on the reviewer's understanding, the Flow-Gradient Video Bit Adaptation module adjusts bit-widths at the video level for all layers uniformly, yet Line 143 implies the adjustment is performed on a per-layer basis, which is inconsistent. Second, there is an error in the latter part of Equation (6). Third, the term “FP network” appears in Line 222 without explanation; while it likely stands for “full-precision network,” this should be clearly defined in the text. We recommend that the authors clarify these points for better readability.

问题

The reviewer notes that a 6-bit setting is used in Table 1, while later experiments adopt a 4-bit setting. This raises several questions:

  1. Is it necessary to change the calibration dataset when switching from 6-bit to 4-bit?
  2. Are the hyperparameters listed in the Experiment Section intended for the 4-bit or 6-bit setting?
  3. Additionally, the processing time for either setting is not reported, and it is unclear which bit-width is used in Table 4. These details should be clarified for consistency and reproducibility.

Additional question:

  1. What is the performance of the proposed method without calibration or fine-tuning? We suggest including this as part of the ablation study in Table 3.

局限性

Yes

最终评判理由

The main novelty of the paper lies in applying quantization to video super resolution (VSR). A shared concern among reviewers was the generality of the proposed method. I appreciate the authors’ efforts to address this in the rebuttal, and the additional experiments provided have resolved my concerns. As a result, I am raising my score to 5.

格式问题

No major formatting issue

作者回复

Response to Reviewer 45MJ – Weakness 1: Generalizability to Other Models and Scaling Factors

We thank the reviewer for this suggestion. In the video SR community, most prior works typically evaluate on 4× upscaling rather than 2×, as the latter is relatively easier and less studied. Our focus followed this standard setting to address the more challenging 4× scenario.

To further assess generalizability, we additionally tested our method on BasicVSRuni, a simplified variant of BasicVSR with the backward propagation branch removed. The results showed consistent improvements over MinMax, Percentile, indicating that our framework generalizes well beyond the original BasicVSR model.

MethodsFQW/AProcess TimeREDS4 (BI)Vimeo90K-T (BI)Vid4 (BI)UDM10 (BD)Vimeo90K-T (BD)Vid4 (BD)
BasicVSR-MinMax4/46min28.12 / 0.789634.89 / 0.918226.03 / 0.758535.37 / 0.925134.32 / 0.910625.64 / 0.7412
BasicVSR-Percentile4/43min27.78 / 0.784134.37 / 0.915925.23 / 0.730535.06 / 0.934134.11 / 0.913525.08 / 0.7275
BasicVSR-MinMax+FT4/470min29.21 / 0.823835.22 / 0.921226.17 / 0.760136.44 / 0.939334.88 / 0.917626.11 / 0.7584
BasicVSR-Percentile+FT4/470min28.30 / 0.805434.40 / 0.915925.26 / 0.731435.32 / 0.936234.20 / 0.912425.26 / 0.7340
BasicVSR-Ours4/4MP90min30.26 / 0.863735.82 / 0.931126.29 / 0.775237.59 / 0.953635.95 / 0.933926.81 / 0.8025
BasicVSRuni-32/32-30.54 / 0.869436.99 / 0.942927.03 / 0.816339.29 / 0.964637.27 / 0.947327.54 / 0.8419
BasicVSRuni-MinMax4/44min28.48 / 0.801234.40 / 0.910325.63 / 0.734235.59 / 0.929134.60 / 0.915325.84 / 0.7550
BasicVSRuni-Percentile4/43min29.28 / 0.832135.45 / 0.927426.11 / 0.771737.19 / 0.951635.78 / 0.933026.41 / 0.7963
BasicVSRuni-MinMax+FT4/435min28.88 / 0.815734.86 / 0.916725.80 / 0.744136.34 / 0.939734.93 / 0.918826.00 / 0.7581
BasicVSRuni-Percentile+FT4/430min29.43 / 0.836635.49 / 0.927526.07 / 0.768137.32 / 0.951935.78 / 0.932626.50 / 0.7979
BasicVSRuni-Ours4/4MP35min29.57 / 0.846035.96 / 0.932326.32 / 0.778437.78 / 0.955036.22 / 0.936626.71 / 0.8067
BasicVSRuni-MinMax4/44min28.45 / 0.799734.31 / 0.908825.56 / 0.730835.53 / 0.927434.55 / 0.914225.82 / 0.7531
BasicVSRuni-Percentile4/43min28.12 / 0.798433.94 / 0.909624.81 / 0.703034.76 / 0.931833.97 / 0.911124.87 / 0.7138
BasicVSRuni-MinMax+FT4/435min28.86 / 0.813534.70 / 0.914625.72 / 0.738436.33 / 0.939234.88 / 0.918226.14 / 0.7667
BasicVSRuni-Percentile+FT4/430min28.35 / 0.807434.03 / 0.910724.92 / 0.710535.05 / 0.934134.06 / 0.912525.09 / 0.7242
BasicVSRuni-Ours4/4MP35min29.55 / 0.843435.75 / 0.929626.15 / 0.767437.29 / 0.951735.72 / 0.931926.53 / 0.7938

These quantitative improvements across multiple benchmarks and models convincingly demonstrate the effectiveness of our quantization framework.

Response to Reviewer 45MJ – Weakness 2: Efficiency Evaluation

Thank you for your comment. Table 4 is intended to compare compression methods for pretrained VSR models, not quantization versus original VSR training. For example, SSL (pruning + fine-tuning) is reported to take 8  A6000 GPUs for 370 hours, excluding the original BasicVSR training. Likewise, our quantization-based compression takes only 1  A6000 GPU for 105 minutes. We will clarify this scope explicitly in the revision and include the time ratio (22,200 min ÷ 105 min ≈ 211×) for transparency. And we will include the inference time in terms of frames per second (FPS) in the main paper instead of the supplementary material. We sincerely thank the reviewer again for this valuable suggestion.

Response to Reviewer 45MJ – Weakness 3: Calibration Dataset

We thank the reviewer for this suggestion. We agree that the role and construction of the calibration dataset should be discussed more explicitly in the main paper. In the revised version, we will add a dedicated description of how the calibration video clips are sampled and include an analysis of the effect of calibration set size on performance in the main paper.

To address your concern about the choice of 100 videos, we conducted an additional experiment varying the number of calibration videos and measured its impact on efficiency and reconstruction quality. The results, as shown below, show that while performance improves as the number of calibration video clips increases, the benefit plateaus after around 100 images, whereas the processing time grows substantially. Based on this trade-off, we selected 100 videos as a balanced setting for all the experiments. The evaluation results are reported on the REDS4 dataset.

numFQW/AProcess TimePSNRSSIM
324/448min29.970.8512
644/467min30.110.8565
1004/490min30.260.8637
1284/4146min30.140.8582
1604/4190min29.870.8512

Response to Reviewer 45MJ – Weakness 4: the Writing of the paper

We thank the reviewer for these helpful comments regarding the writing.

  1. Regarding the Flow-Gradient Video Bit Adaptation module: the reviewer is correct that the adjustment is applied at the video level, and in practice the bit-width assigned to a video is changed uniformly across all layers. This is a specific manifestation of the algorithm at the video-level granularity in the current implementation. The description in Line 143 was misleading, and we will clarify this implementation detail in the revision to avoid confusion.
  2. We also acknowledge the error in the latter part of Equation (6). This will be corrected in the revised manuscript.
  3. Finally, the reviewer is absolutely correct that “FP network” refers to the full-precision network. We will explicitly define this term at its first appearance to ensure clarity.

We sincerely appreciate the reviewer’s careful reading, which will help us improve the clarification and accuracy of the paper.

Response to Reviewer 45MJ – Question 1: “Whether the Calibration Dataset Changes When Switching from 6-bit to 4-bit”

We thank the reviewer for this question. In our experiments, the calibration dataset remains fixed throughout each quantization setting; we do not change it when switching from 6‑bit to 4‑bit. For reproducibility, we fix the random seed to 1 and sample 100 calibration video clips once for each training domain. Specifically:

  • From the REDS dataset, we randomly select 100 video clips (seed = 1) as the calibration set to obtain the quantized model, which is then evaluated on REDS4.
  • From Vimeo90K‑Train (BI), we sample another 100 video clips in the same way to build a quantized model, which is evaluated on Vid4 (BI) and Vimeo90K‑Test (BI).
  • From Vimeo90K‑Train (BD), we again select 100 video clips for calibration, and the resulting model is evaluated on Vid4 (BD), Vimeo90K‑Test (BD) and UDM10.

The same fixed calibration sets are used regardless of the bit-width (6‑bit or 4‑bit). Therefore, the change in bit-width does not involve re‑sampling or changing the calibration dataset.

Response to Reviewer 45MJ – Question 2: “Hyperparameters for 4‑bit vs. 6‑bit Settings”

We thank the reviewer for this question. The hyperparameters reported in the Experiment Section are identical for both the 4‑bit and 6‑bit settings. We kept all training and fine‑tuning settings fixed across bit-widths, and the consistent performance under these same hyperparameters demonstrates the robustness and generalization ability of our method.

Response to Reviewer 45MJ – Question 3: “Processing Time and Bit‑Width in Table 4”

We thank the reviewer for this helpful suggestion. The processing times have been reported in the table above. Table 4 corresponds to the 6‑bit non‑FQ quantization setting. The previously reported 105 min in the main paper included both calibration, fine‑tuning and the evaluation of the quantized model. For fairness, we now report only the time required to obtain the quantized model, which is about 90 min.

Response to Reviewer 45MJ – Question 4: Ablation Study

Thank you for your valuable suggestion. We appreciate you recommending an ablation study on the performance without calibration.

Your point is absolutely valid—adding this detail is necessary to help readers better understand our method. Calibration is the essential step where we determine the quantization parameters (e.g., clipping ranges). Without the calibration stage, we cannot derive these parameters, and thus, we cannot obtain an actual quantized model. In this scenario, the model remains and operates as the original full-precision (FP32) model.

Therefore, as you insightfully suggested, we will add this specific ablation point (performance of the full-precision model, i.e., without Calibration) to Table 3 in our revision. This will clearly demonstrate the baseline performance before any quantization is applied.

We also note that our original Table 3 already includes results without Fine-Tuning (FT), showing the effect of quantization alone after calibration.

Once again, we are very grateful for these insightful and constructive suggestions, which have helped us strengthen the study and clarify these significant points.

评论

Thank you for your rebuttal. Your additional experiments and detailed explanations have sufficiently addressed my primary concern regarding generality. However, I still have some reservations about the use of the calibration dataset. Specifically, how one would select an appropriate calibration set in real-world applications. Therefore, I will keep my rating.

评论

Response to Reviewer 45MJ – On Calibration Set Selection in Real‑World PTQ

We sincerely thank the reviewer for this valuable question. We clarify how calibration data are obtained in practical PTQ deployment and why their selection is not a limitation for our method.

  1. Calibration in PTQ does not require ground truth.
    PTQ only requires a small set of unlabeled low‑resolution clips to estimate activation distributions and determine quantization parameters. In real‑world deployment, such inputs are naturally available—the system receives the same raw video streams it will later enhance. Clips can therefore be reused for calibration without any additional annotation or data‑collection effort.

  2. Our method is robust to calibration‑set choice.
    We tested five calibration sets generated with random seeds 0, 1, 2, 3 and 321. The resulting quantized models show extremely low variance across all datasets (Table 1), demonstrating that even randomly selected clips yield stable performance. Manual selection or domain‑specific heuristics are unnecessary.

  3. Practical deployment process.
    We recommend sampling short, unlabeled clips directly from the target application domain—the same low‑resolution input streams the VSR model will process post‑deployment. For example, the system can buffer a handful of clips during the first seconds of operation or periodically sample during idle cycles. These clips naturally reflect the deployment distribution (e.g., device camera, streaming platform, or surveillance feed) and thus form a representative calibration set.
    Table 1 confirms robustness across random samples, so no manual selection is required—any small subset of the deployment stream suffices. Hence the calibration data are not only easy to obtain but inherently “appropriate,” because they originate from the same data distribution the model is deployed to operate on. We will clarify this point in the main paper and explicitly state that no special heuristics are needed to construct the calibration set.

Table 1  Error‑bar evaluation of 4‑bit QBasicVSR (PSNR / SSIM)
Calibration sets are sampled with random seeds 0, 1, 2, 3, 321.

MethodSeedREDS4 (BI)Vimeo90K‑T (BI)Vid4 (BI)UDM10 (BD)Vimeo90K‑T (BD)Vid4 (BD)
BasicVSR31.42 / 0.890937.18 / 0.945027.24 / 0.825139.69 / 0.969437.53 / 0.949827.96 / 0.8553
Ours030.09 / 0.856236.03 / 0.933526.29 / 0.777237.78 / 0.955435.93 / 0.933626.80 / 0.8040
Ours130.26 / 0.863735.82 / 0.931126.29 / 0.775237.59 / 0.953635.95 / 0.933926.81 / 0.8025
Ours229.90 / 0.851135.81 / 0.931126.22 / 0.773837.94 / 0.956336.02 / 0.934726.77 / 0.8032
Ours330.19 / 0.861835.93 / 0.932026.19 / 0.771237.85 / 0.955736.08 / 0.935626.93 / 0.8114
Ours32130.09 / 0.859835.77 / 0.929925.84 / 0.753037.78 / 0.954635.97 / 0.933526.88 / 0.8056
Mean30.10 / 0.858335.87 / 0.931526.17 / 0.770137.79 / 0.956735.99 / 0.933526.84 / 0.8053
Std0.11 / 0.00420.10 / 0.00110.17 / 0.00880.12 / 0.00270.05 / 0.00190.06 / 0.0032

Std = standard deviation across the five seeds.

评论

Thank you for your efforts in addressing my concerns regarding the calibration dataset. Your response has resolved my doubts, and I believe including a brief explanation in the main paper would strengthen it further. I will raise my rating accordingly.

评论

Thank you very much for your thoughtful reconsideration and kind words. We're truly grateful that our clarification helped address your concern about calibration set selection. We will make sure to include a brief explanation in the final version to improve the paper’s clarity for broader readers. We sincerely appreciate your time and constructive feedback throughout the review process.

评论

Thank you once again for your thoughtful evaluation and for taking the time to acknowledge our rebuttal so thoroughly. We're sincerely grateful for your constructive feedback and engagement throughout the review process.

审稿意见
3

The authors proposed a new solution for quantizing video super-resolution (VSR) models. Inspired by AdaBM, a PTQ method for image super-resolution, the authors extended it to VSR by changing image complexity to image-bit mapping (AdaBM) to video complexity to video-bit mapping and spatial sensitivity-based layer-to-bit mapping (AdaBM) to spatio-temporal sensitivity-based layer-to-bit mapping. The method is applied to a single VSR baseline, BasicVSR, and experimental results support its effectiveness.

优缺点分析

Strengths

As the authors claim, the proposed method can be considered the first PTQ approach for VSR. The designs of the two key modules, flow-gradient video bit adaptation and temporal shared layer bit adaptation, are technically sound, and their effectiveness is validated through ablation studies.

Weaknesses

  1. The proposed method is a relatively straightforward extension of AdaBM to VSR, as noted in the Summary section. In addition to reusing its core ideas, the paper closely follows much of the AdaBM paper’s structure, including its organization, figures, and equations.
  2. Given that this is the first PTQ work for VSR, readers may expect more than a direct adaptation of a prior method; they would look for task-specific motivations. For instance, early quantization studies in image super-resolution began with empirical analyses of activation distributions, which provided a clear rationale for novel quantization strategies. In contrast, the authors briefly mention three VSR-specific challenges in Lines 51–56 but do not support them with any theoretical or empirical analysis.
  3. As a result, the proposed method in its current form appears insufficiently tailored to VSR. Rather than briefly noting this as a limitation, the authors should consider applying their method to additional video processing tasks to demonstrate broader applicability.
  4. The method is evaluated only on a single VSR baseline [3]. To demonstrate generalization capability, it should be tested on multiple baselines, especially since the method involves at least seven empirically chosen hyperparameters, as noted in Lines 285–287.

问题

In addition to the major concerns outlined above, several minor issues remain:

  1. Experimental results using different numbers of calibration images would help clarify the trade-off between training complexity and image quality.
  2. Beyond the standard PTQ baselines such as MinMax and Percentile, stronger baseline methods should be included for comparison, e.g., AdaRound and AdaQuant. The method proposed in [R1], as well as others reviewed therein, could also be considered.   [R1] Pushing the Limit of Post-Training Quantization, TPAMI 2025.
  3. The main results in Table 1 should include variations of the proposed method with different bit configurations to provide a clearer view of performance under different settings.
  4. The layer-wise and frame-wise bit adaptation results are neither analyzed nor visualized. Reporting only the overall average PSNR and SSIM makes it difficult for readers to understand the internal behavior of the adaptive bit allocation process.

局限性

The authors have clearly discussed the limitations of their work.

最终评判理由

After reviewing the other reviewers' comments and the authors' feedback, I still find that the proposed method is not fully justified. A major concern shared by most reviewers is the limited generalization ability, as the method was evaluated only on BasicVSR. Although the authors provided additional results using BasicVSRuni during the rebuttal phase, it is essentially a simplified version of BasicVSR. The authors emphasize their contributions over AdaBM; however, some level of modification is always expected when extending a single-image method to video frames. From my perspective, the claimed contributions still appear relatively straightforward. Furthermore, the authors did not provide sufficient theoretical or empirical analysis to support the proposed methodology, relying primarily on performance results. Therefore, I keep my original rating.

格式问题

NA

作者回复

Response to Reviewer zMGT – Question 1: On the Novelty and Differences from AdaBM

We sincerely thank the reviewer for raising this important point about the relation between our method and AdaBM. While we did briefly mention AdaBM in the related work section, we acknowledge that the distinctions were not sufficiently emphasized in the main paper. This was not due to an intention to downplay its relevance, but rather because we initially focused on presenting the core design of our approach within the limited space. We appreciate the reviewer’s suggestion and are happy to clarify the differences in detail below.

Although our work draws conceptual inspiration from AdaBM, particularly in the idea of bit allocation, we would like to emphasize that our method is not a direct extension. We introduce several critical innovations specifically tailored for video super-resolution (VSR), which fundamentally change the way the method operates.

Below, we highlight the key innovations that we believe are crucial to the success of our method in the VSR setting, and which fundamentally differentiate it from AdaBM.

1. Temporal-Aware Adaptation (Core Technical Extension) While AdaBM targets single-image super-resolution and utilizes spatial sensitivity, our method incorporates temporal modeling through FG-VBA and TS-LBA. These components explicitly capture spatio-temporal variance, which is critical in VSR but absent in AdaBM.

2. Customized Finetuning Strategy (Training Efficiency)

AdaBM requires 10 full epochs of fine-tuning, which is computationally affordable for image SR. However, this is prohibitively expensive in VSR, where input sequences are longer and the cost per epoch is significantly higher.

In contrast, our method adopts a 3-stage progressive optimization schedule, where each stage updates only one group of parameters:

  • Stage 1: Weight clipping
  • Stage 2: Activation clipping
  • Stage 3: Bit adaptation

This staged approach offers:

  • Fast convergence in only 3 epochs
  • Stable optimization behavior
  • Clear attribution of effects to each component,

making it far more practical for PTQ in video models.

3. Task-Specific Loss Design (Temporal Structure Awareness)

AdaBM primarily uses a spatial reconstruction loss. We extend this with a temporal structure consistency loss, encouraging smooth transitions across frames to prevent flickering and preserve perceptual coherence.

4. Flexible Bit Space (No Bit Penalization Loss)

AdaBM includes a bit-penalization loss to explicitly push bit-widths lower. We intentionally do not include such a loss, as we found it harms convergence and stability in video quantization. Instead, our method allows bit adaptation to emerge naturally through data-driven optimization, without explicitly penalizing higher bits.

This avoids unstable optimization trajectories that may result from over-aggressive bit constraints, especially in temporally variant features across video frames.

Conclusion

While AdaBM inspired our initial direction, the core design of our method is purpose-built for VSR. From temporal-aware adaptation modules to efficient optimization and loss design, our contributions form a substantially different and VSR-specific framework, going far beyond a simple adaptation.

We will include a detailed comparison with AdaBM in the supplementary material to further clarify the differences in design details and implementations. We sincerely thank the reviewer for pointing out this important issue and for the opportunity to clarify our contributions.

Response to Reviewer zMGT – Question 2: Calibration Videos Trade-off

We thank the reviewer for this suggestion. To address your concern, we conducted an additional study varying the number of calibration video clips to assess the trade-off between calibration set size, training time, and final quality. As shown in the table below, increasing the number of calibration video clips generally improves performance up to a point, but the gain saturates beyond 100 video clips while the processing time continues to grow significantly. We therefore selected 100 video clips as a practical compromise between efficiency and accuracy in all our experiments. The evaluation results are reported on the REDS4 dataset.

numFQW/AProcess TimePSNRSSIM
324/448min29.970.8512
644/467min30.110.8565
1004/490min30.260.8637
1284/4146min30.140.8582
1604/4190min29.870.8512

Response to Reviewer zMGT – Question 3: On the Choice of PTQ Baselines and Stronger Methods

We thank the reviewer for the constructive comments. To further assess generalization, we additionally evaluated our method on BasicVSRuni, a widely used variant of BasicVSR with the backward propagation branch removed to reduce computation cost (also adopted in prior studies such as SSL). Our method again outperforms MinMax, Percentile, and DBDC (Toward Accurate Post-Training Quantization for Image Super Resolution, CVPR’23) across multiple datasets.

Regarding AdaRound, AdaQuant, and [R1]Pushing the Limit of Post-Training Quantization (TPAMI 2025), these approaches are primarily designed for classification or static image tasks and are not directly compatible with recurrent, flow-guided VSR architectures. Previous PTQ works for image SR (e.g., DBDC, AdaBM) also used MinMax and Percentile as baselines for fairness, so we followed the same protocol. We will add a proper citation to [R1] (Pushing the Limit of Post-Training Quantization, TPAMI 2025) in the introduction to acknowledge it as a recent breakthrough in PTQ research, while noting that post-training quantization for video super-resolution has not been explored before our work.

MethodsFQW/AProcess TimeREDS4 (BI)Vimeo90K-T (BI)Vid4 (BI)UDM10 (BD)Vimeo90K-T (BD)Vid4 (BD)
BasicVSR-DBDC+FT4/495min29.22 / 0.822335.45 / 0.924326.31 / 0.770336.82 / 0.943435.25 / 0.922826.42 / 0.7764
BasicVSR-Ours4/4MP90min30.26 / 0.863735.82 / 0.931126.29 / 0.775237.59 / 0.953635.95 / 0.933926.81 / 0.8025

Response to Reviewer zMGT – Question 4: on Table 1 Bit Configurations

We thank the reviewer for this helpful suggestion. We will revise Table 1 in the final version to present our method under different bit configurations for improved clarity. The corresponding experimental data already exist in the main paper and supplementary material, no new data are generated, and we omit them here to save space.

Response to Reviewer zMGT – Question 5: On Bit Adaptation Analysis

We thank the reviewer for this helpful suggestion. We will revise Table 1 in the final version to present our method under different bit configurations more clearly.

Due to space constraints, we do not repeat the results of other baseline methods here, as they are already included in the main paper. If the you are interested in the raw numbers, we kindly refer to those tables for detailed comparisons. Here, we provide the feature average bit-width (FAB) results of our method across three stages for clarity which is is averaged throughout the video clips of the test dataset. Dynamic adaptation applies only to activations; weight bit-widths stay fixed. Notably, even in Stage 1, our method achieves competitive or superior performance to baselines with lower bit-width usage, demonstrating the effectiveness of our approach.

MethodsFQW/AProcess TimeREDS4 (BI)Vimeo90K-T (BI)Vid4 (BI)UDM10 (BD)Vimeo90K-T (BD)Vid4 (BD)
Ours-Stage14/440min29.82 / 0.8466 / 3.9835.46 / 0.9266 / 3.9526.12 / 0.7666 / 3.7536.69 / 0.9464 / 3.735.45 / 0.9272 / 3.9526.32 / 0.7716 / 3.75
Ours-Stage24/465min29.86 / 0.8479 / 3.9835.51 / 0.9270 / 3.9526.21 / 0.7695 / 3.7536.76 / 0.9471 / 3.735.54 / 0.9283 / 3.9526.38 / 0.7741 / 3.75
Ours-Stage34/490min30.26 / 0.8637 / 5.5035.82 / 0.9311 / 5.0326.29 / 0.7752 / 4.6937.59 / 0.9536 / 4.7835.95 / 0.9339 / 5.0826.81 / 0.8025 / 4.98

Response to Reviewer zMGT – Question 6: Generalization Capability and other questions:

We thank the reviewer for this helpful suggestion. We have additionally validated our method on BasicVSRuni (a variant of BasicVSR without the backward branch) to assess generalization, achieving consistent improvements over MinMax, Percentile, and DBDC+Pac baselines, confirming that our approach generalizes well beyond a single VSR model. Due to space limitations, we omit the detailed table here, but if you are interested, we can provide it in the following discussion. VSR introduces unique PTQ challenges that are harder to visualize than static image SR. We have reported FAB per epoch above to show adaptive behavior. Our framework is also applicable to other pixel-to-pixel video restoration tasks (e.g., denoising, deblurring), but large-scale PTQ experiments across multiple tasks are extremely time-consuming in videos, and we plan to investigate this in future work. We sincerely thank the reviewer again.

评论

Thank you very much for engaging in such a constructive discussion and for updating your rating. We truly appreciate the time and effort you spent reviewing our work and providing valuable suggestions throughout the process. We will make sure to incorporate the clarifications and additional details in the final version, as discussed. Thank you again for your thoughtful feedback and support.

评论

Dear Reviewer zMGT,

Thank you again for taking the time to read our rebuttal and for engaging thoughtfully with our responses. We deeply appreciate your feedback and the opportunity to address your concerns.

If there are any remaining concerns, uncertainties, or points you'd like to further discuss, we would be more than happy to engage in additional discussion during the author–reviewer discussion phase.

We sincerely hope to have the opportunity to clarify any aspects that may still be unresolved. Your feedback is valuable to us.

Thank you again for your time and consideration.

Best regards, The authors

评论

Thank you for your response to my comments. I have read the rebuttal and have no further comments.

评论

Thank you very much for your thoughtful response to our rebuttal and for taking the time to review our work. We are grateful for your constructive feedback and are pleased to hear that our clarifications addressed your concerns.

We especially appreciate your recognition of the changes made to improve the clarity and quality of our paper. Your input has been invaluable in helping us strengthen the manuscript.

We also understand that there are no further comments or concerns at this stage, and we are happy that the adjustments we made align with your expectations.

Once again, thank you for your time, effort, and support in the review process.

审稿意见
4

The paper introduces QBasicVSR, the first post-training quantization (PTQ) framework tailored for video super-resolution (VSR). It identifies three challenges unique to VSR: temporal error propagation, shared temporal parameterization, and temporal metric mismatch. To addresses these problems, temporal awareness adaptation post-training quantization (PTQ) is introduced, which comprises two complementary modules: (1) Flow-Gradient Video Bit Adaptation (FG-VBA) computes a spatiotemporal complexity metric over bidirectional optical flows and image gradients to adapt the global bit-width for an input video; (2) Temporal Shared Layer Bit Adaptation (TS-LBA) analyzes per-layer spatial and temporal sensitivities from activation statistics to assign layer-wise bit-width factors. These modules are initialized via a small set of unlabeled calibration clips and then jointly fine-tuned under supervision of the full-precision model, using both pixel and feature-transfer losses.

优缺点分析

Strengths: S1-This paper proposes a well-motivated dual adaptation framework: Flow-Gradient Video Bit Adaptation (FG-VBA) and Temporal Shared Layer Bit Adaptation (TS-LBA), which are rigorously derived and mathematically formalized. S2-The methodology is explained with clear diagrams (e.g., Figure 1) and detailed formulas that clarify the inner workings of FG-VBA and TS-LBA. S3-Speedup (>×200 faster vs. SSL) and resource reduction (1 GPU vs. 8 GPUs) make the work highly relevant for practical large-scale video applications.

weaknesses: W1. Experiments only target BasicVSR and its lite version. Can this prove the generalizability of the quantification method? For example, the following articles verify the effectiveness of the quantization method on multiple existing networks. [1] Content-Aware Dynamic Quantization for Image Super-Resolution. [2] FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer [3] AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution

W2. Real-world evaluation (e.g., on actual hardware or mobile deployment scenarios) is missing.

W3. Some parameters, such as l_v2b, u_v2b, l_space and u_space, depend on percentage thresholds. Is it necessary to determine different initial statistical thresholds for different datasets?

W4. Is it possible to add real-world video super-resolution evaluation instead of videos from the dataset?

W5. In Flow-Gradient Video Bit Adaptation (FG-VBA), how does C_video map to the global bit width factor b_video?

W6. In Temporal Shared Layer Bit Adaptation (TS-LBA), spatial sensitivity s_space^k computes the standard deviation across the channel dimension (dim=1). Is dim=2?

问题

The experiments can be further enhanced. Please see weakness for details.

局限性

Yes.

最终评判理由

According to the rebuttal, I'll maintain my positive rating.

格式问题

NO

作者回复

Response to Reviewer gBYd - Question 1: the Generalizability of the Quantification Method :

We thank the reviewer for pointing out the importance of evaluating generalizability across diverse model architectures. To address this point, we have conducted additional experiments beyond the original BasicVSR, including retraining a unidirectional variant of BasicVSR (denoted as BasicVSR-uni, where the backward propagation branch is removed). This setup provides a significantly different architectural structure while retaining the core super-resolution logic.

As shown in the updated results table below, our quantization method maintains strong performance across this newly introduced architecture. Even under full quantization, our approach consistently achieves competitive or superior results compared to full-precision baselines and other efficient VSR models, demonstrating the robustness and adaptability of our method to different network variants.

MethodsW/AFQParams(M)GTREDS4 (BI)Vimeo-90K-T (BI)Vid4 (BI)UDM10 (BD)Vimeo90K-T (BD)Vid4 (BD)
BasicVSR-uni32 / 32-2.630.54 / 0.869436.99 / 0.942927.03 / 0.816339.29 / 0.964637.27 / 0.947327.54 / 0.8419
BasicVSR-uni-lite32 / 32-0.729.95 / 0.856136.38 / 0.937226.68 / 0.801238.24 / 0.958636.38 / 0.938826.87 / 0.8157
L1-norm-uni32 / 32-0.729.97 / 0.857036.45 / 0.938126.70 / 0.803138.43 / 0.960136.53 / 0.940526.89 / 0.8187
ASSL-uni32 / 32-1.330.02 / 0.858936.49 / 0.938526.76 / 0.805138.48 / 0.960336.61 / 0.941627.02 / 0.8236
SSL32 / 32-0.530.24 / 0.863336.56 / 0.939227.01 / 0.814838.68 / 0.961536.77 / 0.942927.18 / 0.8296
Ours6 / 60.5530.35 / 0.864636.63 / 0.939626.81 / 0.806038.70 / 0.961036.83 / 0.943427.29 / 0.8310
Ours6 / 60.830.40 / 0.866536.81 / 0.941326.91 / 0.810138.97 / 0.963337.11 / 0.946027.40 / 0.8372

To further validate the effectiveness of our quantization strategy itself, we conducted a direct comparison with other standard quantization methods under the same network setting — using BasicVSR-uni as a shared backbone.

MethodsFQW/AProcess TimeREDS4 (BI)Vimeo90K-T (BI)Vid4 (BI)UDM10 (BD)Vimeo90K-T (BD)Vid4 (BD)
BasicVSRuni-32/32-30.54 / 0.869436.99 / 0.942927.03 / 0.816339.29 / 0.964637.27 / 0.947327.54 / 0.8419
BasicVSRuni-MinMax4/44min28.45 / 0.799734.31 / 0.908825.56 / 0.730835.53 / 0.927434.55 / 0.914225.82 / 0.7531
BasicVSRuni-Percentile4/43min28.12 / 0.798433.94 / 0.909624.81 / 0.703034.76 / 0.931833.97 / 0.911124.87 / 0.7138
BasicVSRuni-MinMax+FT4/435min28.86 / 0.813534.70 / 0.914625.72 / 0.738436.33 / 0.939234.88 / 0.918226.14 / 0.7667
BasicVSRuni-Percentile+FT4/430min28.35 / 0.807434.03 / 0.910724.92 / 0.710535.05 / 0.934134.06 / 0.912525.09 / 0.7242
BasicVSRuni-Ours4/4MP35min29.55 / 0.843435.75 / 0.929626.15 / 0.767437.29 / 0.951735.72 / 0.931926.53 / 0.7938

These findings confirm that our quantization strategy generalizes well across multiple network forms, thereby addressing your concern about architecture-specific overfitting.

Response to Reviewer gBYd - Question 2: Real-world evaluation :

We thank the reviewer for highlighting the importance of real-world deployment evaluation. To address your concern, we conducted inference latency and memory footprint measurements on two representative hardware platforms: an x86-based Intel Xeon CPU and an ARM-based Apple M1 Pro chip. To simulate realistic deployment scenarios, the evaluation was performed with batch size set to 8 and input resolution of 180×320, which reflects a typical low-resolution video processing workload. The detailed results are provided below, which demonstrate that our quantized model achieves significant reductions in inference time and memory usage compared to the full-precision BasicVSR and the strong SSL baseline.

MethodsBasicVSRBasicVSR-SSLBasicVSR-Ours
x86 CPU31.59 sec23.00 sec (×1.373)14.01 sec (×2.255)
11847.04 MB6029.70 MB (×1.965)7331.03 MB (×1.616)
ARM CPU61.79 sec18.03 sec (×3.427)18.89 sec (×3.271)
10725.58 MB9207.39 MB (×1.165)7029.81 MB (×1.526)

We also note that current inference libraries primarily support INT4/INT8 acceleration, so intermediate bit-widths are upcasted to INT8 for compatibility. Despite this limitation, our method already demonstrates improved efficiency. We anticipate further gains once native support for intermediate-bit-width quantization becomes more widely available. Overall, these results clearly demonstrate the deployment potential of our quantized framework, showcasing substantial improvements in inference latency and memory usage under realistic settings.

Response to Reviewer gBYd - Question 3: Parameter Threshold Generality :

We thank the reviewer for raising the question regarding the use of threshold-based parameters such as lv2bl_{v2b}, uv2bu_{v2b}, lspacel_{space} and uspaceu_{space}. In all our experiments, we adopt a fixed set of threshold values, which are shared across all datasets, models, and training stages. These values were not specifically tuned for any individual dataset. We maintain consistent settings as described in the experimental details, without performing dataset-specific fine-grained calibration. This design decision highlights an important strength of our method: it achieves strong and stable performance across diverse benchmarks without requiring extensive hyperparameter search or specific adjustments. Such robustness underscores the generalizability and practicality of our approach.

Response to Reviewer gBYd – Question 4: Real-world Video Evaluation

We appreciate the reviewer’s suggestion regarding evaluation on real-world video content. In principle, our framework can be directly applied to real-world videos. However, since real-world footage lacks corresponding high-quality ground truth (GT) references, commonly used full-reference metrics such as PSNR and SSIM cannot be computed. In such cases, only perceptual quality metrics (e.g., LPIPS, NIQE) or user studies can be used for evaluation. To maintain consistency and ensure fair, quantitative comparisons with prior work, we adopt standard datasets where GT is available. In other words, our framework is model-agnostic and fully compatible with arbitrary video input. Users may feed any real-world video through our quantized models to visually inspect results or evaluate using perceptual metrics.

Response to Reviewer gBYd – Question 5: Mapping from CvideoC_{video} to bvideob_{video}

We thank the reviewer for the question regarding how the global bit-width factor bvideob_{video} is derived from video complexity CvideoC_{video} in our Flow-Gradient Video Bit Adaptation (FG-VBA) module.

To clarify: the thresholds lv2bl_{v2b} and uv2bu_{v2b} are computed only once during the calibration phase, using a set of 100 video clips, which is called calibration dataset. We calculate the 100 complexity scores CvideoC_{video} for these video clips, and determine:

lv2bl_{v2b} as the pVp_V-th percentile of the CvideoC_{video} distribution

uv2bu_{v2b} as the (100pV100 − p_V)-th percentile of the CvideoC_{video} distribution

This process is called calibration phase. Then QBasicVSR fine-tune lv2bl_{v2b} and uv2bu_{v2b} in our fine-tuning phase. This statistical analysis sets fixed bounds that remain unchanged during inference. During inference, when a new video is processed, its CvideoC_{video} is first computed. Then, the global bit-width adjustment factor bvideo{1,0,+1}b_\text{video} \in \{-1, 0, +1\} is assigned as follows:

If CvideoC_{video} > uv2bu_{v2b}, the video is considered complex, and we set bvideo=+1b_{video} = +1 to increase bit-widths and preserve fidelity.

If CvideoC_{video} < lv2bl_{v2b}, the video is deemed simple, and we set bvideo=1b_{video} = −1 to reduce bit-widths and save computation.

Otherwise, we keep bvideo=0b_{video} = 0, using the default configuration. In this way, our framework automatically adapts bit-widths for unseen videos, using a content-aware rule established from the calibration dataset. This strategy ensures generalization across inputs without manual tuning per video.

Response to Reviewer gBYd – Question 6: Clarification on Spatial Sensitivity Dimension

We sincerely thank the reviewer for the careful reading and for pointing out the discrepancy in our description of the spatial sensitivity computation in the Temporal Shared Layer Bit Adaptation (TS-LBA) module.

You are absolutely correct — the standard deviation for the spatial sensitivity term sspaceks_{space}^k is computed along the spatial dimension (dim=2), not the channel dimension as mistakenly written. This was a misstatement in the textual description, and we apologize for any confusion it may have caused. To clarify, the correct computation aggregates standard deviation across the spatial locations of the feature map to measure spatial variance per channel, which then guides the shared bit allocation logic. The implementation is correct and consistent with our intended design, only the textual description contains the incorrect reference to dimension. We will ensure this is corrected in the final version of the paper. We sincerely thank the reviewer again for identifying this issue and contributing to the clarity and accuracy of our manuscript.

评论

Thank the authors for their detailed rebuttal. The authors have addressed some of my concerns I previously raised. I still have the following concerns. 1. For evaluation on real-world video content, since the authors claimed that the method can process any kind of inputs, it is better to provide the perceptual quality metrics and add these results in the supplemental materials. 2. Would it be possible to include a comparison of qualitative and quantitative results across different hardware platforms? The authors can add this part in the supplemental materials.

评论

We sincerely thank the reviewer for emphasizing the importance of real-world evaluation and deployment scenarios. We clearly recognize and appreciate the reviewer’s concern regarding the practical applicability of our method in real deployment environments. Please rest assured — this is fully within our consideration and will be thoroughly addressed.

As ground-truths are not available for real-world videos, common metrics such as PSNR and SSIM cannot be used in this task. Therefore, we will adopt two commonly used non-reference metrics NIQE [1] and BRISQUE [2] to supplement our qualitative comparison in the supplementary material to the efficient VSR methods and all the different settings of BasicVSR and BasicVSR-uni mentioned in the paper. As our current framework does not yet include these perceptual metrics, we will need approximately two weeks to modify the code and complete this evaluation.

This task, as rightly pointed out by the reviewers, is both important and feasible, and we believe that accomplishing it will contribute to the progress of the field. We will also publicly release our first video super-resolution quantization framework, enabling researchers to focus more on the design of quantization methods and conveniently use our framework directly.

[1] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 2013.

[2] Anish Mittal, Anush K Moorthy, and Alan C Bovik. Blind/referenceless image spatial quality evaluator. 2011.

It is a valuable suggestion to include a comparison of qualitative and quantitative results across different hardware platforms.

We evaluate our model across several widely adopted benchmark datasets, including REDS4, Vid4 (with both BI and BD degradation settings), and UDM10 (BD). Although Vimeo-90K contains 7,824 test clips and is a standard benchmark, we exclude it from our evaluation due to the excessive testing time required on CPU platforms.

To ensure fair and stable latency measurements, we report the runtime, as shown in the comparison table included in our rebuttal submission.

For memory usage, we employ the memory_profiler.memory_usage() utility, which continuously samples memory consumption during the model’s forward pass. We report the peak memory usage, reflecting the realistic upper-bound memory footprint during inference.

This evaluation setup ensures that both latency and memory measurements are accurate and reproducible. We will conduct the additional evaluations as suggested and include the complete results in the supplementary materials.

We truly appreciate the reviewer’s concern regarding real-world evaluation, which is of strong practical value. We also sincerely thank the reviewer for the insightful suggestions, which contribute to advancing the field and making our work more comprehensive and rigorous. If there are any further questions or clarifications needed, we would be more than happy to discuss them. We highly value the opportunity to engage with the reviewer and deeply appreciate the constructive feedback once again.

评论

Table 2: Inference latency and memory usage on x86 and ARM Platforms on REDS4.

(a) Intel Xeon Gold 5218R (dual socket; 2×20 cores; 40 threads used; 80 logical CPUs; AVX-512; Linux; x86 architecture)

MethodLatency (s)Peak Mem (MB)Speedup vs BasicVSR
BasicVSR2677.4418823.561x
SSL2203.0519047.571.22x
Ours (W=6-bit†, A=6-bit‡, FQ)1963.6319068.671.36x

(b) Apple M1 Pro (10-core CPU; 16 GB RAM; 8 threads used; macOS; ARM architecture)

MethodLatency (s)Peak Mem (MB)Speedup vs BasicVSR
BasicVSR1293.2210506.251x
SSL1009.1010122.021.28x
Ours (W=6-bit†, A=6-bit‡, FQ)659.8210034.671.96x

Notes. Latency is wall-clock time for the full REDS4 sequence (including optical-flow warping and all pre/post steps). Peak memory is the maximum resident set size during the forward pass. Intel is based on x86 architecture; M1 Pro is based on ARM architecture. Intel used 40 CPU threads; M1 Pro used 8 CPU threads. “SSL” denotes Structured Sparsity Learning for Efficient Video Super-Resolution, a structured-pruning variant of BasicVSR.

W=6-bit (INT8-exec): weights are quantized to 6-bit (64 levels) but stored/executed in INT8 containers due to current library constraints.
A=6-bit (INT8-exec): activations are executed via INT8 kernels; the effective activation precision is 6-bit and dynamically adapted at runtime (upcast to INT8 for execution).

In our cross-platform CPU evaluation on REDS4, we ran CPU-only inference on an Intel(R) Xeon(R) Gold 5218R system (dual socket, 2×20 cores; Linux; 40 threads) and on an Apple M1 Pro laptop (10-core CPU, 16 GB RAM; macOS). Latency is wall-clock time for the full sequence (including optical-flow warping and all pre/post steps), and memory is the peak resident set size during the forward pass. Besides BasicVSR and SSL, we report our quantized model, where weights are quantized to 6-bit (64 levels), and activations are computed with dynamic 6-bit precision but executed via INT8 kernels during computation. For deployment, both weights and activations are upcast and executed via INT8 kernels due to current library constraints—mainstream CPU inference stacks primarily accelerate INT8 (and increasingly INT4). We emphasize that while activations are stored and processed in INT8 format for execution due to hardware constraints, they are dynamically computed with 6-bit precision (adapted per input), which does not impact precision or quality. We expect further speedups and lower memory once inference stacks natively support sub-8-bit weights and dynamic activation precision (avoiding upcasting). The flow-warping/alignment operator currently runs in FP32 with lightweight de/re-quantization, which preserves reconstruction quality while sometimes shifting the peak-memory profile. Under these settings, our method achieves consistent speedups: on Intel we observe 1.36× vs. BasicVSR (−26.7%) and 1.12× vs. SSL (−10.9%), while on M1 Pro we obtain 1.96× vs. BasicVSR (−49.0%) and 1.53× vs. SSL (−34.6%). We attribute the larger speedup on the ARM system to higher per-core efficiency, a unified low-latency memory subsystem without NUMA penalties, and favorable system library implementations, whereas the dual-socket Xeon run incurs cross-socket traffic and scheduler overhead even at 40 threads. Peak memory is essentially flat on Intel and slightly lower on M1 Pro. Accuracy metrics (PSNR/SSIM/NIQE/BRISQUE) are platform-independent for a fixed model, so we report them once in the main paper; complete cross-platform tables and reproduction details will be included in the supplementary material.

We would like to sincerely thank the reviewer for their valuable feedback and insightful suggestions. The suggestions provided have helped us improve both the clarity and quality of our work. Specifically, the points raised regarding the use of perceptual metrics and the evaluation across different hardware platforms have allowed us to present a more comprehensive and balanced assessment of our method. We truly appreciate the opportunity to enhance our manuscript and believe that these revisions have significantly strengthened our contributions. Thank you once again for your thoughtful and constructive comments.

评论

Table 1: Quantitative comparison (PSNR ↑ / SSIM ↑ / NIQE ↓ / BRISQUE ↓). All results are calculated on the Y-channel except REDS4 (RGB-channel).

MethodsW/AFQParams(M)GTREDS4 (BI)Vid4 (BI)Vid4 (BD)
BasicVSR32/32-4.931.42 / 0.8909 / 20.4533 / 42.060327.24 / 0.8251 / 21.2263 / 46.682027.96 / 0.8553 / 20.9968 / 43.2148
SSL32/32-1.031.06 / 0.8833 / 20.5275 / 42.513727.15 / 0.8208 / 21.2449 / 46.444227.56 / 0.8431 / 20.9479 / 44.2100
Ours6/61.331.26 / 0.8879 / 20.5416 / 42.672927.18 / 0.8215 / 21.1865 / 46.898127.84 / 0.8495 / 21.0283 / 44.0302
Ours6/61.031.17 / 0.8849 / 20.5608 / 41.993527.05 / 0.8140 / 21.1727 / 45.723227.69 / 0.8405 / 21.0390 / 43.9466
Ours4/41.030.34 / 0.8657 / 20.9095 / 46.949626.26 / 0.7764 / 21.0902 / 47.090127.02 / 0.8161 / 21.5588 / 44.8271
Ours4/40.730.26 / 0.8637 / 20.7069 / 46.001826.28 / 0.7752 / 21.1959 / 45.304226.81 / 0.8025 / 21.2659 / 44.4966
MinMax4/40.728.12 / 0.7896 / 20.5620 / 42.608626.03 / 0.7585 / 21.7752 / 44.272025.64 / 0.7412 / 21.6568 / 40.8614
MinMax+FT4/40.729.21 / 0.8238 / 21.1095 / 45.463826.17 / 0.7601 / 21.9205 / 45.747626.11 / 0.7584 / 22.2404 / 43.5693
Percentile4/40.727.78 / 0.7841 / 20.6565 / 46.463925.23 / 0.7305 / 21.0079 / 48.085925.08 / 0.7275 / 21.3866 / 45.0816
Percentile+FT4/40.728.30 / 0.8054 / 20.8440 / 46.164325.26 / 0.7314 / 21.3187 / 48.107625.26 / 0.7340 / 21.5298 / 45.3654

Note on distortion vs. perceptual metrics.
As shown in Table 1, improvements on distortion-oriented metrics (PSNR ↑, SSIM ↑) do not necessarily translate into better no-reference perceptual scores (NIQE ↓, BRISQUE ↓). This is expected. PSNR/SSIM reward pixel-wise fidelity and are often maximized by smoothing or attenuating high-frequency textures, whereas NIQE/BRISQUE assess overall naturalness and can prefer sharper, more textured reconstructions. Consequently, the correlation between these families of metrics is positive on average, but it is not monotonic and varies with content and degradation settings.

Importantly, in classical supervised SR with paired ground truth, the task is defined as strict pixel-level reconstruction of the reference image; therefore, PSNR/SSIM remain the primary evaluation criteria for this strict SR setting. At the same time, to reflect perceptual naturalness—especially for real-world content where ground truth is unavailable—we additionally report NIQE/BRISQUE as complementary no-reference indicators. We thus present both families of metrics (together with visual comparisons) to provide a balanced and reproducible assessment, without implying that one family is universally superior to the other. All corresponding numerical results, including the additional NIQE/BRISQUE evaluations on real-world videos, will be provided in the supplementary material.

Takeaways for Table 1 (why Ours):
Under the same 4/4-bit + FQ setting, our method consistently delivers higher fidelity and strong perceptual quality than standard PTQ baselines. On REDS4, Ours (4/4, FQ) improves PSNR over MinMax+FT by +1.05 dB (30.26 vs. 29.21) with a lower NIQE (20.71 vs. 21.11). On Vid4 (BI), Ours improves PSNR over Percentile+FT by +1.02 dB (26.28 vs. 25.26) and yields better NIQE/BRISQUE (21.20/45.30 vs. 21.32/48.11). On Vid4 (BD), Ours is +0.70 dB higher in PSNR than MinMax+FT (26.81 vs. 26.11) with lower NIQE (21.27 vs. 22.24). Versus Percentile+FT, Ours shows broad gains across all three datasets in PSNR/SSIM and NIQE, and also reduces BRISQUE (e.g., Vid4-BI: 45.30 vs. 48.11). In short, unlike other PTQ methods (MinMax/Percentile, ±FT) that often trade off distortion and perception, our adaptive spatio-temporal bit allocation achieves reconstructions that are both quantitatively accurate and perceptually natural.

Against pruning (SSL) and full-precision baselines:
Although SSL prunes BasicVSR to 1.0M params, our 6/6-bit models (1.0–1.3M params) are competitive or better in PSNR/SSIM while using far lower precision; e.g., Ours (6/6, FQ) on REDS4/Vid4(BD) matches or exceeds SSL, and on Vid4(BI) remains close. Compared with BasicVSR (32/32, 4.9M), Ours (6/6) narrows the fidelity gap to within 0.2–0.3 dB on average while using 4–5× fewer parameters, and with perceptual scores that are comparable or better in several cases.

On NIQE vs. BRISQUE:
NIQE and BRISQUE capture complementary aspects of perceptual quality and need not move in lockstep; improvements in one may not necessarily imply improvements in the other. We therefore report both families alongside PSNR/SSIM and provide visual comparisons in the supplementary material to present a balanced and reproducible assessment.

审稿意见
5

The authors QBasicVSR: Temporal Awareness Adaptation Quantization for Video Super-Resolution. It is a quantized model based on BasicVSR for video super resolution. The author's main design focus is on the quantization bit depth. Specifically, QBasicVSR uses a dual adaptation strategy: flow-gradient video bit adaptation and temporal shared layer bit adaptation, optimizing bit-width allocation for both video streams and shared network layers. Additionally, they propose a fine-tuning method to refine the quantization parameters, supervised by the full-precision model. Experiments demonstrate that the proposed method outperforms comparative methods.

优缺点分析

Strengths

  1. The authors propose a dual adaptation strategy combining flow-gradient video bit adaptation and temporal shared layer bit adaptation to adapt bit-widths based on video complexity and layer sensitivity, effectively reducing quantization error.
  2. A fine-tuning strategy is introduced further to optimize bit allocation and clipping ranges under full-precision supervision.
  3. Testing on multiple datasets shows that the quantized model maintains high accuracy under low-bit quantization.
  4. The authors conduct thorough experiments on various modules in the ablation study, proving the effectiveness of each proposed method.

Weaknesses

  1. The comparison of quantization methods with VSR's training time and resource consumption is strange (Table 4). The claim in the article that it is 200 times faster than SSL is confusing.
  2. The methods compared with static quantization are outdated (Minmax-2018, Percentile-2019). There is a lack of comparison with more recent quantization methods, e.g., DBDC+Pac, SVDQuant, and SmoothQuant.
  3. As L239 mentions, the bit count for different videos is dynamically determined, which may be unfriendly for actual deployment on hardware.

问题

  1. Why is the comparison of resource and time consumption made between quantization methods and VSR methods, when the quantization method is based on the VSR method?
  2. Please include comparisons with more recent quantization methods that can be applied to VSR models.
  3. Clarify how the bit number "w6a6" in the experimental result table is obtained, since the quantization bit-width is dynamically optimized. Is it the average value or b_{base}? If it's the latter, what would the average bit be, and is it possible that the actual bit-width could be significantly larger than "w6a6" due to b_V and b_L^k remaining large?
  4. The video-wise bit adaptation factor b_{video} is dynamically determined during inference. How much computational load will this introduce? Is it acceptable for quantization?

局限性

The authors have discussed relevant limitations.

最终评判理由

The proposed method is well-designed and effective. Moreover, the authors have resolved my concerns during the rebuttal phase. I think the paper deserves a score of 5.

格式问题

N/A

作者回复

Response to Reviewer E16n - Question 1 (Comparison with SSL):

Thank you for raising this point. We agree that comparing quantization methods directly with core VSR methods in terms of training time/resource might seem unusual at first glance. However, the intention of Table 4 is specifically to compare the efficiency of different VSR model compression techniques applied to pretrained VSR models, not to compare quantization against the core VSR training itself. They both belong to efficient VSR methods.

1. SSL as a Compression Baseline: SSL is a VSR model compression technique applied to pretrained VSR models. Its compression process (pruning + fine-tuning the pre-trained VSR model) is reported to require 8 A6000 GPUs for 370 hours.

2. Our Method as a Quantization Compression: Similarly, our quantization method is a model compression technique applied to the same pretrained VSR models. Our quantization process requires only 1 A6000 GPU for 105 minutes.

3.Revision Plan: We will clarify this intent explicitly in the revised manuscript, emphasizing that the comparison is between existing SOTA compression techniques for VSR models, not between quantization and core VSR training. We will also explicitly show the time calculation (370 hrs -> 22,200 min / 105 min ≈ 211x) for transparency.

Response to Reviewer E16n - Question 2 (Comparison with Recent Quantization Methods):

We appreciate the reviewer’s valuable suggestion. We agree that recent quantization methods provide more meaningful baselines. Accordingly, we have added a comparison with DBDC, which is one of the few recent methods explicitly designed for low-bit quantization in super-resolution (SR) tasks. DBDC shares a similar target with our approach and provides a more relevant reference than earlier static quantization methods.

Regarding SVDQuant and SmoothQuant, while they are indeed recent and impactful quantization approaches, they are not directly applicable to SR models, especially those with flow-guided temporal propagation in VSR. Specifically:

*SVDQuant is designed around SVD-based matrix decomposition and is typically applied to classification models. Its assumptions and compression pipeline are not tailored for the unique challenges of SR, such as spatial fidelity and temporal consistency.

*SmoothQuant relies on cross-layer fusion and assumes the presence of skip-connections across layers. In SR architectures like BasicVSR, particularly within the propagation modules, such skip-connections are structurally absent or non-trivial due to warping and alignment stages. Thus, direct application is not possible without additional architecture modification or fine-tuning, which would go beyond the scope of a fair out-of-the-box comparison.

Furthermore, as highlighted in the paper “Toward Accurate Post-Training Quantization for Image Super Resolution”, SR tasks pose unique difficulties for quantization, including:

  • Highly dynamic activation ranges
  • Asymmetric and long-tailed distributions
  • Absence of normalization layers like BatchNorm

These properties make the direct transfer of classification-oriented quantization or NLP quantization methods (e.g., SVDQuant or SmoothQuant) not only ineffective but also potentially misleading unless carefully adapted and re-trained. In fact, even in the DBDC paper, the authors only compare against Minmax and Percentile methods — reinforcing the lack of universally accepted strong quantization baselines for SR.

In summary, we intentionally focus on comparing with DBDC as the most relevant and compatible recent method for SR. Including SVDQuant or SmoothQuant would require non-trivial adjustments and retraining under our method’s supervision, potentially biasing the comparison or altering their original design intents.

MethodsFQW/AProcess TimeREDS4 (BI)Vimeo90K-T (BI)Vid4 (BI)UDM10 (BD)Vimeo90K-T (BD)Vid4 (BD)
BasicVSR-DBDC4/430min28.07 / 0.7817 / 4.035.23 / 0.9225 / 4.026.24 / 0.7720 / 4.035.96 / 0.9339 / 4.034.77 / 0.9172 / 4.026.01 / 0.7581 / 4.0
BasicVSR-DBDC+FT4/495min29.22 / 0.8223 / 4.035.45 / 0.9243 / 4.026.31 / 0.7703 / 4.036.82 / 0.9434 / 4.035.25 / 0.9228 / 4.026.42 / 0.7764 / 4.0
BasicVSR-Ours4/4MP90min30.26 / 0.8637 / 5.535.82 / 0.9311 / 5.026.29 / 0.7752 / 4.737.59 / 0.9536 / 4.835.95 / 0.9339 / 5.126.81 / 0.8025 / 5.0

Due to space limitations, we only report the 4-bit full quantization results in all the tables.

Response to Reviewer E16n - Question 3 (The bit number):

We appreciate the reviewer’s detailed question. The reported bit-width configuration such as "w6a6" refers specifically to the base bit-width bbaseb_{base}, which defines the initial quantization level before applying dynamic adaptation. For example, in the "w4a4" case, this means both weights and activations are initialized at 4-bit.

To clarify how the final bit-widths are derived, our fine-tuning process consists of three sequential optimization stages:

  1. Stage 1: Optimize weight clipping parameters while freezing activations and bit adaptation parameters.
  2. Stage 2: Optimize activation clipping parameters, with others still fixed.
  3. Stage 3: Optimize bit adaptation parameters, enabling more precise dynamic bit-width adjustment during inference.

Each stage independently fine-tunes a subset of parameters, while keeping others fixed, and the entire process completes in only three epochs.

Due to the dynamic nature of our method, the actual bit-width per layer may deviate from the base setting. Importantly, we emphasize that even in Stage 1 and 2, our model already outperforms existing baselines (e.g., MinMax, Percentile, DBDC) using only low feature average bit-width (FAB) that is averaged throughout the video clips of the test dataset. The bit adaptation modules are present across all stages, but they are only optimized during Stage 3, where we enable the model to learn bit allocation based on content and layer dynamics. This further improves performance at the cost of a slightly higher average bit-width (e.g., up to 5.5 on REDS4), offering a tunable trade-off between precision and efficiency depending on deployment constraints.

Even after only Stage 1, our method already outperforms existing baselines (e.g., MinMax, Percentile, DBDC) in both accuracy and average bit cost. This design allows flexibility:

  • If low-bit efficiency is prioritized, one may adopt the results from Stage 1 or 2.
  • If higher performance is desired, Stage 3 offers further improvement with a modest increase in average bit-width.

This staged approach demonstrates the dynamic adaptability of our framework and offers a tunable trade-off between compression and performance. Notably, the bit-width of weights remains fixed at 4 bits throughout all stages in "w4a4" and does not dynamically change; only the bit-width of activations is dynamically adapted.

MethodsFQW/AProcess TimeREDS4 (BI)Vimeo90K-T (BI)Vid4 (BI)UDM10 (BD)Vimeo90K-T (BD)Vid4 (BD)
BasicVSR-DBDC4/430min28.07 / 0.7817 / 4.035.23 / 0.9225 / 4.026.24 / 0.7720 / 4.035.96 / 0.9339 / 4.034.77 / 0.9172 / 4.026.01 / 0.7581 / 4.0
Ours-Stage14/440min29.82 / 0.8466 / 3.9835.46 / 0.9266 / 3.9526.12 / 0.7666 / 3.7536.69 / 0.9464 / 3.7035.45 / 0.9272 / 3.9526.32 / 0.7716 / 3.75
Ours-Stage24/465min29.86 / 0.8479 / 3.9835.51 / 0.9270 / 3.9526.21 / 0.7695 / 3.7536.76 / 0.9471 / 3.7035.54 / 0.9283 / 3.9526.38 / 0.7741 / 3.75
Ours-Stage34/490min30.26 / 0.8637 / 5.5035.82 / 0.9311 / 5.0326.29 / 0.7752 / 4.6937.59 / 0.9536 / 4.7835.95 / 0.9339 / 5.0826.81 / 0.8025 / 4.98

Response to Reviewer E16n - Question 4 (Computation load):

The computation of the video-wise bit adaptation factor bvideob_{video} introduces negligible computational overhead, as it is performed once per video based on simple statistical properties extracted during a forward pass. This process does not require iterative optimization, and is implemented with minimal additional cost.

To empirically validate the efficiency of our approach, we report the processing time required for different methods under the 4-bit full quantization setting above. Our method achieves faster processing compared to DBDC+FT across all stages. For instance:

  • Even our full-stage method (Stage 3), which includes bit adaptation parameters optimization, completes in only 90 minutes, and outperforms DBDC+FT in both speed and accuracy.

This clearly demonstrates that the integration of bvideob_{video} remains efficient and practical for deployment, and does not introduce any prohibitive computational load. We thank the reviewer once again for these valuable comments.

评论

Thank you to the authors for their response. My concerns regarding runtime, quantization parameters, and comparison methods have been addressed. Therefore, I will raise my score.

评论

Thank you very much for your thoughtful feedback and for taking the time to review our rebuttal so thoroughly. We truly appreciate your recognition of our efforts to address your concerns regarding runtime, quantization parameters, and comparison methodology. We are especially grateful for your decision to raise your score — your support and encouragement mean a great deal to us.

We will take your suggestions seriously as we revise the manuscript to make our intentions and technical contributions clearer for the broader community. Thank you again for your constructive input and kind acknowledgment.

最终决定

This paper proposes an effective temporal awareness adaptation quantization method for video super-resolution. The proposed method achieves better performance with fewer GPU resources.

The major concerns include the generalization ability, more comparisons with recent methods, limited novelty, and missing running time comparisons.

Based on the provided rebuttals, the authors solve most concerns of reviewers. In addition, Reviewer zMGT still concerns about the generalization ability. The authors are suggested to address this concerns in the camera-ready version carefully.