Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

Xingyu Zhou,Leheng Zhang,Xiaorui Zhao,Keze Wang,Leida Li,Shuhang Gu

OpenReview PDF

提交: 2023-09-15更新: 2024-03-26

摘要

关键词

video super-resolutionadaptivememory and computation efficiency

评审与讨论

审稿意见

评分: 3置信度: 42023-10-28

This paper proposes a Transformer-based recurrent video super-resolution model, termed MIA-VSR. The aim is to reduce the redundant computation in VSR model. To achieve this goal, they propose two complements: An Intra-frame and Inter-frame attention block (IIAB or MIA) and an adaptive mask predicting module (MPM). MIA aims to provide supplementary information from the previous enhanced features (temporal information). MPM aims to generate block-wise masks to reduce the computation. Experiments show that MIA-VSR achieves good results on several datasets.

优点

In the field of video super-resolution, redundant computation is very worthy of study. The problem explored in this paper is meaningful, and the calculation of feature reduction through mask strategy sounds reasonable.
The key idea is simple and easy to understand. From the experimental results, the method in this paper seems to be effective.

缺点

The core problem to be solved in this paper is the redundant computation in VSR, so the MASK strategy is proposed to reduce the calculation of unimportant features. While designing attention mechanisms to take advantage of timing information has been discussed in many previous works, the second contribution of this article （MIA）does not seem to differ from existing attention mechanisms. In other words, directly calculating the attention mechanism through the output features after the mask strategy is also computation-intensive. If the author claims the contribution of its attention mechanism, it should be contrasted with this baseline.
In terms of reducing redundant computation through the mask strategy, the authors should discuss how it differs from other approaches such as Token Merging and TTVSR. TTVSR also reduces the computation by limiting the attention mechanism to the trajectory of optical flow by calculating the temporal relationship of optical flow. Authors should cite and discuss the differences.
Experiments. MASK has a binary ratio, and the author should perform ablation experiments on it, including the FLOPs, not just choose 0.5. Many new VSR methods are not referenced and compared, such as TTVSR, FTVSR.
Writing. The second paragraph of the intro is written like related work. The figures in this paper are messy and not easy to understand, e.g. Fig 1, so many arrows are misunderstood. For example, why two arrows refer to mask M? Is it the output of MPM?
The author claimed that aims to solve the problem of heavy computational burden. According to Tab 9, compared with basicvsr++, 7.3M/92ms/32.39dB, MIA-VSR achieves 16.5M/822ms/32.78dB. The results of this experiment do not show its advantages. To sum up, I think the innovation and contribution of this paper are not obvious enough. Writing and experimentation are not enough. This paper is not sufficient for acceptance by ICLR.

问题

see weakness

审稿意见

评分: 5置信度: 42023-10-30

The paper proposes a new framework called MIA-VSR for video super-resolution (VSR) tasks. The framework utilizes a feature-level masked processing approach to reduce computational burden and memory footprint, making it more suitable for deployment on constrained devices. The key component of MIA-VSR is the intra-frame and inter-frame attention block, which considers the roles of past features and input features and only uses previously enhanced features to provide supplementary information. Additionally, an adaptive block-wise mask predicting module is developed to skip unimportant computations based on feature similarity between adjacent frames. Ablation studies and comparisons with state-of-the-art VSR methods demonstrate that MIA-VSR improves memory and computation efficiency without sacrificing PSNR accuracy.

优点

The author tries to accelerate transformer-based VSR from multiple levels, and I think masked processing is reasonable.
The comparative experiments are objective and detailed.

缺点

My main concern is that the effect of this method is not significantly improved compared to previous work.

1.1 From Table 3, choosing transformer for this task does not introduce obvious benefits, especially since CNN can be further compatible with more inference acceleration frameworks. Compared with BasicVSR++, subsequent work uses an order of magnitude higher computational overhead, but it has not made an improvement that I think is worth it. The impression this paper gives me is that it hopes to improve the practicality of this type of method by improving the processing efficiency of VSR. However, the effect of the paper does not seem to achieve this goal. After all, basicVSR++ is already very slow for users.

1.2 In terms of visual effects comparison, overall there seems to be no significant advantage over PSRT-recurrent.
About masked processing

2.1 I'm worried it's not novel enough. In low-level vision, this kind of blocking processing is not uncommon. Here are a few examples:

Image SR: Restore Globally, Refine Locally: A Mask-Guided Scheme to Accelerate Super-Resolution Networks
Background Matting: Real-Time High-Resolution Background Matting

Although this paper does so by considering temporal continuity. Considering the effects shown in Table 1, I think this contribution is insufficient.

问题

How to avoid the blocking effect that mask processing may introduce?
I think Figure 1 needs to be redrawn, what is the key message this image is trying to highlight?
If generating 720p video requires one second per frame to process, in what scenario do we need video SR?

审稿意见

评分: 3置信度: 52023-10-31

To address the heavy computational burden and large memory footprint in Transformer-based Video Super-resolution (VSR), this paper proposes a masked intra and inter frame attention (MIA-VSR). MIA-VSR uses feature-level temporal continuity between adjacent frames. The experiments demonstrate the effectiveness of the proposed method.

优点

This paper proposes an intra-frame and inter-frame attention block to enhance SR features, and proposes an adaptive mask predicting module to mask out unimportant regions between adjacent frames.
Compared with existing Transformer-based VSR methods, the proposed method has less computational cost and memory footprints.

缺点

The novelty of this paper is not clear.
The performance is not significant on benchmark datasets. Although the proposed method has less computational cost and memory footprints compared with existing Transformer-based VSR methods, it is still challenging for applications on smartphones (the main issue that the authors highlight to solve.)

问题

The motivations of the paper are to reduce the computational burden and the large memory footprint, and propose a VSR method in smart phones and consumer electronic products. However, the model size is large and not very efficient. For real applications, BasicVSR++ has more advantages than the proposed MIA-VSR. Compared with MIA-VSR, RVRT has a smaller model size, less runtime and comparable PSNR.
Some details in Figure 1 are not clear. For example, the inputs of MPM are not clear. How to get x_m^{t-2} in the orange block? What are the blue dashed lines? Why are the output video results poor?
The performance is not significant under different metrics. In addition, in Figure 4, it would be better to provide BasicVSR++ results instead of BasicVSR or EDVR.

伦理问题详情

None

审稿意见

评分: 6置信度: 42023-11-01

This paper presents a novel Transformer-based video super-resolution model called MIA-VSR (Masked Intra and Inter-frame Attention Video Super-Resolution). The model aims to improve the efficiency of video super-resolution by leveraging temporal continuity between adjacent frames and reducing redundant computations. The key components of MIA-VSR include an intra-frame and inter-frame attention block (IIAB) and an adaptive mask predicting module.

优点

Improved efficiency: MIA-VSR reduces computational complexity and memory footprint without sacrificing video super-resolution performance.
Effective use of temporal information: The model leverages temporal continuity between frames to avoid unnecessary computations and provide better results.
Adaptive masking: The adaptive mask predicting module generates block-wise masks to skip unimportant computations, further improving efficiency.

缺点

Complexity: The model may be more complex to implement and train compared to simpler video super-resolution methods.
Limited applicability: The effectiveness of MIA-VSR may be limited to specific video super-resolution tasks and datasets.
Runtime: Although MIA-VSR reduces computational complexity, its runtime may still be slower than some other methods due to the Transformer architecture.

问题

In the comparison with state-of-the-art methods, you mentioned that MIA-VSR achieves better trade-offs between accuracy and efficiency. How does MIA-VSR handle the trade-off between model size and computational efficiency? Can you provide more quantitative analysis or visualizations to support this claim?
Can you provide some insights on the design choices for the Intra-frame and Inter-frame Attention Block (IIAB)? How does it differ from other attention mechanisms used in video super-resolution models?