PaperHub
6.6
/10
Poster4 位审稿人
最低2最高4标准差0.9
2
4
4
4
ICML 2025

QMamba: On First Exploration of Vision Mamba for Image Quality Assessment

OpenReviewPDF
提交: 2025-01-14更新: 2025-07-24

摘要

关键词
Image Quality AssessmentState Space ModelPrompt Tuning

评审与讨论

审稿意见
2

According to this work, it is the first to introduce the Mamba architecture into IQA and proposes the StylePrompt Tuning Mechanism to enhance transfer capability. The proposed method achieves better results with lower FLOPS across multiple datasets.

给作者的问题

Please refer to the strengths and weaknesses.

论据与证据

The proposed method does not achieve leading performance on classic datasets such as LIVE, CSIQ, and LIVEFB under similar parameter counts or FLOPS.

方法与评估标准

It follows established protocols.

理论论述

I checked the method part and the corrsponding formulas. There are no significant issues.

实验设计与分析

Experimental designs make sense, including the quantitative results in Table 1~4, and the visualizations in Figure 3.

补充材料

I reviewed the tables in the appendix.

与现有文献的关系

It explores the application of the Mamba design in the IQA field, which may provide some inspiration for future research on IQA architectures.

遗漏的重要参考文献

Some recent comparisons are missed. For instance,

[1] TOPIQ: A Top-Down Approach From Semantics to Distortions for Image Quality Assessment

[2] Attention Helps CNN See Better: Hybrid Image Quality Assessment Network

[3] Exploring Rich Subjective Quality Information for Image Quality Assessment in the Wild

[4] SF-IQA: Quality and Similarity Integration for AI Generated Image Quality Assessment

Some of these methods achieve better results on some datasets and metrics.

其他优缺点

This work explores the application of the Mamba structure in IQA tasks. However, overall, its effectiveness on some datasets has not been fully demonstrated, and I believe this work still requires further refinement.

其他意见或建议

Please refer to the strengths and weaknesses.

作者回复

Q1: Some clarification about the suboptimal performance on some small datasets, i.e., LIVE, CSIQ, LIVEFB.

A1: Thanks for your great suggestions. We will clarify the reasons for the suboptimal performance of our method on some classical datasets, e.g., LIVE, CSIQ, and LIVEFB with similar parameters/flops.

(i) First, we have to argue that it is very challenging for all existing IQA models to achieve the best performances on all IQA datasets due to the different dataset distributions and sizes, which focus on different capabilities of the IQA model, e.g., local/global representation modeling, distortion perception, etc. However, our Mamba-based model can achieve optimal performances on 5 among 8 datasets, especially for the challenging three real-world datasets and two large-scale IQA datasets, showing obvious advantages compared with other methods, which only achieve optimal performances on 1 or 2 classic datasets.

(ii) Secondly, our proposed LQMamba backbone is actually a fundamental framework, which is complementary to other strategies instead of the backbone design. It is noteworthy that existing works achieving great performances on traditional small datasets, e.g., LIVE, and CSIQ, have introduced some strategies over the backbones to improve the data-efficient training. For example, LoDa introduces the hybrid architecture of CNN and Transformer to extract the hierarchical representation of degradation, thereby achieving great performance on LIVE. DEIQT introduces multiple attention panels to extract different quality perspectives, which enhances the quality assessment of transformer-based architecture. However, these strategies can be theoretically applied to our Mamba-based architecture, which further enhances the capability of our method on some traditional datasets.

(iii) Actually, the real-world IQA datasets and large-scale IQA datasets are more aligned with practical applications, which can demonstrate the potential and generalization of designed models in the real world. From the experiments in Table 1, our Mamba-based method can achieve optimal/second optimal performances on all real-world IQA datasets and large-scale IQA datasets. These reveal the effectiveness of our Q-Mamba compared with existing works.

(iv) From the technical contribution, instead of the first Mamba-based IQA backbone, we also propose StylePrompt, a lightweight tuning paradigm that enables effective cross-domain transfer using only 4% of the total parameters, while achieving near full fine-tuning performance.

We will further improve the generalization capability of Q-Mamba and excavate its potentials on some classic IQA datasets in the future.

Q2: Suggestion about Comparisons with Recent Methods.

A2: We thank the reviewer for the helpful suggestions. We have carefully reviewed and compared our method with the suggested recent works, including TOPIQ, RichIQA, AHIQ, and SF-IQA.

MethodLIVECSIQTID2013CLIVEKonIQ-10kSPAQ
TOPIQ0.9840.9800.9580.8840.9390.924
RichIQA---0.9120.9500.923
Ours0.9620.9400.9650.9130.9470.934

Although TOPIQ achieves better results than QMamba on two small-scale datasets (LIVE and CSIQ), we outperform it on more complex and diverse datasets such as TID2013, CLIVE, KonIQ-10k, and SPAQ, which are more representative of real-world IQA challenges. We believe this reflects the stronger generalization and robustness of our model.

Additionally, AHIQ and SF-IQA are tailored for specific competition settings and only report results on a few small datasets (1–3), often under special constraints. In contrast, we evaluate our method on 10 datasets, covering a broad range of synthetic, authentic, and AIGC-related distortions.

Hence, while we appreciate the value of these recent works, we believe that our comprehensive, consistent, and large-scale evaluation across diverse scenarios offers a more complete and robust comparison. Our results suggest that QMamba is highly competitive and practically effective across both standard and challenging IQA tasks.

审稿意见
4

This paper introduces QMamba and LQMamba, a new network architecture based on Mamba for image quality assessment. QMamba operates through a global scanning approach while LQMamba operates through a local scanning approach. In addition, a style prompt injector is proposed to adjust mean and variance of the feature so this enables easy adaptation to downstream IQA tasks. Both QMamba and LQMamba achieves SOTA performance in experiments.

给作者的问题

.

论据与证据

The style prompt injection is simple yet effective idea and the effectiveness of it is demonstrated in table 3.

I'm uncertain the LQMamba is a brand new arcitecture because it is similar to LocalMamba. What is the key difference between LQMamba and LocalMamba?

Are there any advantages or characteristics of LQMamba compared to QMamba? These aspects are not shown or discussed in the paper.

方法与评估标准

The proposed method is evaluated on ten popular IQA datasets and this is enough amount.

理论论述

The style prompt tuning paradigm, which is a kind of intrinsic style manipulation, seems reasonable because similar types of manipulation have been utilized since the past, like in StyleGAN.

实验设计与分析

The achievement of SOTA performance and the t-SNE results are reasonably good. In addition, the proposed method shows significant improvement in cross-validation test according to table3.

According to table1 and table2, the difference between QMamba and LQMamba seems neglectable even though their scanning approach is very different, and this phenomenon is not discussed in detail.

补充材料

Additional experimental results are shown in the appendix.

与现有文献的关系

By demonstrating that vision mamba-based models contribute to improved IQA performance, particularly in cross-validation performance, this may facilitate the advancement of new IQA model architectures.

遗漏的重要参考文献

Missing IQA methods.

  • Re-iqa: Unsupervised learning for image quality assessment in the wild
  • Quality-aware pre-trained models for blind image quality assessment
  • Blind image quality assessment via vision-language correspondence: A multitask learning perspective

其他优缺点

.

其他意见或建议

.

作者回复

Q1: Key difference between LQMamba and LocalMamba.

A1: Thank you for pointing this out. We want to clarify that while LQMamba is inspired by LocalMamba, it differs in design motivation and technical implementation, specifically with a hierarchical structure for image quality assessment (IQA).

LocalMamba uses adaptive scan selection with multiple window configurations via attention-based routing. While suitable for high-level classification tasks, this introduces unstable inference behavior and higher computational cost, especially in IQA, where local distortions dominate.

In contrast, LQMamba adopts a fully hierarchical design for both architecture and scanning strategy. Each layer processes visual tokens within a fixed-size local window, with the window size progressively varying with network depth. This enables the model to:

  • Capture multi-scale perceptual cues, from fine-grained distortions in early layers to broader context in deeper layers;
  • Preserve stability by avoiding dynamic path selection;
  • Reduce computational overhead while achieving strong generalization.

This design reflects our task-specific and universal motivation: IQA requires consistent perception across varying distortion types and scales, and hierarchical processing is ideal.

To validate the effectiveness of our structure, we compared LQMamba-T and LocalMamba-T across four IQA benchmarks:

DatasetLocalMamba-TLQMamba-T
LIVEC0.843 / 0.7910.903 / 0.863
KADID0.861 / 0.8700.938 / 0.923
KonIQ0.900 / 0.8900.943 / 0.928
SPAQ0.882 / 0.8810.933 / 0.927

Table 1. Performance comparison between LocalMamba-T and LQMamba-T on four IQA benchmarks.

These results show that LQMamba consistently outperforms LocalMamba, especially on authentic and distortion-diverse datasets. The superior performance confirms that hierarchical fixed-window scanning stabilizes the process and captures IQA-relevant structures more effectively.

We will add this clarification and an architectural illustration in the revised version to highlight the hierarchical nature of our design.

Q2: Clarifying the Performance Difference Between QMamba and LQMamba.

A2: We sincerely appreciate the reviewer’s valuable observation. While the average performance gap between QMamba and LQMamba appears marginal in Table 1 and Table 2, a closer examination reveals more nuanced insights.

In fact, LQMamba outperforms QMamba on most individual datasets. The seemingly negligible overall improvement stems primarily from relatively inferior performance on small and simple datasets such as LIVE and CSIQ, which lowers the averaged metrics. These datasets contain fewer distortion types (e.g., 5 in LIVE vs. 25 in KADID-10k) and tend to feature less challenging scenarios where local distortion-sensitive modeling (as introduced by LQMamba) cannot fully demonstrate its advantage.

However, in more complex datasets like TID2013 and KADID-10k — which include a broader range of fine-grained distortions — LQMamba consistently shows stronger perceptual performance. For example:

  • TID2013: QMamba-B (0.949), LQMamba-B (0.964)
  • KADID: QMamba-B (0.932), LQMamba-B (0.941)

This suggests that LQMamba’s local scanning scheme is especially beneficial in challenging real-world conditions, where local artifacts are more critical and nuanced. Hence, we believe LQMamba is an optional and complementary alternative to QMamba, particularly suitable for scenarios requiring finer local distortion modeling.

We will clarify this phenomenon with detailed dataset-level breakdowns and further analysis in the revised version to avoid potential misunderstandings and better highlight the advantage of the proposed local scan mechanism.

Q3: About the Comparison with Suggested Methods

A3: We sincerely appreciate the reviewer’s suggestion of several creative and inspiring methods. To provide a fair comparison, we refer to the PLCC results reported in their original papers. As shown in the table below, our proposed QMamba consistently achieves leading performance on most datasets, demonstrating its strong generalization ability across various distortion types and data distributions.

MethodTID2013KADIDCLIVEKonIQSPAQ
Re-IQA0.8800.8920.854-0.925
QPT--0.9140.9410.927
LIQE-0.9310.9100.908-
Ours0.9650.9430.9130.9470.934

Table 2. PLCC comparison of different methods across multiple datasets.

We believe these results demonstrate the strong performance and versatility of our approach across various datasets.

审稿人评论

The authors have addressed my questions well, so I will increase my rating to accept. I hope that the final revision will include the explanation of this rebuttal, if accepted.

作者评论

We sincerely appreciate your positive feedback and the increased rating. The explanations provided in the rebuttal will be carefully integrated into the final version of the paper.

审稿意见
4

In this paper, an algorithm named QMamba is proposed for NR-IQA. QMamba is based on Mamba, but it employs style prompt tuning method to boost its performances with small learnable parameters. Specifically, style prompt tuning consists of two steps: SPG and SPI. In SPG, it generates the style prompt from input features by using GAP and 1x1 convolution. Then, from the generated style prompt, it predicts the affine parameters to adjust input features. Experimental results on various IQA benchmarks show that the proposed algorithm achieves better performances than existing methods.

给作者的问题

In overall, I think that the proposed algorithm has meaningful results and enough technical contribution. I only have a few concerns as below:

  • The proposed QMamba achieves better performances than conventional algorithms in overall. However, it shows relatively low performances on the LIVE and CSIQ datasets. It would be helpful to have an explanation of the reasons behind these results.

  • Also, for the relatively small sized datasets, such as LIVE and CSIQ, QMamba tends to show lower performance as the model size increases. Is it because of over-fitting?

  • The efficiency of the model is key contribution of the proposed QMamba. Therefore, it would be good to have inference speed comparison as well.

论据与证据

Yes, the claims are supported by experimental results including extensive ablation studies.

方法与评估标准

Yes, the proposed algorithm technically sounds and the evaluation process seems fair.

理论论述

This paper does not propose any theoretical claim.

实验设计与分析

Yes, this paper follows the standard evaluation protocol in this field.

补充材料

Yes, I reviewed the supplementary material as well.

与现有文献的关系

Recently, state space models are applied to various deep learning tasks such as image classification, video understanding, image segmentation and point cloud analysis. However, in IQA, SSM approach has been under researched. This paper applies SSM approach to IQA tasks and proposes a simple but effective algorithm.

遗漏的重要参考文献

Some recent papers are not addressed and compared. It would be better to compare with these algorithms as well.

  • [1] Learning generalizable perceptual representations for data-efficient no-reference image quality assessment. WACV24
  • [2] Blind image quality assessment based on geometric order learning. CVPR24

其他优缺点

Please find weaknesses in questions for authors section.

其他意见或建议

N/A

伦理审查问题

N/A

作者回复

Q1: About the reason for relatively lower performance on LIVE and CSIQ Datasets

A1: Thanks for your positive and constructive comments. We will provide a more thorough explanation for this result in the revision from two perspectives:

(i) Limited Dataset Scale and Diversity in LIVE and CSIQ datasets. LIVE and CSIQ are early synthetic IQA benchmarks with limited image counts (i.e., 799 and 866 respectively) and fewer distortion types. These lead to constrained data diversity in contrast to other datasets, such as TID2013 and KADID-10k, which include over 3,000 and 10,000 images with 24–25 distortion types, providing a broader distortion spectrum.

(ii) Mismatched Size between Model and Dataset. Notably, the scale-up of model size is usually required to be consistent with the scale of datasets. The performance of large-scale models in IQA tasks often relies on training with diverse and large-scale datasets to fully activate their capabilities. As shown in Table 1 of our manuscript, lightweight IQA models (e.g., DBCNN) tend to perform well on smaller datasets such as CSIQ, but their effectiveness significantly drops on real-world datasets like LIVEC and KonIQ, as well as large-scale synthetic datasets such as TID2013 and KADID. In contrast, recent large-scale models, e.g., ResNet152, Swin-B, ViT-B, and our Q-Mamba, have demonstrated superior performance on more complex real-world and large-scale synthetic datasets, while still maintaining acceptable results on smaller ones. However, these models achieve relatively lower performance on LIVE and CSIQ due to the insufficient dataset diversity and overfitting risks.

Q2: About the inconsistency between model size and performance on relatively small datasets, i.e., LIVE and CSIQ.

A2: As stated in response A1, the mismatch between model size and dataset scale can hinder the ability of an IQA model to fully demonstrate its potential, thereby preventing it from achieving optimal performance. On small datasets, the observed inconsistency between model size and performance can be attributed to two main factors:

(i) The overfitting risk, where the large model tends to memorize the limited patterns in small IQA datasets, resulting in poor generalization capability to unseen testing data.

(ii) Dataset bias.It is common sense that small IQA datasets often contain significant subjective bias in human-provided scores. This bias is especially impactful for large models, leading to unstable training and inconsistent performance.

Q3: About inference speed comparison.

A3: We sincerely thank you for highlighting this important point. As model efficiency is a central goal of QMamba’s design, we conducted a comprehensive inference speed comparison to better support our claims. We randomly sampled a total of 20,000 images across multiple IQA datasets, including synthetic and authentic distortion types, and evaluated the inference latency of three representative models of similar scale: QMamba-Tiny, ViT-Small, and Swin-Tiny. The results are summarized below:

ModelParams / GFLOPsTotal Time (s)Time / Image(s)
ViT-S21.67M / 4.61G226.420.0113
Swin-T27.52M / 4.51G363.470.0182
QMamba-T27.99M / 4.47G211.950.0106

Table 1: Comparison of inference efficiency among QMamba-Tiny and popular backbones.

As shown, QMamba-Tiny achieves the lowest average inference time per image, while maintaining comparable model size and computational complexity. This confirms the practical efficiency of our method in real deployment scenarios and complements the theoretical analysis in the main paper. We appreciate the reviewer’s suggestion and will incorporate these results into the final version to more fully demonstrate QMamba’s efficiency advantage.

Q4: About the comparison of some suggested methods

A4: Thank you for your suggestions. These are very creative methods, and I will incorporate them into the final version of the comparison. Since they only report results on a few datasets in the paper, I will briefly present part of the comparison results here.

MethodCLIVEKonIQSPAQ
PLCCSRCCPLCCSRCCPLCCSRCC
GRepQD-0.822-0.855--
QCN0.8930.8750.9450.9340.9280.923
Ours0.9130.8880.9470.9330.9340.929
审稿人评论

The authors have addressed my concerns well. Thank you for the detailed response. I will raise my score to 4. I hope the points discussed in the rebuttal will also be reflected in the final paper.

作者评论

We really appreciate your kind response and the increased score. We're glad our answers were helpful, and we’ll make sure the key rebuttal points are reflected in the final paper.

审稿意见
4

This paper proposes a no-reference image quality measure, and specifically it is the first work to explore vision mamba for blind IQA. Experimental results on task-specific, universal, and transferable IQA tasks demonstrate the advantages of the proposed method. The whole work is interesting and may be useful for the following studies.

给作者的问题

See the above comments.

论据与证据

The claims are well supported by the experimental validations.

方法与评估标准

The paper follows the common evaluation procedures for IQA methods as frequently used in this area.

理论论述

Not involved.

实验设计与分析

The experimental validation is well-conducted, which follows the common procedures.

补充材料

Yes the supplementary material is fine.

与现有文献的关系

This mamba-based image evaluation model has some potential impact on other fields. For example, it can be used as a reward model when improving the perceptual quality of image processing systems.

遗漏的重要参考文献

There are no important references that are not discussed.

其他优缺点

Some comments especially regarding to the weaknesses are as follows:

  1. The authors introduce a new mamba-based framework and many methods based on this framework (with different backbones). Different methods with different backbones show advantages on different databases. It would be better if an all-in-one model can work well on all databases.
  2. More experimental validations are suggested to be given. More state-of-the-art methods are suggested to be compared, for example traditional hand-crafted method BMPRI, more recent CvT-based method RichIQA, and the latest LMM-based method MINT-IQA.
  3. The authors may give some discussions on whether the introduced methodology can be generalized for video quality assessment and even audio-visual quality assessment.
  4. Some surveys for image and video quality assessment are suggested to be given for better referring of the related topics.

其他意见或建议

More intuitive visualizations are suggested to be given in the paper, especially in the experimental validation part. Only 3 figures are given in the paper.

作者回复

Q1: About the suggestion to develop an all-in-one model that performs well on all databases.

A1: Thanks for your great questions and impressive suggestions. We have conducted a thorough analysis of the reasons why the differences occur and have raised some proposals for how to design an all-in-one model as follows.

The differences stem from:

  1. The noise and small size of some datasets, i.e., LIVE, and CSIQ. The large LQMamba-T is susceptible to overfitting risk with the small dataset, thus causing similar or slightly lower performance with smaller LQMamba-T.
  2. The mismatched size between different backbones, e.g., -S, -T, -L, and the dataset cannot excavate the representation capability of IQA, which causes the inconsistency between model size and performance on LIVE and CSIQ.

Based on the above analysis, we believe the essential question for the all-in-one model to perform best on all datasets is "how to increase the dynamic capability of the Q-Mamba for different datasets." Based on the careful survey, we believe the following strategies can achieve an all-in-one Q-Mamba, which will be investigated in our future work.

(i) Dataset-Aware Prompt Tuning. We implemented StylePrompt for lightweight domain adaptation via feature modulation. We plan to extend this with dataset-specific prompts to activate different perception pathways based on the input domain, allowing dynamic adaptation without modifying backbone weights.

(ii) Multi-Domain Joint Training. We are extending our IQA experiments (Tables 2 and 6) with a multi-domain training protocol incorporating domain generalization losses (e.g., feature alignment or contrastive losses) to reduce domain gaps.

(iii) Preliminary Unified Model Experiments. In the final version, we plan to add experiments with a single QMamba variant enhanced with dataset prompts, trained jointly on all datasets. Early results (Tables 2 and 6) across six domains have shown the feasibility for universal deployment.

Q2: About the suggestion to compare with more representative and recent IQA methods.

A2: We thank the reviewer for the constructive suggestion. We fully agree that incorporating comparisons with both classical and recent SOTA methods such as BMPRI, RichIQA, and MINT-IQA can provide a more comprehensive evaluation.

To address this, we have collected partial results from these methods on several popular IQA datasets and compared them with our QMamba framework. As shown in the table below, QMamba achieves highly competitive PLCC scores, outperforming or matching recent strong baselines on many benchmarks.

MethodTID2013CLIVEKonIQSPAQ
BMPRI0.6080.3920.4240.611
MINT-IQA0.8990.9250.9450.932
RichIQA-0.9120.9500.923
Ours0.9650.9130.9470.934

Table 1: PLCC comparison with classical and recent IQA methods

In the final version, we will further extend the comparisons to include more datasets if available, ensuring a thorough and fair benchmarking.

Q3: Some discussions on the generalization to video and audio-visual quality assessment.

A3: We appreciate the reviewer’s forward-looking suggestion. Indeed, exploring the extension of our proposed architecture to video and audio-visual quality assessment is a promising direction, and it is part of our planned future work.

To provide an initial insight, we conducted preliminary experiments by adapting QMamba to the video domain. As shown in the table below, our QMamba (Video) achieves performance comparable to FastVQA while consuming fewer GFLOPs:

ModelGFLOPsPLCCSRCC
FastVQA (27.70M)279.1G0.8760.877
LQMamba (Video, 27.99M)239.3G0.8790.876

Table 2: Preliminary results for video quality assessment.

Video quality assessment (VQA) typically demands much higher computational resources due to temporal modeling. Our efficient SSM-based architecture, originally designed for image quality perception, offers a solid foundation to balance performance and computational cost in VQA tasks.

Moreover, since audio signals inherently possess sequential structures, we believe the state space modeling capability of our architecture is well-suited for audio or audio-visual quality assessment. We envision that our work can serve as a strong baseline for future research on applying selective state space models in both video and audio domains.

Q4: About the suggestion to include more surveys on image and video quality assessment.

A4: We appreciate the reviewer’s suggestion. In the final version, we will include a more comprehensive survey of image and video quality assessment literature and explore recent works that can be meaningfully integrated with our proposed framework.

审稿人评论

The authors have addressed my concerns well. The updated contents are suggested to be included into the final paper if accepted. I have increased my overall rating.

作者评论

We sincerely thank you for your constructive comments and the improved rating. We will carefully integrate the updated content from the rebuttal into the final version of the paper.

最终决定

This paper proposes QMamba, the first exploration of the Mamba state space model for no-reference image quality assessment (NR-IQA). With a lightweight StylePrompt tuning strategy, QMamba achieves strong performance and transferability across various IQA tasks, outperforming prior methods. Three out of four reviews are positive. The remaining reviewer maintains a negative score, citing that while the authors provided comprehensive comparisons on 10 datasets, the rebuttal results only covered 6 datasets. The reviewer argues that methods like TopIQ and RichIQA showed competitive or better results in some cases, and without full comparisons, it's hard to assess QMamba's overall advantage. Despite this, the AC believes the strengths of the paper outweigh the concerns and recommends acceptance. The authors are expected to address these issues more thoroughly and include the rebuttal content in the camera-ready version.