6.4

/10

Poster4 位审稿人

最低2最高5标准差1.2

3.8

置信度

创新性3.3

质量3.0

清晰度2.8

重要性3.0

NeurIPS 2025

4KAgent: Agentic Any Image to 4K Super-Resolution

Yushen Zuo,Qi Zheng,Mingyang Wu,Xinrui Jiang,Renjie Li,Jian Wang,Yide Zhang,Gengchen Mai,Lihong Wang,James Zou,Xiaoyu Wang,Ming-Hsuan Yang,Zhengzhong Tu

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

We present the first agent system for super-resolution that is capable of upscaling any image of arbitrary degradation to high-quality 4K resolution.

摘要

We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at $256\times 256$, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We release all the code, models, and results at: https://4kagent.github.io.

关键词

Super-ResolutionAgentImage RestorationLow-level Vision

评审与讨论

审稿意见

评分: 5置信度: 42025-06-30

This work proposes an agentic super-resolution system to upscale any image to 4k resolution. The framework contains perception agent to provide restoration plan with deployed VLM and restoration agent to execute step by step with various restoration models stand by. The face restoration is particularly optimized with embedded detection and restoration sub-system. DIV4K-50 is built as challenging benchmark for 4K SR. Extensive experiments are conducted across various domains, including satellite image, scientific images, AI-Generated image, and so on, validating the effectiveness of the method.

优缺点分析

Strength:

The authors build a systematic agent framework for 4KSR and validate its effectiveness on various domains;
The overall pipeline is clear with embedded plan-execution-evaluation-reflection, and the face restoration is particularly optimized for practical application;
The authors expose some configurations to enable user to build its own customized 4KSR agent;

Weaknesses:

It is better to provide the fast/slow option to balance the performance and process time, as current execution step may be time-consuming to execute all models in the toolbox for evaluation, which could be further optimized;
It is wondered whether paste the restored face back could bring unnatural region effect in edge, and the integration with the overall image seems not considered;
As the evaluation is an important step in the agentic system, which direct the preference of the system. Therefore, the role of the evaluation metrics should be further investigated;
It should better provide the restoration plan and agenda for exhibition;

问题

I think the proposed pipeline could be a potential framework to investigate the traditional restoration models through observing the winning rate in the MoE policy in face with different images, which may identify the existing restoration models that suited for different domains.

局限性

The authors have provided the limitations of their work.

最终评判理由

All of my issues have been solved, and I will maintain my score.

格式问题

作者回复

2025-07-30

We sincerely thank Reviewer eAhT for highlighting the strengths of our work. We are glad that you found our agentic framework for 4K super-resolution systematic, well-structured, and practically relevant. Your recognition of our clear plan–execution–evaluation–reflection pipeline, the dedicated face restoration subsystem, and the flexibility we provide for user customization is greatly appreciated.

W1: Provide the fast/slow option to balance the performance and process time

A1: TL;DR: We develop a Fast4K mode for 4KAgent to balance the performance and process time.

Thank you for pointing this out. We developed a Fast4K mode based on 4KAgent. Specifically, when the image resolution is higher than the resolution threshold s set by the user (longer side > s), 4KAgent will disable long-running methods in the toolbox in order to reduce the processing time of 4KAgent. At the same time, this resolution threshold s can be set by the user, so the user can flexibly adjust it to balance performance and processing time.

Based on the Fast4K mode, we conduct an experiment on DIV4K-50 by setting different resolution threshold s (s = 1024 / 4096) in 4KAgent and evaluate the running time and performance. Experiment result is shown below (Running time is evaluated by RTX 4090 GPU):

Method	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑	Avg. Running Time (s)
DiffBIR [1] (16x)	2.65	0.7078	38.59	0.5858	976.2
OSEDiff [2] (16x)	8.37	0.5680	25.07	0.4210	106.6
PiSA-SR [3] (16x)	9.30	0.5549	24.51	0.3861	107.8
4KAgent (s=1024)	3.51	0.7310	44.51	0.5689	582.7
4KAgent (s=4096)	3.15	0.7585	44.16	0.5928	1551.8

It shows that, simply set s to 1024, the average running time of 4KAgent decreases from 1551.8s to 582.7s (62.3% ↓) with minimal effect on performance. Under this setting, 4KAgent is even faster than the model-based approach (DiffBIR).

W2: Unnatural region effect in edge for face restoration pipeline

A2: TL;DR: The face restoration pipeline will introduce boundary artifacts in a low frequency of occurrence, and the benefits of the face restoration pipeline in 4KAgent outweigh the impact of the boundary artifacts it introduces.

Thank you for pointing this out. When designing the face restoration pipeline in 4KAgent, we follow the ‘whole image restoration’ procedure from previous methods (e.g., GFPGAN [4], CodeFormer [5], DifFace [6]). Specifically, we use the toolbox in GFPGAN to extract faces in the image and paste them back after restoration. Therefore, it may introduce boundary artifacts at the edge. We conduct an experiment for this.

Specifically, we select 100 LQ images in the WebPhoto-Test dataset as test images. Then, we input these images to GFPGAN, CodeFormer, and DifFace for face restoration under the ‘whole image restoration’ procedure. We also collect result images from 4KAgent. After obtaining result images, we conduct a user study to explore the boundary artifacts. We define the boundary artifacts as: Jagged edges or color halos along the face boundary. We ask users to evaluate the result images and count the occurrence of the boundary artifacts. Then we calculate the average occurrence of the boundary artifacts. Experiment results are shown below:

Method	GFPGAN	CodeFormer	DifFace	4KAgent
Avg. boundary artifacts	9.75 / 100	8.75 / 100	25.75 / 100	7.75 / 100

It shows that SOTA face restoration methods and 4KAgent introduce boundary artifacts following the ‘whole image restoration’ procedure. However, the frequency of occurrence is low (less than 10%), and 4KAgent achieves the lowest frequency of occurrence with its system design.

We also conduct an ablation study of the face restoration pipeline. Specifically, we compare the result images from 4KAgent with different profiles (ExpSR-s4-P, ExpSRFR-s4-P) on the WebPhoto-Test dataset. The key difference between profiles and corresponding experiment results is shown below:

Method	Restore Option	Face Restore	NIQE ↓	CLIB-IQA ↑	DSL-FIQA ↑
4KAgent (ExpSR-s4-P)	super-resolution	False	5.11	0.6415	0.7194
4KAgent (ExpSRFR-s4-P)	super-resolution	True	4.53	0.6602	0.7237

where ‘Face Restore’ controls the activation of the face restoration pipeline. It shows that when activating the face restoration pipeline, both the general image perceptual quality and face-related IQA metrics are enhanced. We also provide a visual comparison in Figure 8 in the Appendix, which further demonstrates the superiority of the face restoration pipeline by the enhanced facial and hair details from the image from 4KAgent by activating the face restoration pipeline.

Therefore, following the previous procedure, the face restoration pipeline in 4KAgent will introduce boundary artifacts with a low frequency of occurrence. However, the face restoration pipeline will enhance the overall image quality and face quality. We believe that the benefits of the face restoration pipeline in 4KAgent outweigh the impact of the boundary artifacts it introduces. We will further modify the face restoration pipeline to reduce the boundary artifacts it introduces.

W3: The role of the evaluation metrics should be further investigated

A3: Thank you for pointing this out. When designing the quality score used for evaluation in the 4KAgent system, we combined commonly used no-reference IQA metrics and HPSv2 scores (for human preferences). Experimental results show that 4KAgent achieves SOTA results, especially in terms of perceptual quality. In the future, we will further study the role of evaluation metrics, such as using block processing at ultra-high resolution to achieve more accurate evaluation results, and combining more advanced aesthetic scores.

W4: Provide examples of restoration plan and agenda for exhibition

A4: Thank you for pointing this out. We show some examples here.

For the first image in Figure 1 with tag ‘4K Upscaling’, its restoration plan and restoration sequence is: Denoising@SwinIR(σ=50) → Super-resolution@DiffBIR → Super-resolution@SwinIR (Real-ISR).

For the lower left image in Figure 1 with tag ‘Human Faces’, its restoration plan and restoration sequence is: Super-resolution@OSEDiff → Denoising@Restormer → JPEG Compression Artifact Removal@FBCNN (BQF) → Defocus Deblurring@DiffPlugin → Super-resolution@DiffBIR. BQF in FBCNN indicated 'Blind to Quality Factor'.

For the left image in Figure 4, its restoration plan and restoration sequence is: Super-resolution@PiSA-SR → Dehazing@RIDCP.

For the right image in Figure 4, its restoration plan and restoration sequence is: Deraining@Restormer → Denoising@SwinIR(σ=15) → Super-resolution → OSEDiff.

We will show more examples of image restoration plans and the intermediate images produced by each step in the Appendix in the next version of paper.

Q1: Observing the winning rate MoE to identify restoration models that suited for different domains

A5: Thank you for pointing this out. As we conduct experiments on images from different domains, we calculate the winning rate of SR methods from these experiments. The result is shown below:

Task	Benchmark (# number of images)	Leading SR Tools (# number of selection)
Real-World image SR (x4)	RealSR (100)	DiffBIR (55), PiSA-SR (22), OSEDiff (16)
AIGC SR (x4)	GenAIBench (100)	DiffBIR (61), OSEDiff (25), PiSA-SR (14)
Remote Sensing SR (x4)	DOTA (183)	HMA (79), HAT-L (54), PiSA-SR(15)
Fluorescence Microscopic Image SR (x4)	SR-CACO-2 (300)	DRCT (238), HMA (23), HAT-L (17)
Pathology Image SR (x4)	bcSR (200)	X-Restormer (84), HAT-L (64), HMA (30)
Medical Image SR (x4)	Chest X-ray 2017 (624)	HMA (312), DRCT (171), X-Restormer (73)

It shows that when facing images from different domains, different SR methods win from the MoE policy. We believe that the results obtained by analyzing the winning rate can provide some insights into model selection for SR tasks in different fields of images.

Reference

[1] Lin, X., He, J., Chen, Z., Lyu, Z., Dai, B., Yu, F., ... & Dong, C. (2024). Diffbir: Toward blind image restoration with generative diffusion prior. ECCV.

[2] Wu, R., Sun, L., Ma, Z., & Zhang, L. (2024). One-step effective diffusion network for real-world image super-resolution. NeurIPS.

[3] Sun, L., Wu, R., Ma, Z., Liu, S., Yi, Q., & Zhang, L. (2025). Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. CVPR.

[4] Wang, X., Li, Y., Zhang, H., & Shan, Y. (2021). Towards real-world blind face restoration with generative facial prior. CVPR.

[5] Zhou, S., Chan, K., Li, C., & Loy, C. C. (2022). Towards robust blind face restoration with codebook lookup transformer. NeurIPS.

[6] Yue, Z., & Loy, C. C. (2024). Difface: Blind face restoration with diffused error contraction. TPAMI.

评论- Hope to discuss with you!

2025-08-06

Dear Reviewer eAhT,

Thank you for your thoughtful review and encouraging rating!

As the author-reviewer discussion period is nearing its end, we would like to follow up to see if our rebuttal has addressed your concerns. Please let us know if any further clarification would be helpful.

Thank you again!

Authors

评论- Official Comment by Reviewer eAhT

2025-08-08

Thanks for the authors' detailed response. Basically, all of my concerns have been addressed, and I believe this work should be accepted.

评论- Thank you for your thoughtful review and recommendation

2025-08-09

Dear Reviewer eAhT,

Thank you for taking the time to read our response and for your careful, constructive review. We are very glad to hear that our rebuttal addressed your concerns and that you recommend acceptance. Your feedback and endorsement are much appreciated.

Sincerely,

Authors

审稿意见

评分: 2置信度: 52025-07-01

In this paper, the authors proposed a image enhancement system which consist of a profiling module, a perception agent and a restoration agent, the authors argue the system could universally upscale any image to 4K resolution.

优缺点分析

The authors have shown some interesting results in the paper.
The paper is well-writen.
The authors have evaluated the proposed method on many datasets.

问题

Recently, generative SR methods have obtained promising results for super-resolve low-resolution images with complex degradations, I am not sure if image super-resolution need to introduce such a complex system. Although some generative SR approaches have been adopted as comparison methods, the comparison is not fair because the authors establish new experimental settings, I suggest the authors to follow the experimental settings of recent SOTA generative SR methods, and conduct a side-by-side comparison with them. Then we can see the advantage of introducing such a complex system.
Can the authors discuss more about how natural language descriptions could benefit SR model for recovering low-level details?
Following the above question 2), I think the authors need to provide ablation studies to analyze the advantage of their design choice. For example, the good results come from good restoration plan or powerful tools? I think the authors should give examples of how different tools working together to deal with specifc images, for real-world image supre-resolution task, I think most of tools will not be used.
Can the authors report some information about runtime?
I do not think rain-streak removal should be automatically conducted in an image resotraiton system, because people do take pictures in raining days.

局限性

I think this paper introduces a system with unnecessary complex modules to deal with a fake problem. Many components in the "toolbox" is unnecessary for the SR task, for example, people usually do not change image brigtness and conduct dehazing/deraining when they super-resolve images. Instead of fucusing on irrelevant tasks like deraining/dehazing (irrelevant to SR), I think the authors should pay more attention to noise and artifacts, which is the major difficulties in SR. The authors could mimic the processing ISP and generate low quality image with special artifacts, for example, processing the challenging see in the dark dataset with a conventional denoiser which will leave some method noise, and aply gamma correction, color correction and tone mapping, then you will get a image with spatial variant structural artifacts, this kind of images are real "real-world low-quality images" and I am curious if the propose method could deal with such spatial vairant structural artifacts? If you just generate fake low-quality images with cascade known-degradations (perfect Gaussian noise, defocus blur, etc.), it is not a significant thing that you can deal with the image by trying different combination of cascade processing pipelines.
The authors established a lot of new experimental settings which has not been considered by previous methods, while interesting new settings helps reads understand the novelty of this paper, I think beating the existing methods under standard settings (adopted by previous methods) is also very important, otherwise, I am not able to evaluate the necessarity of introducing such complex system.

最终评判理由

As I have listed clearly in my comments, I think this paper just introduces a system with unnecessary complex modules to deal with a fake problem. Many design choice have not been validated in the paper, and the authors just establish new experimental settings to show the "advantage" of the proposed system.

格式问题

This paper is well-writen.

作者回复

2025-07-31

We sincerely thank Reviewer wkHw for your positive feedback. We appreciate your recognition of the overall clarity of our writing, the interesting results presented, and the thorough evaluation of our proposed image enhancement system across a wide range of datasets. Your comments are encouraging and affirm the value of our work.

Q1 & L2: Follow the experimental settings of recent SOTA generative SR methods, and conduct a side-by-side comparison

A1: We thank the reviewer for pointing it out. In Section 3.2 in the paper, we adopt the same experimental setup as recent state‐of‐the‐art generative super-resolution (SR) models (e.g., SinSR [1], OSEDiff [2], PiSA‑SR [3]) and perform 4x SR on real-world image SR benchmarks: the RealSR and DRealSR datasets. Quantitatively, 4KAgent outperforms SOTA generative SR methods on perceptual metrics (NIQE, CLIPIQA, MUSIQ, and CLIPIQA). We also provide experimental results on the DRealSR dataset here:

Method	PSNR↑	SSIM↑	LPIPS↓	DISTS↓	FID↓	NIQE↓	CLIPIQA↑	MUSIQ↑	MANIQA↑
DiffBIR [4]	26.84	0.6660	0.4446	0.2706	167.38	6.02	0.6292	60.68	0.5902
SinSR [1]	28.41	0.7495	0.3741	0.2488	177.05	7.02	0.6367	55.34	0.4898
OSEDiff [2]	27.92	0.7835	0.2968	0.2165	135.29	6.49	0.6963	64.65	0.5899
PiSA-SR [3]	28.31	0.7804	0.2960	0.2169	130.61	6.20	0.6970	66.11	0.6156
AgenticIR [5]	23.06	0.6145	0.4775	0.2973	182.02	6.11	0.6542	63.59	0.5927
4KAgent	23.11	0.6126	0.4579	0.2866	178.36	4.65	0.7092	69.30	0.6219

Qualitative results, which we provided in section C.2 in the Appendix, further illustrate that 4KAgent produces images with sharper and more realistic details. This demonstrates that, despite the strong performance of existing generative SR approaches on real‐world images, 4KAgent surpasses them by decoupling the complex distortion in the image, performing the corresponding restoration task while using the Q-MoE strategy to select the image's highest quality, which yields superior perceptual quality.

For the 16x SR experiments in Section 3.4 and the 4K upscaling experiments in Section 3.5, 4KAgent operates in a training‑free manner: we apply it directly to the 16x SR task without additional model training. To ensure a fair comparison, we likewise treat the leading generative SR models as training‑free in this setting and evaluate them in both the 4x->4x and the 16x strategy. We presented visual comparison in the main paper, showing that 4KAgent produces more natural and realistic details. Quantitative results in sections C.5 and C.6 in the Appendix confirm that 4KAgent achieves the state of the art performance on NIQE, MUSIQ, and CLIPIQA metrics.

Q2: How natural language descriptions could benefit SR model for recovering low-level details?

A2: Natural language description is used in the reflection stage to select the image with the best quality (based on the alignment to the image description). It is not used in single SR model in 4KAgent system.

Q3: Provide ablation studies to see good results comes from good restoration plan or powerful tools, and analysis tool usage on real-world image super-resolution task

A3: Thank you for pointing this out. We conduct an ablation study on the DIV4K-50 dataset for 16x super-resolution. Specifically, we fix the plan as 'super-resolution → super-resolution' and apply different SOTA generative SR methods (DiffBIR [4], OSEDiff [2], PiSA-SR [3]) as the tool in each super-resolution step. We use 3 metrics for evaluation: NIQE, MUSIQ, and MANIQA-pipal. Experiment results are shown below:

Method	NIQE ↓	MUSIQ ↑	MANIQA ↑
4KAgent (super-resolution@DiffBIR → super-resolution@DiffBIR)	3.36	37.17	0.5916
4KAgent (super-resolution@OSEDiff → super-resolution@OSEDiff)	4.88	39.88	0.5482
4KAgent (super-resolution@PiSA-SR → super-resolution@PiSA-SR)	5.01	38.22	0.5364
4KAgent	3.15	44.16	0.5928

It shows that using powerful tools can achieve good perceptual quality, but combining them with a good plan can further improve perceptual quality.

Q4: Report some information about runtime

A4: Thank you for pointing this out. As a multi-agent system, 4KAgent supports multi-GPU deployment. Specifically, 4KAgent assigns different agents (Perception Agent, Restoration Agent) to different GPUs to conserve memory. Most of our experiments were conducted using two NVIDIA RTX 4090 GPUs. Therefore, we evaluate the running time using NVIDIA RTX 4090 GPUs. The running time of 4KAgent depends on the chosen profile, the quality of the input image, and the length of the restoration plan. We report the fastest and slowest cases observed in our experiments. The fastest case is super-resolve images (x4) on the B100 dataset under ExpSR-s4-F profile. The slowest case is jointly restoring and upscaling LQ images on the DIV4K dataset to 4K resolution under Gen4K-P profile. The running time under these two cases is shown below:

Profile Nickname	Task	Resolution	Benchmark	Length of plan	Running time (s)
ExpSR-s4-F	super-resolution (x4)	120 x 80 → 480 x 320	B100	1.0 ± 0.0	50.96 ± 2.01
Gen4K-P	joint restoration + 4K Upscaling	256 x 256 → 4096 x 4096	DIV4K-50	3.4 ± 0.6	1551.76 ± 230.73

For accelerating 4KAgent, we developed a Fast4K mode for 4KAgent. Specifically, when the image resolution is higher than the resolution threshold s set by the user (longer side > s), 4KAgent will disable long-running methods in the toolbox in order to reduce the processing time of 4KAgent. At the same time, this resolution threshold s can be set by the user, so the user can flexibly adjust it to balance performance and processing time.

Based on the Fast4K mode, we conduct an experiment on DIV4K-50 by setting different resolution threshold s (s = 256 / 1024 / 4096) in 4KAgent and evaluate the running time and performance. Experiment result is shown below:

Method	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑	Avg. Running Time (s)
DiffBIR [4] (16x)	2.65	0.7078	38.59	0.5858	976.2
OSEDiff [2] (16x)	8.37	0.5680	25.07	0.4210	106.6
PiSA-SR [3] (16x)	9.30	0.5549	24.51	0.3861	107.8
4KAgent (s=1024)	3.51	0.7310	44.51	0.5689	582.7
4KAgent (s=4096)	3.15	0.7585	44.16	0.5928	1551.8

Q5: Rain-streak removal should not automatically conducted in an image restoration system

A5: Thank you for pointing it out. One of the core contributions of 4KAgent is that it is a highly customizable system. By setting different profiles, 4KAgent can be customized to perform different image restoration tasks. We provide specific profile settings and examples in Section A.1 in the Appendix. Therefore, for images containing rain streaks, users can flexibly choose whether to perform rain removal.

L1: Handle real-world data

A6: Thank you for pointing this out. In section 3.4, we test 4KAgent on RealSRSet, where each low-quality image is collected from the real world. Therefore, it contains real-world noise and artifacts. We upscale these images by a scale factor of 16 and evaluate the result images by 3 metrics: NIQE, MUSIQ, and MANIQA-pipal. Experiment results are shown below:

Method	NIQE ↓	MUSIQ ↑	MANIQA ↑
DiffBIR [4] (4x → 4x)	3.63	44.86	0.6076
OSEDiff [2] (4x → 4x)	5.40	48.42	0.5362
PiSA-SR [3] (4x → 4x)	5.70	48.20	0.5464
4KAgent	3.53	50.84	0.5913

It shows that 4KAgent performs on par or better than SOTA generative SR methods, indicating its ability to handle real-world data.

Reference

[1] Wang, Y., Yang, W., Chen, X., Wang, Y., Guo, L., Chau, L. P., ... & Wen, B. (2024). Sinsr: diffusion-based image super-resolution in a single step. CVPR.

[2] Wu, R., Sun, L., Ma, Z., & Zhang, L. (2024). One-step effective diffusion network for real-world image super-resolution. NeurIPS.

[3] Sun, L., Wu, R., Ma, Z., Liu, S., Yi, Q., & Zhang, L. (2025). Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. CVPR.

[4] Lin, X., He, J., Chen, Z., Lyu, Z., Dai, B., Yu, F., ... & Dong, C. (2024). Diffbir: Toward blind image restoration with generative diffusion prior. ECCV.

[5] Zhu, K., Gu, J., You, Z., Qiao, Y., & Dong, C. (2024). An intelligent agentic system for complex image restoration problems. arXiv preprint arXiv:2410.17809.

评论- Further Questions

2025-08-04

A1, although the proposed method could obtain good non-reference IQA indexes, the LPIPS and DISTS metric of the proposed method is inferior to state-of-the-art generative SR methods. I think the proposed method is a conditional image generation approach rather than a super-resolution approach, which is able to generate fake image details but less consistent with the input LR image. Therefore, I still think the proposed method just establish a meaningless setting (automatically removing rain streaks and fog in any images) and introduce elaborately designed method for solving the problem. For real "real-world image super-resolution" task, the proposed method is not as good as recent state-of-the-art methods.
A2, according to Fig.2 in the manuscript, I think the low-resolution figure contain sufficient semantic information and the description "a young girl with long brown hair, xxxx" can not provide any additional information, if the same description can be obtained by the low quality image, how can it provide supplementary information for image enhancement?
A3, refer to my further questions for A1, godd non reference IQA is meanful only when a comparable fidelity can be guaranteed. Furthremore, I still suggest you to give examples of how images are corrupted and how their corresponding restore-path is.

4)A4, after reading the authors response for question 4, I think this paper just introduce elaborately designed method for solving a fake problem, instead of pay attention to automatically removing rain streaks and fog in any image, I think the authors should consider about a real SR-agent which could analyze the blur kernel, local (probably spatial variant) artifacts (including method artifacts, fake textures and so on) and other challenges for SR.

A5, you should validate your "highly customized system" on a real-world image restoration task or commonly used testing bed, otherwise, the comparison with current method is unfair. According to the authors' A1, I think the restoration capability of such a complex automatic system is significant inferior to state-of-the-art methods.

评论- Response to Further Questions by Reviewer (part 1)

2025-08-07

Dear Reviewer wkHw,

We are thankful for your detailed comments. Please find the response below.

Q1 & Q5: Although 4KAgent excels in no-reference IQA metrics, its LPIPS and DISTS still trail state-of-the-art generative SR methods, suggesting it hallucinates details rather than truly recovering them, and it lacks validation on real-world restoration benchmarks, undermining the fairness and credibility of the comparisons.

A1: TL;DR: 4KAgent offers two configurable modes: fidelity for maximizing traditional SR metrics (PSNR, SSIM) and perception for non-reference perceptual scores (NIQE, CLIPIQA, MUSIQ, MANIQA) while remaining competitive on LPIPS and DISTS. This flexible design addresses the fundamental perception–distortion trade-off, letting users choose higher accuracy or more creative “hallucinations” (e.g., enhancing AI-generated images and old photos). Across benchmarks (B100, RealSR, DRealSR, GenAIBench-4K, DiffusionDB-4K, WorldStrat), 4KAgent matches or exceeds state-of-the-art performance in the selected mode.

We thank the reviewer for the question. We’d like to clarify that 4KAgent is a highly customizable system (with a simple argparse-like config): when performing SR tasks, users can pre-set different modes: either prioritizing higher fidelity or higher perception quality, which corresponds to more faithful recovery or higher quality but arguably more creative (c.f. the reviewer’s point on the conditional generation), respectively. We’d also like to mention that it’s almost impossible to achieve the highest scores for both perceptual and fidelity metrics. This phenomenon has been established by the perception-distortion tradeoff papers [1]. We also observed in Table 1 in the main paper that the recent state-of-the-art diffusion-based models like OSEDiff and PiSA-SR can achieve high perceptual scores but still struggle on fidelity metrics. Because of this fundamental tradeoff, we have made this a configurable choice for the user: you can flexibly choose to be more accurate (i.e., the traditional SR problem), or more “creative” (we think this is very useful to handle AI-generated content and old photos where more “hallucination” is needed)

For the traditional SR problem, we follow the experiment setting from previous methods and present the experiment results on the B100 dataset here. The top two performances are marked in bold and italic. (Detailed experiment results on all traditional SR benchmarks are provided in Tables 5 ~ 9 in the Appendix.)

Method	PSNR ↑	SSIM ↑	LPIPS ↓	DISTS ↓	FID ↓	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑
SwinIR [2]	27.92	0.7489	0.3548	0.2005	94.57	6.27	0.5373	57.71	0.5860
HAT-L [3]	28.08	0.7547	0.3440	0.1952	89.52	6.20	0.5477	58.71	0.5991
DiffBIR [4]	24.99	0.6156	0.2719	0.1666	84.99	3.92	0.7483	68.23	0.6750
OSEDiff [5]	24.35	0.6495	0.2408	0.1634	73.23	4.08	0.7422	68.54	0.6725
PiSA-SR [6]	25.00	0.6520	0.2111	0.1471	61.82	4.04	0.7384	68.47	0.6829
AgenticIR [7]	22.51	0.5853	0.3078	0.1907	102.92	4.08	0.7474	68.36	0.6752
4KAgent (F)	28.09	0.7540	0.3453	0.1953	88.89	6.02	0.5516	59.12	0.5994
4KAgent (P)	24.64	0.6294	0.2387	0.1606	73.64	3.86	0.7546	69.42	0.6851

It shows that under the fidelity mode (4KAgent (F), choose to be more accurate), 4KAgent achieves superior performance on fidelity-related metrics (e.g., PSNR, SSIM). When switching to perception mode (4KAgent (P), choose to be more creative), 4KAgent achieves SOTA performance on non-reference perceptual metrics (NIQE, CLIPIQA, MUSIQ, MANIQA) with competitive performance on full-reference perceptual metrics (LPIPS, DISTS).

For AI-generated content, we report experimental results on GenAIBench-4K and DiffusionDB-4K benchmarks in Table 17 in the Appendix. We also employ SOTA generative SR methods for comparison. The experimental result based on SANA [8] is shown below:

Dataset		GenAIBench-4K			DiffusionDB-4K
Model	NIQE ↓	CLIPQA ↑	MANIQA ↑	NIQE ↓	CLIPQA ↑	MANIQA ↑
SANA-4K	4.02	0.6172	0.3673	3.74	0.6005	0.3425
SANA-1K + OSEDiff	3.63	0.6806	0.3922	4.38	0.6655	0.4052
SANA-1K + PiSA-SR	3.43	0.6786	0.4049	4.20	0.6666	0.4081
SANA-1K + 4KAgent (P)	3.03	0.7050	0.4735	3.04	0.7082	0.4715

We find that upscaling images generated by SANA-1K using 4KAgent produces better perceptual quality than directly generating 4K images with SANA-4K. 4KAgent also surpasses state-of-the-art generative SR methods in perceptual quality. As shown in Figure 10 in the main paper, SANA-1K + 4KAgent yields richer textures and better aesthetic alignment.

For old photos, Figure 5 in the Appendix shows that 4KAgent generates more realistic and detailed restorations, with finer hair strands, eyebrow patterns, and eyes compared to previous methods.

评论- Response to Further Questions by Reviewer (part 2)

2025-08-07

For the real-world image SR task, we conducted experiments on two commonly used real-world image super-resolution benchmarks: RealSR and DrealSR. They  are real‑world super‑resolution (SR) datasets built by capturing paired low-quality (LQ) and high-quality (HQ) images of the same static scenes using fixed DSLR cameras at different focal lengths (e.g., 28 mm for LQ vs. 105 mm for HQ). LQ images contain authentic distortions, including spatially varying blur, sensor noise, and lens artifacts, which are different from bicubic downsampling (as observed in synthetic datasets). Therefore, they are widely used by SOTA generative SR methods [4,5,6] for evaluation. Following the same experiment setting in these methods, we compare 4KAgent with SOTA generative SR methods and an agentic image restoration system. Experimental results are shown below:

Benchmark: RealSR

Method	PSNR ↑	SSIM ↑	LPIPS ↓	DISTS ↓	FID ↓	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑
StableSR	24.69	0.7052	0.3091	0.2167	127.20	5.76	0.6195	65.42	0.6211
DiffBIR [4]	24.88	0.6673	0.3567	0.2290	124.56	5.63	0.6412	64.66	0.6231
SinSR	26.30	0.7354	0.3212	0.2346	137.05	6.31	0.6204	60.41	0.5389
OSEDiff [5]	25.15	0.7341	0.2921	0.2128	123.50	5.65	0.6693	69.09	0.6339
PiSA-SR [6]	25.50	0.7417	0.2672	0.2044	124.09	5.50	0.6702	70.15	0.6560
AgenticIR [7]	22.45	0.6447	0.3745	0.2503	140.38	5.81	0.6506	65.87	0.6210
4KAgent (F)	27.55	0.7732	0.3958	0.2423	131.67	8.17	0.3427	30.55	0.3711
4KAgent (P)	22.55	0.6557	0.3509	0.2468	134.63	4.78	0.6666	71.77	0.6564

Benchmark: DRealSR

Method	PSNR ↑	SSIM ↑	LPIPS ↓	DISTS ↓	FID ↓	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑
StableSR	28.04	0.7460	0.3354	0.2287	147.03	6.51	0.6171	58.50	0.5602
DiffBIR [4]	26.84	0.6660	0.4446	0.2706	167.38	6.02	0.6292	60.68	0.5902
SinSR	28.41	0.7495	0.3741	0.2488	177.05	7.02	0.6367	55.34	0.4898
OSEDiff [5]	27.92	0.7835	0.2968	0.2165	135.29	6.49	0.6963	64.65	0.5899
PiSA-SR [6]	28.31	0.7804	0.2960	0.2169	130.61	6.20	0.6970	66.11	0.6156
AgenticIR [7]	23.06	0.6145	0.4775	0.2973	182.02	6.11	0.6542	63.59	0.5927
4KAgent (F)	30.62	0.8326	0.4381	0.2701	156.92	10.29	0.3546	24.18	0.3247
4KAgent (P)	23.11	0.6126	0.4579	0.2866	178.36	4.64	0.7092	69.30	0.6219

The experiment result leads to a similar conclusion on the traditional SR task: under the fidelity mode (choose to be more accurate), 4KAgent achieves superior performance on fidelity-related metrics (e.g., PSNR, SSIM). When switching to perception mode (choose to be more creative), 4KAgent achieves SOTA performance on non-reference perceptual metrics (NIQE, CLIPIQA, MUSIQ, MANIQA). Due to space limitations in the main paper, we provided the detailed experimental results in Table 10 in the Appendix. We will update the corresponding experimental results in the next version of the paper.

We also perform real-world remote sensing super-resolution experiments using the WorldStrat dataset (Corresponding to section F.1 in the Appendix). Each high‑resolution (HR) image in this dataset is a low-cloud Airbus SPOT 6/7 capture (1.5 m pan, 6 m RGB+NIR), while low‑resolution (LR) images consist of up to 16 Sentinel-2 revisits (10–60 m, 12–13 bands). WorldStrat thus provides a realistic benchmark for remote sensing image SR. Experiment results are shown below:

Method	PSNR ↑	SSIM ↑	LPIPS ↓	DISTS ↓	FID ↓	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑
DiffBIR [4]	20.65	0.5150	0.6781	0.3712	227.68	7.55	0.6075	53.57	0.5475
OSEDiff [5]	25.97	0.6316	0.4460	0.2562	176.26	8.63	0.5096	46.51	0.4988
PiSA-SR [6]	23.93	0.6179	0.4581	0.2748	170.04	7.22	0.5010	48.64	0.5152
TransENet	24.49	0.6943	0.3270	0.2106	133.70	7.78	0.2246	27.90	0.3152
AgenticIR [7]	19.59	0.5188	0.6716	0.3686	224.80	8.71	0.5402	54.06	0.5166
4KAgent (F)	22.35	0.6470	0.3702	0.2324	166.27	9.54	0.2875	34.37	0.3011
4KAgent (P)	20.15	0.5379	0.6363	0.3664	223.49	8.55	0.6236	56.84	0.5547

TransENet is a specialized aerial SR method trained on remote sensing data. In fidelity mode, 4KAgent performs competitively on SSIM, LPIPS, and DISTS. In perception mode, it achieves state-of-the-art results on no-reference perceptual metrics like CLIPIQA, MUSIQ, and MANIQA.

Experiments across diverse datasets show that 4KAgent is a flexible agentic system capable of producing either more faithful or more creative high-quality but arguably more creative results, depending on user preferences. We validate its superiority across traditional image SR, AI-generated image SR, and real-world natural and remote sensing image SR, with consistently strong performance.

评论- Response to Further Questions by Reviewer (part 3)

2025-08-07

Q2: The usage of image description in 4KAgent.

A2: In the 4KAgent system, the textual description of an image is used in the MoE module of the Restoration Agent. Specifically, when the Restoration Agent executes the restoration plan provided by the Perception Agent, it processes the input image using multiple restoration methods at each execution step, resulting in multiple candidate images. Therefore, we need to select the best image. For this purpose, we designed a quality score that considers two aspects: perception quality, which we combined non-reference IQA metrics such as NIQE, MUSIQ, MANIQA, and CLIPIQA to generate a score; and human preference, for which we used HPSv2 [9] framework to assess human preference. Since HPSv2 requires images and their corresponding descriptions for evaluation, we need to extract the descriptions of the images to evaluate human preference and assist the MoE module in making selections.

Q3: Good non reference IQA is meanful only when a comparable fidelity can be guaranteed. Provide examples of how images are corrupted and how their corresponding restore-path is.

A3: Based on our answer to Q1 & Q5 (A1 in part 1), 4KAgent can be customized to be more accurate (i.e., the traditional SR problem), or more “creative” (we think this is very useful to handle AI-generated content and old photos where more “hallucination” is needed), depending on user preferences. In both modes, 4KAgent delivers superior performance and outperforms SOTA generative SR methods on multiple image SR tasks.

For the RealSRSet experiment in our A3 in rebuttal: Since the RealSRSet dataset only contains real-world low-quality (LQ) images, we use non-reference IQA metrics for evaluation after 16x SR. Additionally, Figure 5 in the main paper provides corresponding visualization results. Compared to previous methods (SOTA classical SR methods (HAT-L), SOTA generative SR methods (DiffBIR), and the Agentic image restoration system (AgenticIR)), the images generated by 4KAgent are more faithful to the input LQ images (e.g., the rock and grass textures). Furthermore, we provide additional visualization results in Figure 5 in the Appendix, and the conclusion is the same.

We provide some specific restoration paths of 4KAgent when processing the RealSR and RealSRSet dataset (since the images in these datasets are collected from the real world, there is no process involving synthetic LQ images):

Figure 2 in the Appendix (top): Defocus Deblurring@DRBNet → Super-Resolution@PiSA-SR.

Figure 2 in the Appendix (bottom): Super-Resolution@DiffBIR → JPEG Compression Artifact Removal@FBCNN (BQF) → Defocus Deblurring@LaKDNet.

Figure 5 in the Appendix (bottom) (also the lower left image in Figure 1 with tag ‘Human Faces’): Super-Resolution@OSEDiff → Face Restoration@CodeFormer → Denoising@Restormer → JPEG Compression Artifact Removal@FBCNN (BQF) → Defocus Deblurring@DiffPlugin → Super-Resolution@DiffBIR. BQF in FBCNN indicated 'Blind to Quality Factor'.

For synthesized low-quality images and their restoration path, we provided examples on the subset of the MiO dataset:

Figure 4 in the main paper (left): containing haze and x4 downsample.

Restoration path: Super-Resolution@PiSA-SR → Dehazing@RIDCP.

Figure 4 in the main paper (right): containing rain steak and Gaussian noise, and x4 downsample.

Restoration path: Deraining@Restormer → Denoising@SwinIR(σ=15) → Super-Resolution@OSEDiff.

4KAgent successfully perceives distortions in the image and executes the corresponding restoration plan.

Q4: Consider building a real SR-agent that could analyze the blur kernel, local artifacts, and other challenges for SR.

A4: TL;DR: 4KAgent integrates nine specialized restoration methods and a customizable profile module to flexibly address diverse distortions, such as defocus blur, rain streaks, fog, noise, and more, based on user preferences and task requirements. On RealSR and DRealSR datasets, it correctly identifies and applies defocus deblurring in about 70% of cases, underscoring its accurate perception and restoration capabilities. Tested across 26 benchmarks, 4KAgent consistently outperforms both general and domain-specific super-resolution methods. Its dynamic selection of multiple SR tools per image breaks the single-model limitation, greatly expanding its real-world applicability.

4KAgent is designed for broad applicability across image restoration and super-resolution tasks, integrating methods for nine types of restoration. Additionally, following previous agentic AI systems, we equip 4KAgent with a profile module for customization based on user preferences (e.g., “super-resolve this image with high fidelity” → set ‘restore preference’ to ‘fidelity’) and requirements (e.g., “denoise the image” → set ‘restore option’ to ‘denoise’).

This enhances the usability of 4KAgent to support diverse scenarios, including not only rain, fog, but also blur and noise removal.

评论- Response to Further Questions by Reviewer (part 4)

2025-08-07

We validated the effectiveness of 4KAgent in real-world image SR tasks. For more details, please refer to RealSR and the DrealSR experiment in our answer to Q1 & Q5 (A1 in part 1).

Notably, the dominant distortion in the LQ images arises from blur. As both datasets collect LQ images by reducing focal length while keeping exposure, distance, and tripod fixed, the dominant blur in LQ images arises from optical defocus. Therefore, the blur in the LQ images in this dataset is closer to defocus blur. We conducted a statistical analysis of the tool chain from 4KAgent when processing images from these two datasets, specifically counting the number of times when "defocus deblurring" appeared in the tool chain. Experiment results are shown below:

Dataset	Number of images	Number of "defocus deblurring" in tool chain
RealSR	100	59 (59%)
DRealSR	93	76 (81.7%)
Total	193	135 (69.9%)

4KAgent added the defocus deblurring restoration task to about 70% of the images in total, which indicates that 4KAgent accurately identifies defocus blur in LQ images and assigns the defocus deblurring task in the restoration plan. The leading perception quality in the experimental results also proves the effectiveness of this mechanism. We believe that by constructing an Agentic system, accurately identifying distortions in images, and employing advanced restoration methods, we can significantly expand the application scenarios of this system. We demonstrated the high usability of 4KAgent across 26 benchmarks across 12 tasks.

In addition to natural images, we conducted image SR experiments on AIGC images, remote sensing images, and biomedical images. Experiment results demonstrated the effectiveness of 4KAgent in SR tasks with various image domains (e.g., remote sensing images, biomedical images), including synthesis data and real-world data (e.g., WorldStrat dataset on remote sensing image SR and SR-CARO-2 dataset on fluorescence microscopic image SR). 4KAgent not only outperforms SOTA general image SR methods but also outperforms SR methods specifically designed for corresponding domains in these tasks.

Additionally, we statistically analyze the win rate of 4KAgent when processing different types of images, which refers to the top three most selected SR tools by 4KAgent when performing SR on images from this dataset, as shown in the table below:

Task	Benchmark (# number of images)	Leading SR Tools (# number of selection)
Real-World image SR (x4)	RealISR (100)	DiffBIR (55), PiSA-SR (22), OSEDiff (16)
AIGC SR (x4)	GenAIBench (100)	DiffBIR (61), OSEDiff (25), PiSA-SR (14)
Remote Sensing SR (x4)	DOTA (183)	HMA (79), HAT-L (54), PiSA-SR (15)
Fluorescence Microscopic Image SR (x4)	SR-CACO-2 (300)	DRCT (238), HMA (23), HAT-L (17)
Pathology Image SR (x4)	bcSR (200)	X-Restormer (84), HAT-L (64), HMA (30)
Medical Image SR (x4)	Chest X-ray 2017 (624)	HMA (312), DRCT (171), X-Restormer (73)

It shows that:

(1) For different types of image SR tasks, the leading SR tools are not the same.

(2) The total number of times the leading SR tools are selected is less than the number of images in the dataset, indicating that 4KAgent uses more than three different SR tools when processing images in these datasets.

It demonstrates that 4KAgent can flexibly and appropriately invoke SR methods within the system when addressing different tasks, which breaks through the limitations of a single model and significantly enhances the applicability and application scenarios of 4KAgent.

Reference

[1] Blau, Y., & Michaeli, T. (2018). The perception-distortion tradeoff. CVPR.

[2] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). Swinir: Image restoration using swin transformer. ICCV.

[3] Chen, X., Wang, X., Zhang, W., Kong, X., Qiao, Y., Zhou, J., & Dong, C. (2023). Hat: Hybrid attention transformer for image restoration. arXiv preprint arXiv:2309.05239.

[4] Lin, X., He, J., Chen, Z., Lyu, Z., Dai, B., Yu, F., ... & Dong, C. (2024). Diffbir: Toward blind image restoration with generative diffusion prior. ECCV.

[5] Wu, R., Sun, L., Ma, Z., & Zhang, L. (2024). One-step effective diffusion network for real-world image super-resolution. NeurIPS.

[6] Sun, L., Wu, R., Ma, Z., Liu, S., Yi, Q., & Zhang, L. (2025). Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. CVPR.

[7] Zhu, K., Gu, J., You, Z., Qiao, Y., & Dong, C. (2024). An intelligent agentic system for complex image restoration problems. arXiv preprint arXiv:2410.17809.

[8] Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., ... & Han, S. (2024). Sana: Efficient high-resolution image synthesis with linear diffusion transformers. arXiv preprint arXiv:2410.10629.

[9] Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., & Li, H. (2023). Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341.

审稿意见

评分: 4置信度: 32025-07-01

4KAgent presents a unified, agentic framework for up-scaling any image—whether real photos, AI‑generated art, scientific scans, or medical imagery—to 4K resolution. It employs a two‑stage pipeline—first a Perception Agent that dynamically plans restoration steps via LLM/VLM reasoning, then a Restoration Agent that executes those steps using a mixture‑of‑experts toolbox with rollback capability—and demonstrates state‑of‑the‑art results across 26 diverse benchmarks.

优缺点分析

Strengths:

(1) 4KAgent is the first single model capable of upscaling any image—real photographs, AI‑generated content, scientific scans—to 4K resolution via a unified pipeline, avoiding the need for specialist networks per domain.

(2) The two‑stage Perception Agent+Restoration Agent architecture, augmented with a Quality‑Driven Mixture‑of‑Experts policy and rollback mechanism, enables dynamic, content‑aware restoration plans and optimization at each step.

(3) The authors evaluate on 26 benchmarks across classical SR, RealSR, multi‑degradation restoration, face enhancement, large‑factor upscaling, AIGC, remote sensing, microscopy, and biomedical imaging—demonstrating consistent SoTA performance and real‑world applicability.

Weaknesses:

(1) The multi‑agent framework, large toolbox of restoration tools, LLM/VLM components, and rollback loops introduce significant engineering and inference overhead, potentially hindering real‑time or resource‑constrained deployment.

(2) The pipeline involves restoring faces separately and pasting them back into the original image, but the paper does not analyze whether this operation introduces boundary artifacts or inconsistencies, especially under misalignment or inconsistent lighting conditions.

(3) The related work section provides limited discussion of existing super-resolution methods, especially recent advances in generalist or large-scale SR models.

问题

(1) Could you provide quantitative runtime and memory measurements (e.g., end‑to‑end latency on a standard GPU) for typical restoration tasks, and compare them against specialist SR models? This would clarify practical deployment trade‑offs.

(2) When restored faces are pasted back into the original image, do visible seams or boundary artifacts occur?

(3) Could the authors expand the related work section to include more recent and representative super-resolution methods, particularly generalist or unified frameworks?

局限性

YES

最终评判理由

I read the other reviews and the authors’ rebuttal. My concerns have been well addressed by the authors. I suggest borderline acceptance of the paper.

格式问题

None

作者回复

2025-07-30

We sincerely thank Reviewer jm8i for your detailed and encouraging feedback. We appreciate your recognition of 4KAgent as a unified, domain-general framework for 4K image restoration across real photos, AI-generated content, and scientific/medical imagery. We are grateful for your acknowledgement of our two-stage Perception–Restoration Agent architecture, and the Quality-Driven Mixture-of-Experts policy with rollback. Your recognition of our extensive evaluation across various benchmarks and the consistent state-of-the-art performance across diverse and practical scenarios is especially motivating for us.

Q1: Runtime and memory measurements

A1: Thank you for pointing this out. As a multi-agent system, 4KAgent supports multi-GPU deployment. Specifically, 4KAgent assigns different agents (Perception Agent, Restoration Agent) to different GPUs to conserve memory. Most of our experiments were conducted using two NVIDIA RTX 4090 GPUs. For accelerating 4KAgent, we developed a Fast4K mode for 4KAgent. Specifically, when the image resolution is higher than the resolution threshold s set by the user (longer side > s), 4KAgent will disable long-running methods in the toolbox in order to reduce the processing time of 4KAgent. At the same time, this resolution threshold s can be set by the user, so the user can flexibly adjust it to balance performance and processing time.

Method	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑	Avg. Running Time (s)
DiffBIR [1] (16x)	2.65	0.7078	38.59	0.5858	976.2
OSEDiff [2] (16x)	8.37	0.5680	25.07	0.4210	106.6
PiSA-SR [3] (16x)	9.30	0.5549	24.51	0.3861	107.8
4KAgent (s=1024)	3.51	0.7310	44.51	0.5689	582.7
4KAgent (s=4096)	3.15	0.7585	44.16	0.5928	1551.8

Q2: Boundary artifacts for face restoration pipeline

Thank you for pointing this out. When designing the face restoration pipeline in 4KAgent, we follow the ‘whole image restoration’ procedure from previous methods (e.g., GFPGAN [4], CodeFormer [5], DifFace [6]). Specifically, we use the toolbox in the GFPGAN method to extract faces in the image and paste them back after restoration. Therefore, it may introduce boundary artifacts at the edge. We conduct an experiment for this.

Specifically, we use WebPhoto-Test as the testing dataset and select 100 LQ images as test images. Then, we input these images to GFPGAN, CodeFormer and DifFace for face restoration under the ‘whole image restoration’ procedure. We also collect result images from 4KAgent. After obtaining all the result images, we conduct a user study to explore the boundary artifacts. We define the boundary artifacts as: Jagged edges or color halos along the face boundary. Based on this definition, we ask users to evaluate the result images and count the occurrence of the boundary artifacts. Then we calculate the average occurrence of the boundary artifacts. Experiment results are shown below:

Method	GFPGAN	CodeFormer	DifFace	4KAgent
Avg. boundary artifacts	9.75 / 100	8.75 / 100	25.75 / 100	7.75 / 100

It shows that SOTA face restoration methods and 4KAgent introduce boundary artifacts following the ‘whole image restoration’ procedure. However, the frequency of occurrence is low (less than 10%), and 4KAgent does not increase the frequency of occurrence due to its system design.

We also conduct an ablation study of the face restoration pipeline in Section D in the Appendix. Specifically, we compare the result images from 4KAgent with different profiles (ExpSR-s4-P, ExpSRFR-s4-P) on the WebPhoto-Test dataset. The key difference between profiles and corresponding experiment results is shown below:

Method	Restore Option	Face Restore	NIQE ↓	CLIB-IQA ↑	DSL-FIQA ↑
4KAgent (ExpSR-s4-P)	super-resolution	False	5.11	0.6415	0.7194
4KAgent (ExpSRFR-s4-P)	super-resolution	True	4.53	0.6602	0.7237

where ‘Face Restore’ controls the activation of the proposed face restoration pipeline. It shows that when activating the face restoration pipeline, both the overall image perceptual quality and face-related IQA metrics are enhanced. We also provide a visual comparison in Figure 8 in the Appendix, which further demonstrates the superiority of the face restoration pipeline by the enhanced facial and hair details from the image from 4KAgent with the ExpSRFR-s4-P and GenSRFR-s4-P profiles (activating the face restoration pipeline).

Q3: Expand the related work section

A3: Thank you for pointing this out. Due to space limitations, the related work section we provide in the main paper is an abbreviated version. We discuss the latest progress of SR models in detail in section B.1 of the Appendix. For example, we discussed the recent progress of generative SR methods (e.g., StableSR [7], SeeSR [8], OSEDiff [2], PiSA-SR [3]) and all-in-one image restoration methods (e.g., AirNet [9], ADMS [10]). We also discuss the recent progress of the agentic image restoration framework (e.g., RestoreAgent [11], AgenticIR [12], MAIR [13], Q-Agent [14])

At the same time, we also discuss the progress and application of SR models in Section G of the Appendix. For example, we discuss the application of SR models in video conferencing, surveillance, and gaming. We will continue to conduct in-depth research on the progress of SR models and provide a richer and more comprehensive discussion on the related work section in the next version of the paper.

Reference

[1] Lin, X., He, J., Chen, Z., Lyu, Z., Dai, B., Yu, F., ... & Dong, C. (2024). Diffbir: Toward blind image restoration with generative diffusion prior. ECCV.

[2] Wu, R., Sun, L., Ma, Z., & Zhang, L. (2024). One-step effective diffusion network for real-world image super-resolution. NeurIPS.

[3] Sun, L., Wu, R., Ma, Z., Liu, S., Yi, Q., & Zhang, L. (2025). Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. CVPR.

[4] Wang, X., Li, Y., Zhang, H., & Shan, Y. (2021). Towards real-world blind face restoration with generative facial prior. CVPR.

[5] Zhou, S., Chan, K., Li, C., & Loy, C. C. (2022). Towards robust blind face restoration with codebook lookup transformer. NeurIPS.

[6] Yue, Z., & Loy, C. C. (2024). Difface: Blind face restoration with diffused error contraction. TPAMI.

[7] Wang, J., Yue, Z., Zhou, S., Chan, K. C., & Loy, C. C. (2024). Exploiting diffusion prior for real-world image super-resolution. IJCV, 132(12), 5929-5949.

[8] Wu, R., Yang, T., Sun, L., Zhang, Z., Li, S., & Zhang, L. (2024). Seesr: Towards semantics-aware real-world image super-resolution. CVPR.

[9] Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., & Peng, X. (2022). All-in-one image restoration for unknown corruption. CVPR.

[10] Park, D., Lee, B. H., & Chun, S. Y. (2023). All-in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. CVPR.

[11] Chen, H., Li, W., Gu, J., Ren, J., Chen, S., Ye, T., ... & Zhu, L. (2024). Restoreagent: Autonomous image restoration agent via multimodal large language models. NeurIPS.

[12] Zhu, K., Gu, J., You, Z., Qiao, Y., & Dong, C. (2024). An intelligent agentic system for complex image restoration problems. arXiv preprint arXiv:2410.17809.

[13] Jiang, X., Li, G., Chen, B., & Zhang, J. (2025). Multi-Agent Image Restoration. arXiv preprint arXiv:2503.09403.

[14] Zhou, Y., Cao, J., Zhang, Z., Wen, F., Jiang, Y., Jia, J., ... & Zhai, G. (2025). Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model. arXiv preprint arXiv:2504.07148.

评论- Hope to discuss with you!

2025-08-06

Dear Reviewer jm8i,

Thank you for your thoughtful review and encouraging rating!

Thank you again!

Authors

评论- Hope to discuss with you!

2025-08-07

Dear Reviewer jm8i,

Thank you for your thoughtful review and encouraging rating!

Thank you again!

Best Regards,

Authors

审稿意见

评分: 5置信度: 32025-07-03

A first agentic framework was proposed that unifies classical, real-world, and multi-degradation super-resolution under a single, highly customizable system. By leveraging a two-stage Perception Agent–Restoration Agent architecture, enriched with the proposed Quality-Driven Mixture-of-Experts policy and a dedicated face-restoration pipeline, 4KAgent achieves state-of-the-art performance across a spectrum of benchmarks. Beyond standard benchmarks, the versatility of 4KAgent was demonstrated on AIGC imagery, remote sensing, fluorescence microscopy, and biomedical scans, highlighting its broad applicability.

优缺点分析

Strengths Sufficient novelty and experiments. A first agentic framework was proposed that unifies classical, real-world, and multi-degradation super-resolution under a single, highly customizable system. By leveraging a two-stage Perception Agent–Restoration Agent architecture, enriched with the proposed Quality-Driven Mixture-of-Experts policy and a dedicated face-restoration pipeline, 4KAgent achieves state-of-the-art performance across a spectrum of benchmarks. Significantly enhancing facial details in portrait and selfie photos.We rigorously evaluate our 4KAgent across 12 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging 15 domains.

Weaknesses

问题

In line 55, the paper claims that 4KAgent can achieves the arbitrary-scale super-resolution, but in the implement line 112 and line 180, only specific integer upscaling factors were provided. It needs to be clarified. Why conduct twice super-resolution in perception agent and restoration agent Figure 2? Please provide the detailed explanation. Some writing errors should be corrected, such as line 54 left parenthesis. Please review the entire paper.

局限性

Please see the weaknesses and questions.

最终评判理由

My concerns were addressed. I keep my score.

格式问题

N/A

作者回复

2025-07-30

We sincerely thank Reviewer APnX for recognizing the novelty of our proposed agentic framework, the comprehensiveness of our experimental validation, and the broad applicability of our method. We greatly appreciate your positive feedback on the unified design of 4KAgent, the effectiveness of our two-stage Perception-Restoration Agent architecture, the introduction of the Quality-Driven Mixture-of-Experts policy, and the dedicated face-restoration pipeline. Your acknowledgment of our state-of-the-art performance across various benchmarks, as well as the impact of our work on diverse domains such as AIGC, remote sensing, and biomedical imaging, is very encouraging.

Q1: The misalignment between line 112 / 180 and line 55

A1: Thank you for pointing this out. As we have shown in line 112 and line 180 in the main paper, 4KAgent is able to upscale an image with a scale factor of 2, 4, 8, or 16. It can achieve a larger scale factor (e.g., 32, 64, …) if applied recursively. However, it still has misalignment with ‘arbitrary-scale super-resolution’ in line 55. Therefore, we will provide a more rigorous statement in line 55 in the revised version of the paper.

Q2: Conduct twice super-resolution in perception agent and restoration agent in Figure 2

A2: In a single run, 4KAgent can upscale images with a resolution of 250 x 250 or larger to 4K resolution (the long side of the image is greater than 4000 pixels). This depends on the profile set by the user. Specifically, after perceiving the distortion of the input image and the corresponding restoration agenda, the Perception Agent will calculate the scale factor based on the resolution of the input image and the 4K resolution. The calculation method is given by the formula in line 112. The resolution of the input image in Figure 2 is 259 x 194. Therefore, the scale factor required to enlarge this image to 4K resolution is 16. To our best knowledge, SOTA super-resolution (SR) methods (e.g., SwinIR [1], HAT-L [2], DiffBIR [3], OSEDiff [4], PiSA-SR [5], …) can not directly achieve 16x SR or perform limited. We also conduct an experiment on DIV4K-50 benchmark to explore the performance of applying super-resolution (x4) twice and directly applying a 16x super-resolution. The experiment result is shown below:

Method	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑
OSEDiff [4] (4x → 4x)	4.88	0.7201	39.88	0.5482
OSEDiff [4] (16x)	8.37	0.5680	25.07	0.4210
PiSA-SR [5] (4x → 4x)	5.01	0.7141	38.22	0.5364
PiSA-SR [5] (16x)	9.30	0.5549	24.51	0.3861
4KAgent (Direct 16x SR)	8.91	0.4780	22.78	0.3939
4KAgent	3.15	0.7585	44.16	0.5928

It shows that applying super-resolution (x4) twice performs better than directly applying 16x super-resolution for SOTA SR methods and our 4KAgent. Therefore, we add two super-resolution (x4) tasks in the restoration agenda to achieve the goal of 16x upscaling, and then schedule the restoration plan based on the updated restoration agenda.

Q3: Correct writing errors

A3: Thank you for pointing this out. We will carefully check the writing errors in the paper and correct them in the next version.

Reference

[1] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1833-1844).

[2] Chen, X., Wang, X., Zhang, W., Kong, X., Qiao, Y., Zhou, J., & Dong, C. (2023). Hat: Hybrid attention transformer for image restoration. arXiv preprint arXiv:2309.05239.

[3] Lin, X., He, J., Chen, Z., Lyu, Z., Dai, B., Yu, F., ... & Dong, C. (2024, September). Diffbir: Toward blind image restoration with generative diffusion prior. In European conference on computer vision (pp. 430-448). Cham: Springer Nature Switzerland.

[4] Wu, R., Sun, L., Ma, Z., & Zhang, L. (2024). One-step effective diffusion network for real-world image super-resolution. Advances in Neural Information Processing Systems, 37, 92529-92553.

[5] Sun, L., Wu, R., Ma, Z., Liu, S., Yi, Q., & Zhang, L. (2025). Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 2333-2343).

2025-08-06

My concerns were addressed. I decide keep my score.

评论- Hope to discuss with you!

2025-08-06

Dear Reviewer APnX,

Thank you for your thoughtful review and encouraging rating!

Thank you again!

Authors

评论- Summary of the Authors' Responses

2025-08-09

We sincerely thank all reviewers for their valuable comments and constructive suggestions. To facilitate discussion among the reviewers and the area chair, we have summarized the reviewers' feedback in the table below:

Strengths	R-APnX	R-wkHw	R-jm8i	R-eAhT
Unified agentic 4K SR	✓	✓	✓	✓
Two-stage planner + executor	✓	✓	✓	✓
Broad multi-domain evaluation	✓	✓	✓	✓
Strong empirical performance	✓		✓	✓
Dedicated face restoration	✓			✓
Clear & configurable pipeline	✓		✓	✓
Convincing qualitative results		✓
Well-written	✓	✓		✓
Weaknesses
Double SR rationale unclear	✓
Unfair SOTA comparisons		✓
Need more ablations		✓
Missing runtime info		✓	✓	✓
Deraining/Dehazing justification		✓
Face-paste artifact risk			✓	✓
Related work incomplete			✓
Eval-metric influence unclear				✓
Limited ISP-like validation		✓
Rating	5	2	3	5

We appreciate the reviewers' acknowledgment that our work establishes a unified agentic framework for 4K super-resolution (SR) and their recognition of several strengths, including the unified and configurable agentic 4K SR system, a well-designed two-stage planner-executor pipeline, and strong performance across a broad multi-domain evaluation.

To address the reviewers' comments on limitations, we provided experiment results and additional clarifications in our rebuttal as follows:

Fair Real-World Benchmarking & Evaluation (R-wkHW): We strictly follow the experimental settings of recent SOTA generative SR methods (e.g., DiffBIR, OSEDiff, PiSA-SR) on real-world SR datasets RealSR and DRealSR to ensure fair, side-by-side comparison. 4KAgent achieves superior perceptual metric scores.
Flexible SR Modes for Perception–Distortion Trade-off (R-wkHW): 4KAgent supports two configurable modes: fidelity mode and perception mode. This flexible design addresses the perception-distortion tradeoff, allowing users to choose between faithful restoration (e.g., traditional SR) or creative enhancement (useful for AI-generated content and old photos).
Extensive Real-World Benchmarking (R-wkHW): We conducted comprehensive experiments across real-world SR benchmarks (RealSR, DRealSR) and a remote sensing dataset (WorldStrat) to support the superiority of 4KAgent in handling real-world image SR tasks.
Ablation Study on Restoration Plan vs. Tools (R-wkHw): We conducted an ablation study on DIV4K-50 using fixed plans with different SOTA tools. The result shows that powerful tools achieve good quality, but a good plan in 4KAgent improves it further.
Potential Artifacts in Face Restoration Pipeline (R-jm8i, R-eAhT): We follow the ‘whole image restoration’ approach from prior face restoration methods, which introduce boundary artifacts (with a low-frequency (<10%)). Our user study shows 4KAgent has the lowest artifact rate among SOTA methods.
Computational Efficiency (R-wkHw, R-jm8i, R-eAhT): 4KAgent supports multi-GPU deployment (e.g., two NVIDIA RTX 4090) by assigning different agents to separate GPUs to save memory. We provided running time analysis in our rebuttal. We also introduced the Fast4K mode for users to control the running time of 4KAgent.
Super-resolution Scaling Factor Analysis (R-APnX): We validate that two-stage 4× → 4× SR outperforms direct 16× SR, explaining our choice with supporting experiments on DIV4K-50 dataset.

Most of the experimental results are from our submitted materials (main paper and appendix), and we presented these results again in the rebuttal with additional clarification.

Furthermore, we addressed additional questions raised by the reviewers:

Restoration Plans and Agendas (R-wkHw, R-eAhT): We provided concrete examples of restoration sequences for various images, illustrating how 4KAgent applies different restoration plans for different images.
Rain-Streak Removal Flexibility (R-wkHw): We emphasize that 4KAgent’s customizable profiles allow users to decide whether to apply rain-streak removal, ensuring restoration is tailored to the user's needs.
Image description Usage (R-wkHw): We clarified that image descriptions are used for evaluation in the MoE module, not for modeling degradation.
MoE Winning Rates (R-wkHw, R-eAhT): We analyzed the winning rates of different SR models across diverse image domains, showing that the MoE policy effectively selects domain-suited methods.
Expanded Related Work (R-jm8i): We provided detailed discussions of recent advanced SR methods and agentic image restoration frameworks in the Appendix.

We sincerely thank all reviewers and the area chair for their time, patience, and thoughtful feedback.

最终决定Accept (poster)

2025-09-17

This paper presents the very first attempt for an agentic framework that integrates tools and methods for multi-degradation real-world super-resolution into a unified system that is highly customizable. After rebuttal 3 out of 4 reviewers are positive about the paper increasing their confidence and rating (for 1 reviewer) since the authors' rebuttal seems to be effective in addressing their points raised. Only 1 reviewer (wkHw) did not change their score which remained negative but although they raised reasonable points for discussions, they failed to engage in further discussion following the authors' very detailed responses. Overall, the consensus for this paper seems rather positive, hence accept