8.7

/10

Spotlight4 位审稿人

最低5最高6标准差0.4

3.5

置信度

创新性3.0

质量3.3

清晰度3.5

重要性3.3

NeurIPS 2025

GenColor: Generative and Expressive Color Enhancement with Pixel-Perfect Texture Preservation

Yi Dong,Yuxi Wang,Xianhui Lin,Wenqi Ouyang,Zhiqi Shen,Peiran Ren,Ruoxi Fan,Rynson W. H. Lau

OpenReview PDF

提交: 2025-04-11更新: 2025-10-29

摘要

关键词

color enhancement

评审与讨论

审稿意见

评分: 5置信度: 42025-06-18

This paper presents GenColor, a diffusion model-based framework for image color enhancement. By reformulating the color enhancement task as a conditional image generation problem and introducing a dedicated texture preservation network, it achieves expert-level fine-tuning under complex lighting and content conditions while maintaining pixel-perfect texture fidelity. The authors also constructed ARTISAN-1M, a high-quality image enhancement dataset containing 1.2 million samples, and conducted both objective and subjective evaluations across multiple benchmark datasets.

优缺点分析

Strengths

The methodology is well-designed, innovatively proposing a color enhancement framework integrating diffusion models and a texture-preserving color transfer network, which effectively improves image quality.
By simulating degradation processes to construct synthetic training data, the approach not only reduces reliance on real-world data but also significantly enhances the model’s generalization capability.
A large-scale, high-quality dataset, ARTISAN, was constructed, providing solid data support for training high-performance color enhancement models.
Comprehensive comparative experimental analyses are provided in the appendix and supplementary materials, demonstrating that the proposed method achieves superior performance across multiple benchmark datasets, exhibiting significant advantages in both objective evaluation metrics and subjective visual quality.

Weakness

Many of the existing comparison methods include lightweight models; however, the method proposed in this paper still has a relatively long runtime, which is one of its main drawbacks.

问题

Regarding the weight mixing strategy, would it be possible to simply select an intermediate weight between the Mid and Late stages to achieve a comparable effect? In addition, please clarify how the mixing ratios are allocated in the current weight mixing strategy.
It is also recommended to include an ablation study where the generative module is removed, retaining only the texture preservation module and the global filter, in order to further validate the contribution of the generative module to the overall performance.

局限性

yes

最终评判理由

The rebuttal has addressed my concerns. I maintain my rating as "Accept" and look forward to the authors releasing the dataset and relevant code.

格式问题

Line 72： to to -> to

作者回复

2025-07-31

We thank this reviewer for the positive feedback and for highlighting the different strengths of our work. We address your comments/questions below, and will update our paper accordingly.

Weakness: the method proposed in this paper still has a relatively long runtime

Response: We agree that the inference speed is a crucial consideration for real-world applications. The computational cost of GenColor's three-stage pipeline is dominated by Phase 1, the iterative denoising process in our diffusion-based Color Generation Module. Our other modules—Phase 2 (Texture Preservation) and Phase 3 (Global Adjustment)—are single-pass networks and are generally very efficient.

Therefore, to speed up the overall runtime, we focus our analysis on the primary bottleneck. Prompted by your question, we have conducted a detailed analysis of the trade-off between the number of denoising steps in Phase 1 and the final enhancement quality. The full results on the Adobe5K dataset are presented below.

Denoising Steps	Q-Align↑	LAION↑	LIQE↑	NoR-VDP↑
1	2.75	4.86	1.91	68.75
2	2.84	4.92	2.01	68.95
3	3.88	5.42	3.30	69.09
5	4.17	5.67	3.61	69.20
10	4.28	5.80	3.74	69.19
15	4.29	5.82	3.75	69.18
20	4.29	5.83	3.76	69.15
25	4.29	5.83	3.76	69.12
30 (Our Paper)	4.29	5.83	3.76	69.16

Key Finding: Our analysis shows that while the performance is low with very few steps (1-5), the quality rapidly converges. We can reduce the number of denoising steps from 30 down to 15—effectively halving the runtime of our most computationally intensive module—with no obvious degradation in quality across all key metrics. Hence, the performance at 15 steps is virtually identical to that at 30 steps. ****This offers a favorable speed-quality trade-off, making the method more practical for real-world applications.

Future Work and Broader Context: Beyond this immediate optimization, the efficiency of GenColor can be further improved by incorporating techniques from the rapidly advancing field of fast diffusion sampling. Methods such as knowledge distillation (e.g., Progressive Distillation) or consistency models could potentially reduce the step count to as few as 1-4 steps while maintaining high quality, and we consider this as a promising direction for future work.

Question: Regarding the weight mixing strategy, would it be possible to simply select an intermediate weight between the Mid and Late stages to achieve a comparable effect? In addition, please clarify how the mixing ratios are allocated in the current weight mixing strategy.

Our weight mixing strategy is a key component for achieving optimal results, and its design is based on empirical evidence from comparing both single checkpoints and blended weights.

We have conducted an ablation study to analyze the effect of different model checkpoints and their combinations. This includes the "mid" stage (high expressiveness), the "late" stage (high fidelity), a single checkpoint taken from between these two stages ("mid-late"), and our proposed 50/50 blend.

The results below demonstrate that a 50/50 blend of the 'mid' and 'late' checkpoints significantly outperforms any single checkpoint, including the intermediate one.

Mid Weight	Mid-Late Weight	Late Weight	Q-Align↑	LAION↑	LIQE↑	NoR-VDP↑	C-VAR↑
1.0	0.0	0.0	4.08	5.66	3.54	68.96	22.91
0.0	1.0	0.0	4.24	5.77	3.72	68.87	17.87
0.0	0.0	1.0	4.25	5.77	3.73	69.09	16.69
0.5	0.0	0.5	4.29	5.83	3.76	69.16	16.96

Key Finding: This data directly answers both parts of the question.

Intermediate Checkpoint vs. Blend: Simply selecting a single intermediate checkpoint ("Mid-Late Weight") is suboptimal compared to our blending strategy across all key quality metrics (e.g., Q-Align 4.24 vs 4.29).
Allocation and Optimality: Our final method uses a 50/50 blend (Mid Weight = 0.5, Late Weight = 0.5). This configuration achieves the highest scores on all four perceptual quality metrics (Q-Align, LAION, LIQE, NoR-VDP) while maintaining a strong C-VAR score. This makes the 50/50 blend a clear, data-driven choice for balancing aesthetic expressiveness with photorealistic quality.

Question: It is also recommended to include an ablation study where the generative module is removed...

Response: We agree that understanding the contribution of the generative module is essential. In fact, this experiment was included in our comprehensive ablation studies presented in Tables 14, 15, and 16 and visualized in Figure 19.

Specifically, a pipeline without the Color Generation module (C) would not have a color reference to transfer, making the Texture Preservation module (T) inapplicable. Therefore, the pipeline effectively reduces to just the final Global Filter module (G). The configuration labeled "G" (first row) in our ablation tables represents exactly this scenario. As the results show, removing the generative module leads to a dramatic drop in performance across all key metrics, especially expressiveness. For instance, in Table 14 (Adobe5K), the C-VAR score plummets from 16.96 (full model) to 10.35 (Global Filter only). This quantitatively confirms the critical and substantial contribution of our generative module.

Formatting: Line 72： to to -> to

Response: We thank this reviewer for the careful reading of our paper. We will correct this typo in our revision.

2025-08-05

The rebuttal has addressed my concerns. I maintain my rating as "Accept" and look forward to the authors releasing the dataset and relevant code.

2025-08-06

Thank you for your positive feedback and for confirming that our rebuttal has addressed your concerns. We are committed to releasing the ARTISAN dataset and our code and look forward to making them available to the research community.

审稿意见

评分: 5置信度: 32025-06-25

This paper proposes a three-stage approach to image enhancement: (1) perform conditioned generation from a diffusion model to predict an enhanced version of the input image, (2) which is then corrected to preserve texture from the input image, (3) before a final global adjustment performed with standard global filters. Along with the method, authors also introduced a new dataset for image enhancement composed of 1.2M curated public images. By combining such a rich dataset and the presented method, they demonstrate state-of-the-art performance on two standard datasets, i.e. Adobe5K and PPR10K, compared with unsupervised, supervised and generative baselines on diverse metrics.

优缺点分析

Strengths:

S1: The paper is well written and the contributions are properly motivated, with proper reference to previous work and their pros and cons.
S2: Leveraging diffusion models for image enhancement is a relevant direction, in particular as authors propose a way to address some of the challenges diffusion models face.
S3: Many qualitative examples are presented, and the interactive html page is particularly appreciated.

Weaknesses:

[Major] W1: The Color Generation Module expects a textual prompt as input. To the best of my knowledge, such text is not used (indeed Figure 12 in appendix shows the robustness of the generation to diverse prompts), but is rather a consequence of using Stable Diffusion as a part of the pipeline. Wouldn’t it be possible to have a generation module without textual modality? In addition, maybe such prompt should not be presented as a part of the method in Figure 4 as it is in fact ignored by the model.
[Major] W2: The need to generate a dataset to train the texture preservation model is not clear to me. Couldn’t authors train it on outputs from the Color Generation Module instead, as it would allow to train exactly on the distribution of inputs that will be fed to the preservation module at inference time.
[Minor] W3: L81-82: the paper explains that the proposed method is more expressive than the human expert. Could authors elaborate a bit more on why this would be the case?

问题

The following questions are related to mentioned weaknesses:

Q1: Could the authors leverage a diffusion model that does not require a textual prompt within the Color Generation Module? (See W1)
Q2: Could the textual preservation module be trained on a dataset generated from outputs of the Color Generation Module? (See W2)
Q3: What makes the proposed method more expressive than the human expert? (See W3)

局限性

Yes

最终评判理由

All other reviewers and myself were already suggesting acceptance before the rebuttal period. I raised a few concerns that were properly answered by the authors and am thus updating my rating to "Accept", conditioned on the integration of changes and additional discussions promised by authors in their rebuttal.

格式问题

No paper formatting concern

作者回复

2025-07-31

We thank this reviewer for the positive feedback on our paper writing (S1), motivation (S1), and novelty (S2-S3). We address your comments/questions below, and will update our paper accordingly.

[Major] W1/Q1: The Color Generation Module expects a textual prompt as input... Wouldn’t it be possible to have a generation module without textual modality?

Response: We agree that the role of the text prompt was not clearly defined in our paper. Prompted by your question, we have conducted a rigorous ablation study to determine its true impact.

Our findings confirm your intuition - the text prompt is not a critical component. The model's performance is driven by the visual conditioning from ControlNet. As shown in the following new ablation results, using a null text prompt yields virtually identical performance to using BLIP-generated captions.

Method	Q-Align↑	LAION↑	LIQE↑	NoR-VDP↑	C-VAR↑
GenColor (w/ Null Text)	4.29	5.83	3.76	69.15	16.98
GenColor (w/ BLIP)	4.29	5.83	3.76	69.16	16.96

These results demonstrate that the BLIP captioning step adds unnecessary complexity without obvious performance benefits. Therefore, to make our method cleaner and more accurately reflect its core mechanism, we have streamlined our pipeline.

Action:

We will revise Figure 4 to remove the BLIP component, simplifying the diagram to show a fully automatic, visually-conditioned pipeline.
We will update Section 4.2.1 to state that our method uses a null text prompt by default, clarifying that text is not a required input.
The ablation table above will be added to the appendix to fully justify this improved design choice.

We thank the reviewer again for this valuable suggestion, which has helped us simplify and improve our model as well as the clarity of our paper.

[Major] W2/Q2: The need to generate a dataset to train the texture preservation model is not clear to me. Couldn’t authors train it on outputs from the Color Generation Module instead?

Response: This suggestion to train the texture preservation module using the outputs of our color generation module is interesting. However, this direct approach may not be feasible in our case due to the lack of ground truth, which is why we proposed a self-supervised training strategy to effectively address this limitation.

1. The Lack of GT Problem:

Supervised training of our texture network would require a target image that possesses both the exact colors of the diffusion-generated reference and the pixel-perfect texture of the original input. Such a ground truth image does not exist—it is, in fact, the desired output of the network itself. Without this target, a direct supervised training objective cannot be formulated.

2. Our Solution: Principled Self-Supervised Training Strategy:

xiaTo solve this, we designed a self-supervised task that precisely simulates the problem we want the network to solve.

Synthetic Data Generation: We start with a high-quality source image and synthetically degrade it to mimic the key artifacts of a diffusion output. This degradation specifically introduces texture inconsistencies alongside color shifts and minor spatial misalignments.
The Training Objective: The network is then tasked with reversing this specific degradation. It takes the degraded image as input and, using the original, high-quality image as a texture guide, learns to reconstruct the original's pristine texture while retaining the color profile of the degraded version.

This self-supervised strategy is a principled solution, which not only circumvents the "missing ground truth" issue but also provides two critical advantages:

Generality: It forces the network to learn the general skill of texture-aware color transfer, rather than overfitting to the specific artifacts of our single generator.
Training Stability: It provides a fixed, consistent training objective, avoiding the instability of learning from the "moving target" of a generator's evolving outputs.

In short, our self-supervised training strategy is a necessary and robust solution to an otherwise intractable problem.

[Minor] W3/Q3: L81-82: the paper explains that the proposed method is more expressive than the human expert. Could authors elaborate a bit more on why this would be the case?

Response: We thank the reviewer for this excellent question and for pointing us to this specific sentence. You are right to ask for more elaboration here, and we appreciate the opportunity to provide a more detailed explanation.

"It demonstrates how our approach achieves more expressive results than Human Expert C—a limitation inherent to the supervised learning methods, which have relied on Adobe5K due to the absence of fine-grained paired datasets..."

The comparison in this sentence is indeed very specific. It refers to 'Human Expert C' from the Adobe5K dataset, whose edits serve as a well-known example of a global-only adjustment paradigm. We are happy to elaborate below on why this particular comparison is meaningful for understanding our contribution.

1. The Inherent Limitation of Global Adjustments in Adobe5K

The Adobe5K dataset, while a cornerstone of image enhancement research, primarily features edits created using global adjustment layers in Adobe Lightroom. This means that a single set of parameters (e.g., for exposure, contrast, saturation) is applied uniformly across the entire image. This is analogous to applying one filter to the whole photograph.

Expert C's Constraint: "Expert C" is one of the five experts who provided such global retouches for the dataset. While their aesthetic judgment is professional, their method of interaction with the image is limited to these global tools. They cannot, for instance, brighten a subject's face without also brightening the background behind it.

2. GenColor's Expressiveness through Content-Aware Local Edits

GenColor operates on a fundamentally different principle. By leveraging a diffusion model (ControlNet) trained on our large-scale ARTISAN-1M dataset, it learns to perform spatially-varying, content-aware, and local adjustments.

Learning from Diverse Edits: The ARTISAN-1M dataset is not limited to global edits. It contains 1.2 million high-quality photographs curated for their professional aesthetic, which often includes sophisticated local adjustments (e.g., dodging and burning, selective color grading, local contrast enhancement).
Semantic Understanding: The diffusion model learns complex relationships between semantic content and color. This enables GenColor to understand that "sky" should be treated differently from "building" and "shadowed grass" within the same image.

This allows for a much higher degree of expressiveness. For example, in Figure 1, our method:

Selectively brightens the sunlit facade of the building, enhancing its golden tones.
Simultaneously deepens the blue of the sky to create a rich, vibrant contrast.
Intelligently lightens the shadowed areas on the lower right without washing out the entire scene.

An expert limited to global adjustments (like Expert C in the Adobe5K workflow) cannot achieve all these effects simultaneously. Increasing global saturation might make the sky bluer but would also oversaturate the building. Increasing global exposure would brighten the shadows but also blow out the highlights on the facade. GenColor's ability to apply these distinct, targeted edits within a single operation is what makes it fundamentally more expressive in this context.

This approach is necessitated by a fundamental challenge in the field: collecting paired data for fine-grained, local enhancements (e.g., images with before-and-after local adjustment masks) is prohibitively expensive and difficult to scale. Consequently, no large-scale dataset for supervised local enhancement exists, making it currently infeasible to train a model directly for this task. Our generative approach, trained on unpaired but high-quality images, cleverly circumvents this data bottleneck.

3. Supporting Evidence from Our Results

Our quantitative and qualitative results strongly support this claim:

Quantitative Metrics (Table 2): Our method surpasses Expert C on key perceptual metrics like C-VAR (Color Variance/Expressiveness), scoring 16.96 vs. Expert C's 10.91. This metric is specifically designed to measure fine-grained, diverse color transformations, and our superior score quantitatively demonstrates this higher expressiveness.
LLM Preference (Table 2b): When we asked a vision-language model to choose the most "visually beautiful" photo from a color perspective, it preferred GenColor's output over Expert C's 62% of the time. This suggests that the local, expressive adjustments provided by GenColor lead to more aesthetically pleasing results than the global-only edits.
Visual Evidence (Figure 1): The difference map for Expert C shows a uniform, global change, whereas the difference map for GenColor reveals highly selective, region-specific adjustments, visually confirming its advanced capability.

In summary, our claim is not that GenColor is superior to a human expert with an unrestricted toolset (e.g., Photoshop with masking). Rather, it is that GenColor's learned, local, content-aware enhancement capability is fundamentally more expressive and powerful than the global-only adjustment paradigm that defines the "Expert C" benchmark in Adobe5K.

We sincerely appreciate the reviewer's constructive suggestions and believe that the additional experiments, analysis, and explanations help significantly improve the quality of our submission.

2025-08-04

I thank the authors for their answers to my different concerns. Removing the input text prompt (W1/Q1) will indeed make the method simpler and the paper clearer. Authors should update the main paper accordingly, as stated in their answer. They should also incorporate additional motivation about the interest of their self-supervised objective (W2/Q2) and elaborate more on their comparison with human artists (W3/Q3) as presented in their respective answers.

2025-08-05

Dear Reviewer mrwF,

Thank you for your review and acknowledgment of our rebuttal. We will update the manuscript as requested, incorporating the points discussed. Specifically, we will:

Revise Figure 4 and the accompanying text to remove the input text prompt, simplifying the method (W1/Q1).
Incorporate the additional motivation regarding our self-supervised objective (W2/Q2).
Elaborate on the comparison with human artists, using the details provided in the rebuttal (W3/Q3).

We appreciate your constructive feedback, which has helped improve the manuscript.

审稿意见

评分: 5置信度: 32025-07-02

This paper presents a new no-reference fully automatic method for expressive color enhancement for photos. The proposed method starts with a new diffusion-based color generation module, trained on a new high quality dataset. The results from this module may contain artifacts, so a second new texture preservation module is in charge of transferring just the colors over the original textures at the original resolution. Finally, some global adjustments are applied to compensate for some contrast limitations of the first module. Results are of high quality, showing consistency and robustness with respect to some parameters in the pipeline. Evaluation is extremely thorough, analyzing almost every technical decision made.

优缺点分析

Strengths:
New fully-automated method that generates high quality results according to many different metrics and human judgements.
New diffusion-based expressive color generation module.
New texture-preserving color transfer module.
New dataset of high quality photographic color edits, way larger than existing ones.
Extensive analysis of the method.
Weaknesses:
The method seems to be almost deterministic for a given input. This poses questions about user control, or what kind of data is required to capture personalized styles.
Figure/Table placement could be improved.
Previous methods trained on the proposed dataset seem to perform worse, raising questions about its general usefulness.

问题

Related to the weaknesses listed before:

Figures 12 and 13 show the robustness and predictability of the method. But these, along the way the ARTISAN dataset was collected and used for training, raise questions about user control, and the ability of the method to learn specific styles through other carefully curated datasets. More comments on this would be helpful. Otherwise, a tool that is only able to generate a single type of style without allowing user control feels quite limited, specially the kind of network models involved.
Table1 shows methods [28] and [13] perform worse when trained over ARTISAN. Why is that? Could that mean that ARTISAN is more useful for approaches like the one presented in this paper? That would limit the impact of the dataset.
The evolution shown in Figure 14 is quite interesting. Could there be a way to force the model to stabilize in between 'mid' and 'late' through some additional losses, rather than the current proposal, which works but feels a bit ad-hoc?

局限性

The authors discuss a couple limitations in Appendix J that seem minor compared with the points I mentioned above. User control, personalization of styles, and general usefulness of the ARTISAN dataset seem more crucial to the potential impact of the paper.

最终评判理由

The discussion with the authors was very insightful. I recommended adding that to the paper. I was already recommending acceptance, and based on this and the other reviews and discussions I’m keeping it.

格式问题

Some Figures and Tables are quite far from the text that addresses them.

作者回复

2025-07-31

We thank this reviewer for the positive and constructive comments on our work. We address your comments/questions below, and will update our paper accordingly.

Weakness/Question: The method seems to be almost deterministic... raise questions about user control, and the ability of the method to learn specific styles... a tool that is only able to generate a single type of style without allowing user control feels quite limited.

Response: We thank the reviewer for this insightful comment, which touches on the important distinction between automatic and user-guided enhancement.

Our primary focus is on the fully-automatic task, a grand challenge in this field, that many previous works (e.g., 3D-LUT, DeepLPF) address. User studies, particularly with mobile phone users, consistently show that a powerful, one-click automatic solution is one of the most significant demands. The difficulty lies in creating an algorithm with enough expressive power to replicate expert-level aesthetics without any guidance. Our main contribution is in advancing the state-of-the-art for this specific, highly challenging domain.

We view automatic enhancement and user control as largely orthogonal goals. Our work presents an effective solution for the former, providing a high-quality automatic baseline. This approach, however, is not at odds with user control; in fact, it can serve as an ideal starting point upon which such features can be built.

As the reviewer rightly suggests, extending our pipeline to be equipped with user control is a valuable direction for future work. For instance, our framework could be conditioned on user-provided examples or fine-tuned for specific styles. We will revise the paper to clarify this distinction and expand on these exciting future possibilities in the limitations section.

Weakness/Question: Table1 shows methods [28] and [13] perform worse when trained over ARTISAN. Why is that? Could that mean that ARTISAN is more useful for approaches like the one presented in this paper? That would limit the impact of the dataset.

Response: This is an excellent question that allows us to clarify the distinct value of ARTISAN. The reviewer's observation is correct, and it stems from a fundamental difference in the learning challenge posed by our dataset compared to previous datasets.

Learning from a Stylistically Consistent, Curated Source: Earlier datasets like Adobe5K provide a valuable but constrained learning signal. Each of the five experts provides a high-quality but stylistically consistent set of retouched images. For a learning algorithm, the task becomes to model a mapping to one of these few, relatively uniform aesthetic distributions. This is a well-defined problem, and methods like [28] and [13] are well-suited for it. However, the resulting models are inherently tuned to these specific styles, which can limit their general expressiveness.
Modeling a Vast, 'In-the-Wild' Distribution: ARTISAN introduces a different and more complex learning paradigm. With over 1.2 million images, it does not represent a handful of curated styles. Instead, it captures a massive, 'in-the-wild' distribution of what constitutes a high-quality photograph. The learning task is no longer to mimic a specific expert, but to model this incredibly complex distribution, which contains numerous distinct and equally valid aesthetic styles. For any given scene, a model must learn to produce a coherent and aesthetically pleasing result from a wide range of possibilities, which requires a significantly higher model capacity.

This shift in task complexity explains the performance difference. The architectures of prior models are not primarily designed to capture such a broad and varied data space. Their difficulty with ARTISAN is not an indictment of their capabilities, but rather a testament to the new level of challenge presented.

Weakness/Question: The evolution shown in Figure 14 is quite interesting. Could there be a way to force the model to stabilize in between 'mid' and 'late' through some additional losses, rather than the current proposal, which works but feels a bit ad-hoc?

Response: This reviewer proposes to design a new loss function to directly guide the model to an optimal state, rather than blending weights at inference. This is a very interesting suggestion.

Our current weight-blending method is a pragmatic and effective solution that we find to be robust and simple to implement. It reliably achieves the desired balance between the aesthetics of the 'mid' stage and the fidelity of the 'late' stage without altering the complex training dynamics. Exploring and properly tuning new loss functions would require a series of lengthy training runs, which unfortunately is not feasible within the limited rebuttal period. However, we believe this is a valuable path for future investigation.

We will add a brief discussion in Appendix B.4 to acknowledge this reviewer's suggestion as a promising direction for future research, while justifying our current method as a practical and effective choice.

Formatting: Some Figures and Tables are quite far from the text that addresses them.

Response: We thank this reviewer for the carefully reading of our paper. We will revise the placement of all figures and tables in our revision, to improve the paper's readability.

2025-08-05

Thank you for your detailed replies. With respect to user control and ARTISAN 'in-the-wild' distribution, I didn't mean for the user to be able to control the output directly. I appreciate the need for fully-automatic deterministic results. My point is that the current method seems to produce almost deterministic results for a given input, and that comes from ARTISAN, which contains 'numerous distinct and equally valid aesthetic styles'. You don't have to solve it in this paper, but it could be interesting to briefly discuss this fact, and its implications when it comes to 1) Unlock these other aesthetic styles present in ARTISAN, so future work could build on top of this paper, aiming for more variety on the outputs. 2) Change the output of the current model, steering it towards other desired styles. What kind of dataset other than ARTISAN would be needed for that? Would it just need a smaller number of samples with more similar styles? How many of them? Would the model 'snap' into the different substyles without interpolating in between? Right now your model and the dataset seem tightly coupled, and that is fine. Just discuss this fact so future researchers are more aware of this.

2025-08-05

Dear Reviewer vwzM,

Thank you for the insightful clarification and for prompting a deeper discussion on the nuances of learning from a diverse, "in-the-wild" dataset like ARTISAN. We appreciate you recognizing our focus on fully-automatic results while highlighting the crucial point regarding the apparent deterministic nature of our current output, despite the "numerous distinct and equally valid aesthetic styles" present in the training data.

We welcome the opportunity to elaborate on this phenomenon and its implications. We agree this discussion is valuable for future researchers building on our work, and we will incorporate these points into the revised manuscript (in the Limitations and Future Work section), with some detailed discussions included in the supplementary material due to space limitations.

The Paradox of Deterministic Output from Diverse Data

Your observation is accurate. The nearly deterministic output stems from the optimization dynamics of our current training objective.

While ARTISAN’s data distribution is highly diverse and multi-modal (representing many styles), the model is strongly conditioned on the input image via ControlNet. The training process is optimized to find the most probable, high-quality enhancement. This encourages the model to converge towards a high-probability density region within the distribution—often resembling a "mean" or dominant aesthetic mode. The current objective function prioritizes a robust, broadly appealing automatic solution rather than diversity of output.

Implications and Future Directions

As you rightly suggest, this realization opens up significant avenues for future work to build upon the GenColor foundation.

1. Unlocking the Aesthetic Variety within ARTISAN

To move beyond a single deterministic output and unlock the variety inherent in ARTISAN, the model needs mechanisms to navigate the different modes of the aesthetic distribution. Future work could explore:

Conditional Generation: The most promising direction is transitioning to conditional generation. By incorporating style embeddings or exemplar images during training, the model could learn to navigate the diverse aesthetic landscape within ARTISAN.
Disentanglement of Style Codes: Introducing explicit or latent style codes during training could disentangle different aesthetic modes. During inference, sampling different codes would yield diverse, yet equally valid, outputs for the same input.

2. Steering the Model Towards Desired Styles

The current tight coupling between GenColor and the dominant aesthetic of ARTISAN can be adapted. The strength of our approach is that the model has already learned the complex task of high-quality, content-aware enhancement, providing a robust foundation for specialization.

Dataset Requirements for Steering: To steer the model toward a specific style, another massive dataset is not required. As you hypothesized, a smaller, highly curated dataset exhibiting a consistent target style would be effective for fine-tuning.
Data Efficiency and Adaptation: Because the model is already well-initialized by ARTISAN, we anticipate this adaptation could be highly data-efficient, potentially requiring only a few dozen to a few thousand samples. Techniques like Low-Rank Adaptation (LoRA) could make this style specialization highly practical.

3. Model Behavior: "Snapping" vs. Interpolation

The question of whether the model would "snap" into different sub-styles or interpolate between them is excellent. We hypothesize that the behavior would depend on the implementation of the steering mechanism:

Interpolation: Given the continuous nature of the diffusion latent space, if style is controlled via continuous embeddings or adjustable weights (e.g., varying the influence of a LoRA or using classifier-free guidance), the model would likely be capable of smoothly interpolating between styles.
Snapping: If the steering is achieved by switching between distinct fine-tuned models, or if the fine-tuning signal for a new style is very strong and distinct from the ARTISAN mean, the output would likely "snap" to the new style (mode-switching).

2025-08-05

Thank you for the detailed follow up. This should definitely be incorporated in the paper and supplementary material. I don't have any other questions.

2025-08-06

Thank you for your thoughtful engagement throughout the review process. We are pleased that our responses have addressed your questions.

We will be sure to incorporate the detailed discussion into the final version of our paper and supplementary material, as you have suggested. Your insights have been invaluable in helping us strengthen the manuscript.

审稿意见

评分: 6置信度: 42025-07-06

Color enhancement in digital photography remains a challenging task, requiring precise control, adaptability, and texture preservation. Existing methods often compromise one or more of these aspects. Authors present GenColor, a diffusion-based framework that reformulates enhancement as conditional image generation. By leveraging ControlNet and a custom training strategy, GenColor achieves high-quality, adaptable color transformations.

优缺点分析

Strenghts:

Very good resulting quality
Good methodology for result evaluation, it is important to use user studies for such type of tasks
Main SotA methods have been used for comparisson
New dataset is prepared

Mainly I almost do not see weakness in this paper. The only think I would porpose is to show not only mean quality, but also variance

问题

Is is possible to speed up proposed solution?
Have you considered pair-wise image comparissons for more robust result?

局限性

All the limitation are described in a clear manner

最终评判理由

I believe that this paper is pretty good. Authors spent ennormous amount of time to have these results and provide excelent formatting and description. In my opinion it is a good example for such a conference.

格式问题

No formatting concerns have been noticed

作者回复

2025-07-31

We thank this reviewer for the positive feedback and the insightful comments/suggestions for our work. We address your comments/questions below, and will update our paper accordingly.

Weakness: The only thing I would propose is to show not only mean quality, but also variance.

Response: We agree that understanding the variance, in addition to the mean, provides a more complete picture of the consistency and reliability of our method.

1. Variance in Our User Study: We would like to clarify that for our user study results in Figure 20, we have already included error bars representing the standard deviation of the user ratings. These bars visually demonstrate the spread of opinions for each method, directly addressing the need for variance analysis in our subjective evaluation.

2. Variance Analysis for Objective Metrics: As suggested, we have computed the standard deviation for all objective metrics across the Adobe5K dataset. The full results are presented in the following table.

Method	Std Q-Align↓	Std LAION↓	Std LIQE↓	Std NoR-VDP↓	Std C-VAR↓
3D-LUT	0.43	0.91	0.91	4.07	6.81
RSFNet	0.43	0.91	0.91	3.84	6.61
DeepLPF	0.45	0.92	0.92	3.72	6.50
ICELUT	0.45	0.92	0.91	4.17	6.64
D&R	0.31	0.83	0.45	6.33	9.24
D&R (ARTISAN)	0.29	0.85	0.42	6.31	9.07
Exposure	0.45	0.92	0.85	4.58	8.63
Exposure (ARTISAN)	0.44	0.91	0.84	4.67	8.52
GenColor (Ours)	0.43	0.93	0.84	3.45	8.18

Key Insight from the Results: The above results reveal an important finding. For most metrics (Q-Align, LAION, LIQE), the standard deviations across all top-performing methods, including ours, are highly comparable. This indicates that these methods exhibit a similar level of performance consistency across the metrics. While some methods like D&R show a lower variance on certain metrics, they have a significantly lower mean performance. In contrast, GenColor not only achieves the highest mean scores but does so with a variance that is on par with, or even lower than (e.g., NoR-VDP) other leading approaches.

Question: Is it possible to speed up proposed solution?

Response: This is an interesting question/suggestion. The computational cost of GenColor's three-stage pipeline is dominated by Phase 1, the iterative denoising process in our diffusion-based Color Generation Module. Our other modules—Phase 2 (Texture Preservation) and Phase 3 (Global Adjustment)—are single-pass networks and are generally very efficient.

Denoising Steps	Q-Align↑	LAION↑	LIQE↑	NoR-VDP↑
1	2.75	4.86	1.91	68.75
2	2.84	4.92	2.01	68.95
3	3.88	5.42	3.30	69.09
5	4.17	5.67	3.61	69.20
10	4.28	5.80	3.74	69.19
15	4.29	5.82	3.75	69.18
20	4.29	5.83	3.76	69.15
25	4.29	5.83	3.76	69.12
30 (Our Paper)	4.29	5.83	3.76	69.16

Key Finding: Our analysis shows that while the performance is low with very few steps (1-5), the quality rapidly converges. We can reduce the number of denoising steps from 30 down to 15—effectively halving the runtime of our most computationally intensive module—with no obvious degradation in quality across all key metrics. Hence, the performance at 15 steps is virtually identical to that at 30 steps. This offers a favorable speed-quality trade-off, making the method more practical for real-world applications.

Question: Have you considered pair-wise image comparisons for more robust result?

Response: We thank the reviewer for this insightful suggestion. We would like to point out that our original evaluation framework has already designed around this core principle, and how we may further strengthen it with a new study inspired by your suggestion here.

1. Our Existing Framework was Built on Comparative Judgment:

Our original evaluation already employed two distinct methods to encourage or enforce comparison:

Side-by-Side Comparative Rating (User Study): While our primary user study collected ratings on a 5-point scale, it was designed as a comparative task. For each source image, we presented the outputs from all 7 competing methods simultaneously on a single screen. This side-by-side presentation naturally forces participants to make relative assessments before assigning a score to any individual image. This design effectively mitigates the noise of isolated, absolute scoring and produces Mean Opinion Scores (MOS) that reflect a strong comparative user preference.
Forced-Choice LLM Evaluation: To complement this, our evaluation with GPT was a strict three-alternative forced-choice (3AFC) task. The model was required to select the single best result from a set including our method, the original, and the high-quality Expert C baseline. GenColor's definitive win in this direct comparison provides strong, preference-based evidence that corroborates our user study.

2. New Randomized Pairwise Study for Further Validation:

To provide even more direct evidence and fully embrace your suggestion, we have conducted an additional, dedicated randomized pairwise comparison (2AFC) experiment.

Methodology: We have recruited 25 participants, and each completed 20 comparison trials. In each trial, the system randomly selects two of the seven methods for a head-to-head comparison on a given image. Participants then choose the more visually appealing result. This process produces a total of 500 pairwise judgments.
Results: We have analyzed the data using the Bradley-Terry model to derive a preference score for each method. The results show that GenColor achieved the highest preference score, significantly outperforming all baselines .

Rank	Method	Bradley-Terry Score↑
1	GenColor (Ours)	100.00
2	RSFNet	72.10
3	DeepLPF	67.07
4	ICELUT	58.00
5	3D-LUT	52.31
6	Exposure	25.92
7	D&R	17.18

Crucially, the final ranking of all methods from this new pairwise study is highly consistent with the ranking derived from our original side-by-side rating study. This agreement between two different comparative methodologies provides strong support for our conclusions.

最终决定Accept (spotlight)

2025-09-17

This paper introduces GenColor, a three-stage diffusion-based framework for expressive and texture-preserving color enhancement, trained on a new 1.2M-image dataset (ARTISAN-1M). The approach integrates a diffusion-based generation module, a texture-preserving transfer network, and a global adjustment step, achieving state-of-the-art results on standard benchmarks and producing outputs comparable to expert edits. Code, dataset, and supplementary demos are promised to be released. The paper was reviewed by four reviewers who recommended its acceptance.

Strengths: High-quality results, innovative methodology combining diffusion with texture transfer, valuable dataset contribution, and thorough evaluation (metrics, user studies, ablations). The paper is clearly written with transparent limitations.

Weaknesses: Limited user control/personalization, questions about ARTISAN’s generality, reliance on unused text prompts, and an ad hoc global adjustment stage. Runtime is longer than lightweight baselines.

Rebuttal: Strongly addressed runtime (showed quality maintained with fewer diffusion steps), clarified weight mixing superiority, and confirmed necessity of the generative module via ablations. Broader concerns (personalization, dataset generality) remain but are secondary.

Recommendation: Spotlight accept . It is agreed by the reviewers and myself that this is a technically strong paper with broad interest and practical impact.