UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback
摘要
评审与讨论
This paper proposes a method to change the lighting of a pre-specified foreground subject, within a still image or a video. There are two steps: generating a dataset of re-lit videos, which they call LumosData. They constructed that dataset based on Panda70M, which led them to 110K relit video pairs. It's not clear (please clarify) whether that dataset will be made public--line 674, "We do not release any data or model that has high risks for misuse". An image relighting dataset certainly could have a risk of misuse, so does that mean they don't release that dataset, or they don't feel it has a risk of misuse? The second step: they run a latent diffusion process that has an innovation: a "physics-plausible" feedback mechanism, to encourage physically realistic illumination.
优缺点分析
strength: The results look nice, and the comparison of inference times versus related algorithms is impressive.
weaknesses: We see just a few examples, even including the appendix and supplemental materials. It would be nice to see visually the effect of ablations regarding the claimed essential contribution of the paper, the physics-plausible feedback mechanism. What do the re-rendered images look like when it is turned off?
Some parts of the paper make it seem that the paper wasn't proofread: eg, the two successive sentences in lines 55-59.
is the physics-aware feedback module indeed aware of the physics? Do you have justification for calling it physics aware? I feel like I'm missing various details from the paper. What is the precise form of the Feedback listed in Figure 3? Will you release the LumosData? Will you release the code?
I have concerns about the reproducibility and clarity of the paper.
Please expand on lines 217 - 222.
问题
See questions within Strengths and weaknesses
局限性
Authors didn't address possible concerns about faking videos or placing someone in a place where they never were.
最终评判理由
Authors addressed concerns about the paper writing and the physics-aware module.
格式问题
none
Thanks for your comments, and we address them below.
Q1 We see just a few examples, even including the appendix and supplemental materials. It would be nice to see visually the effect of ablations regarding the claimed essential contribution of the paper, the physics-plausible feedback mechanism. What do the re-rendered images look like when it is turned off?
A1 We have added Fig. 5 to the revised manuscript to visually illustrate the effect of removing the physics-plausible feedback. Compared to the full UniLumos model, the variant without feedback produces flatter lighting with weaker shading and less realistic highlights. With feedback, the results show more coherent lighting aligned with subject geometry, including softer shadows and improved depth cues. This visual comparison highlights the impact of the feedback mechanism and supports its effectiveness as a core component of our method.
Q2 Some parts of the paper make it seem that the paper wasn't proofread: eg, the two successive sentences in lines 55-59.
A2 We have removed the redundant sentences and thoroughly proofread the manuscript to improve clarity throughout.
Q3: Is the physics-aware feedback module indeed aware of the physics? Do you have justification for calling it physics aware?
A3 Yes, we use the term “physics-plausible” to indicate that the model is explicitly guided by scene geometry—rather than simulating full physical light transport. The feedback loss (L_phy) enforces consistency between depth and normal maps extracted from the output and a reference image. This encourages lighting effects that respect 3D structure, helping prevent artifacts like misaligned shadows.
Q4 What is the precise form of the Feedback listed in Figure 3?
A4 The “Feedback” in Fig. 3 refers to the physics-plausible loss (L_phy). It compares depth and normal maps extracted from the model output (via a frozen estimator) to those from the reference, using normalized L2 loss.
Q5 Will you release the code?
A5 Yes, we will release the full source code, and the LumosBench benchmark.
Q6 What is LumosBench?
A6 Existing relighting benchmarks often evaluate lighting holistically, lacking the granularity to assess controllability over specific illumination attributes. This makes it difficult to diagnose model behavior or understand where control fails.
To address this, we propose LumosBench, a structured benchmark targeting six core lighting attributes from our annotation protocol: direction, source type, intensity, color temperature, temporal dynamics, and optical effects. Each of the 2,000 test prompts isolates a single attribute (e.g., front vs. back light), while keeping others fixed.
We use the vision-language model Qwen2.5-VL to assess whether the intended attribute is correctly expressed in the output. This disentangled evaluation allows precise measurement of lighting controllability, with scores reported per dimension and averaged for overall performance (see the table below).
Table: Quantitative comparison of attribute-level controllability. Bold numbers indicate the best performance.
| Model | #Params | Direction | Light Source Type | Intensity | Color Temperature | Temporal Dynamics | Optical Phenomena | Avg. Score |
|---|---|---|---|---|---|---|---|---|
| General Models | ||||||||
| LTX-Video | 1.9B | 0.794 | 0.644 | 0.487 | 0.708 | 0.487 | 0.403 | 0.587 |
| CogVideoX | 5.6B | 0.837 | 0.692 | 0.552 | 0.739 | 0.532 | 0.449 | 0.634 |
| HunyuanVideo | 13B | 0.863 | 0.741 | 0.599 | 0.802 | 0.655 | 0.481 | 0.690 |
| Wan2.1 | 1.3B | 0.842 | 0.685 | 0.436 | 0.741 | 0.504 | 0.433 | 0.607 |
| Wan2.1 | 14B | 0.871 | 0.794 | 0.674 | 0.829 | 0.737 | 0.505 | 0.735 |
| Specialized Models | ||||||||
| IC-Light Per-Frame | 0.9B | 0.793 | 0.547 | 0.349 | 0.493 | 0.284 | 0.339 | 0.468 |
| Light-A-Video + CogVideoX | 2.9B | 0.787 | 0.581 | 0.327 | 0.536 | 0.493 | 0.373 | 0.516 |
| Light-A-Video + Wan2.1 | 2.2B | 0.801 | 0.603 | 0.361 | 0.582 | 0.557 | 0.412 | 0.553 |
| UniLumos w/o lumos captions | 1.3B | 0.868 | 0.774 | 0.529 | 0.798 | 0.543 | 0.457 | 0.662 |
| UniLumos | 1.3B | 0.893 | 0.847 | 0.832 | 0.813 | 0.662 | 0.592 | 0.773 |
Q7 Please expand on lines 217 - 222.
A7 To balance physical supervision and training efficiency, we adopt a selective optimization strategy inspired by path consistency scheduling [1]. In each training iteration, we divide the batch based on supervision type, following an 80/20 split to avoid prohibitive costs from full supervision while still maintaining effective learning signals. 20% of each batch is allocated to compute the path consistency loss L_{fast}, which involves three forward passes and one backward pass to enforce consistency across timesteps. The remaining 80% is used for the standard flow-matching loss L_0, with 50% of these samples further supervised using RGB-space geometry feedback via L_phy (i.e., depth and normal alignment). This probabilistic scheduling ensures high training throughput while allowing the model to benefit from multi-level supervision. To further enhance illumination diversity during training, we apply randomized lighting augmentations on the degraded subject V_deg, which introduces realistic lighting variability without the need for explicitly paired captures.
[1] One step diffusion via shortcut models, arXiv:2410.12557.
We thank the reviewer for acknowledging the strong visual results and the inference speed of our method. In this rebuttal, we provide new qualitative results showing that removing our physics-plausible feedback leads to flatter lighting and weaker shading, highlighting its visual impact (Q1). We clarify that this feedback enforces alignment between predicted and reference normals/depths, making it physically grounded though not a full light transport simulation (Q3, Q4). We have corrected clarity issues (Q2), detailed our selective supervision strategy for efficient training (Q7), and explained the structure and purpose of LumosBench (Q6). Regarding reproducibility, we confirm that the full code and benchmark will be released (Q5).
Overall, UniLumos introduces (1) a geometry-guided relighting framework with efficient few-step training, (2) a structured annotation and evaluation suite for disentangled lighting control, and (3) strong empirical performance with 20x speedup, delivering practical and controllable relighting under minimal input assumptions. We hope our clarifications address your concerns.
This paper proposes an image and video relighting method. The method is built upon a video diffusion Wan 2.1 and extends it to handle lighting. It constructs a new dataset for training such a model. The method introduces a shortcut-based consistency loss that enables faster inference with variable step sizes. Additionally, it leverages single-step diffusion models for depth and normal prediction to provide geometry-aware supervision before and after relighting. It achieves temporally consistent, physically plausible, and efficient relighting in videos.
优缺点分析
Strengths:
- Quality: The relighting results are visually compelling, with strong temporal consistency across frames and realistic lighting effects. The proposed method demonstrates solid performance, particularly in portrait scenarios.
- Clarity: The technical exposition is generally sound, and the paper provides sufficient implementation details for reproducibility. The pipeline is well-structured and logically motivated.
- Significance: The paper tackles a relatively unexplored task of lighting control in video diffusion models. The direction, ideas, and results are promising and the problem is quite interesting.
- Originality: The paper uses single-step diffusion models for additional depth/normal supervision and a shortcut-based consistency loss for efficient inference. These elements contribute to faster and geometry-aware relighting.
Weaknesses:
- Quality: One major concern with the paper is that it misses a lot of related work. Despite the paper’s focus on scene relighting, the method largely targets portrait relighting. It is better to cite and discuss other important methods. Below are major ones but not limited to:
- a. Total Relighting: Learning to Relight Portraits for Background Replacement, SIGGRAPH 21.
- b. Lumos: Learning to Relight Portrait Images via a Virtual Light Stage and Synthetic-to-Real Adaptation, SIGGRAPHa 22.
- c. Neural Video Portrait Relighting in Real-time via Consistency Modeling, ICCV 21.
- d. Real-time 3D-aware Portrait Video Relighting, CVPR 24.
Furthermore, while the geometric supervision via normals and depth is helpful, it can compromise appearance fidelity in some cases. For instance, in Figure 4 (right), the appearance of the cup deviates noticeably from the input and other methods, indicating a drop in visual consistency.
- Clarity: The manuscript would benefit from thorough proofreading. The abstract contains grammatical issues (e.g., the phrase “…where controlling illumination under ensures physical consistency…” is unclear). Other minor issues include:
- a. Section title "1. Introudction" should be "1. Introduction"
- b. References [2] and [3] are duplicated
- Significance: The results are shown primarily on portrait videos with limited motion, which constrains the method’s applicability to more diverse or dynamic real-world video scenarios.
- Originality: The use of degraded data as input closely resembles prior work such as RelightVid. Similarly, treating single images as one-frame (T=1) videos is a straightforward extension within the video diffusion framework. The major contributions lie in path consistency loss for faster inference and using single-step pretrained diffusion models for extra supervision. As a result, the overall novelty of the approach is somewhat limited.
问题
- In Fig.4 (left) and Fig.3 (Appendix), the input video clips appear to depict the same scene under different lighting conditions. However, there seems to be a slight difference in scale, particularly noticeable in the bottom-left corner where the three circles in Fig.4. Could the authors clarify if any cropping or resizing was applied, and whether this affects evaluation?
- The dataset includes optical attributes such as “Transmission (Glass)”, yet the paper does not show examples involving such materials. Given that the visual results primarily focus on portrait relighting, could the authors comment on the performance of the method when applied to scenes involving transparent or refractive materials?
- Have the authors evaluated how the method performs across a diverse range of subjects, including individuals with varying skin tones and non-human objects? This would help clarify the model’s generalization ability beyond the portrait-centric examples shown.
局限性
yes
最终评判理由
The rebuttal and discussion have addressed most of my concerns; therefore, I would like to raise my rating, under the authors' promise that the revised version should reflect the response during this discussion period.
格式问题
None.
Thanks for your comments, and we address them below.
Q1 One major concern with the paper is that it misses a lot of related work. Despite the paper’s focus on scene relighting, the method largely targets portrait relighting. It is better to cite and discuss other important methods. Below are major ones but not limited to:
A1 We have revised the Related Work to include and discuss key portrait relighting methods such as Total Relighting (SIGGRAPH '21), Lumos (SIGGRAPHa '22), Neural Video Portrait Relighting (ICCV '21), and Real-time 3D-aware Portrait Video Relighting (CVPR '24). While these works often rely on portrait-specific priors like 3DMMs, our method is designed as a general-purpose relighting framework applicable to both images and videos without being limited to a specific object category.
Q2 Furthermore, while the geometric supervision via normals and depth is helpful, it can compromise appearance fidelity in some cases. For instance, in Figure 4 (right), the appearance of the cup deviates noticeably from the input and other methods, indicating a drop in visual consistency.
A2 There is an inherent trade-off between geometric consistency and appearance fidelity. In Fig. 4 (right), the model prioritizes physically plausible lighting, which can lead to deviations in secondary objects like the cup near subject boundaries. This results from the multi-objective loss balancing fidelity L_0 and physical realism L_{phy}. Removing geometric feedback nearly doubles the geometric error (0.147->0.297), confirming its impact.
To reduce appearance shifts, we adopt a two-stage training strategy: the model is first trained with L_0 only, then fine-tuned with the full loss. This improves appearance consistency while preserving physical plausibility, which can reduce visual artifacts by 19% compared to joint training from scratch.
Q3 The manuscript would benefit from thorough proofreading. The abstract contains grammatical issues (e.g., the phrase “…where controlling illumination under ensures physical consistency…” is unclear). Other minor issues include:
A3 We have rewritten the abstract for clarity, corrected the typo in the “Introduction” heading, and removed the duplicate reference. A full proofread of the manuscript will be completed before submission.
Q4 The results are shown primarily on portrait videos with limited motion, which constrains the method’s applicability to more diverse or dynamic real-world video scenarios.
A4 While some examples focus on portraits, our method is designed as a general-purpose relighting framework. The model is trained on diverse videos from Panda70M, including objects, animals, and dynamic scenes—not limited to human subjects. Figures in the main paper and appendix (e.g., beach and parking lot scenes) already reflect this variety. We will add additional qualitative results on non-human, non-portrait scenarios in the appendix to further demonstrate applicability to general video relighting beyond portrait settings.
Q5 Originality: The use of degraded data as input closely resembles prior work such as RelightVid. Similarly, treating single images as one-frame (T=1) videos is a straightforward extension within the video diffusion framework. The major contributions lie in path consistency loss for faster inference and using single-step pretrained diffusion models for extra supervision. As a result, the overall novelty of the approach is somewhat limited.
A5 While we adopt a degraded-to-real training setup similar to RelightVid, our approach differs in both supervision design and evaluation framework. Rather than relying on latent-space consistency, we introduce explicit RGB-space physical feedback via a differentiable loss based on estimated normals and depth, enforcing alignment between lighting effects and scene geometry. This physically grounded signal is absent in prior work. In addition, our LumosData pipeline and LumosBench benchmark enable structured, attribute-level control and evaluation across a six-dimensional lighting space. Together, these contributions target controllability and physical realism in a unified framework, going beyond prior single-object or latent-only approaches.
Q6 In Fig.4 (left) and Fig.3 (Appendix), the input video clips appear to depict the same scene under different lighting conditions. However, there seems to be a slight difference in scale, particularly noticeable in the bottom-left corner where the three circles in Fig.4. Could the authors clarify if any cropping or resizing was applied, and whether this affects evaluation?
A6 The two clips are from the same source video, shown under different lighting to illustrate relighting behavior. The slight scale mismatch is a rendering artifact from figure layout and does not affect evaluation—models were trained and tested on standardized-resolution videos (e.g., 832x480) without cropping. To avoid confusion, we will update the final version to use different source videos in these figures.
Q7 The dataset includes optical attributes such as “Transmission (Glass)”, yet the paper does not show examples involving such materials. Given that the visual results primarily focus on portrait relighting, could the authors comment on the performance of the method when applied to scenes involving transparent or refractive materials?
A7 Although “Transmission (Glass)” is part of our annotation protocol, such scenes are rare in the dataset, and our current model is not designed to explicitly handle transparent or refractive materials. As an initial reference, we will add qualitative examples involving glass-like materials in the appendix, along with a brief discussion of current limitations and directions for future work.
Q8 Have the authors evaluated how the method performs across a diverse range of subjects, including individuals with varying skin tones and non-human objects? This would help clarify the model’s generalization ability beyond the portrait-centric examples shown.
A8 Yes, as noted in our response to Q4, we will include additional results on non-human objects in the appendix to demonstrate generalization beyond portraits. Regarding skin tone diversity, our model is trained on Panda70M, which includes a broad range of human appearances. Qualitatively, we have not observed performance degradation across skin tones. However, we have not performed a formal fairness analysis with disaggregated metrics.
We thank the reviewer for highlighting the visual quality, temporal consistency, and efficient design of UniLumos, as well as its promise in addressing video relighting with controllable lighting. In this rebuttal, we have addressed concerns regarding missing related work by incorporating and discussing prior portrait relighting methods (Q1), and clarified our framework’s broader applicability beyond portraits, with additional qualitative results on non-human and dynamic scenes (Q4, Q8). We also analyzed the trade-off between geometric supervision and appearance fidelity (Q2), and explained our two-stage training strategy to mitigate artifacts. Clarifications were made regarding scale inconsistencies (Q6), transparent materials (Q7), and textual issues (Q3).
Overall, UniLumos introduces (1) a unified relighting framework that leverages physics-guided geometry feedback for controllable lighting, (2) a structured annotation and evaluation benchmark for disentangled illumination control, and (3) a fast, generalizable system achieving state-of-the-art quality with 20x speedup. We hope our response sufficiently resolves your concerns and clarifies the novelty of our contributions.
Thank the authors for their rebuttal and appreciate the additional effort and details provided.
However, I still have some concerns about the response:
- For A2: The trade-off appears to impose an excessive cost on appearance fidelity, as the scenes are not preserved accurately, which could significantly compromise downstream applications.
- For A4: I still have concerns about the model's generalization to non-human scenes, given that the majority of examples focus on portraits. While the method performs well on portrait scenes, it does not compare against portrait-specific approaches and claims to be a general method, yet provides hardly any results on non-portrait scenarios, which undermines the novelty of the proposed architecture.
- For A8: The paper would benefit from more diverse results.
Overall, my major concern remains mainly with the method's performance. Therefore, I am still leaning towards rejection.
For A4: I still have concerns about the model's generalization to non-human scenes...
Q2: We acknowledge that the initial manuscript may not have sufficiently showcased non-portrait results, which may have led to concerns about UniLumos being portrait-centric. We take this opportunity to clarify that UniLumos is a general-purpose relighting model, both in its architectural design and training data, and we provide additional quantitative and qualitative evidence to fully substantiate this claim.
- General-Purpose Architecture without Semantic Priors
- UniLumos is inherently designed as a generalizable architecture. It does not rely on any domain-specific semantic priors such as 3D Morphable Models, facial landmarks, or skin segmentation, which are commonly used in portrait-specific relighting. Instead, the model builds upon general physical and geometric constraints, making it naturally extensible to arbitrary scenes. This design choice enables the model to remain agnostic to scene semantics and applicable beyond human-centric content.
- Diverse Training Data for Broad Generalization
- The training data further reinforces this generality. UniLumos is trained on the large-scale Panda70M dataset, which includes not only portrait videos but also full-body human scenes featuring a diverse range of clothing, materials, and accessories, as well as numerous instances of people interacting with objects such as guitars, bags, and tools. In addition, the dataset contains a substantial portion of object-centric sequences without any human presence. This inherent diversity equips the model with a strong inductive bias for generalization to non-human and non-portrait subjects.
- Why Portraits Were Emphasized in the Paper
- The emphasis on portrait scenes in the main paper was a deliberate decision rather than a limitation of the model. Portrait videos are known to be the most challenging for relighting tasks, as the human visual system is highly sensitive to lighting inconsistencies on skin and facial features. Even subtle artifacts in these regions can be easily noticed, making portrait scenes an effective stress test for model robustness. And, many practical downstream applications of relighting, such as e-commerce livestreaming, virtual try-on, and video conferencing, are human-centric and involve relighting not only the person but also their clothing and nearby products.
- Implicit Generalization Already Visible in Paper Figures
- While the examples in the paper do include human subjects, they also implicitly demonstrate generalization through complex non-human elements. For instance, in Fig. 1, the lighting on the subject’s clothing shows faithful interaction with fabric folds and textures. In Fig. 4 (left), the person is holding a wooden guitar, which dominates the scene visually. The guitar's specular highlights, shadows, and overall reflectance behavior are rendered more accurately by UniLumos compared to other methods, illustrating the model’s ability to handle material diversity and complex lighting conditions beyond the human face.
- Direct Evaluation on Object-Centric Benchmarks (StanfordOrb+Navi)
- To further demonstrate the model’s generalization to non-human scenes, we conducted additional evaluations on two public object-centric relighting benchmarks: StanfordOrb and Navi. These datasets include objects and sculptures under a variety of lighting environments and are completely disjoint from our training data.
- StanfordOrb contains canonical 3D scanned objects such as the Stanford bunny and dragon, while Navi includes a wide range of everyday objects like containers, toys, and mugs.
- Despite the significant domain gap and without any test-time fine-tuning, UniLumos achieves state-of-the-art results across perceptual (LPIPS), structural (SSIM), and physical (R-Motion) metrics, outperforming all baselines as shown in the tables below.
Navi
| Model | PSNR↑ | SSIM↑ | LPIPS↓ | R-Motion↓ |
|---|---|---|---|---|
| IC-Light Per Frame | 22.021 | 0.883 | 0.125 | 1.974 |
| Light-A-Video + CogVideoX | 23.912 | 0.891 | 0.121 | 1.378 |
| Light-A-Video + Wan2.1 | 23.474 | 0.903 | 0.116 | 1.341 |
| UniLumos | 24.977 | 0.911 | 0.120 | 1.203 |
StanfordOrb
| Model | PSNR ↑ | SSIM ↑ | LPIPS ↓ | R-Motion↓ |
|---|---|---|---|---|
| IC-Light Per Frame | 24.132 | 0.914 | 0.126 | 1.742 |
| Light-A-Video + CogVideoX | 25.617 | 0.923 | 0.108 | 1.279 |
| Light-A-Video + Wan2.1 | 25.784 | 0.926 | 0.104 | 1.241 |
| UniLumos | 26.512 | 0.934 | 0.097 | 1.103 |
- Summary
- These results confirm that UniLumos generalizes well across domain shifts without relying on scene-specific priors. We also expand the appendix with qualitative examples spanning animals, household objects, and outdoor environments, demonstrating consistent preservation of global lighting and fine-grained material details in diverse non-portrait settings.
For A8: The paper would benefit from more diverse results.
Q3: We agree with the reviewer that the paper would benefit from showcasing more diverse results. In response, we have revised both the main paper and the appendix to include additional qualitative examples that cover a broader range of scenarios, including non-human subjects, varied indoor and outdoor environments, and challenging materials such as specular and transparent surfaces. All outputs are generated by the same trained UniLumos model without any dataset-specific tuning or test-time adaptation.
In addition to these qualitative additions, we emphasize that our quantitative evaluation already includes results on object-centric benchmarks that are highly diverse in terms of content and lighting conditions. Specifically, on the Navi and StanfordOrb datasets—which feature a wide array of toys, everyday objects, and sculptural forms—UniLumos consistently outperforms baseline methods across perceptual (LPIPS), structural (SSIM), and physical (R-Motion) metrics:
Navi
| Model | PSNR ↑ | SSIM ↑ | LPIPS ↓ | R-Motion ↓ |
|---|---|---|---|---|
| IC-Light Per Frame | 22.021 | 0.883 | 0.125 | 1.974 |
| Light-A-Video + CogVideoX | 23.912 | 0.891 | 0.121 | 1.378 |
| Light-A-Video + Wan2.1 | 23.474 | 0.903 | 0.116 | 1.341 |
| UniLumos | 24.977 | 0.911 | 0.120 | 1.203 |
StanfordOrb
| Model | PSNR ↑ | SSIM ↑ | LPIPS ↓ | R-Motion ↓ |
|---|---|---|---|---|
| IC-Light Per Frame | 24.132 | 0.914 | 0.126 | 1.742 |
| Light-A-Video + CogVideoX | 25.617 | 0.923 | 0.108 | 1.279 |
| Light-A-Video + Wan2.1 | 25.784 | 0.926 | 0.104 | 1.241 |
| UniLumos | 26.512 | 0.934 | 0.097 | 1.103 |
These results reinforce that UniLumos is capable of maintaining high visual fidelity and lighting realism across a wide range of real-world scenarios—not limited to portrait scenes. We believe these additions significantly improve the breadth and completeness of our paper.
Thank the authors for the timely and thorough response to my remaining issues. The additional comparisons make the performance contribution of this paper more convincing. I would like to raise my rating to BA, under the authors' promise that the revised version should reflect the response during this discussion period.
We sincerely thank the reviewer for the encouraging feedback and for raising the rating. We greatly appreciate your recognition of our clarifications and improvements. As promised, all discussed updates—including expanded qualitative results and additional comparisons—will be fully reflected in the revised version.
We thank the reviewer again for the follow-up comments. We would like to respectfully address the remaining concerns.
For A2: The trade-off appears to impose an excessive cost on appearance fidelity, as the scenes are not preserved accurately, which could significantly compromise downstream applications.
Q1: We fully agree that achieving geometric consistency without sacrificing appearance fidelity is crucial for practical applications. Your concern led us to revisit this trade-off, resulting in a new two-stage training strategy that resolves the issue while enhancing the model’s performance across multiple dimensions. Our core improvement is that the new method significantly enhances visual fidelity while maintaining—or even slightly improving—geometric consistency, directly addressing the concern of an excessive cost.
- Quantitative Analysis: A Dual Training in Fidelity and Consistency. To directly address the concern regarding fidelity, we present the following quantitative comparison:
| Model | PSNR↑ | SSIM↑ | LPIPS↓ | Dense L2↓ |
|---|---|---|---|---|
| IC-Light Per Frame | 22.021 | 0.883 | 0.125 | 0.432 |
| UniLumos (original) | 24.977 | 0.911 | 0.120 | 0.147 |
| UniLumos (2-stage) | 25.431 | 0.934 | 0.107 | 0.143 |
- Appearance Fidelity: The reviewer's primary concern is fidelity. Our model reduces the **LPIPS score from 0.120 to 0.107 (~11% relative improvement). Since LPIPS aligns closely with human perception, this drop reflects fewer artifacts and more realistic details from a visual standpoint.
- Geometric Consistency: This improvement in fidelity does **not come at the expense of structural accuracy. The Dense L2 error slightly decreases from 0.147 to 0.143, indicating equally strong—or even better—geometric alignment.
-
Qualitative Analysis: Scenes Are Accurately Preserved. To further validate that “scenes are preserved accurately,” we include new qualitative comparisons in the appendix and rebuttal attachment. These include challenging cases (e.g., original Fig. 4) where our previous model showed minor artifacts such as color shifts. The two-stage variant visibly resolves these issues, producing more stable lighting and finer texture details. These results confirm that deviations were isolated cases and have been systematically addressed.
-
Impact on Downstream Applications: From Trade-off to Benefit. We also appreciate the reviewer’s emphasis on downstream utility. Our improvements translate into more robust applications:
- Portrait editing and relighting tasks benefit from higher fidelity, as post-processes like lighting manipulation rely on accurate input. The new model offers a cleaner and more realistic foundation, reducing error propagation and improving visual realism.
- Because our method improves both LPIPS (appearance) and Dense L2 (structure), it provides stable, high-fidelity inputs for a variety of synthesis and editing pipelines.
Overall, the proposed two-stage training strategy not only eliminates the trade-off between appearance fidelity and geometric supervision, but also improves both. We will further elaborate on these improvements in the revised manuscript, including expanded discussions and visual evidence to clearly demonstrate scene preservation and application robustness.
UniLumos is a framework building on the Wan 2.1 video diffusion model to enhance the speed and physical plausibility of image and video relighting. The work introduces several techniques for accelerated inference to the task of video and image relighting. Specifically, a few-step flow-matching backbone is extended with path consistency learning. Additionally, monocular depth and normal data are incorporated into an additional loss to improve physical consistency by comparing the extracted depth and normals for the generation to the pseudo ground truth. For training a data generation pipeline, LumosData is introduced, which generates labeled data using a VLM and prior relighting work IC Light for annotation and augmentation. The evaluation shows state-of-the-art results at much lower inference times compared to similar methods.
优缺点分析
Strengths:
- The method shows an impressive improvement on inference times.
- Qualitative results look promising.
- The data generation pipeline presents a helpful approach to semi-self-supervised training of relighting tasks by generating augmentations and annotations using existing (potentially weaker) models.
Weaknesses:
- By using existing generative models for data generation, there is a distribution gap between the training data and real-world videos. The model is only evaluated quantitatively using the data generated with this pipeline.
- Using existing modules that have also been used for the given task. The normal and depth constraints have also been employed for supervision of e.g. optimizations of NeRFs (e.g. [1]) and 3D Gaussian Splatting (e.g. [2]). Path consistency is novel to the relighting task but is also an established idea by now.
- Dynamic light does not seem to work well (based on the example clips in the supplements). This might be an artifact of the augmentation scheme (with a temporally inconsistent model). In case 2 of the supplementary material the guitar keeps some specular reflections through all illumination settings, which looks like an artifact.
- Sometimes the paper has somewhat redundant text passages (e.g. at the beginning of the ablation study).
Overall, the paper presents some nice engineering effort and promising results but offers limited insights into algorithmic advancements to the relighting task as it mostly (re-)combines existing ideas. The evaluation is lacking comparisons outside the training data distribution.
1: Niemeyer et al. RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs. CVPR 2022 2: Turkulainen et al. DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing. WACV 2025
问题
-
How does the model perform on real-world relighting data like e.g. the Navi or StanfordOrb dataset for still images and SDSD (dvlab) or LightAtlas (RelightVid) for video? Rendered, synthetic ground truth would also work for the evaluation here as long as it is path traced using a physical renderer.
-
Related works: Controllable diffusion models could also include the strand of works around ConceptSliders [1] which offers parametric control in the context of diffusion models. Why are the following relighting methods not included in the related works section and (potentially) evaluation? Intrinsic Image Diffusion [2], Neural Gaffer [3], DI-Light [4]. Have they been evaluated and why has IC Light been chosen for augmentation as it might not be the best method for the task? Is the limited consistency of IC Light actually helping the robustness of the training? This could be analyzed in some additional ablation.
-
p.1, l.5: Why are overexposed highlights physically implausible? In tone-mapped images this is a natural occurrence that often makes it even look more realistic than compressed highlights.
-
Table 1 and Figure 4: What data has been used here? It would be helpful to add the evaluation data sources to the captions. I don't think "relited" exists as a word (Fig. 4 caption, for example). Probably, relit would be fitting here.
1: Gandikota et al. Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models. ECCV 2024. 2: Kocsis et al. Intrinsic image diffusion for indoor single-view material estimation. CVPR 2024. 3: Haian et al. Neural Gaffer: Relighting Any Object via Diffusion, NeurIPS 2024 4: Chong et al. Dilightnet: Fine-grained lighting control for diffusion-based image generation. SIGGRAPH 2024.
局限性
Limitations and broader impact are discussed in the appendix but it would be better to add a condensed version to the main paper.
最终评判理由
I appreciate the authors' detailed response and the additional evaluations on the Navi and Stanford Orb relighting datasets, which provide promising evidence of the model’s effectiveness. The inclusion of more diverse qualitative results further strengthens the submission. With the manuscript being revised into a more polished form (related works section, specification of test set), I am inclined to raise my final score. That said, the overall impact of the work could be further enhanced by releasing the dataset or the dataset generation pipeline to the community.
格式问题
There are no concerns.
Thanks for your comments, and we address them below.
Q1 By using existing generative models for data generation, there is a distribution gap between the training data and real-world videos. The model is only evaluated quantitatively using the data generated with this pipeline.
A1 All evaluations in our paper are conducted on real, held-out videos from our internal VCG dataset—not on synthetic or IC-Light-generated inputs. During testing, the model receives real foreground and background videos along with a lighting caption, and generates relit outputs accordingly. The strong performance shown in Table 1 demonstrates that our model generalizes well to real-world video, despite being trained on augmented data. This ability is largely due to our physics-guided feedback mechanism, which helps the model generalize beyond the artifacts of synthetic training data.
Q2 Using existing modules that have also been used for the given task. The normal and depth constraints have also been employed for supervision of e.g. optimizations of NeRFs (e.g. [1]) and 3D Gaussian Splatting (e.g. [2]). Path consistency is novel to the relighting task but is also an established idea by now.
A2 While we adopt existing components, our contribution lies in designing a unified framework for fast and physically-plausible video relighting—rather than proposing individual modules. The core of our system is the non-trivial integration of RGB-space geometric feedback with few-step path consistency, applied in a low-supervision setting where conventional geometric signals are weak. Beyond the model, we also introduce a structured annotation pipeline (LumosData) and a benchmark (LumosBench) that enable fine-grained, attribute-level evaluation. Together, these form a complete relighting system tailored to controllability, generalization, and efficiency.
Q3 Dynamic light does not seem to work well (based on the example clips in the supplements). This might be an artifact of the augmentation scheme (with a temporally inconsistent model). In case 2 of the supplementary material the guitar keeps some specular reflections through all illumination settings, which looks like an artifact.
A3 Our current framework does not explicitly model dynamic lighting, which can cause artifacts under complex conditions. Future work will explore incorporating dynamic cues, such as time-varying lighting descriptors or motion-aware supervision, to better handle such cases.
Q4 Sometimes the paper has somewhat redundant text passages (e.g. at the beginning of the ablation study).
A4 We have revised the manuscript for clarity and conciseness, with particular focus on removing redundancy in the introduction and ablation study sections.
Q5 How does the model perform on real-world relighting data like e.g. the Navi or StanfordOrb dataset for still images and SDSD (dvlab) or LightAtlas (RelightVid) for video? Rendered, synthetic ground truth would also work for the evaluation here as long as it is path traced using a physical renderer.
A5 Our initial focus was on a simple, user-friendly pipeline using RGB inputs, which made datasets requiring HDR or complex scene data (e.g., LightAtlas) less suitable. However, we recognize the value of established benchmarks and have added evaluations on Navi and StanfordOrb (RelightVid is not publicly available). Despite being trained on Panda70M, our model performs well on these datasets, demonstrating good generalization. Visualizing results are included in the updated paper.
In the revised manuscript, we report results (see below) on three datasets: Navi, StanfordOrb, and our LumosData. Across all benchmarks, UniLumos consistently outperforms prior methods in PSNR, SSIM, and temporal consistency, while maintaining strong perceptual quality (LPIPS). These results confirm that our model generalizes well to diverse lighting domains, including real-world and synthetic, and validates its robustness beyond the training distribution.
Table: Quantitative comparison. Bold numbers indicate the best performance.
Navi
| Model | PSNR ↑ | SSIM ↑ | LPIPS ↓ | R-Motion ↓ |
|---|---|---|---|---|
| IC-Light Per Frame | 22.021 | 0.883 | 0.125 | 1.974 |
| Light-A-Video + CogVideoX | 23.912 | 0.891 | 0.121 | 1.378 |
| Light-A-Video + Wan2.1 | 23.474 | 0.903 | 0.116 | 1.341 |
| UniLumos | 24.977 | 0.911 | 0.120 | 1.203 |
StanfordOrb
| Model | PSNR ↑ | SSIM ↑ | LPIPS ↓ | R-Motion ↓ |
|---|---|---|---|---|
| IC-Light Per Frame | 24.132 | 0.914 | 0.126 | 1.742 |
| Light-A-Video + CogVideoX | 25.617 | 0.923 | 0.108 | 1.279 |
| Light-A-Video + Wan2.1 | 25.784 | 0.926 | 0.104 | 1.241 |
| UniLumos | 26.512 | 0.934 | 0.097 | 1.103 |
LumosData (our)
| Model | PSNR↑ | SSIM↑ | LPIPS↓ | R-Motion↓ |
|---|---|---|---|---|
| IC-Light Per Frame | 20.132 | 0.851 | 0.133 | 2.437 |
| Light-A-Video + CogVideoX | 19.851 | 0.859 | 0.124 | 1.784 |
| Light-A-Video + Wan2.1 | 20.784 | 0.876 | 0.129 | 1.582 |
| UniLumos | 25.031 | 0.891 | 0.109 | 1.436 |
Q6 Related works: Controllable diffusion models could also include the strand of works around ConceptSliders [1] which offers parametric control in the context of diffusion models. Why are the following relighting methods not included in the related works section and (potentially) evaluation? Intrinsic Image Diffusion [2], Neural Gaffer [3], DI-Light [4]. Have they been evaluated and why has IC Light been chosen for augmentation as it might not be the best method for the task? Is the limited consistency of IC Light actually helping the robustness of the training? This could be analyzed in some additional ablation.
A6 We have added relevant controllable diffusion models to the related work section. Some prior relighting methods were not included due to differences in task formulation—many require detailed scene geometry, HDR inputs, or full-scene relighting, which do not align with our goal of a lightweight pipeline based on foreground, background, and caption inputs. Our focus is on controllable relighting under minimal, user-friendly input assumptions.
IC-Light was selected for data augmentation due to its speed, availability, and ability to generate diverse lighting effects. In future work, we plan to incorporate multiple complementary augmentation models into the data pipeline to improve temporal consistency and enrich lighting diversity.
We explored incorporating IC-Light–style consistency during UniLumos training, but observed no clear performance gain (see below). This is likely due to the highly compressed nature of the latent spatiotemporal space, where consistency signals are weak—even with learnable MLP-based transitions. As a result, the added supervision provided little benefit in practice.
Q7 p.1, l.5: Why are overexposed highlights physically implausible? In tone-mapped images this is a natural occurrence that often makes it even look more realistic than compressed highlights.
A7 Our concern was with physically inconsistent highlights, such as those misaligned with surface geometry or lighting direction. We’ve revised the text to clarify this point, replacing “physically implausible” with more precise terms like “misaligned highlights” or “geometrically inconsistent lighting.”
Q8 Table 1 and Figure 4: What data has been used here? It would be helpful to add the evaluation data sources to the captions. I don't think "relited" exists as a word (Fig. 4 caption, for example). Probably, relit would be fitting here.
A8 All results use data from the VCG test set. We will update the captions accordingly and have corrected the typo “relited” to “relit” in the revised manuscript.
We appreciate your positive feedback on UniLumos’s motivation, efficiency, and data pipeline. In this rebuttal, we clarify that all evaluations were conducted on real held-out videos, and we include new results on Navi and StanfordOrb to demonstrate generalization beyond the training distribution (Q1, Q5). While we adopt existing modules, our novelty lies in the unified integration of geometric feedback and path consistency within a fast and controllable relighting framework (Q2). We acknowledge current limitations in dynamic lighting (Q3), address clarity and redundancy issues (Q4, Q8), and respond to concerns about dataset choice, related work, and the role of IC-Light (Q6, Q7).
Overall, UniLumos introduces (1) a unified relighting framework with physically guided supervision, (2) a structured annotation and evaluation protocol for fine-grained lighting control, and (3) extensive validation demonstrating strong quality and 20x speedup. We hope our response fully addresses your concerns.
I thank the authors for the detailed response. The additional evaluations on Navi and Stanford Orb relighting datasets looks promising and seems to verify the model's effectiveness. Together with more diverse qualitative results this will strengthen the quality of the submission significantly. Given that the text is being revised into a more polished form, I am willing to consider raising the final score. The impact of the submission could be improved by making the dataset or dataset generation pipeline public.
We sincerely appreciate your positive feedback and are glad that the additional evaluations on Navi and StanfordOrb helped address your concerns. We are especially encouraged by your openness to reconsidering the final score, and genuinely value your thoughtful engagement with the paper.
Regarding your suggestion on releasing the dataset or data generation pipeline, we have organized LumosData and are planning to release both the data and pipeline to support the broader research community. Thank you again for your constructive feedback and supportive review.
This submission develops a framework, from data collection to modelling, for unified image and video relighting. The authors collect a large video and image relighting dataset, LumosData, using an automated pipeline. Based on the data, the authors propose a text-conditioned video relighting model that works better than prior training free approaches or image relighting approaches.
优缺点分析
-
Strengths:
- According to the results in Fig. 4 and Tab. 1, the proposed framework clearly improve over the previous training-frame approach (Light-A-Video) and trained image-relighting baseline (IC-Light). This shows a very promising signal along the presented technical direction.
- The LumosData, if the author can release it, can be a valuable contribution to the community. My final decision will consider the releasing plan of this data.
- The path consistency objective and the normal / depth consistency objective, while neither is brand new, seem quite effective in improving the relighting results, according to Tab. 1 in the supplementary.
-
Weakness:
- First, I’d like to challenge the task itself of text-based relighting, i.e. relighting using text to describe the new lighting condition and generate an image / video for it. The biggest concern is that it is fairly challenging to describe a lighting condition. Imagine how hard it is to “reconstruct” an exact lighting condition just from the text description of it. Also imagine how many possible lighting conditions can fit into the description of “Front Light, Artificial Light, Moderate, Neutral, Static Light”. Practically, it’s not how the film industry or game industry describes lighting either, considering these might be the use cases of the proposed method. I hope the authors could justify this task in the rebuttal. This is important to justify the value of the proposed data and model because both are dependent on the assumption the lighting is described in text.
- While the quantitative results showed promising improvements, the qualitative results still reveal clear limitations. For instance, in case 2 and case 3 of the videos, for the parking garage scene, it is obvious that the specularity on the face caused by the lamps on top are not lit correctly because they don’t appear to move although the background video is clearly moving forward.
- The technical novelty is limited. The proposed framework is largely built on slightly modified existing methods of architecture. This should not be a big problem for acceptance if other concerns are addressed. However, this is the reason for higher acceptance ratings.
- While it is generally challenging to evaluate the effect of the LumosData, it can be interesting to understand it via fine-tuning IC-Light on the image only portion and see how it improves after using LumosData.
问题
- As mentioned in the weakness section, please justify the task of text-based relighting.
- For Table 1 in the supplementary, what is the “UniLumos w/o Path Consistency” baseline? Is it w/o the consistency loss but using the same sampling iterations? Or is it w/o the loss but sampling more iterations? It looks better than w/ the consistency loss so it’s necessary to clarify what it is.
局限性
Yes.
最终评判理由
I thank the authors' efforts in preparing a polished rebuttal. My questions in the initial review were address. Particularly, the argument of the task setup as attribute-based textual description makes sense to me now. So i'll keep my original acceptance rating. As I mentioned, the reason for not giving higher ratings is the scale of technical novelty. The authors response didn't add new dimensions of contributions that I already acknowledged. Additionally, the LumosData release is not promised either. So my final rating is still borderline accept.
格式问题
None.
Thanks for your comments, and we address them below.
Q1 First, I’d like to challenge the task itself of text-based relighting, i.e. relighting using text to describe the new lighting condition and generate an image / video for it. The biggest concern is that it is fairly challenging to describe a lighting condition. Imagine how hard it is to “reconstruct” an exact lighting condition just from the text description of it. Also imagine how many possible lighting conditions can fit into the description of “Front Light, Artificial Light, Moderate, Neutral, Static Light”. Practically, it’s not how the film industry or game industry describes lighting either, considering these might be the use cases of the proposed method. I hope the authors could justify this task in the rebuttal. This is important to justify the value of the proposed data and model because both are dependent on the assumption the lighting is described in text.
A1 We agree that describing lighting through free-form text alone is ambiguous and insufficient for precise control. That’s why our method does not rely on unconstrained natural language. Instead, we define lighting via a structured 6D attribute representation—direction, intensity, source type, color temperature, temporal dynamics, and optical effects. Text is merely a user-facing interface for presenting these attributes.
This design has two key benefits: (1) it makes the task more learnable for the model, since each attribute corresponds to interpretable and disentangled supervision; and (2) it improves accessibility for users, compared to HDR-based or physically-based lighting control, which often requires professional expertise or complex scene representations. In contrast, our system supports controllable, prompt-based lighting manipulation using only RGB inputs, which is more practical for many real-world applications.
To ensure this structured control is actually meaningful, we introduced LumosBench, which evaluates how well the model responds to each attribute individually. High scores across categories (e.g., Direction: 0.893, Color Temp: 0.813) demonstrate the effectiveness and learnability of our attribute-driven design.
Q2 While the quantitative results showed promising improvements, the qualitative results still reveal clear limitations. For instance, in case 2 and case 3 of the videos, for the parking garage scene, it is obvious that the specularity on the face caused by the lamps on top are not lit correctly because they don’t appear to move although the background video is clearly moving forward.
A2 The static specular highlight results from the model not explicitly modeling full 3D scene geometry or light source positions. Although we use depth and normals, the lack of full 3D reconstruction limits view-dependent effects. Addressing this requires more explicit 3D reasoning, which we note as future work.
Q3 The technical novelty is limited. The proposed framework is largely built on slightly modified existing methods of architecture. This should not be a big problem for acceptance if other concerns are addressed. However, this is the reason for higher acceptance ratings.
A3 Our contributions lie in combining them to address a new task with new constraints: (1) integrating RGB-space feedback with few-step path consistency to balance physical accuracy and efficiency, and (2) building a full pipeline (LumosData + LumosBench) for structured, controllable relighting. We appreciate that novelty is not a major blocker and hope our responses clarify the value of this framework.
Q4 While it is generally challenging to evaluate the effect of the LumosData, it can be interesting to understand it via fine-tuning IC-Light on the image only portion and see how it improves after using LumosData.
A4 Fine-tuning IC-Light is not applicable in our setting, as it is an image-based method while our pipeline and benchmark are designed for videos. Instead, we directly assess the value of LumosData through targeted ablations. In Table 3, removing the structured captions drops controllability from 0.773 to 0.662, showing their importance for fine-grained control.
To systematically evaluate controllability, we introduce LumosBench, a benchmark that tests each of the six lighting attributes (e.g., direction, intensity, color temperature) in isolation. It contains 2k structured prompts, each targeting one attribute while holding others fixed. We use a vision-language model (Qwen2.5-VL) to assess whether the intended lighting change is correctly expressed. This disentangled, attribute-level evaluation provides a precise and interpretable way to measure how well models respond to structured lighting control (see below).
Table: Quantitative comparison of attribute-level controllability. Bold numbers indicate the best performance.
| Model | #Params | Direction | Light Source Type | Intensity | Color Temperature | Temporal Dynamics | Optical Phenomena | Avg. Score |
|---|---|---|---|---|---|---|---|---|
| General Models | ||||||||
| LTX-Video | 1.9B | 0.794 | 0.644 | 0.487 | 0.708 | 0.487 | 0.403 | 0.587 |
| CogVideoX | 5.6B | 0.837 | 0.692 | 0.552 | 0.739 | 0.532 | 0.449 | 0.634 |
| HunyuanVideo | 13B | 0.863 | 0.741 | 0.599 | 0.802 | 0.655 | 0.481 | 0.690 |
| Wan2.1 | 1.3B | 0.842 | 0.685 | 0.436 | 0.741 | 0.504 | 0.433 | 0.607 |
| Wan2.1 | 14B | 0.871 | 0.794 | 0.674 | 0.829 | 0.737 | 0.505 | 0.735 |
| Specialized Models | ||||||||
| IC-Light Per-Frame | 0.9B | 0.793 | 0.547 | 0.349 | 0.493 | 0.284 | 0.339 | 0.468 |
| Light-A-Video + CogVideoX | 2.9B | 0.787 | 0.581 | 0.327 | 0.536 | 0.493 | 0.373 | 0.516 |
| Light-A-Video + Wan2.1 | 2.2B | 0.801 | 0.603 | 0.361 | 0.582 | 0.557 | 0.412 | 0.553 |
| UniLumos w/o lumos captions | 1.3B | 0.868 | 0.774 | 0.529 | 0.798 | 0.543 | 0.457 | 0.662 |
| UniLumos | 1.3B | 0.893 | 0.847 | 0.832 | 0.813 | 0.662 | 0.592 | 0.773 |
Within the specialized group, we conduct an ablation to assess the importance of our proposed lumos captions.
w/o lumos captions uses only vanilla scene-level captions during training, omitting structured lighting tags. The performance drop—particularly in controllable dimensions like intensity and optical phenomena—confirms that our semantic annotations play a key role in teaching the model fine-grained illumination control. Compared to strong baselines, UniLumos achieves superior scores across nearly all dimensions, demonstrating the impact of LumosBench in pushing model understanding and control of illumination.
Q5 As mentioned in the weakness section, please justify the task of text-based relighting.
A5 We clarify that the task of “text-based relighting” in our work refers specifically to structured, attribute-driven control—not free-form language description. We formulate lighting as a 6D vector (e.g., direction, intensity, source type, etc.), which makes the task well-defined, learnable, and interpretable. While text serves as the interface, the core formulation is structured control, enabling practical and generalizable relighting without relying on complex HDR inputs or scene geometry.
Q6 For Table 1 in the supplementary, what is the “UniLumos w/o Path Consistency” baseline? Is it w/o the consistency loss but using the same sampling iterations? Or is it w/o the loss but sampling more iterations? It looks better than w/ the consistency loss so it’s necessary to clarify what it is.
A6 “UniLumos w/o Path Consistency” refers to a model trained without the consistency loss (L_fast), evaluated with the same few-step inference schedule as the full model. While it scores slightly higher in PSNR/SSIM, it performs worse on perceptual (LPIPS: 0.113 vs. 0.109) and temporal (R-Motion: 1.438 vs. 1.436) metrics. The baseline lacks few-step consistency training, so pixel-level gains may be incidental. We will clarify this in the revised.
We thank the reviewer for recognizing the strong improvements over prior relighting methods, the potential impact of LumosData, and the effectiveness of our geometry-guided consistency objectives. In this rebuttal, we clarify that our method is not based on free-form text descriptions, but rather on a structured 6D attribute space that enables disentangled, interpretable, and practical control over lighting (Q1, Q5). To address your concerns about qualitative limitations, we explain the source of static highlights and identify future work on 3D-aware modeling (Q2). We also clarify the meaning of the “w/o Path Consistency” baseline (Q6) and discuss why fine-tuning IC-Light is not applicable in our video-centric setting, opting instead for ablations and structured benchmarks to evaluate data quality (Q4).
Overall, UniLumos introduces (1) a unified, geometry-aware relighting framework balancing physical plausibility and efficiency, (2) LumosData and LumosBench for structured lighting supervision and evaluation, and (3) strong generalization with 20x speedup and state-of-the-art controllability across six lighting attributes. We hope our responses address your concerns and clarify the value of our contributions.
Dear Reviewers, ACs, SACs, and PCs,
We thank you and all four reviewers for the valuable feedback that helped us strengthen our work.
We have provided detailed responses to each reviewer’s concerns.
- Reviewer hicd (Confidence: 5) raised questions on generalization beyond training data, novelty, and dynamic lighting;
- Reviewer gRR4 (Confidence: 4) focused on non-portrait generalization and the trade-off between geometric consistency and appearance fidelity;
- Reviewer 4Tza (Confidence: 5) raised concerns about the practicality of the proposed lighting control interface and dataset evaluation;
- Reviewer hxVQ (Confidence: 2) raised concerns about qualitative evidence of physics-plausible feedback, reproducibility, and clarity.
For each, we conducted additional experiments, added results on Navi and StanfordOrb datasets to demonstrate strong generalization beyond the training distribution, and clarified our methodology. These efforts received positive feedback, with two reviewers acknowledging that their concerns had been addressed and raising their scores. We are grateful for their recognition of our work.
We have not yet received a follow-up from Reviewer hxVQ. We have supplemented the paper with new visual comparisons, clarified the technical descriptions, and confirmed code and benchmark release, which we believe sufficiently address Reviewer hxVQ’s concerns. We hope Reviewer hxVQ will consider re-evaluating the contributions of our work during the reviewer–AC discussion phase.
Overall, UniLumos introduces (1) a geometry-guided relighting framework with efficient few-step training, (2) a structured annotation and evaluation benchmark for disentangled lighting control, and (3) strong empirical performance with 20x speedup, delivering practical and controllable relighting under minimal input assumptions.
We kindly request that you carefully consider the contributions of our paper.
Best regards,
Authors.
After a strong rebuttal, all reviewers leaned toward the positive side (with 4 borderline accepts). The AC agrees with the reviewers that video relighting is a relatively underexplored direction and finds the direction, ideas, and results promising. The AC recommends accepting the paper, but asks the authors to carefully address the concerns raised by the reviewers in the camera-ready version, including releasing the LumosData, providing the preprocessing pipeline, giving clearer explanations of the evaluation, and including additional comparison results, among others.