TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer
摘要
评审与讨论
This paper proposes a novel method to adapt an image-based relighting network, IC-Light, to relight dynamic long videos. The main challenge is to maintain temporal consistency in the video relighting results. To this end, the paper proposes a 2-stage pipeline to post-process the delighted video. In the first stage, it aligns the exposure across the frames. In the second stage, it further refines illumination and texture details based on the Unique Video Tensor representation. Quantitative experiments demonstrate that it produces better results than prior art.
优缺点分析
Strengths
- The paper works on a promising direction: relighting any video.
- The proposed technique seems simple but effective.
- The Unique Video Tensor representation is novel and efficient.
Weaknesses
- My main concern is about the quality of the proposed method. I think a paper working on this task should provide video relighting results as supplementary material. Otherwise, it is really hard for me to judge the performance of this work. Although the paper reports some quantitative metrics and the method seems to produce the best results, they are not enough to evaluate the paper. For example, without a side-by-side qualitative comparison, I can not translate the quantitative performance gain into actual visual improvement.
- I suggest that the author add the MemFlow method to the Preliminaries section. It is quite confusing when the flow ID appears in L162. I understand its meaning after checking the supplementary material.
- The paper mentions multiple times that their method can boost the embodied AI field, but there is no experiment to support this claim. For example, in the conclusion, it says: "making it particularly well suited for sim2real and real2real data scaling in embodied AI," and the paper's keywords also contain embodied AI.
As I cannot evaluate the performance gain in the current version, I lean negatively on this paper.
问题
In L165, the paper says, "All pixels with identical κ are gathered via averaging to form one element of the one-dimensional vector." In my understanding, κ is a vector of [flow_id, RGB value]. What is the meaning of identical κ? Let κ1=[1,1,1,1], κ2=[1,1,1,1], κ3=[1,2,3,4]. Is κ1=κ2=κ3, or just κ1=κ2? I think the former is more reasonable. If so, "All pixels with identical κ" should be changed to "All pixels with identical flow id"?
局限性
Previous methods for relighting use a physically-based inverse rendering technique (e.g., Nerfactor [a] and InvRender [b]) or a data-driven approach (train the network on a Light Stage dataset to learn the rendering process, e.g., IC-Light) to ensure the results are physically plausible, but the paper only uses heuristics to post-process the lighting effects. I suggest that the author discuss this point in the paper.
[a] NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination, SIGGRAPH Asia 2021 [b] Modeling Indirect Illumination for Inverse Rendering, CVPR 2022
最终评判理由
Most of my concerns are addressed. Thus, I raise the score to borderline accept. I suggest that the author include all the discussions and experiments in the final version.
格式问题
N/A
Thanks sincerely for your attentive review and profound comments. Below, we address the key concerns, with all citations consistent with the original paper.
Q1: Video Results and Enhanced Evaluation
Thank you for the feedback—we fully understand the importance of video results for evaluating visual quality. Videos with corresponding prompts are provided in the Video_Comparison folder linked anonymously in the supplementary (Line 9). This folder includes seven representative scenarios across both synthetic and real-world settings, including a ground truth benchmark in the following description. As shown, our method significantly outperforms IC-Light and VidToMe in temporal consistency, avoids the blurriness and unnatural lighting seen in COSMOS-Transfer1, and surpasses Slicedit in instruction adherence. TC-Light consistently delivers high-quality, stable results across dynamic scenes. All videos will be made publicly available on our paper's project page.
Additionally, following suggestions from reviewers phEG and xtUc, we include results on the Virtual KITTI 2 benchmark—a rare ground-truth-based relighting dataset with consistent geometry under varying lighting. We selected five scenes and relit them to match illumination of morning, sunset, rain, overcast, and fog settings, using COSMOS prompt upsampler to extract lighting and weather descriptors from the ground truth video as target prompts. Each sequence averages 281 frames at a resolution of 1248×384.
With access to ground truth, we replace CLIP-T and user preference with SSIM and LPIPS [26]. Results are shown below, following the same symbol definitions as Table 2 in the main paper. Our method outperforms all baselines in perceptual and structural similarity, as well as motion consistency (Motion-S, Warp-SSIM), while remaining efficient.
| Virtual KITTI 2 | SSIM↑ | LPIPS↓ | Motion-S↑ | Warp-SSIM↑ | Time(s) | Mem.(G) ↓ |
|---|---|---|---|---|---|---|
| IC-Light* [58] | 0.5102 | 0.4470 | 95.23 | 68.13 | 1770 | 10.25 |
| VidToMe [31] | 0.5359 | 0.4262 | 95.95 | 71.33 | 444 | 6.96 |
| Slicedit [10] | 0.5080 | 0.4237 | 96.91 | 80.74 | 2346 | 17.68 |
| VideoDirector [56] | OOM | OOM | OOM | OOM | OOM | OOM |
| Light-A-Video [60] | OOM | OOM | OOM | OOM | OOM | OOM |
| RelightVid [16] | OOM | OOM | OOM | OOM | OOM | OOM |
| COSMOS-T1 [3] | 0.4833 | 0.4841 | 97.81 | 84.35 | 3314 | 34.83 |
| Ours-light | 0.5855 | 0.4026 | 98.51 | 90.94 | 580 | 15.21 |
| Ours-full | 0.5910 | 0.3971 | 98.62 | 92.38 | 1002 | 15.21 |
Q2: Illustration for MemFlow
Sorry for the confusion. We will revise the original phrasing from "where the first element is the flow ID (grouping pixels with shared optical flow)" to "where the first element is the flow ID (pixels connected by the optical flow predicted by MemFlow share the same flow ID)" for improved clarity. As suggested, we will also add a brief explanation of MemFlow in the Preliminaries section to provide proper context.
Q3: Claim Regarding Embodied AI
Thank you for highlighting this concern. To clarify, our mention of embodied AI was not intended to imply direct improvements to agent training, but to highlight a broader application context where dynamic video relighting serves as a foundational tool. Our method addresses critical limitations of prior work—namely, temporal inconsistency and computational inefficiency—making it possible to be applied to use cases beyond traditional domains like filmmaking or AR. To reduce confusion and improve precision, we will revise the keyword from "embodied AI" to "world-to-world transfer", and adjust the conclusion to: "endowing it with value and potential for broader application areas such as sim2real and real-world video scaling in embodied AI training and validation pipelines."
Q4: Meaning of Identical κ
Sorry for the confusion. Lines 161–165 present the general definition of κ. A crucial detail in interpreting "identical κ" lies in that the RGB component can be quantized. Without quantization, “identical κ” would require both identical flow connections and exact color matches—too strict for cases like view-dependent effects. By applying 7-bit quantization, we allow pixels within a slight color variation (~2/255) to be considered in gathering, enabling more robust grouping and filtering out unreliable flow connections. In our final implementation, κ uses flow ID and 7-bit quantized RGB. To enhance clarity, we will explicitly add this explanation regarding the physical meaning of "identical κ" to Line 165.
Q5: Physical Plausibility
The physical plausibility of our method is inherited from IC-Light, which is pre-trained on high-quality Light Stage data and has learned a physically grounded relighting process. Our main contribution lies in improving temporal consistency without altering the illumination priors embedded in the base model.
As detailed in Section 3.2, we introduce a video model inflation mechanism based on token merging/unmerging and decayed multi-axis denoising to enable temporal feature-level information exchange. Since these components do not alter the prior knowledge encoded in the base model, the distribution of edited illumination aligns with that of IC-Light while enhancing consistency across frames.
In Section 3.3, we propose a two-stage post-optimization strategy. The first stage smooths global exposure transitions using an appearance embedding, following practices in physically-based rendering methods like NeRF-W [CVPR’21] and 3DGS [SIGGRAPH’23]. The second stage refines local fluctuations without altering the overall lighting. Thus, our final results maintain the physically plausible qualities of IC-Light, while significantly improving temporal coherence. As shown in Figure 3 of the main paper and the video results in Q1, our illumination remains qualitatively aligned with IC-Light and VidToMe, but with fewer artifacts and greater temporal stability.
Thanks to the strong priors of IC-Light, our method focuses on temporal coherence and computation efficiency. Compared to optimization-heavy approaches such as Nerfactor [a] and InvRender [b], our post-optimization stage takes only ~2 minutes, with the entire pipeline completing in ~10 minutes—substantially faster than training NeRF or 3DGS models, as discussed in Lines 88–93 of the main paper. As suggested, we will incorporate the above discussion into the revised manuscript and add citations for both Nerfactor and InvRender.
Thanks for the rebuttal. Most of my concerns are addressed. I would consider raising the score to borderline accept.
Thank you for your response and for considering our rebuttal. We appreciate your time and are glad to hear that most concerns have been addressed!
This paper introduces a novel method for relighting long and highly dynamic videos. It extends the image-based relighting model IC-Light to the video domain using training-free techniques, followed by a two-stage post-optimization process to enhance temporal consistency and visual fidelity. Experimental results demonstrate that the proposed method achieves superior performance with low computational cost.
优缺点分析
Strengths
-
The paper presents a novel and efficient approach to video relighting, achieving strong performance with reduced computational cost. This makes it promising for various downstream applications.
-
The authors propose a new benchmark that specifically targets long and highly dynamic videos. The release of this benchmark could benefit the research community.
Weaknesses
-
My primary concern is the overall stability of the pipeline. The method comprises multiple submodules, each relying on different base models or hand-crafted optimization losses. A discussion on the success rate, along with visualizations of failure cases, would provide a clearer picture of the method's robustness.
-
The paper lacks video result demonstrations. Since this work focuses on video editing, quantitative comparisons based on only a few frames are insufficient to evaluate temporal consistency and overall quality. Including video comparisons would significantly strengthen the evaluation.
问题
-
In line 142, the paper mentions the first stage introduces a per-frame appearance embedding , but is not previously defined or explained.
-
In Table 2, the proposed method underperforms on the CLIP-T metric. Please explain the reason behind this discrepancy.
局限性
yes
最终评判理由
The authors have addressed most of my concerns, and I am inclined to recommend acceptance.
格式问题
none
We do appreciate your willingness to recommend our paper for acceptance and insightful advice. Below, we address the key concerns, with all citations consistent with the original paper.
Q1: Video Results and Enhanced Evaluation
Thank you for the suggestion—we fully agree on the importance of video results for assessing temporal quality. Videos with corresponding prompts are provided in the Video_Comparison folder linked anonymously in the supplementary (Line 9). This folder includes seven representative scenarios across both synthetic and real-world settings, including a ground truth benchmark (described in the following paragraph). As shown, our method significantly outperforms IC-Light and VidToMe in temporal consistency, avoids the blurriness and unnatural lighting seen in COSMOS-Transfer1, and surpasses Slicedit in instruction adherence. TC-Light consistently delivers high-quality, stable results across dynamic scenes. All videos will be made publicly available on our paper's project page.
Additionally, following suggestions from reviewers phEG and xtUc, we include results on the Virtual KITTI 2 benchmark—a rare ground-truth-based relighting dataset with consistent geometry under varying lighting. We selected five scenes and relit them to match illumination of morning, sunset, rain, overcast, and fog settings, using COSMOS prompt upsampler to extract lighting and weather descriptors from the ground truth video as target prompts. Each sequence averages 281 frames at a resolution of 1248×384.
With access to ground truth, we replace CLIP-T and user preference with SSIM and LPIPS [26]. Results are shown below, following the same symbol definitions as Table 2 in the main paper. Our method outperforms all baselines in perceptual and structural similarity, as well as motion consistency (Motion-S, Warp-SSIM), while remaining efficient.
| Virtual KITTI 2 | SSIM↑ | LPIPS↓ | Motion-S↑ | Warp-SSIM↑ | Time(s) | Mem.(G) ↓ |
|---|---|---|---|---|---|---|
| IC-Light* [58] | 0.5102 | 0.4470 | 95.23 | 68.13 | 1770 | 10.25 |
| VidToMe [31] | 0.5359 | 0.4262 | 95.95 | 71.33 | 444 | 6.96 |
| Slicedit [10] | 0.5080 | 0.4237 | 96.91 | 80.74 | 2346 | 17.68 |
| VideoDirector [56] | OOM | OOM | OOM | OOM | OOM | OOM |
| Light-A-Video [60] | OOM | OOM | OOM | OOM | OOM | OOM |
| RelightVid [16] | OOM | OOM | OOM | OOM | OOM | OOM |
| COSMOS-T1 [3] | 0.4833 | 0.4841 | 97.81 | 84.35 | 3314 | 34.83 |
| Ours-light | 0.5855 | 0.4026 | 98.51 | 90.94 | 580 | 15.21 |
| Ours-full | 0.5910 | 0.3971 | 98.62 | 92.38 | 1002 | 15.21 |
Q2: Robustness Analysis
The influence of different submodules mainly reflects on the fluctuation of metrics. First, TC-Light inherits IC-Light's inability to relight hard shadows, as discussed in Section 4.4 of the paper. As shown in the Virtual KITTI video example in Q1, the tree shadows in ours_full.gif align with 0_input.gif rather than 1_groundtruth.gif. This results in SSIM deviating from the average value 0.59 to 0.47. Second, the performance is relatively stable to the design of the UVT index κ. Comparing rows 6, 8, and 9 of Table 3 of the main paper, the additional depth and instance mask clues bring a slight performance gain and drop. Third, unreliable flow estimation in some textureless regions may introduce artifacts, as discussed in Section 4.4 of the paper. Under the Video_Comparison/bad_case folder of Q1, an example from the AgiBot subset illustrates how unreliable flow leads to instability along the robotic arm's edge, with CLIP-T falling from the average value 0.275 to 0.259. We will expand these discussions in the revised paper to provide a clearer assessment of robustness, including visualizations of failure cases.
Q3: Illustration
Sorry for the confusion. We explain in Lines 143-144 that "Inspired by [26], we model as a 3 × 4 affine transformation matrix, initialized to the identity and optimized via Adam [29]." We would highlight this text and give an example of for an easier understanding in the modification.
Q4: CLIP-T Performance
The drop in CLIP-T reflects a trade-off between perceptual alignment and temporal consistency. Our post-optimization smooths flickering textures and illumination, slightly simplifying details and causing minor shifts in CLIP features. While CLIP-T decreases by ~1%, Warp-SSIM improves by over 20%. For example, in the video results of the CARLA subset in Q1, VidToMe and TC-Light achieve CLIP-T scores of 0.2556 and 0.2540, respectively, while Warp-SSIM improves from 83.90 to 93.66. As evidence, ours_full.gif is significantly smoother and more consistent than vidtome.gif.
Thank you for your response. It addressed my concerns well. I will maintain my recommendation as Borderline Accept.
We are very grateful for your insightful suggestions and support of our work. We are happy to see that our rebuttal has addressed your concerns.
This paper introduces TC-Light, a video relighting focused on long videos. It proposes a two-stage post-optimization framework that refines an initial relighting result produced by inflating an image-based model to achieve both global lighting consistency and temporal alignment while maintaining low computational cost. A new benchmark dataset is also proposed, containing 58 long video clips of about 256 frames across diverse environment and conditions to evaluate video relighting performance under complex dynamic scenes.
优缺点分析
TC-Light achieved state-of-the-art temporal consistency while staying prompt-controllable with its two-stage framework. Its UVT representation is lightweight yet enforces cross-frame coherence, and the authors back their claims with a large, diverse long-video benchmark. However, the method assumes reasonably accurate optical-flow grouping and benefits from depth when available, so very noisy flow or missing depth may leave residual artifacts. The heavy consistency constraints seems to slightly lower prompt fidelity and image detail. Even with the proposed new benchmark dataset, the quantitative evaluation is limited because no ground-truth relighted videos exist, making results rely on proxy metrics and user studies.
问题
-
Although the paper mentions this in passing, please state explicitly which parts of TC-Light are novel and which are adapted from VidToMe, Slicedit, or other prior work. A concise side-by-side comparison table would help show that the method is more than a straightforward combination of existing techniques.
-
Discuss TC-Light’s behaviour when optical-flow estimates are poor (e.g., very fast motion, heavy occlusions, etc.) and provide qualitative or quantitative evidence of any failure and mitigation strategies if any.
-
For the proposed dataset, detail the clip-selection process, domain balance, and user-study protocol, and explain how test clips were kept fully separate from development. Clarify whether any ground-truth relighting pairs exist or if the benchmark is solely for subjective/relative evaluation, and include an expanded section or appendix summarizing the dataset’s content and statistics.
局限性
yes
最终评判理由
The authors addressed my questions diligently in the rebuttal. Originality attribution (Q1) and dataset details (Q3) are now clearly documented. They added a robustness experiment that relights videos after sub-sampling every two to four frames to mimic faster motion, yet in my view this does not fully resolve Q2. Severe scenarios such as large occlusions, blur, transparency, or complete flow failure are still underexamined. I have read and agree with the other reviewers’ concerns as well and have reflected them in my final rating.
格式问题
None that I've found.
We really appreciate your affirmation of our work and constructive advice! Below, we address the key concerns, with all citations consistent with the original paper.
Q1: Side-by-side comparison of TC-Light Modules
| Modules | VidToMe [33] | Slicedit [10] | 3DGS [25] | TC-Light |
|---|---|---|---|---|
| Token Merging | ✔ | ✔ | ||
| Multi-Axis Denoising | ✔ | ✔ | ||
| Decayed Noise Weight | ✔ | |||
| Noise Statistics Alignment (AIN) | ✔ | |||
| Appearance Embedding Optimization | ✔ | ✔ | ||
| Unique Video Tensor Optimization | ✔ |
As suggested, we provide a side-by-side comparison table for clarity. Note that 3DGS is used for reconstruction instead of video editing. We will include this table and summarize our module distinctions at the start of Sections 3.2 and 3.3 to highlight our contributions.
Q2: Results on Groundtruth Benchmark
| Virtual KITTI 2 | SSIM↑ | LPIPS↓ | Motion-S↑ | Warp-SSIM↑ | Time(s) | Mem.(G) ↓ |
|---|---|---|---|---|---|---|
| IC-Light* [58] | 0.5102 | 0.4470 | 95.23 | 68.13 | 1770 | 10.25 |
| VidToMe [31] | 0.5359 | 0.4262 | 95.95 | 71.33 | 444 | 6.96 |
| Slicedit [10] | 0.5080 | 0.4237 | 96.91 | 80.74 | 2346 | 17.68 |
| VideoDirector [56] | OOM | OOM | OOM | OOM | OOM | OOM |
| Light-A-Video [60] | OOM | OOM | OOM | OOM | OOM | OOM |
| RelightVid [16] | OOM | OOM | OOM | OOM | OOM | OOM |
| COSMOS-T1 [3] | 0.4833 | 0.4841 | 97.81 | 84.35 | 3314 | 34.83 |
| Ours-light | 0.5855 | 0.4026 | 98.51 | 90.94 | 580 | 15.21 |
| Ours-full | 0.5910 | 0.3971 | 98.62 | 92.38 | 1002 | 15.21 |
We appreciate the suggestion and agree that ground-truth evaluation remains scarce but essential in generative video relighting. To this end, we evaluate on Virtual KITTI 2, which renders consistent geometry across varied lighting conditions under autonomous driving scenarios. We selected five scenes and relit them to match illumination of morning, sunset, rain, overcast, and fog settings, using COSMOS prompt upsampler to extract lighting and weather descriptors from the ground truth video as target prompts. Each sequence averages 281 frames at a resolution of 1248×384.
With access to ground truth, we replace CLIP-T and user preference with SSIM and LPIPS [26]. Results are shown above, following the same symbol definitions as Table 2 in the main paper. Our method outperforms all baselines in perceptual and structural similarity, as well as motion consistency (Motion-S, Warp-SSIM), while remaining efficient.
Q3: Robustness Analysis
| Sample Interval | Motion-S↑ | Warp-SSIM↑ | CLIP-T↑ |
|---|---|---|---|
| 1 (Normal) | 97.48 | 89.94 | 0.2971 |
| 2 (Fast) | 96.28 | 87.40 | 0.2986 |
| 4 (Very Fast) | 95.75 | 85.44 | 0.3018 |
First, we evaluate TC-Light’s robustness under fast motion using the NavSim subset (5 sequences and ≥ 584 frames), controlling the total frames to 145 while varying sampling intervals. As shown in the table above, when motion speed increases, CLIP-T and Motion Smoothness remain stable (<2% fluctuation), while Warp-SSIM declines by ~5% under very fast motion due to less reliable optical flow. Second, in the scand subfolder of Q1's video results, we demonstrate TC-Light’s handling of heavy occlusion. As shown in ours_full.gif, our model correctly preserves the appearance of the same identity, while there is a minimal drop (<1%) in Warp-SSIM and Motion Smoothness compared to the SCAND subset average. These discussions would be supplemented to our revised manuscripts.
Q4: Details about the benchmark
Thank you for the valuable suggestion. Key details are as follows:
- The user study protocol is described in Section D of the supplementary, while dataset statistics appear in Table 1 (main paper) and Table S2 (supplementary). Dataset content, including scenarios and camera setups, is outlined in Lines 195–196 (main paper) and Lines 23–30 (supplementary). We plan to add thumbnails of each subset for clearer visualization.
- For clip selection, sequences shorter than 300 frames are used fully; otherwise, 300 consecutive frames are randomly sampled. Statistics are provided in Table 1 of the main paper.
- Domain balance is maintained between synthetic and real environments (25 vs. 28 videos), also balanced within sub-domains: autonomous driving (12 synthetic, 10 real), robotic manipulation (8 synthetic, 12 real), indoor navigation (5 synthetic, 6 real). The aerial subset is excluded from balance due to limited long dynamic drone videos in the simulation environment and serves mainly for generalization validation.
- Test clip separation is strictly ensured. IC-Light, trained solely on images, has never seen these video datasets. Our temporal inflation is zero-shot, and the post-optimization process is input-agnostic. Text prompts are randomly selected from IC-Light prompts or COSMOS-generated ones, with unreasonable prompts resampled (e.g., "a modern interior room" for a driving scenario). This selection process is fully independent of model development.
- This benchmark is solely for subjective/relative evaluation. But we also supplemented results on the Virtual KITTI dataset, which contains ground-truth video (see Q2). The table in Q2 sufficiently validates the superiority of our method's performance.
As recommended, we will add a dedicated section in the paper to comprehensively introduce the benchmark and include these details.
Dear xtUc, we actually need you to show up and say something, in addition to clicking the "mandatory acknowledgement" button.
Thank you for the rebuttal. Authors have addressed all of my questions, and I'm satisfied with most of them. I have also read other reviewers’ rebuttal and will maintain my rating.
We are glad to see that our rebuttal has addressed most of your concerns. Thanks again for your attentive review and constructive suggestions!
This paper proposes a method, TC-Light, that tackles the long-standing challenge of editing illumination in long, highly dynamic videos by inflating the single-image relighting model (IC-Light) into a zero-shot video diffusion framework and introducing decayed multi-axis denoising that blends per-frame ( xy ) and time-slice ( yt ) predictions for improved temporal coherence. A subsequent two-stage post-optimization first aligns global exposure through a learned per-frame affine “appearance embedding,” then refines local flicker using a Unique Video Tensor (UVT)—a compact 1-D RGB lookup indexed by flow, depth, and color cues. Qualitative results and some hand-crafted metrics are evaluated to showcase the effectiveness of the proposed method with extensive ablation studies. However, metrics computed against ground truth relight videos are absent due to the scarcity of such data. Plus, there is no qualitative video presented in the supplementary which is somehow important for papers working on video generation.
优缺点分析
Strengths:
- The paper is well-written and easy to follow.
- The evaluations especially ablation studies are extensive, including strong zero-shot and training-based baselines, ablations, user study, and efficiency measurements.
- By using decayed multi-axis denoising, it is an elegant way to inject motion priors without training, which improves consistency with negligible extra VRAM.
- The proposed UVT representation is simple yet effective, which improves the visual quality while lower the runtime VRAM.
Weaknesses:
- Flow dependence and error analysis: Optical-flow errors propagate into UVT, occasionally causing blur or ghosting in low-texture areas; robustness analysis is limited.
- Groundtruth Benchmark: Though capturing video with different lightings is hard, it is still encouraged to have evaluation against ground truth which is typically lacking in this field of generative video relighting. Moreover, it will provide more accurate and robust probe on the performance.
- No Video results: As a paper working on video-related generative tasks, I would weigh more on the qualitative results presented in the demo video, which could clearly show the temporal smoothness and overall quality of the relit videos. However, there is no video presented in the supplementary material, which makes it hard to determine the effectiveness of the proposed method.
- Reproducibility: Will the authors release the code? Exact UVT implementation (index hashing, GPU kernels) may be non-trivial to replicate without a reference.
问题
- How sensitive is performance to the composition of κ (e.g., color + flow vs flow + depth)? Did you explore learned index hashing or adaptive bin sizes?
- The method currently edits entire frames uniformly via a text prompt. Could UVT be extended to support spatially varying lighting (e.g., spotlight on a moving actor / near-field point lights)?
- Have you tried TC-Light with recent transformer-flow models (e.g., Flowformer)? How does flow noise degrade Warp-SSIM, and could self-supervised flow refinement within stage II help?
- As mentioned in your Limitations section, given IC-Light’s limitations, how does it perform on those failed corner cases?
局限性
Yes
最终评判理由
Most of my concerns have been properly addressed. My acceptance recommendation is conditioned on that the authors will include all the additional discussions, evaluations, and limitations in the revision.
格式问题
NA
We sincerely thank you for the very professional comments. We are inspired and hope our discussion brings more insights. Below, we address the key concerns, with all citations consistent with the original paper.
Q1: Results on Groundtruth Benchmark
| Virtual KITTI 2 | SSIM↑ | LPIPS↓ | Motion-S↑ | Warp-SSIM↑ | Time(s) | Mem.(G) ↓ |
|---|---|---|---|---|---|---|
| IC-Light* [58] | 0.5102 | 0.4470 | 95.23 | 68.13 | 1770 | 10.25 |
| VidToMe [31] | 0.5359 | 0.4262 | 95.95 | 71.33 | 444 | 6.96 |
| Slicedit [10] | 0.5080 | 0.4237 | 96.91 | 80.74 | 2346 | 17.68 |
| VideoDirector [56] | OOM | OOM | OOM | OOM | OOM | OOM |
| Light-A-Video [60] | OOM | OOM | OOM | OOM | OOM | OOM |
| RelightVid [16] | OOM | OOM | OOM | OOM | OOM | OOM |
| COSMOS-T1 [3] | 0.4833 | 0.4841 | 97.81 | 84.35 | 3314 | 34.83 |
| Ours-light | 0.5855 | 0.4026 | 98.51 | 90.94 | 580 | 15.21 |
| Ours-full | 0.5910 | 0.3971 | 98.62 | 92.38 | 1002 | 15.21 |
We appreciate the suggestion and agree that ground-truth evaluation remains scarce but essential in generative video relighting. To this end, we evaluate on Virtual KITTI 2, which renders consistent geometry across varied lighting conditions under autonomous driving scenarios. We selected five scenes and relit them to match illumination of morning, sunset, rain, overcast, and fog settings, using COSMOS prompt upsampler to extract lighting and weather descriptors from the ground truth video as target prompts. Each sequence averages 281 frames at a resolution of 1248×384.
With access to ground truth, we replace CLIP-T and user preference with SSIM and LPIPS [26]. Results are shown above, following the same symbol definitions as Table 2 in the main paper. Our method outperforms all baselines in perceptual and structural similarity, as well as motion consistency (Motion-S, Warp-SSIM), while remaining efficient. Related video comparisons are provided in Q2.
Q2: Video Results
Thank you for the suggestion—we fully agree on the importance of video results for assessing temporal quality. Videos with corresponding prompts are provided in the Video_Comparison folder linked anonymously in the supplementary (Line 9). This folder includes seven representative scenarios across both synthetic and real-world settings, including the ground truth benchmark in Q1. As shown, our method significantly outperforms IC-Light and VidToMe in temporal consistency, avoids the blurriness and unnatural lighting seen in COSMOS-Transfer1, and surpasses Slicedit in instruction adherence. TC-Light consistently delivers high-quality, stable results across dynamic scenes. All videos will be made publicly available on our paper's project page.
Q3: Reproducibility of TC-Light
Yes, we are planning to release the code upon acceptance. In fact, our code is already in a state where it can be open-sourced at any time. We have also used ChatGPT to improve the quality and readability of the whole repository. We hope the code release can benefit the research community.
Q4: Alternative Design Choices
| Components | Motion-S↑ | Warp-SSIM↑ | CLIP-T↑ | FPS↑ | Mem.(G) ↓ |
|---|---|---|---|---|---|
| flow+color | 96.44 | 91.05 | 0.2866 | 0.559 | 11.81 |
| flow+depth | 95.20 | 91.38 | 0.2778 | 0.553 | 11.62 |
| flow+color+depth | 96.56 | 91.12 | 0.2862 | 0.569 | 11.57 |
| flow+color+depth+inst. | 96.50 | 91.01 | 0.2851 | 0.545 | 11.67 |
| flow+color+depth, +vq | 97.25 | 87.98 | 0.2830 | 0.420 | 18.12 |
We evaluate different compositions of κ and the use of learned index hashing (via vector-quantize-pytorch library). "inst." and "vq" denote instance masks and vector quantization, respectively. Rows 1, 3, and 4 match rows 6, 8, and 9 of Table 3 in the main paper.
Color proves essential in κ: flow+depth leads to a notable drop in Motion-S and CLIP-T compared with flow+color+depth. This is because even when pixels are close in 3D or flow-connected, they may belong to different objects or textures—color helps disambiguate such cases and avoid incorrect grouping. And the performance is relative stable to depth and instance mask clues.
Learned index hashing breaks spatiotemporal correspondence: we found this process introduced significant overhead while degrading Warp-SSIM and CLIP-T. The learned codebook often failed to capture true spatiotemporal correspondence, grouping unrelated pixels and receiving conflicting supervision. Hence, we excluded it in our model development.
Q5: Robustness Analysis
| Flow | Noise Level | Denoise | Motion-S↑ | Warp-SSIM↑ | CLIP-T↑ |
|---|---|---|---|---|---|
| MemFlow | 0.0 | 96.56 | 91.12 | 0.2873 | |
| MemFlow | 0.5 | 96.46 | 89.64 | 0.2863 | |
| MemFlow | 1.0 | 96.41 | 88.22 | 0.2866 | |
| MemFlow | 2.0 | 96.32 | 86.17 | 0.2855 | |
| MemFlow | 2.0 | ✔ | 96.37 | 86.13 | 0.2851 |
| FlowFormer++ | 0.0 | 96.56 | 88.16 | 0.2854 |
Here, we provide quantitative results under different flow conditions. The used data subsets align with that of the ablation study in the paper, and row 1 here comes from row 8 of Table 3 in the main paper. The noise level indicates the standard deviation of Gaussian noise added to the flow. As can be observed, the Motion-S and CLIP-T are relatively stable to the flow noise (drop less than 1%). From low to high noise, the drop of Warp-SSIM ranges from 2%-5%. Qualitatively, the frames become noisy, and flickering becomes more salient. We have also tried flow denoising supervised by warping error, but it brings 6-10 GB additional VRAM cost, and the results show no clear benefits. We have also tried FlowFormer++ during model development. As validated in [14], FlowFormer++ underperforms MemFlow. And as shown in the Table, a weaker flow estimator would also deteriorate the Warp-SSIM. Therefore, it is excluded from our model design.
Q6: Extension to Spatially Varying Lighting
Yes, the top row of Figure S1 (supplementary) demonstrates our model’s ability to handle spatially varying lighting. In the middle three images, a white car drives from left to right. Initially, it is illuminated by orange street lamps, reflecting an orange hue. As it moves right, its rear remains orange-lit, while the front becomes blue due to a nearby advertising screen. Eventually, the car is fully bathed in blue light. This dynamic lighting response indicates our model can correctly handle spatially varying light.
Q7: Limitation Analysis
As discussed in Section 4.4 of the main paper, the limitations are fourfold. First, TC-Light inherits IC-Light's inability to relight hard shadows. As shown in the Virtual KITTI video example in Q2, the tree shadows in ours_full.gif align with 0_input.gif rather than 1_groundtruth.gif. This results in an SSIM deviate from the average value 0.59 to 0.47. Second, the model struggles when the input resolution is significantly lower than that used to train IC-Light, leading to more severe flickering. For instance, downscaling the NavSim subset from 960×536 to 480×264 causes Warp-SSIM to drop from the average value 90.46 to 88.36, although CLIP-T remains at 0.304. Third, flow estimation becomes unreliable in textureless regions, introducing artifacts. Under Video_Comparison/bad_case folder of Q2, the example from the AgiBot subset shows how unreliable flow leads to instability along the robotic arm's edge, with CLIP-T falling from the average value 0.275 to 0.259. Fourth, the temporal consistency loss tends to smooth the texture of flickering areas. In the Carla subset of Q2, ours_full.gif shows smoother road and sky textures than iclight.gif, trading off realism for consistency. We will expand these discussions in the revised paper to provide a clearer assessment of robustness, including visualizations of failure cases.
Thanks for the response from the authors. Most of my concerns have been addressed. I will make the final decision in light of the discussion with other reviewers, thanks!
Thank you again for your thoughtful review and constructive feedback! We do appreciate your time and consideration!
This paper proposes a method for relighting long videos. The main idea is to give an image-based relighting model a new temporal axis, aligning the exposure across time within the video and then refining the illumination and texture details. The paper also includes a new benchmark dataset, with 58 diverse "long" clips (256 frames). This paper received borderline/positive reviews: 5, 4, 4, 4. The reviewers find the paper well-written and easy to follow, and are impressed with the results, particularly given the low computational cost. The reviewers are strongly encouraged to revise the paper according to the facts clarified in the rebuttal, especially relating to the originality attribution. The AC recommendation is to accept.