ROGR: Relightable 3D Objects using Generative Relighting
We introduce ROGR, a novel approach that reconstructs a relightable 3D model of an object captured from multiple views, driven by a generative relighting model.
摘要
评审与讨论
The paper proposes a method to reconstruct a relightable 3D model of an object. The authors first construct a dataset of posed images under different illumination conditions based on a multi-view diffusion model. Then, they train a NeRF model conditioned on illumination with this dataset, such that the model can render images under novel lighting by conditioning on unseen environment maps.
优缺点分析
Strengths:
- The ablation study is thorough and investigates the effects of key design choices.
- The experiments include comparisons with many existing methods and demonstrate the effectiveness of the proposed approach.
Concerns:
- The Method section contains a lot of information in text, however, I find some parts somewhat hard to follow. For example, it might be easier for the readers to understand if some of the concepts in section 3.1 are visualized in Figure 2. The current figure 2 is not very informative. More clarity-related questions are listed below.
- It would be nice if Figure 6 can be larger, currently it is hard to tell the difference between a,b,e, and the proposed model.
- The paper would be more insightful if the authors discussed why the baselines fail or why their method performs better in the Results section. This would help the community better understand the core contributions.
问题
- Line 80: There might be a typo? It’s unclear why the second sentence begins with “However”, the transition feels abrupt.
- Line 137: Looks like the conditioning signals are illustrated in Figure 3 instead of Figure 2.
- In section 3.2, the new relit dataset D_relit is for a single target object. Is that correct? Might be more clear to emphasize this in section 3.2
- It would be helpful if the authors explicitly stated the input and output of the NeRF model. I assume it’s something like NeRF(xyz, ray direction, general conditioning, specular conditioning), where the two conditioning signals are trained with D_relit to learn to handle unseen environment maps, is this correct? If so, I would suggest making this overall idea clearer in Section 3.2.
局限性
Yes. It would be nice if the authors can provide a short overview of the limitations in the main paper.
最终评判理由
This paper proposes a pipeline that combines multi-view diffusion with a NeRF model to achieve relightable neural rendering. While the multi-view diffusion component is largely based on existing works, the proposed pipeline demonstrates improved performance over multiple existing methods. The authors have addressed several concerns and are willing to polish the clarity in the final version. Therefore, I am leaning towards accepting this paper.
格式问题
n/a
Q1: Visualize more concepts in Section 3.1 and make Figure 6 larger.
Thank you for the suggestion. We will update Figure 2 and integrate the concepts of Section 3.1 into it. We will also update Figure 6 with improved visibility for differences. Note that compared to (b) and (e), our final model produces more realistic specular highlights in the regions marked by the inset (specifically note the “ketchup”). Although (a) appears visually similar, our method achieves slightly better quantitative metrics as shown in Table 3.
Q2: Add discussion in the results section—why baselines fail or why our method performs better.
Methods based on inverse rendering such as NeRFactor, TensoIR, and R3DGS rely on accurate decomposition of geometry and materials, making them susceptible to compounded errors that lead to unrealistic relighting under novel illumination. In contrast, our method bypasses this decomposition by directly generating relit images. Compared to generative relighting methods like IlluminNeRF and Neural Gaffer, our approach leverages multi-view relighting diffusion, which produces more view-consistent outputs for relightable NeRF optimization. Additionally, our environment map-conditioned relightable NeRF more effectively utilizes lighting information for radiance prediction, resulting in higher-fidelity and physically-plausible relighting. We will add the result discussion into the revision.
Q3 : Limitation discussion.
We provided a limitation discussion in the supplementary material, but we are happy to move it into our main paper in the final revision.
Q4: Use of ‘However’ in Line 80
Thank you. We will revise "[...] generalization. However [...]" to "[...] generalization, and [...]".
Q5: Figure reference issue in Line 137
Thank you for pointing this out. We will fix the sentence to: “The conditioning signals are illustrated in Fig. 3.”
Q6: Clarification on the relit dataset
Correct. We will emphasize in Section 3.2 that the relit dataset D_{\text{relit}} is constructed for a single target object.
Q7: Inputs and outputs of the NeRF model.
The inputs to our relightable NeRF include 3D coordinates (x,y,z), ray direction, general conditioning, and specular conditioning. The geometry MLP predicts roughness, normals, and density, while the color MLP outputs the RGB values. We will explicitly describe these components in Section 3.2 to improve clarity.
Thank you for the rebuttal. I do not have further questions and will take the authors' responses into account when finalizing my review.
By utilizing multi-view diffusion to generate multi-view images of the same object under various environmental light sources, a NeRF conditioned on these light sources can be trained. This enables rapid, high-quality feedforward relighting in entirely new lighting conditions. The model was evaluated on the TensoIR and Stanfold-ORB datasets, outperforming state-of-the-art methods across multiple metrics.
优缺点分析
Strengths
- The motivation for fine-tuning a multi-view diffusion conditioned on environmental light maps to generate multi-view, multi-light source images for a single object is highly reasonable.
- Experimental validation demonstrates that the rendering results predicted by ROGR reflect the input environmental light map and exhibit clear specular phenomena.
- The paper is well-written and easy to follow.
Weaknesses
- The paper lacks experiments and analysis on fine-tuning the multi-view diffusion. The performance and generalization ability of Relightable NeRF largely depend on the generation quality, viewpoint consistency, and strict adherence to lighting conditions in the multi-view diffusion. It is recommended to include quantitative metrics to evaluate the multi-view diffusion in the relighting task, thus verifying that the proposed method can generate high-quality multi-view images under varying lighting conditions.
- There is some confusion regarding the reconstruction of "shadow" and "specular." In L247-248, the authors claim that the method captures convincing specular reflections and shadows. However, neither the relighting multi-view diffusion nor NeRF-Casting appears to have any strategies or designs that aid in accurately reconstructing shadows. As for "specular," while the authors provide persuasive experimental results on "specular"/"glossy" surfaces, there is some doubt as to whether the model can be applied to reconstruct objects with purely specular or metallic surfaces. Shadows and specular highlights are highly sensitive to the distribution of high HDR values in the environmental light source and viewpoint variations. Could the authors provide theoretical analysis or experimental evidence explaining how the model reconstructs "shadow" and "specular"?
- Recent works, such as RELITLRM[1] and DIFFUSIONRENDERER[2], can achieve "feed-forward rendering or reconstruction in new environmental light sources," even without requiring dense view inputs or scene-specific optimization. These methods can achieve results similar to this work on both synthetic and real datasets. What are the advantages of this work over these relightable feed-forward 3D models? It is suggested to include a discussion of these methods in the related works section.
[1] RelitLRM: Generative Relightable Radiance for Large Reconstruction Models. ICLR2025
[2] DIFFUSIONRENDERER: Neural Inverse and Forward Rendering with Video Diffusion Models. CVPR2025
问题
Kindly refer to the [Weaknesses].
局限性
Yes.
最终评判理由
The authors’ response and additional results have largely addressed my concerns. I will increase my score accordingly. Introducing generative models to tackle challenges such as shadows and specular highlights under complex illumination is a promising and worthwhile research direction.
格式问题
None
Q1: Quantitative metrics to evaluate the multi-view diffusion in the relighting task.
Good idea! The table below shows the effect of increasing the number of views input to the diffusion model. For less reflective objects, the inconsistency from 4-view diffusion was already small enough that adding more views did not have a major impact and the NeRF model could reconcile the inconsistency. We observed qualitatively that increasing the number of views for shinier objects was more important, such as the car and sewing machine objects shown in the supplemental “in the wild” capture setup.
| Method/Metrics | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| 4-view | 31.04 | 0.86 | 0.082 |
| 16-view | 31.86 | 0.90 | 0.077 |
| 64-view | 31.88 | 0.91 | 0.075 |
Q2: Can the method handle shadows and specular highlights? What strategies are used to handle them?
Shadows and reflections are two challenging phenomena that our model is designed to handle in somewhat different ways. Shadows are strongly lighting-dependent, but they are not view-dependent. In order for our model to render correct shadows, the conditioning signals (and in particular the general conditioning) must encode the lighting in a way that allows the NeRF model to render them. This is enabled by our multi-view relighting model being trained to output correct shadows. Our model’s ability to handle shadows correctly is demonstrated in our supplementary material—see for example the lego scene with the rotating environment map at the top of the main page (index.html) or the car scene of StanfordORB (in stanford_orb.html).
Specular highlights are even more challenging since they are both strongly lighting-dependent and view-dependent. For rendering specular highlights, the diffusion model must be able to correctly relight the input images, and the NeRF must be able to use the conditioning signals (and in particular the specular conditioning) to create fast variations in appearance as the view direction changes. The strong dependence on view direction is achieved through the specular conditioning being concatenated to NeRF-Casting’s reflection feature, which was originally designed to enable rendering highly specular environments. We show that this pipeline is indeed effective in our results—see for example the hotdog scene at the top of the main page (index.html) and under additional lights in the TensoIR page (tensoir.html), and multiple objects like teapot and pepsi from StanfordORB (in stanford_orb.html).
Q3: Discussion with RelitLRM and DIFFUSIONRENDERER. RelitLRM is a concurrent work to ours and directly recovers the 3D geometry and appearance of an object. While their method produces similar results to ours, their method cannot guarantee to produce the same geometry for different environment maps. Our two-stage strategy enables reconstruction of constant geometry under different illumination conditions.
Compared to DiffusionRenderer, which uses a video diffusion model to generate relighting results, our method enforces 3D consistency by design, since we explicitly model the geometry and rendering, whereas DiffusionRenderer does not guarantee 3D consistency. Additionally, our model enables fast inference as it avoids the long sampling times typically required by diffusion-based generation.
Thank you for your detailed rebuttal. I appreciate the clarifications provided and have no further questions at this time.
This paper proposes a method to reconstruct a relightable 3D model from multi-view images taken under the same illumination. The authors first use a multi-view diffusion model to generate a dataset consisting of the same object relighted using various environment maps (a total of 111 illuminations, to be specific). They then employ the generated dataset to train a relightable NeRF casting-based model, taking novel environment maps as conditions, which supports fast rendering (less than 1 second per frame). Specificaly, they propose to introduce conditions on the novel environment map to the relightable NeRF-Casting model in General(via DINO feature) and Specular(blur the environment map by by roughness and query corresponding reflection direction) ways to model both diffuse and specular effects during relighting. Experiments demonstrate improvements compared to previous works.
优缺点分析
Strengths: Quality: The paper is technically sound. The road map to first relight the object jointly (and therefore more consist) through diffusion model and then train a relightable NeRF model is well discussed and evaluated. Clarity: The paper is clearly written with clear structure. Providing enough detail along with its supplementary. Originality: Regarding the method, the multi-view relight diffusion model and relightable modification to the NeRF-Casting shows originality.
Weaknesses: Significance: The experiment results shows noticeable improvement compared to previous works. As an end product, the relightable NeRF model can be useful as it can be rendered efficiently. However, as the method claims to be excel in specular highlight, additional comparisons with methods being more dedicated to such effect would be appreciated.
问题
Main concerns:
- Since the paper highlight the specular high light as visual comparison, I would be curious about its comparison with methods that are more dedicated to the reflective objects like [1] and [2] , which are mentioned in the paper but not included in any comparisons. Specifically, for [2] the target illumination need to be provided in a different way by a rendering of the object (which can be either obtained from GT image or the relighted diffusion model as proposed)
- Additional analysis of the geometry / normal map quality of the relightable NeRF model would be appreciated to evaluate its capability in reconstruction.
- It would be appreciated to include analysis of the quality of the Multi-view Relighting Diffusion results, to give a hint on the supervision signal quality. e.g. comparing the relighting from the relightable rendering, the diffusion model, and the GT. (Although it might only be possible in the original camera view)
Other minor concerns:
- Ablation study or discussion on choosing the hyper-parameter of specular blur kernel size would be appreciated.
- It is not very clear how many images are used to trained the relightable model. In the paper it mentioned running diffusion inference on 111 enthronement maps. So suppose N camera poses are provided, does it means a total of images are used for training relitable NeRF model? Any specific selection on choosing 111 environment maps or why choosing this number?
- In Fig.5 the cactus model of Neural-PBIR doesn’t seems to be the same as their paper and degraded too much. Including Neural-PBIR in TensoIR benchmark would also be appreciated.
[1] Nero: Neural geometry and brdf reconstruction of reflective objects from multiview images.
[2] Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation
Overall the current evaluation is fair, and would be stronger with more evaluation.
局限性
yes
最终评判理由
The authors’ rebuttal provides satisfying additional information, including meaningful experiments, thorough ablation studies, and extra technical details. I will keep my accept recommendation for this paper.
格式问题
N.A.
Q1: Comparisons with NeRO, Neural-PBIR, and Generative Multi-view Relighting
The table below shows a comparison of our method with NeRO and Neural-PBIR on the TensoIR dataset. We will include these results in our revised version. As pointed out by the reviewer, the setting for Generative Multi-view Relighting (GMR) is different than ours: we aim to relight objects from an unseen environment map, whereas GMR expects an observed image from the target lighting. Furthermore, GMR did not release their source code.
| Method/Metrics | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| Neural-PBIR | 27.09 | 0.925 | 0.085 |
| NeRO | 27.00 | 0.935 | 0.074 |
| Ours | 30.74 | 0.950 | 0.069 |
Q2: Additional analysis of the geometry / normal map quality of the relightable NeRF.
The quality of our method’s normals is comparable to prior inverse rendering techniques like TensoIR, as well as novel view synthesis approaches like NeRF-Casting which explicitly encourage geometry to be more surface-like. We will add normal visualizations to our final supplemental appendix.
Q3: Analysis of the quality of the multi-view relight diffusion results.
As suggested, we compare the output of the multi-view relighting model with renderings from our full pipeline which includes the relightable NeRF. Note that the metrics are computed over the test views of the TensoIR dataset. While the NeRF results only use the relit training images, meaning that our pipeline never observed the test images, the evaluation of the relighting model required providing the test views to the relighting model as input. Even though our pipeline does relighting and novel view synthesis, it still performs only slightly worse than a model that gets access to the test views and only needs to relight them.
| Method/Metrics | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| NeRF output | 31.88 | 0.91 | 0.075 |
| Relighting model output | 32.32 | 0.97 | 0.074 |
Q4: Discussion on choosing the hyper-parameter of specular blur kernel size.
A 2D Gaussian kernel with a radius of 20 pixels (resulting in a 40 × 40 kernel) corresponds to an angular range of (40 / 512) × 360° = 28.125° centered around the reflection ray. A 2D Gaussian kernel with a radius of 40 pixels, aggregates lighting information within an angular range of (80 / 512) × 360° = 56.25°centered around the reflection ray. Based on empirical findings, this hyperparameter selection captures the light transport behavior of most reflective surface materials.
Q5: Number of images used to train the relightable NeRF.
Given N cameras and M environment maps, we use a total of NxM images for training the relightable NeRF.
Q6: The reason for choosing 111 environment maps.
We chose roughly 100 environment maps because our method’s quality increases with the number of environment maps and starts saturating around that point. In the table below we present relighting metrics on the hotdog scene from TensoIR for a varying number of environment maps used.
| Method/Metrics | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| 10 | 29.12 | 0.82 | 0.086 |
| 50 | 31.78 | 0.88 | 0.079 |
| 111 | 31.88 | 0.91 | 0.075 |
| 150 | 31.90 | 0.92 | 0.075 |
Q7: Cactus result from Neural-PBIR in Fig.5
Thank you for noticing this! We will fix the cactus visualization issue of Neural-PBIR in Fig.5.
I am satisfied with the authors' response, which includes meaningful experiments, thorough ablation studies, and additional technical details. I feel that most of my concerns have been well addressed.
I do have one point that needs clarification: When analyzing the quality of the multi-view relight diffusion results (Q3), which data is actually used? I noticed that the "NeRF output" refers to the paper's own pipeline, but the PSNR score reported for the TensoIR dataset (31.88) does not match the value shown elsewhere (30.74 in Table 1 and Q1). Am I correct in assuming that only a subset of the data was used for this analysis?
Yes, that is correct. Our ablations are run on the hotdog scene. The experimental setup for this ablation is the same as Table 3 in the paper.
This paper proposes ROGR, a method that takes inputs from multiple views of an object and reconstructs a lighting-conditioned Neural Radiance Field (NeRF) that can be relighted under different lighting conditions. The two main components are: a multi-view diffusion model and a relightable neural radiance field with general and specular conditioning to better encode lighting. Experiments are conducted on TensoIR and Stanford-ORB datasets, where ROGR demonstrates state-of-the-art performance on relighting image quality.
优缺点分析
Strengths
- The main idea of obtaining 3D relightable objects without per-illumination optimization makes sense. The proposed solution with a multi-view 2D relighting diffusion model and a relightable neural radiance field seems to be a reasonable solution to the idea.
- The performance of the relighting is solid: ROGR achieves state-of-the-art performance on widely-used benchmarks, competing favorably against other top methods. At the same time, it is fairly efficient.
- The proposed 'General Conditioning' and 'Specular Conditioning' could be a good contribution to NeRF community to encode rich lighting information in 3D scenes. It might be useful for other 3D neural rendering models as well.
Weaknesses
- Of the two components proposed, the multi-view 2D relighting diffusion model is heavily borrowed from existing work (e.g., CAT3D and Neural Gaffer) in various parts. There could be an argument here about the relatively weak technical novelty.
- While the overall metric about the relighting quality is good in the experiment section, there are some missing pieces that require additional experiments. See below for questions about the experiment.
问题
-
One of the baselines that should be included in the experiment section (maybe as an ablation study) is: using the same multi-view relighting diffusion model + original NeRF and doing the per-illumination optimization. In other words, for each target lighting condition (env map), use the multi-view relighting model to generate relit images, and then use the original NeRF to do 3D reconstruction on those images. It is reasonable to believe ROGR has an advantage over such a baseline in efficiency, but the quality differences are unclear.
-
While ROGR clearly outperforms others on TensoIR, its performance in Standford-ORB is fairly close to others. Can the authors explain the performance gap more here?
-
I'm a bit surprised to see that the number of views plays a small role in the performance gain, as shown in Table 4. I would imagine it to be very important in obtaining high-quality 3D reconstruction and therefore the relit images. Did I miss something here?
Overall, I'm fairly positive about this paper. If authors could answer questions and clarify points in the questions above, I would raise my score.
局限性
There is no discussion about the limitations.
最终评判理由
After reviewing the authors response and other reviews, I updated my score to Accept. This paper proposes a solid approach to address the multiview relighting problem. It is a good contribution to the field.
格式问题
None
Q1: Additional ablation study on per-illumination optimization using multi-view relit images and the original NeRF-Casting method.
Thank you for the suggestion! See the results of the proposed ablation study in the table below, where we optimize NeRF-Casting directly for each illumination. We will add it to our final revision. While our method performs on par with NeRF-Casting, it has the obvious advantage that it performs relighting in a feed-forward manner and does not require optimization per illumination.
The experimental setup is the same as Table 3 in our paper, which is from the hotdog scene from TensoIR. See the text and Fig. 6 for corresponding qualitative results and additional explanations.
| Method/Metrics | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| per-illumation optimization | 31.91 | 0.93 | 0.07 |
| Ours | 31.88 | 0.91 | 0.075 |
Q2: The performance gap between TensoIR and Stanford-ORB.
The Stanford-ORB benchmark provides illumination captured using a light probe that was repositioned for each image in the dataset, and is not colocated with the object. As such, their reported “environment maps” don’t capture the actual lighting condition at the object's position. The dataset's inaccuracies mean that any metrics derived from this benchmark should be interpreted with caution. This was also pointed out by IllumiNeRF [1] (Section 4.2) and Eclipse [2] (Section 5 and Section S1). This issue causes mismatches between the provided images and the relighting results obtained by every method, which effectively introduces noise to the metrics. This explains why it’s easier to see our method’s quality boost in the TensoIR data compared with Stanford-ORB.
[1] IllumiNeRF: 3D Relighting Without Inverse Rendering
[2] Eclipse: Disambiguating Illumination and Materials using Unintended Shadows
Q3: Ablation study on the effect of varying the number of views in diffusion models for relightable NeRF reconstruction.
Good idea! The table below shows the effect of increasing the number of views input to the diffusion model. For less reflective objects, the inconsistency from 4-view diffusion was already small enough that adding more views did not have a major impact and the NeRF model could reconcile the inconsistency. We observed qualitatively that increasing the number of views for shinier objects was more important, such as the car and sewing machine objects shown in the supplemental “in the wild” capture setup.
| Method/Metrics | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| 4-view | 31.04 | 0.86 | 0.082 |
| 16-view | 31.86 | 0.90 | 0.077 |
| 64-view | 31.88 | 0.91 | 0.075 |
Q4: No discussion about limitations
We provided a limitation discussion in the supplementary material. If the reviewer would like us to, we can move it into our main paper in the final version of the paper.
Thanks for the response. I don't have further questions. Please include the experiments in the final version.
Thanks you for your suggestion! We will include the rebuttal experiments in the final version.
This paper was reviewed by 4 experts in the field. After authors’ feedback and internal discussion, all reviewers agreed that this is solid work and should accept it (rating: 5, 5, 5, 5).
Specifically, all reviewers agreed that the paper is well written, easy to follow and illustration is very clear. The experimental results are also very solid, with several comparisons with most of the recent methods on various benchmarks. The provided supplementary material also shows the effectiveness of the solution over the previous. The proposed solution is simple, novel and effective. Area chairs also agreed that this is high-quality work.
Considering all these, the decision is to recommend the paper for acceptance to NeurIPS 2025. We will also recommend this work for a spotlight presentation, for the following two reasons: 1) Excellent performance across thorough experiments. 2) Significant contributions on the relighting of 3D objects, a topic of substantial recent interest.
At last, we recommend the authors carefully read all reviewers’ final feedback and revise the manuscript as suggested in the final camera-ready version. We congratulate the authors on the acceptance of their paper!