PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
4.0
置信度
创新性2.8
质量3.0
清晰度3.3
重要性2.5
NeurIPS 2025

IntrinsiX: High-Quality PBR Generation using Image Priors

OpenReviewPDF
提交: 2025-04-25更新: 2025-10-29
TL;DR

We introduce a text-to-intrinsic (albedo, rough, metallic, normal) model, which can be used rerendered under arbitrary lighting condition or even distilled into room-scale PBR textures.

摘要

关键词
Image SynthesisDiffusionIntrinsic ImagesRenderingTexturingDeep Learning

评审与讨论

审稿意见
5

This paper proposes a method to generate intrinsic properties images (including albedo, roughness & metallic, and normal) from a given text description. The method uses the pretrained FLUX.1-dev model as the generative backbone and finetunes it using 3 separate LoRAs for each intrinsic property. After separate finetuning, the method finetunes the LoRAs jointly using a cross-intrinsic attention and a re-rendering loss with importance-based lighting direction sampling. The paper showcases the applications of the proposed method and conducts extensive experiments to show the superiority of the proposed method against baseline methods and ablation studies to show the effectiveness of the proposed techniques.

优缺点分析

Strengths:

  1. The proposed two-stage training strategy with the cross-intrinsic attention (though this attention is alike the commonly-used cross-view or cross-frame attention and lacks true novelty) and the re-rendering loss is somewhat novel: Though the re-rendering loss is vastly used in previous CNN-based intrinsic image decomposition methods such as [33], it is the first time (as far as I know) that it is used in a diffusion-based framework. The proposed importance-based lighting direction sampling for training is also interesting and effective.
  2. The paper proposes an efficient pipeline that takes only 26h to train on a single A100 GPU and 12s to generate an image on a single A6000 GPU, and also gives state-of-the-art results in the task of intrinsic image generation (though it is a new task proposed by this paper).
  3. Extensive quantitative and qualitative comparisons and ablation studies.
  4. The authors provide an entire section to showcase the applications of the proposed method, including relightable image generation and PBR texturing.
  5. Good presentation and easy-to-follow paper.

Weaknesses:

  1. The quantitative evaluation may not be very rigorous.
  • Firstly, the paper uses FID to compare the quality of the generated albedo image across different baselines. However, the testing dataset contains only 100 (for in-distribution) + 40 (for out-of-distribution) images, which are far less than the common requirements of computing FID, i.e., 10,000 images. The small amount of images for computing FID may render the evaluate unreliable. Is it a good practice to use 100 and 40 images to compute FID? If not, why not use more images?
  • Secondly, I am a little bit confused about the definition of "in-distribution" and "out-of-distribution" in L263-L272. While the sampled images are "in-distribution" regarding the proposed method as it is trained on InteriorVerse, are they still "in-distribution" regarding the comparing methods, including IID, RGBX, and ColorfulShading, as these baselines may not be trained on InteriorVerse? Could the authors provide a more in-depth discussion of "in-distribution" and "out-of-distribution", as the ability to handle "out-of-distribution" is stated as a major difference between the proposed method and baseline methods (L33, L44, L90, L297)?
  1. The statement about the inherent ambiguity of intrinsic image decomposition is not persuasive. Intuitively, generating PBR images from mere text inputs is more ambiguous than from an input shaded RGB image, as even the scene structure is not determined yet. So I fail to understand why "the inherent ambiguity of decomposing an RGB image into its intrinsic properties makes it difficult to match the PBR distribution for out-of-domain samples" (L90, L298) and how does the proposed method overcome this difficulty using an even more ambiguous task.
  2. The above weakness also motivated me to question the proposed task, intrinsic image generation. I think in essence it is similar to the traditional intrinsic image decomposition, except that the input is changed to a text description instead of an image. Personally, I prefer to use an image as the input in many cases as I may want to edit an exisiting image instead of generating a new image from scratch. Thus, I think the proposed task is not very useful (the texturing task in Section 4 is also actually achieved by finetuning the model to enable it to condition on an image). Could the authors compare the usefulness between the proposed intrinsic image generation task and the traditional intrinsic image decomposition task? Also, what is the possible outcome of using the proposed techniques for intrinsic image decomposition (i.e., changing the input from text to image)?

Trivia:

  1. Typos:
  • L129: "with utilizing" -> "with" or "utilizing".
  • L274: "A-PP" -> "A-PQ".
  • L337: "the preserve" -> "to preserve".
  1. Eq. (3) seems incorrect. E.g., the BRDF evaluation should be related to the incoming lighting direction ωi\omega_i and the cosine term should be related to the surface normal.

问题

Corresponding to the weaknesses above, I summarize the questions below:

  1. The paper uses FID to compare the quality of the generated albedo image across different baselines. Is it a common practice to use 100 and 40 images to compute FID? If not, why not use more images?
  2. Could the authors provide a more in-depth discussion of "in-distribution" and "out-of-distribution", as the ability to handle "out-of-distribution" is stated as a major difference between the proposed method and baseline methods (L33, L44, L90, L297)?
  3. Why "the inherent ambiguity of decomposing an RGB image into its intrinsic properties makes it difficult to match the PBR distribution for out-of-domain samples" (L90, L298) and how does the proposed method overcome this difficulty using an even more ambiguous task?
  4. Could the authors compare the usefulness between the proposed intrinsic image generation task and the traditional intrinsic image decomposition task?
  5. What is the possible outcome of using the proposed techniques for intrinsic image decomposition (i.e., changing the input from text to image)?

I will raise my rating once the authors provide clarifications to these questions.

局限性

Regarding the rendered image: The method does not discriminate between foreground and background regions. Sometimes relighting the background region like a plain surface feels weird. Also, the renderer cannot render cast shadows as the depth information is absent. These can be left for future work to make the rendered image more realistic.

最终评判理由

The rebuttal thoroughly addresses my question 1 (about the size of evaluation dataset), question 2 (about the concrete definition of ID and OOD), question 4 (about the application scenario of text-to-PBR image compared to image-to-PBR image methods), and question 5 (about using the proposed method for intrinsic image decomposition). I really appreciate the efforts made by the authors during the rebuttal period that largely increase the dataset set by more than 10 times and gives comprehensive additional experimental results. I will raise my rating to "Accept" as promised.

格式问题

None.

作者回复

Motivation for Image-Space PBR Priors

Text-based generative models can act as a generic prior over images, which has shown great value for numerous downstream tasks: improved intrinsic image decomposition(IID 2929 , RGBX 6767 ), depth estimation (Marigold KeetalCVPR2024Ke et al CVPR 2024 ), etc. Our goal is to establish a similar generic prior for PBR materials to support downstream applications like UV-based texturing.

The key challenge of the field of PBR generation is the lack of large-scale, diverse datasets of PBR UV textures. Unlike RGB data, material datasets are scarce, and crucially, consistent UV parametrization (mapping from object coordinates to texture coordinates) across assets practically does not exist. This makes training direct generative models in UV texture space infeasible.

Instead, we learn a prior in image space. To distill image-space information into texture-space, we use SDS, which aggregates image-space signals from multiple views. This enables PBR UV texture generation for 3D assets, even though trained with relatively small amounts of image-space PBR data. In contrast, decomposition methods require an image as input and it is not obvious how one would synthesize consistent PBR UV textures for novel 3D assets.

Inherent Ambiguity

We believe that intrinsic image decomposition methods are constrained to match the appearance of the input image. Thus, the model can learn to rely more on the input, instead of focusing on staying in the PBR distribution; therefore, facing an out-of-distribution input image poses a challenge to these models to produce faithful PBR maps, corresponding to the respective distributions.

L90 and L298 are also grounded on our empirical observations of our experiments. Concretely, the decomposition baselines perform well on in-domain samples (indoor scenes), but struggle to generalize out-of-distribution. In contrast our method does not have such limitations (see Tab. 1, Fig.6,7). Thus, we argue that decomposition is inherently more ambiguous than our task of directly generating the PBR maps.

ID vs OOD

Our diffusion-based baselines are only trained on indoor rooms, partially or fully on the InteriorVerse dataset. These methods perform well on indoor scenes, since this is in-domain for them. Our goal is to generate unconstrained scenes in diverse domains (e.g. single objects, composited scenes, outdoor, nature, etc.); therefore, we curated a dataset of complex scenes from various domains (OOD dataset). Since the baselines are image-conditional, when facing an out-of-domain input, their performance decreases. In contrast, our method does not depend on an input image, but rather learns the distribution of the PBR properties while maintaining the prior knowledge of the base model; thus, it can better generalize, as our qualitative and also quantitative results show.

Additional Results

We extend our test set and re-evaluate all the baselines and our method. For in-domain experiments, we use the full InteriorVerse test set (2595 samples, as done in IID 2929 ). Evaluating the out-of-domain performance is challenging since there exists no large-scale PBR dataset of complex scenes from various domains. To evaluate one specific domain, we use the single-object G-Buffer ObjaVerse Zuoet.al.ECCV2024Zuo et. al. ECCV 2024 dataset. We take 1000 samples from the diverse ``Daily-Used" category. To show that these numbers are representative, we evaluate the FID with a different number of samples. The relative improvement over baselines align with the original FID numbers in the main paper and the metrics are consistent. Our method achieves better generalization to out-of-distribution domains.

We provide three additional experiments to better support our claims (as also requested by the reviewers).

  • Ours w/ BRDF-sampling: we sample more diverse lighting directions using BRDF-sampling during our rendering loss to provide more gradient signal about the BRDF behavior. This variant slightly improves our generalization results, showing the importance of lighting direction sampling.
  • IID+FLUX LoRA: we re-train the IID baseline with FLUX as backbone and use LoRA layers similar to our method. This improves the fidelity, but this still does not reach the quality of our method, showing that learning a generalizable prior over PBR distributions is beneficial for generalization. For more details, we refer to our rebuttal to reviewer 48oZ.

We are happy to include these detailed results.

Image-conditional IntrinsiX

Our model can be made image-conditional in a similar fashion as for our downstream task of normal-conditioned PBR generation (Fig. 5, 15): we add the RGB image as input and apply input dropout during fine-tuning (as in L228). Our image-conditional method produces sharper results and outperforms decomposition baselines on OOD data (see Ours w/ Image-cond below). This demonstrates the benefits of learning a generative prior over PBR modalities.

# Samples vs FID on InteriorVerse

1005001000150020002595
Ours186.60113.9692.2183.3277.1372.08

# Samples vs FID on G-Buffer ObjaVerse

1003005007008009001000
Ours186.60123.5499.1485.4480.4675.2871.39

FID Metrics

In-domain InteriorVerseOut-of-domain G-Buffer ObjaVerse
IID78.7798.77
RGBX61.3690.12
ColorfulShading91.1086.48
IID+FLUX LoRA103.3679.29
Ours w/ BRDF-sampling75.2668.87
Ours w/ Image-cond98.0870.51
Ours72.0971.39

Writing

We fixed the typos as suggested and clarify eq. (3): \\hat{\\mathbf{I}} \= f(\\omega\_o; \\omega\_i; \\hat{\\mathbf{X}}\_0) \\cdot L\_i \\cdot(\\hat{x}\_n^T \\omega\_i)

评论

The rebuttal thoroughly address my question 1 (about the size of evaluation dataset), question 2 (about the concrete definition of ID and OOD), question 4 (about the application scenario of text-to-PBR image compared to image-to-PBR image methods), and question 5 (about using the proposed method for intrinsic image decomposition). I really appreciate the efforts made by the authors during the rebuttal period that largely increase the dataset set by more than 10 times and gives comprehensive additional experimental results. I will raise my rating to "Accept" as promised.

Regarding question 3 about the inherent ambiguity of text-based method and image-based method, I am convinced to certain extent that text-based method proposed by the author is more suitable for OOD samples. However, I still do not think decomposition is inherently more ambiguous than directly generating the PBR maps, as the input image provides more constraint than the input text, and more constraint means less ambiguity. Maybe changing the word "inherent ambiguity" to something else would be better? I do not know.

评论

Thanks a lot for the quick response! Regarding the term “inherent ambiguity” in comparing text- and image-based methods, we agree that this term may not be the most accurate phrasing. We will revise these parts in the final version to better reflect that the key difference is being more constrained and relying on the input image rather than modelling the PBR distribution:

  • L31: ``However, finding the correct decomposition to a given input image is a constrained task, usually causing over-smoothed or simplified predictions on out-of-domain samples.”
  • L89: ``However, the decomposition of an RGB image into its intrinsic properties is a constrained task, making it hard to generalize to out-of-distribution input images.”
  • L298: ``Furthermore, intrinsic image decomposition is constrained to match the appearance of the input image, making it difficult to rather focus on the PBR distribution for out-of-domain samples.”

Please let us know if this fully clarifies your question.

评论

The revision made is reasonable and these revised statements should cause much less confusion now.

On the other hand, in the final version of the paper, maybe thoroughly checking and revising related statements about the difference between your method and image-based methods to be in accordance with the reasoning in your rebuttal would provide a smoother logic flow. For example, emphasize that your method relies less on the given constraint and more on the distribution it learned (more self-contained, I would say) to produce PBR material images, which is the reason why your method addresses OOD inputs better, even for the version that takes an image as input.

Anyway, my question 3 is now fully clarified. Thanks for your efforts!

评论

Thanks again for the swift response! We will revise the related statements as well and emphasize the constraint differences.

审稿意见
4

The paper addresses the problem of text-conditioned PBR generation using a pretrained generative model. The authors adapt an image generation model with LoRA for the text-based generation of different PBR components. This results in three sets of LoRA weights: one for albedo, one for normals, and one for specularity (channel-packing roughness and metalness). Naturally, these models are not aligned and produce inconsistent outputs. To address this, the authors introduce a second stage where traditional self-attention is replaced by cross-task attention. In this setup, the activation for a given component is computed based on its correlation with the activations of other components (including itself). Additionally, a rendering loss is applied in image space between the re-rendered ground truth and the rendering of the generated samples, using a single point light source. The model produces high-quality, sharp results that are consistent across PBR properties. Finally, the authors demonstrate that their method can be successfully applied to scene texturing via score distillation sampling.

优缺点分析

Paper strengths

  • I find the explanation of why joint learning (with cross-talk between different material properties) leads to unsatisfactory results quite interesting. As the authors hypothesize, since attention mechanisms are correlation-based, they tend to encourage similar features, which may hinder diversity among outputs.

  • The quality of the results are good and material property maps remain sharp.

  • I also appreciate the extension to more complex scenes which does not look like a trivial addition.

Task definition

  • The formulation of the problem is a bit clumsy. For example, the phrase “we formulate PBR generation as a generative approach” (l.91) is redundant, PBR generation is inherently a generative task. And since the downstream task is not material decomposition, it's already framed as a conditional generation task by definition.

  • The authors claim that generating PBR maps “from scratch” helps alleviate the ambiguity of the task. First, “from scratch” is a somewhat odd choice of words, especially considering the model is not unconditional. More importantly, it's unclear how this approach resolves ambiguity. One could argue that the task is simply easier, since it does not rely on an RGB reference. In fact, any prediction in the parameter space can result in a plausible output under the BRDF analytic model. In contrast, I would argue that decomposition methods (conditioned on RGB images) tackle a more complex problem and suffer from inherent bias, precisely because they are forced to meet specific constraints. The task chosen in this paper, text-conditioned generation, is less constrained: the model is free to generate any plausible decomposition, as long as it aligns with a coherent scene. It’s not about explaining the properties of a specific image, but rather about sampling a plausible material configuration without regard to a particular appearance. In that sense, the comparison with decomposition methods is not entirely fair.

Novelty and scope

  • The authors claim: “For the first time, we directly generate PBR maps from text as input in a probabilistic diffusion process.” This is a rather bold statement. Indeed, text prompts have already been used for PBR generation in "Diffusion-based G-buffer Generation and Rendering" (Xue et al. arXiv 2025). In the context of single-material generation, this has been explored in works such as MatFuse (Vecchio et al., CVPR 2024), ControlMat (Vecchio et al., 2023), and DreamPBR (Xin et al., arxiv 2024).

  • The idea of modeling the joint distribution has also been addressed before, notably in ControlMat, although it targets single materials. In fact, they first train a text-conditioned model and then add a ControlNet on top. If joint learning of the distribution is indeed so beneficial, this naturally raises the question of whether adding a ControlNet, or any other form of image conditioning on the resulting model would in fact yield a more powerful decomposition. Compared to a model trained directly for decomposition from the start.

Related works

  • I am wondering whether the cross-task attention is similar to how Tune-A-Video (Wu et al., ICCV 2023) models attention between the input frames? I think it’s important to position this work with respect to such approaches, which also rely on cross-task mechanisms to model interactions. See also Material Picker (Ma et al., SIGGRAPH 2025) which uses a video diffusion model to learn interactions between PBR maps. There currently seem to be no discussion positioning this part of the contribution. Also, it would be valuable if the authors positioned their approach relative to the literature on cross-task attention mechanisms. This is common mechanism found in the multi-task literature: for example in MTFormer (Xu et al., ECCV 2022).

  • On line 201, I don’t think it is accurate to say the light direction is “obtained”; rather, it is chosen as the reflected direction with respect to the outgoing direction. It would be good to mention that a similar rendering loss has been applied in prior works, e.g., SVBRDF capture (Deschaintre et al., Siggraph 2018), as well as the fact that they enforce sampling of the reflected direction.

Qualitative results

  • Overall, the generations look good, but my concern is whether the model is simply outputting random properties that remain internally consistent within image segments. Since the model is conditioned on text prompts alone, what needs to be verified is whether it has actually learned the correct distribution of object properties: whether it understands which objects should be metallic or non-metallic, and what their reflectance characteristics should be (glossy or rough).

  • Based on the results in Figure 12, I'm not so convinced it does so very well. Plausible roughness values are difficult to assess and quite subjective. However, the metallic maps can be more easily judged. For instance, several object parts are labeled with high metalness values despite being clearly non-metallic: the horse’s tail, the guitar soundhole, the spacesuit helmet visors (both), the guitar player’s eyes, and the windows of the wooden cabin. Similarly, in Figure 13, the water and tree in the Taj Mahal scene are predicted as metallic, while the sword and key are assigned rather low metalness values.

Quantitative results

  • Regarding the metric in Table 1, I am not very convinced by the results of the user study (particularly for the rendered images, specularity assessment, and coherence with the input prompt). I notice that the uncertainty in the results is greater than the difference between the baseline and the proposed method. Therefore, I do not fully agree with the conclusions drawn in Section 5.1.

  • Regarding the FID metric: since the albedo maps are generated via text prompts, they are naturally not aligned with the ground-truth albedo annotations. As a result, the authors resort to using the FID score. However, for the FID evaluation, only 100 images are used for the ID set and 40 for the OOD set. Is this sample size truly representative for a metric like FID? I believe that such distribution-based measures typically require larger sample sizes to be reliable.

Minor comments:

  • On line 97, authors said they "align them [each modality] by leveraging cross-intrinsic". The term align is a bit vague; perhaps learning the joint distribution would be more accurate

  • On line 201, the sentence "Then, we obtain the light direction as the reflectance \omega_i", shouldn't it be "as the reflected direction \omega_i" instead?

Typos:

  • line 54: "pysically"

  • line 67: It is grammatically incorrect to use citation numbers within sentences like "They typically model the diffusion process following [19] or [36]". Same on line 84 with "generate PBR textures of entire 3D scenes following [7]." and line 108 with "Unfortunately, existing datasets, such as [34, 71, 46]".

  • Line 124: I find "paired datasets for each intrinsic property" to be imprecise

  • Line 129: Simplifying the grammatical structure would help fix the sentence "Our method with utilizing a pre-trained image prior does not require extensive PBR datasets"

问题

  • Wouldn't you say there are also inherent limitations to relying on text-based conditioning? I remain puzzled about the practical need for such a prompt-based approach. Text prompts in diffusion models have been shown to be relatively imprecise, as they tend to map to broad regions in the appearance space; as shown in Material Palette (Lopes et al., CVPR 2024) and TexSlider (Guerrero-Viu et al., SIGGRAPH 2024).

  • Regarding the importance sampling, here the strategy used is rather simplistic, shouldn’t directions other than the reflected direction also be sampled? It seems natural that sampling this direction exclusively would prevent the model from learning rougher properties.

  • On line 294, authors say that baselines perform worse because they rely on indoor datasets. Again, I am convinced that the problem tackled by these methods is a different that the one proposed in this paper. Anyhow, I think this still raises the question of whether the qualitative results stem from model size or from the fact that the model is learning the joint distribution. I wonder if the generative capability depends primarily on model size. If the authors had compared their approach with a smaller model, such as Stable Diffusion (v1 or v2), we could better understand whether the improvements come from model capacity or the problem formulation.

  • Have the authors investigated which layers are most critical for adapting the model to the downstream task? Is fine-tuning all layers necessary, and how effective is this relative to the model size? In the current setup, how many parameters are trained when applying LoRA?

  • Regarding the dataset size ablation (line 937), the fact that larger dataset sizes yield worse results sounds counterintuitive. Do you have an explanation for why this occurs? Isn’t this a limitation if the model’s learning capacity doesn’t scale well with more data? How does the model behave when a new domain is added to the training set, especially for the two tasks trained with fewer examples (normals and albedo)?

  • In the scene texture examples, I find the roughness and metallic values rather flat, that is, the values seem clustered around an average. Do you have any idea why this might be?

  • Regarding the image-space loss, how is backpropagation handled through all the steps? Also, have you tried rendering with multiple light directions or even an environment map during inference?

局限性

The limitations section does not provide an in-depth analysis of the method’s failure modes. Instead, it mainly explains how a larger dataset could improve results. I believe a deeper understanding of the method’s real limitations and how future research could address them would be very beneficial, and this aspect is definitely missing from the paper.

最终评判理由

The paper addresses the problem of text-conditioned PBR generation using a pretrained generative model. It delivers high-quality results with sharp material property maps and successfully extends the approach to handle more complex scenes, which is a non-trivial achievement. The authors have addressed the various questions and concerns I raised in my review. In particular, they provided an extended evaluation of the FID scores and detailed their positioning relative to prior works. As mentioned in my response to the authors, I have therefore decided to raise my score.

格式问题

  • There are many inconsistent margins throughout the paper, which conflict with the conference template guidelines. More space is supposed to be maintained around section, subsection titles, as well as around figures.

  • Are half-column figures allowed?

  • The authors should fix their bibliography formatting, as the citation styles are inconsistent.

作者回复

Related works

The work of Xue et al. (Diffusion-based G-buffer Generation and Rendering, arXiv 2025) addresses text-conditional complex PBR generation. They train a shared ControlNet for Stable Diffusion v2 and use a diffusion renderer for editability. In contrast, our method first learns a prior over the PBR properties independently, then aligns them with cross-intrinsic attention using a fixed renderer to ensure compatibility with standard rendering engines. We will include a thorough discussion; however, at the same time we want to highlight that this work is concurrent (almost appeared at the same time on arXiv as our submission).

Single-material generation and capture

Single-material generation methods focus on texture generation; they cannot handle complex environments. ``SVBRDF capture” uses a feed-forward model with a similar rendering loss using the reflected viewing direction of a random surface point for lighting. Orthogonal to this direction, we aim to generate the PBR properties of complex scenes in image space to enable downstream applications. We introduce a rendering loss to the diffusion framework using a roughness-weighted importance sampling for the lighting direction.

Cross-frame attention

We discuss related approaches that use cross-frame attention in L77-80. Tune-A-Video similarly uses cross-frame attention, but as other multi-view image generation methods, they have the same modality across the ``frames" (=batch elements); thus, they can share the LoRA weights. MTFormer applies cross-task attention on the dense prediction heads of a deterministic transformer model. In contrast, our method uses cross-intrinsic attention, where all the LoRAs are different, but still tuned together for a generative task.

We will include these discussions.

Comparison against Decomposition

To the best of our knowledge, there is no other text-conditional PBR map generation method that operates on complex, unconstrained scenes in image space; therefore, we compare to decomposition methods. We agree that PBR generation is less constrained than decomposition, and our results demonstrate this: direct PBR generation is better suited for high-quality PBR generation (Tab.1, Fig.6,7).

Additional Results

We extend our test set and re-evaluate all the baselines and our method. For in-domain experiments, we use the full InteriorVerse test set (2595 samples, as done in IID 2929 ). Evaluating the out-of-domain performance is challenging since there exists no large-scale PBR dataset of complex scenes from various domains. To evaluate one specific domain, we use the single-object G-Buffer ObjaVerse Zuoet.al.ECCV2024Zuo et. al. ECCV 2024 dataset. We take 1000 samples from the diverse ``Daily-Used" category. To show that these numbers are representative, we evaluate the FID with a different number of samples. The relative improvement over baselines align with the original FID numbers in the main paper and the metrics are consistent. Our method achieves better generalization to out-of-distribution domains. We are happy to include these detailed results.

# Samples vs FID on InteriorVerse

1005001000150020002595
Ours186.60113.9692.2183.3277.1372.08

# Samples vs FID on G-Buffer ObjaVerse

1003005007008009001000
Ours186.60123.5499.1485.4480.4675.2871.39

FID Metrics

In-domain InteriorVerseOut-of-domain G-Buffer ObjaVerse
IID78.7798.77
RGBX61.3690.12
ColorfulShading91.1086.48
IID+FLUX LoRA103.3679.29
Ours w/ BRDF-sampling75.2668.87
Ours w/ Image-cond98.0870.51
Ours72.0971.39

Image-conditional IntrinsiX

Our model can be made image-conditional in a similar fashion as for our downstream task of normal-conditioned PBR generation (Fig. 5, 15): we add the RGB image as input and apply input dropout during fine-tuning (as in L228). Our image-conditional method produces sharper results and outperforms decomposition baselines on OOD data (see Ours w/ Image-cond above). This demonstrates the benefits of learning a generative prior over PBR modalities.

Lighting Direction Sampling

We sample light directions that maximize specular highlights to provide strong gradients about reflectance. We train a BRDF-sampled variant (Disney model) for more diverse lighting directions (denoted as Ours w/ BRDF-sampling above*)*. This yields slightly worse results on the in-domain dataset, but improves generalization, showing that lighting direction sampling is crucial for our task.

Model Inconsistency

We re-trained the IID 2929 baseline using the FLUX backbone, both with full fine-tuning (as in IID) and with LoRA (matching our model, 300M trainable parameter). Stronger backbone improves results. Full fine-tuning performs better in-domain (InteriorVerse) but generalizes worse (IID+FLUX Full). LoRA helps toretain the original prior and improves generalization (IID+FLUX LoRA). While sample sharpness improves, IID still shows artifacts (e.g., washed or misaligned textures). We believe image-conditioned models struggle with OOD generalization due to over-reliance on the input image, whereas our method learns a prior over PBR distributions that generalizes better. We are happy to include an extensive version of these experiments.

IID + FLUX

IDOODIn-domain InteriorVerseOut-of-domain G-Buffer ObjaVerse
IID188.34224.8378.7798.77
IID+FLUX Full178.05245.41127.4183.58
IID+FLUX LoRA185.11224.27103.3679.29

Rough and Metal Results

We agree that evaluating material properties alone is challenging, which is why we include renderings and relighting videos and also a user study assessing specular and rendering quality (Tab. 1). The first four rows of Fig. 12 show our first-stage model, trained without rendering loss or cross-intrinsic attention, hence the predictions are incoherent and do not understand the rough and metal properties well. After stage two (bottom four rows), we can see clear improvements. The horse's tail and the guitar soundhole are not metallic anymore (guitar strings correctly remain metallic). The helmet visor, guitar player's eye and the windows of the cabin should have glass-like reflections, which can be best matched with metallic (mirror-like) reflection. Training with more diverse and high-quality PBR maps should improve the generalization.

User Study

Since PBR quality is difficult to assess, user studies naturally exhibit a large uncertainty. However, we rate each method separately, and uncertainty of the different methods are close to each other. Thus comparing the mean values is fair.

Details on LoRA Finetuning

Each LoRA has ~75M params (0.625% of the base model), giving a total of 225M params. It is not necessary to fine-tune all layers, we insert LoRAs only into attention blocks, augmenting the to_q, to_k, to_v, and to_out layers. While we explored skipping some blocks to reduce parameters, we found that cross-intrinsic attention performs best when LoRA is applied throughout.

Dataset Size

LoRA finetuning can adapt T2I models effectively with few high-quality samples (see LoRA paper Sec. F.3). However, available larger PBR datasets are domain-specific (indoor). While images might become more diverse, the prompts do not vary that much, so the LoRA links style mainly to that domain. Therefore, out-of-distribution prompts (Fig. 9) don’t adopt the new style and remain shaded RGB images. The ideal solution is a large PBR dataset, with diverse, complex domains, which currently does not exist. Adding a new domain to the training improves the generalization, as discussed above (Rough and Metal Results).

Rendering

During training, we process only a single diffusion timestep, avoiding backpropagation through all steps, which enables training with a rendering loss. Starting from a noisy sample, a single-step (x0) prediction produces clean PBR maps that receive useful gradients when rendered. At inference, multiple lights can be rendered using a fixed forward renderer, as demonstrated in our scene texturing results with complex lighting (supplementary video at 1:25).

SDS

Flat colors often occur in SDS methods due to averaging predictions. To avoid this, high guidance scales are used but cause oversaturation (ProlificDreamer Wanget.al.Wang et. al. Fig.3). This is unacceptable for rough/metal as it alters reflectance. Thus, we use lower guidance and apply an observation-frequency weighted loss and normalized flow direction to preserve details (L983-990). Other works that overcome this limitation can be additionally applied to our downstream application (e.g., LoRA adaption in ProlificDreamer, consistent noising in SyncDreamer Liuet.al.Liu et. al. ).

Limitations

We’ll expand Sec. C and add a figure showing failure cases (flat decompositions, ambiguous metallic, noisy normals). T2I models have shown value as generic image priors for numerous tasks, e.g. improvements in intrinsic image decomposition. Our goal is to get a similar prior for PBR maps, i.e. focusing on broad domain coverage, and not on exact prompt precision.

Writing

We adopt the suggestions to improve the clarity of our exposition/method details and fixed the typos in L91,97,201,124,129,201.

Formatting

We will fix the formatting issues. Since other NeurIPS papers (e.g., “Light Field Networks” Sitzmannetal.,NeurIPS2021Sitzmann et al., NeurIPS 2021 ) use half-column figures, we believe this is allowed.

评论

We thank again for the reviewer's feedback and hope our rebuttal addressed their questions. Please let us know if there are still open concerns.

评论

I appreciate the detailed response the authors have provided to my (rather long) review and list of comments. In particular, you have convinced me of the differences compared to prior works, especially with respect to single material decomposition models. The extending your evaluation protocol regarding the FID scores is reasonable to me. Assuming the authors will consider all proposed changes in the final version of the paper, fix the bibliography, and, most importantly, detail the limitations section, I am inclined to increase my score.

评论

Thank you for your feedback! We are more than happy to include the new evaluations together with the discussion of related works and expand the limitations section, as discussed above. We appreciate your proposed changes and will incorporate them and fix the references.

评论

Dear Reviewer,

Please respond to the rebuttal from the authors. Please note that, this year it is not allowed to finalize without discussion with authors.

Best,

Your AC

审稿意见
5

This paper introduces IntrinsiX, a novel approach for generating high-quality intrinsic images directly from text descriptions. Unlike traditional text-to-image models that produce RGB images with baked-in lighting effects, IntrinsiX generates physically-based rendering (PBR) maps (albedo, roughness, metallic, and normal maps) using a probabilistic diffusion process. The method involves a two-stage training process: first, training intrinsic priors for each PBR component separately using LoRA layers on curated datasets, and then aligning these models via cross-intrinsic attention and a new rendering loss with importance-based light sampling. This ensures semantically coherent and detailed PBR predictions. The core contributions of this work include being the first to directly generate usable PBR maps from text, decomposing strong image priors into intrinsic components, and combining cross-intrinsic attention with a novel rendering objective.

优缺点分析

Strengths

  1. It's the first method to directly generate PBR maps from text, addressing the limitations of existing text-to-image models that only output RGB images with baked-in lighting effects.
  2. The two-stage training process with cross-intrinsic attention and novel rendering loss is well-designed and effectively combines the strong image prior of pretrained T2I models with intrinsic component generation.
  3. The paper provides comprehensive evaluations with quantitative metrics and qualitative comparisons against baselines. The user study further supports the claims about the quality of the generated PBR maps.

Weaknesses The method requires specific datasets for training each PBR component and may struggle with more diverse or complex scenarios beyond the training data. The paper only uses 40 images for out-of-distribution evaluation, which may not reflect the true generalization ability of the method.

问题

How was the decision made to use a rank of 64 in the LoRA training? Were other values explored during the experimentation process? If so, how did they compare in terms of performance and efficiency? If not, what was the reasoning behind choosing 64 without exploring other options?

局限性

Yes

最终评判理由

The authors basically addressed my concerns in the rebuttal. The significance of this research lies in its advancement of text-to-image generation for practical graphics applications, such as gaming and VR, by producing outputs suitable for re-lighting and editing. I remain postive about the paper.

格式问题

I do not notice any major formatting issues.

作者回复

FID Metrics

We extend our test set and re-evaluate all the baselines and our method. For in-domain experiments, we use the full InteriorVerse test set (2595 samples, as done in IID 2929 ). Evaluating the out-of-domain performance is challenging since there exists no large-scale PBR dataset of complex scenes from various domains. To evaluate one specific domain, we use the single-object G-Buffer ObjaVerse Zuoet.al.ECCV2024Zuo et. al. ECCV 2024 dataset. We take 1000 samples from the diverse ``Daily-Used" category. To show that these numbers are representative, we evaluate the FID with a different number of samples. The relative improvement over baselines align with the original FID numbers in the main paper and the metrics are consistent. Our method achieves better generalization to out-of-distribution domains.

We provide three additional experiments to better support our claims (as also requested by the reviewers).

  • Ours w/ Image-cond: we make our model image-conditional similarly, as we did for our normal-conditional model during the SDS-based texturing (L228). This variant maintains sharp results and outperforms baselines on OOD data, demonstrating the benefit of a learned PBR prior.
  • Ours w/ BRDF-sampling: we sample more diverse lighting directions using BRDF-sampling during our rendering loss to provide more gradient signal about the BRDF behavior. This variant slightly improves our generalization results, showing the importance of lighting direction sampling.
  • IID+FLUX LoRA: we re-train the IID baseline with FLUX as backbone and use LoRA layers similar to our method. This improves the fidelity, but this still does not reach the quality of our method, showing that learning a generalizable prior over PBR distributions is beneficial for generalization. For more details, we refer to our rebuttal to reviewer 48oZ.

We are happy to include these detailed results.

# Samples vs FID on InteriorVerse

1005001000150020002595
Ours186.60113.9692.2183.3277.1372.08

# Samples vs FID on G-Buffer ObjaVerse

1003005007008009001000
Ours186.60123.5499.1485.4480.4675.2871.39

FID Metrics

In-domain InteriorVerseOut-of-domain G-Buffer ObjaVerse
IID78.7798.77
RGBX61.3690.12
ColorfulShading91.1086.48
IID+FLUX LoRA103.3679.29
Ours w/ BRDF-sampling75.2668.87
Ours w/ Image-cond98.0870.51
Ours72.0971.39

LoRA Rank

We explored different LoRA ranks and noticed differences in (a) image quality and (b) generalizability to out-of-distribution samples. A too low LoRA rank fails to produce high-quality images from the target domain. In contrast, a too high LoRA rank negatively impacts generalizability in a similar fashion as our analysis of dataset size for LoRA finetuning (Fig.9/Tab.2). Thus, we chose rank 64 as the best middle-ground that produces high-quality images and still generalizes well to out-of-distribution domains. We'll include these findings into our Fig.9.

评论

The authors addressed my concerns on the limited size of the test set for evaluation. And provide a reasonable explanation on the choice of the LoRA rank. I will remain the Accept rating. I hope the authors can provide some quantitative results on different choices of the LoRA rank in the revised manuscript.

评论

We are more than happy to include the qualitative and quantitave ablation on the LoRA rank into the paper.

审稿意见
4

The paper introduces a framework for generateing PBR maps directly from text prompts. It leverages a pretrained flow model (Flux.dev1) and employs LoRAs for effective fine-tuning. To further enhance the quality of generated PBR maps, the authors use a cross-view/multi-view attention which enables effective data interaction across different attributes of PBR maps (e.g., albedo, roughness/metallic, normal) on batch dimension. The paper also apply a rendering loss to enhance the modeling of PBR map distributions.

优缺点分析

Strengths

  • The paper is well-written and easy to follow. The figures are intuitive and helpful for understanding the method.
  • The proposed method shows promise in generating high-quality PBR maps from text prompts, especially showing improvements in generating clean albedo maps.

Weaknesses

  • Although the proposed method produces cleaner albedo maps, it still faces challenges in accurately handling reflections. Additionally, the generated normal maps frequently appear noisy. Consistency among different attributes (albedo, roughness/metallic, normal) is also lacking, as each attribute often presents disparate details.

The paper is motivated by the limitations of existing methods in decomposing RGB images into distinct PBR maps, a task inherently ambiguous and ill-posed. The authors opt for generating PBR maps directly from text prompts instead. However, for practical real-world applications, decomposing or generating PBR maps from RGB images aligns more closely with common workflows, such as image editing and relighting. Therefore, the primary advantage of the paper, generating cleaner PBR maps, is simultaneously its principal limitation regarding practicality in real-world scenarios. Nonetheless, the overall approach shows promising results, particularly given the limited training data. The proposed solution is logically sound and well-justified. Considering these points, I am leaning towards borderline acceptance.

问题

  • Exactly how many images are used for training the albedo and normal maps? The mention in Line 130 of "20 synthetic examples of albedo and normal maps" is somewhat unclear.
  • Does the cross-view/multi-view attention mechanism introduce new attention layers (e.g., new linear layers for queries, keys, and values)? If not, reusing the existing attention layers might interfere with pretrained priors due to altered attention behavior, potentially complicating the second-stage training process that still only employs LoRAs.

局限性

Please refer to Weaknesses.

最终评判理由

The authors’ rebuttal addresses most of my concerns, and training an image-conditional IntrinsiX resolves my main issue about the motivation of the paper. So, I recommend accepting this paper.

格式问题

I see no formatting issues; the submission follows the required template.

作者回复

Motivation for Image-Space PBR Priors

Text-based generative models can act as a generic prior over images, which has shown great value for numerous downstream tasks: improved intrinsic image decomposition(IID 2929 , RGBX 6767 ), depth estimation (Marigold KeetalCVPR2024Ke et al CVPR 2024 ), etc. Our goal is to establish a similar generic prior for PBR materials to support downstream applications like UV-based texturing.

The key challenge of the field of PBR generation is the lack of large-scale, diverse datasets of PBR UV textures. Unlike RGB data, material datasets are scarce, and crucially, consistent UV parametrization (mapping from object coordinates to texture coordinates) across assets practically does not exist. This makes training direct generative models in UV texture space infeasible.

Instead, we learn a prior in image space. To distill image-space information into texture-space, we use SDS, which aggregates image-space signals from multiple views. This enables PBR UV texture generation for 3D assets, even though trained with relatively small amounts of image-space PBR data. In contrast, decomposition methods require an image as input and it is not obvious how one would synthesize consistent PBR UV textures for novel 3D assets.

Accurate Reflections and Normal Maps

We propose an image-space rendering loss with our lighting direction sampling to improve the specular reflections. We show in Fig. 8 that without this rendering loss all modalities are averaged and washed away. The rendering loss helps to ground the individual PBR properties and leads to more details. Our lighting direction sampling further enhances this effect by strengthening the gradient signal coming from the specular highlights. Compared to the baselines (Fig. 6), our method exhibits much more details in the rough and metal maps, which ultimately facilitates higher-quality renderings. That being that we fully agree that there is still room for further improvements (e.g., with more diverse datasets); however, at the same time we argue that our approach makes a significant step forward.

Implementation Details

We curate 20 views from the InteriorVerse dataset across different scenes. Then, we select the corresponding albedo and normal maps for training our LoRA layers. Please also refer to the supplementary material L942 for more details.
Our cross-view attention mechanism utilizes the new LoRA layers inside of the attention blocks. Concretely, we add LoRA in every attention block for the to_q, to_k, to_v, to_out linear layers in our first stage training. In the second stage, we fine-tune these new layers with our cross-intrinsic attention. In both stages, we do not modify the pretrained prior weights, thus preserving the generalization capabilities. We are happy to elaborate further in the final revision.

评论

Thanks to the authors for their rebuttal. Most of my concerns have been addressed. However, I’m still not fully convinced by using SDS to distill image-space information into texture-space. This doesn’t seem like the best way to generate high-quality textures. Instead, directly reprojecting the image into UV space would create a lossless texture. In this sense, training an image-conditional IntrinsiX is more important. Based on the rebuttal and other reviews (e.g., Reviewer 1FNt), I understand that the authors have indeed trained an image-conditional IntrinsiX. I agree that the text-based IntrinsiX can serve as a generic prior over images, so I strongly suggest including results from the image-conditional IntrinsiX in the paper. Additionally, providing comparisons with other image-intrinsic methods would further support the claim of a generic prior.

评论

We are more than happy to include the image-conditional results together with qualitative comparisons in the paper. This will showcase the advantage of learning our direct PBR-prior (both text-conditioned and image-conditioned) in comparison to the decomposition methods.

We chose SDS-based distillation that enables consistent supervision, while avoiding challenges like seam artifacts and accumulated errors often seen in reprojection-based approaches [7]. We are happy to provide a balanced discussion about these two alternatives.

[7] Chen et al, Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors, CVPR2024

评论

We thank again for the reviewer's feedback and hope our rebuttal addressed their questions. Please let us know if there are still open concerns.

评论

Dear Reviewers,

Please discuss with the authors, especially if the rebuttal did not solve your concerns.

Best, Your AC

最终决定

This submission introduces a novel method for generating high-quality, text-conditioned PBR maps using a diffusion-based framework. The reviewers initially raised concerns about the novelty of cross-intrinsic attention, the practicality of text-based conditioning, and the robustness of quantitative evaluations (e.g., limited FID sample sizes). The authors provided a comprehensive rebuttal, clarifying that their approach uniquely combines independent PBR priors with cross-intrinsic attention and a rendering loss to ensure semantic coherence. The expanded evaluations (e.g., testing on 2595 in-domain and 1000 out-of-domain samples) demonstrated image-conditional variants and justified design choices (e.g., LoRA rank, BRDF-sampling). While some weaknesses remain (e.g., noisy normals, reflections), the method’s technical soundness, strong empirical results, and potential for applications like relighting and texture generation were widely acknowledged. After discussion, all reviewers upgraded their ratings, recognizing the paper’s contributions. Acceptance is suggested.