PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
4
5
3.8
置信度
创新性3.0
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

OpenReviewPDF
提交: 2025-05-03更新: 2025-10-29

摘要

关键词
Object Compositing; Multi-view Harmonization; Shadow Generation

评审与讨论

审稿意见
5

The paper presents a compositing framework capable of handling both 2D single-image and 3D multi-view compositing. The authors use an image transformer in the 2D stage. For the 3D stage, they apply a Hilbert curve to convert the 3D point cloud into a 2D image representation, allowing them to reuse the feature maps from the 2D stage as input. This conversion enables them to continue using the same image transformer architecture.

优缺点分析

Strenghth:

  1. The new proposed dataset can be very usefull as training data for similar task.
  2. The Hilbert curve mapping is interesting and can be used for feature mapping in other tasks.

Weaknesses:

  1. The method assumes that the composited object is perfectly aligned with the scene (3D representation of both composited object and background are available). This is a significant limitation. In the 2D case, a view of the object that precisely matches the scene’s viewpoint is not available and this alignment is non-trivial in practice if the 3D model is not available which is the case with general compositing problem. In contrast, diffusion-based approaches can internally handle misalignment by generating a new view that blends naturally with the scene.This can be more clear in Supp-Fig4 results.

  2. In the 3D case, the quality of the generated depth map depends heavily on the number and diversity of input views. However, the paper does not discuss how varying view coverage or sparse geometry affects the final compositing quality. There is also no discussion of robustness to errors in geometry estimation or alignment.

  3. The paper uses a Hilbert curve to convert the 3D Gaussian representation into a 2D texture to be processed by an image transformer, but this design choice is not well motivated. A standard transformer can operate directly on token sequences, including the 3D Gaussian features, without requiring 2D spatial layout. The ablation study includes a row where the Hilbert curve is removed, but it is unclear what architectural changes are made in that variant. More clarity is needed on how this version is implemented and how a basic transformer would perform without spatial mapping.

  4. Using metrics like SSIM and PSNR to compare against diffusion-based methods can be unfair. These metrics rely on pixel-level alignment with ground truth, while diffusion models often modify the composited region significantly. Therefore, a lower score does not necessarily reflect lower perceptual quality. Although the authors include a user study in the supplementary material, they do not explain how it was conducted — including details such as the number of samples, the dataset used, the number of participants, or the evaluation protocol.

  5. The paper does not specify which datasets were used during training. It is also unclear which evaluations are zero-shot (i.e., tested on entirely unseen datasets) and which are conducted on held-out splits from the same datasets used for training.

问题

I would appreciate if authors could clarify my concerns mentioned in weaknesses section.

1- motivation behind use of hilbert curve + image transformer vs standard transformer 2- clarification on the training data, evaluation data and user study.

局限性

While the paper discusses some limitations, it does not address the strong assumption that both the 3D model of the scene and the composited object are available. This assumption significantly limits the practical applicability of the method — especially in the 2D case, where acquiring a correctly aligned view of the object is non-trivial. The absence of any discussion on this constraint is notable.

最终评判理由

The extra provided ablations and the clarifications authors provide during the rebuttal addresses my concerns and I am convinced to increase my rating to accept.

格式问题

None

作者回复

We thank all your valuable comments of our paper. The detailed responses are shown below.

W1: Assumption of Perfect Object Alignment

Response: Thank you for your valuable suggestions. We acknowledge that our method requires an initial 3D alignment of the object. For example, we can align the bottom surface of the object with the plane of the scene.

However, we clarify that the core contribution of our paper is illumination harmonization and consistent ligthing and shadow generation, not reconstruting the background scene or aligning the composited object, as described in Section 3.1.

While diffusion-based models can often blend a misaligned 2D object into a single image, this generative flexibility becomes a critical weakness in a multi-view 3D context. Running a diffusion model independently on each view would almost certainly produce inconsistent object geometry, lighting, and shadows, shattering the 3D illusion. Our method is explicitly designed to solve this multi-view consistency problem, which necessitates a consistent 3D position for the object as a starting point.

W2: Impact of Sparse Geometry or View Coverage

Response: This point relates to the scope of our work. Our method assumes a pre-existing, high-quality 3D representation of the background scene (e.g., a pre-trained 3DGS). Consequently, the depth maps required by our model are rendered from this representation, rather than being generated from sparse views. Since 3D representation is known, the input views can be picked, thus we did not analyze the varying view coverage or the sparse geometry in the main text.

To explore this boundary condition, our supplement (Section A.7) includes an experiment where we use VGGT to estimate geometry from unposed images. The results show that our adjustments to the Gaussian reconstruction process effectively mitigated the impact of noisy input geometry on the harmonization quality.

W3: Motivation for Hilbert Curve vs. Standard Transforme

Response: The motivation is to preserve spatial locality. A standard transformer requires the input sequence order to be meaningful. For an unordered set of 3D points like Gaussians, an arbitrary ordering provides no useful spatial bias for the attention mechanism. The Hilbert curve imposes a meaningful order, ensuring that Gaussians that are close in 3D space are also close in the 1D sequence, which aids learning.

For the "w/o Hilbert transform" ablation in Table 3, the architectural change is in the input preparation. Instead of Hilbert ordering, we simply take the list of N Gaussians and perform a naive reshape into a pseudo-image. This is equivalent to ordering them by their arbitrary initial index. The resulting performance drop (e.g., PSNR in the 3D task falls from 30.29 to 28.99) provides clear empirical evidence for the Hilbert curve's benefit. We will add this clarification to Section 4.4.

W4: Evaluation Metrics and User Study

Response: We agree that pixel-wise metrics have limitations for evaluating generative models. We use them because they are the most established quantitative standards for image restoration and translation tasks. To provide a more complete picture, we supplement these metrics with two crucial elements:

  1. Extensive Qualitative Results: We provide numerous visual comparisons in Figure 47 and Figure 46 in the supplementary material., which allow for direct perceptual judgment of quality.
  2. Formal User Study: The user study was conducted via an online questionnaire and we collected 374 valid responses from human participants. The study presented participants with side-by-side comparisons and asked them to choose which result demonstrated more realistic and consistent lighting and shadows. The two questions corresponded to the single-view and multi-view compositing tasks, respectively, with the results summarized in the pie charts on Figure 3 in the supplementary material.

W5: Training/Test Datasets and Zero-Shot Evaluation

Response: Thank you for the reminder, and we will make it explicit in Section 4.1.

  • Training Data: The model was trained exclusively on the "simple synthetic scenes" from our proposed DTC-MultiLight dataset except 50 scenes for evaluation.
  • In-Domain Testing: A held-out split of 50 "simple synthetic scenes" from DTC-MultiLight was used for in-domain evaluation, as shown in Figure 4&5 and Figure 5 in the supplementary material.
  • Zero-Shot Testing: All other evaluations were conducted in a zero-shot setting on entirely unseen datasets to test generalization. This includes:
    • our "complex synthetic scenes" (Figure 4&5 and Figure 5 in the supplementary material).
    • our "real captured scenes" (Figure 6).
    • FOSCom and Objects With Lighting (OWL) dataset (Figure 4 and Figure 4 in the supplementary material).
评论

I thank the authors for clarifications. I still have the following concernts:

W1: I understand the core motivation behind introducing a 2D stage before the 3D one. However, my concern lies in the assumption of a pre-aligned composite in the 2D application, which represents a limitation that should be explicitly clarified. Unlike this method, most 2D compositing approaches do not rely on such a constraint. If the authors claim that their 2D compositing stage can be used independently—as a standalone compositing method rather than solely as input to the second stage—this limitation needs to be discussed and acknowledged.

W3: As demonstrated in [1], a complete neural rendering engine can be implemented using transformer models, which challenges the claim that transformers inherently struggle with capturing spatial locality. While it's true that naively feeding Gaussian tokens as input may prevent the network from learning effective attention patterns, I believe that even a simple positional encoding based on the Gaussian center positions would serve as a stronger baseline. This would make a more meaningful ablation comparison when evaluating the benefits of using Hilbert curve-based localization. I believe such ablation is needed to show how strong hilbert curve is compared to a naive transformer architecture rather than randomly feeding tokens without a meanigfull positional encoding.

[1] : RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination; Zeng et al, 2025

评论

Thank you for your detailed feedback.

W1: We understand your concerns. Our method does indeed require the assumption of a pre-aligned composite in the 2D application, which is a limitation. While earlier diffusion-based methods could adjust object-scene alignment through their generative capabilities, they also modified the object's shape, texture, and size. Our feed-forward approach, however, aims to efficiently and stably generate highlights and shadows for inserted objects while keeping them unchanged, which is more user-friendly for inserting objects in AR scenes.

W3: We agree that our ablation study on the Hilbert curve was insufficient. Following your suggestion, we will use a pre-trained point transformer to extract positional and color features from the Gaussian primitives, then merge them with the image features as input to the network. We will provide the results of this comparative experiment before the end of the discussion period.

评论

W3 EXP: We compare our method with using Point Transformer for extracting Gaussian features, and the quantitative results is shown below. We train on the simple scenes of DTC-MultiLight for 40k iterations and test on simple, complex, and real-world scenes.

MethodSimple Synthetic Scene (PSNR)Simple Synthetic Scene (SSIM)Simple Synthetic Scene (LPIPS)Complex Synthetic Scene (PSNR)Complex Synthetic Scene (SSIM)Complex Synthetic Scene (LPIPS)Real Captured Scene (PSNR)Real Captured Scene (SSIM)Real Captured Scene (LPIPS)Inference Time(s)Memory(MB)
Hilbert curve28.450.9150.05727.640.8870.07324.240.8520.0961.0835.01
Point Transformer28.310.8990.06627.700.8850.07423.980.8390.1031.16565.69

The results show that our method achieves superior quantitative results while maintaining the advantages of inference time and storage. Because of time constraint, We will incorporate the 100k comparison results into the final version of the ablation experiment.

审稿意见
5

This paper proposes a method for object composition and scene relighting. A large-scale synthetic 3D compositing dataset is collected to train a feed-forward model. The model adopts a two-stage framework, consisting of 2D image-level compositing and 3D scene-level compositing, which are connected via a Hilbert curve-based mapping. Extensive experiments demonstrate that the proposed method outperforms previous state-of-the-art approaches.

优缺点分析

Strengths:

  1. The quality of the results appears strong and promising. The paper is well-written and easy to follow.
  2. The Hilbert curve-based mapping is a novel texture mapping method of 3D GS.
  3. The experiments are thorough and demonstrate that the method achieves state-of-the-art performance.
  4. The proposed large-scale synthetic dataset could be valuable to the community and serve as a strong benchmark for this task.

Weaknesses: No major weaknesses identified.

问题

  1. It is suggested that the video rendering results be visualized.
  2. It is suggested that more details be provided regarding the mapping function based on the 3D Hilbert curve. From Fig. 3, it appears that the 3D Gaussians in the scene occupy only a small portion of the full space. How is it ensured that multiple 3D Gaussian primitives are not assigned to the same curve point? How to deal with curve points that have no primitives assigned?

局限性

yes

最终评判理由

After reading the author's responses to my questions and other reviewers, I can see the author's efforts to address the issues and improve the paper. Although my biggest concern is the visual effect of the video results, I believe the author can handle them well in their revision, so I am inclined to maintain the score.

格式问题

No formatting issues noticed.

作者回复

We thank your recognition and suggestions for our paper. The detailed responses are shown below.

Q1: Visualization of Video Results

Response: Due to time constraints, we were unable to produce the final videos for the initial submission. We have since rendered several high-quality videos showcasing multi-object composition with smooth camera motion in complex scenes. These will be contained in the camera-ready version of the paper.

Q2: Details of Hilbert Curve Mapping

Response: Thank you for these insightful questions. We will add a more detailed explanation of the mapping process to.

  1. Space Occupancy and Collisions: The purpose of constructing the Hilbert curve is to preserve the spatial information of sparse Gaussians and we do not explicitly prevent multiple Gaussians from mapping to the same discrete point on the curve. Our process is as follows: we build a KD-Tree on the Hilbert curve's points for efficient lookup. Each Gaussian is then mapped to its nearest neighbor on the curve. The final 1D sequence is ordered first by position along the curve. If a "collision" occurs (multiple Gaussians map to the same point), their relative order is resolved using a stable sort based on their original index in the input list.
  2. Empty Curve Points: Curve points that have no assigned Gaussians are simply ignored. The final sequence passed to the transformer consists only of the ordered Gaussian features, with no gaps.
评论

Thanks for your reply. It has solved my questions. I tend to maintain my score.

审稿意见
4

MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation introduces a framework for image harmonization in the context of object compositing in 2D or 3D. Given an composited image or sequence of images from a 3D scene, the original background image, corresponding depth maps and camera poses, the pipeline first performs 2D adjustments on each image individually using a transformer network, before transforming the results into point clouds and subsequently Gaussians, that are projected using a 3D Hilbert curve before being processed by a second transformer. The output color information of the second network is projected back into Gaussian space (using the inverse Hilbert projection) and then rendered to apply photometric loss or to extract the final result image. For 2D input the second part of the pipeline can be omitted. The method is evaluated on the FOSCom and Objects With Lighting benchmark as well as a newly curated dataset. The proposed method achieves state-of-the-art performance and shows visual promising results even on self-emitting objects which are often not well represented in prior work.

优缺点分析

Strengths:

  • Fast inference that proves to be effective for both 2D and 3D inputs.
  • Quantitative results look promising. Interestingly also the
  • Extensive evaluation including multiple open benchmarks and captured real-world data.

Weaknesses:

  • The motivation for the individual components could be made clearer in the method section, maybe also incorporating prior works. The ablation study (sec.4.4) is the only part where some motivation is given for the component choices.
  • The Gaussian representation needs to be initialized for each scene which takes about a minute extra. This means that the fast inference only pays off if many views of a single scene are processed.
  • Long training times. The proposed pipeline seems complex, featuring two transformer networks and multiple representation changes.
  • Image resolution is small (256 x 256) limiting the applicability.
  • Text clarity could be improved in some areas. E.g. First sentence of the abstract (l.3) could introduce object compositing with some motivation, section 3.1, l.117 starts with an incomplete sentence (or typo at "Begin"?). Overall, it is not that clear what types of inputs are supported during inference and wether a Gaussian representation needs to be fitted to the input at all times. For the single view evaluation only the 2D network is run, isn't it? The used training dataset should be mentioned in the training section.
  • It seems like the model has only been trained on scenes featuring a table with objects placed on top. It is not clear how well this generalized to other scene content.

Overall, the paper proposes a complex but effective method to harmonize composited scenes in 2D or 3D that is limited by the low processing resolution and the need for camera pose data.

问题

  • How well does the method generalize to arbitrary scenes that don't feature a table with objects placed on top (e.g. landscape background and some foreground objects or even people)?
  • What is the "iterative bias" (abstract, l.12) of diffusion-based methods in this context? Do you mean the exposure bias? Overall it seems like diffusion models can generate very high quality results by now, especially for tasks like image manipulation and generation.
  • How can the image resolution be increased? 256x 256 is very small for today's image processing pipelines.
  • Can the training times be reduced by using more available pre-trained components (especially the 2D object network)?
  • What perceptual loss is used during training (sec. 3.4)? Is it LPIPS as during evaluation?
  • Would the method work with a single image input and monocular depth?
  • What it the user study documented in the supplementary material? How many subjects have been asked? Were they human? If so, also the NeurIPS checklist needs to be updated. I would suggest to delete the study or discuss the results properly.

局限性

Limitations are (only) discussed in the supplementary material.

最终评判理由

Given the other proposed refinements from the rebuttal, specifically the motivation for the Hilbert Curve mapping and choice of losses, the information on the user study and the revised text including a clearer overview of model inputs, I can see the work fit for acceptance to the conference. The dependence and relationship to the input 3D representation should be discussed in more detail as it seems to be the main limitation.

格式问题

No concerns.

作者回复

We thank all your valuable comments of our paper. The detailed responses are shown below.

W1: Motivation for Individual Components

Response: We clear the motivation for two key components below.

  • Swin Transformer: The Swin Transformer introduces shifted-window attention, achieving linear computational complexity while maintaining hardware efficiency through its hierarchical pyramid structure. It demonstrates superior performance over the original transformer across multiple vision tasks (+2.7 box AP on COCO detection and +3.2 mIoU on ADE20K segmentation), while also requiring fewer FLOPS (-45.5%) and parameters (-35.8%) at the same input resolution. Consequently, it serves as the backbone for several widely-used models like Grounded DINO and DINO-X. In our work, we adopt the Swin Transformer as the backbone for both our 2D and 3D object compositing models. (Quantitative comparison is from the paper "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows")
  • Hilbert Curve Mapping: Inspired by recent advancements in processing point clouds with transformers, like Point Transformer V3, we utilize the Hilbert curve maximally preserves spatial locality, which means points that are close in 3D remain closing in the 1D sequence. Then, the ordered sequence is then folded into a 2D grid, allowing our transformer to effectively process the geometric structure of the scene. The ablation study (e.g. PSNR in the 3D task falls from 30.29 to 28.99) provides clear empirical evidence for the Hilbert curve's benefit.

W2: Concerns about Gaussian Initialization

Response: We appreciate the reviewer's concern regarding efficiency. We clarify two key points:

  1. The Gaussian initialization is a one-time, per-scene preprocessing step, not a per-inference overhead. We clarify that the one-minute training for the Gaussians was chosen to achieve the highest reconstruction quality and we can trade off this training time for quality. To demonstrate this, we conducte a comparison experiment below. Once this 3D scene representation is constructed, our framework can perform harmonization for any number of viewpoints or even for newly inserted objects at a rapid speed (e.g., 0.07 seconds per frame). This initial cost should be viewed as analogous to loading or preparing a 3D asset, after which real-time interaction becomes possible.
  2. Our method is fully compatible with the latest advancements in feed-forward 3D reconstruction. Recent geometry foundation models can generate 3DGS representations directly from sparse inputs in a single feed-forward pass. To demonstrate this, we conduct experiments using AnySplat, which generates a 3DGS scene from input frames without per-scene optimization, drastically reducing the initial setup time.
ParadigmMethodSimple Synthetic Scene (PSNR)Simple Synthetic Scene (SSIM)Simple Synthetic Scene (LPIPS)Complex Synthetic Scene (PSNR)Complex Synthetic Scene (SSIM)Complex Synthetic Scene (LPIPS)Real Captured Scene (PSNR)Real Captured Scene (SSIM)Real Captured Scene (LPIPS)Time(s)
Per-scene3DGS(#10k iter)30.290.9600.03030.130.9520.02726.390.9270.04064.8
Per-scene3DGS(#1k iter)28.430.9260.07628.540.8950.09225.240.8990.0816.5
Per-scene3DGS(#250 iter)26.500.8920.11527.160.8690.12622.030.8530.1481.6
Feed-forwardAnySplat22.280.7050.21130.030.9350.05024.460.8640.0950.8

The results reveal a clear trade-off between reconstruction paradigm, speed, and scene complexity. AnySplat achieves the fastest reconstruction time at under one second and delivers quality nearly on par with the highest-quality, per-scene optimized 3DGS in complex, forward-facing scenes. However, its performance significantly degrades in simple, surround-view scenes, likely due to the inherent challenge of jointly estimating camera poses. Meanwhile, by reducing training iterations of the vanilla 3DGS to 1k, it produces comparable results in just a few seconds, offering a practical compromise between the instant inference of AnySplat and the lengthy optimization of a full 10k-iteration run.

W3: Training Times and Model Complexity

Response: We acknowledge that the training time (15+3=18 days on 16 A100 GPUs) is quite long. However, this is standard for training large-scale, feed-forward vision models from scratch on a new, large dataset. For context, comparable large-scale models like VGGT requires 9 days of training on 64 A100 GPUs and Fast3r requires 6.13 days of training on 128 A100 GPUs. In terms of total GPU-hours, our training budget is in a similar range.

Regarding complexity, while our pipeline involves two stages, the core transformer architecture is nearly identical for both the 2D and 3D models, as illustrated in Figure 2 of our supplementary material. We propose a unified architecture for both 2D and 3D tasks with different data representations.

W4&Q3: Image Resolution

Response: Our model supports inference at a resolution of 2K, as shown in Figure 7. Furthermore, we are also conducting training with higher resolution (512x512) and will release the pretrained models with different resolutions.

W5: Text Clarity and Supported Inputs

Response: We thank the reviewer for these actionable suggestions. We will incorporate all proposed textual edits, including adding a definition for object compositing in the abstract and correcting the typo in Section 3.1. We will also clarify the supported input modalities and the training data details.

  • Supported Inputs:
    1. Single-view image: For a single view input, only the 2D Object Compositing model is used.
    2. Multi-view images with known poses: This is the primary use case for our full pipeline, where the 3D model leverages the 3DGS representation to enforce multi-view consistency.
    3. Multi-view images without known poses: This is also supported as an extension. As demonstrated in our supplementary material (Section A.7), camera poses can first be estimated with VGGT, after which our pipeline can proceed.

The training dataset we used is simple synthetic scenes from the DTC-MultiLight dataset we proposed. The complex synthetic scenes and real captured scenes are used for out-of-domain evaluating.

W6&Q1: Generalization to Arbitrary Scenes

Response: We apologize if this was unclear. While we train models on tabletop scenes (simple synthetic scenes), our evaluation is deliberately performed on a much wider variety of out-of-domain content to test generalization:

  • Complex Indoor Scenes: Figure 4 and Figure 4&5 in the supplementary material show strong performance on the Objects With Lighting (OWL) dataset and complex synthetic scenes of DTC-MultiLight dataset, which features complex indoor environments.
  • Diverse Outdoor Scenes: Figure 4 in the supplementary material shows qualitative results on the FOSCom dataset, which includes varied scenes such as outdoor streets.
  • Real-World Captures: Figure 6 demonstrates robust performance on casually captured real-world scenes that differ significantly from our synthetic training data.

Q2: Clarification of "Iterative Bias"

Response: I'm certain it refers to exposure bias. While these models produce high-quality textures, their stochastic and iterative nature can lead to results that are visually plausible but physically inconsistent, especially for precise effects like shadows and highlights.

Q4: Use of Pre-trained Components

Response: Thank you for your suggestion. We agree that using more available pre-trained components is a method to accelerate training like VGGT. However, Our network backbone is a specific architecture that takes a multi-channel input tensor concatenating the composite image, background, and depth map. Standard pre-trained weights from datasets like ImageNet are trained on 3-channel RGB images and would not be directly applicable. Meanwhile, the hyperparameters of the network architecture of our model are also different.

Q5: Perceptual Loss Details

Response: The perceptual loss Lp used during training is a VGG19-based feature loss, not the LPIPS metric. While both are derived from VGG features, LPIPS involves an additional learned linear layer on top of the features to better align with human perceptual judgments. Our training loss uses a standard L2 distance in the VGG feature space, a common and effective practice for image-to-image tasks.

Q6: Monocular Depth Input

Response: For all our experiments on public datasets like FOSCom and OWL where ground-truth depth was unavailable, the depth maps were generated by Depth Anything v2.

Q7: User Study Details

Response: The user study was conducted via an online questionnaire and we collected 374 valid responses from human participants. The study presented participants with side-by-side comparisons and asked them to choose which result demonstrated more realistic and consistent lighting and shadows. The two questions corresponded to the single-view and multi-view compositing tasks, respectively, with the results summarized in the pie charts on Figure 3 in the supplementary material.

评论

Thank you for the detailed response. I would like to see a summary of the motivations as provided in the rebuttal also in the main paper. Backing up the choice of the Hilbert Curves with the experimental data helps the overall understanding.

Regarding W3: The main question here would be if you tried fine-tuning existing architectures. Giving some justification for the training effort in the final paper would be good, I think.

I would suggest to feature the user study more prominently in the paper and add some information on the participants. Given the other proposed refinements I'm happy to raise my score accordingly.

评论

We appreciate your valuable suggestions and will revise in the final version:

  • We will add a summary of the motivations to the beginning of the method section.
  • Our model was trained from scratch, not fine-tuned. We will add the justification for our training effort to the implementation details section.
  • We also agree that user studies are an important indicator, so we will incorporate the results into the main text and elaborate on the specific details in the supplementary materials.
审稿意见
5

The paper presents a method for inserting novel objects into existing scenes such that shading, shadows, occlusions and interreflections are properly modeled. Instead of performing intensive inverse rendering, the approach trains a forward prediction network that has been trained on many (about 480k) synthetic object insertion tasks with different illumination. The authors present a 2D pipeline where insertion into a single image is possible as well as a multi-view 3D setup where the resulting color adjustment of the newly inserted object as well as on the existing parts of the scene is predicted by altering the colors of a 3DGS representation. Here, the 3D neighborhood of Gaussians is mapped into 2D by use of 2D and 3D space-filling Hilbert curves. Both in the 2D as well as in the 3D object comopsiting model a Swin ViT block is trained to predict the necessary scene alterations. Extensive comparisons with other existing approaches, including also diffusion and inverse rendering attempts, indicate vastly better output quality for both synthetic and real-world scenes at very fast inference speeds (0,07s).

优缺点分析

Strengths

  • The paper addresses the problem of object compositing as a pure feed-forward approach by training on a massive synthetic dataset.
  • The presented results are highly convincing as long as the depth and the background mask are given.
  • The inference speed is way faster than existing approaches.
  • The object insertion data set is interesting.
  • The proposed mapping from 3D Gaussian to 2D maps using 3D and 2D Hilbert curves to train a 2D SWIN ViT for a 3D task is highly original.
  • The paper is well written with a clear motivation and quite clear presentation of the approach
  • The authors provide a quite exhaustive comparison to existing work

Weaknesses

  • The authors do not discuss failure cases. It would be good to show some of the worst-performing examples and hint at why they might be challenging.

问题

For the task of inserting a luminous objects, which is also quite convincing, are there sufficiently many cases in the original training data set are those added later on?

局限性

yes.

最终评判理由

After reading the other reviews and seeing the additional discussion necessary on the initial 3D alignment, I would slightly lower my score to 5.

格式问题

none.

作者回复

We thank your appreciations and suggestions for our paper. The detailed responses are shown below.

W1: Discussion of Failure Cases

Response: Thanks for your excellent suggestion. We agree that a transparent discussion of limitations is essential. Due to space constraints, the limitation is presented in Section A.8 of the supplementary material. We also summarize some major failure cases as follows:

  • Shadow Precision: Our model can learn the general direction of highlights and shadows from the scene, but sometimes it fails to generate accurate and hard-edged shadows, as shown in the last column of Figure 5. The challenge lies in the fact that the information in the picture cannot determine the position of the light source, so creating sharp and accurate shadows is difficult.
  • Reflective Surfaces: Our model, trained primarily on diffuse materials, struggles to generate physically accurate reflections of composited objects on specular or mirrored surfaces. The challenge lies in capturing the characteristics of the materials from the images, and reflections and shadows are often easily confused.

Q1: Training Data for Luminous Objects

Response: Thank you for highlighting this. The original training dataset did not contain a large number of luminous objects. To specifically probe our model's ability to generalize to complex light transport phenomena, we curated a targeted dataset (about 1/3 of the scenes in the original dataset) where some objects were assigned emissive material properties. The strong performance on this task, as shown in Figure 7, even with relatively limited exposure during training, demonstrates the powerful generalization capability of our framework. We will clarify this in Section 4.3.

评论

Thank you for your clarification. After reading the other reviews and seeing the additional discussion necessary on the initial 3D alignment, I would slightly lower my score to 5.

最终决定

This paper considers the task of image harmonization for inserting objects into existing scenes (given by an RGBD image or set of RGBD images and camera poses) such that various aspects relating to lighting (including shadows, shading, interreflections, occlusions, and self-emission) are modeled realistically. The identified strengths include the quality of the results, the inference speed, and the approach for mapping unstructured 3D Gaussians into a structured 2D grid via Hilbert curves, which could potentially be used in more general settings for processing Gaussian mixtures. The identified weaknesses mirror some of these strengths. Most reviewers identified the motivation and effectiveness of the Hilbert curve approach, especially compared to standard positional encoding approaches, as a cause for potential concern. Other concerns include the inference time, after considering the time required for per-scene initialization, the lack of failure cases, the lack of video results, and the hard requirement of perfectly scene-aligned composited objects.

The initial ratings were quite divergent (two borderline rejects, an accept and a strong accept). There was a considerable amount of discussion between authors and reviewers, which ultimately led the reviewers to form a consensus to accept the paper, contingent on several recommended revisions. The authors have consented to the following revisions:

  1. Providing the motivations for the main design choices (Hilbert curve, Swin Transformer), as outlined in the author response;
  2. Discussing the failure cases and assumptions (e.g., object alignment);
  3. Clarifying the data splits;
  4. Including the new Hilbert curve ablation experiment; and
  5. Including video results.

Given this commitment, the AC sees no reason to override the consensus of the reviewers and is satisfied by the technical merit of the paper and its potential to make an impact on the field.

However, in addition to the revision items raised by the authors in their final response, the AC requests that the revision acknowledge that the underlying idea of using a Hilbert curve to process unstructured data comes from prior work, such as [A, B], which should be cited. In addition, the AC concurs with reviewer UGzN that a stronger baseline would be a standard transformer with a 3D positional encoding based on the Gaussian center positions; i.e., lending an inductive bias of spatial locality during self-attention. It is not clear from the author response ("we will use a pre-trained point transformer to extract positional and color features from the Gaussian primitives") that the provided experiment does this. One would assume that the primary benefit of the proposed approach over a standard positional encoding is that the resulting 2D grid facilitates the use of efficient 2D convolutions and Swin transformer layers.

[A] Wu et al., Point Transformer v3: Simpler, Faster, Stronger, CVPR 2024

[B] Chen et al., Efficient point cloud analysis using Hilbert curve, ECCV 2022