LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Control and Rendering
the first scene-level language-embedded interactive radiance field; interaction datasets;
摘要
评审与讨论
The paper addresses the complex challenge of reconstructing and controlling multiple interactive objects in complex scenes from monocular videos without prior modeling of geometry and kinematics. This task is critical for advancing fields like virtual reality, animation, and robotics, where understanding and interacting with 3D environments are essential. The proposed framework decomposes interactive scenes into local deformable fields. This factorization allows for precise modeling of individual objects’ interactions. Additionally, a multi-scale interaction probability sampling strategy is introduced to accurately sample interaction-relevant 3D points within these fields, enabling effective control over object dynamics in complex environments. The interaction-aware language embedding method generates dynamic language embeddings that adapt to varying interaction states. This allows users to control interactive objects through natural language commands, enhancing the interface's intuitiveness and accessibility. Authors also contributed OmniSim and InterReal datasets. These datasets are the first to offer scene-level physical interaction data, comprising 28 scenes with a total of 70 interactive objects. They provide a valuable resource for evaluating the performance of interactive scene modeling methods.
优点
- The paper is well motivated and sets a clear difference from previous works. Results look good and promising. Figures and illustrations are helpful and informative.
- The factorization technique that decomposes complex scenes into local deformable fields allows for more granular control and precise modeling of individual object interactions within a complex scene, addressing the high-dimensional challenge that previous methods struggled with.
- By embedding language control within the interactive radiance fields, the framework allows users to manipulate and interact with 3D environments using simple language commands, greatly enhancing user accessibility and interaction fidelity.
- Authors provided abundant demos on their project page, which is helpful
缺点
- Might need to slightly enlarge texts in figures.
- Not necessarily a weakness but authors can consider visualize some latent features (instead of illustrations like in fig2-4) to better show the decomposition of the high-dimensional feature space.
- Lack some qualitative results on real-world dataset and existing public dataset. Also, in the only InterReal qualitative results (fig.11), k-planes results were missed.
问题
What's the memory cost of the proposed method? What's the finest granularity of interaction can it handle (i.e. operate on a very thin/small object)? How does the model handle repetitive objects in the scene? In Supp. Fig16, the language query is "top cabinet", but it seems the model finds the microwave?
局限性
Authors discussed limitations in terms of closed vocabulary, caused by OpenCLIP. Given that authors did not show much real-world scene manipulation results, potential limitations in real-world scenes should also be discussed as such scenes are in general more complicated. No societal impact was discussed.
Thank you for reviewing our paper and for the valuable feedback!
#Q1. Enlarge texts in figures.
Thanks. We have carefully revised the manuscript according to your comments.
#Q2. Visualization of some latent features.
We provide additional interaction feature visualization of x-, y-, and z- in Fig.1(a) of Attached PDF to illustrate latent feature distribution. It can be seen that the features are clustered around the spatial coordinates of interactive objects, corresponding to the local deformable fields in Sec 3.2 of the manuscript..
#Q3. More qualitative results on real-world and public datasets. K-Planes results were missed.
To our knowledge, existing view synthetic datasets for interactive scene rendering are primarily limited to a few interactive objects falling short of scene-level interactive reconstruction. Fig.5 of the Attached PDF provides visualization comparisons with CoNeRF[1] and CoGS[3] in CoNeRF Controllable dataset. As the results show, LiveScene outperforms existing SOTA methods for interactive rendering, achieving higher-quality rendering. More detailed experiments will be provided in the revised version. We must emphasize that K-Planes[2] was originally designed for 4D reconstruction and thus it is not suitable and unfair for controllable rendering. Meanwhile, our enhanced baselines, MK-Planes and MK-Planes* in Line 218 of Sec.5 of the manuscript., require dense interactive variable inputs, which are unavailable in the InterReal and CoNeRF Controllable datasets, making the comparison with K-Planes infeasible.
#Q4. Memory cost of the proposed method
As shown in Table 1, we report GPU memory usage in the seq002_Rs_int sequence of the OminiSim dataset with official parameter settings. Notably, LiveScene (w/o language grounding) requires approximately 8G GPU memory, which is lower than that of the other methods.
| Method | Batch Size | Ray Samples | Runtime Memory (A100) |
|---|---|---|---|
| CoNeRF | 1024 | 256 | 71931 MiB |
| MKPlane | 4096 | 48 | 12781 MiB |
| MKPlane* | 4096 | 48 | 12185 MiB |
| CoGS | 512 | —— | 25505 MiB |
| LiveScene w/o lang | 4096 | 48 | 8441 MiB |
#Q5. The finest granularity of interaction can it handle.
The minimum granularity of controllable objects is hard to measure, as it depends on scene complexity, object number, and camera view. Our method still achieves precise control in extreme cases, such as the chest in the anonymous link video demos. In this work, we primarily focus on joint objects in indoor scenes, and finer-grained control is still feasible as long as the dataset provides relevant mask and control variable labels*.
#Q6. Handle repetitive objects.
Good question. In fact, due to the limitations of CLIP's spatial relationships understanding ability, our method does not perform well in distinguishing between repeated objects. This limitation is a commonality among many 3D vision-language field methods, such as LERF[4] and OpenNeRF[5]. We’ll clarify this in the revised version.
#Q7. Annotation error in Supp. Fig16.
Thanks. It’s a typo, and the correct language query is "microwave".
#Q8. More discussion of real-world potential limitations and societal impact.
Beyond the existing discussions in Sec.6 of the manuscript., a known limitation in real-world scenarios is that occlusion between objects can affect the interactive rendering effect. Besides, our method currently requires dense GT control variable annotation, which can be time-consuming and labor-intensive to obtain in real-world scenarios. We plan to explore sparse GT input methods (e.g., 3-frame annotation) to improve efficiency.
Regarding social impact, our method is committed to building interactive simulators from real-world scenarios, providing real-to-sim interactive environments for Embodied AI, e.g. navigation, grounding, and action. However, as with all work that enables editable models, our method has the potential to be misused for malicious purposes such as deep fakes. We’ll clarify this in the revised version.
Reference
[1] CoNeRF: Controllable Neural Radiance Fields, CVPR2022
[2] K-Planes: Explicit Radiance Fields in Space, Time, and Appearance, CVPR2023
[3] CoGS: Controllable Gaussian Splatting, CVPR2024
[4] LERF: Language Embedded Radiance Fields, ICCV 2023
[5] OpenNeRF: OpenSet 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views, ICLR2024
They introduce a language-embedded "interactive neural radiance field" that efficiently reconstructs and controls multiple objects within scenes. Factorization decomposes the scene into more local fields that can achieve local deformation.
优点
My sense is that the technical novelty of this paper is high, though I'm not an expert in this domain. Additionally the evaluation seems thorough and the method seems well-considered.
缺点
At times, I feel the the language of the methods section can be made a little clearer. I had trouble following the motivation for a lot of the design decisions.
问题
"As illustrated in Fig 2, interactive objects exhibit mutual independence and interaction features F_alpha unevenly distribute in the 3+alpha-dimensional interacive space \mathcal X and ggregate into cluster centers..." Figure 2 shows this? This was hard to parse and to see from figure 2. I guess what you're trying to say is isntead of storing the motion state variables at every x, you'd much rather define cluster centers . However, I don't understand why this means you can project the point in 3+alpha-space dowen to a 4-dimensional space.
局限性
No issues here.
Sincerely appreciate for your kind comments and voluntary review suggestions! We have carefully reviewed and corrected the entire manuscript to improve the paper's organization and presentation.
#Q1. Improve the presentation of the methods section
Thanks! We have carefully revised the manuscript to eliminate presentation issues.
#Q2. Explain the Interaction Space Factorization
Please refer to Fig.2 and Sec.3.2 of the manuscripts, where we decode and render the interactive space using spatial coordinates and interactive variables as inputs. The basic idea is to divide the space into independent local deformable field, each of which interacted with the 4 coordinates . Specifically, we use an interaction probability decoder (MLP with the plane feature) In Fig.2 to predict the probability distribution of the ray samples, thereby determining the local deformable field region to which the sampling points belong. In the interactive ray sampling, each sample is associated with a distinct time coordinate and interaction variable. Therefore, through probability maximization, we can convert this -dimensional sampling into a 3+1-dimensional sampling, wherein the index of interaction variables is derived by maximizing the probability distribution of decoder . In this way, we map the interaction variables to the most probable cluster region in the 4D space and accomplish the "projection":
\Theta(\boldsymbol{\kappa}, \boldsymbol{\theta} _ s(\mathbf{x})) _ I ,
where is the probability features at position from 3d feature planes.
Additionally, we provide a more detailed figure to illustrate latent feature distribution in Fig.1(a) of the Attached PDF and the learning process of the local deformable field from 0 to 1000 training steps in the Attached PDF Fig.4. Please refer to the PDF for details.
This paper tackles an important problem in reconstructing interactive 3D scenes with language grounding. The authors proposed to use object-based modeling of different deformation fields over the dynamic NeRF pipeline and equip it with language embeddings for grounding interactions. The authors constructed two synthetic datasets OmniSim and InterReal for data collection and evaluations. Experimental results show that their methods significantly outperform prior methods on reconstruction and language grounding.
优点
- The problem of reconstructing interactive scenes with language grounding is an important topic to embodied AI, we have seen many related works that boost the development of robot perception.
- The construction of OmniSim and InterReal datasets could be beneficial for research in dynamic reconstruction and robotics.
- The authors showed significant performance improvement on their constructed dataset, outperforming existing articulated object reconstruction models by a large margin.
缺点
- Despite the good motivation, one major concern about its paper is its poor presentation in terms of notations and details. Several key detail designs are omitted in both Sec.3/4 and supplementary.
- How are disjoint regions defined? it seems from the start of the description that this knowledge is already given. How do we determine the number of subregions? Any prior used for learning the deformation in each subregion? Since jointly optimizing the belonging relationship of each point to each region and the deformation for each field is pretty difficult as far as I know.
- The notations used are confusing, especially in Sec.3.2 which is an important portion of text to help understand the methodology. In Eq.3, what does mean? It is different from the in Eq.2 but is it just another MLP prediction? how to determine them for each region?
- As for the dataset curation, in each data sample will there be multiple objects being interacted (since you modeled many sub-deformation fields)?
- The authors should mention if any priors were added to the implemented baselines as well because methods like CoGS were not originally designed to handle multiple objects if I'm understanding it correctly.
- The current dynamic reconstruction model still stays at the rendering level, some explorations on extending it to 3D meshes or simulated environments might provide more insights on using this model for future research.
问题
See the weakness section.
局限性
The authors have addressed the limitations.
Thank you for the constructive and voluntary review suggestions in both methodology and writing. We have carefully reviewed and corrected the entire manuscript, striving to eliminate any organizational and typos issues. Sincerely hope the proposed method and dataset in this paper will contribute to the field of interactive scene reconstruction.
#Q1. The definition and determinant of disjoint regions 𝑅. How to determine the number of subregions?
As shown in Fig.3(b) of the manuscript, the basic idea of disjoint regions is that we divide a complex interactive space into regions, namely disjoint regions 𝑅, each with an independent local deformable field, where interactions within the local field are manipulated through interaction variables. As the reviewer pointed out, jointly optimizing to form regions is challenging; hence, we utilize mask supervision and focal loss in Eq.7 to segment regions during training.
,
where is the ground truth label, is the predicted probability rendering from the interactive probability field, is the balancing factor, and is the focusing parameter.
The number, shape, and density of regions are gradually established in training by maximizing the probability distribution outputs of the interaction probability decoder shown in manuscript Fig.2. We provide additional experiments in Fig.4 of the Attached PDF to illustrate the learning process of disjoint regions from 0 to 1000 training steps. The results demonstrate a clear trend that, as training advances, the proposed method is able to progressively converge to the vicinity of the interactive objects, thereby establishing interactive regions. Furthermore, we provide interaction feature visualization of x-, y-, and z- in Fig.1(a) of Attached PDF to illustrate latent feature distribution. It can be seen that the features are clustered around the spatial coordinates of interactive objects, corresponding to the disjoint regions 𝑅.
#Q2. What does Θ in Eq.3 mean? Presentation of notations and paper organization.
in Eq.3 represents the projection operation in Eq.2 to map the interaction variables to the most probable cluster region. It is implemented with the interaction probability decoder (MLP with the plane feature) in Fig.2 of the manuscript. In addition, all the region is split by the probability distribution yielded with the only one interaction probability decoder according to Eq.3. The interaction probability decoder is primarily trained with the constraints of the mask (focal loss in Eq.7) and RGB (rendering loss in Eq.6). We have carefully revised our manuscript to clarify the notations.
#Q3. Will there be multiple objects in each data sample?
Yes. Our method and dataset are designed for complex scenes containing multiple objects. Please refer to the anonymous link in supplemental Sec.C and Tab.5 for details.
#Q4. Any priors were added to the implemented baselines?
Yes, we introduce extensions of K-Planes[1], namely MK-Planes and MK-Planes*, which enable their control capabilities by generalizing them from planes to and planes, as elaborated in Sec.5 of the manuscript. In addition, CoGS[2] is able to control multiple object elements according to the original paper. We emailed authors and extended the CoGS based on Deformable Gaussian[3] to tackle multiple interactive objects. We’ll clarify this in the revised version.
#Q5. Future explorations on extending it to 3D meshes or simulated environments.
Thanks. Building interactive simulations from real-world scenarios holds great promise, particularly in Embodied AI. In the future, we will further explore this area, including explicit scene representation such as 3DGS and Mesh, as well as interactive generation, .etc.
Reference
[1] K-Planes: Explicit Radiance Fields in Space, Time, and Appearance, CVPR2023
[2] CoGS: Controllable Gaussian Splatting, CVPR2024
[3] Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction, CVPR2024
Thank the authors for the clarifications. Now that it's been clear about the formulations, I'm willing to increase my scores to positive hoping that the authors could refine the manuscript in future iterations.
Thank you for your constructive feedback and suggestions. We will continue to refine the paper and explore future research.
This paper proposed LiveScene, a NeRF-based approach to enable indoor-scale controllable scene reconstruction and novel view synthethis. By extending K-Panes with a object-aware multi-scale space factorization, scene-level 3D space with articulated objects could be modeled with motion patterns via densely collected monocular video with camera poses. On exsiting benchmark datasets and newly proposed datasets OmniSim/InterReal, the porposed method LiveScene achieves the best overall performance.
优点
-
The overall method is well motivated to address the more challenging indoor-scene level controllable NeRF. The introduction of control variables and their space spaces make the overall training feasible.
-
The extensive experiments as well as demos proves the effectiveness of LiveScene, both quanlitatively and quantitaitively.
缺点
Though the overall resutls seem promising, I still have several concerns regarding the formulation and lack of clarifications in certain key aspects.
-
The only additional attribute of overll space is modelled as a 3+alpha dimension space, how to cope with time variations? Are the time dimension implicitly encoded within the contro variables to cope with motions (such as open/close the door)? Or is there any explicit formulation of time dimension?
-
What is the potential maximum number of objects within the scene? And what is the potential limitations when scaling up to more objects? In Tab.6 of supplement, 6 objects at most is validated. How about more diverse objects?
-
Is it possible to encode more complex or fine-grained object control (e.g.,open left side door of a double-open fridge), especially when the training data mainly contains the fully open and fully-close state. Specifically, I am wondering what is the interpolation capability of the propsoed interaction-aware feature space and its generalization capability to unseen but correlated states.
-
As mentioned in Appdix D, ‘Interaction Variable MSE’ mentions that the interaction values are fully supervised and GT labels are also used during inferenced to enble control. It would be good to see, in practical cases, without GT interaction values, what is the performance degradations, which could further strengthen the potential applciations and reveal potential limitaions.
问题
I think overall this paper is quite interesting and effective. I would like to see a more diverse demos to highlight its streagths and more clarifications on the detail of this paper. In initially the paper is a bit hard for me to clearly follow. Therefore, in additio to the concerns raised in the weaknesses section, I would encourage the authors to improve the organization by clearly describing the pipeline details and improve the correspondence between text and figures.
局限性
Limitations have been partly addressed in the main paper.
#Q1. How to cope with time variations? Are the time dimension encoded within control variables?
Yes. The timestep variables, 3D features, and interaction variables are hybridized and fed into the interaction probability decoder to yield the probability distribution in implementation, as shown in Fig.2 of the manuscript. It’s critical to distinguish between 4D reconstruction and our interactive scene modeling. Unlike 4D reconstruction, where all properties synchronously vary over time, interactive scene modeling involves independent changes to individual objects, which is precisely why we require additional dimensions to model the scene. Hence, each object forms its own local 4D scene, where interactions within a local scene are manipulated through interaction variables .
#Q2. Potential maximum number of objects and limitations within the scene.
In the following table and Fig.1(b) of the Attached PDF, we validate the performance of LiveScene in scenarios with up to 10 complex interactive objects. Notably, our method demonstrates robustness in rendering quality, which does not degrade significantly as the object number increases. The number of objects is not a major limiting and our method is still feasible as long as the dataset provides mask and control variable labels. In contrast, the occlusion and topological complexity between objects do affect the reconstruction results, which will be discussed in the limitations section.
| object number scaling | #2 | #4 | #6 | #8 | #10 |
|---|---|---|---|---|---|
| PSNR | 34.19 | 33.39 | 32.64 | 32.63 | 32.09 |
| SSIM | 0.958 | 0.950 | 0.948 | 0.948 | 0.936 |
| LPIPS | 0.081 | 0.077 | 0.093 | 0.104 | 0.110 |
#Q3. Is it possible to control more complex or fine-grained objects? The interpolation and generalization capability of the interaction-aware feature.
Yes. In Fig.1(c) of the Attached PDF, we demonstrate the fine-grained control capability of LiveScene on a refrigerator and cabinet dataset without part-based labels. Our method can control a part of the object even though there are no individual part-based interaction variable labels. However, the effect is not entirely satisfactory, due to the lack of labels and CLIP's limited understanding of spatial relationships.
Additionally, we conducted an experiment to examine the method's interpolation capability for unseen yet correlated states. As shown in Fig.2 of the Attached PDF, we mask the camera view and control variable labels from 100% to 30% to increase the unseen yet correlated states. The results show that our method achieves good interpolation and generalization performance, as the image rendering quality remains stable between 100% and 40%. However, the algorithm deteriorates and causes artifacts, as the perspective and labels are increasingly missing in the last column of the table and picture in Fig.2 of the Attached PDF. Hence, the proposed feature interpolation can only maintain interaction and view consistency but does not completely address the extreme view missing issue. We’ll clarify this in the limitation section of the revised version.
#Q4. GT labels during inference control and performance degradations without GT interaction values.
The interactive variable labels are only required during training, not during inference control. As shown in Fig.2 of the Attached PDF, the decrease in interactive variable density only starts to have an adverse impact once it reaches a certain threshold (40%~30%), which has been clarified in the limitation Sec.6 of the manuscript. Additionally, we provide an experiment in Fig.3 of the Attached PDF which compares the rendering results with and without GT interaction variables. Note that we only mask the GT interaction variables but provide the supervision of RGB in both setting. According to the results, even without the interaction variable supervision, the proposed method can still achieve satisfactory rendering quality (31.45 vs 31.58 in PSNR) but loses the ability to control. Specifically, our method is unable to open the dishwasher and degenerates into 4D reconstruction without any interaction variable supervision. We have carefully revised our manuscript and will provide more detailed experiments to illustrate.
#Q5. More diverse demos to highlight and more clarifications on details of the paper.
Thanks for your kind comment. We have provided a project page in supplemental Sec.C with an anonymous link to demonstrate the interactive scene reconstruction and multimodal control capabilities. Please refer to the anonymous link for more information. Besides, we have carefully revised our manuscript to eliminate all typos and presentation issues.
We sincerely thank reviewer #PajG, #aZ7V, #a4DY, and #94zG for their thoughtful and constructive comments and suggestions. We have carefully revised our manuscript according to their comments. An attached one-page PDF is provided to show additional experiments and can be summarized as:
- Visualization of interaction feature planes x-, y-, and z- to illustrate the interaction feature distribution.
- A complex scenario illustration with up to 10 complex interactive objects.
- Part-level controlling experiment to show the fine-grained controlling ability.
- Rendering quality and visualization as the supervision of view and control variables decreases from 100-30% to show the effectiveness of feature interpolation.
- Rendering results with and without GT interaction variables.
- Visualization of disjoint regions (local deformable field) learning process from 0 to 1000 training steps.
- More View Synthesis Comparison results on the public CoNeRF-Controllable dataset.
In the original supplemental materials (supplemental Sec.C), We provide an anonymous project page to demonstrate the detailed datasets, interactive scene reconstruction, and multimodal control capabilities of LiveScene. Please refer to the anonymous link for more information.
Thank you to the Area Chair and reviewers for the voluntary review suggestions. We are pleased to see that the reviewers have acknowledged the contributions and effectiveness of our work. Specifically, Reviewers #PajG and #a4DY consider our paper to be well-motivated and setting a clear difference from previous works, Reviewer #aZ7V believes that the proposed OmniSim and InterReal datasets could be beneficial for research in dynamic reconstruction and robotics, and Reviewer #94zG thinks that our paper has high novelty.
At the same time, the reviewers have also raised some concerns, such as real-world qualitative results, baseline implementation, GPU consumption, and paper presentation. We have carefully addressed these concerns in our rebuttal, providing detailed explanations and additional experiments, and thoroughly polishing the entire paper. We hope this helps to clarify the points raised by the reviewers. If there are any remaining questions, we would be delighted to provide further explanation
Once again, we sincerely thank the reviewers for their time and effort.
The paper received very positive scores from the reviewers. They appreciated the problem, the proposed datasets, and performance improvements done in this work. The authors addressed most of the limitations during the discussion period. AC agrees and recommends acceptance. Congrats!