PaperHub
7.0
/10
Poster5 位审稿人
最低5最高10标准差1.8
10
8
6
5
6
4.4
置信度
ICLR 2024

LEAP: Liberate Sparse-View 3D Modeling from Camera Poses

OpenReviewPDF
提交: 2023-09-16更新: 2024-03-28

摘要

Are camera poses necessary for multi-view 3D modeling? Existing approaches predominantly assume access to accurate camera poses. While this assumption might hold for dense views, accurately estimating camera poses for sparse views is often elusive. Our analysis reveals that noisy estimated poses lead to degraded performance for existing sparse-view 3D modeling methods. To address this issue, we present LEAP, a novel pose-free approach, therefore challenging the prevailing notion that camera poses are indispensable. LEAP discards pose-based operations and learns geometric knowledge from data. LEAP is equipped with a neural volume, which is shared across scenes and is parameterized to encode geometry and texture priors. For each incoming scene, we update the neural volume by aggregating 2D image features in a feature-similarity-driven manner. The updated neural volume is decoded into the radiance field, enabling novel view synthesis from any viewpoint. On both object-centric and bounded scene-level datasets, we show that LEAP significantly outperforms prior methods when they employ predicted poses from state-of-the-art pose estimators. Notably, LEAP performs on par with prior approaches that use ground-truth poses while running $400\times$ faster than PixelNeRF. We show LEAP generalizes to novel object categories and scenes, and learns knowledge closely resembles epipolar geometry.
关键词
3D ReconstructionSparse-view 3DGeneralizable NeRFPose-freeCamera Pose

评审与讨论

审稿意见
10

Summary: The paper proposes a 3D modelling method that predicts a NeRF volume from sparse input views without requiring camera poses at inference time.

Method: The main contribution is proposing an attention-based structure and a feature volume to associates image features across all views and between 2D and 3D. Essentially the method is similar to PixelNeRF and IBRNet, but does not need camera poses at inference time.

Evaluation: The evaluation is comprehensive and convincing. It would be better to add some cross-dataset evaluation.

优点

  1. The paper is well written and easy to follow.
  2. Code is provided in supplementary.
  3. The proposed method is novel. The proposed method alleviates the need of camera poses while being simple and effective, by applying attention across image features from all source views and attention between a feature volume and source views.
  4. The evaluation is convincing. The paper shows clear improvement comparing with previous methods and offers extensive analysis and discussions.

缺点

While the experiment is convincing, it appears to me that all experiments are trained and evaluated on train/test splits of same datasets. It would be interesting to see cross dataset performance and a comparison between other method, i.e. training on dataset A and evaluating on dataset B.

A minor issue: missing a reference to Sajjadi, Mehdi SM, Aravindh Mahendran, Thomas Kipf, Etienne Pot, Daniel Duckworth, Mario Lučić, and Klaus Greff. "RUST: Latent Neural Scene Representations from Unposed Imagery." In CVPR 2023.

问题

See weakness section.

评论

We appreciate the highly positive feedback! Thanks for acknowledging that our paper is well-written and the method is novel, as well as the experiments are convincing. We will address your comments as follows.

  1. [Cross-dataset evaluation] Thanks for pointing this out. Cross-domain generalization is important for real-world applications. We appreciate your insight! We study the cross-data transfer capability as shown in Table 1 of the paper, where we train on 13 ShapeNet categories (Kubric-ShapeNet-seen) and test its generalization on 10 novel categories (Kubric-ShapeNet-novel). LEAP demonstrates robust generalization in this test with ~3 dB PSNR higher than the next-best method FORGE. To further address your concern, we include another cross-domain evaluation, where the model is tested on Google Scanned Objects (GSO) [1], collected from another domain. LEAP demonstrates better performance than FORGE in this setting (24.12 v.s. 22.78 PSNR). We will add more discussion in the revision accordingly.

[1] Downs, Laura et al. “Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items.” ICRA 2022.


  1. [Minors] Thanks for pointing out the related work. We will cite the paper and add discussion. RUST resolves the necessity of target camera poses by prompting the model with a partial target image, which further makes SRT to be more applicable in real applications.
审稿意见
8

This paper describes LEAP, a method for novel view synthesis from 5 or fewer images. Unlike other novel view synthesis methods, LEAP does not require camera poses to be provided at test time. However, LEAP still requires camera poses at training time. ​ LEAP generates novel views using a sequence of image- and voxel-based attention layers. Given a set of input views, it arbitrarily designates one view as the canonical view. It uses this view to define a world coordinate space in which novel views can be queried. After having picked the canonical view, LEAP encodes the context views, alternating between self-attention among all views and cross-attention between queries from the non-canonical views and keys/values from the canonical view. It then lifts the resulting image features to 3D via cross-attention between learned (shared across scenes) query tokens in a 3D voxel grid and key/value tokens from the images. Finally, it maps the voxel grid's features to a density and renderable feature at each location, uses NeRF-style volumetric rendering to composite these features, and finally converts the rendered features to an RGB image using a 2D convolutional network. ​ LEAP is trained using a mean-squared error loss on RGB values, an image-based perceptual loss, and a mask loss (if masks are available). It is important to note that although the rendering process itself does not require camera poses, the training process does, since the views that are rendered to compute these losses are queried using known poses. ​

优点

​* LEAP is among the first methods that perform novel view synthesis without known poses at test time. While SRT also performs novel view synthesis without known poses at test time, LEAP has better results on object-level scenes and guarantees consistency between viewpoints by using an explicit 3D representation.

  • The authors provide video results that convincingly show LEAP working qualitatively on a large number of objects.
  • LEAP outperforms the provided baselines on object-level scenes.
  • The description of the method is clear and easy to follow. Additionally, the authors provide concise code for the LEAP model that is helpful for understanding the details.

缺点

  • The authors claim that they significantly outperform prior methods on scene-scale datasets. However, LEAP only presents a comparison on the small DTU dataset. This comparison is somewhat flawed, since it prevents the authors from benchmarking against SRT, which is focused on scene-scale novel view synthesis. It would be much more convincing to see results on MultiShapeNet, which is a publicly available scene-scale dataset that SRT is trained and compared on. Showing more scene-scale results seems particularly important because LEAP's use of a 3D voxel grid with fixed extents seems likely to limit its usefulness on scene-scale datasets. It is also worth noting that the scenes in the DTU dataset have small camera baselines, meaning that classical pose estimation would produce near-perfect estimated poses on DTU. Using a wide-baseline dataset for scene-scale comparison (where pose estimation is more likely to fail) would be more convincing.
  • Figure 8 appears to be flawed and somewhat misleading. The experiment setup, as I understand it, is that the authors feed a synthetic scene consisting of a single dot projected onto five images into a trained LEAP network. The authors show that the neural volume's density is highest along the ray that corresponds to the dot's location in the canonical view. This is good evidence that the network has "understood" the meaning of the canonical view and has learned to map pixels to rays within the volume. However, the authors claim that the network then "leverages the multi-view information to resolve the depth ambiguity of the ray." However, estimating pose given images of a single dot is impossible, since the scene's appearance is invariant to rotation around the dot. Since pose and correspondence are both needed to correctly place a 3D point, this means that in this case, the network cannot actually be doing what the authors claim. It seems much more likely that because the training dataset consisted of objects/surfaces that were all at roughly the same depth, the network is simply placing the dot near the mean of those depths.
  • The authors should explicitly clarify both in the intro and the methods section that their method requires camera poses at training time. Currently, the first sentence of section 3 might suggest that their method doesn't require poses at training time either.
  • The authors cannot both claim that their method is the first to introduce the pose-free paradigm (first paragraph, page 4) while also stating that the SRT already introduced the same paradigm (which it very much did). The authors should soften that claim to saying that they merely implement a new way of solving this pose-free problem, rather than introducing it.

Minor Points

  • The authors state that the 2D-to-3D step happens in a coarse-to-fine manner, but do not describe how. It seems that the network emergently shows coarse-to-fine behavior (boundaries getting sharper during successive lifting attention layers), not that there's an upscaling operation between lifting attention layers.
  • The paper contains a large number of typos and grammatical errors that need to be fixed (too many to list).
  • In figure 2, the colors for fusion with high weight and fusion with low weight are extremely similar when printed.
  • It would be worth mentioning "RUST: Really Unposed SRT," "View Matching Neural Radiance Fields," and "GAN-Based Neural Radiance Field Without Posed Camera" in the related works section.

问题

  • Was the SRT used for comparison trained for long enough? In particular, how many training steps were used for the SRT? The SRT results shown here are significantly blurrier than those shown in the SRT paper.
  • Is rendering features and decoding them with a CNN to produce the final image necessary? I would be curious to see an ablation regarding this design choice.
  • Does LEAP show any robustness to noisy views at training time?
  • How does the cross-attention layer (equation 1) compare with simply concatenating learnable vectors for canonical vs. non-canonical views to the transformer tokens when using self-attention (equation 2)?
  • Can a trained LEAP network be fine-tuned to estimate pose? It would be very interesting to see if a pose estimation network/head could be trained on top of LEAP image features from just before the 3D lifting step. The RUST paper contains a similar experiment.

Please also address the weaknesses in the "weaknesses" section.


I thank the authors for addressing my questions. I have updated my score accordingly, and recommend acceptance of the paper!

评论

We highly appreciate the detailed feedback! Thanks for acknowledging that the paper provides convincing results. We will address your concerns as follows. (1/2)

  1. [Comparison with SRT] Thanks for the suggestion. We described the limitations of LEAP on unbounded scenes on page 9 of our paper (asked by the reviewer). We further clarify the differences with SRT: i) LEAP targets at 3D modeling -- it uses a 3D representation that can export the density field and render depth (with evaluation results shown below). Dealing with unbounded scenes is a long-standing problem with 3D-aware methods [1], which is not our current focus. SRT targets at novel view synthesis based on 2D representations. It can be applied to unbounded scenes, but it cannot produce 3D geometry directly. To address your concern, we will specify the scene-level results to bounded scenes as you suggested; ii) Moreover, the absence of SRT in DTU experiments is because SRT requires large training data and it cannot adapt to the small DTU data – It is a property of SRT rather than intentionally excluding it. In contrast, LEAP works better with limited training data. All comparisons of objects and bounded scenes are fair and extensive, which is acknowledged by all other reviewers and demonstrates the effectiveness of LEAP. We will better clarify these points in the revision.
FORGELEAP
Depth err.0.160.11

[1] Barron, Jonathan T. et al. “Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields.” CVPR 2022.


  1. [The pose-free approach] We never claim we are “the first” in the first paragraph of page 4 (asked by the reviewer). What we claim is “a novel pose-free paradigm” – LEAP is indeed different from SRT as it is based on 3D representations for 3D modeling. We will make this more clear. We give credit to SRT, which is the first pose-free work and it performs novel view synthesis based on 2D representations.

  1. [DTU details] Pose estimation on DTU is not solved. As shown in Fig 7 of SPARF, the pose errors are still large with sparse inputs.

  1. [Fig 8] The location of the point visualized in the volume clearly matches the GT views. Simply placing the density near the mean value will not match the GT. Besides, the reviewer mentions “estimating pose given images of a single dot is impossible”. However, LEAP doesn’t estimate any poses. To further address your concern, another piece of evidence is in Fig 16, where LEAP shows large depth errors with one input, and it is almost resolved with two inputs, verifying LEAP uses multiview information to resolve depth ambiguity.

  1. [Camera pose during training] We described the related content in the introduction. We will make it clearer to avoid misunderstanding. Thanks for your detailed feedback.

  1. [Coarse-to-fine 2D-to-3D mapping] Your understanding is correct. It is a learned behavior. We will clarify it.

  1. [Typos and figure color] We will revise them. Thanks for the detailed review!

  1. [Related work] We will cite and add more discussion on the mentioned works. RUST resolves the necessity of target camera pose by prompting the model with a partial target image. VMRF and GNeRF focus on the dense-view setting.

  1. [SRT training details] We strictly follow the training protocol with officially verified code (4M iterations). The visualization results are comparable to the ones in the SRT GitHub repo. Besides, we clarify the difference between the object-centric datasets that this work and the SRT paper use. The NMR data, which is used in SRT paper, is rendered by only 24 fixed camera poses. This leads to better performance on the fixed views but the trained model cannot generalize to other camera poses. The problem is reported in the SRT repo. In contrast, the camera poses we use to train LEAP and SRT are randomly sampled, which introduces more difficulties. Besides, the dataset that we experimented with contains more complicated real-object textures and geometry.

  1. [Rendering features] We observe a 0.72 PSNR drop with rendering RGB directly, showing the effectiveness of rendering features.
评论

We continue to address your comments. (2/2)

  1. [Robustness to noisy views] LEAP can handle some input noise during training. As the asked “noisy views” can be interpreted as either “noisy image content” or “noisy image viewpoint/pose”, we provide an explanation of both to address your concerns. For the former, we note the segmentation masks from the Omnioject3D dataset are not perfect, which causes noisy training image content. For the latter, we experiment with adding Gaussian noise (sig. 0.03) to the poses during training. LEAP demonstrates 27.59 PSNR in this setting, which is reasonably decreased (compared with 29.10 with no pose noise).

  1. [Comparison with concatenation+self-attention] We experiment with the suggested method. The concatenation+self-attention (denoted as c+s) method demonstrates 24.33 PSNR on OmniObect3D, which is lower than the cross-attention. We conjecture the reason is the shifting canonical view features. In detail, because the canonical view features will also be updated after each c+s operation, we have to perform it sequentially with each non-canonical view. During the process, the canonical view features are shifting and weakened. In contrast, when using the cross-attention (NVU layer), only non-canonical view features are updated. Besides, the c+s method doubles the number of tokens in the attention, requiring 4x computation due to the quadratic computation growth property of attention.

  1. [LEAP for estimating pose] That is a great point! We appreciate your valuable suggestion. We experiment with using image features of LEAP for estimating poses on Omniobject3D. As shown below, the fine-tuned DINO backbone of LEAP demonstrates better performance than the original DINO. Moreover, the features after the multiview encoder demonstrate even better performance, showing that LEAP learns cross-view correspondence cues. We use a concatenation-regression pose estimator for this experiment. We generally believe the synergy between pose estimation and shape prediction can be further explored, RUST is an inspiring work.
DINODINO-LEAPLEAP-MV encoder
rot. err.16.5314.179.85

We sincerely appreciate your dedication and the considerable time and effort you invested in providing such a detailed and high-quality review.

审稿意见
6

This paper proposes a pose-free method for novel view synthesis in sparse-view settings. Unlike previous methods that estimate/optimise camera poses, the proposed approach uses a novel transformer-based 2D-3D mapping method to aggregate 2D image features in 3D space. After training in large-scale data with ground-truth poses, the method is able to generalize to new scenes without the pose input. It shows that the proposed method outperforms previous approaches that use estimated camera poses in both object-level and scene-level datasets.

优点

  1. The idea is novel. Unlike previous methods that try to predict or estimate camera poses in sparse views, the proposed method completely did not use camera pose to build the 3D volume representation.

  2. The paper is well-written. It is easy to read and understand the motivation, background, problem, and high-level ideas to address the challenge.

  3. The proposed multi-view encoder and the 2d-3d information mapping layers are novel, and efficacy has been demonstrated in the ablation study.

  4. The experimental results are evaluated in both object-level and scene-level datasets, and it shows that the proposed method outperforms previous methods when they use estimated camera poses.

缺点

  1. Unlike the traditional pose-based projection, the proposed 2D-3D mapping layers are a weighted fusion of 2D features. The mapping may be more robust than pose-based projection when the pose is inaccurate but it limits in accuracy. Consequently, the reconstruction and rendering are often blurred, as shown in Figure 6. Do authors have ideas to improve it?

  2. Sparse-view reconstruction is an ill-posed problem because the input images contain incomplete scene information. Although the proposed method is better than previous approaches, the performance is still very limited. In this scenario, most of the existing methods can refine the results by using more input images. However, the proposed method only aggregates information and runs forward pass at the inference time. Is it possible to do refinement when more input images are given?

  3. Does the proposed method work well with different frame numbers during inference time? As the proposed 2d-3d mapping needs aggregating information from all frames, does it consume huge memory in dense views (e.g., more than 100 frames)?

问题

See weakness.

评论

We thank the reviewer for the positive acknowledgment of the novel idea, the clarity of writing, and the diverse experiments. We will address your comments as follows.

  1. [Performance upper bound] Your understanding is correct. Using the pose-free approach is a better strategy for handling pose estimation errors. However, we argue that the pose-free approach does not limit its accuracy, as LEAP is better than prior generalizable NeRFs, i.e., PixelNeRF and FORGE when they use estimated poses. It’s reasonable that the performance is upper-bounded by the results with GT poses, and actually LEAP is better than PixelNeRF+GT poses in some cases. The mentioned blurry results in Fig. 6 are caused by another problem, and we discuss it as follows.

  1. [Reconstruction quality] We appreciate your detailed observation. We note that the result in Fig. 6 (DTU experiments) is a test of LEAP on handling small-scale training data. In detail, the DTU dataset only contains 88 scenes for training, and all test scenes are unseen during training. Thus, the performance in Fig. 6 is not so “perfect”, which is caused by the limited training data scale rather than the pose-free framework itself. When using training datasets with normal scales (e.g. > 1K training instances), LEAP demonstrates good enough performance, e.g., shown in Fig. 4, Fig. 5, Fig. 12, and Fig. 13 for Omniobject3D, ShapeNet, and Objaverse, respectively. Improving the performance with limited training data is an open problem for deep learning-based methods. We believe that applying better transfer learning strategies on models pre-trained with normal-scale or larger-scale data can improve the performance in this setting. Besides, we believe applying more geometry-level regularizations on the refined neural volume and predicted radiance field can improve the performance.

  1. [Refine results with more incoming images] We appreciate the valuable suggestion. The mentioned setting of refining the prediction with more incoming images is a very practical problem, as is the incremental reconstruction [1]. To address your concern, we provide a study using 5 existing views with 1 incoming view. Our solution is i) reconstructing the neural volume using the 5 existing views as an initialization, ii) using the multiview encoder to propagate information between the canonical view and the incoming view, and iii) refining the existing neural volume using the image features of the incoming and canonical view by performing 2D-3D information mapping attentions. Using the refined neural volume for predicting the radiance field demonstrates 0.37 PSNR novel view synthesis improvement, compared with the results from the existing 5 views. For reference, direct inference on 6 images observes a 0.45 PSNR improvement. The slight gap shows the feasibility of the solution and the potential of LEAP in handling the incremental reconstruction problem. Besides, we believe the mentioned task is a promising research direction where a lot of details should be considered (while it is beyond the current scope of this paper). For example, whether the input views are temporally correlated, i.e., video frames, or whether the lighting conditions have been changed in incoming views. We thank you for the insightful suggestions and we plan to explore it further in the future.

[1] Yuan, Ze-Huan and Tong Lu. “Incremental 3D reconstruction using Bayesian learning.” Applied Intelligence 2012.


  1. [Inference on different numbers of inputs] Our method is able to work on different numbers of inputs. As shown in Fig. 8 (bottom), LEAP is able to work with fewer images (2-4 views) when the model is trained with 5 views. We further evaluate the results using 7 and 10 images (as shown below), which reach the upper bound of the number of views in prior sparse-view research [1]. Under the dense view setting (>100 images as mentioned), the pose estimation accuracy will not be a problem, and, thus, we can perform pose-aware reconstruction, e.g. COLMAP. For extending LEAP to dense views, as you mentioned, it is necessary to propose hierarchical information aggregation techniques, as the aggregation from all input views can be computationally expensive. We note that working on dense views is beyond the current scope of LEAP and it's an interesting open research problem to unify sparse-view and dense-view reconstruction methods. We appreciate your insightful comments!
5 Views7 Views10 Views
FORGE26.5627.1627.08
LEAP29.1029.7530.11

[1] Zhang, Jason Y. et al. “RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild.” ECCV 2022.

审稿意见
5

The paper introduces a new methodology designed to transition Nerf from a pose-based optimization framework to a pose-independent reconstruction paradigm. Within the fusion phase, the network generates a feature volume, with the initial view serving as the canonical reference point, subsequently integrating further perspectives by means of feature matching between the 3D volume and 2D image feature similarity. This design ensures that the entire process remains agnostic to pose, obviating the necessity for explicit pose optimization.

The paper conducts an exhaustive series of experiments to substantiate the robustness and efficacy of the proposed methodology, employing both object-centric and scene-centric datasets. The findings conclusively demonstrate that the proposed approach yields results on par with previous works that rely on ground-truth poses, while exhibiting superior generalizability in comparison to its predecessors.

优点

Originality: The reviewer did not identify a comparable concept within the existing literature, suggesting that the idea presented in this paper is novel and distinctive. Furthermore, the concept exhibits significant potential for broader applications across various use cases, further underscoring its relevance and practical utility.

Clarity: The problem statement, literature review, and a portion of the methods and experimental details are clear to me.

Quality and significance: The paper exhibits a well-structured organization that facilitates ease of comprehension. The experimental design effectively showcases the efficacy of the proposed methodology.

缺点

  1. Clarity in the exposition of the method's approach to generating predicted images during the optimization process would be beneficial.
  2. It would be advantageous if the paper delved deeper into the reasons behind its enhanced speed and the trade-offs involved in achieving such acceleration.
  3. The proposed methodology presents certain limitations, particularly concerning relative poses. A more detailed exploration of how the network achieves accurate scale predictions without pose information would be insightful.

问题

The paper is not well written:

  1. Concerning the proposed methodology, the process of generating predictions remains unclear. Specifically, after enhancing the 3D neural volume, how is the establishment of a 2D to 3D association executed for rendering the input view during optimization? Regrettably, this paper does not provide a satisfactory response to this inquiry.

  2. A noticeable absence of local consistency, which is essential for ensuring a robust 3D to 2D association, prompts the question of how the proposed approach manages challenges such as occlusion and variations in lighting conditions. Clarity regarding the method's strategy for addressing these issues would be greatly appreciated.

  3. The impact of an increasing number of input views on the method's prediction accuracy compared to baseline methods remains unaddressed. It is crucial to understand whether the accuracy is expected to decrease or improve with a greater quantity of input views. The author's insights on this matter would be valuable.

评论

We appreciate your valuable comments. We thank you for acknowledging the extensiveness of our experiment and the originality of our idea. We will address your comments as follows.

  1. [Render input views during training/optimization] Different from the original NeRF-style works, where the model is trained and tested on the same instance, LEAP deals with modeling novel instances captured as sparse & unposed images at inference. It is trained and tested on different object/scene instances, and all test instances are unseen during training. As described in the third paragraph of page 2, we use ground-truth poses of input images of training scenes to render the predicted input views. Note that these ground-truth poses are only used during training for learning the 2D-3D information mapping, and using ground-truth pose during training is a common setting in prior generalizable NeRF works, e.g. PixelNeRF, SRT, and FORGE. Besides, we note that the ground-truth poses used during training are easily accessible, as they are provided in the existing datasets, e.g. OmniObject3D, Objaverse, and DTU dataset, annotated using virtual cameras for rendering or complicated camera calibration techniques. After LEAP is trained, it directly inferences on novel objects/scenes captured as unposed images with zero-shot generalization. We will better clarify the point in the revision.

  1. [Improved speed] The fast inference speed of LEAP originates from its ability to predict the radiance field in a single feed-forward step during inference, without any test-time optimization (as we described in the first paragraph of page 2). In contrast, most prior works require some per-scene optimization at test time (SPARF, RelPose, and FORGE), which are slow to converge. We will clarify it in the revision.

  1. [Scale prediction] We perform all experiments (both training and evaluation) with a normalized scale. This is a common setting in RGB-based reconstruction methods [1]. As scaling up/down the size of the scenes as well as the camera translations jointly will lead to the same RGB observations at these camera viewpoints, it requires depth information to resolve the scale ambiguity. It is an ill-posed problem to recover the scale when only RGB images are available (if without any category-level prior, e.g. human category average height). We will clarify it in the revision.

[1] Schönberger, Johannes L. and Jan-Michael Frahm. “Structure-from-Motion Revisited.” CVPR 2016.


  1. [Local consistency] LEAP learns the local consistency between 2D-3D. As shown in Fig. 11 (a) and (b), the two neighbor on-surface 3D voxels are associated with consistent 2D regions. We generally agree with your point that incorporating more regularizations can improve the 2D-3D association accuracy.

  1. [Increasing number of views] As shown in Fig. 8 (bottom), we provide the performance of LEAP compared with prior works using 2-5 input views, where all methods are trained with 5 images of each scene/object. LEAP demonstrates better performance than prior works when more images are available. Besides, we note that LEAP focuses on sparse-view reconstruction, where we usually have only 2-5 input views. To further address your concern, we experiment with using 7 and 10 input views, which reaches the upper bound of the number of views in prior sparse-view research [2]. As shown below, the performance of the best baseline FORGE drops with more than 7 views. We conjecture that the reason is the compounding pose estimation error. In contrast, LEAP continually demonstrates better performance with more inputs.
5 Views7 Views10 Views
FORGE26.5627.1627.08
LEAP29.1029.7530.11

[2] Zhang, Jason Y. et al. “RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild.” ECCV 2022.

审稿意见
6

The paper proposed LEAP, a pose-free approach for 3D modeling from a set of unposed sparse-view images. By appropriately setting the 3D coordinate and aggregating 2D image features, LEAP demonstrates satisfying novel view synthesis quality.

优点

++ As a pose-free approach, LEAP discards pose-based operations and learns geometric knowledge from data.

++ LEAP is equipped with a neural volume, which is shared across scenes and is parameterized to encode geometry and texture priors. For each incoming scene, it updated the neural volume by aggregating 2D image features in a feature-similarity-driven manner. The updated neural volume is decoded into the radiance field, enabling novel view synthesis from any viewpoint.

++ The experimental evaluations and ablation studies are extensive.

缺点

-- Novel view synthesis is defined as rendering the images as specific camera pose and time (for dynamic scenes). When the camera poses are not avaliable or not estimated as this paper, how to deal with NVS with given camera poses, i.e., how to align the given camera poses with the training set images.

-- Essentially, the method incoorporated the feature correspondences into the overall optimization, it is thus interesting to make comparisons with estimated correspondences from optical flow where the dense matching is learned.

-- It is worth to compare related work such as Unposed NERF to verify the effectiveness of the proposed method.

问题

Please refer to my questions as listed in the above Weaknesses section.

评论

Thanks for the positive comments, and your acknowledgment that our experiments and ablation studies are extensive. We will address your comments as follows.

  1. [NVS with given camera pose] We use “relative” camera poses for both training and inference. The relative camera pose specifies the rotation and translation of a target view with respect to the canonical view (first image) [1, 2]. Thus, during testing, you can control the target NVS viewpoints by specifying small or large relative camera transformations. This definition of target NVS camera pose will not be influenced by the different definitions or absence of “absolute” camera pose of input views, as the reconstructed radiance field is defined in the local frame of canonical view. Furthermore, we follow prior works [3] to perform reconstruction with normalized scales, which eliminates the scale ambiguity of camera translations caused by RGB-only observations. We will include more details in the revision and we thank you for your helpful feedback. Additionally, we show 360-degree visualization in the supplementary by controlling the relative poses of target NVS views.

[1] Zhang, Jason Y. et al. “RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild.” ECCV 2022. [2] Jiang, Hanwen et al. “Few-View Object Reconstruction with Unknown Categories and Camera Poses.” 3DV 2024. [3] Schönberger, Johannes L. and Jan-Michael Frahm. “Structure-from-Motion Revisited.” CVPR 2016.


  1. [Comparison with optical flow] We generally agree that more explicit use of fine-grained cross-view correspondences/matching can facilitate learning. However, the existing correspondence models are not suitable for finding correspondence in the sparse-view setting. For example, the mentioned optical flow methods typically work on nearby frames in a video, which observes slight visual appearance differences. In contrast, input images in the sparse-view setting usually have extremely large variations, caused by wide-baseline cameras. The current optical flow methods cannot work well in this setting [4]. Besides, one of our baseline methods, SPARF, is based on dense correspondence from off-the-shelf methods, which shows strong artifacts when correspondence is not accurate and suffers from slow convergence. Finally, we believe pixel-level dense correspondence and sparse-view reconstruction with feature-level correspondence can benefit from each other. Exploring their synergy to improve the performance of both, i.e., using a joint learning framework, can be a promising future direction. We appreciate your insightful comments.

[4] Chen, Qiao and Charalambos (Charis) Poullis. “Motion estimation for large displacements and deformations.” Scientific Reports 12 2022.


  1. [Comparison with unposed NeRF] We clarify that this paper focuses on sparse-view reconstruction, where only 2-to-5 unordered images are available. In contrast, the current unposed NeRFs usually assume either dense-view (~50) or sequential (video) inputs [5,6]. This line of work demonstrates limited performance in the sparse-view setting, as they still estimate camera poses internally. We run experiments on the DTU dataset with Nope-NeRF [5] under the sparse-view setting, where its PSNR is 11.32. This performance gap (compared with their satisfying performance using dense-view inputs) verifies the necessity of this work. Besides, one of our baselines, FORGE, is an unposed & generalizable NeRF variant designed for sparse views, and LEAP demonstrates better performance compared with FORGE. We will add more discussion on this in the revision.

[5] Bian, Wenjing et al. “NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior.” CVPR 2023. [6] Fu, Yang et al. “MonoNeRF: Learning Generalizable NeRFs from Monocular Videos without Camera Poses.” ICML 2023.

AC 元评审

The submission received mostly positive reviews. Although EgpU has some reservations on the clarity, the other reviewers generally appreciate the presentation, recognize the novelty of the method, and are convinced by the positive experimental results. After reading the paper, the reviewers' comments and the authors' rebuttal, the AC agrees with the decision by the reviewers and recommends acceptance.

为何不给更高分

I believe the paper should demonstrate more significant superiority over existing methods or applicability to a broader domain (e.g. more difficult datasets) for it to be considered for oral/spotlight presentations. The reviewers are also positive about the paper in general but without a very strong consensus on a clear accept.

为何不给更低分

3D modeling from sparse observations with unknown poses is a very challenging ill-posed problem. I believe this would be considered exploratory work in a new direction that deserves acceptance.

最终决定

Accept (poster)