PaperHub
5.5
/10
Poster3 位审稿人
最低3最高3标准差0.0
3
3
3
ICML 2025

Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

A two-stage pipeline for generating high-quality 3D assets in a feed-forward manner.

摘要

关键词
3D Generation3D ReconstructionLarge 3D Models

评审与讨论

审稿意见
3

This paper proposes a two-stage framework called Flex3D for 3D generation and reconstruction. The first stage leverages multi-view and video diffusion models to generate a large candidate set of views, then filters them according to both visual quality and multi-view consistency. The second stage uses a flexible reconstruction module (FlexRM)—a Transformer-based approach—to convert the curated views into 3D Gaussian points, aiming to achieve fast rendering and high-quality 3D outputs. The authors claim that Flex3D is capable of generating consistent and improved 3D objects under various input conditions (e.g., text or single images), and they report advantages over several recent baselines in their experiments.

给作者的问题

  1. Accuracy of filtering: Could you provide quantitative measures (e.g., how often “bad” back views are successfully excluded, or how many “good” side views get missed)? How sensitive is the final 3D quality to mistakes in this step?
  2. Large-model filtering: Have you tested more advanced or larger models (e.g., GPT-4V) for quality checks? Would that significantly improve the system’s overall generation?
  3. 3D metrics: Will you include surface-based measures (e.g., Chamfer Distance)? How does the view selection process specifically improve geometry consistency, if at all?

论据与证据

From the reported experiments, there is some supporting evidence that providing more and better-filtered input views can improve 3D quality. However, direct quantitative demonstrations of how effectively the “view selection” pipeline finds the actual “best angles” or significantly resolves multi-view inconsistencies are somewhat limited. Much of the discussion about “view selection” improvements rests on qualitative results or the authors’ own metrics. Stronger, more thorough experiments—especially on how well their selection strategy consistently selects “optimal” viewpoints—would bolster the paper’s claims.

方法与评估标准

Method design:

Stage 1: Two specially fine-tuned diffusion models (one focusing on elevation angles, the other on azimuth sweeps) produce candidate images; a quality classifier plus feature matching (LoFTR) filter out inconsistent or low-quality views.

Stage 2: A tri-plane + 3D Gaussian Splatting network (FlexRM) renders a 3D object from the selected images, with extra camera encoding and noise simulation strategies to manage varying numbers and qualities of input views.

Evaluation metrics: The authors primarily use 2D image-based quality measures (PSNR, SSIM, LPIPS) plus CLIP-based semantic scores, and a user study. While these are relevant, more explicit 3D metrics (e.g., Chamfer Distance, surface normal consistency) or direct benchmarks of the selection accuracy would further strengthen the results.

Adequacy: The chosen metrics address visual quality but fall short of fully capturing 3D geometric consistency. Some additional metrics or more detailed reports on how well the selection step contributes to consistent geometry would be beneficial.

理论论述

The paper does not present new theoretical derivations or formal proofs; most contributions lie in the design of a two-stage pipeline and its architectural modifications. As such, there is no major theoretical analysis to be audited.

实验设计与分析

Dataset variety and generalization: The paper mentions a fairly large synthetic dataset and tests with GSO-like scanned objects. While this shows decent coverage, additional experiments on more diverse categories or real-world scenes would clarify the method’s broader applicability.

User study: The user study is rather high-level, lacking details on participant demographics, rating procedures, or how statistical significance was assessed.

补充材料

There is no supplementary material attached.

与现有文献的关系

Flex3D follows the increasingly popular “generate multi-view images first, then reconstruct” paradigm. Its main focus is on engineering a pipeline that (1) selects higher-quality and more consistent input views, and (2) learns a robust tri-plane + 3D Gaussian network to handle input imperfection. Compared to direct 3D diffusion approaches, this two-stage system might be simpler to integrate with current 2D diffusion models but does not necessarily provide novel insights beyond an engineering standpoint. The paper might have benefited from a deeper comparison or discussion regarding other advanced multi-view quality-control approaches or large multimodal models for automated viewpoint filtering.

遗漏的重要参考文献

I believe the related works that are essential to understanding the key contributions of the paper.

其他优缺点

Strengths

  1. The overall pipeline is clearly presented and practically oriented.
  2. The tri-plane + Gaussian Splatting design has potential for efficient rendering.
  3. The method provides a coherent solution for integrating multi-view diffusion models with a flexible 3D reconstructor.

Weaknesses

  1. Limited novelty: The paper primarily refines existing ideas, an Instant3D with filtered multi-view inputs, with most contributions being incremental engineering or pipeline reorganizations.
  2. View selection performance: The paper lacks rigorous experiments demonstrating that their filter truly finds optimal angles or reliably removes inconsistent samples.
  3. Evaluation scope: The focus on 2D metrics and a relatively small user study might not fully substantiate claims of improved 3D consistency or geometry quality.

其他意见或建议

  1. More direct analysis of view selection: More quantitative metrics would clarify the effectiveness of the filtering and how errors in filtering affect final 3D quality.
  2. Compare to multimodal LLM-based filtering: Attempting view selection via GPT-4V or similar might highlight the pros and cons of simpler geometric feature matching vs. large-model approaches.
  3. Expanding user study details: Clarify participant backgrounds, rating methodology, and any significance testing. This would make user study findings more credible.
作者回复

We thank you for your thorough review and constructive feedback. We are encouraged by your comments on our pipeline's clarity, the tri-plane + Gaussian Splatting design's performance and potential impact, and the solution's coherence for common two-stage 3D generation pipelines. We provide our responses below, addressing the specific weaknesses and questions raised.

1:Limited novelty.

Two-stage 3D generation pipelines, such as Instant3D and many others, represent a popular and effective class of frameworks. However, a significant limitation of all these approaches is that while their reconstructors perform well with sparse-view reconstruction, the final 3D quality remains constrained by the quality of the generated multi-views. Our work directly tackles the challenge of handling suboptimal outputs from this initial staage. To achieve this, we propose three novel key methods including view selection, a flexible view reconstruction architecture, and noise simulation. We believe both the core concept of mitigating first-stage limitations and the specific proposed methods possess some novelty. For instance, the view selection process utilizes 3D geometric priors, and our reconstruction model, combining tri-planes with 3DGS, offers both speed and the flexibility to handle varying numbers of input views.

2: View selection performance.

Please see our response to Reviewer kNaj on point 1.

3: Compare view selection pipeline to multimodal LLM-based filtering.

Using the IoU metric, we compare our performance with three MLLMs: GPT-4o, GPT-4o mini, and Gemini 1.5 Pro. Their performances are 0.64, 0.49, and 0.57, respectively, all worse than ours (0.72). For the "ramen" example in Figure 5, our pipeline selected [1, 2, 3, 5, 6, 10, 11, 12, 13, 18, 19, 20] . In comparison, GPT-4o selected [1, 2, 5, 6, 10, 12, 13, 17, 18], GPT-4o mini selected [1, 5, 7, 10, 11, 13, 17, 18], and Gemini selected [1, 5, 7, 13, 17]. Compared with our pipeline, Gemini rejected frames [2, 3, 6, 10, 11, 12, 18, 19, 20] and selectedbad frames [7, 17] where chopsticks are missing or blurry. GPT-4o mini also selected these bad frames [7, 17] while missing several high-quality frames like [2, 3, 6, 12, 19, 20]. GPT-4o performed better, selecting mostly high-quality frames, but still missed potentially useful views like [11, 19, 20].

In conclusion, while MLLMs are not yet as effective and efficient as our proposed pipeline for this specific task, using them for view selection holds strong potential.

4: More direct analysis of view selection, how errors in filtering affect final 3D quality, how sensitive?

Although our view selection pipeline is generally strong, achieving 93% accuracy for back view assessment and 0.72 IoU for overall view selection, errors can negatively impact the final 3D quality. Table 4 shows how view selection affects final quality quantitatively. Generally, incorporating bad views degrades the quality of the final 3D output, and the strength of this effect tends to be related to the number of good views. For example, when a larger number of high-quality views are selected as input, the negative impact of incorporating a poor view tends to be less significant. Missing a high-quality view also degrades the final 3D output quality; similarly, this impact is less significant when many other good views are already included.

5: User study details.

Participant: Five computer vision or machine learning researchers participated in the evaluation. Two were from the US, two from Europe, and one from Asia.

Methodology: Participants viewed paired 360° rendered videos—one generated by Flex3D and one by a baseline method—presented via a Google Form. Video pairs were presented in random order and randomized left/right positions. Participants selected the video they preferred based on overall visual quality.

Statistical Significance: We collected 1400 valid results (5 participants * 7 baselines * 40 videos). Flex3D was preferred in at least 92.5% of comparisons across all 7 baselines, strongly suggesting better visual quality.

6: No 3D metrics for evaluation. How does the view selection process improve geometry consistency?

In Table 2, we reported 3D metrics for the reconstruction task, including Chamfer Distance and Normal Correctness, where our FlexRM model clearly outperforms other baselines. Evaluating geometry consistency for generation tasks is challenging due to the absence of GT. We thus employ VGGSfM as a proxy to assess the 3D quality of the generated models. Specifically, we render 16 views covering a 360° azimuth from the generated 3D Gaussians and measure the success rate of VGGSfM in estimating poses (correct poses for at least 8 views). Among 404 results, our full pipeline with view selection achieves a 65.6% success rate, higher than the 59.6% rate obtained without it. This confirms that our view selection strategy improves geometric consistency.

审稿人评论

I appreciate the detailed rebuttal from the authors. I am convinced by the user study details, the use of 3D metrics for evaluation, and the comparison against naive multimodal LLM-based filtering methods. As for the novelty concern, I believe it is a matter of perspective and subjective judgment. Since the remaining two reviewers did not raise any concerns regarding novelty, I am willing to revise my score to Weak Accept, and I trust the Area Chair to make the final decision on this matter.

作者评论

Thank you for your thoughtful follow-up comment. We truly appreciate the effort you put into your detailed review; your suggestions were very constructive and insightful. We are grateful for your updated assessment.

审稿意见
3

This paper introduces two novel modules for achieving high fidelity 3D generation. The first one is candidate view generation and selection module, which generates a pool of novel view images and adopt a SVM scorer to select high fidelity novel views which are then sent to the second module named Flexible Reconstruction Model to reconstruct the final 3D models. The idea of selecting high fidelity novel view images from a large pool is interesting. As there are a lot of novel view synthesis methods, this method allow us to combine them all to achieve better performance.

给作者的问题

N.A.

论据与证据

The claim that novel views generation is challenging and could not guarantee to be consistent is true and make sense. And the claim that more consistent views lead to better reconstruction results is also demonstrated in Table 2 in the paper.

方法与评估标准

The proposed method make sense for me. And the evaluation criteria is consistent with prior works.

理论论述

No such claims made in the paper.

实验设计与分析

The experiments for this method should be divided into two parts. The first part is novel view synthesis, which mainly focuses on the view selection module and the second part is the reconstruction module, focusing on the reconstruction quality. I appreciate that the author conduct thorough experiments (including ablation experiments) for the reconstruction module. However, I think the evaluation of view selection is not enough. I would like to see more discuss about this module:

  • What’s the performance of this module, the author may report the classification accuracy.
  • The SVM is only trained with 2000 labeled samples, will such a small number of data enough for the SVM to generalize to different generation cases?
  • If the back view is not selected, could the model still works well with only side view as input?

补充材料

Yes

与现有文献的关系

The paper introduces a novel 3D generation method, which may benefit fields like 3D generation, reconstruction and understanding. The prior related finding like instant3D[1], LRM[2] have been demonstrated to have certain level of influence in the field of 3D vision.

[1] Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model (ICLR2024)

[2] Lrm: Large reconstruction model for single image to 3d (ICLR2024)

遗漏的重要参考文献

N.A.

其他优缺点

Strengths:

  • The idea of the paper that adopts view selection strategy to filter inconsistent views is interesting and novel to me.
  • Although the idea of involving more views for reconstruction has been proposed in previous work like [1], the paper has introduced several strategies that can help improve performance, which could provide a good baseline for future work.
  • The paper is good-writing and easy to understand.

[1] GeoLRM:Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation (ECCV2024)

Weakness:

  • Please see experiment part for the discuss of evaluation experiments.
  • Although the paper proposes a selection module to filter inconsistent inputs, the performance is still bounded by the quality of novel view synthesis models.

其他意见或建议

N.A.

作者回复

Thank you for your thorough review and insightful feedback. We are encouraged by your recognition of our core view selection strategy as interesting and novel, and we appreciate you noting its potential to improve reconstruction quality by filtering inconsistent views. We also value your positive comments regarding the paper's clarity, motivation, and its potential to serve as a useful baseline for future 3D generation research alongside related works. We address each specific weakness and question below.

1: Performance of view selection module, report classification accuracy.

To evaluate our view selection module, we first manually labeled 404 videos generated by our multi-view EMU model (from deduplicated prompts from DreamFusion). We manually established a ground truth set (GT_Set) for each video. To mitigate subjective bias in determining the absolute 'best' views among all 20 frames, we focused on selecting approximately 10 clearly high-quality views per video to serve as the GT_Set. The labeling process is similar to that described in our response to Reviewer PAZq (2.2: Can we trust manual labeling?). The authors first carefully labeled 20 sample videos. The remaining videos were then assigned to two labelers for annotation.

We used the Intersection over Union (IoU) metric to evaluate the quality of view selection per video. For each video, we compared the ground truth set (GT_Set) with the set of views selected by our model (Selected_Set). The IoU is calculated as:

IoU = |GT_Set ∩ Selected_Set| / |GT_Set ∪ Selected_Set|

Our model achieves an average IoU of 0.72. In terms of accuracy, treating results with an IoU > 0.6 as accurate, we achieve an accuracy of 88.6%. This result indicates strong performance and demonstrates the effectiveness of our proposed selection module, which also operates in near real-time.

2: The SVM is only trained with 2000 labeled samples, will such a small number of data enough for the SVM to generalize to different generation cases?

Yes, we found 2,000 labeled samples to be sufficient for training an accurate filter. This sample size also aligns with practices in related work, such as the data curation pipeline in Instant-3D. Its effectiveness stems from leveraging powerful pre-trained image encoders (DINO).

To validate accuracy and robustness, we tested on 100 held-out videos from our method and 100 from SV3D (out-of-distribution). The classifier achieved 93% accuracy on our videos and 90% on SV3D's. These high rates, especially on OOD data, show the 2,000 samples were sufficient and the filter is robust.

3: If the back view is not selected, could the model still works well with only side view as input?

Yes, our pipeline functions effectively even without selected back views.

Selection Module: Our filter achieves high accuracy (>90%). If it excludes the back view, this typically indicates poor generation quality for the back side. The module correctly removes it and tends to retain other, higher-quality views (often front/side), which is crucial for the final 3D quality.

Reconstruction Model: Our FlexRM model is trained to handle a variable number and arbitrary combination of input views. While it can generate a full 3D object even with fewer or missing views (e.g., no back view), the final reconstruction quality definitely benefits from having more, high-quality input views. This is precisely why the selection module is valuable.

Therefore, filtering out poor-quality back views and feeding the reconstruction model with the remaining, better views leads to higher final 3D quality compared to forcing the inclusion of the low-quality one. The CLIP text sim on 404 DreamFusion prompts are 0.277 and 0.270 for such two cases (with back view selection vs always select back view), supporting our claims.

4: Although the paper proposes a selection module to filter inconsistent inputs, the performance is still bounded by the quality of novel view synthesis models.

We acknowledge that final performance is influenced by the novel view synthesis (NVS) model's quality, a characteristic common to two-stage pipelines. However, our selection module specifically mitigates this limitation. By filtering NVS outputs and selecting only the highest-quality views available, our approach makes better use of the NVS model's capabilities and reduces the negative impact of inconsistent views, achieving a tighter performance bound. Consequently, even with the same underlying NVS model, our method yields superior final 3D quality. Furthermore, NVS models themselves are expected to improve, potentially benefiting significantly from advancements in large-scale video generation models which learn implicit 3D consistency from vast video data.

审稿意见
3

This paper introduces Flex3D, a novel two-stage framework designed for high-quality 3D generation from text, single images, or sparse views. In the first stage, the framework employs multi-view diffusion models to generate multiple images from diverse viewpoints, coupled with a view selection mechanism to filter out inconsistent or low-quality views. In the second stage, the varying selected views are fed into a transformer-based architecture that leverages a tri-plane representation, which is subsequently decoded into 3D Gaussians for efficient and high-fidelity 3D reconstruction.

Update after rebuttal

Based on the authors' response, I keep my weak accept rate.

给作者的问题

n/a

论据与证据

The main claims made in the paper are generally supported by clear and convincing evidence.

方法与评估标准

The view selection mechanism seems reasonable and works well in experiments, but it feels a bit too engineered. The quality assessment is trained on just 2,000 manually labeled samples—is that enough for accurate filtering, and can we trust manual labeling? Also, the consistency check uses a fixed threshold (60% matching points), which might not be the most robust approach. Are there better, more systematic ways to evaluate consistency, like learned metrics? Exploring these could make the method more reliable and generalizable.

理论论述

N/A

实验设计与分析

N/A

补充材料

n/a

与现有文献的关系

n/a

遗漏的重要参考文献

n/a

其他优缺点

Strengths:

  1. The paper is well-organized and clearly demonstrates the entire pipeline.
  2. The view selection mechanism effectively filters out low-quality or inconsistent views, providing higher-quality and more consistent inputs for the reconstruction process.
  3. The reconstruction model can handle a varying numbers of input views, making it more practical for real-world applications.
  4. The paper proposes a series of effective modules in the pipeline, like view selection and robust reconstruction. If open-sourced, these could be useful for future work in the community.

Weakness:

  1. This work builds on existing technologies (e.g., multi-view diffusion models, 3D Gaussian splatting, Tri-plane) and introduces a set of engineering tricks (e.g., view selection with SVM and LoFTR). Although these contributions lead to performance gains, exploring fundamental challenges or valuable insights would further enhance the paper's impact.
  2. The view selection mechanism seems reasonable and works well in experiments, but it feels a bit too engineered. The quality assessment is trained on just 2,000 manually labeled samples—is that enough for accurate filtering, and can we trust manual labeling? Also, the consistency check uses a fixed threshold (60% matching points), which might not be the most robust approach. Are there better, more systematic ways to evaluate consistency, like learned metrics? Exploring these could make the method more reliable and generalizable.
  3. The training process is pretty resource-intensive, requiring 32, 64, and even 128 A100 GPUs, and the whole training process is quite complicated, making it hard to reuse or replicate. A simpler and more efficient training approach might be better—something that still delivers strong results but is easier for others to adopt and build on. This could make the method more accessible and practical for the wider research community.
  4. The additional ablation in supplementary material shows only tiny improvements from the imperfect input simulation, so it’s not clear if this really makes the model more robust to noisy inputs. The idea makes sense, but the results don’t show a big impact, it seems unnecessary and redundant. Maybe trying other ways to handle noise, like adversarial training or more varied noise types, could make this part more convincing.

其他意见或建议

n/a

作者回复

Thank you for your thorough review and valuable feedback. We appreciate you acknowledging several strengths, including the paper's clear organization, the effectiveness of our view selection in improving input quality, the practicality of handling varying view numbers in reconstruction, and the potential community value of the proposed modules.We provide point-by-point responses below, addressing each weakness and question raised.

1: Exploring fundamental challenges or valuable insights would further enhance the paper's impact.

We concur with the reviewer's suggestion regarding the value of discussing fundamental challenges and insights. While our primary focus is mitigating suboptimal outputs from the first-stage in common two-stage 3D Gen models, we discussed insights and future directions (e.g., feed-forward 3D/4D generation, generative reconstruction) in the Appendix's section A.

2.1: Is 2000 labeled samples enough for accurate filtering?

Yes, we found 2,000 labeled samples to be sufficient for training an accurate filter. This sample size also aligns with practices in related work, such as the data curation pipeline in Instant-3D. Its effectiveness stems from leveraging powerful pre-trained image encoders (DINO). To validate accuracy and robustness, we tested on 100 held-out videos from our method and 100 from SV3D (out-of-distribution). The classifier achieved 93% accuracy on our videos and 90% on SV3D's. These high rates, especially on OOD data, show the 2,000 samples were sufficient and the filter is robust.

2.2: Can we trust manual labeling?

We conducted a rigorous labeling process, for manual labeling, the authors first carefully labeled 100 sample videos. These were then provided to two labelers, and each labeler was asked to label approximately 1,000 videos, resulting in a total of 2,000 labeled videos. The trustworthiness is corroborated by the strong empirical performance of the classifier trained on these labels (high accuracy detailed in 2.1). Furthermore, results in Table 4, Figures 5-6 show this filter significantly improves our generation pipeline's overall performance, indicating the manual labeling was reliable for its purpose.

2.3: Are there better, more systematic ways to evaluate consistency?

We conducted a sensitivity analysis that indicates that the final generation results are relatively robust to variations in this threshold within a reasonable range (50% to 70%). For instance, varying the threshold between 50% and 70% yielded comparable final generation quality, as measured by CLIP text similarity (ranging from 27.4 to 27.7) and Video CLIP text similarity (ranging from 25.3 to 25.7). We agree that more sophisticated methods like learned metrics or adaptive thresholds are promising future research directions for potentially more optimal filtering.

3: A simpler and more efficient training approach might be better—something that still delivers strong results but is easier for others to adopt and build on.

We agree the full pipeline (NeRF pre-training, 3D Gaussian training, imperfect view simulation training) is resource-intensive. However, efficiencies can facilitate adoption: Stage 1 (NeRF) can be bypassed by initializing Stage 2 (GS) directly from available pre-trained Instant-3D weights. Stage 3, which enhances robustness to imperfect inputs, is optional if the primary application is high-quality reconstruction from clean views. These optimizations reduce the core training requirement primarily to Stage 2, substantially lowering the resource barrier and complexity compared to training everything end-to-end from scratch.

For board adapatation, the FlexRM architecture is designed with a minimalist philosophy and can be easily reproduced based on Instant-3D, which can be easily adopted or being implemented by others. Nevertheless, we recognize that further reducing the computational demands of large-scale 3D generative models remains an active and important research challenge across the field.

4: Maybe trying other ways to handle noise, like adversarial training or more varied noise types.

We tested adding Gaussian noise (σ up to 0.05), salt-and-pepper noise (density up to 0.05), and combining Gaussian noise with our simulation. A comparison using 4-view reconstruction showed these other noise types degraded performance noticeably. In contrast, our proposed simulation pipeline using 3D Gaussians slightly enhanced performance, suggesting it is a more suitable approach in this context.

ApproachPSNR↑SSIM↑LPIPS↓CLIP image sim↑
No noise25.510.8930.0750.893
Proposed25.550.8940.0740.893
Gaussian24.930.8710.0830.874
Salt and pepper25.180.8820.0780.880
Proposed + Gaussian25.120.8790.0800.881
审稿人评论

Thank you for the detailed explanations. Most of my comments have been addressed. Although some technologies are not particularly novel, I believe this paper meets the acceptance threshold and may inspire the community. Therefore, I recommend a weak acceptance.

作者评论

Thank you for taking the time to review our rebuttal and for acknowledging our explanations, as well as your original thoughtful review. We appreciate you noting that most comments were addressed and are encouraged by your assessment that the paper may inspire the community!

最终决定

This paper initially got mixed feedback before rebuttal, including one weak reject and two weak accepts, with some concerns raised about novelty and other details. However, the authors provided a convincing rebuttal that successfully addressed the main points of criticism, leading all reviewers to converge on recommending acceptance (final scores 3, 3, 3). While initial novelty concerns were noted, the proposed method's ability to handle flexible view inputs and the strong experimental results highlight its contribution and potential impact.

Therefore, I recommend accepting this paper for ICML, and I strongly encourage the authors to integrate the feedback from all reviewers into their final camera-ready version.