PaperHub
5.0
/10
Poster4 位审稿人
最低4最高6标准差0.7
6
4
5
5
4.5
置信度
正确性2.8
贡献度2.8
表达3.0
NeurIPS 2024

Human-3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

OpenReviewPDF
提交: 2024-04-25更新: 2024-11-06
TL;DR

A novel 3D consistent diffusion framework that utilize pretrained 2D diffusion prior for 3D reconstruction and use reconstructed 3D to guide 2D sampling process

摘要

关键词
3D Reconstruction3D Human ReconstructionDiffusion Models3D Generative Models2D Foundation Models

评审与讨论

审稿意见
6

The paper presents a pipeline for creating a 3D model of a full body avatar given a single input image. While the previous methods show a single image to 3D using a 2D diffusion model such as ImageDream, as a 2D model, they suffer from 2D inconsistencies. This paper combines the benefit of both the large-scale 2D pre-trained multiview image model as well as the 3D consistent generative model (here they used gaussian splatting as such a representation).

Given a single input image, the method first creates 4 orthogonal views of the input using 2D ImageDream. While this creates some realistic views, those views might be inconsistent. To create more consistent views, the method renoises the image and pass it to 3D generative model based on a gaussian representation to create a 3D consistent version of the 2D output. The output from the 3D generative model is 3D consistent but the texture might not be as good as the 2D one. So the model finally redecodes the 3D rendering with a 2D model to create the final 3D-aware rendering.

Both the 2D ImageDream and 3D gaussian-based generative models are trained jointly with a tightly coupled manner. To enable the tight coupling, the 3D generative model also takes a noisy image as well as a time step t, just like the 2D counterparts. The 3D generative model part is built on top of the U-Net decoder from LGM, which takes a 4-view image and produces gaussian splatting parameters. The model is trained on a combined dataset of 6000 human scans, which are converted to rendering images via Blender.

The paper demonstrates convincing results on multiple challenging datasets (Sizer, IIT-Human, and GSO) with extensive numerical / qualitative comparisons.

优点

  • Convincing results are demonstrated on challenging datasets of Sizer, IIT-Human, UBC Fashion and GSO (Google Scanned Objects)
  • While the paper mostly focuses on humans, the method itself is fairly generic and could also work on general objects (some results are shown)
  • extensive numerical and qualitative comparisons are provided and the proposed method performs better compared to the previous work
  • While many previous methods used mostly 2D generative priors, the paper shows how to use both 2D and 3D generative models to create a better result.
  • Paper is overall easy to follow and good amount of technical details are provided (but not all).

缺点

  • the paper provided most of the technical details but some key details are still missing such as the exact architecture of the noise conditioned gaussian based 3D generative model and the timing of how long it will take to process an image.
  • the overall pipeline seems fairly complicated to need to train both 2D and 3D generative models and that they rely on 3D scans, which are fairly limited for humans (only 6000).
  • the paper text has some English grammar errors and typos and could benefit from proofreading.

问题

  • How long will it take to process an image?
  • How is X^tgt_{t} exactly generated? In Fig2, it looks like the noisy images have some structured noise rather than a regular gaussian noise added. Why is that?
  • The visualization of the used views in Figs9 and 10 are inconsistent. For some subjects only front views are shown and other subjects only back views are shown rather randomly. Please make them consistent or show both.
  • The following papers are relevant for citation: Generative Novel View Synthesis with 3D-Aware Diffusion Models (ICCV2023) ReconFusion: 3D Reconstruction with Diffusion Priors (CVPR2024) ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image

Typos

  • line225: typo, extra period after "reconstruction"
  • Figure 6: Visualization "of" intermediate sampling steps
  • line781: Follow[ing] [71]

局限性

The paper discusses both the broader impact and the limitations.

Some more limitations could be clarified:

  • Currently, limited materials are supported (e.g., shiny green dress on the right col of Fig13).
  • Fine-scale details such as stuffed toy furs (e.g., first image of Fig14) seems to have blobby 3D details likely from the 3DGS, which seem not just the problem of the low resolution generator.
作者回复

We sincerely appreciate the reviewer for recognition of our method’s strengths, including its generalization, potential extensions, performance, and clarity. We maximize our effort to answer every comment seriously and hope our response can address the reviewer’s concerns. If there are any remaining questions, we are more than happy to address them.


Q1: The paper provided most of the technical details but some key details are still missing such as the exact architecture of diffusion-based 3D-GS generative model

A1: Thanks for pointing out this problem. We start with the asymmetric U-Net proposed by LGM for our 3D-GS generator. The key modifications we did are:

  • Following stable diffusion we inject time embedding into the U-Net.
  • We add additional clear context image in addition to the other 4 views.
  • We add 3D-attention across views and the context view to inject condition information during the generation process.

We hope this clarifies the details, and we will add these details to our supplementary.


Q2: The overall pipeline seems fairly complicated to need to train both 2D and 3D generative models, and the method only rely on limited 6000 subjects scans

A2: We agree that the number of human scans are small-scale compared to objects dataset such as Objaverse (800K to 10M objects). However, we empirically didn’t observe the overfitting. On opposite, we show that our approach generalizes well to unseen subjects with various type of clothing and appearance. In our analysis, the pretrained Multiview diffusion model and initialization from LGM helps in preventing the overfitting. Moreover, as reviewer mentioned, our proposed approach is generic. Thus, we can pretrain on large-scale objects dataset and finetune on limited human scan dataset. To validate this, we conduct an toy example: We test to train the model jointly with only Thuman2.0 (500 subjects, simulate limited human data) and Thuman 2.0+ShapeNet (12K objects + 500subjets, simulate large-scale objects data + small scale human data). For both experiments, we adopt pretrained Imagedream and use LGM to initialize layer weights for 3D-GS generator.

We agree that the number of human scans is small-scale compared to object datasets such as Objaverse (800K to 10M objects). However, we empirically didn’t observe the overfitting. On the contrary, we show that our approach still generalizes well to unseen subjects with various types of clothing and appearance (supp. fig. 9-12) and also objects (Fig5, fig 14). In our analysis, the pretrained mult-iview diffusion model and initialization from LGM help in preventing overfitting.

Moreover, as the reviewer mentioned, our proposed approach is generic. Thus, we can pretrain on large-scale objects dataset and finetune on limited human scan dataset. To validate this, we finetune the model pretrained on Objaverse on Thuman (500 human scans). We don’t observe a significant drop in the performance, see more details in R2Q3.


Q3. Training requires 3D scans, which is limited for humans (only 6k).

A3: We would like to point out that we do not rely on full 3D to train the model. Instead, we compute losses on the multi-view images only. Therefore, in principle, we can also train our model on multi-view images with good camera poses, which is much more abundant than 3D scan data. Future works can extend our method to train on multi-view or video data, and we believe there is a lot of potential in this direction.


Q4: How long will it take to process an image?

A4: Thanks for the question. We also discuss this with other reviewers (KYnm, Q4) and reviewer (uByW, Q5). Please refer to Q2 in general rebuttal for the table. In summary, it takes around 22.5 seconds to process an image on one Nvidia A100. We further evaluate each key component inside of individual diffusion steps, and report the runtime of subcomponent here:

  • each DDIM steps of 2D-3D joint diffusion: 0.46 seconds
    • where 2D denoising step: 0.08 seconds
    • where 3D denoising step: 0.38 seconds

Q5: How is X^tgt_{t} exactly generated? In Fig2, it looks like the noisy images have some structured noise rather than a regular gaussian noise added.?

A5: X^tgt_{t} was initialized from Gaussian noise in reverse diffusion step T. We visualize the intermediate step results in Fig2, where the MVD predictions already have some structure. Visualization of more intermediate steps can be found in supp figure 6. We will clarify this better in Fig2.


Q6: The visualization of the used views in Figs9 and 10 are not all aligned. For some subjects only front views are shown and other subjects only back views are shown rather randomly. Please make them consistent or show both.

A6: Thanks for pointing this out. We showed different views for each subject in order to showcase diverse viewing angles of our results. In our rebuttal pdf we show comparisons in consistent viewing angles. We will update Figure 9-12 with consistent viewing angles.


Q7: Some more imitations could be clarified

A7: Thanks for the suggestion. We agree that these points are indeed the limitations of our method. They are also the unsolved problems in current sota multi-view diffusion models or 3D reconstruction methods. We will add this discussion to the limitations.


Q8. Typos and English grammar error

A8: Thanks for pointing them out. We will correct them in the final manuscript.


Q9: Missing citations of relevant papers

A9: Thanks for the additional reference, we will add paper [1] to L125, L767, paper [2] to L125, L767 and paper [3] to L74. [1]. Generative Novel View Synthesis with 3D-Aware Diffusion Models (ICCV2023). [2]. ReconFusion: 3D Reconstruction with Diffusion Priors (CVPR2024). [3]. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image.

评论

I read the entire rebuttals including all the reviews by other reviewers. The rebuttal sufficiently addressed most of my concerns and I did not find new concerns. I will keep my original rating. The figures (second Ninja turtle example) are still inconsistent though in terms of rendering poses in Figure 1 of the rebuttal PDF.

审稿意见
4

This paper proposed an image-conditioned 3D-GS generation model for human 3D reconstruction. 2D diffusion models fall short in offering 3D consistency for multi-view shape priors. To address this, the authors introduce a method that combines the strengths of 2D multi-view diffusion and 3D reconstruction models to create avatars with explicit 3D consistency. They propose an image-conditioned 3D Gaussian Splats model that uses 2D diffusion priors to enhance the 3D reconstruction and guide a 2D reverse sampling process, resulting in geometry and appearance. It’s capable of joint training a 3D generative model and a 2D multi-view diffusion model end-to-end.

优点

1.The motivation and insight behind this paper are reasonable. I agree that only by obtaining 2D priors with good 3D consistency can one achieve high-quality 3D Gaussian reconstruction results. 2. An interesting idea of joint training 2D diffusion model and 3D Gaussian end-to-end. 3. The paper is well-written and easy to follow.

缺点

  1. The resolution supported by the model is too low.
  2. By observing the qualitative results, I don't think the method has superior performance in geometry or texture; it seems to have a significant gap compared to existing methods that were not compared. However, the quantitative metrics are surprisingly good, which puzzles me. 3. Comparative experiments lack comparisons with many other works. I suggest adding comparative experiments with the following studies: 1.TeCH: Text-guided Reconstruction of Lifelike Clothed Humans(3DV 2024)
  3. HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion(CVPR 2024)
  4. Human-SGD: Single-Image 3D Human Digitization with Shape-Guided Diffusion(SIGGRAPH Asia 2023)
  5. ECON: Explicit Clothed humans Optimized via Normal integration(CVPR 2023)
  6. ICON: Implicit Clothed humans Obtained from Normals(CVPR 2022)
  7. FOF: Learning Fourier Occupancy Field for Monocular Real-time Human Reconstruction(NIPS 2022)

问题

  1. How long is the training period? Is it possible to encounter the phenomenon of Gaussian overfitting before the diffusion model has converged? If it occurs, how is it resolved?
  2. What is the inference speed like, and does it have an advantage compared to other methods?
  3. The multi-view human generation results of your 2D Diffusion model should be displayed.

局限性

Limited reconstruction quality and lack of comparison results with state-of-the-art methods.

作者回复

We sincerely thank the reviewer for recognizing our insight, motivation, and the interesting idea of the proposed framework. We notice that the reviewer has concern about the performance comparisons with SOTA methods and thus looks forward to more comparison and results. We address these concerns here and are happy to take part in further discussions.


Q1: The supported resolution is low

A1: The reviewer may refer We guess the reviewer refers to the resolution of our 2D Multi-view diffusion model. We adopt ImageDream pretrained multi-view diffusion model, where the resolution is already limited by (256x256). We denote it as a limitation (L288), and believe that it can be resolved by leveraging a more powerful high-resolution multi-view diffusion model such as SV3D (576x576). We would like to emphasize that our proposed approach is not limited to the ImageDream. Our framework allows us to use any 2D multi-view diffusion models and further improve them with our proposed tight-coupling of 2D and 3D generation models. Even with low-resolution input, our method achieves sota results, highlighting the strength of our idea.


Q2: Qualitative results don't achieve superior performance than previous SOTA works but quantitative results are high

A2: Thanks for raising this concern. Our examples shown in fig2. might not well represent the superiority of our method. We add 10 more comparison with SiTH, SiFU, ECON and ICON in our rebuttal. Furthermore, we randomly select 40 subjects from the test set and conducted asked 70 users to select which method has the best reconstruction quality. Results suggest that our method is preferred by 80.3% of the users, which is aligned with the quantitative results we reported in the paper. We would like to point out that, even though ECON showed impressive results and robustness to diverse clothing and challenging poses, it heavily relies on SMPL estimations which can be inaccurate in challenging cases. As we shown in rebuttal pdf, inaccurate SMPL estimation leads to incorrect human shape and clothing geometry. In contrast, our method does not rely on SMPL and is more flexible to represent different clothing, accessories, and children. Therefore, we obtain better results in the test set IIIT, Sizer (initial submission) and Cape, CustomHuman (commonly used benchmarks).


Q3: Comparative experiments lack of baselines, suggest adding comparison with TeCH, HumanRef, HumanSGD, ICON, ECON, FoF

A3: Thanks for pointing out this extensive list of comparable works. We compared with SiTH and SiFU, both are SOTA human reconstruction methods published at CVPR2024. Since they already outperform prior SoTA ECON/ICON, we omit the comparison in the initial submission. We do understand that thorough comparison with more baselines can strengthen the arguments. However, some of the listed baselines have either too long runtime (Tech, >6h per image and requires more than 48GB GPU), poorly maintained codebase (FoF, no instruction), released after NeurIPS (HumanRef released June 19, 2024), or no code release (HumanSGD). Therefore, we compare additionally only with ICON, ECON as they are the most popular baselines. The results are reported in the table below.

MethodPublished atSMPL priorCD(cm)NCF-score
PiFUICCV20192.830.7690.333
ICONCVPR20224.060.7280.230
ECONCVPR20233.520.7490.278
SiTHCVPR20243.920.7350.250
SiFUCVPR20243.600.7390.235
LGMECCV20243.290.5620.275
TripoSR03.20242.590.7710.360
InstantMesh04.20242.470.7870.338
Our-1.351.381.31

The quantitative results show that our approach outperforms all baselines. We include additional qualitative examples in the rebuttal PDF. We also conducted a user study to evaluate the qualitative results (see R1Q1). Overall, our method is preferred over ICON, ECON, SiTH, SiFU by approximately 80% of 70 users. This clearly shows that our method outperforms baselines.


Q4: How long is training period? Is it possible to encounter the phenomenon of Gaussian overfitting before the diffusion model has converged?

A4: It takes 5 days on 8 A100@80GB to train our model (L213). Preventing large models from overfitting to small datasets is an open research question. In our setting we did not observe this problem as our model generalizes to subjects with diverse appearance and geometry(supp. fig. 9-12) and even general objects (Fig5, fig 14). We believe overfitting is mitigated due to these aspects:

  • Pretraining on a large-scale 3D dataset. We reuse some model weights from ImageDream (pretrained on LAION5B, Objaverse) and LGM (pretrained on 80k Objaverse objects). This pretraining provides strong prior to reason 3D shapes even from very noisy multi-view and a single clean input image.
  • Data augmentation for camera poses. We added small noise to the camera poses when first training the 3D-GS model alone. This helps align the 3D-GS generator to quickly adapt to 3D inconsistent multi-view images from 2D MVD model.
  • Small learning rate for fine-tuning. As is common in the literature, we employ small learning rates to fine-tune the pretrained MVD (1e-5 with Cosine Annealling) and 3D-GS generator (5e-5).

Q5: What is the inference speed and does it have an advantage compared to other methods

A5: Our generation time is approximately 22 seconds. In contrast, baseline methods depend on SMPL estimation and test-time optimization, which significantly slows down their performance, taking up to 2 to 5 times longers. For more details, please refer to general Rebuttal Q2.


Q6: The multi-view human generation results of your 2D Diffusion model should be displayed

A6: A6: Thanks for the suggestion. We showed one example in fig. 6 last row and 2nd column. We include more examples of 2D diffusion outputs in Rebuttal PDF (figure 3). We will add these to the supplementary.

评论

Thank you for the clarifications and the rebuttal response. After reading the authors’ response and the comments of the other reviewers, I have the following concerns:

  1. It's important to stress that a qualified paper needs to do a comprehensive comparison experiment with previous related works. I approve that it is difficult to make a detailed quantitative comparison with TeCH (high-quality geometry and texture), FOF (fast inference for high-quality geometry), and other works within a limited time. Still, I think it is unreasonable not to show the qualitative comparison results.

  2. As stated by the reviewer uByW,"your model seems to perform poorer than SOTA baselines". Although the proposed approach has an advantage in large loose skirts, children, and anime characters as shown in the rebuttal PDF, the authors do not compare frontal face views.

So, I think this paper does not achieve the sota performance, and I will keep my rating.

评论

Thanks for addressing my concerns about comparison methods. However, I still think the qualitative results are not good and the limitation for the input resolution of the propose method is not a valid reason, but rather a shortcoming of the method. Considering the results of the qualitative comparison, I will raise the rating to borderline reject and maintain the negative attitude.

评论

Q1: comparison with TeCH and FoF

We appreciate for the feedback.

As recommended by the reviewer, we test additional baselines TeCH and FoF.

We are happy to provide the additional results on FoF. We use the SMPL-X estimation from ECON to serve as body prior for FoF. We augment the quantitative evaluation tables as follows:

MethodPublished atSMPL priorCD(cm)NCF-score
FoFNeurIPS2022Yes5.360.6850.195
ECONCVPR2023Yes3.520.7490.278
Our-No1.351.381.31

It clearly shows that the FoF fails to reconstruct the clothed human accurately, and further proves that our proposed approach achieves SOTA reconstruction performance quantitatively.

For TeCH, despite the protracted and costly inference process, we consistently obtain textured meshes with extremely noisy surfaces. For visual examples, we kindly ask reviewers to check the result Fig. 6a) in PuzzleAvatar[1] which is produced by the same author of TeCH. We have also consulted the authors of TeCH and they confirmed us that TeCH indeed struggles with producing smooth surface in lots of cases. For quantitative comparison, we run TeCH on the same 8 unseen subjects on CustomHuman and report the results below. Testing TeCH on more datasets is not possible as it takes 6h/image and can only run on A100 with 80GB memory. It can be seen in the table that the the normal consistency score is significantly lower than our method, which is consistent with the visual results.

CustomHumanPublished atSMPL priorCD(cm)NCF-score
TeCH3DV2024Yes3.370.640.31
Our-No1.030.850.66

[1] PuzzleAvatar: Assembling 3D Avatars from Personal Albums

评论

Q2: your method has advantage in large loose skirts and children, but not compare frontal face views

We understand the concern that the reviewer has about the performance on the facial features reconstruction. We regret our previous rebuttal in not including more direct frontal views in the rebuttal PDF. We hope the following explanation will satisfactorily address the reviewer's concerns:

Firstly, we wish to highlight that our user study, which included views from 45 degrees frontal right and front left, demonstrates a preference for our approach by 80.3% of participants over SOTA baselines such as SiTH, SiFU, ICON, and ECON. Additionally, the side views provided in the rebuttal PDF illustrate that our method surpasses these baselines in facial appearance (facial color, hair, and helmets) and geometry (eyes, noses, and hairstyles).

We appreciate the reviewer’s observation regarding superior facial detail in some baseline models, especially as depicted in Figure 5 of the SiTH paper. In response, we offer the following clarifications:

  1. Input Resolution: Unlike SiTH which uses 1024x1024 as input, our approach operates at a lower 256x256 resolution due to model capacity and training cost considerations. We believe that employing a higher-resolution multi-view diffusion model capable of processing images at 512x512 would significantly enhance detailed facial regions.
  2. Underlying SMPL prior: Unlike baselines which estimates SMPL to provide the body shape prior, our method doesn't rely on the SMPL template. Thus, our approach has no additional information regarding the detailed face and hand geometry from SMPL. However, we argue that estimating SMPL accurately from real-world images is still an open challenge. As illustrated in rebuttal PDF Fig.1, the inaccurate SMPL template brings disadvantage in representing loose clothing.
  3. Training data: The aforementioned methods rely on ground-truth geometry (SDF) information, providing significant supervision on face geometry reconstruction. We rely only on RGB information, which is more flexible and allows using multi-view image and video datasets.

In this paper, we propose a general framework for monocular reconstruction that can handle particularly challenging scenarios e.g. loose clothing, using flexible 3D-GS representation. Our framework elegantly combines 2D multi-view diffusion model with 3D-GS generation model. Reviewer iJQG highlights that our framework extends beyond human reconstruction, with the potential for generic object or 3D face reconstructions. We demonstrate that our framework obtains better overall human reconstruction but one could also apply our method to further improve face and hand reconstructions as well. We agree that accurate facial reconstruction is important for 3D human reconstruction, but we also want to emphasize that overall accuracy including clothing also is crucial for realistic avatar creation. We hope reviewer could value our paper not only on the facial results but on the novelty and generality as well.

Given the reviewer's positive feedback acknowledging the reasonableness of our motivation, the interesting nature of our idea, the quality of our presentation, and the demonstrated performance advantages in scenarios involving large loose skirts, children, and anime characters, we want to understand if there are any additional concerns that might prevent the acceptance of our paper. We are committed to addressing any further issues to ensure our research meets the high standards expected for publication.

评论

We are thankful for the feedback and the increased score of our submission.

Regarding the limited resolution of 2D multi-view diffusion model, we wish to highlight that the pre-trained 2D multi-view approach is designed to ensure robust generalization, a key advancement over prior baselines as demonstrated in Fig.1 of our rebuttal PDF.

Importantly, our proposed framework is not bounded to any single model; we initially utilized ImageDream[1] (256x256), the SOTA pre-trained model available during our development phase. This choice underscores our model's adaptability, not its limitation. As higher resolution models such as MVDiffusion++[2] and CAT3D[3] (both 512x512, not public yet) become available, our approach is fully capable of integrating these advancements, further enhancing its applicability and performance in future applications.

[1] ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation
[2] MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction
[3] CAT3D: Create Anything in 3D with Multi-View Diffusion Models

审稿意见
5

The paper introduces a framework that combines 2D Multi-view Diffusion model and Gaussian Splatting to achieve the task of 3D clothed human body reconstruction from a single view. The focus of the paper is to deal with the 3D inconsistency present in 2D multi-view diffusion models.

优点

A novel framework that has reasonable motivations. The ablation studies support the various components introduced by the authors.

The method appears to be robust to various different input data.

缺点

Point-to-surface (P2S) metric, widely used in established works like PIFu, PIFuHD, ICON, and ECON, is not used in this paper.

In terms of the resolution of the generated meshes (especially for facial features), the proposed model seems to perform poorer than what has been observed in SOTA like ECON or ICON. Can the authors provide a comparison of the proposed model with the SOTA methods but show only the geometry and not the texture?

问题

Are all the SOTA methods trained with the same training set (with the same human subjects) as your model?

What is the inference time required to generate each mesh, and how does that compare against the existing SOTA methods?

局限性

No potential negative societal impact.

I hope the authors address the aforementioned concerns during the rebuttal.

作者回复

Q1: Point-to-surface (P2S) metric not reported in the paper

A1: Thanks for the question. We would like to point out that the chamfer distance (CD) reported in the paper is a bidirectional point-to-mesh distance. It measures the distance from both Point-to-Surface (P2S, reconstructed mesh to GT scan) and Surface-to-Point (S2P, GT scan to reconstructed mesh). We understand and agree with Reviewer that reporting the P2S and S2P separately helps analyze the performance. Hence we report the numbers for the reconstruction from table 1 below. We additionally report the results of ICON and ECON as suggested by R3 (RJbF). We will integrate these numbers into Table 1.

AccuracyCD(cm)S-to-P(cm)P-to-S(cm)
Our1.351.381.31
SiTH3.924.183.66
SiFU3.603.503.70
ECON3.523.543.49
ICON4.064.034.08
PiFU2.832.942.70
LGM3.292.833.775
TripoSR2.592.652.52
InstantMesh2.472.592.34

Q2: Geometry Reconstruction performance and comparison with SOTA methods like ICON and ECON

A2: We have presented the results of geometry reconstruction using ICON and ECON in our Rebuttal PDF (Figure 1). Our model demonstrates superior performance compared to ICON and ECON. This advantage arises from the limitations of ICON and ECON in modeling loose clothing, accessories, and children due to their dependency on SMPL. In contrast, our model is not constrained by that. Additionally, we have further assessed the quality of the geometry through a user study. We use 20 subjects, randomly sampled from IIIT, Sizer, and CustomHumans, and conduct a user study involving 70 people. In our user study, users chose the geometry reconstructed by our method over ICON and ECON 73.8% of the time.

Moreover, previous works like ECON uses Poisson reconstruction and Laplacian Smoothing as the post processing. It also directly replaces the reconstructed face with the facial part of SMPL-X, and stitches the face part using poisson reconstruction. Our method uses 3D-GS to represent human which is flexible to represent diverse clothing geometry but it is still an open question to extract high-quality meshes from 3D-GS. Nevertheless, we do not have extensive post-processing to optimize the geometry but still obtain better reconstruction. With the rapid advancement of remeshing for 3D-GS (e.g. ref. [1]), we believe our geometry can be even better and the advantage of our 3D-GS representation will prevail more in the future. [1]. DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing.


Q3: Settings of SOTA baselines

A3: Thanks for raising this question. We reuse some weights from LGM which was trained on Objaverse and fine-tune on 6k human scans. We also fine-tune LGM on the same human data (LGM_human in the paper). For human reconstruction baselines, we use the official released model from each baselines to evaluate the performance. This is a standard setting in most recent papers in this area (e.g. SiTH, SiFU). It is not possible to retrain all baselines due to compute limitation. Apart from compute, it is also impossible to train baselines like SiTH, SiFU, ICON, ECON that requires GT SMPL fits to scans, as it is very difficult to fit SMPL for scans with wide clothing, clutter, children or missing parts. No method currently achieves this reliably. This remains an open challenge and poses a limitation for methods dependent on SMPL GT. Our method does not rely on SMPL hence allows us to train on 6k scans where only 1500 scans have GT SMPL fits. However, we fully understand that ablating the contribution of model design and data is important for future works. Thus, we trained our model on Thuman 2.0 only, which is the same training dataset of SiTH, SiFU, ICON, and ECON. We adopt pretrained 2D Multi-view diffusion (MVD) on objaverse and pretrained diffusion-based 3D-GS generator on ShapeNet. We report the performance as follows:

AccuracyPSNRSSIMLPIPSCD(cm)S-to-P(cm)P-to-S(cm)
SiTH20.880.9070.0743.924.183.66
SiFU20.390.8960.0853.603.503.70
Our (Thuman2.0 only)21.210.9070.0661.601.661.63
Our21.50.9180.0601.351.381.31

The results shows that our model trained on the same data (Thuman 2.0) still outperforms baselines and achieves SOTA performance. We would like to emphasize that the Thuman2.0 dataset is smaller (approximately 500 samples) and offers less diversity in terms of clothing, subjects, and poses compared to our full training dataset. Despite this, our model outperforms SOTA methods and its performance is very close to that of our model trained on the full dataset. This demonstrates the strength and effectiveness of our proposed model. We thank the reviewer for encouraging us for this experiment.


Q4: Inference Time of proposed approach

A4: In the introduction section, we have compared our model's performance with other models in terms of memory and processing time. Although our method is based on a diffusion approach, which involves iterative sampling, it only requires 50 feedforward steps (DDIM), resulting in a generation time of approximately 22 seconds. In contrast, baseline methods depend on SMPL estimation and test-time optimization, which significantly slows down their performance, taking up to 2 to 5 times longer.

For more details and table, please refer to the introduction section.

评论

Q1: Ok, that answered my question.

Q2: I looked at Rebuttal PDF (Figure 1). I appreciate the effort, but I actually asked to see facial features of the generated meshes. From what I can observe, I do believe that existing papers (e.g. SiTH, Fig. 5 from their paper) may be able to do this better. Nevertheless, Rebuttal PDF (Figure 1) does show the structural accuracy of your proposed method. Overall, I do not believe my concern here was well-addressed.

Q3: The table you showed do somewhat address my concern, but I do not agree with the reasons you cited for not doing this in the first place. In particular, you could have picked other methods that do not use SMPL as baselines. I am also not sure why the evaluation datasets are combined into one during quantitative evaluation. Overall, my concern is partially addressed here.

Q4: This is one of my more minor concerns, but I looked at your paper and did not find it. Please specify the line number and the table number.

Overall, I find your rebuttal response to be a good effort although I feel it was mixed in terms of convincing me. Hence, I feel it is appropriate for me to retain my original score.

评论

Q1: additional P2S metric

We are happy that we address reviewer's concern.


Q2: facial features comparison in Rebuttal PDF

We understand the concern that the reviewer has about the performance on the facial features reconstruction. We regret our previous rebuttal in not including more direct frontal views in the rebuttal PDF. We hope the following explanation will satisfactorily address the reviewer's concerns:

Firstly, we wish to highlight that our user study, which included views from 45 degrees frontal right and front left, demonstrates a preference for our approach by 80.3% of participants over SOTA baselines such as SiTH, SiFU, ICON, and ECON. Additionally, the side views provided in the rebuttal PDF illustrate that our method surpasses these baselines in facial appearance (facial color, hair, and helmets) and geometry (eyes, noses, and hairstyles).

We appreciate the reviewer’s observation regarding superior facial detail in some baseline models, especially as depicted in Figure 5 of the SiTH paper. In response, we offer the following clarifications:

  1. Input Resolution: Unlike SiTH which uses 1024x1024 as input, our approach operates at a lower 256x256 resolution due to model capacity and training cost considerations. We believe that employing a higher-resolution multi-view diffusion model capable of processing images at 512x512 or even higher resolutions would significantly enhance the detail in the reconstructed facial regions.
  2. Underlying SMPL prior: Unlike SiTH, SiFU, ICON, and ECON which estimates SMPL to provide the body shape prior, our method doesn't rely on the estimated SMPL template. Thus, our approach has no additional information regarding the detailed face and hand geometry, which can be directly provided by SMPL. However, we argue that estimating SMPL accurately from real-world images is still an open challenge. As illustrated in rebuttal PDF Fig.1, the inaccurate SMPL template can also bring disadvantage in representing loose clothing.
  3. Training data: The aforementioned methods rely on ground-truth geometry (SDF) information, providing significant supervision on face geometry reconstruction. We rely only on RGB information, which is more flexible and allows using multi-view image and video datasets.

We believe this analysis and the discussion reported here will be beneficial for future works that will embrace our original proposed novel idea, combining 3DGS generation with 2D diffusion models. We also thank the reviewer for acknowledging our results on complex cases are never addressed by other methods, such as loose clothing and children. We agree that accurate facial reconstruction is important for 3D human reconstruction, but we also want to emphasize that overall accuracy including clothing also is crucial for realistic avatar creation.

In this paper, we propose a general framework for monocular reconstruction that can handle particularly challenging scenarios such as loose clothing, using the flexible 3D-GS representation. This framework elegantly combines 2D multi-view diffusion model with 3D-GS generation model. Reviewer iJQG highlights that our framework extends beyond human reconstruction, with the potential for generic object or 3D face reconstructions without altering the model's architecture. We demonstrate in this paper that our framework obtains better overall human reconstruction but one could also apply our method to further improve head, face, hair, and hand reconstructions as well. We hope reviewer could value our paper not only on the facial results but on the novelty and generality as well.

评论

Q3: setting as SOTA baselines

As we provided the additional quantitative results of our approach trained on the same data as baseline works such as SiTH and SiFU, it clearly shows that our model design can outperforms baseline under the same number of seen subjects during training.

As requested by the reviewer, we report the evaluation datasets separately as follows:

Sizer denoisePSNRSSIMLPIPSCD(cm)S-to-P(cm)P-to-S(cm)
SiTH18.90.9120.0633.383.383.37
SiFU18.00.9120.0682.692.562.80
Our (Thuman2.0 only)20.540.9160.0601.521.631.41
Our21.30.9280.0471.061.051.07
CAPEPSNRSSIMLPIPSCD(cm)S-to-P(cm)P-to-S(cm)
SiTH22.20.9080.0823.763.753.76
SiFU22.00.9070.0853.723.703.73
Our (Thuman2.0 only)21.10.9080.0752.232.192.25
Our21.50.9160.0641.891.861.91
CustomHumanPSNRSSIMLPIPSCD(cm)S-to-P(cm)P-to-S(cm)
SiTH20.80.9150.0732.822.812.84
SiFU20.10.8990.0873.103.083.11
Our (Thuman2.0 only)21.610.9090.0692.082.082.09
Our22.30.9260.0481.031.051.02
IIITPSNRSSIMLPIPSCD(cm)S-to-P(cm)P-to-S(cm)
SiTH22.70.9060.0773.313.543.08
SiFU22.60.8990.0874.244.324.17
Our (Thuman2.0 only)21.80.9000.0711.671.701.64
Our22.10.9050.0651.441.491.39

Q4: Inference time

Sorry for the misunderstanding. We reported the comparison in the introduction section of general rebuttal, and we are happy to provide the details here:

Our Human-3Diffusion is a diffusion-based feed-forward approach, avoiding the SMPL estimation and test-time optimization required by models like ICON, ECON, SiTH, and SiFU. This approach significantly boosts our model’s inference speed. We provide a runtime comparison on an Nvidia A100 GPU below, detailing the inference time from an RGB image to the final 3D representation:

OurSiTHSiFUICONECON
Time(s)22.6106.248.960.545.3
VRAM(GiB)11.722.012.06.35.9

For mesh extraction, we use Gaussian Opacity Fields (11.5s, resolution=256) and TSDF-Fusion (14.8s, across 24 views, resolution=256). We will provide comprehensive details in the revised manuscript.

评论

Thank you for your clarifications, I have no further questions. I acknowledge the good quantitative performance and the limitations that you stated. Before the rebuttal, I was deliberating whether to downgrade or upgrade my initial score for your paper, but your rebuttal has helped me understand that the initial positive score is appropriate.

审稿意见
5

In this paper, the authors propose to create realistic avatar representations by coupling the 2D multi-view diffusion and 3D reconstruction models which complement each other. Specifically, the 3D Gaussian Splatting (3D-GS) reconstruction leverages the priors from 2D diffusion models and produces an explicit 3D representation. Meanwhile, the rendered images from 3D-GS representations further guide the reverse sampling of 2D diffusion models to improve 3D consistency. Experimental improvements on some examples are achieved to demonstrate the empirical effectiveness of the proposed method.

优点

  • How to use one single image to infer 3D structure is an important problem, which can affect many down-streamed applications.
  • The Human 3Diffusion method improves the existing methods in some scenarios.
  • The paper is well written with nice figures. I can follow it easily.

缺点

  • In Figure 3, as for me, the outputs from Human 3Diffusion are similar to those of SiTH in terms of image quality. Actually, some examples in Figure 8 even show that SiTH can produce better results than the proposed method, such as the right hand is more reasonable on the 4th row.
  • Although the authors argue why they do not evaluate on the CAPE dataset, which is a standard testbed for previous methods, I think the Sizer and IIIT datasets are also not perfect. For example, as shown in Figure 7, some inputs contain the noisy ground part while the SiTH method produces cleaner background than Human 3Diffusion. I guess this also might yield worse metrics for SiTH in Table 1. So I suggest the authors additionally report the metrics on CAPE and CustomHuman datasets by following the evaluation settings of SiTH such that the readers can better perceive the empirical improvements from the proposed contribution.
  • Some wordings are a bit too strong. For example, in L198, the authors state that "guarantee the 3D consistency...". I understand that the developed 2D Multi-view Sampling could improve the 3D consistency, but it is hard to say it can fully address this problem.

问题

  • In Table 4, why only using FID for evaluation? Could the authors provide the results with other metrics like PSNR, SSIM and LPIPS?
  • How many camera views and how many examples are used for evaluation? Will the authors release their exact evaluation settings in the future?
  • It seems that the results of Table 2, 3, 4 are performed under different settings. Could the authors align all ablated revisions with the settings used in Table 1? It can better help the readers to comprehend the importance of each key technical idea compared to previous baselines.

I am looking forward to hearing about these points from the authors.

局限性

The authors address the limitation and societal impact in the appendix.

作者回复

We thank the reviewer for acknowledging the importance of our task and highlighting that our paper is well written, and our method improves existing methods. We address the concerns raised below and are open to further discussion and questions.


Q1: Performance of Human Reonctruction similar to SiTH

A1: We acknowledge SiTH as a strong baseline (CVPR’24, code released in April). While our examples may not fully highlight our method's advantages, Table 1 shows quantitative superiority over SiTH. We provide further qualitative comparisons in the Rebuttal PDF (Figure 1). SiTH relies on the SMPL estimation which might produce good hands or faces, but cannot represent geometry, which deviates significantly from SMPL body. Our method offers greater flexibility in modeling challenging scenarios like loose clothing, interaction, and children, as demonstrated in Supplementary PDF Figures 9-12 and Rebuttal PDF Figure 1.

To further evaluate reconstruction quality, we conducted a user study detailed in the general rebuttal section. The results show that 86.6% of participants prefer our reconstructions, clearly indicating superior quality over SiTH and other SOTA baselines.


Q2: Evaluation on new Datasets like CAPE and CustomHuman

A2: Thank you for the suggestion. We've now included results for the CAPE and CustomHuman datasets under the same settings as IIIT and Sizer. Since some high-quality CustomHuman scans were used in training, we report only the results for unseen subjects (ID0636 - ID0641). The FID score is higher in this case because fewer examples were available to compute the image distribution. It can be seen that our method consistently outperforms SiTH and SiFU on both datasets. We will add these evaluations to Table 1.

CAPEPSNRSSIMLPIPSFIDCD(cm)NCF-score
Our21.50.9160.06416.401.890.800.49
SiTH22.20.9080.08228.463.760.780.27
SiFU22.00.9070.08543.633.720.770.27
CustomHumanPSNRSSIMLPIPSFIDCD(cm)NCF-score
Our22.30.9260.04828.941.030.850.66
SiTH20.80.9150.07360.372.820.820.30
SiFU20.10.9080.08187.093.100.810.31

We hope the additional experiments address the concern of reviewer.

Q3. Noise on the Sizer test set.

A3: We agree that Sizer evaluation set is not perfect due to the noise on the ground. We remove the floor noise and redo the evaluation. Results are reported below and our method consistently outperforms baseline. We will release the evaluation dataset with rendered input images for more convenient benchmarking.

Sizer denoisePSNRSSIMLPIPSFIDCD(cm)NCF-score
Our21.30.9280.04710.011.061.060.63
SiTH18.90.9120.06321.873.380.750.28
SiFU18.00.9120.06836.642.690.780.33

Q4: Some wordings like 'guaranteed 3D consistency' are too strong.

A4: Thank you for pointing this out. We agree that we should avoid over-claiming in the paper. However, we believe this might be due to misunderstanding of L198: here we meant the renderings of the predicted 3D-GS are guaranteed to be 3D consistent. This is true because we have an explicit 3D-GS representation. To help 2D diffusion model, we add noise to the renderings as the input to next step (L7, Alg.2). These noised renderings are indeed not 3D consistent. We will clarify this better in L198-200. We are open to have further discussion and obtain feedback from the reviewer to improve the manuscript.


Q5: Why only FID metric in Table 4? Can author provide other metrics like PSNR, SSIM, LPIPS?

A5: Thank you for the suggestion. We provide the other metric below:

Ablation Tab.4PSNRSSIMLPIPSFIDCD(cm)NCF-score
Our w/o 2D prior20.980.9120.06811.701.750.7950.498
Our21.490.9180.0609.571.350.7980.550

In the original submission, the baseline was trained with relative camera system, which means the camera pose of the input and 3D reconstruction are unknown. This makes it difficult to compute PSNR and SSIM that require good image alignment. We retrain the baseline with a global camera system, which makes comparison possible and also explains the difference in FID compared to original table 4. But the conclusion is the same: 2D prior helps 3D reconstruction. We will update table 4 with the new numbers.


Q6:. Details about evaluation setting

A6: We use 32 uniformly rendered views around the human with zero elevation angle (Supp. L840-844). The number of subjects evaluated in each datasets are: IIIT (155), Sizer (136), CAPE (107) and CustomHuman(8). We will release the full evaluation setting, including Blender rendering and metrics calculation scripts. To ensure the reproducibility, we will also release the processed Sizer denoised dataset.


Q7: Do Table 2, 3, 4 have same the setting as Table 1? What is the difference between these ablations?

A7: Table 1, 3, 4 all have the same evaluation setting and the numbers of ours are aligned. Table 2 ablates the influence of our 3D model on 2D multi-view diffusion outputs. Hence we evaluate only on the 4 output images from MVD instead of 32 views used on other tables. this leads to different numbers for our method. We will move table 2 after table 3 and 4, making it easier for reader to connect the information across these tables.

评论

Thank the authors for offering a detailed response. I really appreciate it. Generally, the response addresses my most concerns. However, I am a bit confused why the scores of SiTH in Q2: Evaluation on new Datasets like CAPE and CustomHuman do not comply with Tab. 1 in SiTH's paper. Could the authors explain this a bit?

评论

We are happy that most of the reviewer's concerns have been addressed.


Q1: Different SiTH numbers between our table and SiTH paper

For the evaluation, we use the official inference pipeline from SiTH github repo to obtain the reconstruction and use the official alignment script to align with GT meshes before evaluation. The only difference is that our test images are rendered from perspective cameras while SiTH which uses orthographic cameras. Due to this difference, there is some pose offset from SMPL fitting results, leading to the gap. It is worth mentioning that estimating 3D SMPL on perspective Images can lead to less faithful 3D, as also discussed in CLIFF[1] and SPEC[2]. We also observe similar artifacts in the SMPL estimation, such as legs bending backwards, as shown in our rebuttal pdf. Moreover, we rerun SiTH on orthographic renderings and reproduce similar numbers as reported in the SiTH paper. We will clarify this in the experiment section. We really appreciate the great results produced by SiTH and SiTH's author for helping in figuring out the evaluation settings. Nevertheless, it is crucial to acknowledge that nearly all real-world images are captured using perspective cameras. Thus, we argue that our evaluation setting, designed to handle perspective images, offers a more accurate reflection of performance in real-world conditions.

[1] CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation, ECCV2022
[2] SPEC: Seeing People in the Wild with an Estimated Camera, ICCV2021

评论

I thank the authors for providing the detailed explanations. Based on the current response, I would retain my positive attitude.

作者回复

Dear Reviewers and Area Chairs,

We sincerely thank all reviews and ACs for their time and insightful feedback. We are glad that they found our work novel and addressing an important task (R1) and appreciating the technical contribution of of integrating 3D Gaussian Splatting generation (R3, R4) within 2D diffusion. Reviewers appreciated also our experiments, since we improve over existing methods (R1, R4), validate with a comprehensive ablation study (R2), and show robustness to different input (R2, R4) as well as general objects (R4).

The main concerns raised from the reviews are the evaluation and qualitative comparison with more baselines, and inference time cost. We address these by adding comparisons with ICON, ECON on our initial evaluation dataset and CAPE, CustomHuman. We also report the inference time and compare with baselines. Please see Q1 and Q2 below. We answer each questions in more details in the replies to each reviewer. We sincerely hope that ours replies could address all the concerns. We are also open for discussion and are happy to clarify or address any further questions.


Q1: the Qualitative Results are not outstanding compared to SOTA baselines, why is it?

A1: We appreciate the reviewer’s assessment. However, we maintain that our results, as shown in Figure 3 of the main paper and Figures 7 and 8 in the supplementary material, indeed surpass current SOTA methods. To further support our position, we have included additional qualitative results as suggested by reviewers and a user study in our rebuttal.

We have included additional qualitative results (see Figure 1) in the Rebuttal PDF, where we compare our method against SiTH (CVPR2024), SiFU (CVPR2024), ICON (CVPR2022), and ECON (CVPR2023). Our results highlight the advantages of our approach in handling challenging scenarios such as large loose skirts, children, anime characters, and diverse accessories, where prior methods often struggle and sometimes completely fail to generate anything reasonable. SOTA methods rely on the SMPL template which might produce good hands or faces, but it is limited by the naked body shape. In contrast, our method does not rely on SMPL and is more flexible. We invite reviewers to examine Figures 9-12 in the supplementary materials and Figure 1 in the Rebuttal PDF. This flexibility is also evident in our quantitative results, where SOTA baselines falter due to SMPL estimation inaccuracies.

Moreover, we thoroughly assessed our results through quantitative analysis and a user study with 70 participants. This study compared 20 textured subjects against SiTH and SiFU, and 20 geometry-only subjects against ICON and ECON, using subjects from the IIIT, Sizer (w/o floor noise), Cape, and CustomHuman test sets. Participants were asked to select the best reconstruction among three options. Details and a demo of the study are provided in the Rebuttal PDF.

User StudyOurSiTHSiFU
Appearance & Geometry86.6%7.6%5.8%
User StudyOurICONECON
Geometry only73.8%8.0%18.2%

In summary, our model approach is preferred by 80.3% of participants in our user study, indicating a significant preference over the baseline models.

We trust that our additional results comprehensively address the concerns raised. We remain open to further suggestions on how to better demonstrate the advantages of our approach over the baselines.


Q2: What is the runtime to infer one image? Does Human-3Diffusion have advantage in efficiency compared to other works?

A2: Thanks for the question. Our Human-3Diffusion is a diffusion-based feed-forward approach, avoiding the SMPL estimation and test-time optimization required by models like ICON, ECON, SiTH, and SiFU. This approach significantly boosts our model’s inference speed. We provide a runtime comparison on an Nvidia A100 GPU below, detailing the inference time from an RGB image to the final 3D representation:

ModelTime (s)VRAM (GiB)
SiTH106.222.0
SiFU48.912.0
ICON60.56.3
ECON45.35.9
Ours22.611.7

For mesh extraction, we use Gaussian Opacity Fields (11.5s, resolution=256) and TSDF-Fusion (14.8s, across 24 views, resolution=256). We will provide comprehensive details in the [updated version].


For other individual comments, we have addressed each within the respective sections assigned to each reviewer. We deeply appreciate all the effort and time invested by the reviewers and Area Chairs.

Best,
Authors

评论

We would like to thank AC and all reviewers for the valuable feedback and active involvement in the discussion. We summarize the merits of our work and additional results for rebuttal and discussion below:

Brief summary of our contributions

  • Novel method for 3D reconstruction. We introduce a general framework for image-based 3D reconstruction, which leverages the strong prior from 2D multi-view diffusion (MVD) models and 3D consistency of explicit 3D Gaussian splatting representation. At the core of our method is a novel joint diffusion process that synchronize 2D MVD and 3D generation model which enhances performance of both models.
  • SOTA human reconstruction: Our method achieves SOTA results in human reconstruction and surpasses 10 baselines (most recent at ECCV2024) (Fig.1 in rebuttal PDF, Fig.7-8 in paper, Q3 to reviewer RJbF).
  • Strong generalization: Our design leverages the 2D multi-view model effectively, enabling strong generalization to diverse unseen cases like large loose clothing, animals, anime characters, Human-Object Interaction, demonstrated in various figures across the paper and rebuttal (Figs. 5,14-16 in paper and Fig.1 in rebuttal PDF).

Brief summary of rebuttal stage

We are very happy to see that there is no criticism of our method’s novelty and significance. The concerns raised in the initial reviews focused on the quality of 3D reconstruction, inference time, and the need for more extensive quantitative comparisons with other baselines. We addressed in the rebuttal with following aspects:

  • We show more qualitative comparisons in the rebuttal PDF, clearly demonstrating that our Human-3Diffusion outperforms all SOTA baselines. Furthermore, we conducted a user study with 70 participants evaluating 40 randomly selected subjects. The study showed a strong preference for our method, with 80.3% favoring it over others.
  • We also reported runtime performance at ~22 seconds per image, significantly faster than the baselines such as ICON, ECON, and SiTH, which exceed at least 45 seconds due to SMPL estimation and surface optimization post-processes.
  • We expanded our comparison to include four additional baselines (with code available): ICON, ECON, FoF, and TeCH. Now, our Human-3Diffusion model is compared against 10 recent SOTA methods, and our method consistently outperforming all baselines in 4 challenging datasets.

We are grateful that the additional data and comparisons have addressed most concerns. Reviewers KYnm, uByW, and iJQG maintain a positive attitude to accept our paper, with RJbF acknowledging our enhanced comparison and qualitative results, subsequently raising his/her score. We kindly request that all reviewers consider revising their scores upwards if they are satisfied with our revisions. Thank you again for all the constructive feedback and discussions.

Respectfully,
Authors

最终决定

This paper proposes a framework that utilizes 2D multi-view diffusion models and ensures 3D consistency through explicit 3D Gaussian splatting to generate 3D clothed avatars. The initial scores were mixed, with two borderline accepts (5), one weak accept (6), and one reject (3). Following the rebuttal, the reject was revised to a borderline reject (4). The main concerns raised by the reviewers focused on the quantitative and qualitative evaluations. While the reviewer RJbF still has reservations about the qualitative comparison results and input resolution, the other reviewers found the responses satisfactory and maintained a positive attitude during rebuttal. Given the innovative joint diffusion process that synchronizes 2D multi-view diffusion with 3D generation, which was acknowledged by all reviewers, I recommend accepting this paper.