Robust-PIFu: Robust Pixel-aligned Implicit Function for 3D Human Digitalization from a Single Image
Using Transfer Learning and Latent Diffusion Models to elevate the robustness of pixel-aligned implicit models in terms of dealing with occlusions and also to achieve SOTA performance.
摘要
评审与讨论
This paper proposes a method to address the common challenges in single-image to 3D human reconstruction, specifically the performance degradation caused by self-occlusion and inter-person occlusion. The authors introduce the Disentangling Diffuser, a diffusion model used for inpainting occluded regions to resolve inter-person occlusion, and the Penetrating Diffuser, which generates multi-layer normal maps to handle self-occlusion. Using these outputs, the paper proposes a PIFu-based structure for 3D human shape reconstruction.
优点
This work takes an approach to resolving self-occlusion and inter-person occlusion without significantly altering the foundational PIFu method.
By using the divide-and-conquer strategy with the Disentangling Diffuser and Penetrating Diffuser diffusion models.
缺点
- The method novelty is limited. The proposed Disentangling Diffuser is identical to the Stable Diffusion [1] inpainting model in both its structure and functionality, making it difficult to consider this as a unique contribution of the paper. Similarly, the Penetrating Diffuser is analogous to ICON [2], which utilizes the SMPL model for normal prediction. It is also challenging to find novelty in the Pixel-aligned Implicit Model.
- The paper’s structure is difficult to follow, and the notations lack clarity. In particular, section 3 lacks a formal formulation of the inputs and outputs for each model, and there is excessive repetition regarding input and output dimensions that could be streamlined. Significant revisions are needed for paper quality.
- The figure 3,4 and 5 are excessively large and occupy too much space.
- Figure 4, figure 8, 10 placement is inconvenient. It reduces readability.
- The paper does not reference any methods from 2024, and all comparison methods are from 2023 or earlier. Additional experiments with newer methods are needed.
- The experimentation and result analysis are insufficient. In particular, a quantitative evaluation of the inter-occlusion claimed in this paper is necessary.
- There is no explanation or reference for the evaluation metrics. A reader who doesn't know the abbreviation for 'evaluation metric' would find it difficult to understand.
- The requirement to manually define occlusion regions (mask) limits practical applicability.
- There is an insufficient ablation study regarding layer depth of outer and inner normal map, particularly concerning variations in limb length and pose.
- There is a lack of detail on which diffusion and PIFu models were used and the training and optimization procedure, making reproduction challenging.
[1] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models.
[2] Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J Black. Icon: implicit clothed humans obtained from normals.
问题
- Why are there no references and comparisons to methods published after 2023?
- Is there a specific rationale for defining the Penetrating Diffuser views as front, back, right, and left? Could additional views be used?
- How does the inference runtime or computational complexity compare to other methods?
- When using the Disentangling Diffuser and Penetrating Diffuser for inference, how many denoising steps were performed, which scheduler was used, and was classifier-free guidance (CFG) applied?
- Is there an additional study on the reliability and performance of images generated by the diffusion model (particularly normal maps), given that diffusion model results can vary with different seed values?
- How does the viewpoint of a single-view image affect performance, and to what extent does it impact reconstruction quality?
Weakness 1a:
We list a few of the important differences between our Disentangling Diffuser and the Stable Diffusion’s inpainting model here:
-
Stable Diffusion’s inpainting model does inpainting and thus requires a mask that indicates exactly which part/region of the input image needs to be infilled. In contrast, our Disentangling Diffuser does not do inpainting and is only given a Human Subject Selection Map, which is a map that indicates which human subject in the input image is to be retrieved. This map can be any of three arbitrary but fixed patterns, and it certainly does not indicate the exact location of the human subject that we wish to retrieve. For example, if the given Human Subject Selection Map is the first pattern, then the leftmost human subject in the input image will be retrieved, regardless of whether the leftmost human subject is physically located at the leftmost or rightmost end of the input image.
-
Stable Diffusion’s inpainting model is used for the inpainting task, while our Disentangling Diffuser is used for the task of external occlusion removal. These two tasks are very different and have very different inputs and outputs. In inpainting, you are given a mask that indicates which region of the input image needs to be filled, and the task is to fill that region up using context from the non-masked parts of the input image. But our task of external occlusion removal is more complicated and involves many different sub-tasks. Firstly, an input image can have more than 1 human subject, so our Disentangling Diffuser needs to understand how many human subjects are in the input image. After that, the diffuser needs to find out which of the human subjects is the target human subject. In addition, not only does the non-target human subjects need to be removed, our Disentangling Diffuser also needs to find out whether any body parts of the target human subject is being occluded by the non-target human subjects. This means that our Disentangling Diffuser needs to understand which body part belongs to which human subject. The occluded body parts of the target human subject need to be filled in autonomously since we do not provide any hint of where or what body parts are being occluded. Extraneous objects, such as chair, table etc., need to be autonomously removed as well. Again, there will be no hint of what these objects are and where these objects will be. If these extraneous objects are occluding the target human subject’s body parts, then the occluded body parts need to be generatively predicted as well. In addition, if you refer to the 3rd and 4th rows of Fig. 11 in our manuscript, you will see that our Disentangling Diffuser has the additional responsibility of correcting/optimizing camera position. Moreover, regions on the input image that are occluded by darkness (the image has inadequate lightings) need to be generatively predicted by our Disentangling Diffuser as well. These various subtasks are certainly not completed in the inpainting task.
-
Our Disentangling Diffuser uses CLIP Image Embedder to encode the input image and then apply Cross Attention to incorporate the input image as a conditional prior. The Stable Diffusion’s inpainting model neither uses a CLIP Image Embedder nor Cross Attention for their conditional prior.
-
Our Disentangling Diffuser is adapted to the domain of human images, while the latter works in the domain of natural images. For our Disentangling Diffuser, we trained a specialized autoencoder to learn latent encodings of images that contain at least one and at most three human subjects. Majority of the images have external and/or internal occlusions, and this means the human subject is usually only partially visible. In contrast, the Stable Diffusion’s inpainting model is trained on natural images. Consequently, the Stable Diffusion’s inpainting model has difficulties dealing with images that contain humans (Please see the 3rd and 4th row in Fig. 21 of the Stable Diffusion’s Arxiv paper at https://arxiv.org/pdf/2112.10752)
Weakness 1b:
In addition, we list a few of the important differences between our Penetrating Diffuser and ICON:
-
ICON is a CVPR 2022 paper and it generates front and back normal maps. Our Penetrating Diffuser does not claim the generation of front and back normal maps as part of our novelty. In actual fact, our Penetrating Diffuser, as the name suggests, generates the normal maps that lie in between the front and back normal maps. This is important to us because we are trying to solve self-occlusion (e.g. a human subject’s arms cover his torso, resulting in errors in the 3D reconstruction of his torso). Moreover, our Penetrating Diffuser generates right and left normal maps and the normal maps that lie in between the right and left normal maps. These normal maps are not generated by ICON as well.
-
ICON does not use a diffusion model at all, but our Penetrating Diffuser is a latent diffusion model. Given the immensely under-constrained nature of the task given to our Penetrating Diffuser (see the previous pointer), it is not feasible to depend on ICON to provide reasonable outputs for this task of ours. In contrast, our Penetrating Diffuser makes use of diffusion and very large-scale pre-training to successfully achieve the task.
-
ICON trains two different networks that separately predict the front and back normal maps, but Penetrating Diffuser is a self-contained network that predicts 8 different normal maps. In addition, in each forward pass, each of ICON’s networks will produce 1 normal map. In contrast, a forward pass of Penetrating Diffuser will produce 4 normal maps. Moreover, Penetrating Diffuser, unlike ICON’s networks, utilizes sequential conditioning where the previous outputs (4 normal maps) are used as additional conditional priors for the next forward pass.
Weakness 1c: There is no existing method to incorporate multi-layered normal maps to a Pixel-aligned Implicit Model. Our work changed that by proposing Layered-Normals Pixel-aligned Implicit Model, which is designed with two different mechanisms to integrate multi-layered normal maps. One of the mechanisms is to use a Multi-dimensional Layered Normals Grid. The Multi-dimensional Layered Normals Grid aligns 8 layered normal maps in a 3D space using a 3D grid, and a 3D-CNN is used to learn to extract important features from this grid.
Weakness 2:
Our paper is structured as such:
-
Firstly, in Section 1, we introduce and explain the problem that we want to resolve. Terms and jargon are defined here. We explain the motivations behind our work and list our work’s contributions.
-
Next, in Section 2, we introduce related works and explain relevant concepts.
-
Then, in Section 3, we first introduce our proposed model and its various components from a high-level point of view. Then, in Section 3.1 - 3.4, we explain each of the components with more specific details.
-
After that, in Section 4, we first describe datasets that we used and the benchmark models that we compared against. Following that, we show our proposed model’s qualitative and quantitative results. This is followed by ablation studies of the different components that form our proposed model.
-
Finally, in Section 5, we describe our manuscript’s limitations before ending the manuscript with a conclusion.
If you find the order of above to be confusing or requires adjustment, do let us know and we will make the required adjustments. In addition, it will help us a lot if you can let us know which notations of ours “lack clarity”.
In addition, in response to your feedback, we have added a formal formulation for our models (i.e. D and P) in our revised manuscript (See Eqn 1 in Sect. 3.1 and Eqn 2 in Sect 3.2). In addition, we have removed repetitions pertaining to input and output dimensions in our revised manuscript by shifting this information to our Appendix (“A.14 Implementation Details”)
Weakness 3: We have reduced the size of Figs. 3,4 and 5 in our revised manuscript.
Weakness 4: We have shifted the position of Fig. 4, Fig. 8, and Fig. 10 in our revised manuscript.
Weakness 5:
Our work addresses the problem of Occlusion and Robustness in Single-view Clothed Human Digitalization (Sect 2.2 of our manuscript). The most recent, relevant work that we found is the work by Wang et al. [1], which was published in 2023 and included as a benchmark model in our paper. We did not find any relevant work published in 2024.
[1] - Junying Wang, Jae Shin Yoon, Tuanfeng Y Wang, Krishna Kumar Singh, and Ulrich Neumann. Complete 3d human reconstruction from a single incomplete image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8748–8758, 2023b.
Weakness 6: The quantitative evaluation shown in our manuscript includes randomly applied inter-occlusions. These inter-occlusions are illustrated in Fig. 1, Fig. 7, and Fig. 8.
Weakness 7: The abbreviations are explained in the first paragraph of Sect. 4.2 in our manuscript. In addition, we added an explanation for the metrics in our revised manuscript in the first paragraph of Sect. 4.2.
Weakness 8: No, we do not require occlusion regions to be manually defined. Our Human Subject Selection Map is explained in Sect. A2 of our Appendix. The Human Subject Selection Map is not a mask. It does not identify where the occlusion is, and it does not identify where the target human subject is. It is simply one of three arbitrary patterns that indicate to our D whether we want to target the leftmost, the second leftmost, or the third leftmost human subject in the input image. The three patterns can be arbitrarily chosen, but they must be consistent during training and testing.
Weakness 9:
It is not practical to increase the number of layers for the normal maps. Firstly, this will likely require the testing time required for Penetrating Diffuser (P) to be doubled because separate runs are required for quality. Secondly, this will only cater to very rare body poses. For most body poses, the increased number of layers will mean additional testing time to generate blank or repeated normal maps. Thirdly, because only very rare body poses require more than 4 layers in a direction, this means that we do not have sufficient training data for P to learn to produce reasonable outputs.
Reducing the number of layers is not reasonable as well as it will defeat the purpose of proposing our Penetrating Diffuser in the first place, and it only reduces the number of body poses that our proposed model can handle.
Weakness 10: Those details are in Section “A. 14 Implementation Details” of our Appendix.
Question 1: Please see our above response to your “Weakness 5”
Question 2: In order to construct a realistic, sharp, and detailed 360 degree view of a human avatar, we require views that cover 360 degree. Previous works like ECON use only the front and back, and their results show irregular and erroneous right and left views (e.g. ears are distorted). By using front, back, right, and left, we are able to overcome the shortcomings of ECON. Using these 4 views is the minimum number of views required to cover the 360 degree view appropriately. Using additional views led to replication of the same information and will increase the inference time for P by at least 1.5 times.
Question 3: Please see Section “A.12 Computation Time Required by our Robust-PIFu” of our Appendix.
Question 4: We use 200 DDIM sampling steps, a DDIM Scheduler, and CFG is applied.
Question 5: We acknowledge that hallucination and variability in output is a known problem of diffusion models and other generative models. In order to prevent wildly hallucinated results, we use strong conditional priors like SMPL-X normal maps and outputs generated from previous iteration. These conditional priors keep the outputs stable and narrow the output space of the diffusion models. From our experiments, we did not find any egregiously unreasonable results from our diffusion models.
Question 6: If a viewpoint of the input image is front-facing, then we expect the back-facing view of the 3D avatar to be most challenging to reconstruct (since there is least information there). Our proposed model is designed to deal with such situations, thus, as you can see from Fig. 21 of our manuscript, our proposed model is able to handle these situations and maintain high reconstruction quality. There are also examples that compare our proposed model vs SOTA models in such situations in Figs. 1, 2, 7, and 19. These examples demonstrated the significance of our method.
Thank you for the revisions and your responses to my questions.
After reviewing the additional explanations and the revised paper appendix, I have reconsidered my evaluation and adjusted the score. However, there are still areas for improvement in the revised version.
-
The explanation of the background knowledge for the method presented in the paper is insufficient, and it is written in a way that may be difficult for readers who are not familiar with the related methodologies to understand.
-
There is a lot of verbal explanation regarding the method, and it has not been expressed in mathematical formulas, which reduces clarity. Additionally, such formalization would have helped simplify the paper, but since it was not done, the paper became longer and important content could not be included in the main paper.
-
The literature review is lacking, particularly in terms of mentioning and comparing recent methods. Papers such as [1], [2], and [3] are recent works related to this paper and were published before the ICLR submission deadline. Some of these methods also utilizes diffusion models.
-
The rationale for dividing the human subject selection map in "DISENTANGLING DIFFUSER" into three sections is unclear, and results for various poses and human locations are not provided. There are likely limitations, but there is no analysis of failure cases, nor any mention of potential improvements.
Regardless of the outcome of this submission, I hope that these points will be taken into consideration for further revisions.
[1] Chen, Mingjin, et al. "Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail."
[2] Lee, Jungeun, et al. "PIDiffu: Pixel-aligned Diffusion Model for High-Fidelity Clothed Human Reconstruction."
[3] Ho, I., Jie Song, and Otmar Hilliges. "Sith: Single-view textured human reconstruction with image-conditioned diffusion."
Pointer 1: Thank you for your feedback. We made a new revision of our manuscript and added in more background information in Section A.18.1 and A.18.2. The information added in there will be useful for readers who are not familiar with the related methodologies. We have also updated our “Related Works” Section to help these readers.
Pointer 2: We understand that although we added two mathematical formulas in our revised manuscript, you still wish to see more mathematical formulas being used in place of our textual explanations. Thus, we revised our manuscript again and added two more mathematical formulas (Eqn 3 in Sect. 3.3 and Eqn 4 in Sect. A.7). With these additions, all components of our model are now given a mathematical formula.
Pointer 3: We will add these three references into our literature review. These methods may not be suitable to be used as benchmark models because they did not address external occlusions, which is central to the problem that we are trying to resolve.
Pointer 4:
Thank you for your feedback. There is no need to divide the human subject selection map into three sections. We only do this for human-interpretability and to make it intuitive for readers. In other words, the human subject selection map can be set to any three arbitrary (but consistent) patterns, and it will work the same.
We thank you for your feedback. Results for various poses and human locations for our Disentangling Diffuser can be seen in Figs. 10,11,12,20, and 26. In particular, the 2nd and 3rd rows in Fig. 10 show the leftmost human subject being located away from the leftmost 1/3 of the input image. Despite that, the Disentangling Diffuser successfully recognized and located the leftmost human subject.
We discussed our limitations in Sect. 5, Sect. A.12, and Sect. A.13 of our manuscript. Thank you for your suggestion, we revised our manuscript and added Sect. A.19 to analyse the failure cases in our work. In addition, we also added Sect. A. 20 to our newly revised manuscript to discuss potential improvements and future works.
The authors propose a method to reconstruct 3D human(s) from a single image by leveraging pretrained latent diffusion models and implicit function representation. The models are utilized for disentangling external occlusions (overlapping people), recovering from internal occlusions (viewpoint specific missing details) and an optional normal map super-resolution module to improve fidelity. The diffusion models+implicit models are trained using Thuman2.0. The method is evaluated on MultiHuman and BUFF datasets and show superior quality and performance compared to prior works.
优点
The method is simple and the writing is easy to follow. To highlight specific positives:
- Leveraging open-source diffusion models: This method highlights the ability of large open-source pretrained diffusion models to tackle 2D human inpainting (disentangling diffuser) and cross-domain conditional image generation (penetrating diffuser). The pipeline chains the outputs in a straightforward manner and obtains good quality results.
- Superior result quality: From the evaluations, it is quite clear that prior work cannot address the problem of inter-penetrating and self-occluding human reconstruction. This method achieves much higher quality reconstructions that reflects in the qualitative and quantitative experiments.
- Experimental results and observations: The authors have provided extensive results and ablations between the paper and the appendix which is quite appreciated.
缺点
The approach overall is quite straightforward. However, I have a few concerns regarding the details of the method. Specifically:
- Need for SMPL-X mesh: The paper mentions that prior work ECON and Wang et al. require accurate SMPL-X meshes for good predictions which Robust-PIFu does not need (L51-53). In fact, not using the SMPL-X model causes the model to have errors as shown in Fig. 14. Please clarify this point and show results when the SMPL-X input has some error.
- Method's robustness to diverse data: The results in the main paper are quite impressive. But, I would urge the authors to add more qualitative results on a larger diversity of data to get a more holistic view of the method's robustness (to live upto its name :) ).:
- Ex. real-world images under multiple settings (single/multi-humans with varying degrees of occlusion).
- Predicted normal maps N_s: What is the reason behind choosing an arbitrary number 8? There are numerous set of human poses that do not get covered by just 8 maps. How does the method handle such scenarios?
问题
- Fig. 3: consider adding a caption.
- Fig. 4 caption has incorrect layout for the titles.
- Disentangling Diffuser:
- How well does the disentangling diffuser work for overlapping humans at different depths? I presume the training data had humans at similar depths.
- Were any augmentations applied to add external occlusions to human images? L306 is not clear about this.
- Penetrating Diffuser:
- What is the reason behind generating the normal maps in two parts? Is it mostly due to memory and compute or due to bad predictions?
- Layered Normals PIFu model:
- What is the need for adding SMPL-X normal maps N_s here? Since all the required information to reconstruct the human is present in N_c. An ablation on this would be useful.
- Refining Diffuser:
- Why is R run 8 separate times to obtain the super-resolved normal maps? Would it be possible to super-resolve all 8 at once or 4 at a time (similar to P's output format)?
- Fig. 7 (Row 4) - Show results for both humans.
- For ablations on P (L481) and the implicit model (L492), what are the inputs to the implicit model in each setting?
- For ablation on P (Fig. 8), The hidden arms of the person in each image should have been reconstructed correctly if P outputs the correct normal maps. Does P fail when there are multiple people in the input image since its OOD?
- Fig. 22: It would be interesting to show results on these examples. This would show that even though the GT mesh has holes, the method can reconstruct plausible surfaces.
伦理问题详情
No ethical concerns.
Question 8:
For the ablation on P (previously L481, now at L500 L506 in the revised manuscript), when P is not used, the pixel-aligned implicit model is still given the disentangled image (from D), N_s, and N_c, but the N_c consists of only two normal maps (front and back). Also, the two normal maps are not generated by P, but are generated by a separate normal predictor that is commonly used in existing works (i.e. PIFuHD, ICON, and ECON). The reason for still including the two normal maps is because existing works typically use these two normal maps in their models. If we exclude these two normal maps, we are over-exaggerating the impact of our P.
For ablations on the implicit model (previously L492, now at L506 L512 in the revised manuscript), when the Multi-dimensional Layered Normal Grid is not used, the values (processed or not) from the Multi-dimensional Layered Normal Grid are not given as inputs to the MLP of our pixel-aligned implicit model.
When the concatenation of N_s + N_c is not used, we do not feed N_s and N_c as inputs to the stacked hourglass network of our pixel-aligned implicit model. Instead, we use front and back normal maps that are predicted by a separate normal predictor (not P) that is commonly used in existing works like PIFuHD, ICON, and ECON. The front and back normal maps are concatenated with the disentangled image (from D) before being fed as inputs to the stacked hourglass network. Again, we include these two normal maps in order to ensure fairness.
Hence, when both Multi-dimensional Layered Normal Grid and concatenation of N_s + N_c are not used, the only inputs to the pixel-aligned implicit model will be the disentangled image and a pair of normal maps (front and back).
Question 9: Thank you for your question. Fig. 8 is an ablation on D instead of P. When D is not used, the best we can do is to use an off-the-shelf instance segmentation model to segment out the pixels that belong to each human subject (same as what is done by ECON). Then, we separately feed in the image of each segmented human subject into our P. As each of these images will be missing a number of pixels of a human subject, P will not be able to generate normal values at those missing pixels.
Question 10: Thank you for your feedback. Please refer to the last row of Fig. 26 in our revised manuscript.
Thank you to the authors for clarifying my questions and concerns. The minor concerns I had have been answered and my scores remain the same. This method shows a simple method to chain together diffusion models to reconstruct humans with varying degrees of occlusion. However, I feel that for it to work well in in-the-wild, there has to be more diversity in the training data/larger size.
Thank you for your kind comments and feedback. While we demonstrated our in-the-wild performance in Figs. 10, 19, 20, and 26, we acknowledge that our model’s in-the-wild performance can always be improved if we had more training data. Thus, in Sect. A.20 of our revised manuscript, we acknowledge this limitation and describe our future intention to build a large scale dataset that directly addresses external and internal occlusion problems. Building such a dataset may be the only way to increase the diversity of training data for models that are involved in this task.
Weakness 1:
Indeed, as we wrote in L46-53, the work by Wang et al. assumed that the groundtruth SMPL-X mesh is given. This can be confirmed by reading the work by Wang et al. (the 1st paragraph of their “3. Method” Section). However, in practical scenarios, the SMPL-X mesh must instead be predicted from the input image. If you refer to the Figure 2 in their work, you will observe that the input image has occlusions (i.e. the left arm and both legs are missing). Given such an occluded input image, it is very unlikely that the SMPL-X mesh that is predicted from the input image can be reasonably accurate. Thus, the assumption that a perfect/groundtruth SMPL-X mesh is given is a very strong assumption that is unlikely to be practical. For ECON, they too do not address the occlusions before trying to predict the SMPL-X mesh, thus their results have the same problems too (see either Fig. 1d, the 4th-5th rows of Fig. 10, or the 4th row of Fig. 20 in our manuscript).
Our Robust-PIFu’s approach is different because we do not attempt to predict the SMPL-X mesh from an occluded input image. Instead, we remove the occlusions from the input image first. As shown in Fig. 3 of our manuscript or 1st paragraph of our Sect. 3, we use Disentangling Diffuser (D) to remove such occlusions from the input image (i.e. the occluded regions are generatively filled up). Then, once such occlusions are removed, we can predict the SMPL-X mesh, which is now a straightforward task. The difference is observed if you compare our results and ECON’s results in 4th-5th rows of Fig. 10 or 4th row of Fig. 20 in our manuscript.
Weakness 2: Besides Fig. 10, we also provided Fig. 20 to show both single and multi-humans with at different degrees of occlusion. In addition, we also revised our manuscript to add in Fig. 26. Thus, if you require more results on this, please also refer to our Fig. 26 of our revised manuscript.
Weakness 3:
We took into account a number of reasons:
Firstly, we examined the 3D meshes that are given in various human datasets, such as BUFF, THuman2.0, and MultiHuman, before deciding on the design choice of using the specific 8 normal maps. Among the meshes that we examined, we find it is rare that more than 8 normal maps will be required. Moreover, if we try to increase the number of normal maps by increasing the number of layers, then there will not be enough 3D meshes that require more so many layers of normal maps, and thus we will not have adequate training examples to train our Penetrating Diffuser to produce reasonable outputs for the additional layers.
Secondly, increasing the number of normal maps will require additional sampling runs of the latent diffusion model, and this translates to a significant increase in inference time. On the other hand, reducing the number of normal maps will translate to less coverage and defeat the purpose of having our Penetrating Diffuser. Hence, we have to strike a balance between inference time and coverage.
Thirdly, in order to construct a detailed 360 degree view of a clothed human avatar, we find that, at the minimum, only 4 view directions (front, back, right, left) are required. Adding more view directions (and also adding more normal maps) will lead to overlapping and replicated information. On the other hand, reducing the view directions to less than 4, like what is done in ECON (only 2 view directions - front and back), will result in poorly constructed right and left views (see 1st row of Fig. 2d and 1st row of Fig. 7d in our manuscript).
Question 1: Thank you for your feedback. We added a caption for Fig. 3 in our revised manuscript.
Question 2: Thank you for your feedback. We fixed the caption of Fig. 4 in our revised manuscript.
Question 3a: You are correct. We followed existing works on Single-View Clothed Human Digitalization (e.g. PIFu, PIFuHD, IntegratedPIFu, ICON, and more) and used a weak-perspective camera to generate the training images. As such, the human subjects that appeared in the same image are assumed to be at the same depth. Consequently, when given an input image with human subjects at different depths, the furthest human subject (that appears smallest) will be assumed to be of the smallest body size compared to the nearer human subjects. However, this can be fixed if we generate training images using a perspective camera with variable parameters, and then use these images to finetune our Disentangling Diffuser (D). The effect will be similar to the 3rd row of Fig. 11, where the camera view/parameters is automatically inferred and adjusted by D.
Question 3b: To add the external occlusions, we added random objects onto the input image, added random human subjects onto the input image, randomly adjusted the camera’s elevation, randomly varied the lighting during the rendering process, and randomly cropped the input image. In terms of data augmentation, we used the standard techniques that are used in PIFu, such as random flip, random blur, and random crop.
Question 4: Our initial design was to have our Penetrating Diffuser to generate all eight normal maps in a single forward run. We realized early that this does not work well for a latent diffusion model. The reason is that there is no spatial relationship (i.e. the pixels do not align) between the front-back normal maps and the left-right normal maps. This causes a problem when we have to use an autoencoder to learn to encode the eight normal maps into a single latent encoding because the learning is exceptionally challenging. Despite many attempts, we are unable to obtain a latent encoding that achieves low reconstruction error. Whenever we decode the latent encoding back into the eight normal maps, we observe that the reconstructed front-back normal maps contain pixels that belong to the left-right normal maps instead, and the left-right normal maps contain pixels that belong to the front-back normal maps. It is clear to us that, during the encoding process, the front-back normal maps and left-right normal maps have negatively interfered with each other. Thus, we decided to have the Penetrating Diffuser to generate the front-back normal maps and left-right normal maps in two separate runs instead.
Question 5: Thank you for your feedback, we added Section A.15 and Table 3 in our revised manuscript to include this. The inclusion of N_s was viewed by us as a hyperparameter that we tune during the cross-validation. We observed that it was beneficial to include N_s instead of excluding it. We believe that it was due to the fact that N_s does not contain pixels that pertain to clothes or hair. The combination of N_s and N_c thus gives our pixel-aligned implicit model information of which pixels belong to clothes and hair. Clothes and hairs in a 3D mesh are often thin rather than thick, and this information is helpful for guiding the pixel-aligned implicit model to construct the 3D mesh.
Question 6: This is also a design that we tried early in our project, but it did not work out well. There seems to be negative interference among the 8 normal maps. We observe a significant improvement in the quality of the super-resolved normal maps when we super-resolve a single normal map at a time.
Question 7: Our Fig. 7 (Row 4) only showed the first human subject because only he is occluded in the input image. We revised our manuscript and added Fig. 25 in the Appendix. Please refer to Fig. 25 for the results on the second human subject.
This paper propose Robust-PIFu, a novel approach designed to reconstruct 3D human models from a single image, especially under the occlusion setting. By harnessing the power of large-scale pre-trained latent diffusion models: a disentangling latent diffusion model and a penetrating latent diffusion model, Robust-PIFu effectively separates each human and inpaints the occluded areas of the human. Subsequently, a layered-normals pixel-aligned implicit model, coupled with an optional super-resolution mechanism, is employed to reconstruct the 3D clothed human model. Experimental results demonstrate that Robust-PIFu outperforms current state-of-the-art methods in both qualitative and quantitative evaluation.
优点
- Robust-PIFu provides a novel solution to the problem of external and internal occlusions in 3D human reconstruction from single image by means of pretrained diffusion models.
- Extensive experiments prove the effectiveness of the proposed methods.
缺点
- The writing and the presentation of the paper need improving.
- Robust-PIFu seems to be primarily effective in reconstructing 3D humans from laboratory datasets. Its generalizable experiment in real image reconstruction scenarios is simple, as it lacks more complex real image reconstruction results.
- Lack of the training details of Disentangling Diffuser.
问题
- I am curious about the reconstruction performance of Disentangling Diffuser on real images when there are significant areas of missing. Specifically, how well can it reconstruct the head portion if it is occluded?
- In the images presented in the paper, the characters are arranged side by side. If the characters were arranged vertically, such as one person standing and another crouching in front of them, would the disentangling strategy of Disentangling Diffuser still be effective?
Weakness 1:
Thank you for your feedback. We will really appreciate it if you could specify which part was unclear to you. Our paper is structured as such:
-
Firstly, in Section 1, we introduce and explain the problem that we want to resolve. Terms and jargon are defined here. We explain the motivations behind our work and list our work’s contributions.
-
Next, in Section 2, we introduce related works and explain relevant concepts.
-
Then, in Section 3, we first introduce our proposed model and its various components from a high-level point of view. Then, in Section 3.1 - 3.4, we explain each of the components with more specific details.
-
After that, in Section 4, we first describe datasets that we used and the benchmark models that we compared against. Following that, we show our proposed model’s qualitative and quantitative results. This is followed by ablation studies of the different components that form our proposed model.
-
Finally, in Section 5, we describe our manuscript’s limitations before ending the manuscript with a conclusion.
Weakness 2: Please see our response to your "Question 1" below.
Weakness 3: Besides the last two paragraphs in Section 3.1 of our manuscript, we also provided more information on the training details of Disentangling Diffuser in Section “A.14 Implementation Details” in our Appendix.
Question 1: We understand that you want to see how our model will perform on real images that are different from what were already shown in our manuscript. Specifically, you asked for real images where the head of the human subject is occluded. Thus, we revised our manuscript and added Fig. 26 in the Appendix. Please refer to Fig. 26 of our revised manuscript. As shown in the figure, our model can provide reasonable outputs for these challenging scenarios as well.
Question 2: The last row of Fig. 26 in our revised manuscript shows how our Disentangling Diffuser will handle these situations. In addition, we also wrote more information on these situations in Section “A.3 How D deals with Vertically Occluded Humans” of our Appendix. In short, as long as the training images are prepared as required, the disentangling strategy of Disentangling Diffuser will still be effective. We could train our Disentangling Diffuser with images of vertically occluded humans and ask it to target the topmost human subject whenever pattern 1 (Human Subject Selection Map) is given as input, and the Disentangling Diffuser will learn to target the topmost human subject when the human subjects in the input image are perfectly aligned in a vertical line.
Thanks for the response. It has addressed some of my questions. However, I still have reservations about the disentangling diffuser's actual effectiveness. Furthermore, I’m curious whether the disentangling diffuser mentioned in the paper will be released. Can I directly apply it to in-the-wild images for human inpainting tasks? The authors are recommended to provide a few more evaluations on in-the-wild images.
I also hope the authors can enhance the quality of the illustrations in the article, making them nicer and more engaging.
Yes, we will be releasing our proposed model, which includes our disentangling diffuser. This is also mentioned in L35 of our manuscript.
Yes, you can directly apply it to in-the-wild images for occlusion removals.
Thank you for your feedback, we updated Fig. 26 and added more results on in-the-wild images.
We appreciate your feedback on making the illustrations in our manuscript nicer and more engaging. We plan to improve the graphics in our figures by changing the textbox style, reducing the no. of words inside the figures, and enlarging the images used within the figures. Do let us know if this is not what you are looking for or if you have other suggestions pertaining to this.
This paper proposes a robust method for single-image human reconstruction by building upon the PIFu framework and addressing its limitations. The improvements focus on four main aspects. First, a diffusion model is employed to remove external occlusions from the image, ensuring a clearer view of the target human. Second, another diffusion model is used to predict multi-layer normal maps to mitigate the impact of self-occlusion. Third, the pixel-aligned implicit model is modified to use both the original image and the multi-layer normal maps as inputs for predicting 3D occupancy labels. Finally, an optional refinement diffusion model is introduced to enhance the resolution of the normal maps if finer details are required.
优点
- The proposed approach significantly improves the robustness of PIFu by addressing specific limitations, such as handling occlusions and introducing normal maps. The modifications are logical and well-justified, leading to more accurate reconstructions.
- The method demonstrates better performance on the BUFF and MultiHuman datasets compared to existing approaches, underscoring its effectiveness in general human reconstruction tasks.
缺点
-
The necessity of eight distinct normal maps (both outer and inner maps for each direction) in the Penetrating Diffuser is not fully intuitive. It is unclear why both outer and inner maps are required for each direction, as they may simply represent flipped orientations. Further explanation on the additional information provided by the inner normal maps would enhance understanding.
-
The need to run the Penetrating Diffuser separately for the front-back and left-right orientations introduces significant computational overhead. The left-right normal map prediction is conditioned on the front-back prediction; however, simultaneous prediction of all eight maps could potentially enhance the conditioning across all pairs. This two-stage execution doubles the computational load, requiring two diffusion models and multiple sampling runs, which could be optimized.
-
Clarification on MLP in Pixel-Aligned Implicit Model (Section 3.3): Additional clarification on the structure of the MLP in the Pixel-Aligned Implicit Model would be beneficial. Specifically, it is unclear if this MLP follows the structure of the original PIFu model, and whether all normal maps and features are concatenated directly as inputs to the network.
问题
Please justify the design of the Penetrating Diffuser (see Weakness 1 and 2). Please justify why inner and outer normal maps are needed, and why separate models for front-back and left-right are needed.
Please also add more explanations to the architecture of the Pixel-Aligned Implicit Model (see Weakness 3).
Weakness 1: The necessity of eight normal maps is illustrated in Fig. 23 of our manuscript under Section A.4 in the Appendix, but we agree that additional explanation will make it clearer. Thus, we revised Section A.4 of our manuscript and expanded the 2nd paragraph in Section A.4 to provide additional explanation.
Weakness 2: This design choice that you suggested was originally considered by us. Indeed, the initial design for our Penetrating Diffuser is to generate all eight maps (i.e. both front-back and left-right orientations) all at once. We realize early that this does not work well for a latent diffusion model. The reason is that there is no spatial relationship (i.e. the pixels do not align) between the front-back normal maps and the left-right normal maps. This causes a problem when we have to use an autoencoder to learn to encode the eight normal maps into a single latent encoding because the learning is exceptionally challenging. Despite many attempts, we are unable to obtain a latent encoding that achieves low reconstruction error. Whenever we decode the latent encoding back into the eight normal maps, we observe that the reconstructed front-back normal maps contain pixels that belong to the left-right normal maps instead, and the left-right normal maps contain pixels that belong to the front-back normal maps. It is clear to us that, during the encoding process, the front-back normal maps and left-right normal maps have negatively interfered with each other. Thus, we decided to have the Penetrating Diffuser to generate the front-back normal maps and left-right normal maps in two separate runs instead.
Weakness 3: Thank you for your feedback, we revised the fourth paragraph of Section “A.14 Implementation Details” in our Annex to clarify the structure of the MLP in our pixel-aligned implicit model. In short, our MLP follows the structure of the original PIFu except that its input dimension is increased in order to allow for the additional features from the eight normal maps to be used as inputs. Yes, these additional features are concatenated directly as inputs to the network.
Question 1: Please see our response to your “Weakness 1” and “Weakness 2” above.
Question 2: Please see our response to your “Weakness 3” above.
Thanks for the responses. They have solved my questions, especially for weakness 2, the explanation makes sense. For weakness 1, I understand you want to input more geometry information by considering more intersection between rays and the surface. I however suggest an ablation w/ and w/o ‘Inner’ map, to demonstrate how much this "two-layer intersection" can improve the results.
Thank you for your suggestion. We have added in an ablation with and without our inner maps in Section A.21 and Fig. 30 of our revised manuscript.
This paper proposes a pixel-aligned implicit model for reconstructing 3D human models from a single image, especially under the occlusion setting. By harnessing the power of large-scale pre-trained latent diffusion models: a disentangling latent diffusion model and a penetrating latent diffusion model, the proposed method effectively separates each human and inpaints the occluded areas of the human. A layered-normals pixel-aligned implicit model, coupled with an optional super-resolution mechanism, is then employed to reconstruct the 3D clothed human model. The proposed solution to the problem of external and internal occlusions in 3D human reconstruction from single image by leveraging open-source diffusion models should be appreciated. Also, the proposed method significantly improves the robustness of PIFu by addressing specific limitations, such as handling occlusions and introducing normal maps. On the other hand, the reviewers raised concerns regarding unclear explanations of the proposed method, lacking evaluation using more complex real images, and novelty. The authors have provided additional experiments to address the concern on experiments and argued the detailed explanations about unclear parts of the method in the rebuttal. Differences between the proposed disentangling diffuse and stable diffusion inpainting as well as differences between penetrating diffuse and ICON were detailed. The authors’ rebuttal with the revised manuscript and following discussion between the authors and the reviewers convinced three of the four reviewers. Despite discussion, Reviewer 8tGb feels that writing issues remain still. The main point raised by Reviewer 8tGb is that the paper is not kind to readers who are not familiar with the topic of this paper and unpreferable allocation of the contents between the main text and the appendix. AC thinks that if the writing quality is poor enough to prevent readers from understanding the work, the paper should be rejected even if the contribution is sufficient. However, this is not the case. With the current revised manuscript, readers will be able to understand the contributions of this paper. On balance, the contributions of this paper outweigh Reviewer 8tGb’s remaining concerns. This paper should be accepted, accordingly. AC suggests the authors to improve the paper for the broader community.
审稿人讨论附加意见
See above.
Accept (Poster)