PaperHub
5.8
/10
Poster4 位审稿人
最低4最高7标准差1.3
7
5
7
4
4.5
置信度
正确性3.0
贡献度3.0
表达3.0
NeurIPS 2024

Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We train a state-of-the-art generalist human pose and shape estimation model that can localize any point of the human body.

摘要

关键词
3D human pose estimationhuman shape estimationcomputer visionhuman mesh recovery

评审与讨论

审稿意见
7

This paper proposes Neural Localizer Field, a continuous field of point localizers, for localizing any point of the human body in 3D from a single RGB image. The method enables mixed-dataset training using various skeleton or mesh annotation formats. The method has three main parts: a point localizer network, a neural localizer field, and a body model fitting algorithm. Trained on a mix of datasets with different annotations, the model achieves good performance across various benchmarks.

优点

  • Novelty. The idea of utilizing a neural field to unite different data sources is interesting and novel. The paper does a good job in explaining the motivation as well as laying out the technical details.

  • Impressive performance on extensive benchmarks and experiments. The method enables training with multiple annotation sources, and it achieves better performance as compared to SoTA. Notably, it achieves good performance on shape prediction, which is a hard problem in human mesh recovery due to the lack of training data with shape annotations.

缺点

  • Lack of discussion on inference speed. Since this method would require on the fly point inference, I wonder what is the inference speed and whether this method is suitable for real-time inference.

问题

How does the model do on small details such as fingers?

局限性

Limitations (lack of temporal cues) and are explained.

作者回复

We thank Reviewer R9xH (R4) for the assessment and questions. R4 considers the idea "interesting and novel" and finds that we do "a good job in explaining" the motivation and technical details. R4 further sees the performance as "impressive" on "extensive benchmarks and experiments".

Lack of discussion on inference speed. [...] whether this method is suitable for real-time inference.

The method is suitable for real-time inference. NLF-S has a batched throughput of 410 fps and unbatched throughput of 79 fps on an Nvidia RTX 3090 GPU. For NLF-L these are 109 fps and 41 fps respectively. (Bounding box detection needs to be performed on top of this, but fast off-the-shelf detectors are readily available.)

Note that NLF’s inference-time overhead (for predicting weights through the field MLP) can be eliminated by precomputing the weights once for a chosen set of canonical points. (Typically one wants to localize the same points, i.e. same skeleton formats, for many images.) For reference, in case of NLF-S with no image batching, and about 8000 points to be predicted (mesh vertices and skeletons), forwarding the field MLP to obtain convolution weights takes 7.7 ms, while the rest of the network including the backbone takes 12.7 ms. For NLF-L with batch size 64 the latter takes 587 ms, making the MLP cost negligible in comparison even if we do not precompute it. We will add this information to the paper.

How does the model do on small details such as fingers?

For this, we refer to our AGORA results on hand keypoints (Table 5, RHan and LHan columns), where we achieve second-best results. However, given our focus on body pose and shape, most of our training datasets do not contain detailed finger annotations, and hence we can consider the second-best results obtained for this subset of keypoints still a strong result (see also our answer to R3.)

审稿意见
5

The authors proposes a Neural Localizer Field (NLF) to learn a continuous representation of the canonical human pose by learning to predict a set of functions that map a query point in the canonical human volume to a point in the human posed space, given a single rgb image. By introducing a meta-learning architecture, they are able to train on diverse datasets with different annotations in both 2D and 3D. The authors claim that this scaling from a large number of datasets allows for a better pose predictor than prior work and show relevant results.

优点

  • Clear Insight/Idea: The insight is simple, clearly explained and well motivated. Having a single architecture that ingests all sorts of human pose, shape annotation would certainly benefit from the diversity if handled correctly during training time. Although, the current architecture might not be the best design choice, the paper does show that a simple architecture + large datasets boosts metrics.

  • Impressive results: The quantitative metrics in Table 2-6 are quite impressive and perform better on most comparison axes. The shape estimation results also are convincing and show benefits from better pose prediction.

  • Although not trained for temporal stability, the method does show some temporal stability in the supp. video.

缺点

  • Better data inspection: The core contribution is that simple architecture + more data gives better results. Since data is the main focus here, a thorough ablation on the data sources is missing. Its not clear if the performance is derived from just a few data sources or all of them i.e. how each dataset affects results. Without this understanding, its hard to argue that more diverse data improves metrics while a few datasets might have the biggest quality impact.

  • Extent of generalizability: Usual suspects for human pose&shape estimation failures/limitations are loose clothing, occluded views, unique poses. It would be nice to see how the method works on such cases and if the method generalizes well to such cases. Additionally, points that are often not annotated in pose estimation datasets might be prone to failures. The lower accuracy for face and hands in Table 5 makes me believe that this could be the case. It would be great if the authors could comment on the performance of the method in such cases.

  • 3D Loss weighting: Given the 3D loss for 3D datasets, the network would have to account for the different dataset scales which might be widely off. This can affect training and test time results.

问题

  • Its not clear what the canonical space representation is. Since the query points share the same domain across multiple datasets, I presume all of them are sampled from a single canonical space. But Fig. 4 column 1 shows query points that are with respect to each dataset.

  • Corresponding to the previous comment, how are 3D losses weighted across datasets such that scale is handled appropriately?

局限性

Yes, authors have adequately addressed limitations and societal impact of the work.

作者回复

We thank Reviewer YzeS (R3) for the review and questions. R3 considers that our "insight is simple, clearly explained and well motivated", and finds the quantitative metrics "impressive" and "convincing".

Its not clear if the performance is derived from just a few data sources or all of them i.e. how each dataset affects results.

While an extensive ablation for assessing each individual dataset's contribution is computationally not feasible, we provide ablations for using only synthetic or only real 3D data. Please refer to the global answer regarding this.

Extent of generalizability: Usual suspects for human pose&shape estimation failures/limitations are loose clothing, occluded views, unique poses. It would be nice to see how the method works on such cases and if the method generalizes well to such cases.

We include further qualitative examples in the attached PDF that cover such cases. (Such challenging examples are rare in quantitative benchmarks, hence the qualitative examples.)

lower accuracy for face and hands in Table 5

Although our focus throughout the paper is mainly on body pose and body shape, we still achieve second-best scores for hands and faces in Table 5 (AGORA benchmark). It is true that many of the training datasets do not provide detailed annotations for hands and faces, and we indeed attribute our lack of SOTA results on hands and faces to this property of the training data.

different dataset scales which might be widely off

We adjust for different dataset scales (i.e. different number of training examples in each) by sampling more training examples from larger datasets. We did not tune these sampling-proportion hyperparameters, as this can lead to combinatorial explosion and would be very resource-intensive.

Its not clear what the canonical space representation is.

The canonical space is defined in reference to a T-like pose of the default SMPL mesh. All other keypoints (as e.g. shown in the referenced Fig. 4 column 1) are represented in this same coordinate system, as points within this canonical human volume. Again, no explicit conversion between formats is necessary and the exact locations of points in the canonical volume are tuned automatically during training. We will include this important point in the paper, and we thank R3 for calling attention to this omission.

评论

Thank you for the clarifications. The qualitative results for the hard cases show the strength of the approach.

审稿意见
7

This paper deals with the task of 3d human pose estimation. It contains three main contributions:

  1. A hypernetwork that takes as input a point in a 3d body volume (in a canonical pose) and outputs the weights of a network (a single layer, really) that, when applied to the features of a vision backbone, is able to localize said 3d point in R^3 given an image (plus a 2d point and 2d uncertainty).
  2. An application of this approach to train on multiple datasets with SMPL and SMPL-like annotations, 3d, and 2d annotations.
  3. An algorithm to fit SMPL parameters given joints and vertices.

These approaches, combined, (plus a series of engineering tricks, such as creating a synthetic dataset and treating some annotations as themselves learnable) result in network that yields state-of-the-art results on several 3d pose estimation benchmarks.

优点

Originality

Using a hypernetwork to predict arbitrary points in a human volume is a novel idea. Putting together a super dataset for this task is very nice and novel as well as far as I am aware.

Quality

The results, whether they come from a novel architecture or a novel super-dataset, are strong across the board.

Significance

Regardless of the soundness of the contributions, the fact that the paper promises to make all the contributions easily reproducible is a big plus. The field could really benefit from a way of sourcing multiple datasets together, and I can see multiple people building upon the ideas presented here if everything is released in decent shape.

缺点

Soundness

The main weakness of this paper is the lack of experiments that independently test the importance of each of the contributions. The paper proposes two main ideas: a hypernetwork for 3d human modelling, and a superset of datasets used to train this system; the former being primarily a methodological contribution, and the latter being primarily an engineering contribution. Unfortunately, there is no experiment or ablation distilling the importance of each contribution. Concretely, this could be achieved by, for example

  • Training the novel architecture on a single dataset
  • Training the novel architecture on a subset of the compiled datasets (eg, on the datasets with SMPL annotations), or
  • Training a baseline architecture on the superset of datasets (or a subset thereof, such as the ones with SMPL annotations)

These results would help the readers understand whether and to what extent the access to more data or the novel architecture make a difference in the SOTA results reported. As is, this crucial question remains unfortunately unanswered, and takes away from what would otherwise be a very, very strong paper.

I think these experiments are extra important because the paper is implicitly making a very bold and counterintuitive claim: that by posing the task of 3d human pose estimation as 3d registration (a more complicated task), it is possible to achieve better 3d poses than SOTA. Furthermore, this is achieved by exploiting data that is not annotated for 3d registration; this is very counterintuitive and, in my opinion, likely to be untrue. Therefore, I am inclined to think that it is the extra data that helps the most towards the strong results.

Clarity

In my opinion, the treatment of the "localizer field" is overly convoluted. While yes, it is true that the localizer field technically defines a neural field of functions, the paper makes it sounds like this is a very new idea (L163-164 "Although neural fields are typically used to predict points or vectors, here we use them to predict localizer functions"). This is not the case; at the end of the day this is a hypernetwork, which has been a staple of work in human modelling for a long time (eg [a, b]). The authors seem to be aware of this connection, since the paper mentions that the localize field "modulates" (L731) the convolutional layer of the point localization network, which is the terminology used in [a] for hypernets. I believe S3.2 could benefit from rewriting to make this part clearer and more in line with previous notation and descriptions.

Re: Efficient body model fitting. The method is described as really fast, compared to the official code which is said to take 33 minutes and achieves a slightly lower error. Most optimization methods have exponential error decreases, so it is not uncommon to see exponentially longer times for slightly lower errors. I think it would be clearer to plot the error as a function of time for both the official and new methods.

Re: Using 2d and 3d annotations. I am unable to understand how datasets annotated with only 3d poses are used to supervise an approach to volumetric registration -- the description in the paper is very terse (1 line). Is this done by fitting SMPL to the 3d points and obtaining an approximate place in the human volume? If so, it seems like training with these fitted SMPL meshes would be another baseline worth trying; ie, bring all the datasets to SMPL, then train on it. This would further disambiguate whether the architecture or the use of extra data is the main contribution.

[a] Karras et al, A Style-Based Generator Architecture for Generative Adversarial Networks, CVPR'19 [b] Chen et al, Authentic Volumetric Avatars from a Phone Scan, SIGGRAPH'22

问题

  1. Could the authors elaborate on how datasets with 2d and 3d annotations are used for training? How does the "approximate initialization" work? Is this some approximate initialization to 3d registration (via SMPL fitting)?

  2. The supplementary material discusses the creation of a large synthetic dataset using SMPL fittings of the DFAUST dataset, which is not mentioned in the abstract or the paper -- how important is this for the overall results?

  3. What is the dimensionality of the volumetric heatmap? Is this depth defined over the entire scene or only over the depth of the human body? If so, is the range of the function over the entire R3\mathbb{R}^3 in the human body, or a discretized subset?

  4. Why does the architecture predict a 2d and a 3d heatmap? Is it possible for the 2d heatmap to disagree with the projection of the 3d prediction?

  5. The last two layers of Fig 6 show FC layers going from 1024 to 384 dimensions, and later going from 1024 to 384 again. Is this a typo? If so, what does the actual architecture look like?

  6. The paper uses the number 384 several times in seemingly unrelated areas

  • The size of the images used in the larger network variant
  • The number of channels predicted by the localization field (or is it both the size of the input plus output?)
  • The number of points sampled from the interior of the human volume

Is this a coincidence?

  1. What is the time it takes the official SMPL fitting code to achieve an error comparable to the one achieved by the proposed method?

局限性

Limitations are addressed adequately.

作者回复

We are glad that R1 sees our idea as "novel", further acknowledging the novel aspect of putting together such a "super-dataset". R1 further praises the model quality as "strong across the board" and foresees significant community impact.

Importance of the hypernetwork and datasets: see the global answer.

Complicated explanation of localizer field, simply a hypernetwork: our description emphasizes the connection to neural fields, given the similar 3D-spatial input and the use of positional encodings. However, we agree that the hypernetwork view is also important. While we do mention this connection and cite the HyperNetworks paper (Ha et al., 2017), we will extend the manuscript with further connections to other relevant uses of hypernetworks.

how datasets with 2d and 3d annotations are used for training? How does the "approximate initialization" work? Is this done by fitting SMPL to the 3d points[...]?

We will make our existing explanation around L197-214 clearer. Importantly, our formulation allows us to sidestep the generally ill-posed problem of converting annotations to a single format, such as SMPL. Each dataset with 3D joint annotations can have different skeleton formats, i.e., the joints may designate different anatomical locations, e.g. Human3.6M's shoulder point is not the same as the shoulder point of the MPI-INF-3DHP dataset. We directly train the network with points that are annotated for a training example, i.e. we query those points for which we have annotations, so we can compute and minimize the average loss for those points. For this, we need to know where to query the field, e.g. where the Human3.6M shoulder point is in the canonical space. (For SMPL this is given, since the canonical space is derived from a SMPL mesh of mean shape and a T-like pose). Training can be started with an approximate placement of e.g. the Human3.6M shoulder point in the canonical human volume in the general shoulder area, and we let the gradient-based optimization update all parameters jointly, including the backbone parameters, the neural field parameters and the 3D query location of each skeletal format in the canonical space. Datasets with 2D point annotations are treated similarly, except that the prediction is projected before computing the loss in 2D.

bold and counterintuitive claim: that by posing the task of 3d human pose estimation as 3d registration (a more complicated task), it is possible to achieve better 3d poses than SOTA. Furthermore, this is achieved by exploiting data that is not annotated for 3d registration

We do not make such a “bold claim” in the paper, instead our claims are as follows. We are able to jointly train on heterogeneous data sources, some annotated for 3D/2D skeleton pose estimation with different formats as well as some for 3D human mesh recovery (also different ones, SMPL/SMPLX/SMPLH male/female/neutral), obtaining a single model that achieves SOTA on both kinds of tasks. Our work enables this common treatment by designing a generalist model that can estimate any arbitrarily chosen points, and then by casting each task as a point localization task (with different point sets) that are simple to tackle with the proposed generic point localizer. The end goal is to obtain a strong model for body pose and shape estimation, that is test-time configurable for user-chosen skeleton formats and mesh formats. Our model is designed to make spatially smooth predictions w.r.t. the selected points, resulting in consistent predictions for the different output formats.

The supplementary material discusses the creation of a large synthetic dataset using SMPL fittings of the DFAUST dataset, which is not mentioned in the abstract or the paper -- how important is this for the overall results?

Individually ablating the effect of each dataset is computationally infeasible. However, we provide experimental results with using only synthetic or only real 3D-annotated training examples, see the global answer.

dimensionality of the volumetric heatmap? Is this depth defined over the entire scene or only over the depth of the human body? Why does the architecture predict a 2d and a 3d heatmap? Is it possible for the 2d heatmap to disagree with the projection of the 3d prediction?

The heatmap does not cover the entire scene but a cube of side length 2.2 meters around the human. The 2D and 3D heatmaps are used in order to estimate the human scale and distance from the camera. They could theoretically disagree, but we did not observe such a problem in practice - training them together results in compatible predictions.

The last two layers of Fig 6 show FC layers going from 1024 to 384 dimensions, and later going from 1024 to 384 again. Is this a typo? If so, what does the actual architecture look like?

The architecture was designed like this to express the Global Point Signature-based positional encoding, which is inspired from [72] in 2D surface modeling. The 1024-dimensional vector is hence initially trained to output the Global Point Signature. As explained in the paper from L268, we found it best to finetune the full MLP after this initialization. Without this special pretraining, the two linear layers could indeed be replaced by a single layer without losing expressivity.

The paper uses the number 384 several times in seemingly unrelated areas

There is no special connection there, the reason is simply that 384 is the sum of two high powers of 2 (256+128) and hence a "round number" in binary. Tensor sizes divisible by high powers of two are often more convenient and hardware-efficient in practice.

What is the time it takes the official SMPL fitting code to achieve an error comparable to the one achieved by the proposed method?

Please refer to Table 2 in the PDF. The official code runs for about 7 minutes (for all 33 samples included with the code) achieving an average error of 8.0 mm, while our method achieves 7.8 mm in just 13 milliseconds.

评论

Thanks for clarifications and discussion on my questions.

Training can be started with an approximate placement of e.g. the Human3.6M shoulder point in the canonical human volume in the general shoulder area, and we let the gradient-based optimization update all parameters jointly

My question is about how exactly this initialization is done, and this paragraph does not provide an answer. When the authors say that training "can be started with an approximate placement" on a canonical human volume, how exactly is this done? Is it manual? automatic? via optimization? I am more interested to hear how the initialization was done in this paper, rather than how it can be done in the abstract.

评论

We are glad to provide the precise details to this part. We trained a model for predicting the separate skeleton formats (similar to the new baseline architecture in our rebuttal, but only for sparse keypoints not for vertices). We then ran inference with this predictor on the SURREAL dataset and learned linear regressors to interpolate from SURREAL GT vertices to the predicted keypoints and applied this regressor to the canonical template to obtain the approximate initialization. We will make sure to also include these details in the final version.

评论

Got it. SURREAL uses SMPL for GT, so I guess once you have the regressors you can obtain an approximate landing on this canonical space (please correct me if I'm wrong).

Thanks for clarifying! This is the main part I couldn't quite figure out, and sounds like a clever use of synthetic data for initialization -- you should definitely put this in the paper IMO.

Congratulations again on your very strong work.

评论

Yes, that's correct, and we will include this explanation. Thank you for the kind words.

审稿意见
4

This paper focuses on 3D human pose and shape estimation from a single RGB image. The main insight is, to avoid the influence of a fact that different human pose dataset defines different skeleton in their annotations, this paper proposes a point-based representation to ensure the model can learn from many datasets without suffering the skeleton mis-alignment problem between existing datasets. The idea follows the mechanism of dynamic convolution, encoding a point (inside the 3D human body) in canonical coordinates as the weight of the dynamic convolution, and converting the image features into a heat map that estimates the 3D position of the point in the target 3D human body mesh. The proposed model is trained with nearly 50 datasets with different annotations, including SMPL parameters, 3D / 2D keypoints, densepose, etc. Then they compare the performance on multiple benchmarks.

优点

  1. Extensive ablation studies to exploit the effects of many important settings, e.g. different way of encoding canonical position, uncertainty estimation. These results would be helpful for HMR developer to make better decision on model designs.
  2. Great qualititive results on InterNet videos, especially the 2D alignment seems pretty well.
  3. Video demo of sampling random canonical points proves the effectiveness of position encoding.

缺点

  1. The quantitative comparison is not fair and cannot verify the superiority of the proposed point-based representation over previous ones.The model is trained with nearly 50 datasets, while none of the compared methods are using the same experiment settings. Without fair experiment settings, readers can't tell wether the proposed point-based representation helps or not. See the following questions section for details.
  2. Some typos. For example, a period is missing between "process" and "I" at L#213. At L#215, "We use EfficientNetV2-S (256 px) and EfficientNetV2-L (384 px) [98]".

问题

  1. About fair comparisons. If we use the same training dataset but remove the point-based representation, will the results be similar? To what extent does this new point-based representation help? However, the current paper does not answer this fundamental question very well.

In rebuttal, this question doesn't get well answered. The concern about what is one solid technical contribution of this paper is still there.

局限性

No.

作者回复

We thank Reviewer M7cf (R2) for the suggestions. R2 notes that we performed "extensive ablation studies", which are seen as "helpful" for future model design decisions, and further praises our strong qualitative results with good pixel alignment.

If we use the same training dataset but remove the point-based representation, will the results be similar? To what extent does this new point-based representation help?

Please refer to the global response regarding the role of the neural-field-based point querying.

作者回复

We thank all reviewers for their thoughtful suggestions and questions. Their assessments are unanimously on the positive side, recommending acceptance. R1 (Dcsh) sees both our architectural idea and our extensive dataset combination as "novel" and foresees significant community impact. R3 (YzeS) finds that our "insight is simple, clearly explained and well motivated". R4 (R9xH) considers our idea "interesting and novel" and finds that we do "a good job in explaining" the motivation and technical details. R2 (M7cf) highlights our "extensive ablation studies" as "helpful" and R4 commends the "extensive benchmarks and experiments". All four reviewers emphasize our results that are "strong across the board" (R1), have accurate pixel alignment (R2), are "impressive" (R3, R4) and "convincing" (R3).

A common question by reviewers is about separately analyzing the contribution of our novel architecture and the effect of data scale. First, we emphasize that the two are linked: thanks to our architectural contribution, we can train from multiple datasets that are annotated with different pose and skeleton formats. This would otherwise require tedious and difficult conversions – for example it is not well-defined to convert a sparse skeleton to the full body pose of SMPL as some degrees of freedom such as shape, and axial arm rotation is missing, and differences in the skeleton definitions and joint placements can introduce further problems. It is even less well-defined if there can be any number of missing joints in each training example, which is typically the case for datasets that are triangulated from multi-view 2D predictions.

Nevertheless, we include additional experimental results obtained with newly trained ablation models to demonstrate that both our architectural contribution and the data scale are important.

The architectural contribution (i.e., localizer functions encoded as a neural field / hypernetwork) is ablated by training a baseline model where separate, explicit convolutional weights are learned for localizing every skeletal point and every SMPL vertex, instead of predicting these weights via an MLP. As shown in Fig. 1 (note that the 3D views show a rotated side view), the different skeleton formats in the baseline prediction are visibly inconsistent with each other and with the SMPL mesh; see e.g. in the bottom example how the green H36M arm is outside the SMPL body for the baseline. This is because the weights for localizing each point have no enforced relation to each other in the baseline. This also results in scattered and disorganized vertex predictions (see e.g. the hand region). NLF, by contrast, ensures that the different skeletons are localized consistently with each other and with the mesh, and the predicted mesh is spatially smooth and less scattered. (Note that these aspects are not straightforward to measure quantitatively on individual benchmarks.) Note also that the baseline architecture requires one to pre-determine and fix the number and definition of the points before training; they cannot be changed at runtime by the user. Furthermore, increasing the number of points that the baseline network can predict requires linearly scaling the number of network parameters in the prediction head. In contrast, NLF allows choosing arbitrary points at test time, and its number of parameters is independent of how many different points we want to be able to localize.

The data contribution is ablated by training our novel architecture also on subsets of the 3D datasets - first only on the synthetic datasets (computer graphics renderings) then only on the real ones (photos). (The 2D-annotated real datasets are always used). As Tab. 2 shows, the best results are achieved when combining these data sources. (All results in Tab. 2 are obtained with the small model and a shorter training of 100k steps compared to the 300k used in the main paper, due to rebuttal time constraints.)

We answer the further, individual questions as a reply to each review.

最终决定

The paper was championed for acceptance. One condition that was explicitly mentioned is the release of the codebase, especially if it is easy to use and enables the use of multiple datasets for 3d human pose estimation as a single unit. This was recognized as a limitation that is holding back progress in the area, and this paper has the potential to change that.

The main concerns remaining were soundness and presentation. In particular, it seems that all pre-processing steps require some sort of explicit mapping to SMPL. The reviewer argued that this step may be inevitable today, and this paper does the next best thing by using synthetic data to bridge that gap.