RAPTR: Radar-based 3D Pose Estimation using Transformer
Estimate 3D human poses from multi-view radar data using 2D image-plane keypoints and 3D BBox labels, rather than more expensive 3D keypoint labels.
摘要
评审与讨论
This paper proposes an advanced 3D human pose estimation method using radar signals and Transformers. Annotations are provided in the form of bounding boxes and 2D keypoints on images, offering significantly better scalability compared to conventional 3D joint coordinate annotations. The proposed Transformer model, RAPTR, consists of an encoder that applies deformable cross-attention to multi-view radar signals and a pose/joint decoder that performs similar attention operations in a pseudo-3D space. A custom loss function is also introduced to accommodate both 2D pose and 3D bounding box annotations. Extensive experiments on two datasets demonstrate that the method significantly outperforms existing approaches.
优缺点分析
Strengths:
- By using a scalable annotation scheme with 3D bounding boxes and 2D joint positions, the method achieves very high accuracy and demonstrates strong practical applicability in real-world scenarios. It also makes a valuable contribution toward enabling future large-scale dataset development.
- As shown in Table 1, the proposed annotation setting clearly improves accuracy compared to existing methods.
- The results in Figure 7 show that the proposed loss function is highly effective, confirming its strong suitability for scalable annotation settings.
- The paper is well-written, and the figures are very clear and easy to understand. The methodology is also clearly explained.
Weaknesses:
- The technical contributions of the paper mainly consist of the architecture and the loss function. However, the effectiveness of the proposed architecture remains unclear. A close look at the ablation study suggests that the performance gains are primarily due to the two-stage structure of the pose decoder and joint decoder. As shown in Figure 9 of the supplementary material, the proposed decoder using pseudo 3D space essentially merges the offset computation branches of the deformable attention used in the existing method QRFPose. This architectural change appears to be incremental. Although the author claims that the proposed method is simpler than the existing QRFPose, a comparison regarding how much the switch from QRFPose to RAPTR improves model size (memory usage) or inference time is insufficient.
Although adding pseudo-3D deformable attention slightly worsens the performance, the error gain is only about 1 cm, and this rather contributes to the aspect that the attention mechanism for two-view radar input becomes simpler.
- To fairly demonstrate the effectiveness of the architecture, it would be necessary to apply the {Pose + Joint} decoder structure to QRFPose and evaluate the resulting performance.
- (Minor) Regarding the proposed Template Loss, the paper does not describe how the template pose is calculated for each subject. I understand that global translation is computed using 3D gravity centers, but the paper does not clearly specify how the template keypoint at the coordinate origin is calculated.
问题
A comparison that clearly demonstrates the effectiveness of the proposed architecture, which I pointed out as a weakness, is highly desirable. Additionally, if my understanding of the template loss is incorrect, I would appreciate it if the authors could clarify or correct it.
局限性
yes
最终评判理由
The authors have adequately addressed my concerns by demonstrating the effectiveness of pseudo-3D attention in terms of both accuracy and computational complexity. They also showed that their scalable training strategy, based on 3D bounding boxes and a template loss, achieves only minimal performance degradation compared to training with full 3D keypoint ground truth. Based on these clarifications and improvements, I have raised my score and now consider the paper a borderline accept.
格式问题
No
Summary & Strengths
Response
We thank the reviewer for recognizing our motivation to use weak supervision for 3D radar perception—an unexplored direction in this domain. We also appreciate your positive feedback on our loss function ablation (Fig. 7) and experiments (Table 1). As noted, our work has implications for future large-scale dataset development. We will release full training/evaluation code and pretrained checkpoints to support such efforts.
Weaknesses
- ...However, the effectiveness of the proposed architecture remains unclear. A close look at the ablation study suggests that the performance gains are primarily due to the two-stage structure of the pose decoder and joint decoder.
Response
Thanks for raising this concern. While we do not claim the two-stage decoder as a core contribution, we deliberately adopt PETR's two-stage design as a strong, empirically validated baseline for our purpose - 3D radar perception under weak supervision.
With additional ablation studay, we summarize our contribution in two folds (as you pointed out): (1) the structural design of loss functions under weak supervision; and (2) the architectural modifications including cross-view (CV) radar encoder and pseudo-3D deformable attention.
First, our structured design of the loss functions—particularly the use of 3D Gravity Loss and 3D Template Loss across the two decoder stages—is key to the substantial performance gains. Along with the suggestion from Reviewer vQxv, Table 1 below provides a complete picture of the significance of each loss function and the decoder stage where it is applied.
- The 3D template prior at the first-stage decoder helps localize initial pose estimates near the correct spatial region (e.g., close to the 3D bounding box center) and promotes human-like keypoint configurations, avoiding degenerate or implausible shapes.
- The second-stage joint decoder refines initial pose estimates by projecting 3D keypoints to the 2D image plane and minimizing 2D keypoint loss (K2D), which leverages the fine granularity of 2D annotations. However, 2D keypoint loss alone cannot constrain absolute scale or depth due to depth ambiguity—where a distant large object may appear identical in 2D to a closer small object. To resolve this, we combine 2D keypoint loss with 3D Gravity Loss, allowing the model to learn depth- and scale-aware corrections during pose refinement.
Table 1: Effect of our structural loss function on RAPTR
Results with ± values are newly added and report mean ± std from at least 3 runs with different random seeds.
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | CV | 3D | -- | K2D+G3D | 50.96 | (A) in Fig.7 |
| 2S | CV | 3D | K2D+G3D | K2D+G3D | 27.36 | (B) in Fig.7 |
| 2S | CV | 3D | T3D | K2D | 30.06 | new result (suggested by Reviewer vQxv) |
| 2S | CV | 3D | T3D | K2D+G3D | 18.99 0.16 | RAPTR |
symbol are defined below Table 1 of the response to Review bgMt.
Our structural loss design is also compatible with PETR. As shown in Table 2 below, The original PETR with 2D keypoint loss alone completely fails to generalize to 3D radar perception, with MPJPE over 370 cm due to depth ambiguity. Adding gravity and template losses reduces the error to under 20 cm—a 10× improvement.
Table 2: Effect of our structural loss function on PETR
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | SV | 2D | K2D | K2D | 372.03 | PETR with 2D Keypoint Loss only |
| 2S | SV | 2D | K2D+G3D | K2D+G3D | 21.03 | PETR with 3D Gravity Loss |
| 2S | SV | 2D | T3D | K2D+G3D | 19.52 | PETR with 3D Gravity Loss+ 3D Template Loss |
Second, pseudo-3D attention works better with stronger, view-consistent radar features—such as those provided by the cross-view (CV) encoder. Thanks to Reviewer bgMt's suggestion to evaluate +(1)+(3), we include the result in the third row of Table 3 below, alongside Table 3 of the main paper (we re-run +(1) and +(1)+(2)+(3) with 3+ random seeds and report mean ± std for more rigorous evaluation). It is observed that using either the cross-view encoder (CV, 3rd row) or pseudo-3D attention (3D, 2nd row) alone results in slightly worse performance than the base two-stage setup (first row). However, when pseudo-3D attention is paired with the cross-view encoder (last row), it achieves the best performance. This suggests that the effectiveness of pseudo-3D attention depends heavily on the quality and consistency of the underlying radar features, which are better provided by the cross-view encoder than the separate-view encoder. We provide the following justification by looking into the pseudo-3D deformable attention in Fig. 4 of the main paper.
Table 3: Ablation Study on RAPTR architectural modules (Encoder, Decoder, Attention)
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (WALK) | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|---|
| 2S | SV | 2D | T3D | K2D+G3D | 22.95 1.21 | 20.33 0.73 | +(1) in Table 3 |
| 2S | SV | 3D | T3D | K2D+G3D | 22.83 | 20.78 | +(1)+(2) in Table 3 |
| 2S | CV | 2D | T3D | K2D+G3D | 24.73 | 20.57 | +(1)+(3) (suggested by Reviewer bgMt) |
| 2S | CV | 3D | T3D | K2D+G3D | 22.32 0.06 | 18.99 0.16 | +(1)+(2)+(3) in Table 3 (RAPTR (ours)) |
Weaknesses
- As shown in Figure 9 of the supplementary material, ...This architectural change appears to be incremental...a comparison regarding how much the switch from QRFPose to RAPTR improves model size (memory usage) or inference time is insufficient.
Response
As you pointed out, Figure 9 in the supplementary material illustrates key differences between the QRFPose and RAPTR:
- decoupled 2D offsets in QRFPose versus 3D offsets in RAPTR. While conceptually simple, our 3D offsets better align with radar physics—ensuring that the same human keypoint projects to consistent depths across multiple radar views. In contrast, the 2D offsets in QRFPose may select unrelated features at different depths, limiting cross-view consistency.
- decoupled attention weights in QRFPose versus integrated attention weights in RAPTR to aggregate the view-vise features or cross-view radar features.
In terms of memory footprint and inference time, we don't expect big differences between the 2D and pseudo-3D deformable attention as pseudo-3D saves one parameter for one offset prediction.
- ...it would be necessary to apply the {Pose + Joint} decoder structure to QRFPose and evaluate the resulting performance.
Response
Suggestion has been taken. As shown in Table 4, the second row reports the MPJPE for the {Pose + Joint} decoder structure added to QRFPose with 2D keypoint and gravity losses, showing significant gains over the base one-stage QRFPose (first row).
This {Pose + Joint} structure is further improved by applying our structured loss design (e.g., 3D template loss at the first-stage decoder), as shown in the third row.
With additional architectural enhancements—such as the cross-view (CV) encoder and pseudo-3D deformable attention (3D)—the last row achieves the best performance across all architecture and loss configurations on the HIBER (MULTI) dataset.
Table 4: QRFPose versus RAPTR (with 3D template loss)
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 1S | SV | 2D | K2D+G3D | -- | 56.02 | Base QRFPose of Table 3 |
| 2S | SV | 2D | K2D+G3D | K2D+G3D | 21.03 | QRFPose with 2-stage decoder (suggested by Reviewer Zb8G) |
| 2S | SV | 2D | T3D | K2D+G3D | 20.33 0.73 | +(1) in Table 3 |
| 2S | CV | 3D | T3D | K2D+G3D | 18.99 0.16 | +(1)+(2)+(3) in Table 3 (RAPTR (ours)) |
- (Minor) ...the paper does not describe how the template pose is calculated for each subject...
Response
Thanks for pointing out the calculation of 3D template. In implementing our 3D Template Loss, we used a standard standing pose of a 1.6 m-tall person (see Fig. 3 in the main paper). More specifically, we used the ground-truth 3D keypoint labels at one frame to construct the template (minus its center position).
Related to your comment, we also conducted an ablation study evaluating the impact of different 3D templates at the first-stage pose decoder (in response to a comment from Reviewer WQeC,). Please refer to Table 4 in our response to Reviewer WQeC.
You have clearly stated that the key contributions of this paper lie in (1) the structural design of loss functions under weak supervision, and (2) the architectural modifications. I would like to address each of these points in turn.
On the Contribution of the Loss Functions
Regarding Tables 1 and 2 in your rebuttal, as you rightly pointed out, supervision based on 2D projections of predicted joint coordinates and 2D keypoint ground truth cannot resolve depth ambiguity. Therefore, it is somewhat expected that your method outperforms this baseline, and I do not believe this makes for a sufficiently strong comparison. Consequently, the technical contribution of the template loss feels limited. To properly validate its effectiveness as a core contribution, I believe it is important to demonstrate (i) how well your method performs compared to training with (costly and unscalable) 3D ground truth labels (e.g., on the HIBER dataset), and (ii) how much it improves over the existing 2D-to-3D lifting pipeline using only 2D supervision and pre-trained lift-up models. Without such evidence, I do not find the template loss and gravity loss to be strong enough contributions for a top-tier conference.
On the Necessity of Pseudo-3D Attention
I appreciate your explanation in the rebuttal that pseudo-3D attention works best when combined with a cross-view encoder, and that high-quality features from the encoder enhance spatial understanding. However, I am still unclear on the necessity of pseudo-3D attention itself. Is the only stated advantage over the 2D sampling offsets used in QRFPose that
conceptually simple, our 3D offsets better align with radar physics
?
If so, this conceptual benefit is somewhat limited, and I believe it is necessary to demonstrate concrete advantages in either accuracy or computational efficiency to justify the use of pseudo-3D attention.
Thank you for your follow-up comment.
Comment 1
- On the Contribution of the Loss Functions...consequently, the technical contribution of the template loss feels limited. (i) how well your method performs compared to training with (costly and unscalable) 3D ground truth labels (e.g., on the HIBER dataset), and (ii) how much it improves over the existing 2D-to-3D lifting pipeline using only 2D supervision and pre-trained lift-up models.
Response
Thank you for you comment on the effectiveness of 3D template loss. First, we emphasize that the 3D Template Loss (T3D) plays a key role in enhancing performance
- Quantitatively, as shown in Table 1 below, comparing rows 1 to 3 demonstrates a clear trend of gradual performance improvement with different first-stage decoder loss functions: no loss (1st row), 2D keypoint loss (K2D) + 3D gravity loss (G3D) (2nd row), and T3D (3rd row), while fixing the second-stage decoder loss to K2D + G3D. These results highlight the effectiveness of incorporating the 3D Template Loss and exploiting a 3D human shape prior at the first-stage pose decoder.
- Qualitatively, we refer the reviewer to Figure 6 in the main paper, which visualizes the initial pose estimates from the first-stage pose decoder for 5 different frames. These initial poses closely align with the fixed 3D template (1st row of Fig. 6), and are further refined by the second-stage joint decoder to produce more accurate final pose estimates (2nd row of Fig. 6).
Second, regarding (i) comparison with full 3D supervision, we took your suggestion and have included an evaluation of RAPTR under full 3D supervision (using 3D keypoint loss, K3D) in Table 1 below (4th row). We agree that this setting provides a meaningful upper bound on achievable performance, and allows us to quantify the performance gap between weak and full supervision. Under full supervision, RAPTR achieves an MPJPE of 8.93 cm, whereas our best weakly supervised variant—leveraging 3D Template Loss (T3D) and 3D Gravity Loss (G3D)—achieves 18.99 cm. The weakly supervised model yields about a loss of ~10 cm in MPJPE, compared with the full supervision result. Nevertheless, the RAPTR with structural loss (T3D and K2D+G3D) outperforms other variants without T3D.
Table 1: RAPTR with Weak and Full Supervision
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | CV | 3D | -- | K2D+G3D | 50.96 | RAPTR + Weak Supervision |
| 2S | CV | 3D | K2D+G3D | K2D+G3D | 27.36 | RAPTR + Weak Supervision |
| 2S | CV | 3D | T3D | K2D+G3D | **18.99 0.16 ** | RAPTR + Weak Supervision |
| 2S | CV | 3D | K3D | K3D | **8.93 ** | RAPTR+Full supervision |
Third, regarding (ii) comparison with pretrained 2D-to-3D uplifting models, we appreciate your suggestion. It is important to note that RAPTR—like other 3D radar baselines—directly estimates 3D poses in the absolute world coordinate system, utilizing radar’s capability to provide accurate depth (range) measurements. These 3D poses can be transformed into other coordinate systems (e.g., camera or radar) using known/calibrated transformations between the world and sensor frames, and can also be projected onto the image plane using a standard pinhole camera model.
We are currently implementing a pipeline for your suggested (2D keypoint estimates + pretrained 2D-to-3D uplifting) option, but with a key challenge:
- Most pretrained uplifting models output 3D keypoints in a relative coordinate system, centered on the pelvis of each subject. While they preserve inter-joint structure, they lack absolute depth, requiring additional processing to convert these poses into the absolute 3D coordinate system. So that we can directly compare the 3D pose estimation performance with RAPTR in the absolute 3D space.
We aim to provide an updated evaluation within the next day.
Comment 2
- On the Necessity of Pseudo-3D Attention
- ..I believe it is necessary to demonstrate concrete advantages in either accuracy or computational efficiency to justify the use of pseudo-3D attention.
Response
Thanks for your comment. For accuracy, we believe that Table 3 in the previous response to you has demonstrated the effectiveness of the pseudo-3D deformable attention, particularly given its improvement on the more challenging MULTI split over the single-subject-only WALK split.
For computational efficiency, we are conducting a more detailed evaluation of 2D and pseudo-3D deformable attention as a function of configuration parameters (e.g., number of queries, offsets, radar views,...). We will provide an updated complexity comparison within the next day.
We now address the two remaining comments.
Comment 1 (ii) how much it improves over the existing 2D-to-3D lifting pipeline using only 2D supervision and pre-trained lift-up models.
Response
As noted earlier, most pretrained 2D-to-3D uplifting models output 3D keypoints in a relative coordinate system, centered at the pelvis without absolute depth. It requires a translation vector to place the pose in the absolute 3D coord. system.
Based on this note, we implemented a 2D-to-3D uplifting baseline using the same configuration (2S, CV, 3D) as RAPTR for a fair comparison:
- Estimated 2D keypoints on the image plane, supervised with only 2D keypoint loss (K2D)
- Estimated 3D object center for each object, supervised with the 3D distance loss between the estimated center and the 3D BBox center (the only 3D label available under weak supervision)
- Applied a pretrained 2D-to-3D uplifting model to lift the estimated 2D keypoints into a relative 3D coordinate system
- Translated 3D keypoints into the absolute 3D coordinate system using the estimated 3D object center
Using this pipeline, we updated Table 1 to include the evaluation result of the uplifting model on the MULTI data split (3rd row). In our experiments, under the chosen architecture configuration, the uplifting pipeline resulted in an MPJPE of 33.85 cm, which is much larger than the MPJPE of 18.99 cm of RAPTR.
A possible explanation for the weaker performance of the uplifting baseline is that the 3D object center is supervised only through the 3D BBox center, which may not align precisely with the anatomical pelvis center. This lack of fine-grained 3D supervision may lead to less accurate absolute pose estimates of the uplifting baseline.
It is noted that the current implementation uses the pretrained 2D-to-3D uplifting model from [1]. We are evaluating the uplifting baseline for WALK split and testing other pretrained models.
Table 1: RAPTR with Weak and Full Supervision
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | CV | 3D | -- | K2D+G3D | 50.96 | |
| 2S | CV | 3D | K2D+G3D | K2D+G3D | 27.36 | |
| 2S | CV | 3D | K2D | K2D | 33.85 | pretrained 2D-to-3D pose uplifting |
| 2S | CV | 3D | T3D | K2D+G3D | 18.99 0.16 | RAPTR + Weak supervision |
| 2S | CV | 3D | K3D | K3D | 8.93 | RAPTR+Full supervision |
[1] A simple yet effective baseline for 3D human pose estimation. ICCV, 2017.
Comment 2
- Computational efficiency to justify the use of pseudo-3D attention.
Response
Thank you for your insightful suggestion, which motivated us to conduct a more detailed analysis of decoupled 2D and pseudo-3D deformable attention with respect to configuration parameters: number of queries , sampling offsets , radar views , and query dimension . This deeper investigation revealed that:
- Pseudo-3D attention eliminates redundant per-view offset estimation, offering better scalability as the number of radar views increases.
- Notably, pseudo-3D attention achieves a computational saving of 17.5% with radar views and 21.3% with radar views, compared to decoupled 2D attention.
The detailed analysis is shown below:
-
For decoupled 2D deformable attention, the overall complexity is
O(8VQNC), where:- 3D reference point projected to V 2D radar views:
O(6VQ) - Offset estimation:
O(2VQNC)— 2 for the 2D (x, y) offsets and for radar views - Attention weights:
O(VQNC) - Feature aggregation:
O(5VQNC)— 5 due to bilinear interpolation and weighted sum.
- 3D reference point projected to V 2D radar views:
-
For pseudo-3D deformable attention, the total complexity is
O(6VQNC + 3QNC), where:- Offset estimation:
O(3QNC)— 3 for 3D (x, y, z) offsets, shared across views. - 3D offset projected to V 2D radar views:
O(6VQ) - Attention weights:
O(VQNC) - Feature aggregation:
O(5VQNC)
- Offset estimation:
Note that O(6VQ) is excluded from the final complexity expression as 6VQ 5VQNC in practice.
Table 2: Complexity comparison of Pseudo-3D vs. Decoupled 2D
| Queries () | Views () | 2D Att | Pseudo-3D Att | Ratio (3D/2D) | Savings |
|---|---|---|---|---|---|
| 10 | 2 | 0.94 ↓ | 6.25% | ||
| 10 | 5 | 0.83 ↓ | 17.5% | ||
| 10 | 10 | 0.79 ↓ | 21.3% |
It seems I had misunderstood the interpretation of the quantitative results regarding the pseudo-3D attention. Indeed, the table provided in the rebuttal clearly demonstrates the effectiveness of pseudo-3D attention in terms of accuracy.
Additionally, the performance gap between using ground-truth 3D keypoints and the proposed method is within approximately 10 cm, which is significantly smaller than the gap between the proposed method and the baseline approach using bounding box center prediction followed by 2D-to-3D lift-up.
However, regarding the number of views and the cost-efficiency of the proposed method, the author claims that the complexity can be reduced by approximately 20% when using 10 views. But isn’t the typical number of views in practice closer to 2? As far as I understand, the HIBER dataset used in the paper also provides only 2 views (horizontal and vertical). Could the authors provide concrete examples or scenarios where 10-view setups are realistic or commonly used?
We thank the reviewer for engaging in the discussion and for the suggestions that motivated us to further clarify our contributions and better demonstrate the effectiveness of the proposed approach.
Comment 1 But isn’t the typical number of views in practice closer to 2? As far as I understand, the HIBER dataset used in the paper also provides only 2 views (horizontal and vertical). Could the authors provide concrete examples or scenarios where 10-view setups are realistic or commonly used?
Response
Thank you for the comment. You are correct that both HIBER and MMVR employed only two radar views (horizontal and vertical) in their data collection. It is worth noting that both datasets utilized high-resolution cascading radar evaluation boards (each with 12 transmitters, 16 receivers), which are significantly expensive than single-chip radar boards (3 transmitters, 4 receivers).
We anticipate that multi-view radar systems with more than two sensors (e.g., 5 or 10) will be increasingly considered and deployed in practice in the following scenarios:
- Large-area surveillance in environments such as transportation hubs, factories, and residential buildings for perimeter surveillance and intrusion detection, where a wide monitoring field-of-view is a critical requirement. In such cases, multiple radar sensors are often required to cover the full area of interest.
- Indoor safety monitoring in crowded or occlusion-prone settings such as eldercare homes, hospitals, assisted-living facilities, and open office space, where deploying multiple radar sensors can mitigate the occlusion and improve the robustness of human perception, especially for critical events such as fall detection and abnormal motion.
- Distributed monitoring aimed at low cost, where multiple single-chip radar sensors are placed throughout a space (different heights, different angles) to enable seamless 360° coverage. For example, the hardware cost of 2 cascading radar sensors (24 transmitters, 32 receivers) is similar to that of 8 distributed single-chip radar sensors. While the latter have lower spatial resolution, their placement flexibility allows broader spatial diversity.
In the context of 3D non-human radar perception, multi-view radar setups are also critical to cover parking lots, logistics centers, warehouses, wildlife areas, and low-attitude airspace. These scenarios often involve autonomous vehicles (cars, drones, robots, AGVs) that require full-scene coverage and precise awareness of their position, motion, and interaction for safe and effective operation.
We also observe that commercial multi-view radar systems—supporting up to 5 low-cost single-chip radars—are currently available for purchase online. However, due to rebuttal policy constraints, we are unable to disclose specific product names or links.
The authors have adequately addressed my concerns by demonstrating the effectiveness of pseudo-3D attention in terms of both accuracy and computational complexity. They also showed that their scalable training strategy, based on 3D bounding boxes and a template loss, achieves only minimal performance degradation compared to training with full 3D keypoint ground truth. Based on these clarifications and improvements, I have raised my score and now consider the paper a borderline accept.
We thank the reviewer for the constructive feedback, which has helped improve the clarity and quality of our paper. We will incorporate these changes in the revised paper accordingly.
This paper introduces RAPTR, a radar-based 3D human pose estimation framework that avoids costly 3D keypoint labels by instead using cheaper 3D bounding boxes and 2D keypoints. RAPTR builds upon PETR, a vision-based 2D pose transformer, and adapts it for radar input via a pseudo-3D deformable attention mechanism and a two-stage decoding strategy. The authors propose novel 3D losses (template and gravity loss) and demonstrate competitive performance on two indoor radar datasets (HIBER and MMVR), outperforming prior radar-based baseline methods. The approach is shown to be scalable, privacy-preserving, and suitable for various challenging environments.
优缺点分析
Strengths
- 1) Efficient Use of Weak Supervision: The method relies on cheap and scalable labels (3D BBoxes + 2D keypoints) rather than expensive 3D motion capture data, improving data accessibility for real-world radar applications.
- 2) Extensive Experiments and Ablation Studies: The work presents comprehensive experimental results, including detailed per-joint error analysis, directional breakdowns, and ablations on architecture and loss functions.
- 3) Solid Engineering and Architectural Design: The proposed pseudo-3D deformable attention and two-stage decoding pipeline demonstrate good adaptation of vision transformer ideas to radar-based pose estimation.
- 4) Robust to Multi-Person and Cluttered Scenes: The method shows good performance in multi-person scenarios, which are typically challenging for radar-based perception systems.
Weaknesses
- 1) Limited Methodological Novelty: The method is a modified version of PETR, a 2D pose transformer, adapted to radar data. While practical, this reduces its originality, especially since most contributions are engineering extensions.
- 2) Occlusion and Human-to-Human Interference Remain Unsolved: The paper does not adequately address the inherent limitations of radar, particularly signal interference in multi-person occlusion-heavy scenarios—one of the core challenges in radar-based perception.
- 3) Strong Assumption on Template Priors: The use of a fixed 3D human pose template in the decoder can be limiting, especially when generalizing to diverse body shapes, poses, or actions. A more flexible or learned template bank might improve robustness.
In a word, RAPTR is a well-executed system that makes radar-based 3D pose estimation more feasible under weak supervision. However, its conceptual originality is modest considering that the novelty bar is high in NeurIPS, and it leaves open key challenges like occlusion robustness and template generality.
问题
-
1) How does RAPTR handle human-to-human occlusions and radar signal interference in multi-person scenes? Is there a mechanism to explicitly model or mitigate inter-person radar reflections?
-
2) To what extent does the fixed 3D human pose template limit generalization to diverse body types or motion patterns? Would a learned or dynamic template set improve robustness?
-
3) Is pseudo-3D deformable attention significantly better than simple view-wise fusion or 3D CNN alternatives? Can ablations with more diverse attention mechanisms help validate this choice?
局限性
-
1) Limited originality in core methodology: RAPTR is largely a radar-adapted version of PETR, and while technically solid, the core innovation lies more in adaptation than conceptual novelty.
-
2) Lack of effective modeling for occlusion and multi-person interference: Radar-based human-to-human occlusion remains a key challenge that RAPTR does not explicitly solve.
-
3) Rigid prior assumptions in the 3D template: The reliance on a fixed pose template may reduce generalization to out-of-distribution poses or body shapes.
最终评判理由
Despite some shortcomings, I am impressed by the strong empirical results and extensive experimental tests of this paper. Therefore, I promise to maintain the positive evaluation. I ask the authors to clarify the concerns we discussed in the final version.
格式问题
No significant formatting issues.
Summary & Strengths
Response
We thank the reviewer for recognizing our core motivation—advancing 3D radar perception under weak supervision, which we believe is the first to explore this direction. We appreciate the positive feedback on our 3D loss design and experiments. To support reproducibility, we will release full code and pretrained checkpoints.
Weaknesses
- Limited Methodological Novelty:... While practical, this reduces its originality, especially since most contributions are engineering extensions.
Response
We thank the reviewer for raising this concern, which allows us to clarify our core contributions more clearly below. With additional ablation studay, we summarize our contribution in two folds: (1) the structural design of loss functions under weak supervision; and (2) the architectural modifications including cross-view (CV) radar encoder and pseudo-3D deformable attention.
First, our structured design of the loss functions—particularly the use of 3D Gravity Loss and 3D Template Loss across the two decoder stages—is key to the substantial performance gains. Along with the suggestion from Reviewer vQxv, Table 1 below provides a complete picture of the significance of each loss function and the decoder stage where it is applied.
- The 3D template prior at the first-stage decoder helps localize initial pose estimates near the correct spatial region (e.g., close to the 3D bounding box center) and promotes human-like keypoint configurations, avoiding degenerate or implausible shapes.
- The second-stage joint decoder refines initial pose estimates by projecting 3D keypoints to the 2D image plane and minimizing 2D keypoint loss (K2D), which leverages the fine granularity of 2D annotations. However, 2D keypoint loss alone cannot constrain absolute scale or depth due to depth ambiguity—where a distant large object may appear identical in 2D to a closer small object. To resolve this, we combine 2D keypoint loss with 3D Gravity Loss, allowing the model to learn depth- and scale-aware corrections during pose refinement.
Table 1: Effect of our structural loss function on RAPTR
Results with ± values are newly added and report mean ± std from at least 3 runs with different random seeds.
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | CV | 3D | -- | K2D+G3D | 50.96 | (A) in Fig.7 |
| 2S | CV | 3D | K2D+G3D | K2D+G3D | 27.36 | (B) in Fig.7 |
| 2S | CV | 3D | T3D | K2D | 30.06 | new result (suggested by Reviewer vQxv) |
| 2S | CV | 3D | T3D | K2D+G3D | 18.99 0.16 | RAPTR |
symbol are defined below Table 1 of the response to Review bgMt.
Second, our structural loss design is also compatible with PETR. As shown in Table 2 (new results), PETR with only 2D keypoint loss completely fails to generalize to 3D radar, with MPJPE over 370 cm due to depth ambiguity. Adding gravity and template losses reduces the error to under 20 cm—a 10× improvement.
Table 2: Effect of our structural loss function on PETR
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | SV | 2D | K2D | K2D | 372.03 | PETR with 2D Keypoint Loss |
| 2S | SV | 2D | K2D+G3D | K2D+G3D | 21.03 | PETR with 2D Keypoint Loss + 3D Gravity Loss (ours) |
| 2S | SV | 2D | T3D | K2D+G3D | 19.52 | PETR with 2D Keypoint Loss + 3D Gravity Loss (ours) + 3D Template Loss (ours) |
Third, pseudo-3D attention works better with stronger, view-consistent radar features—such as those provided by the cross-view (CV) encoder. Thanks to Reviewer bgMt's suggestion to evaluate +(1)+(3), we include the result in the third row of Table 3 below, alongside Table 3 from the main paper (we re-run +(1) and +(1)+(2)+(3) with 3+ random seeds and report mean ± std for more rigorous evaluation). It is observed that using either the cross-view encoder (CV, 3rd row) or pseudo-3D attention (3D, 2nd row) alone results in slightly worse performance than the base two-stage setup (first row). However, when pseudo-3D attention is paired with the cross-view encoder (last row), it achieves the best performance. This suggests that the effectiveness of pseudo-3D attention depends heavily on the quality and consistency of the underlying radar features, which are better provided by the cross-view encoder than the separate-view encoder. We provide the following justification by looking into the pseudo-3D deformable attention in Fig. 4 of the main paper.
Table 3: Ablation Study on RAPTR architectural modules (Encoder, Decoder, Attention)
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (WALK) | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|---|
| 2S | SV | 2D | T3D | K2D+G3D | 22.95 1.21 | 20.33 0.73 | +(1) in Table 3 |
| 2S | SV | 3D | T3D | K2D+G3D | 22.83 | 20.78 | +(1)+(2) in Table 3 |
| 2S | CV | 2D | T3D | K2D+G3D | 24.73 | 20.57 | +(1)+(3) (suggested by Reviewer bgMt) |
| 2S | CV | 3D | T3D | K2D+G3D | 22.32 0.06 | 18.99 0.16 | +(1)+(2)+(3) in Table 3 (RAPTR (ours)) |
Weaknesses
- Lack of effective modeling for occlusion and multi-person interference:...
Response
For occlusion, our preliminary investigation shows that RAPTR with weak supervision has some capability in recovering 3D poses even when some body parts are occluded (e.g., by furniture or other people). This is particularly observed on the MMVR dataset where the subjects are often partially occluded by furniture. We observed that:
- The 2D keypoint labels—generated by pre-trained vision models—often extrapolate occluded keypoints accurately in the image pixel domain.
- Our weakly supervised framework lifts these 2D keypoints to 3D using the combination of 2D keypoint loss, 3D template loss, and 3D gravity loss.
These results appear to suggest that RAPTR can handle small-to-moderate occlusion via strong image-based 2D supervision loss and structured 3D regularization (template + gravity loss). However, due to the absence of ground-truth 3D keypoints in MMVR, we cannot provide a quantitative analysis under varying occlusion levels.
For multi-person interference, this is a good point. In both the HIBER and MMVR datasets, we did not observe strong interference effects, potentially due to the absorption of radar waveforms at 77 GHz at human body and clothing, and multiple-bouncing attenuated signals, resulting in weak ghost targets. One possible direction is to incorporate temporal dynamics of human motion. We leave this as a future topic.
Weaknesses
- Rigid prior assumptions in the 3D template: The reliance on a fixed pose template may reduce generalization to out-of-distribution poses or body shapes.
Response
Thanks for the insightful comment on the 3D template prior. In implementing our 3D Template Loss, we used a standard standing pose of a 1.6 m-tall person (see Fig. 3 in the main paper). To address this comment, we conducted an ablation study evaluating the impact of different 3D templates at the first-stage pose decoder. Results are shown in Table 4 below. Specifically, we experimented with:
- A standing pose scaled by two factors: 0.5× and 1×
- A sitting pose of a 1.6 m-tall person
- A learned scaling factor applied to the standard standing pose
Table 4 suggests that the choice of 3D template has some minor impact on final MPJPE, likely due to the refinement capability of the second-stage decoder, as long as the first-stage decoder generates a reasonable, human-like initial pose.
Table 4: The effect of 3D Template
| Decoder | Encoder | Attn | Pose Loss | 3D Template | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|---|
| 2S | CV | 3D | T3D | Standing (scale=1) | K2D+G3D | 18.99 0.16 | existing result |
| 2S | CV | 3D | T3D | Standing (scale=0.5) | K2D+G3D | 19.49 | new result |
| 2S | CV | 3D | T3D | Sitting (scale=1) | K2D+G3D | 19.86 | new result |
| 2S | CV | 3D | T3D | Standing (learned scale) | K2D+G3D | 19.71 | new result |
Questions
- ...Can ablations with more diverse attention mechanisms help validate this choice?
Response
We refer the reviewer to Table 3 above, which evaluates the effect of RAPTR's architectural modules—including the attention mechanisms (2D vs. pseudo-3D)—in response to your concern. We appreciate your suggestion and will consider more attention mechnisms in the update paper.
We sincerely thank the authors for their detailed rebuttal and thoughtful follow-up experiments, especially the new ablation studies and additional quantitative evidence regarding loss design and architectural choices. The commitment to reproducibility through planned code and checkpoint release is also appreciated.
That said, I would like to raise two remaining concerns:
1. On the Combined Use of Cross-View Encoder and Pseudo-3D Attention (Table 3):
While the authors provide ablations comparing separate and combined use of the proposed cross-view encoder and pseudo-3D attention modules, the analysis lacks a strong explanatory justification. Specifically, Table 3 shows that each module individually results in no performance gain (or even slightly worse results) compared to the base model, but the combination leads to a significant improvement. This raises the question:
- Is the observed gain purely empirical, or is there a principled architectural rationale explaining their synergy?
The current explanation, which attributes performance differences to "underlying radar feature consistency," feels speculative. I would encourage the authors to further investigate this interplay, perhaps through visualization of intermediate attention maps or analysis of representational similarity, rather than relying solely on empirical evidence.
2. On Occlusion and Multi-Person Interference Handling:
Although the authors mention that weakly supervised RAPTR can recover poses under small to moderate occlusions (e.g., using extrapolated 2D keypoints and 3D structural losses), I still find the approach insufficiently robust for realistic occlusion scenarios. In real-world applications, occlusion is often dynamic, partial, and non-stationary, especially during interactions or crowd settings. Given that radar inherently lacks the fine-grained spatial resolution of vision, I would argue that radar-only perception remains limited in handling such complexities.
A more promising direction may be to fuse radar with complementary modalities (e.g., vision or depth sensors), enabling the system to automatically detect modality degradation and rely on more informative signals. I recommend the authors discuss this in the Limitations section, as it represents a critical challenge for future deployment in cluttered, dynamic environments.
Despite the above reservations, I recognize the strong empirical results and extensive experimental validation presented in the paper. The structured loss design, two-stage decoding, and radar-oriented adaptation of PETR represent a solid engineering contribution. I encourage the authors to clarify the remaining concerns in the final version, and I believe the work is worthy of positive recognition.
Thank you for your follow-up comment. We have the following point-to-point response.
Comment 1
- On the Combined Use of Cross-View Encoder and Pseudo-3D Attention (Table 3)
- Is the observed gain purely empirical, or is there a principled architectural rationale explaining their synergy?
Response
Thank you for raising this insightful question regarding the architectural rationale behind the observed synergy between the cross-view (CV) encoder and pseudo-3D attention, as reported in Table 3.
In addition to the empirical evidence in Table 3, we provide the following architectural justification. It is worth noting this justification was included in our rebuttal to other reviewers (e.g., Reviewer bgMt) but was omitted from the rebuttal to you due to space limitations and the need to prioritize other major concerns. The justification centers on the structural design of pseudo-3D deformable attention (illustrated in Fig. 4 of the main paper) and its link to the CV encoder:
- Pseudo-3D attention predicts 3D spatial offsets—unlike the decoupled 2D offsets in QRFPose—to align and extract spatially consistent features across radar views.
- These 3D offsets are projected into each radar view to determine where features should be sampled.
- The sampled features from different views are then aggregated using normalized attention weights to form the pseudo-3D deformable attention output.
- For this mechanism to function effectively, the underlying radar features need to be spatially coherent and consistent across views—a property better provided by the cross-view (CV) encoder.
- In contrast, when pseudo-3D attention is combined with the separate-view (SV) encoder, the lack of shared multi-view context may limit its ability to extract reliable 3D features, leading to less powerful aggregated multi-view radar feature at the pseudo-3D deformable attention output.
We hope this explanation clarifies the principled architectural rationale behind the synergy between the CV encoder and pseudo-3D attention, and provides some insight into why their combination leads to better performance.
Comment 2
- On Occlusion and Multi-Person Interference Handling...
- A more promising direction may be to fuse radar with complementary modalities (e.g., vision or depth sensors) enabling the system to automatically detect modality degradation and rely on more informative signals.....I recommend the authors discuss this in the Limitations section, as it represents a critical challenge for future deployment in cluttered, dynamic environments.
Response
We agree with the reviewer that, in real-world applications, occlusion is often dynamic, partial, and non-stationary—particularly in interactive and crowded environments. Thank you also for the suggestion regarding multi-sensor or multi-modal fusion. We find this to be a very promising direction for enabling truly 3D multi-sensor perception (e.g., combining vision, depth, and radar) under weak supervision, and incorporating the ability to automatically assess modality degradation and adaptively control feature flow based on modality quality.
We will incorporate a discussion on these important aspects—including occlusion, multi-person interference, and future directions—in the Limitation section of the revised paper.
Comment 3
- ...I recognize the strong empirical results and extensive experimental validation presented in the paper...I encourage the authors to clarify the remaining concerns in the final version, and I believe the work is worthy of positive recognition.
Response
Thank you for positive feedback, constructive suggestions, and engaged discussion in the rebuttal process. We will incorporate the key rebuttal responses to your comments in the revised version of the paper.
This paper introduces RAPTR, a novel method for radar-based 3D human pose estimation. The core innovation of the proposed method is that it uses only low-cost, easily accessible 3D bounding boxes (BBoxes) and 2D keypoints for supervision, instead of relying on expensive and difficult-to-collect 3D keypoint labels. The whole framework is built on PETR to output initial 3D pose and refined 3D pose, which is supervised by the proposed 3D template loss, and 2D keypoint and 3D gravity losses, respectively. Experiments on two public radar datasets demonstrate the effectiveness of the proposed method.
优缺点分析
Strengths
1.The proposed method significantly reduces data annotation costs. The proposed method is trained using low-cost 2D keypoints and 3D BBox labels instead of expensive 3D keypoint labels, which allows the model to be trained on larger-scale and more diverse scenarios and enhances the scalability of the proposed method.
2.This paper is well-written and easy to understand.
Weakness
The methodological novelty of the proposed method is somewhat limited. The overall framework of RAPTR is mainly based on PETR. Its primary modification is to "lift" the 2D deformable attention mechanism into a "pseudo-3D" space. However, as shown in Table 3, pseudo-3D deformable attention is harmful to the model performance, which cannot show the effectiveness of the proposed component. What is the performance of +(1)+(3).
问题
In weakness.
局限性
In weakness.
最终评判理由
The rebuttal addresses my concerns and I keep my rating for Borderline accept
格式问题
No
Summary & Strengths
- The core innovation of the proposed method is that it uses only low-cost, easily accessible 3D BBoxes and 2D keypoints for supervision, instead of relying on expensive and difficult-to-collect 3D keypoint labels. The whole framework is built on PETR to output initial 3D pose and refined 3D pose, which is supervised by the proposed 3D template loss, and 2D keypoint and 3D gravity losses, respectively...
- The proposed method significantly reduces data annotation costs...allows the model to be trained on larger-scale and more diverse scenarios and enhances the scalability of the proposed method.
Response
We sincerely thank the reviewer for the positive feedback and recognition of our core motivation—advancing 3D radar perception under weak supervision—which, to the best of our knowledge, is the first work to explore this direction in the radar domain. We appreciate your acknowledgment of the broader impact of our approach, particularly our effort to enable scalable training and deployment using inexpensive labels, which hopefully encourages more extensive radar data collection without the need for costly 3D annotations. We will release the full training and evaluation code, along with the pretrained model checkpoints, to support reproducibility and future open research in 3D radar perception.
Weakness
*The methodological novelty of the proposed method is somewhat limited. The overall framework of RAPTR is mainly based on PETR. Its primary modification is to "lift" the 2D deformable attention mechanism into a "pseudo-3D" space. However, as shown in Table 3, pseudo-3D deformable attention is harmful to the model performance, which cannot show the effectiveness of the proposed component. What is the performance of +(1)+(3)?
Response
Thanks for raising this concern, which allows us to clarify our core contributions more clearly. We also conducted additional ablation studies on the HIBER dataset, evaluating a comprehensive set of configurations involving architectural modules and loss functions with random seeds, including your suggested combination of +(1)+(3).
First, while we do not claim the two-stage decoder as a core contribution, we deliberately adopt PETR's two-stage design as a strong, empirically validated baseline for our purpose. Rather than reinventing the wheel for the sake of novelty, we plug-and-play PETR's backbone and enable us to focus on advancing 3D radar perception under weak supervision.
Second, PETR with only 2D keypoint loss completely fails to generalize to 3D radar perception, as evidenced by the first row of Table 1 below, which shows drastically worse performance—with an error exceeding 3 meters. The core issue is depth or depth-scale ambiguity, a limitation well illustrated by the behavior of a pinhole camera or flashlight projection: A larger object that is farther away can appear identical in 2D to a smaller object that is closer. In such cases, 2D keypoint supervision alone provides no constraint on absolute scale or depth, making 3D recovery fundamentally under-constrained—especially in weakly supervised settings.
Table 1: Effect of our structural loss function on PETR
Results with ± values are newly added and report mean ± std from at least 3 runs with different random seeds.
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | SV | 2D | K2D | K2D | 372.03 | PETR with 2D Keypoint Loss only |
| 2S | SV | 2D | K2D+G3D | K2D+G3D | 21.03 | PETR with 3D Gravity Loss |
| 2S | SV | 2D | T3D | K2D+G3D | 18.99 0.16 | PETR with 3D Gravity Loss+ 3D Template Loss |
Symbol Definitions
-
Decoder:
1S= one-stage,2S= two-stage -
Encoder:
SV= separate-view encoder,CV= cross-view encoder -
Attn (Attention):
2D= 2D deformable attention,3D= pseudo-3D deformable attention -
Loss Terms:
T3D= 3D Template loss (weak-supervsion label)
G3D= 3D Gravity loss (weak-supervsion label)
K2D= 2D Keypoint loss (weak-supervsion label)
Third, it is our structured design of the loss functions—particularly the use of 3D Gravity Loss and 3D Template Loss across the two decoder stages—that leads to substantial performance gains. As shown in the second and third rows of Table 1, these additions reduce the MPJPE of PETR from over 370 cm to under 20 cm—a more than 10× improvement. We also observe in Table 2 that our structured loss function significantly outperforms other loss combinations, further validating its effectiveness on RAPTR.
Table 2: Effect of our structural loss function on RAPTR
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | CV | 3D | -- | K2D+G3D | 50.96 | (A) in Fig.7 |
| 2S | CV | 3D | K2D+G3D | K2D+G3D | 27.36 | (B) in Fig.7 |
| 2S | CV | 3D | T3D | K2D | 30.06 | new result (suggested by Reviewer vQxv) |
| 2S | CV | 3D | T3D | K2D+G3D | 18.99 0.16 | RAPTR |
Fourth, Pseudo-3D attention works better with stronger, view-consistent radar features—such as those provided by the cross-view (CV) encoder. Thanks to your suggestion to evaluate +(1)+(3), we include the result in the third row of Table 3 below, alongside the original Table 3 from the main paper (we re-run +(1) and +(1)+(2)+(3) with 3+ random seeds and report mean ± std for more rigorous evaluation). It is observed that using either the cross-view encoder (CV, 3rd row) or pseudo-3D attention (3D, 2nd row) alone results in slightly worse performance than the base two-stage setup (first row). However, when pseudo-3D attention is paired with the cross-view encoder (last row), it achieves the best performance at the two HIBER data splits (WALK, MULTI). This suggests that the effectiveness of pseudo-3D attention depends heavily on the quality and consistency of the underlying radar features, which are better provided by the cross-view encoder than the separate-view encoder. We provide the following justification by looking into the pseudo-3D deformable attention in Fig. 4 of the main paper.
- Pseudo-3D attention predicts 3D spatial offsets—unlike the decoupled 2D offsets in QRFPose—to align and extract spatially consistent features across radar views.
- These 3D offsets are projected into each radar view to determine where features should be sampled.
- The sampled features from different views are then aggregated using normalized attention weights to form the pseudo-3D deformable attention output.
- For this mechanism to function effectively, the underlying radar features need to be spatially coherent and consistent across views—a property better provided by the cross-view (CV) encoder.
- In contrast, when pseudo-3D attention is combined with the separate-view (SV) encoder, the lack of shared multi-view context may limit its ability to extract reliable 3D features, leading to less powerful aggregated multi-view radar feature at the pseudo-3D deformable attention output.
Table 3: Ablation Study on RAPTR architectural modules (Encoder, Decoder, Attention)
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (WALK) | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|---|
| 2S | SV | 2D | T3D | K2D+G3D | 22.95 1.21 | 20.33 0.73 | +(1) in Table 3 |
| 2S | SV | 3D | T3D | K2D+G3D | 22.83 | 20.78 | +(1)+(2) in Table 3 |
| 2S | CV | 2D | T3D | K2D+G3D | 24.73 | 20.57 | +(1)+(3) (suggested by Reviewer bgMt) |
| 2S | CV | 3D | T3D | K2D+G3D | 22.32 0.06 | 18.99 0.16 | +(1)+(2)+(3) in Table 3 (RAPTR (ours)) |
Summary
We thank Reviewer bgMt for the thoughtful and constructive feedback. Your positive evaluation of our work's quality, clarity, significance, and originality (all rated 4) affirms our core contributions to weakly supervised 3D radar perception. We hope that the added clarification on pseudo-3D deformable attention, additional ablations (including your suggested +(1)+(3)), and new results with random seeds further strengthen the impact and completeness of our submission.
We thank the reviewer for the constructive feedback, and hope that the added evaluation result on +(1)+(3) and our detailed explanation have addressed your concerns. Please let us know if you have any remaining questions or feedback.
In more recent responses to other reviewers (e.g., Reviewer Zb8G), we have included updated results on a new 2D-to-3D uplifting baseline and a computational complexity comparison between decoupled 2D and pseudo-3D deformable attention. We believe these additions may further clarify the advantages of using pseudo-3D attention over its decoupled 2D counterpart. We kindly ask the reviewer to refer to our recent responses to Reviewer Zb8G for more details.
Below, we repeat the computational complexity comparison that was recently shared with other reviewers. Specifically, we conducted a more detailed analysis of decoupled 2D and pseudo-3D deformable attention with respect to the following configuration parameters: number of queries , sampling offsets , radar views , and query dimension . This deeper investigation revealed that:
- Pseudo-3D attention eliminates redundant per-view offset estimation, offering better scalability than decoupled 2D attention as the number of radar views increases.
- Notably, pseudo-3D attention achieves a computational saving of 17.5% with radar views and 21.3% with radar views, compared to decoupled 2D attention.
The detailed analysis is shown below:
-
For decoupled 2D deformable attention, the overall complexity is
O(8VQNC), where:- 3D reference point projected to V 2D radar views:
O(6VQ) - Offset estimation:
O(2VQNC)— 2 for the 2D (x, y) offsets and for radar views - Attention weights:
O(VQNC) - Feature aggregation:
O(5VQNC)— 5 due to bilinear interpolation and weighted sum.
- 3D reference point projected to V 2D radar views:
-
For pseudo-3D deformable attention, the total complexity is
O(6VQNC + 3QNC), where:- Offset estimation:
O(3QNC)— 3 for 3D (x, y, z) offsets, shared across views. - 3D offset projected to V 2D radar views:
O(6VQ) - Attention weights:
O(VQNC) - Feature aggregation:
O(5VQNC)
- Offset estimation:
Note that O(6VQ) is excluded from the final complexity expression as 6VQ 5VQNC in practice.
Table 2: Complexity comparison of Pseudo-3D vs. Decoupled 2D deformable attention
| Queries () | Views () | 2D Att | Pseudo-3D Att | Ratio (3D/2D) | Savings |
|---|---|---|---|---|---|
| 10 | 2 | 0.94 ↓ | 6.25% | ||
| 10 | 5 | 0.83 ↓ | 17.5% | ||
| 10 | 10 | 0.79 ↓ | 21.3% |
This paper addresses 3D human pose estimation using multi-view millimeter wave radar with 3D bounding boxes and 2D keypoints as weak supervision. The framework consists of pseudo-3D pose and joint deformable transformers as decoders to predict poses and joints. The authors then enforce the gravity and template loss to regularize predicted 3D keypoints. The authors have conducted experiments on the HIBER and MMVR dataset to show the effectiveness of the proposed method.
优缺点分析
Strengths:
- Using learnable queries to represent 3D keypoints and offsets in deformable transformer is interesting.
- Using weak supervision makes this work more practical in real-world applications.
- The experimental results are promising, highlighting the potential of using weak supervision.
Weaknesses:
- Lack of clarity in the proposed method
- Template loss and fine-grained 2D Loss regularize the predicted 3D/2D keypoints. Both losses are stricter than gravity loss which only encourages the predicted and bounding box centroids to be closer, but not the individual keypoints. The authors should elaborate on the intuition of using gravity loss.
- Lack of essential experiments and unconvincing results
- The authors should conduct the experiment of using 3D keypoints as supervision. The proposed method uses template loss, meaning that the model has a certain amount of access to 3D keypoints. Training RAPTR with 3D keypoints provides reviewers with a clear understanding of how well the method can be regarding full supervision.
- The authors did not provide the ablation study on with and without using gravity loss in Figure 7 and therefore, it raises concerns about the effectiveness of gravity loss.
- Table 3 shows that 3D-pseudo deformable attention is not useful, achieving approximately 1cm worse in MPJPE. Because the pseudo-3D deformable attention is the main difference between RAPTR and QRFPose [32], the inferior results make the proposed design less impactful. The authors should explain why pseudo-3D deformable attention mechanism becomes a contribution regarding performance and complexity.
- Minor weaknesses
- What is SNR in line 91?
- In Figure 3, the arrow from Pose Decoder to Joint Decoder is missing, as Joint Decoder takes the output from Pose Decoder as inputs (line 135 - 136).
问题
Important questions and concerns for the authors to address. Responses will strongly influence the final rating
- Elaborate on the intuition of using gravity loss.
- Conduct the experiment of training the method with full supervision (3D keypoints) to show the upper bound performance for comparison.
- Conduct the ablation study of training with and without gravity loss.
- Explain why pseudo-3D deformable attention mechanism becomes a contribution regarding performance and complexity, specifically compared to QRFPose [32].
Optional questions and concerns for the authors to address. Good responses can increase the the final rating
- Address the concerns listed in the minor weakness section.
Please find the relevant details of some questions in the weakness section.
局限性
yes
最终评判理由
The authors have addressed most of my concerns. Specifically, the complexity analysis demonstrates that the proposed method does offer advantages over the baselines. However, in my opinion, the improvement of using pseudo-3D deformable attention still falls within the range of statistical bias.
Overall, I consider this a borderline paper, slightly leaning toward acceptance.
格式问题
no major formatting issues
Summary & Strengths
Response
We thank the reviewer for recognizing our core motivation—advancing 3D radar perception with weak supervision—which, to our knowledge, is the first to explore this direction. We appreciate your summary of our contributions, including pseudo-3D attention, structured loss design, and evaluations on two large indoor radar datasets. We will release full code and pretrained checkpoints to support future research.
Weaknesses & Questions
- Elaborate on the intuition of using gravity loss and conduct the ablation...with and without gravity loss.
Response
First, we highlight that the motivation of using a hierarchical and structured loss function:
- (a) a 3D Template Loss (T3D) at the first-stage pose decoder, using coarse 3D bounding boxes and a human keypoint template to guide initial 3D pose estimates, and
- (b) a combination of 2D Keypoint Loss (K2D) and 3D Gravity Loss (G3D) at the second-stage joint decoder to refine the pose predictions.
Second, to clarify the role of gravity loss, we follow your suggestion and include an ablation (last row of Table 1). The second-stage joint decoder refines poses using 2D keypoint loss (K2D), but K2D alone cannot constrain depth due to depth ambiguity (a limitation illustrated by the behavior of a pinhole camera or flashlight projection: A larger object that is farther away can appear identical in 2D to a smaller object that is closer). Adding 3D gravity loss (G3D) regularizes depth and scale. Comparing the 3rd and last rows of Table 1 shows that removing G3D leads to over 50% performance drop (e.g., 18.99 cm → 30.06 cm on MULTI), confirming the necessity of using G3D in the second-stage decoder.
Third, we justify the template loss at the first-stage decoder using Table 1. Comparing rows 1–3 shows that adding 3D shape priors reduces MPJPE from 50.96 cm and 27.36 cm to 18.99 cm. The template not only localizes poses near the 3D box center but also encourages human-like keypoint structures, avoiding degenerated or implausible shapes produced without this human shape prior.
Table 1: RAPTR with Weak Supervision (2D Keypoint Loss (K2D), 3D Gravity Loss (G3D), and 3D Template Loss (T3D))
Results with ± values are newly added and report mean ± std from at least 3 runs with different random seeds.
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (WALK) | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|---|
| 2S | CV | 3D | -- | K2D+G3D | 29.12 | 50.96 | (A) in Fig.7 |
| 2S | CV | 3D | K2D+G3D | K2D+G3D | 23.77 | 27.36 | (B) in Fig.7 |
| 2S | CV | 3D | T3D | K2D+G3D | 22.32 0.06 | 18.99 0.16 | (C) in Fig. 7 (RAPTR) |
| 2S | CV | 3D | T3D | K2D | 40.92 | 30.06 | new result (suggested by Reviewer vQxv) |
symbol are defined below Table 1 of the response to Review bgMt.
Weaknesses
- Conduct the experiment of training the method with full supervision (3D keypoints) to show the upper bound performance for comparison.
Response
Suggestion has been taken. We have included the evaluation of RAPTR under full supervision (using 3D keypoint loss, K3D) in Table 2 below. We agree that this setting provides a meaningful upper bound on achievable performance, and allows us to quantify the performance gap between weak and full supervision.
Under full supervision, RAPTR achieves an MPJPE of 8.93 cm, whereas our best weakly supervised variant—leveraging 3D Template Loss (T3D) and 3D Gravity Loss (G3D)—achieves 18.99 cm. The weakly supervised model yields about a loss of ~10 cm in MPJPE, compared with the full supervision result. For completeness, we also include additional weakly supervised variants from Table 1 to highlight how each loss term contributes to final performance.
Table 2: RAPTR with Weak and Full Supervision
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 2S | CV | 3D | -- | K2D+G3D | 29.12 | RAPTR + Weak Supervision (K2D+G3D) |
| 2S | CV | 3D | T3D | K2D | 40.92 | RAPTR + Weak Supervision (K2D+T3D) |
| 2S | CV | 3D | T3D | K2D+G3D | 18.99 0.16 | RAPTR + Weak Supervision (K2D+T3D+G3D) |
| 2S | CV | 3D | K3D | K3D | 8.93 | RAPTR+Full supervision (suggested by Reviewer vQxv) |
Symbol Definitions
- Loss Terms:
T3D= 3D Template loss (weak-supervsion label)
G3D= 3D Gravity loss (weak-supervsion label)
K2D= 2D Keypoint loss (weak-supervsion label)
K3D= 3D Keypoint loss (Full-supervsion label)
Weaknesses
- ...The authors should explain why pseudo-3D deformable attention mechanism becomes a contribution regarding performance and complexity...
Response
Thanks for this insighful comment. Pseudo-3D attention works better with stronger, view-consistent radar features—such as those provided by the cross-view (CV) encoder. Thanks to your suggestion to evaluate +(1)+(3), we include the result in the third row of Table 3 below, alongside the original Table 3 from the main paper (we re-run +(1) and +(1)+(2)+(3) with 3+ random seeds and report mean ± std for more rigorous evaluation). It is observed that using either the cross-view encoder (CV, 3rd row) or pseudo-3D attention (3D, 2nd row) alone results in slightly worse performance than the base two-stage setup (first row). However, when pseudo-3D attention is paired with the cross-view encoder (last row), it achieves the best performance. This suggests that the effectiveness of pseudo-3D attention depends heavily on the quality and consistency of the underlying radar features, which are better provided by the cross-view encoder than the separate-view encoder. We provide the following justification by looking into the pseudo-3D deformable attention in Fig. 4 of the main paper.
- Pseudo-3D attention predicts 3D spatial offsets—unlike the decoupled 2D offsets in QRFPose—to align and extract spatially consistent features across radar views.
- These 3D offsets are projected into each radar view to determine where features should be sampled.
- The sampled features from different views are then aggregated using normalized attention weights to form the pseudo-3D deformable attention output.
- For this mechanism to function effectively, the underlying radar features need to be spatially coherent and consistent across views—a property better provided by the cross-view (CV) encoder.
- In contrast, when pseudo-3D attention is combined with the separate-view (SV) encoder, the lack of shared multi-view context may limit its ability to extract reliable 3D features, leading to less powerful aggregated multi-view radar feature at the pseudo-3D deformable attention output.
In terms of complexity, we provide such an analysis of RAPTER in Appendix A. As shown in Fig. 9 of Supplemental, we don't expect big differences between the 2D and pseudo-3D deformable attention as pseudo-3D saves one parameter for one offset prediction.
Table 3: Ablation Study on RAPTR architectural modules (Encoder, Decoder, Attention)
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (WALK) | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|---|
| 2S | SV | 2D | T3D | K2D+G3D | 22.95 1.21 | 20.33 0.73 | +(1) in Table 3 |
| 2S | SV | 3D | T3D | K2D+G3D | 22.83 | 20.78 | +(1)+(2) in Table 3 |
| 2S | CV | 2D | T3D | K2D+G3D | 24.73 | 20.57 | +(1)+(3) (suggested by Reviewer bgMt) |
| 2S | CV | 3D | T3D | K2D+G3D | 22.32 0.06 | 18.99 0.16 | +(1)+(2)+(3) in Table 3 (RAPTR) |
Minor Weaknesses
- What is SNR in line 91?
Response
SNR denotes Signal-to-Noise Ratio. Raw radar waveforms are often affected by random thermal noise from the hardware and incoherent multi-path reflections. Integrating radar waveforms over the Doppler domain typically improves SNR.
Minor Weaknesses
- In Figure 3, the arrow from Pose Decoder to Joint Decoder is missing, as Joint Decoder takes the output from Pose Decoder as inputs (line 135 - 136).
Response
We will revise the figure in the updated manuscript.
Summary
We appreciate the reviewer’s constructive feedback, which has helped us clarify our technical contributions and strengthen the manuscript. With the added ablation studies, full-supervision upper bound, and improved explanation of the loss design and pseudo-3D attention, we hope that our paper now has clearly demonstrates its quality, clarity, significance, and originality.
I thank the authors for their detailed responses. They have clarified and addressed some of my concerns. My main concern remaining is the proposed pseudo-3D deformable attention, which still seems not effective, even with CV encoder.
- In the submission, the performance of +(1)+(2)+(3) is 0.5% and 0.4% worse than +(1) on WALK and MULTI, respectively. In the rebuttal, although the performance of +(1)+(2)+(3) is better than +(1), the improvement is still in the range of bias. Note that the performance reported in the submission carries more weight than the one presented in the rebuttal.
- Pseudo-3D deformable attention does not benefit from the complexity and has the same amount of computational cost as QRFPose.
- The proposed design does not convince me that it is a novel approach, as it combines two decoupled 2D offsets into one 3D offset, which still falls within the scope of QRFPose.
Overall, the intuition behind the proposed method seems reasonable. However, it demonstrates only minor novelty and offers limited improvements in performance and complexity.
Thank you for your reply and engagement. We have the following point-to-point response.
Comment 1
- In the submission, the performance of +(1)+(2)+(3) is 0.5% and 0.4% worse than +(1) on WALK and MULTI, respectively. In the rebuttal, although the performance of +(1)+(2)+(3) is better than +(1), the improvement is still in the range of bias. Note that the performance reported in the submission carries more weight than the one presented in the rebuttal.
Response
We share this concern and would like to clarify that the initial results in the submission were based on a single training run. The updated results for +(1) and +(1)+(2)+(3) are now averaged over 3 independent runs with different random seeds. Due to time constraints, we could not update +(1)+(2) and +(1)+(3) in Table 3, which would offer a more complete picture of individual component contributions.
We also assure the reviewer that we will publicly release all training and evaluation code to ensure full reproducibility.
Comment 2
- Pseudo-3D deformable attention does not benefit from the complexity and has the same amount of computational cost as QRFPose.
Response
This is correct: the pseudo-3D deformable attention maintains a similar computational cost to that of QRFPose. Our main design goal was to improve representational capacity and improve the 3D radar perception (pose) performance under weak supervision.
Comment 3
- The proposed design does not convince me that it is a novel approach, as it combines two decoupled 2D offsets into one 3D offset, which still falls within the scope of QRFPose.
Response
We respectfully disagree with this comment.
The original QRFPose uses a single-stage (1S) decoder, a separate-view (SV) encoder, and decoupled 2D deformable attention. More importantly, it relies on strong 3D supervision via ground-truth 3D keypoints.
When we simply apply QRFPose under weak supervision, it fails completely. As shown below, using only the 2D keypoint loss (K2D) in the original QRFPose results in an MPJPE exceeding 3.7 meters (1st row), due to the lack of 3D depth constraints. Even with the addition of 3D gravity loss (G3D) at QRFPose's one-stage decoder, the error remains above 50 cm (2nd row).
By contrast, the (+1) configuration of Table 3 (3rd row here) incorporates key modifications from QRFPose: 1) the use of the 2-stage decoder and 2) the utilization of our proposed structural loss (3D template loss (T3D) at first-stage pose decoder and K2D+G3D at the second-stage joint decoder). This version already includes a core contribution -- structural loss functions -- of our paper.
Further improvements from (+1) to (+1)+(2)+(3)—which include the cross-view (CV) encoder and pseudo-3D deformable attention—yield additional performance gains, supporting the effectiveness of our architectural design.
In summary, as consistently stated in both the submission and rebuttal, our contributions are two-fold:
- Structural loss design under weak supervision, especially when paired with the two-stage decoder backbone, enabling effective 3D radar pose estimation without full 3D ground truth;
- Architectural modifications, including the cross-view (CV) radar encoder and pseudo-3D deformable attention.
Based on all ablation studies conducted so far, the impact of the structural loss (e.g., from QRFPose to (+1) of Table 3) appears to be more significant than that of the architectural modifications (e.g., from (+1) to (1)+(2)+(3)). Nonetheless, our overall design of RAPTR (the combined structural loss + CV + pseudo-3D deformable attention) shows much better performance than both QRFPose and PETR (paired with K2D or K2D+G3D, without our structural loss).
Table 1: A more complete comparison between the original QRFPose (with K2D and with K2D+G3D), (+1) of Table 3, and RAPTR
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|
| 1S | SV | 2D | K2D | -- | 376.02 | QRFPose with 2D Keypoint Loss |
| 1S | SV | 2D | K2D+G3D | -- | 56.02 | QRFPose with 2D keypoint loss and 3D gravity loss; Base in Table 3 |
| 2S | SV | 2D | T3D | K2D+G3D | 20.33 0.73 | +(1) in Table 3 |
| 2S | CV | 3D | T3D | K2D+G3D | 18.99 0.16 | +(1)+(2)+(3) in Table 3 (RAPTR (ours)) |
Thank you again for your response. The effectiveness of the proposed method, cross-view (CV) encoder and pseudo-3D deformable attention, after average over multiple runs remains questionable. Specifically, the performance of +(1)+(2)+(3) (22.32 0.06) is still worse than +(1) (22.95 1.21) on MPJPE (WALK), or is still in the range of bias, and the authors did not show it in Table 1 in the latest authors’ response.
- Why +(1)+(2)+(3) is not effective on WALK? What makes the proposed method ineffective?
Thank you for your further engaged discussion which we really appreciate.
Comment 1
- The effectiveness of the proposed method, cross-view (CV) encoder and pseudo-3D deformable attention, after average over multiple runs remains questionable. Specifically, the performance of +(1)+(2)+(3) (22.32 0.06) is still worse than +(1) (22.95 1.21) on MPJPE (WALK), or is still in the range of bias, and the authors did not show it in Table 1 in the latest authors’ response.
Response
We thank the reviewer for the observation. First, on the WALK split and averaged over 3 random seeds, +(1)+(2)+(3) performs slightly better (not worse) than +(1) with a lower mean MPJPE (22.32 vs 22.95) and smaller standard deviation (0.06 vs 1.21), indicating more consistent results across random seeds. That said, we agree the difference may fall within the range of statistical bias.
In the updated Table 1 below, we also include averaged performance over the WALK split for a more complete comparison.
Table 1: A more complete comparision between the original QRFPose, QRFPose variants, (+1) in Table 3, and RAPTR
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (WALK) | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|---|
| 1S | SV | 2D | K2D | -- | 366.31 | 376.02 | QRFPose with 2D Keypoint Loss |
| 1S | SV | 2D | K2D+G3D | -- | 49.71 | 56.02 | QRFPose with 2D keypoint loss and 3D gravity loss; Base in Table 3 |
| 2S | SV | 2D | T3D | K2D+G3D | 22.95 1.21 | 20.33 0.73 | +(1) in Table 3 |
| 2S | CV | 3D | T3D | K2D+G3D | 22.32 0.06 | 18.99 0.16 | +(1)+(2)+(3) in Table 3 (RAPTR (ours)) |
Comment 2
- Why +(1)+(2)+(3) is not effective on WALK? What makes the proposed method ineffective?
Response
We thank the reviewer for the comment. If we understand your concern correctly, it is about that +(1)+(2)+(3) shows only a slight improvement over +(1) on the WALK split (from 22.95 ± 1.21 to 22.32 ± 0.06), which is smaller compared to the more noticeable gain on the MULTI split (from 20.33 ± 0.73 to 18.99 ± 0.16).
With updated results averaged over 3 random seeds, +(1)+(2)+(3) achieves a lower mean MPJPE and reduced variance on both WALK and MULTI splits. The smaller improvement on WALK may be attributed to its simpler nature—it features only a single subject and fewer occlusions than the MULTI split. Nevertheless, the overall trend demonstrates that the overall combination of architectural components improves performance, particularly in more challenging scenarios such as MULTI.
We would like to follow up with the reviewer regarding the effectiveness of the proposed RAPTR and its comparison to baselines.
Response
First, regarding performance, we have added an additional baseline (as suggested by Reviewer Zb8G) in Table 1 below (3rd row) to further highlight the improvement achieved by RAPTR. This new baseline first estimates 2D keypoints on the image plane and then lifts them to 3D space using a pretrained 2D-to-3D pose uplifting model which was trained on pure vision-based datasets such as Human3.6M. A detailed explanation of this baseline implementation is provided in our recent responses to Reviewer Zb8G.
Using this uplifting pipeline, we updated Table 1 (see 3rd row). Under the selected configuration, the uplifting pipeline resulted in an MPJPE of 33.85 cm on MULTI and 40.40 cm on WALK, which are both higher than those of RAPTR.
[1] A simple yet effective baseline for 3D human pose estimation. ICCV, 2017.
Table 1: Full comparison with a new uplifting baseline
| Decoder | Encoder | Attn | Pose Loss | Joint Loss | MPJPE (WALK) | MPJPE (MULTI) | Notes |
|---|---|---|---|---|---|---|---|
| 1S | SV | 2D | K2D | -- | 366.31 | 376.02 | QRFPose with 2D Keypoint Loss |
| 1S | SV | 2D | K2D+G3D | -- | 49.71 | 56.02 | QRFPose with 2D keypoint loss and 3D gravity loss |
| 2S | CV | 3D | K2D | K2D | 40.40 | 33.85 | pretrained 2D-to-3D uplifting; suggested by Reviewer Zb8G |
| 2S | SV | 2D | T3D | K2D+G3D | 22.95 1.21 | 20.33 0.73 | +(1) in Table 3 |
| 2S | CV | 3D | T3D | K2D+G3D | 22.32 0.06 | 18.99 0.16 | +(1)+(2)+(3) in Table 3 (RAPTR (ours)) |
Second, regarding computational complexity, we appreciate the reviewer's suggestion, which motivated us to conduct a more detailed analysis of decoupled 2D and pseudo-3D deformable attention with respect to configuration parameters: number of queries , sampling offsets , radar views , and query dimension . This deeper investigation revealed that:
- Pseudo-3D attention eliminates redundant per-view offset estimation, offering better scalability as the number of radar views increases.
- Notably, pseudo-3D attention achieves a computational saving of 17.5% with radar views and 21.3% with radar views, compared to decoupled 2D attention.
The detailed analysis is shown below:
-
For decoupled 2D deformable attention, the overall complexity is
O(8VQNC), where:- 3D reference point projected to V 2D radar views:
O(6VQ) - Offset estimation:
O(2VQNC)— 2 for the 2D (x, y) offsets and for radar views - Attention weights:
O(VQNC) - Feature aggregation:
O(5VQNC)— 5 due to bilinear interpolation and weighted sum.
- 3D reference point projected to V 2D radar views:
-
For pseudo-3D deformable attention, the total complexity is
O(6VQNC + 3QNC), where:- Offset estimation:
O(3QNC)— 3 for 3D (x, y, z) offsets, shared across views. - 3D offset projected to V 2D radar views:
O(6VQ) - Attention weights:
O(VQNC) - Feature aggregation:
O(5VQNC)
- Offset estimation:
Note that O(6VQ) is excluded from the final complexity expression as 6VQ 5VQNC in practice.
Table 2: Complexity comparison of Pseudo-3D vs. Decoupled 2D deformable attention
| Queries () | Views () | 2D Att | Pseudo-3D Att | Ratio (3D/2D) | Savings |
|---|---|---|---|---|---|
| 10 | 2 | 0.94 ↓ | 6.25% | ||
| 10 | 5 | 0.83 ↓ | 17.5% | ||
| 10 | 10 | 0.79 ↓ | 21.3% |
Summary
We sincerely hope that the reviewer will reconsider the evaluation of our work’s quality, clarity, significance, and originality based on the results presented in the main paper, along with the additional experimental evaluation in the rebuttal. We also hope that we have sufficiently addressed the main concerns raised, specifically:
- Clarified the intuition behind gravity loss and included additional ablation study to demonstrate its effectiveness.
- Added experimental results using full 3D keypoint supervision to establish an upper-bound performance benchmark.
- Provided further experimental evidence of our contributions (structural loss function and pseudo-3D deformable attention), showing improvements on both performance and computational complexity over baselines.
I sincerely thank the authors for their response. The authors have clarified 1) the intuition behind gravity loss, 2) conducted experiments using 3D keypoints as supervision, 3) conducted experiments with and without using gravity loss, and 4) provided detailed experimental results and analysis demonstrating the improvements of pseudo-3D deformable attention in both performance and computational complexity.
Most of my concerns have been addressed. Specifically, the complexity analysis demonstrates that the proposed method does offer advantages over the baselines. However, in my opinion, the improvement of using pseudo-3D deformable attention may still fall within the range of statistical bias, which is agreed by the authors as well.
Overall, I consider this a borderline paper, but I am leaning toward raising my final rating.
We thank the reviewer for actively engaging in the rebuttal process and for the suggestions that have helped improve the clarity and quality of our paper. We will incorporate these changes in the revised version accordingly.
This paper introduces RAPTR, a novel method for radar-based 3D human pose estimation. The core innovation of the proposed method is that it uses only low-cost, easily accessible 3D bounding boxes (BBoxes) and 2D keypoints for supervision, instead of relying on expensive and difficult-to-collect 3D keypoint labels. The whole framework is built on PETR to output initial 3D pose and refined 3D pose, which is supervised by the proposed 3D template loss, and 2D keypoint and 3D gravity losses, respectively. Experiments on two public radar datasets demonstrate the effectiveness of the proposed method
Strenghts include: Well-written paper Significant reduction in data annotation costs Interesting ideas - Using learnable queries to represent 3D keypoints and offsets in deformable transformer Solid Engineering and Architectural Design Extensive Experiments and Ablation Studies Promising results - Robust to Multi-Person and Cluttered Scenes Efficient Use of Weak Supervision
Weaknesses include: Somewhat limited methodological novelty Clarity could be improved Experimental validation could be improved Occlusion and Human-to-Human Interference Remain Unsolved Strong Assumption on Template Priors
The authors provide solid rebuttals. There was an engaging discussion among authors and reviewers after the rebuttal, after which all four reviewers were positive - grades 4,4,4,4. I recommend accepting the paper.