Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery
摘要
评审与讨论
This paper proposes the first large-scale, real-world, high-quality dataset for 3D reconstruction of hand–instrument interactions in ophthalmic surgery settings. Based on this dataset, it further presents two benchmarks for bimanual hand pose estimation and two-hand–two-tool pose estimation. The two baselines achieve comparable or better performance than SOTA methods in general settings.
优缺点分析
Strength
- This is the first work that considers hand-object pose estimation in a surgery setting, which may benefit future researchers from both the general and medical areas.
- The multi-stage automatic annotation pipeline is reasonable and cost-effective.
Weakness
- No detailed application is discussed in this paper. What could the model be used for after training on this dataset, more than reconstructing hand and object motion?
- When it comes to hand motion estimation, the temporal consistency should be considered. This could be measured using the Acceleration and Jerk error, like in [1].
- For hand performance, any comparison with the most advanced model, HaMeR [2]? (both without domain-specific training and exclusively trained on this dataset)?
- Lastly but most importantly, considering the medical background of this paper, any metrics to measure surgical quality could be reported and analyzed in benchmarks, like time and motion, flow of operation, overall performance, etc., in [3]?
[1] Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera, CVPR 2025. [2] Reconstructing hands in 3D with transformers, CVPR 2024. [3] JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): A Surgical Activity Dataset for Human Motion Modeling, MICCAI workshop 2014.
问题
Please refer to the weaknesses in Strengths And Weaknesses part. For a medical-related paper, it is important to consider more medical-related evaluations and applications.
局限性
yes.
最终评判理由
I was on a trip in the previous days and missed the last chance to discuss with the authors, yet I trust they have addressed my proposed concerns. I also hope to see the dataset released to benefit the research community, so there are no strong reasons to reject this paper.
格式问题
No.
We appreciate that the reviewer acknowledges the significance of our work. We would also like to thank the reviewer for the suggestions on clinical applications, evaluation, and experiments - these have been really crucial for improving our work.
W1 & Q1- Clinical applications:
We sincerely thank the reviewer for their insightful comments regarding the practical applications of our work. As we emphasized in the introduction, accurate three-dimensional reconstruction provides substantially richer information compared to traditional two-dimensional imaging methods. Capturing the precise spatial relationships, depth information, and temporal dynamics of hand-instrument interactions forms the foundation for numerous downstream clinical applications. Our 3D reconstruction framework captures critical information inherently missing from 2D methods: precise positioning of instrument tips in 3D space, accurate hand joint poses under occlusion, authentic spatial relationships in bimanual interactions, and motion trajectories with depth perception. These 3D features are essential prerequisites for the following applications:
(1) Quantitative surgical skill assessment: Unlike 2D video analysis, our 3D reconstruction enables measurement of precise motion metrics including 3D path length, economy of motion, millimeter-level tremor amplitude, and authentic bimanual coordination patterns. Additionally, we have included synchronized ophthalmic microscope videos along with surgical/practice experience duration data from 40 surgeons, facilitating more comprehensive and systematic analysis.
(2) Intelligent surgical training systems: Detailed 3D motion trajectories support the development of AI-driven training systems with real-time feedback on instrument angles, force vectors, and spatial precision—elements that cannot be accurately assessed from 2D projections alone.
(3) Surgical workflow optimization: Our phase-annotated 3D data can be used to analyze motion efficiency, workspace utilization, and ergonomic patterns, thereby providing evidence for improving surgical techniques and reducing operator fatigue.
(4) Next-generation robot-assisted surgery: Millimeter-precision 3D hand-instrument coordination data provides crucial training data for developing semi-autonomous surgical robots capable of replicating expert-level minimally invasive surgical movements.
(5) Risk assessment and safety monitoring: 3D reconstruction can detect potentially hazardous movements, proximity to critical structures, and predict complication risks based on deviations from optimal trajectories.
We recognize these application domains need comprehensive discussion to fully demonstrate our work's clinical impact. In the final version, we will expand the discussion section to include detailed use cases, implementation pathways, and clinical validation strategies, highlighting the transformative potential of 3D surgical scene understanding in advancing ophthalmic surgical practice.
W2 & W3 – Temporal consistency and comparison with HaMeR:
We thank the reviewer for these valuable suggestions, we fully agree that assessing temporal motion quality is important. We would like to answer this from two aspects: 1) data annotation pipeline and 2) baseline method, by comparing both methods to HaMeR [1].
(1) Annotation pipeline vs HaMeR. Since we are constructing the ground truth itself, it is inherently challenging to perform standard quantitative evaluation like MPJPE or acceleration error without a reference. Nevertheless, we have implemented multiple physical constraints and regularisation terms with temporal smooth to guarantee the annotation reliability. We now provide evaluations using Acceleration (not error term), Jerk following Dyn-HaMR [2], and Penetration (between hands and instruments) metrics where each of them designed to quantify motion stability and interaction plausibility (also in response to MyX8, feMU). Although traditional error metrics like MPJPE or acceleration error cannot be computed without a reference, we can still observe the significant improvements over HaMeR, where it presents much lower Acceleration, Jerk and Penetration, indicating the high temporal consistency and the effectiveness of the annotation method. Additionally, the strong performance of downstream models trained on our annotations, and the superior values across the temporal metrics, offer indirect yet compelling evidence of annotation quality.
| Method | Acc () | Pen () | Jerk () |
|---|---|---|---|
| HaMeR + Instrument (Init.) | 8.72 ( 0.21) | 6.79 ( 0.57) | 52.76 ( 8.23) |
| OphNet-3D (Our dataset) | 1.45 ( 0.11) | 1.39 ( 0.09) | 0.76 ( 0.05) |
(2) OH-Net vs HaMeR. We appreciate the reviewer highlighting HaMeR. As noted above, HaMeR is used as initialization in our data annotation pipeline and is also evaluated separately, where we demonstrate significantly improved temporal consistency, stability, and plausibility.
However, it is important to note that HaMeR is a large-scale foundation model trained on 2.7M samples using a huge ViT backbone, and is designed primarily for per-frame local hand pose estimation. In contrast, our baseline methods are specifically developed to reconstruct joint hand–instrument and support interaction-aware reasoning. As HaMeR is a large-scale foundation model designed exclusively for hand pose estimation, without accounting for object interactions, we instead benchmark our approach against state-of-the-art hand-object (instrument) pose estimation baselines, as reported in Tables 2 and 3.
Moreover, both HaMeR and our baseline models are per-frame methods and do not explicitly model temporal consistency. Nevertheless, to provide additional insight, we evaluated HaMeR using its official checkpoint on our dataset under the same protocol. Specifically, we randomly selected 100 clips from the test split and repeated the evaluation 5 times (totalling 500 sequences) to ensure consistency and robustness. Due to the scale of HaMeR and the substantial computational & time cost associated with retraining a foundation model, domain-specific retraining was not feasible during the limited time and the scope. We thank the reviewer for the advice and carefully add this into the final version.
As shown in the table below, our methods consistently achieve lower Acceleration and Jerk, indicating superior temporal stability. This further supports the value of our annotation pipeline and task-specific modeling.
| Method | Acc Err () | Jerk () |
|---|---|---|
| HaMeR (official checkpoint) | 5.24 ( 0.16) | 52.76 ( 7.23) |
| H-Net (Ours) | 3.76 ( 0.07) | 35.34 ( 4.55) |
| OH-Net (Ours) | 4.13 ( 0.09) | 39.45 ( 5.23) |
W4 - No surgical quality metrics:
In JIGSAWS, the authors conducted quantitative assessment of technical skills for three fundamental surgical tasks (Suturing, Knot-Tying, and Needle-Passing) using six specific dimensions (Respect for tissue, Suture/needle handling, Time and motion, Flow of operation, Quality of final product, and Overall performance), with each dimension rated on a 1-5 point Likert scale (where 1 point represents the poorest performance, 3 points indicates "competent/adequate" performance, and 5 points denotes the best performance), and the total score calculated as their sum. The annotation was performed by an experienced gynecological surgeon with extensive expertise in robotic and laparoscopic surgery through video review. It is worth noting that JIGSAWS' assessment method still relies on subjective expert scoring, where this direct rating approach based on human observation not only has limited interpretability but also inevitably suffers from raters' subjective biases. In contrast, our work provides a more fundamental and objective analytical framework. As emphasized in the introduction and W1, we establish a new technical foundation for the entire surgical skill assessment field through constructing a large-scale dataset and developing advanced 3D reconstruction techniques. Though space constraints prevent us from incorporating additional annotations or experiments, our core contributions in this work are pioneering.
More importantly, our framework represents a paradigm shift: from 2D video analysis relying on subjective scoring to objective data analysis based on precise 3D reconstruction. This transformation is particularly crucial in ultra-high precision ophthalmic surgery, where millimeter-level operational errors can lead to severe consequences. By integrating 3D hand joint positions, 3D instrument poses, depth information, and visualizations, we not only provide quantifiable, verifiable, and reproducible objective measurements, but more importantly, establish a technical platform that can eliminate human biases and support standardized assessment. This objectivity and precision, unattainable through traditional subjective scoring methods, enables: (1) surgeons to make more scientific decisions based on precise spatial data; (2) training systems to provide more accurate feedback; and (3) skill assessment to be based on objective metrics rather than subjective impressions. We will cite JIGSAWS and include a comparative analysis in the final manuscript. We believe this foundational work will drive a new generation of surgical education, training, and assessment systems, while offering essential tools and data resources for researchers in the field.
References:
[1] Reconstructing Hands in 3D with Transformers
[2] Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera
Dear Reviewer 3Ziv,
We sincerely appreciate the time and effort you have invested in reviewing our submission. Your insightful feedback has been invaluable to us, and we have diligently worked to address all the concerns you raised in our rebuttal. As the author-reviewer discussion phase is drawing to a close, we would like to confirm whether our responses have effectively addressed your concerns. We are more than happy to provide any further details or explanations. Thank you once again for your thoughtful review and consideration.
Best regards,
The Authors
Dear authors,
Thanks for your detailed explanation and supplementary experiments.
I understand the main contribution of this work is the dataset. However, I'm still doubtful about the comparison between H-Net/OH-Net and HaMeR. Comparing H-Net/OH-Net and HaMeR in the same training setting (i.e., both training on OphNet-3D) might not be fair due to the large data prior learned by HaMeR; yet, it is expected that H-Net/OH-Net trained on OphNet-3D can beat HaMeR without fine-tuning (i.e., official checkpoint) in terms of hand reconstruction accuracy (MPJPE, MPVPE, etc.). If not, that would be strange, which shows that a domain-specific training cannot surpass a large-scale general domain training.
In addition, after reading other reviewers' questions and your rebuttals, a small question comes to my mind: About the interpenetration loss, have you considered the penetration between the left and right hands more than between hand-instrument? If not, how do you ensure that the two hands are not penetrated into each other in the annotation?
Thank you for considering my questions.
Best, Reviewer
Dear Reviewer 3Ziv,
Thank you very much for your professional, detailed, and valuable reviews. As suggested, we have randomly selected 100 clips and repeated the evaluation 5 times to ensure stability and fairness.
We would greatly appreciate it if you please consider increasing the score in light of these clarifications and additional insights. We thanks you again for your valuable feedback.
Best regards, The Authors
Additional experiments on HaMeR:
We thank the reviewer for their thorough evaluation and for raising this important point. To fully address this concern, we evaluated HaMeR using its official checkpoint on our test set, following the protocol described above: we randomly selected 100 clips and repeated the evaluation 5 times to ensure stability and fairness.
| Method | MPJPE (mm) ↓ | MPVPE (mm) ↓ |
|---|---|---|
| HaMeR (official checkpoint) | 31.68 (± 0.16) | 33.56 (± 0.42) |
| H-Net (Ours) | 17.44 (± 0.13) | 18.37 (± 0.21) |
| OH-Net (Ours) | 17.36 (± 0.09) | 18.43 (± 0.18) |
As shown in the table, our H-Net and OH-Net, trained with domain-specific supervision, outperform HaMeR in terms of positional accuracy, achieving lower MPJPE, MPVPE under our task setting. This result is expected given that HaMeR is a general-purpose, large-scale hand model not fine-tuned for our surgical domain.
Importantly, HaMeR is known to suffer from failure cases such as hallucinated hand configurations (e.g., wrong-handedness, missing detections) in challenging scenarios involving gloves or extreme lighting, which leads to higher MPJPE because of mismatching. These issues are exacerbated by the absence of post-processing steps like non-maximum suppression in its default pipeline, resulting in low MPJPE and MPVPE regardless of reasonable single hand performance. In contrast, our method benefits from domain-specific annotations and a bimanual-restricted initialization, allowing for more robust disambiguation of left and right hands, which is particularly critical in our setting.
On interpenetration between hands:
We thank the reviewer for raising this point. To verify the validity of our annotation in terms of hand–hand interpenetration, we inspected the entire dataset as well as consulted our experienced ophthalmic surgeon. According to expert feedback, interpenetration between the left and right hands during ophthalmic surgery is extremely rare. In standard practice, the surgeon's hands typically operate independently and often hold separate instruments with sufficient spacing. Even when both hands interact with the same instrument, their points of contact are well-separated, often aided by layers of irrigation or adhesive materials that physically separate the hands.
As such, our annotation pipeline does not explicitly impose loss terms penalizing hand–hand intersection to speed up optimization process. However, we did not observe any notable cases of interpenetration during our annotation and visual verification process. We appreciate the reviewer’s suggestion and will consider adding an explicit hand–hand collision term in future work for completeness, although it appears unnecessary in the context of ophthalmic surgical motion.
Thank you for your detailed review. We hope the experiments address your concern.
This paper introduces a new dataset for bi-manual hand and instrument motion in surgical settings. The authors design a multi-stage optimization pipeline to obtain ground-truth MANO hand parameters and 6-DoF object motion sequences. In addition, they propose H-Net and OH-Net for reconstructing hand and object motions. The paper provides extensive implementation and annotation details, making the dataset and methods reproducible and informative for future research.
优缺点分析
Strengths:*
- The modeling of bi-manual hand and instrument motion is novel and relevant, especially in the context of surgical applications.
- The dataset is valuable, containing multi-view RGB-D sequences with dense MANO hand meshes and 6-DoF instrument annotations.
- The authors provide extensive implementation details and thorough documentation of the data collection and annotation pipeline.
Weaknesses:
- While Line 8 claims the annotation pipeline is automatic, Line 103 reveals that it still relies on manual corrections for per-frame instance masks.
- While a few visual examples are provided, the overall quality of the ground-truth annotations is not systematically evaluated. This is not a clear weakness, as large-scale annotation quality remains a challenging open problem, but it would benefit from further discussion.
- No video results are shown for the reconstructed motions from the proposed OH-Net.
问题
The paper claims the annotation pipeline is automatic (Line 8), but Line 103 mentions that manual corrections are required for per-frame instance masks. To what extent is the pipeline automated in practice? Could the authors clarify how much manual intervention is still involved?
While the authors provide video examples to demonstrate the quality of the generated annotations, these individual cases are insufficient to assess the overall quality of the dataset. Evaluating the accuracy and consistency of large-scale automatically annotated data is a challenging but essential problem, particularly when the main contribution lies in the proposed annotation pipeline.
局限性
yes
最终评判理由
We increased the final score given the rebuttal by the authors.
格式问题
good
We appreciate the reviewer’s recognition of the contributions of our work. We are glad that the novelty of modeling bi-manual hand and instrument motion in surgical contexts was well received, and that the dataset were found valuable and significant.
W1 & Q1 - About the annotation pipeline:
We thank the reviewer for raising this point and apologize for any confusion. Our annotation pipeline does not rely on manual corrections during the optimization process after the initialization. Overall, manual correction was applied only to the mask initialization in the first frame of each video to improve hand-instrument tracking. All subsequent masks were automatically inferred by SAM2. We have now clarified this more explicitly in the revision.
W2 & Q2 - More systematic evaluation on dataset:
We appreciate the reviewer’s comment regarding the systematic evaluation of annotation quality. Since we are constructing the ground truth itself rather than evaluating against an existing dataset, it is inherently challenging to perform standard quantitative evaluation. Nevertheless, we have implemented multiple constraints to ensure annotation reliability, including thorough visual inspection across all sequences, cross-view 3D consistency checks, and sanity checks on temporal smoothness. Our method is based on the SotA initialisations and refines the hand instrument motion in the aspects of temporal consistency, contact and the hand pose accuracy in the following stages. To support our claim, we hereby provide a comparison on the test split between our data and SotA model (e.g. HaMeR [1] + instrument initialized as mentioned Section 3.2 and transformed into HaMeR's coordinate system) in terms of motion quality using Acceleration, Jerk and Penetration (between hand and instrument) metrics in the table below following Dyn-HaMR [2], as well as the ablation study provided in the response to [MyX8], which demonstrates the effectiveness of the annotation method and the quality. Note that although other standard error metrics (e.g. MPJPE, MPVPE, Acc Err) for them cannot be computed in this case as we are constructing the ground truth, we can still analyse the Acceleration, Jerk for hands and Penetration rate between hand and instrument.
| Method | Acc () | Pen () | Jerk () |
|---|---|---|---|
| HaMeR + Instrument (Init.) | 8.72 ( 0.21) | 6.79 ( 0.57) | 52.76 ( 8.23) |
| OphNet-3D (Our dataset) | 1.45 ( 0.11) | 1.39 ( 0.09) | 0.76 ( 0.05) |
We can observe significant improvements over the SoTA (e.h. HaMeR), where much better Acceleration and Jerk proves stability of our annotation pipeline and the Penetration shows the effectiveness of our physical constraints and regularisation terms. The results are obtained by randomly selecting 100 clips on test split and repeating for 5 times.
W3 & Q2 - More qualitative result:
We fully agree with the reviewers that more visualizations can offer a better understanding of the dataset. In addition to the operational-frame visualizations in the appendix, our supplementary video presents, beginning at 33s, 8 operation clips along with their reconstruction visualizations (4 shorter and 4 longer clips). While we already included video samples of the predictions in our original submission, we regret that the NeurIPS rebuttal policy does not allow us to share additional materials at this stage. We acknowledge the value of presenting more examples and agree that doing so can help further demonstrate the effectiveness and limitations of our method. We now provided a comprehensive and diverse set of results in our final version. We invite the reviewer to follow our final release, where we have also share extended visualizations on the project page.
References:
[1] G. Pavlakos et al., “Reconstructing Hands in 3D with Transformers,” CVPR ’24.
[2] Z. Yu et al., “Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera,” CVPR ’25.
Thank you for the detailed and thoughtful response. The clarification on the annotation process and the additional evaluation are helpful and appreciated. It’s good to see more metrics supporting the dataset quality. The rating will be updated accordingly. It would be great if these evaluations, along with more visualizations, could be included in the final version.
Dear Reviewer feMU,
Thank you for your positive feedback and for indicating that you would update the rating. We greatly appreciate your constructive comments and support.
We noticed that the final rating might not have been submitted yet. As the author-reviewer discussion deadline is approaching, we wanted to kindly remind you in case it was overlooked.
Thank you again for your time and consideration.
Best regards, The Authors
Dear Reviewer feMU,
We sincerely appreciate the time and effort you have invested in reviewing our submission. Your insightful feedback has been invaluable to us, and we have diligently worked to address all the concerns you raised in our rebuttal. As the author-reviewer discussion phase is drawing to a close, we would like to confirm whether our responses have effectively addressed your concerns. We are more than happy to provide any further details or explanations. Thank you once again for your thoughtful review and consideration.
Best regards,
The Authors
Thank you for your detailed feedback and positive assessment. We are pleased that our response addressed your concerns.
We kindly request you to consider increasing the score, as we believe the clarifications and additional evidence strengthen our work.
We will include further evaluation results and enhanced visualizations to better highlight the dataset’s quality. We sincerely appreciate your constructive suggestions and the time you dedicated to reviewing our paper.
This work aims to facilitate dynamic 3D reconstruction of hands and surgical instruments in ophthalmic surgery. The authors (i) introduce an automatic annotation pipeline, (ii) construct a large-scale dataset using 8 RGB-D cameras, (iii) establish two benchmarks, and (iv) present two baseline methods. The main contribution of this paper is the OphNet-3D dataset, which contains 7.1M frames featuring 40 surgeons, 12 surgical phases, and 10 surgical instruments. Based on this dataset, the authors introduce two benchmarks and baseline methods. The proposed methods achieve state-of-the-art results for hand reconstruction and hand-instrument joint reconstruction.
优缺点分析
< Strengths >
This paper has the following strengths:
(i) The OphNet-3D dataset has a significant contribution to the research field. It is carefully constructed with 8 synchronized RGB-D cameras using the proposed annotation pipeline. As shown in Table 1 and Figure 1, the dataset is substantially larger and more diverse than existing datasets. It includes detailed annotations such as MANO hand meshes and 6-DoF instrument poses.
- As described in NeurIPS Paper Checklist, the authors will open-source all data, code, and model weights. This contribution can facilitate kinematic analysis of critical operational details (e.g., surgeon hand movements, surgical tool usage).
(ii) The authors establish two benchmarks and introduce baseline methods for bimanual hand motion estimation and two-hand-instrument interactions. They also evaluate existing methods on the proposed OphNet-3D dataset, which supports the dataset’s utility as a benchmarking tool.
(iii) This paper is interesting and well-organized. The techniques used for dataset construction and baseline methods are generally appropriate.
< Weaknesses >
This paper has the following weaknesses:
(i) Despite the dedicated efforts for instrument motion annotation (Sec. 3.2), the proposed baseline struggles to capture fine-grained instrument articulation, as shown in the supplementary material. This limitation is understandable for now, given the challenges posed by the fine-grained motion scale in ophthalmic surgery.
(ii) This paper contains several incomplete or unclear sentences, which may hinder readability and clarity:
-
L228-230: "Based on the data split, bimanual hand pose estimation and (2) hand-instrument interactions."
-
L257-258: "Mean Per Joint Positional Error between each hand-instrument interaction pair (MRRTEh,o)"
-
L367: "on all metrics ()"
-
L840: "CoordConv []"
-
Equation (4) defines "2d", but equations (9), (10), and (22) use "pj2d".
(iii) This paper lacks ablation studies to analyze the contribution of individual components in the proposed method, limiting assessment of design choices and their effectiveness.
问题
(i) How effective are the losses L_inter and L_sdf in addressing interpenetration? An ablation study on these losses would be valuable, given that hand-instrument interaction is the central focus of this work.
(ii) The proposed OH-Net is trained with monocular training images from all 8 camera views. Is the same training protocol applied to the compared methods? Clarification is important for fair benchmarking.
(iii) Can the authors provide a comparison of inference speeds between their baselines and other methods? Since the baselines rely on multiple large pre-trained models, they may be much slower than existing approaches.
局限性
yes
最终评判理由
Most concerns were resolved during the discussion period. I believe this paper makes a valuable contribution and is well-qualified for acceptance. Therefore, I maintain my original rating of 5: Accept.
格式问题
There is no major concern.
We thank the reviewer for their thoughtful and constructive feedback. We are pleased that they recognize the impact of our proposed dataset and its significant contribution to the research community and applications such as the analysis of critical operational workflows. We also appreciate the reviewer’s constructive assessment of our benchmark system design, which they found appropriate and supportive of the dataset’s utility as a benchmarking instrument. We are encouraged that the reviewer has a clear understanding of the goals and contributions of our work.
W1 - Challenges in fine-grained motion estimation:
We agree with the reviewer that, as the pioneer in the field to recover 3D bimanual instrument interactions in the challenging context of ophthalmic surgery, capturing fine-grained instrument articulation, particularly during delicate procedures (e.g., capsulorhexis) involving subtle motions and instruments such as toothed forceps and capsulorhexis forceps, under extreme lighting & occlusion conditions may exhibit slightly reduced alignment accuracy compared to other phases. We would especially polish the dataset and keep updating with the newest features.
W2 - Minor corrections & typos:
We thank the reviewer for bringing these up. We have now carefully corrected them and incorporated them into our final revision of the paper.
W3 & Q1 - Ablation study:
We absolutely agree with the reviewer that, due to the novelty of the task and our benchmark, a detailed ablation study could further benefit the community. This is the reason why we ablates the on the essential components of the baseline OH-Net in Table. 3 of the main paper. Moreover, we hereby also provide the ablation study for the loss terms used in the annotation pipeline including and , by analysing the Acceleration, Jerk for hands following Dyn-HaMR [1] and Penetration rate between hand and instrument:
| Terms | |||||
|---|---|---|---|---|---|
| Acc () ↓ | 1.89 (± 0.08) | 1.92 (± 0.09) | 1.29 (± 0.13) | 1.47 (± 0.10) | 1.45 (± 0.11) |
| Pen () ↓ | 4.46 (± 0.15) | 4.12 (± 0.11) | 3.45 (± 0.08) | 2.98 (± 0.13) | 1.39 (± 0.09) |
| Jerk () ↓ | 0.98 (± 0.07) | 1.02 (± 0.07) | 0.72 (± 0.05) | 0.78 (± 0.06) | 0.76 (± 0.05) |
The table shows the impact of different error terms in our joint loss function. Specifically, we found can speed up the convergence and increase the stability. Among them, and can help with better interaction modelling with less inter-penetration. Other regularization terms (e.g. , ) are included in all experiments to avoid unrealistic hand poses. The results are obtained by randomly selecting 100 clips on test split and repeating for 5 times.
Q2 - Model training clarification:
First of all, we would like to kindly point out that only the proposed annotation pipeline relies on multiple large pre-trained models, while the baseline methods (e.g. OH-Net, H-Net) do not require any pre-training. In terms of the annotation pipeline, the methods used for initialization were using the officially released checkpoints without any additional pre-training. For the comparison with baseline OH-Net, all the methods in the experiments were trained and evaluated on our dataset with the same data split and protocols.
Q3 - Runtime analysis:
For our baseline method OH-Net, it takes about 0.1 ~ 0.3 seconds on a monocular RGB image to predict the poses using a single NVIDIA A100 GPU whilst the SoTA method HaMeR takes around 1.4 seconds. As we mentioned that only our annotation pipeline needs a reasonable initialization from existing pose estimation models, the average runtime for it could vary depending on different initialization methods. In terms of the runtime of annotation pipeline, please refer to Section D.2 of the Appendix.
[1] Z. Yu et al., “Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera,” CVPR ’25.
Dear Reviewer MyX8,
We sincerely appreciate the time and effort you have invested in reviewing our submission. Your insightful feedback has been invaluable to us, and we have diligently worked to address all the concerns you raised in our rebuttal. As the author-reviewer discussion phase is drawing to a close, we would like to confirm whether our responses have effectively addressed your concerns. We are more than happy to provide any further details or explanations. Thank you once again for your thoughtful review and consideration.
Best regards,
The Authors
Dear Authors,
Thank you for the thorough rebuttal. It has effectively addressed all my comments, and I have no further major concerns.
- With regard to the proposed OH-Net, I initially assumed it relied on pre-trained models, likely due to the absence of architectural details (e.g., Segmentation Head). The paper only mentions feature dimensions. Providing more detailed architectural information would be helpful.
It seems that most of the other reviewers' concerns have also been resolved. In my opinion, this paper makes a meaningful contribution and deserves to be accepted.
Best,
Reviewer MyX8
Thank you for your valuable feedback and strong support for our work. We will update the manuscript to include detailed architectural specifications for all network components to improve clarity and reproducibility. If you have no further concerns, would you consider improving your rating for our work? We would be deeply grateful if you are willing to update your rating. We are more than happy to provide any further details or explanations. Thank you again for your thoughtful review and consideration.
(a) The paper proposed a large‑scale RGB‑D dataset for dynamic 3D reconstruction in ophthalmic microsurgery (41 sequences from 40 surgeons; ~7.1M frames) with fine‑grained labels (12 surgical phases, 10 instrument categories, dense MANO meshes, 6‑DoF instrument poses). It further proposes a multi‑stage automatic annotation pipeline and establishes two benchmarks: 1) bimanual hand pose estimation and 2) hand–instrument interaction, together with baseline models H‑Net (dual‑hand) and OH‑Net (two‑hands–two‑instruments) that incorporate weak‑perspective spatial reasoning and collision‑aware, center‑based representations. Reported gains include >2 mm MPJPE for hands and up to 23% ADD‑S for instruments over prior baselines.
(b) + The dataset and pipeline fill a clear gap in surgical vision. + The benchmarks and baselines are thoughtfully constructed; and the rebuttal resolved the core technical and evaluation concerns raised by reviewers.
(c) - Need clearer architectural specifications (e.g., segmentation head). - Need more comprehensive ablation and clarification on the temporal stability.
(d)(e) The paper got positive rating from all reviewers after rebuttal {5,5,4}. The three reviews focused on dataset value, fairness/metrics, automation claims, and practicality. The rebuttal was thorough and empirical, and resolved most of concerns from reviewers. AC agreed with the acceptance decision, and suggest authors to should include their clarification in the final camera-ready version.