Interactive Anomaly Detection for Articulated Objects via Motion Anticipation
An interactive anomaly detection method for articulated objects
摘要
评审与讨论
In this work, the authors propose to tackle interactive anomaly detection (IAD), that is, the detection of anomalies that occur only after interacting with the object (e.g., opening and closing a drawer). The authors propose a new benchmark and approach to interact and detect this kind of anomaly. Specifically, the proposed approach generates a set of actions to interact with the object and predict its motion. Suppose the predicted motion is different to a certain degree from the actual motion, and the model will detect the anomaly. The authors define different baselines to compare with.
优缺点分析
Strengths
-
The proposed formulation of IAD is original and well-motivated. Most existing anomaly detection benchmarks focus on passive visual cues. In contrast, this work emphasizes that many real-world anomalies only present themselves through physical interaction, a critical insight for robotics and quality assurance in manufacturing.
-
The proposed method introduces a joint modeling of interaction and motion anticipation. While each component (e.g., motion prediction, action sampling) builds on prior work, their combination into an end-to-end interactive anomaly detection pipeline is novel. The idea of detecting anomalies through discrepancies between expected and observed motion is both intuitive and effective.
-
The paper is clear, logically organized, and provides strong qualitative and quantitative results
-
The PartNet-IAD benchmark fills a gap in the community by offering a dataset for interactive anomaly detection. The design of the anomalies and interaction protocols is realistic, and the evaluation across both seen and unseen categories demonstrates the model’s generalizability.
-
The proposed approach achieves substantial performance gains (e.g., AUROC improvements of ~6 points over the strongest baseline), suggesting its practical utility.
Weaknesses
-
While the core idea is strong, the experimental section could benefit from more thorough ablation studies. For instance, the paper briefly mentions the use of confidence-based weighting (Eq. 6) and a multi-step filtering strategy for action selection. However, the impact of these design choices is not isolated and quantified. A study showing the effect of varying \lambda values in the loss would strengthen the empirical analysis.
-
While the baselines are well-designed, most rely on Where2Act, and it is unclear how much performance gain comes from better interaction versus better anomaly scoring. A more comprehensive comparative analysis with a broader range of interaction strategies would be beneficial.
问题
-
Please provide more detail on how sensitive the method is to the loss weight hyperparameters (e.g., different \lambdas in Eq. 6). A sweep or sensitivity analysis would be valuable.
-
How robust is the approach to noisy or partial depth observations, especially since it relies on point clouds?
-
How critical is the use of the long-horizon motion estimation compared to atomic action steps?
局限性
Limitations have been discussed.
最终评判理由
The authors provided thorough and convincing responses to all concerns. The ablations reported in the response clarify the impact of loss weighting and noise filtering. They justify the joint modeling of interaction and motion, discuss robustness to noisy inputs and the necessity of long-horizon motion estimation. The rebuttal strengthens the paper’s contributions, and the accept rating remains appropriate.
格式问题
N/A
Thank you for your positive feedback. We address your questions and concerns below.
1. Ablation for confidence-based loss and noise filtering
Table 2 (main paper) quantitatively and Figure A2 (appendix) qualitatively show the impact of multi-step noise filtering on the motion estimation performance. As further suggested, we also conduct ablation studies on the test categories to assess the individual contributions of the confidence-based loss and noise filtering strategy to our IAD task, using the AUROC metric.
| Method | AUROC |
|---|---|
| w/o filtering | 81.4 |
| w/ conf-based noise filtering | 82.7 |
| w/ conf-based + DBSCAN-based noise filtering (our method) | 83.9 |
2. Ablation for loss weights
The weights and in Eq. (6) are relatively insensitive and are therefore fixed to 1 across all experiments. We tune , the loss weight for the confidence loss, over the range [0.005, 1]. We find that a high value of (e.g., close to 1) can cause training instability and slow convergence by dominating the primary motion loss, whereas a very low value leads to uniformly low confidence predictions across inputs. We set it to 0.05 for optimal performance. We will add discussion in the final draft.
3. It is unclear how much performance gain comes from better interaction versus better anomaly scoring
Our method consists of three integral steps: (1) learning an interaction (action), (2) motion modeling, and (3) anomaly scoring. Each step plays a critical role in the overall performance.
A good interaction agent can be helpful as it systematically explores the full articulation range of a part. Since functional anomalies are typically not visible without motion, the full articulation range of all parts needs to be thoroughly explored and tested to detect such anomalies.
While Where2Act [R1] offers a simple yet effective affordance-based interaction framework, it falls short in our specific IAD task due to its lack of motion-aware long-term planning. This limitation restricts its ability to fully explore articulation ranges, which is crucial for anomaly detection.
Learning an accurate motion estimator is equally important for our task, as it directly drives the final anomaly prediction by comparing against observed motion. Moreover, our motion estimator also facilitates long-term planning, enabling the selection of more informative future actions that lead to better exploration of the part articulations. In summary, our joint modeling of both action and motion is mutually reinforcing—better interaction policies lead to more informative motions, while accurate motion modeling supports more effective action planning.
The final step, our anomaly scoring as defined in Eq. (9), is straightforward and performs well provided the anticipated normal motion is accurate.
We also note that while more advanced interaction models may exist, Where2Act integrates seamlessly with our motion estimator, making it the most suitable baseline for our framework. Incorporating alternative models would require significant modifications to our motion model beyond the scope of this work.
4. Impact of noisy or partial depth observations
Our method inherently works well with partial point clouds as we train the model using depth observation from single viewpoint. Figure A1 (appendix) shows motion estimation results on real-world scanned articulated objects showing it can handle noisy input to a certain extent. To improve robustness, we can inject artificial noise to model real-world artifacts in the training depth data.
5. How critical is the use of the long-horizon motion estimation compared to atomic action steps?
Without long-horizon motion estimation, our model struggles with multi-step interactions. The lack of motion history often results in back-and-forth movements without progressing to new states. As shown in Table 1, our method outperforms baseline E, which uses atomic action steps with Where2Act [R1] as the interaction agent.
[R1] Where2Act: From Pixels to Actions for Articulated 3D Objects, ICCV'21
Thank you to the authors who have addressed all the concerns. I will keep the accept rating.
The authors propose a novel task, namely Interactive Anomaly Detection (IAD) that aims at classifying an articulated object as anomalous in case of functional probelms. This represent a step further respect to images AD, and can be useful in several industrial scenarios.
The authors also contribute a novel method for IAD, in which they compare possible movments on the articulated object movable parts with the same actions as performed on a functioning replica of the object inspected.
This work presents a novel dataset (even though it originates from a pre-existing one) that is used for benchmarking their proposed method against five baselines reporting best results.
优缺点分析
STRENGHTS The paper is well written and is full of details. The proposed task seems valuable and novel. The method proposed, while being directly inherited from classical Anomaly Detection techniques, is well posed and articulated. The proposed baselines are not very challenging, but they are informative when compared with the proposed method. The integration of CV methods, robotics and simulated environments is interesting
WEAKNESSES When proposing a novel task, the motivations are very important. While the authors provide some motivations, I think that the importance of the proposed IAD task could be expressed more clearly. For example, I imagine that IAD could be very interesting for industries, that could benefit from inspecting virtual copies of their produced goods. The benchmark is limited to a selection of objects. While I acknowledge that this paper represents only the first work in IAD, I also think that it would be challenging to extend it to real-world objects (virtual copies of industrial objects) to be valuable for the industries. It is not very clear to me how the treshold on the anomaly score is calculated, as it could change from object to object.
问题
See weaknesses
局限性
The main limitations of this work is in its extendability. It seems very challenging to construct a dataset similar to the one the author propose and this could limit the interest in IAD for the CV community.
最终评判理由
The authors addressed all my concerns.
Regarding the motivation of the IAD benchmark and its extendability, the authors have committed to adding a discussion about the former in the final version of their paper and described possible ways to extend the proposed benchmark in simulated environments.
Regarding the metric, the authors kindly (but rigorously) clarified the properties of the used metric (correcting a mistake I made in my review) and experimented with alternative metrics when addressing reviewers osxE's concerns.
In light of this, I will confirm my initial positive score for this work.
格式问题
I did not find any major formatting issue
We appreciate your valuable feedback and recognition of our work's novelty. We address your points below.
1. Practical importance of the proposed IAD task
Thank you for your feedback. We agree that our proposed IAD task holds significant value for industries, which could benefit from inspecting digital twins of their manufactured products with movable parts. Those may include furniture (e.g., for cabinets) and consumer electronics (e.g., for laptops) industry for quality control and durability testing. It may also be used in product development and prototyping to identify design flaws and improve functionality, and, as well as, in service industry for inspecting and repairing articulated objects, and warranty assessments. We will add a discussion of this in the final draft.
2. Extendibility to real-world industry setting
We would like to clarify that our method does not require anomaly samples during training, which significantly reduces the effort of dataset construction and improves scalability.
Regarding extension to industrial settings, we note that many industries already maintain digital twins of their products as part of modern manufacturing pipelines. These digital twins can be used within simulation environments to generate rich interaction data. Recent advances in digital twin modeling and availability of advanced simulation tools (i.e., NVIDIA Omniverse) further support the generation of realistic, industry-standard datasets for IAD.
Furthermore, imperfections from real-world scans or digitization artifacts can be explicitly modeled during simulation, enhancing our model's robustness. Our method is general and designed to accommodate such variations. Future work can further incorporate additional physical properties such as friction to increase realism.
3. On threshold anomaly score
Following standard practice in AD benchmarks, our evaluation uses AUROC metric which is threshold-independent and obtained by varying over the different threshold values. A higher AUROC score means superior separability between normal and anomaly samples, i.e., stronger AD capability. In a practical setup, a fixed threshold is typically chosen based on the desired tolerance for false positives. This can be determined by calibrating on normal data to achieve an acceptable false positive rate (FPR).
Thank you once again for your review. We hope that our detailed rebuttal have addressed your concerns. Please let us know If there are any remaining questions or points you would like us to clarify, we would be happy to address them during the remaining rebuttal period.
I would like to thank the authors for their thorough responses and clarifications. They have adequately addressed all of my original concerns, and I found the discussion of issues raised by other reviewers to be particularly insightful, further reinforcing my view on the relevance of the proposed benchmark.
I therefore confirm my initial positive score and expect the authors to implement all the suggested revisions in the final version of the manuscript.
The paper presents a benchmark for interactive anomaly detection of articulated objects, using motion anticipation as the basis. Alongside the benchmark, it introduces a learning-based approach to anomaly detection. The method detects anomalies by learning how the object is normally interacted with through a robotic agent. It learns to predict normal functional behaviour through interactions with normal objects. During testing, the method detects anomalies by comparing the observed motion of the object with the model's learned normal behaviour. The approach is evaluated on the proposed benchmark and compared to several baselines using standard evaluation metrics.
优缺点分析
- The paper would be better suited to the benchmark track. The method is not entirely novel. Instead, the main contribution is the benchmark.
- Limited novelty: the proposed method does not make significant contributions. The idea of using motion modelling for anomaly detection is well established in the literature (e.g. automated driving trajectory-based anomaly prediction; see the Shift dataset). The proposed approach is also quite similar to Where2Act. It should be made clear what the proposed novelties are.
- Benchmark clarity: The selection of categories for training and testing lacks clarity. While the paper mentions a 'mutually exclusive set', it is unclear whether the categories are chosen randomly or manually, or if they are functionally disjoint (e.g. a closet door versus a regular door). This ambiguity limits the strength of the generalisation claims. It would be valuable to conduct an ablation study analysing how training category selection affects performance.
- Benchmark clarity: The paper provides limited information on how anomalies are generated in the dataset. For example, the anomaly shown in Figure 1 appears somewhat unrealistic, which could reduce the relevance of the benchmark.
- More evaluation metrics: The only metric reported for anomaly detection performance is AUROC. However, this metric is insensitive to class imbalance, which is usually the case in anomaly detection since, by definition, anomalies are rare. This can lead to deceptively high AUROC scores even when the model performs poorly on the rare class (i.e. anomalies). To get a more complete picture of anomaly detection performance, benchmarks typically require precision-recall area under the curve (AUPRC) and false positive rate at 95% recall to be reported.
- Method clarity: The method description lacks clarity and uses inconsistent terminology. For example, line 125 refers to 'decoders' for motion and segmentation, whereas later sections refer to 'heads'. Maintaining consistent terminology would improve readability.
- The paper appears somewhat unpolished in places. For example, the abstract could be improved by providing a brief explanation of the broader context and potential applications of interactive anomaly detection to motivate the problem. A condensed version of the first paragraph of the introduction could achieve this.
Minor points:
- Section 4.4 (Limitations and Failure Cases) would benefit from a more detailed analysis of failure modes, including potential causes of false positives/negatives, as well as suggestions for how these could be mitigated in future work.
- The term 'online data sampling' is introduced in Section 4.2 without explanation. Explaining what this entails and why it is beneficial would improve reproducibility.
问题
- How does the proposed method differ from Where2Act, and how directly comparable are the chosen baselines to it?
- How are object categories for training and testing selected? Are they chosen randomly or according to specific criteria?
- Have you computed additional evaluation metrics, such as AUPRC or FPR@95% recall? These could help to improve our understanding of the model's performance.
- Are the failure cases shown in Section 4.4 representative of all typical failures, or have other types of false detection been observed?
局限性
yes.
最终评判理由
After reading the rebuttal,, I found all points valid and could follow them. However, I am fundamentally against accepting the dataset paper as a NeurIPS paper. This could be submitted to the dataset track of the conference or to another robotics venue, e.g. IROS/ICRA. There is no new theory in this work. I will maintain my 'Reject' score.
格式问题
It's fine.
Thank you for reviewing our work. Below we address your questions and weaknesses mentioned:
1. Concerns regarding novelty
Thank you for the feedback. We would like to clarify the novel contributions of our method compared to Where2Act [R1] as follows:
- Where2Act focuses on predicting which actions are likely to move an articulated part, but it does not model how the part moves. In contrast, our method estimates the rigid motion flow that fully characterizes the part’s kinematic response to the action. This is crucial for our interactive anomaly detection (IAD) task, as it provides the anticipated ‘normal’ motion necessary for comparison with the observed motion.
- Our formulation enables long-horizon, temporally consistent action exploration, which Where2Act lacks.
- By explicitly estimating motion, our method can filter out noisy actions by identifying inconsistencies in predicted motions.
- Our model leverages rich supervisory signals from simulation during training, including motion mask and rigid motion matrix tied to each input action, whereas Where2Act uses only discrete supervision (motion vs. no motion). Our richer supervision allows to capture the continuous range of articulation dynamics more accurately.
Comparison with autonomous driving trajectory anomaly detection: In autonomous driving, anomaly detection is typically based on passive observation of agents (vehicles or pedestrians) moving along structured paths governed by traffic rules. These methods rely on detecting spatial deviations in smooth trajectories. In contrast, our task involves an embodied agent actively manipulating articulated objects, where anomalies arise from deviations in expected part dynamics (e.g., stuck or broken joints). This demands modeling of contact-driven motion, causality, and part-specific behavior. We will discuss about the related papers in the final draft.
We would like to highlight that other reviewers have acknowledged the novelty of our work. For instance, Reviewer dCsY acknowledged our joint modeling of interaction and motion anticipation as a novel approach, Reviewer bQ9u remarked that our proposed method is well-posed and clearly articulated, and both reviewers recognized the novelty and practical importance of the IAD task itself.
[R1] Where2Act: From Pixels to Actions for Articulated 3D Objects, ICCV'21
2. Selection of training categories
Our chosen category split follows the same protocol as in [R2]. As suggested, we also conduct experiments with different combinations of training categories. While selecting these combinations, we ensure that each combination contains a sufficient number of articulated objects covering representative articulation types, such as both revolute and prismatic joints. The StorageFurniture category is included in all sets as it is the most representative category in the PartNet-Mobility dataset. The test categories remain consistent across all experiments. As shown in the results below, our choice of training categories has a relatively low impact on overall performance, although some test category might benefit from the presence of functionally similar category in the training set (e.g., Microwave and Safe as in SetA). Note, while the overall performance is slightly lower than in Table 1, this is expected since training was conducted on a smaller subset of the data.
| Training categories | Interaction pairs (offline) | Box | Phone | Dishwasher | Safe | Oven | WashingMachine | Table | KitchenPot | Bucket | Door | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SetA | {StorageFurniture, Microwave, Switch, stapler} | 77M | 69.1 | 88.3 | 79.8 | 84.1 | 79.0 | 84.5 | 88.1 | 75.7 | 72.3 | 73.9 |
| SetB | {StorageFurniture, Refrigerator, Laptop, Window} | 90M | 71.4 | 91.5 | 82.4 | 81.1 | 81.6 | 85.8 | 88.9 | 80.5 | 73.4 | 72.1 |
| SetC | {StorageFurniture, TrashCan, Toilet, Kettle} | 81M | 73.7 | 92.6 | 80.2 | 81.0 | 80.3 | 86.1 | 88.4 | 87.6 | 79.3 | 72.3 |
[R2] UMPNet: Universal Manipulation Policy Network for Articulated Objects, ICRA'22
3. Anomaly generation details
Our dataset contains a broad range of plausible functional anomalies covering types such as: tilted axis of motion, wrong joint type, restricted motion etc. A detailed description of the anomaly generation process is provided in Appendix A.1. Currently, the dataset focuses on 1-DoF prismatic and revolute joints due to limitations in publicly available articulation datasets, such as PartNet-Mobility, which includes only these joint types. However, our methodology uses a general motion parameterization that is agnostic to joint type and can be extended to more complex articulations.
4. Additional metrics
Our evaluation set is relatively balanced (see Table A1 in appendix). As suggested, we also report AUPRC and FPR@95% recall on the test categories in the table below.
| Method | AUPRC (↑) | FPR@95% (↓) |
|---|---|---|
| Where2Act + MotionPrior | 74.4 | 14.2 |
| Our Method | 81.9 | 9.5 |
5. On clarity and inconsistent terminology
Thanks for your feedback. We will fix the inconsistencies in the final draft.
6. Analysis of failure cases
In addition to the analysis in Section 4.4, Figure A.3 (appendix) presents more examples of representative failure cases. The primary cause of failure is due to inaccurate motion prediction that manifests in several ways, such as misclassifying the joint type (e.g., predicting rotation instead of translation), defining an incorrect axis of motion, or generating physically implausible movements that ignore part collisions. Please see Section A.3 (appendix) for more discussion.
7. Explanation of Online data sampling
While offline data generation using random actions can partially explore the action space, it is often inefficient due to the large size of the action space. Focusing on actionable interactions (i.e., actions likely to induce motion) yields more informative training data. Online data sampling provides an efficient way of collecting actionable interaction data by leveraging the knowledge already acquired by the network. Specifically, we sample actions using the normalized movability score distribution in Eq. (8) and simulate additional interactions for those high-likelihood actions. We skip noise filtering in this step to sample actions faster, since this process is not sensitive to outliers. This online data collection strategy significantly increases the proportion of interaction pairs that induce positive motion.
Dear Reviewer osxE, As the author-reviewer discussion deadline approaches, we would appreciate it if you could let us know whether our rebuttal has addressed your initial concerns. If there are any remaining questions or points you’d like us to clarify, we’d be glad to respond.
Thank you again for your valuable comments and feedback. We hope that our detailed rebuttal has addressed your concerns. Please let us know If there are any remaining questions or points you would like us to clarify, we would be happy to address them during the remaining rebuttal period.
Dear Reviewer osxE,
Could you kindly clarify which of your comments you feel have not been adequately addressed? Additionally, could you specify the areas where the paper still requires significant revision?
I encourage both the reviewers and the authors to actively engage in the discussion to help move the process forward constructively.
Best regards,
AC
Dear Authors, dear AC,
Thank you for clarifying the proposed dataset. I found all of your points valid and could follow them. However, I am fundamentally against accepting the dataset as a NeurIPS paper. As I wrote to you, it could be submitted to the dataset track, or to IROS/ICRA, or to another robotics venue. My perspective might not be correct, but that is why there are two more reviews and ACs to make the joint decision.
I will maintain my 'Reject' score and allow the ACs and PCs to discuss whether this could be a NeurIPS paper.
Best regards, osxE
We thank Reviewer osxE for participating in the discussion.
For the benefit of the AC and the final discussion, we emphasize that our paper’s contributions extend beyond the proposed dataset, with our primary contributions as follows:
-
A Novel Task: The formulation of Interactive Anomaly Detection (IAD) for articulated objects, a new problem space that requires active agent interaction.
-
A Novel Method: A new, learning-based approach that uniquely integrates interaction and motion anticipation to solve this task. We have detailed its novelty and significant differences from prior work like Where2Act in our rebuttal.
-
A New Benchmark: The creation of the PartNet-IAD benchmark, which is essential for evaluating methods on this new task, but is not the sole contribution of our paper.
Although all three reviewers acknowledge the technical novelty and experimental details, reviewer osxE recommends rejection, suggesting that the paper is more suitable for IROS/ICRA or the benchmark track.