Typos:

Line 150, authors stipulate that [2] introduce NPT while [3] are actually the ones that introduced it, while [2] use it for AD. Moreover, the authors should consider including [2] in the related work section regarding Tabular Anomaly Detection to avoid confusion.
Paragraph Projection-Space Mask Generation. in section 4.1. is not easy to follow. In equation (4), the authors write and do not define . We hypothesize that it corresponds to the -th element of the dimensional vector, in which case, we do not understand the operation between a scalar and the basis vector of dimension .

Questions and clarifications

There are a few elements that need clarification:

(i) In Equation (3) in the paper the authors include weights used to construct the masks for as done in [4]. We suppose that those weights are conjointly learned with the rest of the hyperparameters. Given the overall learning objective, (i-1) how do you ensure that the learned weights do not yield a trivial solutions, e.g. no features are masked or the masks are identical? As there is no constraint on the representation or on the obtained masks, it could very much be that . For example, in [4] the authors include a diversity loss that ensures that masks are diverse and non-trivial. [4] mention that, "it raises another vital problem: how to prevent the mask generator generating same and redundant masks. This determines whether MCM can extract different and diverse correlations. Our solution is to constrain the similarity between different masking matrices." (i-2) What motivated the authors to not include this loss?
(ii) In the original paper that propose NPTs [3] and in [2] that relies on NPT for AD, both approaches include a mask token that increases the feature representation by one dimension, , where is a scalar for numerical values and the one-hot representation for categorical features. This representation is then mapped to an -dimensional representation using a linear layer. (ii-1) How do you perform data-space masking for the categorical features using equation (3) in the paper? Similarly, is a mask token involved in the pipeline? A detailed description of the pipeline as done in App. C.3. of [3] or App. B of [2] might clarify things. We suggest the authors include such description in the appendix.
(iii) During training, prototypes are obtained from the encoded masked representations of normal samples. While constructing the basis vector can be natural for unmasked samples, it appears very complicated to capture normal behaviors in normal samples that may have been masked very differently. Let us consider a very simplistic example where we have a batch of samples, and that the mask produced in equation (3) in the paper is

how can the basis vectors be relevant in that case? More precisely, in the general case, if the mask produced in equation (3) is a diagonal matrix, which is possible since no constraint on the mask in included, **(iii-1) how can the obtained basis vector contain meaningful information?**
In particular, in the more general case as the number of basis vectors is set to be $5$ in all tested datasets, what if the $M^{ds}$ matrix produces significantly more than $5$ mask types, it seems very unlikely that those basis vector will disentangle properly the representations.

(iv) Prototype generation is also used to capture normal inter-features dependencies. We could be wrong, but it appears to us that aiming at learning inter-feature dependencies on representations that result from a double masking strategy (in the data space and in the encoded space) is particularly challenging and we are surprised that the model could perform such task. (iv-1) How do you ensure that the inter-feature dependencies learned by the NPT itself does not suffice to discriminate between normal samples and anomalies?
(v) In inference, since contains both normal samples and anomalies, how do you ensure that the basis vector computed in the pipeline still capture normal behavior (both for and ). (v-1) If anomalies originate from a similar distribution (but distinct from the normal distribution), couldn't some obtained basis vectors also capture anomalous behavior thus polluting the anomaly score through anomaly leakage?
(vi) An average over three runs seems very low. While we acknowledge that [4] also reports an average over runs, other recent work [1],[2] report an average over runs which seems more reasonnable. Increasing the number of runs might be relevant, in particular since the training cost seems reasonnable as reported in table 5. Moreover, (vi-1) could the authors report the standard deviations (as done in [1], [2])? (vi-2) Are bold results significantly higher than competing methods? No mention of a statistical test can be found in the paper.
(vii) Some previous papers [1],[2] have also relied on the F1-Score, it would be best if the authors could include this metrics used in previous benchmarks. In particular, the code provided shows that F1-Score is stored and can be included without too much effort.

Overall, we believe that the present work is promising as it combines existing ideas to provide a novel pipeline for tabular anomaly detection. However given the listed weakness and interrogations, we lean towards reject. However, we are very open to discussion and would be happy to increase our score if the authors are able to clarify our interrogations as well as provide the code to run all the experiments presented in the paper.

[1] Tom Shenkar and Lior Wolf. Anomaly detection for tabular data with internal contrastive learning. In International conference on learning representations, 2022.

[2] Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, and Bich-Liên Doan. Beyond individual input for deep anomaly detection on tabular data. In Forty-first International Conference on Machine Learning, 2024.

[3] Jannik Kossen, Neil Band, Clare Lyle, Aidan N Gomez, Thomas Rainforth, and Yarin Gal. Self-attention between datapoints: Going beyond individual input-output pairs in deep learning. Advances in Neural Information Processing Systems, 2021.

[4] Jiaxin Yin, Yuanyuan Qiao, Zitang Zhou, Xiangchao Wang, and Jie Yang. Mcm: Masked cell modeling for anomaly detection in tabular data. In The Twelfth International Conference on Learning Representations, 2024.

[5] Hangting Ye, Wei Fan, Xiaozhuang Song, Shun Zheng, He Zhao, Dandan Guo, and Yi Chang. Ptarl: Prototype-based tabular representation learning via space calibration. In International Conference on Learning Representations, 2024