8.2

/10

Poster4 位审稿人

最低5最高5标准差0.0

3.3

置信度

创新性3.0

质量3.0

清晰度3.3

重要性2.8

NeurIPS 2025

Understanding and Improving Adversarial Robustness of Neural Probabilistic Circuits

Weixin Chen,Han Zhao

OpenReview PDF

提交: 2025-05-01更新: 2025-10-29

摘要

关键词

Concept Bottleneck ModelsNeuro-Symbolic AIInterpretabilityAdversarial Robustness

评审与讨论

审稿意见

评分: 5置信度: 32025-07-01

In this paper, the authors analyzed and improved the adversarial robustness of Neural Probabilistic Circuits (NPCs). One of the key findings of the paper is that an NPC's robustness to adversarial attacks is only determined by its attribute recognition model and not affected by the probabilistic circuit. Based on this insight, the authors proposed a new model called RNPC (Robust Neural Probabilistic Circuit) that uses a class-wise integration strategy rather than NPC's original node-wise integration. The authors demonstrated that this technique improves adversarial robustness by aggregating probabilities over attribute neighborhoods rather than relying on exact combinations. Then, the authors mathematically showed the robustness bounds and how the proposed approach maintains compositional estimation error. Finally, extensive experiments were conducted on multiple datasets to validate that RNPC significantly improves robustness while preserving benign accuracy.

优缺点分析

Strengths:

The paper is well-structured, with clear descriptions of models, threats, and mathematical formulations. Thus, the paper is comprehensible and easy to navigate.
The theoretical analysis done by the authors is comprehensive and well-presented.
Experiments are comprehensive, covering various datasets and attack configurations.

Weaknesses:

The assumption that the model partitions the attribute space into $V_{y}$ for each class, assuming these are non-overlapping, is a bit concerning, given that some attribute combinations can correspond to multiple classes.
RNPC defines the neighborhood $\mathcal{N}(y, r)$ using Hamming distance between attribute vectors. Since attributes may differ in importance or semantic meaning, and Hamming distance treats all attribute mismatches equally, NPC may overestimate the safety of a neighborhood if attributes differ in critical ways.
The implicit assumption that if an attribute is perturbed, it still stays in the neighborhood is a critical concern. A strategic adversary can craft perturbations to cross the inter-class distance.

问题

The authors wrote on page 5: "If such an attack shifts the probabilities originally assigned to nodes of $V_{y}$ to any other nodes within $\mathcal{N}(y, r)$ the class-wise integration ensures that the weight of class $y$ is barely affected". However, this depends on the capabilities of an adversary trying to launch adversarial attacks. If a strategic adversarial perturbation is introduced such that the impact crosses the inter-class distance, then what will happen?
For Equation 2, it is written: “RNPC's prediction becomes a weighted sum of all classes’ contributions to $Y$ . This phrasing is potentially confusing, since $Y$ is the random variable representing the class itself, so classes cannot meaningfully “contribute” to $Y$ . Would the authors clarify whether the RNPC's prediction is a weighted sum over classes of their contributions to the predicted class distribution, rather than to $Y$ itself? Please consider rephrasing this statement for precision.
The authors have used norm-bounded attacks (PGD, CW) without considering adaptive attackers who might deliberately design perturbations to cross the inter-class distance threshold; hence, the attribute vector crosses into another class’s neighborhood. How does RNPC function under such adaptive threat models? If authors could include an analysis or experiments to measure RNPC's resilience to such attacks, it would prove the robustness and versatility of RNPC against different types of attacks.
Hamming distance treats all attribute mismatches equally. But attributes may differ in importance or semantic meaning. For example, changing an attribute from 'red' to 'green' (color) may be more impactful than changing it from 'circle' to 'oval' (shape). Thus, the proposed framework may overestimate the 'safety' of a neighborhood if attributes differ in critical ways. Will it be feasible to consider weighting the Hamming distance to reflect the relative importance or semantic impact of different attributes?
The model partitions the attribute space into $V_{y}$ for each class, assuming these are non-overlapping. But, again, there can be some attribute combinations can correspond to multiple classes. Can authors clarify this?

局限性

The approach assumes class-specific attribute partitions $V_{y}$ are non-overlapping, but some attribute combinations may belong to multiple classes.
Hamming distance treats all attributes equally, ignoring their differing semantic importance. Thus, using it to determine inter-class distance, then the minimum inter-class distance, and finally the neighborhood of a class, the core design principles, is a critical limitation of the paper.
RNPC is not evaluated against adaptive attacks.

最终评判理由

After considering the authors' rebuttal, I am updating my score to accept. The author's response effectively addressed my primary concerns. The authors clearly explained the construction of non-overlapping attribute partitions and neighborhoods, also clarified the phrasing of Equation (2), and importantly proposed a feasible formulation to use the weighted Hamming distance as a future direction.

格式问题

N/A

作者回复

2025-07-31

We sincerely thank the reviewer for the thoughtful feedback! We're glad you found the paper well-structured and easy to follow, with comprehensive theoretical analysis and extensive experiments!

Q1: The model partitions the attribute space into $V\_y$ for each class, assuming these are non-overlapping. But some attribute combinations may correspond to multiple classes. Can authors clarify this?

A1: We answer this question with the following three steps.

Clarify why the sets $\\{ V\_y \\}$ are non-overlapping.

We first review the construction process of $\\{ V\_y \\}$ . Specifically, for any attribute assignment $a\_{1:K}$ , we assign it to the class $y^\*=\arg\max\_{y\in\mathcal{Y}} \mathbb{P}\_D(Y=y\mid A\_{1:K}=a\_{1:K})$ . That is, $a\_{1:K}\in V\_{y^\*}$ . In the event of a tie, we can break it arbitrarily (e.g., by coin flipping), ensuring that each $a\_{1:K}$ is assigned to a unique $y^\*$ . Thus, the sets $\\{V_y\\}$ are non-overlapping by construction.
Clarify why the sets $\\{ \mathcal{N}(y, r) \\}$ are non-overlapping.

We further interpret the reviewer's concern as questioning the non-overlap of the neighborhoods $\\{\mathcal{N}(y, r)\\}$ . This property holds under a mild and implicit assumption: the radius $r$ of the attribute set is non-negative ( $r \geq 0$ ). This corresponds to a minimum Hamming distance of at least 1 ( $d\_{min} \geq 1$ ), which generally holds in the context of concept bottleneck models, where each class is associated with distinct concept combinations. In cases where this does not hold, we propose the following strategy.
Propose a strategy for overlapping scenarios.

One potential approach is to incorporate additional unsupervised attributes [1] during training to help capture fine-grained distinctions between classes sharing the same observed attribute values.

Thanks for this question. It helps us clarify the scenarios considered in our work and point out potential scenarios. We will include this discussion in the paper!

[1] "Concept bottleneck model with additional unsupervised concepts." IEEE Access. 2022.

Q2: Would the authors clarify that the RNPC's prediction is a weighted sum over classes of their contributions to the predicted class distribution, rather than to the variable $Y$ itself? Please consider rephrasing this statement for precision.

A2: Thanks for the suggestion. Below is our rephrased version that improves the clarity of Eq. 2:

\Phi\_{\theta, w}(Y=y\mid X=x) = \sum\_{\tilde{y} \in \mathcal{Y}} \left( \mathbb{P}\_{\theta}\left(A\_{1:K} \in \mathcal{N}(\tilde{y}, r) \mid X=x\right) \cdot \sum\_{a\_{1:K} \in V\_{\tilde{y}}} \mathbb{P}\_{w}\left(Y=y \mid A\_{1:K}=a\_{1:K}\right) \right).

Compared to the original Eq.2, we add $Y=y$ and $X=x$ , making it clearer that RNPC's prediction is a weighted sum over classes of their contributions to the predicted class— $y$ .

Q3: How does RNPC perform when confronted with adaptive attacks that craft adversarial perturbations capable of crossing inter-class? If authors could include an analysis or experiments to measure RNPC's resilience to such attacks, it would prove the robustness and versatility of RNPC against different types of attacks.

A3: We answer this question with the following three steps.

Clarification on the role of the radius $r$ .

First, we'd like to clarify the relationship between RNPC's performance and the radius of the attribute set, denoted by $r := \lfloor\frac{d\_{\min}-1}{2}\rfloor$ . As stated in Lines 167-170, when the number of attacked attributes is not greater than $r$ , the perturbed probabilities are transferred to nodes within $\mathcal{N}(y^\*, r)$ , where $y^\*$ denotes the ground-truth class. RNPC is designed to aggregate these probabilities via Eq. 2, thereby tolerating such perturbations. Theoretically, Lem. 4.6 ensures that the prediction perturbation $\Delta\_{\theta, w}^{RNPC}$ is small in this case. Empirically, Fig. 2 demonstrates that RNPC achieves high adversarial accuracy on MNIST-Add3 ( $r = 1$ ), MNIST-Add5 ( $r = 2$ ), and CelebA-Syn ( $r = 1$ ) datasets when a single attribute is attacked. These results validate RNPC’s robustness when the number of attacked attributes is within $r$ .
Define adaptive attacks for RNPC.

Based on the understanding of the role of $r$ , we can now define adaptive attacks for RNPC. Specifically, knowing the value of $r$ , an adaptive attacker for RNPC will deliberately generate adversarial perturbations that target more than $r$ attributes. Then, how does RNPC perform against such adaptive attacks?
Evaluation against adaptive attacks.

In fact, this question has been answered in our paper. Specifically, Fig. 3 (a-b) in Sec 5.3 illustrate the performance of RNPC under varying numbers of attacked attributes. For your convenience, we summarize the results in the following tables. As shown in Table 14 and Table 15, RNPC consistently outperforms baseline models. In particular, when the number of attacked attributes is within $r$ , the performance gap is significant. For instance, RNPC achieves up to 45% higher adversarial accuracy than the best-performing baseline on MNIST-Add3 when one attribute is attacked. As the number of attacked attributes exceeds $r$ , RNPC’s advantage narrows, yet it still maintains the highest robustness among all models.

Overall, on these datasets, compared to baseline models, RNPC exhibits strong robustness against the adaptive attacks that target more than $r$ attributes.

Table 14: Adversarial accuracy on MNIST-Add3 under the PGD attack with varying numbers of attacked attributes.

#Attacked attributes	CBM	DCR	NPC	RNPC
1 (=r)	5.55	13.07	19.27	64.30
2	5.09	12.41	18.20	37.55
3	4.91	11.79	15.28	30.32

Table 15: Adversarial accuracy on MNIST-Add5 under the PGD attack with varying numbers of attacked attributes.

#Attacked attributes	CBM	DCR	NPC	RNPC
1	14.07	30.28	26.99	81.81
2 (=r)	13.31	26.10	18.22	63.53
3	12.47	22.05	22.62	52.14
4	12.00	21.84	22.25	47.01
5	11.67	22.41	23.90	43.29

Q4: Would it be feasible to consider weighting the Hamming distance to reflect the relative importance or semantic impact of different attributes?

A4: Yes, our framework could definitely be extended to incorporate weights associated to each attribute when defining the distance between two attribute assignments.

Formally, let each attribute $A\_i$ be assigned with a weight $w\_i$ , such that $\sum\_i w\_i = K$ . We can then define a weighted Hamming distance between two attribute assignments $a\_{1:K}$ and $a\_{1:K}^{\prime}$ as:

d\_w(a\_{1:K}, a\_{1:K}^{\prime}) = \sum\_i w\_i \mathbb{I}(a\_i \neq a\_i^{\prime})

Based on this distance, we can define a new neighborhood:

\mathcal{N}\_w(y,r) := V\_y \cup \\{ a\_{1:K}^c\in V^c: \min\_{a\_{1:K}\in V\_y} d\_w\left( a\_{1:K}^c, a\_{1:K} \right) \leqslant r \\}

This formulation allows certain attributes to be emphasized by assigning them higher weights, while others can be downweighted with smaller values of $w\_i$ . However, determining appropriate weights would still require some prior domain knowledge specified by humans. Thanks for this suggestion, which presents an interesting direction to extend in future work.

Once again, thank you for your insightful questions, which help us clarify our work more thoroughly and explore promising future directions!

2025-08-05

allows us to explain our method with better clarity,
gives us the opportunity to demonstrate the robustness of RNPC against adaptive attacks,
helps us exploit new potential directions.

We would appreciate it if these responses have satisfactorily addressed your concerns. Also, your acknowledgement of the comprehensiveness of our theoretical and empirical results has encouraged us a lot! Thanks again!

2025-08-05

Dear Authors, thank you for the detailed answers. Your explanations addressed my main concerns about how the attribute partitions and neighborhoods are constructed to be non-overlapping, and detailing how RNPC’s robustness behavior is well-characterized both within and beyond the radius $r$ . This addresses my concern about neighborhood-crossing perturbations.

2025-08-08

Thank you for your positive feedback!

We could truly feel your dedication and thoughtfulness in your comments, which have been a source of encouragement for us to respond as clearly and thoroughly as possible! We would sincerely like to know if our responses have resolved all your concerns. If you have any further questions, please feel free to ask. We would be more than happy to provide additional clarification!

We sincerely hope that our proposed method, together with the valuable input from all reviewers, can help advance the understanding and enhancement of robustness in concept bottleneck models.

审稿意见

评分: 5置信度: 42025-07-02

The authors study the robustness of neural probabilistic circuits, which is a new class of bottleneck models comprising an attribute recognition model and a probabilistic circuit for reasoning. They analyze the robustness of NPC and show that its robustness depends only on the robustness of the attribute recognition model. They proposed a novel algorithm for training a robust attribute recognition model. Empirically, the proposed RNPC outperforms existing concept bottleneck models.

优缺点分析

Strengths:

Novel analysis of robustness of NPC models, showing that their robustness depends only on the robustness of the attribute recognition model.
Robust class-wise integration, which improves the robustness of NPC models without expensive adversarial retraining.
Experimental evaluation shows improved robustness of RNPC model compared to the baseline methods.

Weaknesses:

RNPC is vulnerable to attack propagation. The authors acknowledge this issues in limitations section but didn't propose any ways to mitigate it.
The theoretical analysis depends on the dataset having a sufficiently large minimum hamming distance. If the distance is small and the number of attributes is large, the number of partitions can be exponential, leading to poor scalability of the proposed approach.
The experimental evaluation is limited only to $\ell_\infty$ norm bounded attacks.

问题

How does the proposed approach for class-wise prediction scale to tasks with many attributes?
Does RNPC method achieve similar robustness improvements against other types of adversarial attacks and natural corruptions? Can the authors include robustness results against corruptions and other types of adversarial attacks, e.g. $\ell_1$ , $\ell_2$ , stAdv?
How sensitive is RNPC to the parameter $r$ used for partitioning the attribute space?

局限性

The authors extensively discussed limitations of the proposed work and suggested preliminary solutions to the limitations of RNPC method.

最终评判理由

I thank the authors for their thorough and detailed responses, which addressed my main concerns and clarified several key aspects of their work. In particular:

Expanded Robustness Evaluation: The authors added experiments on robustness to $\ell_1$ -norm attacks. Combined with the existing $\ell_2$ -norm results from the appendix, this resolves my initial concern about the limited scope of adversarial evaluation and strengthens the empirical claims.
Clarification of Parameter $r$ : The explanation of the dataset-dependent radius $r$ , along with the accompanying ablation study on its impact, provides a much clearer understanding of this parameter.
Defense againt Attack Propagation: The authors proposed and empirically validated an adversarial training approach to mitigate this vulnerability.

Considering these responses, along with clarifications provided to other reviewers, I maintain my positive assessment. My final recommendation remains the same: accept.

格式问题

No concerns.

作者回复

2025-07-31

We sincerely thank the reviewer for the comprehensive feedback! We're glad you found our robustness analysis insightful (model robustness only depends on robustness of the attribute recognition model) and appreciated the efficiency and effectiveness of RNPC (only requiring inference-time changes)!

Q1: Can authors include robustness results against corruptions and $l_1$ , $l_2$ -norm-bounded attacks?

A1: We answer this question in two steps.

Point out existing results against $l\_2$ -norm-bounded attacks

The evaluations in this paper are not limited to $l\_\infty$ -norm-bounded attacks. We performed extensive evaluations using both $l\_2$ -norm-bounded PGD and CW attacks with varying budgets. Results are presented in Appendix G.2 (Fig. 6&7).
Explore new results against $l\_1$ -norm-bounded attacks and corruptions
- Setting: We adopt the EADL1 attack [1] with varying values of $\beta$ , which controls the $l\_1$ regularization strength with a larger value yielding a sparser perturbation, and use different types of corruptions, respectively.
- Results: As shown in Table 9, RNPC consistently outperforms other baselines across EADL1 attacks with different $\beta$ values, demonstrating robustness of RNPC against $l_1$ -norm-bounded attacks. For corruptions, as shown in Table 10, all models appear robust to defocus blur and Gaussian noise. CBM and DCR are notably vulnerable to fog, whereas RNPC still exhibits high robustness.

Overall, these additional results for corruptions and $l\_1$ -norm-bounded attacks, together with the results for $l\_\infty$ and $l\_2$ -norm-bounded attacks provided in the paper, validate the robustness of RNPC under various types of perturbations.

[1] "Ead: elastic-net attacks to deep neural networks via adversarial examples." AAAI.

Table 9: Adversarial accuracy against EADL1 on MNIST-Add3.

$\beta$	CBM	DCR	NPC	RNPC
0.0001	13.47	46.11	98.96	99.21
0.001	12.86	80.79	98.75	99.10
0.01	12.03	95.27	97.83	98.77

Table 10: Performance against corruption on MNIST-Add3.

	CBM	DCR	NPC	RNPC
Defocus Blur	99.11	98.53	99.35	99.37
Gauss Noise	96.06	95.36	99.10	99.11
Fog	42.80	45.95	81.71	83.72

Q2: How sensitive is RNPC to the parameter $r$ used for partitioning the attribute space?

A2: We answer this question in two steps.

Clarification on radius. We clarify that, as defined in Def. 4.2, $r$ denotes the radius of the dataset, which is determined by its intrinsic structure. This value is fixed unless the dataset itself is modified. To avoid confusion, we explicitly denote this radius as $r^\*$ . In contrast, it's possible to adjust the value of $r$ used in RNPC (see Eq. 2).
Explore the effect of changing $r$ $r$ on the performance of RNPC
- Setting: We conduct this analysis using MNIST-Add5, which has 5 attributes and a radius of $r^\* = 2$ . We vary $r$ in the range [0, 5] and evaluate performance of RNPC under an $l\_\infty$ -norm-bounded PGD attack with a perturbation budget of 0.11.
- Benign Accuracy: As $r$ increases from 0 to 4, we observe a decreasing logit gap between the top-1 and top-2 predicted classes. Nevertheless, as shown in Table 10, RNPC consistently achieves near-perfect accuracy. When $r$ reaches 5, the accuracy drops to 29.9%. This is expected because, at $r = 5$ , $\mathbb{P}\_\theta(A\_{1:K} \in \mathcal{N}(\tilde{y},r) | X) = \mathbb{P}\_\theta(A\_{1:K} \in \Omega | X) = 1$ , meaning Eq. 2 no longer contains any information from the input $X$ .
- Adversarial Accuracy: When $r$ deviates from $r^\*$ , the adversarial accuracy declines. Despite this drop, RNPC ( $r\neq 5$ ) still outperforms other baseline models, which achieve below 40% accuracy (see Fig. 2 (middle)). Notably, when $r = 5$ , adversarial accuracy drops to 29.9%, matching the benign accuracy and thus indicating a prediction perturbation of $\Delta\_{\theta, w}^{RNPC}=0$ . This aligns with theoretical results, because in this case, the right-hand side of Lem. 4.6 becomes zero.

Overall, when $r \neq$ #attributes, RNPC demonstrates strong resilience to changes in $r$ under benign settings; while adversarial accuracy declines when $r$ deviates from $r^\*$ , RNPC remains superior to baseline models.

Table 10: Performance of RNPC with various $r$ on MNIST-Add5

	$r=0$	$r=1$	$r=2=r^\*$	$r=3$	$r=4$	$r=5$
Benign Acc	99.53	99.53	99.51	99.46	99.24	29.90
Adv Acc	71.84	81.53	81.86	78.47	60.56	29.90
Acc Drop	27.68	18.00	17.65	20.99	38.68	0.00

Q3: The theoretical analysis depends on the dataset having a sufficiently large minimum hamming distance. If the distance is small and the number of attributes is large, the number of partitions can be exponential. How does the proposed approach scale to tasks with many attributes?

A3: We answer this question in three steps.

Clarifications

First, we clarify that the theoretical results do not assume a specific radius $r$ or the minimum Hamming distance $d\_{\min}$ . The results hold for any non-negative radius $r \geq 0$ , which corresponds to $d\_{\min} \geq 1$ . But technically, larger values of $r$ (and thus larger $d\_{\min}$ ) are expected to provide higher robustness.

Second, the number of partitions is determined solely by the number of classes in the dataset, i.e., $|\\{V\_y\\}| = |\mathcal{Y}|$ , and is independent of both $d\_{\min}$ and the number of attributes $K$ .
Scaling to large $K$

We interpret the reviewer’s concern as follows: When $K$ increases, the size of the attribute space $\prod\_{k=1}^K |\mathcal{A}\_k|$ grows exponentially. How does RNPC handle this scaling?

As shown in Prop. 4.4, the computational complexity of RNPC is:

O\left(\sum\_{k=1}^K |f\_k| + |S|\cdot|V|\right)

Here, $|\mathcal{Y}| \leq |V| = \sum\_{y\in\mathcal{Y}} |V\_y| \leq \prod\_{k=1}^K |\mathcal{A}\_k|$ . Therefore, although $|V|$ can be exponential in $K$ in the worst case, it can also be as small as $|\mathcal{Y}|$ (when every class only has one high-probability attribute assignment), depending on the dataset's structure. In contrast, the computational complexity of NPC is always exponential in $K$ .

More discussion

In fact, scaling remains an open challenge in neuro-symbolic models, especially those built upon graphical models. Some recent work explores approximation-based solutions. We will include this point in the discussion section.

Nevertheless, as demonstrated in Prop. 4.4, RNPC significantly reduces computational complexity compared to NPC, with improved inference time shown in Table 2 of Appendix E.

Q4: Authors discuss the vulnerability of RNPC to attack propagation in Limitations Section but didn't propose ways to mitigate it.

A4: In Appendix C, we provide a detailed discussion of strategies to mitigate the attack propagation effect. Here, we also introduce an adversarial training–based approach for mitigation and evaluate its effectiveness on GTSRB-Sub, where spurious correlations between attributes are strong and result in significant attack propagation (see Fig. 3(d)).

Intuition

Suppose a model learns a spurious correlation between two attribute values, e.g., red color and circle shape. That is, the model may rely on features of "red" to identify "circle". To mitigate this correlation, we need to disentangle their features. Specifically, by generating adversarial examples that perturb the color attribute, we can shift the feature representation of "red" toward another color, thereby weakening its connection with "circle".
Method and Evaluation

Following this idea, during training, we generate adversarial samples that target a random attribute and train the model on both these samples and benign inputs. Table 13 reports performance of resulting models. Compared to models trained without adversarial training (Table 12), the adversarially trained models show significantly smaller drops in accuracy of different attributes when color is attacked, which indicates a reduction in the attack propagation effect. Consequently, the advantage of RNPC over NPC becomes evident again.

Overall, the adversarial-training-based approach leverages adversarial perturbations during training to disentangle spurious correlations between attributes, which effectively mitigates the attack propagation effect and enhances the robustness of RNPC on real-world datasets.

Table 12: Col 2-4: Relative accuacy drop (%) of different attributes when color is attacked by PGD. Col 5-6: Task accuracy (%) of NPC and RNPC on GTSRB-Sub with a standard attribute recognition model

Budget	$\Delta$ Shape	$\Delta$ Symbol	$\Delta$ Text	NPC	RNPC
0.07	30.3	70.7	43.7	28.4	31.8
0.09	40.0	79.9	49.3	17.8	21.6
0.11	45.5	87.1	53.9	10.9	12.9

Table 13: Results with an adversarially trained attribute recognition model

Budget	$\Delta$ Shape	$\Delta$ Symbol	$\Delta$ Text	NPC	RNPC
0.07	6.4	5.6	6.3	76.2	78.4
0.09	10.0	7.9	8.5	62.4	69.3
0.11	13.6	11.2	10.6	50.5	60.0

Once again, thanks for your questions, which lead us to understand and evaluate our method more comprehensively!

2025-08-06

Thanks for your constructive comments! Given the approaching end of the discussion phase, we wonder if the provided responses have addressed your questions? — Actually, we greatly value your feedback since it gives us the opportunity to validate the robustness of RNPC under various types of perturbations! Also, we greatly appreciate your acknowledgment of the novelty of our theoretical analysis in robustness. Thanks again!!

2025-08-09

Dear authors, Thanks for your detailed explanations. Your rebuttal clarifies my main concerns regarding attack propagation and addresses my misunderstandings of some of your results, in particular, results for multinorm robustness.

审稿意见

评分: 5置信度: 32025-07-03

This paper proposes a robust approach to tackle adversarial attacks on Neural Probabilistic Circuits (NPCs). While NPCs produce interpretable results through an attribute recognition model, a carefully designed attack based on the attributes can reduce their performance. Therefore, this paper adopts a different inference strategy to tackle the adversarial attacks, without sacrificing the benign accuracies. Their theoretically grounded inference approach is validated with four different adversarial datasets with superior performances.

优缺点分析

The strengths of the paper are

1. Well-written: The paper is well-written with a clear motivation behind designing a robust NPC inference method. The distinction between NPC and RNPC is highlighted in the overview figure. Intuitions and theoretical discussions are also well-crafted and aptly put in the paper.

2. Good results: The method shows comparable performance under benign settings and significantly improved performance under adversarial settings, bolstering the hypothesis.

3. Proper ablation studies: The paper performs adequate ablation studies to further verify their hypothesis. Impact of the number of attributes is studied and a future research direction is proposed based on the spurious correlation shown in Fig 5 (d).

The weaknesses of the paper are:

1. No ablation study on the neighbourhood radius. The weighted inference is based on some neighbourhood of radius r. It would be interesting to see how the performance varies with this radius.

2. Is RNPC a generalized version of NPC? Is there any way to collapse the RNPC formulation to the NPC-one? Under which conditions Eqn (2) will be identical to Eqn (1) ?

问题

See weaknesses

局限性

N/A

最终评判理由

In the provided rebuttal, the authors have successfully addressed both of my questions. Therefore, I increase my rating to 'Accept'.

格式问题

N/A

作者回复

2025-07-31

We sincerely thank the reviewer for the constructive feedback! We're glad that you found the paper well-written with clear motivation and sound theoretical analysis, the empirical results compelling, and the ablation studies effective in validating our hypothesis and highlighting meaningful future directions!

Q1: The proposed inference is based on some neighbourhoods of radius $r$ . It would be interesting to see how the performance of RNPC varies with this radius.

A1: We answer this question with the following steps.

Clarification on radius. First, we'd like to clarify that, as defined in Def. 4.2, $r$ denotes the radius of the dataset, which is determined by its intrinsic structure. This value is fixed unless the dataset itself is modified. To avoid confusion, we explicitly denote this radius as $r^\*$ . In contrast, it's possible to adjust the value of $r$ used in RNPC (see Eq. 2).
Explore the effect of changing $r$ $r$ on the performance of RNPC.
- Setting: We conduct this analysis using the MNIST-Add5 dataset, which has 5 attributes and a radius of $r^\* = 2$ . We vary $r$ in the range [0, 5] and evaluate both benign accuracy and adversarial accuracy of RNPC under an $l\_\infty$ -norm-bounded PGD attack with a perturbation budget of 0.11. Results are provided in Table 8.
- Benign Accuracy: As $r$ increases from 0 to 4, we observe a decreasing logit gap between the top-1 and top-2 predicted classes. Nevertheless, as shown in Table 8, RNPC consistently achieves near-perfect accuracy. However, when $r$ reaches 5, the accuracy drops to 29.9%. This is expected because, at $r = 5$ , we have $\mathbb{P}\_\theta(A\_{1:K} \in \mathcal{N}(\tilde{y},r) | X) = \mathbb{P}\_\theta(A\_{1:K} \in \Omega | X) = 1$ , meaning Eq. 2 no longer contains any information from the input $X$ .
- Adversarial Accuracy: When $r\leq r^\*$ ( $r\geq r^\*$ ), reducing $r$ (increasing $r$ ) results in worse performance. That is, deviating from the dataset radius $r^\*$ leads to a drop in adversarial accuracy. Despite this drop, RNPC ( $r\neq 5$ ) still outperforms other baseline models, which achieve below 40% accuracy (see Fig. 2 (middle)). Notably, when $r = 5$ , adversarial accuracy drops to 29.9%, matching the benign accuracy and thus indicating a prediction perturbation of $\Delta\_{\theta, w}^{RNPC}=0$ . This aligns with our theoretical result, because in this case, the right-hand side of Lem. 4.6 becomes zero.

Table 8: Performance of RNPC with various $r$ on MNIST-Add5.

	$r=0$	$r=1$	$r=2=r^\*$	$r=3$	$r=4$	$r=5$
Benign Acc	99.53	99.53	99.51	99.46	99.24	29.90
Adv Acc	71.84	81.53	81.86	78.47	60.56	29.90
Acc Drop	27.68	18.00	17.65	20.99	38.68	0.00

Q2: Under which conditions will Eq. 2 be identical to Eq. 1? Is RNPC a generalized version of NPC?

A2: We will answer this question using two steps. First, we will directly compare $\mathbb{P}\_{\theta, w}(Y|X)$ (Eq. 1) with $\Phi\_{\theta, w}(Y|X)$ (Eq. 2), showing why RNPC is not a generalized version of NPC. Second, we will provide a more accurate comparison, comparing $\mathbb{P}\_{\theta, w}(Y|X)$ (Eq. 1) with the normalized probability distribution $\hat{\Phi}\_{\theta, w}(Y|X)$ (Line 183), showing when these two distributions are the same.

Step 1: Under Assump. 3.2 and the assumption that $\Omega = \bigcup\_{\tilde{y}}\mathcal{N}(\tilde{y}, r)$ , we can write Eq. 1 as follows:

\mathbb{P}\_{\theta, w}\left(Y=y \mid X=x\right) = \sum\_{a\_{1:K}} \mathbb{P}\_{\theta}\left(A\_{1:K}=a\_{1:K} \mid X=x\right) \cdot \mathbb{P}\_{w}\left(Y=y \mid A\_{1:K} = a\_{1:K}\right)\\ = \sum\_{\tilde{y}}\sum\_{a_{1:K}\in\mathcal{N}(\tilde{y}, r)} \mathbb{P}\_{\theta}\left(A\_{1:K}=a\_{1:K} \mid X=x\right) \cdot \mathbb{P}\_{w}\left(Y=y \mid A\_{1:K} = a\_{1:K}\right)

Meanwhile, Eq. 2 is:

\Phi\_{\theta, w}(Y\mid X) = \sum\_{\tilde{y} \in \mathcal{Y}} \left( \mathbb{P}\_{\theta}\left(A\_{1:K} \in \mathcal{N}(\tilde{y}, r) \mid X\right) \cdot \sum\_{\tilde{a}\_{1:K} \in V\_{\tilde{y}}} \mathbb{P}\_{w}\left(Y \mid A\_{1:K}=\tilde{a}\_{1:K}\right) \right)\\ = \sum\_{\tilde{y} \in \mathcal{Y}} \sum\_{a\_{1:K}\in \mathcal{N}(\tilde{y}, r)} \left( \mathbb{P}\_{\theta}\left(A\_{1:K} = a\_{1:K} \mid X\right) \cdot \sum\_{\tilde{a}\_{1:K} \in V\_{\tilde{y}}} \mathbb{P}\_{w}\left(Y \mid A\_{1:K}=\tilde{a}\_{1:K}\right) \right)

These two equations only differ in the final terms. Specifically, the final term in NPC is independent of the class $\tilde{y}$ , while the final term in RNPC is dependent on $\tilde{y}$ , suggesting that it is doing the class-wise integration.

Therefore, RNPC is not a generalized version of NPC; they represent fundamentally different ways of integrating over attribute probabilities.

Step 2: Consider a scenario where we learn an optimal attribute recognition model and an optimal probabilistic circuit, i.e., $\mathbb{P}\_{\theta}(A\_{1:K}|X) = \mathbb{P}^\*(A\_{1:K}|X)$ and $\mathbb{P}\_{w}(Y|A\_{1:K}) = \mathbb{P}^\*(Y|A\_{1:K})$ . Consequently, we have: $\mathbb{P}\_{\theta, w}(Y|X) = \mathbb{P}^\*(Y|X)$ and $\hat{\Phi}\_{\theta, w}(Y|X) = \hat{\Phi}^\*(Y|X)$ (Refer to Prop. 4.8).

Then, the comparison between $\mathbb{P}\_{\theta, w}(Y|X)$ and $\hat{\Phi}\_{\theta, w}(Y|X)$ reduces to evaluating the difference between $\mathbb{P}^\*(Y|X)$ and $\hat{\Phi}^\*(Y|X)$ .

As shown in Thm. 4.11,the expected total variance distance between these two distributions is upper bounded by:

\mathbb{E}\_X\left[ \max\_{\tilde{y}}~d_{\mathrm{TV}}\left( \bar{\mathbb{P}}^\*(Y\mid A\_{1:K}\in V\_{\tilde{y}}), \mathbb{P}^\*(Y\mid X) \right) \right]

The detailed proof is provided in Appendix H.4, specifically Lines 1033-1038.

Hence, when this bound is zero, and both modules are optimally learned, we have: $\mathbb{P}\_{\theta, w}(Y|X) = \hat{\Phi}\_{\theta, w}(Y|X)$ , i.e, the predicted distributions of NPC and RNPC are the same.

Once again, thanks for your questions, which lead us to better illustrate our method and intepret our theoretical results!

2025-08-06

your first question gives us the opportunity to show that the value of $r$ used in our method turns out to be optimal,
your second question allows us to use our theoretical results to gain a deeper understanding of the relationship between NPC and RNPC.

We would appreciate it if these responses have satisfactorily addressed your concerns. Also, we’re greatly encouraged by your acknowledgement of our writing ("Motivations, intuitions and theoretical discussions are well-crafted and aptly put in the paper.")! Thanks again!

2025-08-07

Thanks for your detailed answer to both of my questions. I believe the analysis regarding r should be included in the main paper if possible or at least in the appendix. Also, my confusion regarding generalizability of RNPC is also properly addressed. Therefore, I raise my rating to 'Accept'.

2025-08-07

Thank you for your positive feedback! We will definitely include the analysis regarding $r$ in the paper, which enriches our discussion of the proposed method—thanks again for this great suggestion! If you have any further questions or concerns regarding our paper, please feel free to let us know.

审稿意见

评分: 5置信度: 32025-07-14

The authors propose a new version of Neural Probabilistic Circuits (NPCs) which is robust against adversarial attacks. First, the authors establish that the vulnerability of NPC to adversary attacks is due to how the outputs of the attribute recognition modules are combined to generate the prediction label. Once an adversary fools enough attribute classifiers, the downstream probabilistic circuit will end up outputting the incorrect prediction label. Based on this insight, the authors propose a new way of combining the outputs of the attribute recognition modules that make it less susceptible to adversarial attacks. Authors provide theoretical and empirical evidence to demonstrate that this new version of NPC, called RNPC, preserves performance on benign inputs while significantly boosting performance on adversarial inputs.

优缺点分析

Strengths

The paper is well-written and easy to follow. The authors motivate the problem clearly, and provided an eloquent description of the proposed method (RNPC).
Authors provide a detailed theoretical evidence to justify the effectiveness of RNPC. All theorems are adequately defined and backed up with proofs. All assumptions and limitations are clearly stated.
RNPC only requires inference-time changes, and therefore has much better usability than training-based defenses.
Authors discuss several key ablation experiments that help understand RNPC in greater detail. For example, the discussion on spurious correlations demonstrated how certain attributes can be attacked together based on the characteristics of the dataset. This highlights a key vulnerability of CBMs.
Empirical results demonstrate that RNPC yields strong robustness compared to baseline while reducing inference time.

Weaknesses

Authors don't systematically analyze the attack propagation phenomenon. Specifically, they don't show if there is a direct correlation between co-occurence of attributes in a dataset and the possibility of attack propagation between them.
The proposed defense scales poorly with complexity of the dataset. This is evident in Fig3-c, where NPC and RNPC have similar robustness on GTSRB-sub. Authors explain that this is because of the attack propagation effect. It appears that attack propagation becomes more severe as dataset complexity increases. Therefore, the utility of the proposed defense seems to be only applicable to small-scale toy datasets like MNIST.
Authors back up key findings using results on small-scale toy datasets only. Results on more complex datasets are either lacking or not convincing.
Authors propose remedies to mitigate the attack propagation effect in the Appendix. However, there is no result available that can provide evidence for whether these remedies will be effective. For example, adding synthetic training data may prevent the model from learning spurious correlations, but it may also reduce the model's performance on benign data as these spurious correlations may be driving some of the performance in the benign setting. It is important to develop a comprehensive understanding of how problematic attack propagation really is, which the paper doesn't provide.
The authors only consider an adversary that targets the attribute recognition module. However, an adversary can also target the end-to-end RNPC prediction pipeline. Authors don't consider such an adaptive attack scenario in the paper. Strong robustness guarantees in settings where the adversary is oblivious of the defense are not as impactful.
The authors present results using very restrictive adversarial budget (epsilon). When attacking the traditional MNIST neural network classifiers, epsilon=3.0 is used. But authors use epsilon=0.11 at max for MNIST-Add3 and MNIST-Add5. This dilutes the impact of the proposed defense.

问题

What if we train the recognition module adversarially? How will this affect the robustness of the NPC and the RNPC inference methods?
How does NPC and RNPC compare against each other when attack the end-to-end prediction pipeline?
Can we predict which attributes can be attack together based on how often attributes co-occur in the training data?

局限性

The authors have adequately discussed limitations of the proposed method in the paper.

最终评判理由

Authors have addressed several of my comments adequately. Therefore, I am increasing my rating by 1.

格式问题

N/A

作者回复

2025-07-25

We sincerely thank the reviewer for your thoughtful feedback! We're glad that you found our paper well-written, the theoretical analysis sound, the proposed method effective (only requiring inference-time changes), and the ablation studies insightful!

Q1: How does training the recognition module adversarially affect the robustness of NPC and RNPC?

A1: To explore the impact of adversarial training, during training, we use the PGD attack with an $l\_\infty$ bound of 0.11 to generate adversarial samples that perturb the recognition module’s predictions, and train the recognition module using both adversarial samples and benign samples.

Denote the NPC (RNPC) with an adversarially trained recognition module as NPC-AT (RNPC-AT). We evaluate performance on MNIST-Add3 under PGD attacks targeting the first attribute. Results are provided in Table 3.

AT significantly improves the robustness of the recognition module.

Without AT, the accuracy of predicting the first attribute declines from 99.3% to 2.2% after the attack. In contrast, with AT, this accuracy merely drops to 85%.
AT also improves the robustness of both NPC and RNPC.

Without AT, the task accuracy of NPC and RNPC declines to 19% and 64% after the attack. In contrast, with AT, their accuracy remains as high as 85.8% and 86.3%, respectively. This is attributed to the fact (Thm. 3.4 and Lem. 4.6) that the robustness of NPC and RNPC is only bounded by the robustness of the recognition module.
AT leads to a slight performance drop in the benign setting.

When using AT, the model's benign task accuracy drops by less than 0.5%.

Overall, adversarial training significantly strengthens the robustness of both NPC and RNPC by improving the robustness of their recognition module, and RNPC is still more robust than NPC thanks to the class-wise integration.

Table 3: Accuracy of predicting the attacked attribute ("Attr1 Acc") and class labels ("Task Acc").

Setting	Model	Attr1 Acc	Task Acc
No attack	NPC	99.3	99.3
	RNPC	99.3	99.4
	NPC-AT	99.0	99.0
	RNPC-AT	99.0	99.2
PGD ( $l\_\infty = 0.11$ )	NPC	2.2	19.2
	RNPC	2.2	64.3
	NPC-AT	85.0	85.8
	RNPC-AT	85.0	86.3

Q2: How do NPC and RNPC compare against each other when attacking the end-to-end prediction pipeline?

A2: We begin by clarifying our threat model. NPC is a compositional model that integrates a graphical model, specifically a probabilistic circuit, and the only component with a black-box nature (which may be pretrained and obtained from a third-party source) is the attribute recognition model. Therefore, we consider a threat model in which the adversary targets the attribute recognition model, rather than the entire pipeline.

Next, to explore the impact of attacking the entire pipeline, we apply the PGD attack with varying $l_\infty$ bounds on the end-to-end prediction pipelines of NPC and RNPC. In addition to these compositional models, we also train an end-to-end ViT-B/32 model and attack it for comparison. Results are provided in Table 4.

Similar performance is observed on NPC and RNPC, which is reasonable since now the adversarial samples are crafted to perturb final class predictions. This setting doesn't align with RNPC's design objective, i.e., to robustly integrate incorrect attribute predictions with outputs of probabilistic circuit. Despite the performance drop, both NPC and RNPC perform much better than ViT, highlighting the advantage of compositional models. This finding also aligns with the one in [1].

Overall, both NPC and RNPC are vulnerable to attacks targeting the entire pipelines; however, they exhibit much higher robustness than the end-to-end black-box model.

[1] Exploring the impact of conceptual bottlenecks on adversarial robustness of deep neural networks. IEEE Access. 2024.

Table 4: Task accuracy against attacks targeting the end-to-end pipelines.

Task Acc	NPC	RNPC	ViT
No Attack	99.3	99.4	99.5
PGD ( $l\_\infty = 0.05$ )	91.1	91.0	2.3
PGD ( $l\_\infty = 0.11$ )	20.4	19.4	0.0

Q3: Can we predict which attributes will be attacked together based on how often they co-occur in the training data?

A3: We answer the question using three steps.

Quantify the co-occurrence between attributes

We use the mutual information (MI) metric, defined as $MI(X, Y) = \sum\_{x \in X} \sum\_{y \in Y} p(x, y) \cdot \log \left( \frac{p(x, y)}{p(x) p(y)} \right)$ . A higher MI indicates a stronger dependency between two attributes. Table 5 displays the MI between color and other attributes. The attribute with the highest MI is symbol.
Measure the influence on other attributes when a specific attribute is attacked

Table 5 also reports the relative accuracy drop of other attributes when color is attacked. The attribute with the largest accuracy drop is also symbol.
Assess the correlation between these metrics

To statistically assess this correlation, we use the Pearson Correlation Coefficient (PCC), defined as $\rho = \frac{\mathrm{cov}(X, Y)}{\sigma\_X \sigma\_Y}$ . We observe a PCC of 0.92, indicating a strong positive linear correlation between MI and accuracy drop.

Overall, these results suggest that it's possible to utilize the mutual information in the training data to find out which attributes are more likely to be affected when a specific attribute is attacked.

Table 5: Mutual information between color and other attributes, and relative accuracy drop when color is attacked on GTSRB-Sub.

PCC = 0.92	Shape	Symbol	Text
MI	0.346	0.757	0.222
Acc Drop	45.7%	87.2%	54.0%

Q4: Authors discuss remedies to mitigate the attack propagation effect in appendices. Is there any empirical evidence?

A4: In Appendix C, we provide a detailed discussion of strategies to mitigate the attack propagation effect. Here, inspired by your previous question, we introduce an adversarial training–based approach for mitigation and evaluate its effectiveness on GTSRB-Sub, where spurious correlations between attributes are strong and result in significant attack propagation (see Fig. 3(d)).

Intuition

Suppose a model learns a spurious correlation between two attribute values, e.g., red color and circle shape. That is, the model may rely on features of "red" to identify "circle". To mitigate this correlation, we need to disentangle their features. Specifically, by generating adversarial examples that perturb the color attribute, we can shift the feature representation of "red" toward another color, thereby weakening its connection with "circle".
Method and Evaluation

Following this idea, during training, we generate adversarial samples that target a random attribute and train the model on both these samples and benign inputs. Table 7 reports the performance of the resulting models. Compared to models trained without adversarial training (Table 6), the adversarially trained models show significantly smaller drops in accuracy of different attributes when color is attacked, which indicates a reduction in the attack propagation effect. Consequently, the advantage of RNPC over NPC becomes evident again.

Table 6: Col2-4: Relative accuacy drop (%) of different attributes when color is attacked by the PGD attack. Col5-6: Task accuracy (%) of NPC and RNPC on GTSRB-Sub with a standard attribute recognition model.

Budget	$\Delta$ Shape	$\Delta$ Symbol	$\Delta$ Text	NPC	RNPC
0.07	30.3	70.7	43.7	28.4	31.8
0.09	40.0	79.9	49.3	17.8	21.6
0.11	45.5	87.1	53.9	10.9	12.9

Table 7: Col2-4: Relative accuacy drop (%) of different attributes when color is attacked by the PGD attack. Col5-6: Task accuracy (%) of NPC and RNPC on GTSRB-Sub with an adversarially trained attribute recognition model.

Budget	$\Delta$ Shape	$\Delta$ Symbol	$\Delta$ Text	NPC	RNPC
0.07	6.4	5.6	6.3	76.2	78.4
0.09	10.0	7.9	8.5	62.4	69.3
0.11	13.6	11.2	10.6	50.5	60.0

Q5: The authors use a small adversarial budget ( $\epsilon = 0.11$ ) for MNIST datasets compared to a standard value of 3.0.

A5: We'd like to clarify that the commonly cited value of 3.0 applies to $l_2$ -norm-bounded attacks, not $l_\infty$ -norm-bounded ones. Fig. 2 in Sec. 5.2 corresponds to the latter. In fact, we did use $l_2$ -norm-bounded PGD attacks with $\epsilon$ ranging from 3.0 to 11.0 on MNIST datasets. Please refer to Fig. 6 of Appendix G.2 for detailed results.

Once again, thanks for your insightful questions, which inspire us to better understand our method and propose effective solutions to potential limitations!

2025-08-06

helps us demonstrate that the robustness of RNPC can be further enhanced through adversarial training,
motivates us to investigate the mechanism behind attack propagation and explore ways to mitigate it.

We are inspired a lot by your questions, and we would appreciate it if our responses have satisfactorily addressed your concerns.

Also, your acknowledgement of the soundness of our theoretical results (“All theorems are adequately defined and backed up with proofs. All assumptions and limitations are clearly stated.”) has encouraged us a lot! Thanks again!

最终决定Accept (poster)

2025-09-17

The paper aims to analyze and improve the adversarial robustness of concept bottleneck models, specifically neural probabilistic circuits (NPCs). In the paper, theoretical evidence is provided for adversarial robustness of the NPC depending primarily on the attribute recognition model and not the robustness of the PC, which leads to the conception of robust (R)NPCs at the hand of changes to the inference-stage. This provides merit over defenses that are rooted in training. Beyond these key theoretical strengths, the paper is well written, clearly presented and features several empirical verifications, including rigorous ablation studies. A main weakness can perhaps be attributed to the lack of scale of the experimentation, relying largely on small MNIST-like datasets. Given the scope of the work and the inherent present limitations of NPCs this seems reasonable.

Reviewers were generally all favorable of the paper from the start. All agree on the method being valuable, the theoretical contributions being substantial, and the empirical investigation to be mostly satisfactory. Few initial concerns were pointed out, including some concerns over hyper parameter choices, lack of larger datasets, the methods' vulnerability to attack propagation and some theoretical caveats on the assumptions regarding the neighborhood and hamming distance for specific dataset configurations. The rebuttal was able to resolve some of these concerns, specifically questions regarding the empirical procedures and made assumptions. It specifically acknowledged the vulnerability to attack propagation in the limitations section of the paper and reviewers are generally satisfied with this point not being resolved immediately. The AC agrees with that this take that not all scenarios in various complicated set-ups can be addressed immediately and agrees with the reviewers on the general value of the contribution.

The AC recommends accepting the paper.