PaperHub
7.6
/10
Poster3 位审稿人
最低4最高5标准差0.5
5
4
5
2.7
置信度
创新性3.7
质量3.0
清晰度3.0
重要性2.7
NeurIPS 2025

Collaborative and Confidential Junction Trees for Hybrid Bayesian Networks

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Bayesian NetworksCollaborationSecure Multi-Party Computation

评审与讨论

审稿意见
5

Bayesian networks are a way to represent distributions over multivariate data, each node is a variable, and a directed edge between nodes shows dependence, the network is a DAG. Collaborative Bayesian networks are when multiple parties determine individual Bayesian networks which may contain some common variables, the goal is to combine networks and perform inference on the combined Bayesian network. However we may also require that agents cannot see each others private local variables.

Junction trees are a standard technique for performing inference over (non-collaborative) Bayesian networks by clustering factors in a graphical model (e.g. a Bayesian network) into nodes called cliques, within each clique one may apply variable elimination followed by message passing across cliques for exact inference, the tree structure ensures that message passing will converge in a single pass.

This paper considers the problem of performing inference over a collaborative Bayesian network while preserving confidentiality of individual parties local variables and their distributions. When there are continuous and discrete variables.

The proposed solution is a custom protocol to build a junction tree that creates two cliques for each agent, one with all agents discrete variables, and one with all the agents continuous variables, and there are two central cliques, one for communal discrete variables and communal continuous variables, and these are connected to ensure that all discrete nodes have no continuous parents.

The method is applied to 9 different datasets synthetically generated from known Bayesian network models and shows improvements over baselines.

优缺点分析

I am familiar with graphical models and junction trees, I am not familiar with all the topics covered in the paper (SMC and I may have misunderstood details and apologize in advance for any mistakes, particularly in .

In general, I am positive about the paper and the approach seems reasonable. My comments are mostly minor.

Strengths

  • Seems intuitive clean simple idea

Weaknesses

  • there is no mention of Multiply Sections Bayesian Networks, presumably one could augment this with HE for interface cliques and perform loopy belief propagation? The continuous variables require a tree structure and strong rooting hence we cannot use such a method?
  • the interface cliques must contain shared variables as well as all threshold variables (both private and shared), is there a privacy leak or can we just treat private threshold variables as invisible to irrelevant agents (see questions)?
  • it is stated that the two agent case allows each agent to infer the other agents distribution over shared variables, there is a slight data leak?

https://ieeexplore.ieee.org/document/858473

  • Inherits the same exponential cost in largest clique as junction trees

问题

  • How does it work with 3+ agents? If I understand correctly, we make a star/hub and spoke network of discrete cliques with interface at the hub and one agent in each spoke, and similar hub+spokes for continuous variables cliques and link the hubs?

  • If the interface nodes are the only link from discrete to continuous variables, for an agent with a private discrete parent and private continuous child node, the necessary (private) message must still pass through the interface nodes. How are private threshold variables preserved if they still pass through the interface node? Is it safe to assume that they can be encrypted/ignored by all other agents as they are not required and just pass through? If so, is this equivalent to appending a private continuous clique to each private discrete clique? I understand this preserves Tree structure and strong rootedness.

  • What is the “#Arc” field in the datasets table?

局限性

  • Cost is exponential in the number of variables in each agent, (presumably one may use nested junction trees?)

  • In the two party case, each agent can infer information from the aligned result (which is a fundamental fact of HE and not a problem with Hybrid CCJT), I see this is mentioned in the limitations (Appendix H), this may be more prominently mentioned in the main paper.

最终评判理由

My original comments were mostly minor and my questions were answered accordingly and as expected.

After reading other reviews, particularly rRs9 thoughts on impact and novelty and the issue being addresed, I don't see a clear reason to not accept and have raised my score accordingly.

格式问题

I did not notice any formatting errors.

作者回复

Thank you for the insightful review. Please see our reply to the outlined points. We would be happy to clarify any unclear or unaddressed aspects.

Weaknesses

[W1] No mention of Multiply Sectioned Bayesian Networks

Performing loopy belief propagation over Multiply Sectioned Bayesian Networks would (indeed) break the strong root constraint and lead to poor inference quality in hybrid domains. Furthermore, loopy belief propagation entails sending more messages (due to the iterative nature of the algorithm) and is proven to converge only on graphs with at most one cycle [1]. We will update the related work section with this information.

[W2] Is there a privacy leak over threshold variables?

Please see the answer to Q2.

[W3] Possible data leak in two-party scenario

Indeed, when only two parties are involved, the proposed method does not protect the private posterior of interface variables, as doing so is impossible. Intuitively, consider computing the sum of NN secret numbers x1,...,xNx_1, ..., x_N (Equations 7 and 8 in the paper). If N>2N>2 a certain party ii knows xix_i and y=j=1Nxjy = \sum^N_{j = 1}x_j from which party ii cannot reconstruct the values of xjix_{j \neq i}. If N=2N=2, party ii can reconstruct xj=(xi+xj)xi=yxix_j = (x_i + x_j) - x_i = y - x_i. Similar reasoning also holds for products.

The work from Kearns et al. [2] builds upon this basic idea to explore the limitations of confidentiality in Bayesian Networks and specific methodologies including sum-product message passing. Notably, that work establishes a generalized definition for confidentiality in Bayesian Network applications (Definition 1 from [2]):

Let Π\Pi be any protocol for the kk parties to jointly compute the value y=f(x1,...,xk)y = f(x_1, ... ,x_k) from their nn-bit private inputs. We say that Π\Pi is privacy-preserving if for every 1ik1 \leq i \leq k, anything that party ii can compute in polynomial time in nn following the execution of Π\Pi, they could also compute in polynomial time given only their private input xix_i and the value yy.

We abide by this definition which we include in Appendix B.

Questions

[Q1] How does the protocol work with 3+ parties?

The junction tree forms 2 star networks corresponding to discrete and continuous variables, respectively (as also remarked by the reviewer), joined via their hubs (i.e., the interface cliques). Please note that, as stated in the paper, the Junction Tree dictates the computational logic within the protocol, not the actual distribution of data across parties. Thus, an interface clique is not centralized or controlled by any party. Instead, parties solve inference collaboratively over this clique.

[Q2] How are threshold variables protected?

Unlike interface variables, whose protection depends on homomorphic encryption and secret sharing during alignment and inference, threshold variables are not shared between parties and only used locally. We will better emphasize this aspect in the methodology section.

[Q3] What is the “#Arc” field in the datasets table?

"#Arc" in Table 5 refers to the number of edges in the original network. It describes the number of relationships between variables in the network. However, having more edges does not necessarily imply an increase in the number of parameters (#Params column) reported in the same table. #Params depends on the network's edge distribution and the representations of variables. For instance, a discrete binary variable with two binary parents will require 23=82^3=8 parameters. Whereas, one discrete variable with one parent both with three states will require 32=93^2=9 parameters.

Limitations

[L1] Cost is exponential in the number of variables in each party

Per the nature of discrete Bayesian Networks, inference is NP-Hard. However, our algorithm is agnostic to the inference algorithm used locally by each party. Thus, instead of the Variable Elimination (as employed in our experiments) parties could indeed autonomously harness auxiliary junction trees for local inference exponential in the tree-width instead of the number of variables. Note that using nested Junction Trees would not affect communication costs.

[L2] Undermentioned caveats in two-party case

As correctly identified by the reviewer, there is an unavoidable confidentiality leak in the two-party case. We will update the paper's method section to emphasize the phenomenon. More specifically, the issue stems from the need to disclose the output of a bijective function with two inputs, with one of the inputs also being known beforehand. Thus, no secure multiparty computation technique, including homomorphic encryption, can address the issue. Please refer to the answer of W3 for additional details.

References

[1] Zivan, Roie, Omer Lev, and Rotem Galiki. "Beyond trees: Analysis and convergence of belief propagation in graphs with multiple cycles." AAAI Conference on Artificial Intelligence (AAAI), 2020.

[2] Kearns, Michael, Jinsong Tan, and Jennifer Wortman. "Privacy-preserving belief propagation and sampling." Advances in Neural Information Processing Systems 20 (NeurIPS), 2007.

评论

Reviewer 6zT8, the authors have provided an extensive response to your original review. How has their response changed your opinion of the paper?

评论

My original comments were mostly minor and my questions were answered accordingly and as expected.

After reading other reviews, particularly rRs9 thoughts on impact and novelty and the issue being addresed, I don't see a clear reason to not accept and have raised my score accordingly.

评论

We would like to thank the AC and Reviewer 6zT8 for following up. We are glad we could address the reviewer's comments, and are grateful for the positive feedback.

审稿意见
4

Bayesian Networks (BN) are a tool that allows different parties to collaboratively optimize their production process. State-of-the-art frameworks for collaborative and confidential inference only support discrete data and have high communication costs, while frameworks for hybrid scenarios (mix of discrete and continuous data) are limited to single-party settings. The authors solve this by providing a collaborative and confidential framework for BNs in the hybrid setting, constructing a strongly rooted Junction Tree then running a belief-propagation-like algorithm to combine inference results confidentially. Results show 32% average improvement in predictive accuracy and up to 331× reduction in communication costs.

优缺点分析

Strengths:

  1. The authors' use of Junction Trees gives a clear advantage in terms of communication costs compared to vanilla belief propagation.
  2. Authors provide a proof of confidentiality under adversarial model and a clear collaborative strongly-rooted JT protocol.
  3. The experimental results are strong: clear improvement in accuracy in hybrid setting, communication cost reduction in discrete setting.

Weaknesses:

  1. Could the authors provide a high-level proof sketch for Proof of confidentiality would help understand why the collaborative inference protocol always hides the marginal posteriors on private variables.
  2. The Collaborative inference protocol is quite sophisticated and high-level. More implementation details would be appreciated if possible.
  3. Limited communication cost improvement analysis in the hybrid setting compared to the extensive discrete results (though Table 1 does show some improvements for hybrid datasets).

问题

Please address weaknesses.

局限性

yes

最终评判理由

This is definitely a technically strong paper. Authors have addressed my concerns and I am satisfied with their response. I didn't recommend 5 (my score would be 4.5) because I do not have the domain expertise to judge if this is high-impact and which areas of machine learning this paper can heavily influence (confidence 2). That said, the other reviewers seem to agree that this is a "significant and previously unaddressed challenge".

格式问题

None

作者回复

Thank you for your valuable comments and feedback. It's helpful for us to further improve this work. Please see our reply to the highlighted weaknesses. We will be happy to follow up on any of the points.

Weaknesses

[W1] No high-level proof of confidentiality

Below, we outline why both of Hybrid CCJT's collaborative phases preserve confidentiality during inference:

  1. Collaborative Discrete Inference CPD tables of interface variables (say, XX) are "augmented" to allocate space for all parents of XX pa(X)\text{pa}(X), some of which might be private. This step is required to ensure the correct outcome of HE, CPD product, and CPD normalization as in Equation 4. We show that this step does not leak any information about such private parent variables by proving that their marginals yield a uniform distribution U(Ω(pa(X)))\mathcal U(\Omega(\text{pa}(X))). Where Ω(X)\Omega(X) is the set of states of variable XX. We prove this in Proposition B.2. Then, we prove that, after message passing, the marginal over these parent variables remains uniform. This ensures no information about such private variables leaks at inference time. We prove this in Proposition B.5. Furthermore, note that parties encrypt the names of variables and their states to enhance confidentiality further.
  2. Collaborative Continuous Inference Unlike discrete variables, continuous counterparts do not require computing and correctly applying normalization factors potentially defined over private factors. Thus, continuous private variables are inherently protected during alignment. At inference time, the means and variances of shared continuous variables are calculated using multi-party secret sharing, which updates the parameters of such shared continuous interface variables. This approach inherently safeguards the information of private variables, as they are not involved in the update process.

[W2] Missing implementation details on the inference protocol

We will prepare a more detailed explanation of the inference protocol in the methodology section and a more detailed pseudocode and a diagram to be included in the appendix. The inference is divided in three steps (of which two are collaborative) and follows the logic of the Lauritzen algorithm [1].

  1. Collaborative Continuous Inference The first step requires marginalizing continuous variables and finding the strong marginals over the threshold variables. Recall that threshold variables are the discrete variables with at least one continuous child. In short, we aim at computing P[TΓ=e]P[T|\Gamma = e], where TT is the set of threshold variables and Γ=e\Gamma = e is some continuous evidence. To do so, parties collaboratively compute the updated parameters of interface continuous variables given the set of continuous evidences as in
μ(X)=iPwiμi(X),σ(X)=iP[wiσi(X)]+iP[wi(μμi(X))2] {\mu}(X) = \sum_{i \in \mathbb P} w_i\mu_i(X), \quad \sigma(X) = \sum_{i \in \mathbb P} [w_i\sigma_{i}(X)] + \sum_{i \in \mathbb P}[w_i(\mu - \mu_{i}(X))^2 ]

where the parameters μi\mu_i and σi\sigma_i are obtained locally by computing P[IΓΓ=e]P[I_\Gamma | \Gamma=e]. Once these new parameters are obtained, they can help compute the marginals over the threshold variables by just marginalizing all the continuous variables.

  1. Collaborative Discrete Inference Once parties find the marginals of threshold variables, the querying party needs to find the strong marginals over the remaining discrete variables. For this, parties compute P[IΔΔ=e]P[I_\Delta | \Delta = e], the posterior of discrete interface variables given their discrete evidence. The querying party receives these posteriors and merge them via secret sharing to find the geometric mean:
P[IΔ]=iPPi[IΔ]wi P[I_\Delta] = \prod_{i \in \mathbb P} P_i[I_\Delta]^{w_i}

The querying party can then use this factor to update its local probability distribution via sum-product message passing [2] (Equation 2 from the paper). For a more detailed explanation on the sum-product message passing please refer to Section 10.2 in the book of Koller [3].

  1. Local query At this point in the protocol, the querying party can find the posterior of any variable it owns. For discrete variables, the posterior is found by marginalizing non-target variables, as in a normal junction tree. For continuous variables, the posterior is found by marginalizing all non-threshold variables and performing weak marginalization:
P[X]=N(X;sΩ(ΔQ)P[TQ=s]μs(X),sΩ(ΔQ)P[TQ=s]{σs(X)+(μμs(X))2}) P[X] = \mathcal N\Bigl(X; \sum_{s \in \Omega(\Delta_{Q})}P[\mathcal T_Q=s] \mu_{s}(X), \sum_{s \in \Omega(\Delta_{Q})}P[\mathcal T_Q=s] \{\sigma_{s}(X) + (\mu - \mu_{s}(X))^2\} \Bigr)

[W3] Limited communication cost analysis on hybrid domains

We repeated the experiments from Table 1 with new variable splits to derive further insights in the communication costs of Hybrid CCJT. In Table C isolates the communication costs of Discrete and Continuous collaborative inference. This allows us to better analyze the communication costs gains obtained by natively handling continuous data. Furthermore, we report standard deviation measurement of communication costs.

Table C: Communication cost split for discrete and continuous variables

DatasetHealthcareSangiovese
#Parties22442244
Overlap10%30%10%30%10%30%10%30%
HybridCCJTDiscrete11±4.211\pm4.25±4.45\pm4.420±11.620\pm11.648±43.948\pm43.932±032\pm032±032\pm096±096\pm096±096\pm0
CLG0±00\pm021±4.821\pm4.80±00\pm0192±62.1192\pm62.129±7.529\pm7.559±10.559\pm10.5182±44.9182\pm44.9541±73.6541\pm73.6
Δ\Delta-CCJT3 States11±3.911\pm3.921±11.221\pm11.214±5.614\pm5.670±24.270\pm24.221±1.021\pm1.0177±37.9177\pm37.9243±45.0243\pm45.03313±1251.63313\pm1251.6
5 States10±3.810\pm3.820±6.720\pm6.710±1.610\pm1.6121±78.8121\pm78.838±1.138\pm1.1225±36.0225\pm36.0279±46.7279\pm46.735591±19277.835591\pm19277.8
10 States8±3.18\pm3.139±1539\pm1510±1.810\pm1.8229±148.7229\pm148.727±2.227\pm2.2334±146.0334\pm146.0539±100.2539\pm100.25131±1842.45131\pm1842.4
CCBNet3 States58±27.358\pm27.361±15.661\pm15.620±5.020\pm5.0120±30.6120\pm30.6240±77.1240\pm77.11021±298.31021\pm298.325609±11412.625609\pm11412.628505±11896.828505\pm11896.8
5 States14.7±3.714.7\pm3.772±29.872\pm29.810±1.610\pm1.6888±428.7888\pm428.755±7.055\pm7.01557±506.71557\pm506.75210±1971.25210\pm1971.2271053±125799.6271053\pm125799.6
10 States11±3.211\pm3.2290±139290\pm13910±1.810\pm1.82068±10432068\pm1043169±40.8169\pm40.8990±247.0990\pm247.04766±1438.74766\pm1438.7577838±295695.1577838\pm295695.1

References

[1] Lauritzen, Steffen L., and Frank Jensen. "Stable local computation with conditional Gaussian distributions." Statistics and Computing (SC), 2001.

[2] Shenoy, Prakash P., and Glenn Shafer. "Axioms for probability and belief-function propagation." Machine intelligence and pattern recognition, 1990.

[3] Koller, Daphne, and Nir Friedman. "Probabilistic graphical models: principles and techniques". MIT Press, 2009.

评论

I appreciate the authors' thorough response. It has addressed most of my concerns. I will keep my score which leans towards acceptance.

评论

We thank the reviewer for their kind reply and helpful review, which constitutes a valuable source for further improvements to strengthen our work.

审稿意见
5

This paper addresses the challenge of performing confidential, collaborative inference in Bayesian Networks (BNs) that span multiple parties and involve hybrid data (i.e., both discrete and continuous variables). The authors identify that existing privacy-preserving methods are typically limited to discrete-only scenarios and often incur high communication overhead.

To overcome these limitations, the paper introduces Hybrid CCJT, a novel framework for secure multi-party inference in hybrid BNs. The proposed method is twofold: first, it presents a novel technique for constructing a single, strongly rooted Junction Tree (JT) across all parties using "interface cliques" to connect their local network structures. Second, it develops a secure protocol based on Multi-Party Computation (MPC) to perform belief propagation on this shared JT, enabling inference while keeping model parameters and structures confidential.

The authors conduct an empirical evaluation of Hybrid CCJT on nine datasets of varying types (mixed, discrete, and continuous). They compare its performance against a state-of-the-art baseline (CCBNet) and an ablation variant of their method (Δ\Delta-CCJT), measuring prediction accuracy (using MSE for continuous and Brier score for discrete variables) and communication cost.

优缺点分析

  • Strengths The paper addresses a significant and previously unaddressed challenge: confidential inference in collaborative, hybrid Bayesian Networks. It is the first to extend the Junction Tree (JT) algorithm to this domain, enabling the native handling of both discrete and continuous variables without discretization. This greatly enhances the practical relevance of collaborative probabilistic modeling, especially for industrial applications.

The proposed Hybrid CCJT methodology is technically sound, introducing "interface cliques" to construct a strongly-rooted JT in a distributed fashion. This is a critical feature for valid inference in hybrid Conditional Linear Gaussian (CLG) models. The paper’s design is complete, including theoretical proofs and integration of cryptographic primitives to ensure confidentiality.

The empirical evaluation is comprehensive, covering nine datasets with varying characteristics and scales. The comparison against CCBNet and an ablation variant (Δ\Delta-CCJT) clearly demonstrates the contribution of the paper’s specific innovations. The reported results show substantial performance improvements, including a 32% reduction in Mean Squared Error (MSE) and up to a 331x reduction in communication costs, providing strong evidence of the method’s superiority.

  • Weaknesses
  • Security Guarantee Scope. The security of the protocol is based on a semi-honest, non-colluding adversary model. As noted in Appendix H, the protocol cannot withstand collusion among N-1 participants. While this is a common and explicitly stated assumption, it represents a significant limitation when deployed in environments with low trust or high incentives for collusion. Briefly mentioning this boundary condition in the main text would provide readers with a clearer understanding of the threat model.
  • Practical Assumptions for Data Alignment. The successful operation of the framework relies on two key practical assumptions: (a) shared variables between parties can be correctly identified (e.g., through private set intersections on names), and (b) there is consensus on the representation of these shared variables (e.g., the states of discrete variables, applicable units, and discretization schemes). In practice, achieving semantic interoperability is a significant challenge. The importance of this as a prerequisite could be emphasized more clearly.
  • Context of the 331x Communication Improvement Claim. The paper reports a communication cost reduction of up to 331x on the Munin #2 dataset (Table 3), which is a highly impressive result. However, this peak performance figure arises under a specific, best-case scenario for the proposed method (a large, purely discrete network where the communication overhead of the baseline is maximal). While the result is technically correct and showcases the potential of the method, highlighting this peak number so prominently (e.g., in the abstract) could be slightly misleading. The average improvement across all datasets and settings might be a more representative measure of the expected gains.

问题

  • Could the authors clarify the computational overhead introduced by the cryptographic tools used in the protocol, and how they compare with communication costs in larger models?
  • Is the local marginalization in Equation 9 the primary reason for communication efficiency compared to Variable Elimination methods like CCBNet? A brief confirmation of this mechanism would be helpful.
  • How does the protocol handle inconsistencies in data representation (e.g., mismatched units or state definitions across parties)?

局限性

  • The paper does not provide a detailed analysis of the computational cost, especially when using cryptographic primitives, which is crucial for practical implementations.
  • Data misalignment issues are not sufficiently addressed, and there is no discussion on the protocol's sensitivity to such inconsistencies.
  • Scalability concerns with larger datasets or more participants are not fully explored.

格式问题

NO or VERY MINOR ethics concerns only

作者回复

Thank you for your positive feedback. Please see our reply to the limitations, weaknesses, and questions. We look forward to any further comments.

Weaknesses

[W1] Underspecified security guarantee scope in main text

We will update the abstract and introduction to mention the semi-honest, non-collusion setting. Moreover, in the second paragraph of subsection 3.2, we will visually highlight the "semi-honest" denomination alongside the existing one for the "adversarial model".

[W2] Underemphasized practical assumptions for data alignment in main text

We will explicitly enumerate the assumptions for successful alignment. Namely, as correctly identified by the reviewer, shared variables across parties should have (i) the same name when defining identical concepts, and (ii) the same discrete states or continuous measurement units. Furthermore, we will also point to an appendix with a misalignment-related discussion based on the Q3 answer below.

[W3] Undercontextualized 331×331\times communication improvement claim

We concur that the 331×331\times communication improvement is in a scenario that heavily favours the proposed method. We will additionally report the median (10.4×10.4\times) communication improvement observed alongside the maximum in the abstract and introduction for a fairer presentation.

Questions

[Q1] What is the computational overhead of the cryptographic tools the protocol employs?

The protocol relies on private set intersection, secret sharing (additive/product-based), and homomorphic encryption (HE) at different points in its lifecycle. Among the enumerated cryptographic tools, only HE involves a significant computational overhead, while the primary bottleneck for the others is network latency. HE is used exclusively during discrete factor alignment. The results below split the HE computation time across:

  1. Context Generation: Generation of HE keys and parameters.
  2. Encrypted Column Summation: The actual HE-encrypted computation; used to compute column normalization factors (i.e., α\alpha in Equation 4 in the paper).
  3. CPD combination: Alignment of party CPDs based on the HE-computation outputs.

We report the per-factor mean and standard deviation of each HE substep and the total time required for all factors.

Table A: Results on discrete datasets (10% overlap):

Dataset#PartiesContext Gen. (ms)Col. Sums (ms)CPD comb. (ms)Total (ms)
Asia2153±0153\pm018.7±018.7\pm00.27±00.27\pm0176
4644±0644\pm0802±0802\pm00.53±00.53\pm01449
82155±02155\pm0767±0767\pm00.746±00.746\pm02926
Alarm2136±5.7136\pm5.751401±8261151401\pm826119.57±14.69.57\pm14.6206191
4672±50672\pm503639±47803639\pm47800.749±0.40.749\pm0.417250
81900±1501900\pm15096745±13475496745\pm1347544.19±4.694.19\pm4.69394601
Child2143±5143\pm5144±84144\pm840.33±0.0340.33\pm0.034577
4703±26703\pm2617151±917617151\pm91761.78±0.61.78\pm0.635713
81984±571984\pm5783010±7957083010\pm795703.9±2.93.9\pm2.9169999
Link64133±7133\pm7324.9±381324.9\pm3810.4±0.10.4\pm0.1431291

As expected, substep 2, i.e., encrypted column summation, is the most time-consuming one, taking up to 96 seconds in the worst-case scenario, but remains under one second in 6 out of 10 experiment scenarios. Its time requirements increase overall with the number of parties sharing factors. In contrast, substeps 1 and 3, i.e., context generation and CPD combination, run in under one second and one millisecond, respectively. Moreover, we note that the HE procedure represents a one-off setup cost, after which parties may perform unlimited queries until one of them updates their local network.

Table B: Results on hybrid datasets (30% overlap):

Dataset#PartiesContext Gen. (ms)Col. Sums (ms)CPD comb. (ms)Total (ms)
Healthcare (Hybrid)2127±0127\pm026±026\pm00.23±00.23\pm0154
Healthcare (Discrete, 3 states)2142±8.24142\pm8.2482.4±6782.4\pm670.3±0.050.3\pm0.05675
Healthcare (Discrete, 5 states)271.35±1371.35\pm1371.4±62.471.4\pm62.40.271±0.030.271\pm0.03641
Healthcare (Discrete, 10 states)2139.7±10.2139.7\pm10.2165.5±152.6165.5\pm152.60.316±0.040.316\pm0.04918
Healthcare (Hybrid)4646.9±6.04646.9\pm6.04244±90.5244\pm90.50.43±0.020.43\pm0.021785
Healthcare (Discrete, 3 states)4678.46±49.3678.46\pm49.3300.2±139.2300.2\pm139.20.48±0.060.48\pm0.062940
Healthcare (Discrete, 5 states)4682.9±46.7682.9\pm46.73007.8±36603007.8\pm36600.662±0.2950.662\pm0.29511076
Healthcare (Discrete, 10 states)4677±51677\pm515956±7221.45956\pm7221.40.86±0.480.86\pm0.4819905
Sangiovese (Hybrid)2127±0127\pm012.1±012.1\pm00.25±00.25\pm0141
Sangiovese (Discrete, 3 states)2141±9.7141\pm9.7827±1080827\pm10800.46±0.240.46\pm0.242906
Sangiovese (Discrete, 5 states)2139.8±9.2139.8\pm9.2710.3±905710.3\pm9050.403±0.180.403\pm0.182553
Sangiovese (Discrete, 10 states)2141±11.9141\pm11.9621±489621\pm4890.424±0.090.424\pm0.092289
Sangiovese (Hybrid)4127±0127\pm012.1±012.1\pm00.25±00.25\pm0140
Sangiovese (Discrete, 3 states)4139±9.7139\pm9.7813.7±1063813.7\pm10630.43±0.240.43\pm0.242861
Sangiovese (Discrete, 5 states)4659.3±49.6659.3\pm49.6188826.7±239278.3188826.7\pm239278.313.2±16.413.2\pm16.4758015
Sangiovese (Discrete, 10 states)4838±51838\pm511082339±15913711082339\pm159137151±72.851\pm72.84332927

When dealing with hybrid domains, we notice that encryption takes significantly more time on the discretized variants. While Hybrid CCJT maintains the entire encryption process under one second for both Sangiovese and Healthcare datasets, Δ\Delta-CCJT's encryption cost grows as the discretization coarseness is reduced (i.e., more states). These results also reaffirm the gains in memory and communication costs of natively handling CLG variables compared to discretizing them, as further highlighted in our answer to Q2.

We will add the above tables and discussion as a new appendix.

[Q2] Is local marginalization in Equation 9 the primary reason for improved communication efficiency?

Equation 9's local marginalization is the primary reason for improved communication when only considering discrete variables. When incorporating continuous variables, a significant portion of Hybrid CCJT's efficiency also comes from reducing the representation space from O(2N)O(2^N) states required by the discretization used in Δ\Delta-CCJT or CCBNet, down to O(N2)O(N^2) parameters required by representing continuous variables in the canonical form.

For more details on the canonical representation of CLG variables, please see Appendix C of the paper. Table 1 from the paper contains quantitative results on the improvements given by Equation 9 (i.e., Δ\Delta-CCJT vs CCBNet; for example, 15.5×15.5\times on average for 5 states) and natively handling CLG variables (i.e., Hybrid CCJT vs Δ\Delta-CCJT; for example, 1.9×1.9\times on average for 5 states).

[Q3] How does the protocol handle data representation inconsistencies?

While the current protocol assumes no data representation inconsistencies, the name-matching procedure is flexible in being updated to mitigate such issues. Inconsistencies can appear at the level of variable definition (parties using different names for the same concept) or their representation (state/measurement mismatches for discrete and continuous variables, respectively).

Some possible mitigation strategies are:

  • parties giving synonyms for variable names;
  • normalizing variable names (e.g., as lowercase);
  • for state alignment mismatches in discrete variables, states not present in all parties could get mapped to a single miscellaneous state;
  • for measurement unit mismatches in continuous variables, parties could update their local representations to an agreed-upon unit.

We will update Appendix H with the above discussion.

Limitations

[L1] No detailed computation time analysis (for cryptographic primitives)

The computation complexity of homomorphic encryption in solving an overlap is O(LNlog(N))O(L \cdot N \log(N)), bounded by LL chained multiplications of a size NN ciphertext. LL is the maximum number of parties in an overlap. NN is a power of two (in the range 2102^{10}--2152^{15}), derived from the desired precision, the maximum output scale, and LL. Successive multiplications increase complexity linearly [1]. The cost of a single multiplication is linearithmic [2].

In our implementation, solving a private set intersection between two parties is O(mlog(m))O(m \log(m)) time, where mm is the maximum number of variables among the two parties.

For an empirical overview of homomorphic encryption overhead, please refer to our answer to Q1.

[L2] Data misalignment issues insufficiently addressed

Please see the answer to Q2.

[L3] Scalability concerns not fully explored

We acknowledge that larger-scale tests on continuous and hybrid networks would be beneficial. We note, however, that the primary determiner of our method's scalability compared to a centralized approach is the total number of overlaps and the number of parties involved in each. The sizes of each party's local network or the number of parties are not the bottleneck. For time complexity, see the answer to L1. For communication overhead, cost is predominantly related to processing discrete variables. As evidenced by our main evaluation results and our reply to L3 of Reviewer HuYN, continuous variables have much lower communication requirements, making them even less of an issue for scaling.

Moreover, in our motivating example, we expect overlap percentages to be lower than 30% as covered in our experiments. Also, in such settings with highly specialized participants, at most a handful of parties would model most concepts. Thus, the insights from our experiments with up to 8 parties remain relevant. However, we acknowledge that other setups are also plausible (e.g., a central concept modeled by many parties).

References

[1] Cheon, Jung Hee, et al. "Homomorphic encryption for arithmetic of approximate numbers." International conference on the theory and application of cryptology and information security (ASIACRYPT), 2017.

[2] Al Badawi, Ahmad, et al. "OpenFHE: Open-source fully homomorphic encryption library." 10th workshop on encrypted computing & applied homomorphic cryptography (WAHC), 2022.

评论

Reviewer rRs9, the authors have provided extensive details in response to your questions about computational overhead and communication efficiency, among other points raised in your review. How has your opinion of the paper updated since reading their response?

最终决定

This paper considers a highly original setting wherein multiple parties (e.g., companies with proprietary data/IP) seek to collaboratively optimize a process (e.g., manufacturing) by performing inference in a large Bayesian network that is composed of separate party-specific Bayesian networks, each representing their respective system or domain. The key challenge in this setting is permitting collaboration while guaranteeing confidentiality---i.e., performing inference in the large Bayesian network while ensuring that parties cannot access the propriety knowledge encoded in other parties' Bayesian (sub-)networks. The paper introduces the "first framework for performing confidential multiparty inference in hybrid domains", where hybrid domains are those involving both discrete and continuous variables.

The reviewers all appreciated the originality and novelty of the setting and the proposed framework. They found the methodology to be sound and compelling, and were satisfied with the empirical results. The following is a representative excerpt from one reviewer's comments:

"The paper addresses a significant and previously unaddressed challenge [...] The proposed methodology Hybrid CCJT is technically sound [...] The paper's design is complete [...] The empirical evaluation is comprehensive."

The authors were highly responsive during the discussion period and largely assuaged the reviewers' concerns, including responding in detail to questions about computational overhead, giving a proof sketch, and providing additional experimental results. The authors also presented a clear plan for revising the paper in response to the reviewers' comments, which I find reasonable and achievable for camera-ready.