PaperHub
6.0
/10
Poster4 位审稿人
最低2最高5标准差1.1
4
5
2
4
2.5
置信度
创新性2.8
质量3.0
清晰度2.5
重要性3.0
NeurIPS 2025

Distribution Learning Meets Graph Structure Sampling

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We develop efficient algorithms for non-uniformly sampling over directed acyclic graph structures, and use these along with results from online learning, to develop efficient algorithms for agnostically-learning Bayes nets in KL divergence.

摘要

This work establishes a novel link between the problem of PAC-learning high-dimensional graphical models and the task of (efficient) counting and sampling of graph structures, using an online learning framework. The problem of efficiently counting and sampling graphical structures, such as spanning trees and acyclic orientations, has been a vibrant area of research in algorithms. We show that this rich algorithmic foundation can be leveraged to develop new algorithms for learning high-dimensional graphical models. We present the first efficient algorithm for (both realizable and agnostic) learning of Bayes nets with a chordal skeleton. In particular, we present an algorithm that, given integers $k,d > 0$, error parameter $\varepsilon > 0$, an undirected chordal graph $G$ on $n$ vertices, and sample access to a distribution $P^\ast$ on $[k]^n$; (1) returns a Bayes net $\widehat{P}$ with skeleton $G$ and indegree $d$, whose KL-divergence from $P^\ast$ is at most $\varepsilon$ more than the optimal KL-divergence between $P^\ast$ and any Bayes net with skeleton $G$ and indegree $d$, (2) uses $\widetilde{O}(n^3k^{d+1}/\varepsilon^2)$ samples from $P^\ast$ and runs in time $\mathrm{poly}(n,k,\varepsilon^{-1})$ for constant $d$. Prior results in this spirit were for only for trees ($d=1$, tree skeleton) via Chow-Liu, and in the realizable setting for polytrees (arbitrary $d$ but tree skeleton). Thus, our result significantly extends the state-of-the-art in learning Bayes net distributions. We also establish new results for learning tree and polytree distributions.
关键词
Distribution learningBayesian networksChordal graphsOnline learning

评审与讨论

审稿意见
4

This paper uses ideas from online learning to make progress on PAC learning of Bayes networks. They present an efficient (agnostic) PAC learning algorithm when the Bayes networks has a known chordal skeleton and improve in several ways in the case where the graph is a known tree which was studied in recent work [11].

优缺点分析

The idea of treating different orientations as different experts is definitely attractive. Similarly improving the efficiency by using conditional independence and estimate conditional probabilities from samples is very natural.

Of course restricting to a known graphical structure where only the orientation is unknown is theoretically interesting but maybe hard to motivate from applications?

The ideas of the proofs are a bit too high level and I am not sure how confident I am in their correctness.

问题

While the high level picture is clear - I am not quite sure I see the argument but this may be due to space constraints. In particular it would be good to at least hint at the body of the paper as to:

  1. Where do you use the fact that the graph is chordal?
  2. It is not clear how local estimates are sufficient to compute the probability of each particular orientation.
  3. Moreover, given that you consider potentially unbounded degree graphs, it is not clear how much data do you need in order to estimate the local weights especially as these need to be very accurate to apply randomized weighted majority.

局限性

The main limitation is the sparsity of the arguments provided. There is just not enough to convince that it works out.

最终评判理由

This is nice theoretical progress on a hard problem. It is not clear in which case the graphical model is known but not the orientation. Finding some neat application would make the paper stronger. Similarly hinting as to how the algorithm works in the body of the paper would make it more attractive if possible.

格式问题

None

作者回复

Thank you for your review. We address your comments and questions below:

Use of the chordal property of graph: In the chordal distribution learning algorithm (Algorithm 7 in Section C in supplementary material), we crucially use the clique tree decomposition of the skeleton of the chordal graph in the subroutine SamplingChordalDist (Algorithm 6). Moreover, the chordal structure assumption is used for the correctness proof of our algorithm (Lemma C.6)

Computing the probability of each particular orientation using local estimates: We would like to note that we formally prove that local estimates (of the add-one distribution) are sufficient (Lemma B.4 and Lemma B.6 in Section B in supplementary material). This does rely on the tensorization properties of KL divergence etc.

Estimating the local weights: We do formally prove the sample complexity bounds, taking the indegree as a parameter. The sample complexities are, of course, polynomial only when the indegree is a constant, but this is similar to the information-theoretic lower bounds to learn Bayesian network (Canonne et al., 2020, Appendix B), which are exponential in terms of indegree. Especially, our improper learning sample complexities for general match the lower bounds up to log-factors (for indegree n1n-1, exponential samples are required in any case since general distributions on [k]n[k]^n can be represented by such Bayes nets).

Sparsity of the arguments: Unfortunately, space constraints prevented us from including the arguments in the main paper. We have included the complete arguments in the supplementary material.

Reference

Clément L Canonne, Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Testing Bayesian Networks, IEEE Trans. Inf. Theory, 2020.

审稿意见
5

This paper proposes a novel distribution learning algorithm for graphical models. The proposed method is the first to enable PAC learning of bounded indegree chordal-structured distributions using only the skeleton of the graphical model as prior information. It can also be applied to learning polytree-structured and tree-structured distributions, and is shown to outperform existing methods theoretically. The key idea behind the proposed approach is to leverage algorithms from the online learning literature, such as the Exponential Weighted Average algorithm and the Randomized Weighted Majority algorithm. By exploiting structural properties of graphical models, the method enables efficient sampling of experts within the algorithm.

优缺点分析

As I am not an expert in the learning theory of graphical models, I may not be able to accurately assess the validity or correctness of the authors’ claims. I have not wholly followed all the proofs provided in the appendix.

Strengths

  • The proposed method is innovative and elegant. It applies techniques from online learning to estimating distributions over graphical problems and suggests promising directions for future research.
  • The method is extensively analyzed from a theoretical perspective, and improvements over existing approaches are clearly demonstrated.

Weaknesses

  • Certain assumptions limit the applicability of the proposed method, such as the requirement that knowledge of the graphical model's skeleton and the distribution's chordal structure.
  • The paper has a complex structure, making it difficult to follow. Understanding the overall framework requires reading the appendix in detail. While key components necessary for understanding the method are deferred to the appendix, some arguably less critical content (e.g., the paragraph "Why KL divergence?") remains in the main text. The organization between the main body and the appendix warrants reconsideration.

问题

  • While the proposed method relies on several assumptions—such as the distribution being a bounded indegree chordal-structured distribution and the skeleton being known—can these limitations be addressed or relaxed through extensions of the proposed framework?
  • In the context of practical applications, to what extent do the assumptions made by the proposed method (e.g., known skeleton, chordal structure) pose a barrier? Are they relatively benign, or could they significantly hinder real-world deployment?

局限性

The proposed method makes various assumptions about the shape and input of graphical models, limiting the problems to which it can be applied.

最终评判理由

My concerns have been resolved. Since I had originally given a score leaning toward acceptance, I will keep the score as it is.

格式问题

N/A

作者回复

Thank you for your review. We address your comments and questions below:

Assumptions about the knowledge of the graphical model's skeleton and the distribution's chordal structure on the proposed method: The general framework presented in Section B of the supplementary material does not rely on these assumptions. Of course, the weakness is that the meta-algorithms (Algorithms 3 and 4 in Section B) would take exponential time (even with good sample complexity) and are thus not useful in practice. The tree-structured distribution assumption and the known chordal skeleton assumption are used in the algorithms developed in Sections C and D (Algorithm 8 and Algorithm 7 in the supplementary material) to make the learning process computationally efficient. Note that the sample complexity as well as the computational efficiency of learning Bayes nets in this framework is tied to the indegree bound. This is because without an indegree bound or some other bound on the number of parameters, the sample complexity of learning Bayes nets would be exponential since they can represent general distributions of exponential size.

Complex structure: Thanks for your suggestion. Unfortunately, due to space constraints in the main paper, we came up with this organization of the paper. We will make suitable changes in the final version.

Assumptions of the proposed method in the context of practical applications: The algorithm in practice does not rely on the known skeleton assumption (of course, it is required in the analysis). Any structure learning algorithm (not necessarily with theoretical guarantees) that gives a chordal skeleton (with bounded indegree) GerrG_{err} as output (when GG is input) can be combined with our distribution learning algorithm to give an efficiently-samplable distribution as the output. However, the agnostic learning guarantee would only hold with respect to the set of distributions with structure GerrG_{err} rather than the true skeleton GG. On the other hand, the chordal-structure and bounded-indegree assumptions are fundamentally required for our computationally-efficient algorithm.

评论

Thank you for the rebuttal; it has fully resolved my questions. So, I will keep my current score. As noted in my review, I am not an expert on the work’s novelty or significance; thus, my confidence in those evaluations is limited.

审稿意见
2

This paper tackles the PAC-learning of Bayesian network distributions by reducing the problem to online expert learning over all DAGs with a given chordal skeleton. The authors propose treating each possible DAG as an expert, aggregating them via efficient combinatorial counting of acyclic orientations. They prove that for a chordal skeleton with max indegree d, their algorithm learns a distribution within ε in KL-divergence using O~(n3kd+1/ε²)Õ(n³k^{d+1}/ε^²) samples, extending classic Chow–Liu tree and polytree results.

优缺点分析

Strengths

  • Novel theoretical bridge: Clever reduction of Bayes net learning to online no-regret expert aggregation using combinatorial graph algorithms.
  • Generalization: First efficient agnostic PAC learner for chordal-skeleton Bayes nets, subsuming tree and polytree cases.
  • Rigorous bounds: Provable sample complexity and runtime guarantees grounded in well-established online learning theory.

Weaknesses

  • Known skeleton assumption: Sidesteps the hardest part—structure discovery—raising questions about practical usability when the skeleton is unknown or estimated.
  • No experiments: Lacks any empirical validation or simulations to gauge performance, leaving constants and scalability unclear.
  • Misleading framing: Title mentions “latent confounder” and causality, but the work focuses purely on distribution learning under a known graph, which may confuse readers.

问题

  1. Skeleton acquisition: How would one obtain a correct chordal skeleton in practice? Would skeleton estimation errors undermine the guarantees?
  2. Unknown structure: Can the online framework be extended to learn the skeleton itself (beyond the tree case), or is that inherently intractable?
  3. Causal interpretation: Please clarify the role of “latent confounders” in this setting—does your method handle hidden-variable models, or is it purely observational?
  4. Empirical feasibility: Have you attempted small-scale simulations (e.g., n≈20) to evaluate sample complexity and runtime in practice?

局限性

The authors do not experimentally assess limitations; a discussion of skeleton errors, runtime constants, and potential failure modes would be beneficial.

最终评判理由

This paper makes a good theoretical contribution by giving the first efficient agnostic PAC learner for bounded‐indegree chordal‐skeleton Bayes nets, extending known results for trees and polytrees. The online learning–based reduction to structure sampling is elegant and well‐proved. However, the reliance on a known skeleton, no analysis of robustness to skeleton errors, and no empirical validation limit the work’s applicability. Given these gaps, I lean towards weak reject to reject recommendation, despite appreciating the theoretical advances.

格式问题

None

作者回复

Thank you for your review. We address your comments and questions below:

Known skeleton assumption: We can still combine our algorithm with a practical structure learning algorithm, as noted in our reply to Reviewer c9e7. If the structure learning algorithm has theoretical guarantee that it ``exactly’’ recovers the structure, we can get a theoretical guarantee for this combined algorithm. Otherwise, this does not give the proper theoretical guarantee with respect to the true distribution. Unfortunately, to the best of our knowledge, we do not know of such an efficient structure learning algorithm with exact recovery guarantees.

Misleading framing: We are afraid that we do not know why the reviewer makes such a comment. Our title (“Distribution Learning Meets Graph Structure Sampling”) clearly does not mention “latent confounder” or “causality”. Furthermore, we do not mention “latent confounders” anywhere in our paper, and the only place where we mention “causality” is lines 99-100 where we say that chordal graphs play a crucial role in the study of causal Bayesian networks. We stand by this statement and we do not feel it is misleading in any way.

Skeleton acquisition: There have been existing works on practical structured learning algorithms (cite survey by Daly et al., 2011). Unfortunately, these do not often have theoretical guarantees for “exact recovery”. If we combine a structure learning algorithm with our distribution learning approach, it would give us the proper guarantee as we state in the paper (without the skeleton assumption) if the correct skeleton is output by the structure learning algorithm. If the structure learning phase outputs a wrong skeleton GerrG_{err}, we can get an agnostic learning guarantee with respect to the set of distributions with that structure GerrG_{err} (not wrt the true skeleton GG). The KL divergence (from the output distribution to the true distribution) in this case may be high, even for “structurally-close” skeletons (e.g., if the learnt skeleton GerrG_{err} is missing a single edge from GG).

Unknown structure: The general framework in Section B in the supplementary material does not rely on having a known skeleton, and neither does the online-learning based tree-structured distribution learning algorithm in Section D in the supplementary material. In that sense, the online learning framework can indeed be applied to efficiently learn the skeleton as well as the probabilities without knowing the skeleton in advance. But this efficient algorithm does rely on a strong combinatorics result (the generalization of the matrix-tree theorem, Lemma D.13 in the supplementary material), which applies only to rooted trees as far as we know. We do not have a general computationally-efficient algorithm that goes over all possible skeletons for polytrees/chordal graphs.

Causal interpretation: Unfortunately, our work does not mention “latent confounders” anywhere in the main paper, as well as in the supplementary material.

Empirical feasibility: In this work, we focused on proving sound theoretical guarantees of our results, which will continue to hold in practice. We defer the practical implementation of our results for future work.

Reference:

Daly, R, Qiang, S & Aitken, S, Learning Bayesian Networks: Approaches and Issues, Knowledge Engineering Review, 2011.

评论

Thank you for your detailed replies. Some clarifications are given below:

  • Skeleton estimation error remains unquantified You concede that no efficient exact‐recovery algorithm exists for chordal skeletons and that a single missing edge can cause unbounded KL loss. Yet you offer neither a finite‐sample bound nor even a toy experiment to illustrate how skeleton mistakes affect final performance. Without at least a simple bound or empirical curve, readers cannot judge whether this method is fragile or robust in realistic settings. (Actually, as far as I understand even learning bounded treewidth networks is NP-hard and so this is unlikely).

  • Unknown‐skeleton claims must be clearly limited to trees Your reference to Section B’s framework gives the impression that unknown‐skeleton learning extends beyond trees. In fact it relies entirely on the matrix-tree result and applies only to tree skeletons. Please revise the paper to state explicitly that unknown-skeleton learning is tractable only for trees, and that extending to chordal or polytree skeletons remains a nontrivial open problem.

  • Latent-variable extension is out of scope and needs a disclaimer Although latent confounders were never mentioned in the paper, readers may nevertheless wonder whether hidden nodes can be handled (Recall that one mixture node can yield a high-treewidth network once its eliminated and thus most real-world networks have this property that they have a large number of latent/hidden nodes). Your framework does not support any latent variables. Please add a clear sentence in the introduction such as “We assume that all variables are observed and leave extension to hidden-variable models for future work (or is not addressed).”

评论

Thank you for your comments. We address them below:

(i) Skeleton estimation: We agree that combining the skeleton estimation is not very useful as is, which is why we do not discuss this in the paper and rather include the known-skeleton assumption throughout the paper.

(ii) Unknown‐skeleton: The framework in section B encompasses learning the skeleton. We would like to note that we have never claimed it to be tractable. In our tractable results for polytrees and chordal graphs, we do explicitly mention the known skeleton assumption. Moreover, in Lines 113-115 in our main paper, we explicitly state that no computational hardness result is known for the skeleton assumption. Moreover, there have been prior works with the skeleton assumption (Sharma, 2024, and Choo et al., 2024). We have also mentioned these references in Lines 115-118 and Lines 136-139 in our paper. We do not feel the paper has to be revised.

(iii) Latent-variable extension:  We would like to note that we do not mention semi-Markovian models or unobserved confounders anywhere in our paper. We believe that extensions to causal graphs are beyond the scope of the current work. 

References:

1. Vidya Sagar Sharma. A fixed-parameter tractable algorithm for counting markov equivalence classes with the same skeleton. In AAAI, 2024.

2. Davin Choo, Joy Qiping Yang, Arnab Bhattacharyya, and Clément L. Canonne. Learning bounded-degree polytrees with known skeleton. ALT, 2024.

评论

Regarding your first comment on the Skeleton estimation ("(Actually, as far as I understand even learning bounded treewidth networks is NP-hard and so this is unlikely)"), we would like to note that the hardness result only holds for exact structure learning. Currently, no hardness result is known for distribution learning.

审稿意见
4

This paper gives computationally efficient on-line learning algorithms for certain families of Bayesian Networks, models with a known skeleton with a chordal graph of bounded in-degree, and trees. A computationally efficient algorithm for trees was known (indeed, the algorithm itself, by Chow and Liu, is very old) but the proposed method obtains optimal sample complexity -- the dependence on the size of the range of the random variables is improved to quadratic, rather than cubic. The method yields guarantees for agnostic improper learning of such structures. (A further extension to learning Bayes Nets with a moralization with a O(1)-size vertex cover, using another prior work for sampling, is described in an appendix.)

These results are obtained by leveraging the following simple observation: the multiplicative updates used in existing on-line learning methods can be distributed across the factors of the probabilistic model. This is essentially equivalent to running a separate learner for the conditional probability tables for each of the factors; this corresponds to a mixture of models that can be sampled by independently sampling a table for each factor.

优缺点分析

The paper has two main strengths. The first is that the underlying idea is very clean and well-suited to these kinds of probabilistic models. The second is that it indeed obtains quantitatively improved guarantees for specific problems, like learning tree-structured distributions.

The main weaknesses are first, that the approach relies on running the learner on a discretization of the conditional probability tables. Although the conditional probability tables themselves are inherently exponential in the number of assignments to the parents, and such a dependence is to be expected, the requirement to maintain parameters for the individual table values seems somewhat impractical. Second, the chordal graph learner relies on possession of a known skeleton for the DAG, and this approach doesn't address the structure learning problem that poses the more fundamental barrier to the use of probabilistic graphical models in practice. It is a theoretical improvement -- I am not aware of a method that is guaranteed to learn such structures -- but it is not clear if it makes a difference in practice.

The "novel link" between learning graphical models and sampling structures, described in the abstract, seems like an over-sell. The work uses sampling methods for improper learning of the models, but this is (just) an application, not the discovery of a deeper connection between the problems. (For example, the paper does not seem to establish a connection in the other direction.)

问题

Do you have a deeper link between counting/sampling of graph structures and learning graphical models? (Besides the application here, and the casting of the computation of partition functions as a counting problem; I mean, is there a tight link between these problems, as the abstract and introduction seems to suggest? Where is it spelled out in the paper?)

局限性

Yes

最终评判理由

My recommendation remains more or less unchanged:

On the positive side, there is still a cute theoretical idea in the maintenance of the weights node-wise that yields a polynomial time algorithm where one was previously not known.

But, on the negative side, the discussion with the authors seems to confirm the main weaknesses: (1) that the algorithm is still wildly impractical on account of the number of discretized conditional probability tables, each of which must maintain a weight per node, as well as (2) that the "connection" to sampling methods is, rather, just an application and (3) the paper only gives methods for fitting the local tables and edge orientations, not learning the structure.

The main idea has merit, and there is something to learn from the paper, but the weaknesses temper my enthusiasm. I am not sure about the impact. Hence, "weak accept".

格式问题

None

作者回复

Thank you for your review. We address your comments and questions below:

Exponential size conditional probability table: We would like to note that while the EWA/RWM-based algorithms implemented “as is” (as described in Section B of the supplementary material) would use exponential-size tables and take exponential time, the way we implement it for trees/bounded-indegree chordal graphs via sampling does not use exponential-size tables (in the chordal graph case, if we assume that the indegree is constant). Thus, our results on learning tree and bounded-indegree chordal and polytree distributions work in polynomial time, as opposed to exponential time if we had used an exponential-size table. The computational efficiency of our algorithms is the non-trivial contribution of our paper.

Known skeleton assumption: This is indeed a part left open by our paper (in terms of getting formal guarantees for structure learning). In practice, any structure learning method might be used, and the agnostic learning guarantee will hold with respect to the closest Bayes net (to the input distribution) having that structure.

Novel link between learning graphical models and sampling structures: In this work, we have used the online learning and recursive graph structure sampling framework to efficiently learn tree, polytree and chordal structured distributions in KL divergence. We note in Section 3 that we can replace the online learning framework with maximum likelihood estimation (sacrificing the near-optimal sample complexities for improper learning), but even then the graph structure sampling is crucial for getting computationally-efficient algorithms. Unfortunately, we are not aware of a connection in the other direction (graph structure sampling via distribution learning), which seems unlikely in general. It is possible that given a learnt Bayes-net distribution as a density oracle, it could be used to learn the graph structure of that distribution in some non-trivial way which differs from learning the structure through samples (which is NP hard as shown in Chickering (2004)).

Reference: Max Chickering, David Heckerman, and Chris Meek. Large-sample learning of bayesian networks is NP-hard. JMLR, 2004.

评论

Regarding the conditional probability tables per node, the size is still exponential w.r.t. the in-degree (= number of parents), is it not? If not, two questions: (1) why do you require that the in-degree to be constant, and (2) how are you actually representing these conditional probabilities?

评论

Thank you for your question. Yes, the size of the conditional probability table for every node is exponential w.r.t. the in-degree d. In fact, any Bayes net with n nodes and indegree d requires nkd+1nk^{d+1} parameters. As a result, any polynomial (in n,k,1/εn,k, 1/\varepsilon) sample learning algorithm needs to either bound the indegree or the number of parameters. Note that, without such a bound, Bayes nets on [n][n] can represent general distributions over nn-dimensional hypercube, which would require exponential in nn sample complexity for learning. 

For tree-structured distributions, as d=1, we obtain an algorithm with polynomial time and sample complexity. For chordal and polytree distributions, the sample complexity scales as follows: O~(n4ε4+nkd+1ε)\tilde{O}\left(\frac{n^4}{\varepsilon^4} + \frac{nk^{d+1}}{\varepsilon}\right) and O~(n3ε2δ2+nkd+1ε)\tilde{O}\left(\frac{n^3}{\varepsilon^2 \delta^2} + \frac{nk^{d+1}}{\varepsilon}\right) for agnostic improper and proper learning, respectively, for alphabet size kk, which becomes polynomial for constant dd.

评论

Let's put aside the sample complexity and talk about the computational complexity, which is the more relevant issue here. It looks like you have proposed to maintain weights for every value for the conditional probability table, is that correct? While I agree that for constant dd this is polynomial time, in order to attain a suitable accuracy overall the discretization parameter probably needs to scale with the number of variables, right? So it seems like you are maintaining something like (kn/ϵ)d\sim (kn/\epsilon)^d weights per node, that you need to update on every round. Is this correct, or do you have a more computationally efficient way of maintaining the weights?

评论

Thank you for your comment. Let us clarify the runtime. At every node, for every choice of the d parents, we learn how the node conditionally depends on its parents. This is done at the outset, and after that, the algorithm focuses only on the combinatorial problem of learning the acyclic orientation. So, the runtime contribution from the node-distribution learning part is O~((Δk)dn2/ε)\tilde{O}((\Delta k)^d \cdot n^2/\varepsilon), where Δ\Delta is the max degree of the skeleton, since n(Δd)n \cdot {\Delta \choose d} bounds the number of all possible (node, parent set) pairings and O~(kdn/ε)\tilde{O}(k^d n/\varepsilon) is the runtime of the add-1 algorithm at each node (Def B.7). Note that the runtime is polynomial even if both Δ\Delta and dd are O(logn)O(\log n).

If Δ\Delta or dd are unbounded, then indeed the runtime can grow to be ndn^d. Again, note that for unbounded dd, exponential dependence is inevitable, since degree-nn Bayes nets can capture arbitrary nn-dimensional distributions.

Thank you very much for the opportunity to reply. We will make sure to discuss this explicitly in the revision.

最终决定

This paper considers the task of learning directed graphical models with specific graphical structure. The main contribution is a method that leverages online learning to reduce learning to sampling certain high-dimensional distributions. This reduction is used to obtain improved sample complexity bounds and, in some cases, tractable computational complexity. With the exception of a single review with an unjustifiably low score, the other reviewers viewed this contribution as borderline with an inclination towards accepting the work. Given my own reading and understanding, I recommend borderline acceptance.