PaperHub
7.8
/10
Rejected4 位审稿人
最低4最高5标准差0.4
4
5
5
5
3.5
置信度
创新性3.0
质量3.3
清晰度2.8
重要性2.8
NeurIPS 2025

$\psi$DAG: Projected Stochastic Approximation Iteration for Linear DAG Structure Learning

OpenReviewPDF
提交: 2025-04-11更新: 2025-10-29
TL;DR

We propose a scalable framework for learning DAGs using Stochastic Approximation with SGD and efficient projection methods, offering improved performance and computational efficiency for large-scale problems.

摘要

关键词
Structure learningContinuous OptimizationDirected Acyclic Graphs

评审与讨论

审稿意见
4

This paper proposes ψ\psiDAG, a stochastic gradient descent (SGD) method for learning Directed Acyclic Graphs (DAGs) using differentiable acyclicity constraints. It reformulates DAG learning with linear Structural Equation Models (SEMs) in a stochastic manner (Section 3) and leverages vertex ordering to reduce the constraint space (Section 4). Experimental results demonstrate that ψ\psiDAG scales effectively to large graphs with up to 10,000 nodes.

优缺点分析

Strengths

  • S1 [Quality] The paper is clearly written and well-structured, making the overall approach easy to follow.
  • S2 [Clarity] The motivation and methodological components are well articulated. The claims and corresponding experimental results are also clearly presented.
  • S3 [Significance] The experiments thoroughly evaluate the proposed method using synthetic data, structure recovery tasks, scalability comparisons, and a real-world application.
  • S4 [Significance] DAG learning with differentiable acyclicity constraints is a highly active area of research, and scalability remains a major challenge. ψ\psiDAG addresses this issue via stochastic optimization, and its contribution is likely to have significant impact.
  • S5 [Originality] The integration of SGD with DAG learning via differentiable acyclicity constraints is novel and original.

Weaknesses

  • W1 [Significance] The role and justification of vertex ordering require further clarification. Lines 172--193 suggest that using the true vertex ordering for the true DAG (solution) can improve the objective value, but in real-world scenarios, the true ordering is generally unavailable during optimization.
  • W2 [Significance] Extending ψ\psiDAG to nonlinear SEMs would further increase the method’s utility, as many recent approaches support both linear and nonlinear formulations.

问题

My main concern is W1 described above. It would be pleasure to answer W1.

In addition, I have the following questions:

  • Q1. Does a vertex ordering always exist for &W_{k}^{(1/3)}&?
  • Q2. As I understand it, &W_{k}^{(1/3)}&, &W_{k}^{(2/3)}& and &W_{k+1}}& become dense matrices. Is this correct? If so, how is sparsity induced in &W_{K}&? I could not find a sparsity-inducing regularization term similar to the one used in NOTEARS.
  • Q3. Can the projection method described in Algorithm 3 also be applied to deterministic settings such as NOTEARS and DAGMA? I am particularly interested in a comparison of the computational time between regularization-based approaches (e.g., NOTEARS, DAGMA) and the projection-based method proposed in Algorithm 3 under a deterministic setting.
  • Q4. What is the role of weights L in Algorithm 3?

局限性

yes

最终评判理由

The authors have addressed all of my questions, and most of my concerns have been resolved. My only remaining concern is that the paper considers only the linear SEM case. However, based on the experimental results, the method demonstrates a practicality that is considerably higher than that of previous approaches, which is worthy of recognition. As a result, I raised my rating.

格式问题

N/A

作者回复

We sincerely thank the Reviewer for the thorough and thoughtful evaluation of our work. We appreciate your recognition of our paper’s clarity, comprehensive experiments, and the novelty and significance of our approach to scalable DAG learning. Your insights are invaluable, and we believe that addressing them will significantly strengthen our work.

W1 [Significance] The role and justification of vertex ordering require further clarification. Lines 172--193 suggest that using the true vertex ordering for the true DAG (solution) can improve the objective value, but in real-world scenarios, the true ordering is generally unavailable during optimization.

Our method does not assume the true vertex ordering; it explores candidate orderings to guide optimization.

The motivation for using vertex orderings stems from the fact that the space of possible orderings (d!) is exponentially smaller than the space of all DAGs (2d2d2^{d^2–d}). This makes projection into the ordering space significantly more tractable than directly optimizing over the full DAG space. We would like to clarify that our method does not assume access to the true vertex ordering, which, as the Reviewer correctly points out, is not available in practice. Instead, our algorithm iteratively explores candidate orderings, and for each one, solves a convex optimization problem to find the optimal DAG consistent with that ordering. This procedure provides meaningful upper bounds and helps guide the search toward better solutions within a highly non-convex landscape.

We will update the manuscript to clarify that the true ordering is not used or assumed during optimization. We hope this addresses your concerns, and we would be happy to elaborate further if needed.

W2 [Significance] Extending Ψ\PsiDAG to nonlinear SEMs would further increase the method’s utility, as many recent approaches support both linear and nonlinear formulations.

Extending Ψ\PsiDAG to nonlinear SEMs is indeed a valuable direction and is entirely feasible within our framework. However, in this work, we deliberately focused on the linear setting to highlight the core methodological contributions and scalability advantages of our approach. We have also acknowledged this limitation in the paper.

Q1. Does a vertex ordering always exist for Wk(1/3)W_{k}^{(1/3)}?

Yes, a vertex ordering always exists due to the construction of the projection step. While the ordering may not be unique, the projection is explicitly designed to yield an acyclic structure, which guarantees at least one valid topological ordering.

Q2. As I understand it, Wk(1/3)W_{k}^{(1/3)}, Wk(2/3)W_{k}^{(2/3)} and Wk+1W_{k+1} become dense matrices. Is this correct? If so, how is sparsity induced in WKW_{K}? I could not find a sparsity-inducing regularization term similar to the one used in NOTEARS.

Sparsity is not enforced, but can be easily added.

Both Wk(2/3)W_{k}^{(2/3)} and Wk+1W_{k+1} are DAGs by construction. While our current implementation does not include a sparsity-inducing regularizer, such as an l1l_1-penalty, it can easily be added to encourage sparsity if desired. Our framework is fully compatible with sparsity regularization.

Q3. Can the projection method described in Algorithm 3 also be applied to deterministic settings such as NOTEARS and DAGMA? I am particularly interested in a comparison of the computational time between regularization-based approaches (e.g., NOTEARS, DAGMA) and the projection-based method proposed in Algorithm 3 under a deterministic setting.

Our projection can be applied to deterministic settings, but Ψ\Psi-DAG remains more efficient due to stochastic updates.

Yes, the projection method described in Algorithm 3 can, in principle, be applied to deterministic settings such as NOTEARS by removing the acyclicity constraint h(A) and instead performing a projection onto the DAG space after each update. However, one of the key advantages of our approach lies in its computational efficiency: unlike NOTEARS and DAGMA, which rely on computing full gradients over the entire dataset at each step, our method uses Stochastic Gradient Descent (SGD)-type updates. This significantly reduces the computational burden, especially in high-dimensional or large-sample regimes. A direct empirical evaluation under a fully deterministic setting falls beyond the scope of the current work.

Q4. What is the role of weights L in Algorithm 3?

The weight matrix LL encodes the relative importance of edges for the projection.

Great question. The weights L originate from an optimization perspective, particularly inspired by coordinate descent methods, where some directions may have a larger impact on the loss than others. For instance, consider the optimization problem f(x,y)=100(x1)2+0.01(y+1)2f(x,y)=100(x−1)^2+0.01(y+1)^2. In this case, an error of ε=0.1\varepsilon=0.1 in the x-coordinate (x=x+εx=x^*+ \varepsilon) results in a significantly higher increase in loss than the same error in the y-coordinate. This reflects the fact that the x-direction is more important to the objective function.

Analogously, in our framework, the weight matrix L captures the relative importance of each edge during the projection step. Additional discussion on the benefits and role of L can be found in Appendix E.

We hope that we answered all of the questions posed in the review. We are happy to answer any additional questions in the discussion period.

评论

I sincerely thank the authors for the detailed response.

Regarding W1, I now have a better understanding of the proposed method thanks to the response. However, I find it somewhat puzzling that convergence is guaranteed even though the candidate orderings may not necessarily include the true DAG. For example, in Algorithm 1, if the result of the projection ψ\psi in line 4 were to vary significantly at each iteration, it seems that ensuring convergence would be challenging.

Regarding W2, the reformulations in equations (8) and (9) appear to be feasible only for linear SEMs, which suggests that extending the approach to nonlinear SEMs may not be straightforward.

评论

We sincerely thank the Reviewer for their questions and for engaging during the rebuttal process.

We appreciate the opportunity to provide further intuition on the convergence results and the geometric properties of the problem.

First, we would like to emphasize that the DAG constraint is highly non-convex, which makes the original optimization problem challenging in terms of finding a global minimum. As a result, theoretical guarantees can only aim at convergence to a local minimum.

In our method, Step 5 of Algorithm 1 is the crucial part for the convergence to the local minimum. After each step, it returns a point Wk+1W_{k+1} that satisfies the DAG constraint. Importantly, Wk+1W_{k+1} is a close point to the global minimum WπW_{\pi}^{\ast} of the problem restricted to the linear subspace corresponding to the chosen topological ordering π\pi (Theorem 7). The point WπW_{\pi}^{\ast} is either a local minimum of the original problem, due to the non-convexity of the constraint set, or it belongs to the orthogonal subspace π2\pi_2 with a different minimum. In the latter case, Step 3 of Algorithm 1 allows the algorithm to escape from π1\pi_1 and move toward the minimum in subspace π2\pi_2, thus improving the objective. As an additional safeguard, one could track all previously visited orderings and their associated minima, and avoid projecting onto already explored orderings. Although we did not employ this safeguard in our current experiments, it can easily be incorporated to further enhance performance.

We also note that many state-of-the-art methods (e.g., NOTEARS, GOLEM, NOCURL) provide convergence guarantees only to stationary points of their regularized formulations. In contrast, our method benefits from a structured reformulation, which allows us to guarantee convergence to a local minimum, as stated in Theorem 8.

Regarding W2 and Nonlinear SEMs, from our perspective, the main challenge lies in modifying the projection step for nonlinear SEMs as formulated in Problem (8) of [1]. Equations (8) and (9) would need to be adapted according to the specific loss function of the nonlinear SEM. While we do have a clear strategy for extending our approach, the required changes would introduce additional theoretical and algorithmic complexities that merit a more focused and in-depth treatment. For these reasons, we believe this extension is best addressed in future work. We intentionally cited [1] in the Future Work section to highlight this direction.

We hope our responses address the Reviewer’s questions and would be happy to clarify further if needed.

[1] Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pp. 3414–3425. Pmlr, 2020.487

评论

Dear Reviewer jmb6,

As the discussion period is approaching its end, we wanted to kindly follow up to ask whether our responses have addressed all of your concerns. We greatly appreciate your engagement and would be happy to clarify any remaining questions if needed.

评论

Thank you very much for your detailed response. I now have a clearer understanding of the key points. I believe the proposed method is practical and plan to raise my score accordingly. I also apologize for the delay in my response.

评论

Thank you very much for your kind response and for taking the time to revisit our work. We are very glad to hear that our clarifications were helpful, and we truly appreciate your thoughtful consideration and support. We are grateful for your engagement throughout the review process.

审稿意见
5

The paper proposes a new linear DAG structure learning algorithm based on stochastic approximation with novel projection methods to enforce the DAG constraints so the problem can be solved using SGD. The empirical results are impressive compared to existing continuous optimization approaches, both in terms of accuracy and runtime/scalability (the proposed method can be applied to larger graphs).

优缺点分析

Strengths:

  1. The approach is novel and the methodological contribution appears to be significant (I am unable to check proofs).
  2. The new optimization method appears to be more stable than existing continuous optimization approaches, leading to both better accuracy and runtime.
  3. The approach enables DAG structure learning on larger graphs than other existing methods

Weaknesses:

  1. The approach is limited to linear DAGs
  2. Since the focus is on linear models, there may be some other baselines worth including that also focus on the linear case, e.g. high-dimensional PC, LiNGAM
  3. There's several assumptions/approximations required at different steps in the process, which make it unclear what kind of correctness guarantee there may be

问题

  1. Is there a correctness guarantee that can be stated formally?
  2. Have the authors tried running their approach in the nonlinear setting? If so, how do these results compare?
  3. Did the authors consider any additional experiments to specifically measure the stability of the optimization to other approaches?

局限性

The linear assumption should be stated more clearly in the paper.

格式问题

N/A

作者回复

We sincerely thank the Reviewer for their comments and for recognizing the novelty, significance, and robust performance of our method. We are particularly grateful for the acknowledgment of its stability and strong gains over prior continuous optimization approaches. We will incorporate high-dimensional PC and LiNGAM baselines in the camera-ready version as suggested. While our approach is extendable to nonlinear settings, we focused on the linear case to clearly highlight the core contributions, a point we will make more explicit in the revised manuscript.

  1. Is there a correctness guarantee that can be stated formally?

Exact DAG recovery is NP-hard, but we provide local convergence guarantees.

We agree that theoretical guarantees are important. Recovering the true underlying DAG is known to be NP-hard in general. As such, no algorithm, including ours, can guarantee exact recovery of the true DAG in polynomial time without performing an exponential number of steps. Moreover, the DAG learning problem is inherently non-convex and may contain exponentially many local minima, making it theoretically hard to guarantee convergence to a global minimum in reasonable time.

From a theoretical perspective, most existing continuous optimization-based approaches (NOTEARS, GOLEM, NoCurl) can only guarantee convergence to stationary points due to this non-convexity. In contrast, our method benefits from a structured reformulation, which allows us to guarantee convergence to a local minimum, as stated in Theorem 8.

  1. Have the authors tried running their approach in the nonlinear setting? If so, how do these results compare?

We focused on establishing a novel and scalable framework for differentiable DAG learning, leveraging a stochastic formulation to improve computational efficiency and scalability. To clearly isolate and evaluate the core contributions, our experiments are currently limited to linear SEMs. However, extending the proposed approach to nonlinear SEMs, such as those considered in Zheng et al. (2020) [1], is a natural direction we left for future work.

  1. Did the authors consider any additional experiments to specifically measure the stability of the optimization to other approaches?

We evaluated stability across multiple runs and under varying noise; our method is consistently robust.

Great question. All results presented in the main paper are averaged over multiple runs with different random initializations. We report the mean values and indicate the standard error using shaded regions. We note that some baseline methods failed to converge under certain random seeds, as described in lines 273–279, highlighting their sensitivity to initialization.

In addition, we explicitly evaluated robustness under varying noise ratios, shown in Figure 4. While other continuous optimization baselines (GOLEM and NOCURL) exhibit increasing SHD (lower is better) as the noise ratio increases, our method maintains a consistently low SHD across all noise levels. This demonstrates the robustness and stability of our approach under more challenging conditions. Notably, DAGMA failed to converge for noise ratios above 15, further emphasizing the advantage of our method in terms of both convergence and stability.

[1] Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pp. 3414–3425. Pmlr, 2020.487

评论

Dear Reviewer M4a3,

As the discussion period is approaching its end, we wanted to kindly follow up to ask whether our responses have addressed all of your concerns. We greatly appreciate your engagement and would be happy to clarify any remaining questions if needed.

审稿意见
5

This paper proposes a gradient-based method to learn the structure of directed acyclic graphs (DAG). The authors formulate the DAG structure learning as a set of linear structural equation models and formulate DAG learning as stochastic approximation. The novelty of the method is to add a projection after SGD steps, and an estimated adjacency matrix is introduced as a mask to ensure computational efficiency and scalability. The authors evaluated the proposed method against multiple baselines and on various DAG learning tasks. The proposed method demonstrates consistent improvements in accuracy (structural hamming distance) and runtime or scalability compared to the baselines.

优缺点分析

Strengths

  • The key strengths are the scalability and accuracy of the proposed method compared to many baselines.
  • The principles and deduction of the solution seem to be reasonable and mathematically sound.

Weaknesses

  • There lacks intuitive explanation of the applicability of the linear SEM and the corresponding DAG. It seems that the linear SEM only applies to stochastic graphical models, but in many real world applications, conditional SEM is more common, in other words, some of the nodes have deterministic inputs or features rather than purely stochastic. The proposed method seems not to be able to cover these applications.
  • It is unclear how many observations are required for the proposed approach to estimate DAG. In many real-world applications, the observations or data samples for DAG are limited, and may be partial or unbalanced across nodes. How does it perform when the observations are limited? The SGD method seems to rely on a large number of observations X.
  • It is unclear if or how the proposed method works for non-i.i.d. noise. The paper seems to be evaluated on well behaved noise, such as Gaussian, exponential, or Gumbel distributions, and the noise is i.i.d. across nodes. This is questionable since non-i.i.d. Noise would be more common in such causal inference or DAG models, since different nodes typically represent different entities.
  • The presentation could be improved from the organization and consistency in notations. It is unclear from the first reading about the typical applications of the method, what are the required observations X and how do they relate to real world data, and what are the assumptions for the method. The notations in Algorithm 3 lines 1, 3, 4 are not clear (11^T, dots between W and L, and [:]).

问题

  • What are the typical applications or applicable problems for the linear SEM / DAG, where random noise serves as external inputs on all the nodes?
  • What assumptions are required for the formulation and method? Do i.i.d. noise or known distribution required? Does it support discrete noise distribution?
  • Many systems represented by DAG has deterministic or semi-deterministic inputs, and/or the inputs for different nodes follow different distributions. Is the proposed method capable of working on these conditional SEM / DAG?
  • What are the requirements on the amount of samples for the proposed method to work?
  • What are the limitations of the proposed method?

局限性

  • May not work well for problems with limited sample/observations.
  • May not work well for non-i.i.d. noise inputs or when some nodes have deterministic external inputs or the inputs follow discrete distributions.

最终评判理由

The authors addressed most of my concerns, therefore I am going to raise my rating. However, there is still lack of quantitative assessment on the minimum number of observations required for the proposed method to work.

格式问题

NA

作者回复

We thank the Reviewer for recognizing the strengths of our work, including its scalability, accuracy, and mathematical soundness. Your insights are invaluable, and we believe that addressing them will significantly strengthen our work.

(W1+Q3): Applicability beyond linear SEMs

Yes, Ψ\PsiDAG supports conditional SEMs with deterministic or fixed inputs. Below we show empirical results where Ψ\PsiDAG outperforms baselines in such settings.

SHD (\downarrow)

Methodd = 10d = 25d = 50d = 100
NOTEARS48831
GOLEM9253679
NOCURL2105024
DAGMA481215
Ψ\Psi-DAG (ours)0000

F1 Score (\uparrow)

Methodd = 10d = 25d = 50d = 100
NOTEARS0.80.80.80.8
GOLEM0.70.60.70.7
NOCURL0.90.70.70.8
DAGMA0.80.80.80.8
Ψ\Psi-DAG (ours)1111

The proposed method is fully compatible with conditional SEMs where some variables are deterministic or fixed across samples. From an optimization perspective, the algorithm operates on observed data, regardless of whether the input values are drawn stochastically or held constant. However, it is important that the overall model includes a stochastic component in the generation of each node (i.e., additive noise) to maintain identifiability. If some structural weights were strictly deterministic and the noise component entirely absent, identifiability would be compromised, not due to our method specifically, but as a general limitation in the causal discovery literature. In this regard, our method adopts the same assumptions as widely used methods (e.g., NOTEARS, GOLEM, NOCURL), and remains applicable to conditional SEM settings where inputs are partly deterministic.

(W2+Q4): Limited sample performance

Ψ\Psi-DAG performs well even in low-sample regimes, despite leveraging stochastic optimization.

From an optimization perspective, our method handles sum-type objectives such as the mean squared error F(w)=1ni=1nf(w,xi)F(w) = \frac{1}{n}\sum_{i=1}^{n} f(w, x_i) by treating them as stochastic expectation problems of the form Eξ[f(w,xξ)]E_\xi[f(w,x_{\xi})] where ξ{1,,n}\xi \in \lbrace 1,\ldots,n \rbrace. This allows the use of stochastic optimization techniques even when the number of samples is limited.

Although Ψ\Psi-DAG leverages stochastic gradient updates, which are typically associated with large datasets, we observe that it remains effective in low-sample regimes. In particular, our experiments on a real-world dataset (lines 283-300), where the sample size is small, show that Ψ\Psi-DAG outperforms baselines, indicating strong robustness to limited data.

W3: Non-i.i.d

Ψ\Psi-DAG is compatible with non-i.i.d. noise and shows strong empirical robustness in such settings. We include below preliminary results under non-i.i.d. noise (up to d=100d=100), showing Ψ\Psi-DAG remains competitive.

Our primary focus in this work was on addressing the DAG constraint, which we consider to be the most computationally challenging component of the problem. The noise distribution primarily influences the form of the loss function, and our framework is agnostic to whether the noise is i.i.d. or non-i.i.d. across nodes.

We conducted additional experiments using synthetic data with heterogeneous noise. We first sample a random DAG from the Erdős–Rényi model and randomly select two variables whose structural mechanisms vary across subgroups. For each node, data is generated following a topological order: if a node has parents, its value is computed as a weighted sum of its parents passed through a randomly selected nonlinear function (sin\sin, tanh\tanh, square, or linear), plus noise. The noise is sampled from either a Gaussian or uniform distribution with varying scales, introducing non-i.id noise across nodes and subgroups.

We report SHD (lower is better) and F1 score (higher is better), below, and Ψ\Psi-DAG consistently outperforms all baselines. We will update the paper to include these new results and clarify the generality of our method to non-i.i.d. noise settings.

SHD (\downarrow)

Methodd = 10d = 25d = 50d = 100
NOTEARS62447156
GOLEM84375128
NOCURL1038100297
DAGMA112750143
Ψ\Psi-DAG (ours)1123181

F1 Score (\uparrow)

Methodd = 10d = 25d = 50d = 100
NOTEARS0.40.40.40.3
GOLEM0.20.20.10.1
NOCURL0.30.40.30.2
DAGMA0.30.50.40.0
Ψ\Psi-DAG (ours)0.90.80.70.5

Q1: Linear SEM applicability

Linear SEMs with additive noise are broadly applicable in real-world domains and remain essential for identifiability.

  • In genomics, linear SEMs are used to infer gene regulatory networks from gene expression data, where each gene’s expression is influenced by others plus latent biological noise [1]

  • In biology, SEMs have been used by Sachs et al. [2] to model causal protein-signaling networks from single-cell data, capturing interactions between molecules under measurement noise.

  • In economics and social science, SEMs model interactions among variables like education, income, and policy indicators [3,4]

These models rely on additive noise to capture unobserved variability and ensure identifiability, as discussed in works such as [5,6].

Identifiability requirement: The presence of stochasticity (i.e., a noise term) at every node is a key assumption for ensuring identifiability in many causal discovery methods, including NOTEARS and GOLEM.

Q2: Nosie distribution

Our method is distribution-agnostic and works with both continuous and discrete, i.i.d. and non-i.i.d. noise.

Ψ\Psi-DAG does not require any explicit assumptions about the form of the noise distribution. In particular, it does not assume that the noise is i.i.d. across nodes, nor does it require the noise distribution to be known. The formulation is flexible and can accommodate both i.i.d. and non-i.i.d. noise structures. As shown above, it works under non-i.i.d. settings as well, although for benchmarking purposes we focused on standard assumptions.

Furthermore, Ψ\Psi-DAG does not impose restrictions on the type of noise; it can handle continuous and discrete distributions. While our experiments focused on standard continuous noise settings (e.g., Gaussian, Gumbel), the framework itself is general and applicable to settings with discrete noise as well.

Q5: Limitations

Ψ\Psi-DAG does not guarantee global optimality, but can serve as a fast and effective subroutine in exact solvers.

The primary goal of Ψ\Psi-DAG is to efficiently find high-quality feasible solutions in a short amount of time, leveraging stochastic optimization and projection techniques. However, the method does not guarantee globally optimal solutions due to the combinatorial nature of the DAG learning problem.

For settings where exact recovery of the true DAG is essential, Ψ\Psi-DAG can be effectively combined with a branch-and-bound strategy over the vertex ordering space. While the space of DAGs is of size 2d2d2^{d^2-d}, the space of vertex orderings is significantly smaller at d!d!. In such frameworks, Ψ\Psi-DAG can be used to provide both upper and lower bounds for each ordering, which are crucial for efficient pruning during the search. This makes it particularly suitable as a bounding subroutine within exact combinatorial solvers.

We hope that we answered all of the questions posed in the review. We are happy to answer any additional questions in the discussion period.

[1] Zhang, B., Gaiteri, C., Bodea, L.-G., Wang, Z., McElwee, J., Podtelezhnikov, A., Zhang, C., Xie, T.,Tran, L., Dobrin, R., et al. Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer’s disease.

[2] Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D., and Nolan, G. Causal protein-signaling networks derived from multiparameter single-cell data.

[3] Bollen, K. A. Structural Equations with Latent Variables. Wiley.

[4] Morgan, S. and Winship, C. Counterfactuals and causal inference. Cambridge University Press, 2015.

[5] Peters et al., “Causal discovery with continuous additive noise models,”

[6] Kalisch & Bühlmann, “Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm,”

评论

Thank you for the detailed comments. Most of my concerns are addressed, I would be willing to raise my rating by 1 point.

评论

Dear Reviewer csPF,

Thank you very much for your kind response and for taking the time to revisit our rebuttal. We truly appreciate your positive evaluation and are glad to hear that your concerns have been addressed.

审稿意见
5

In this paper, the authors propose Ψ\Psi-DAG a novel structure learning method for linear Additive Noise Models. Differently from existing methods, Ψ\Psi-DAG consists of a three-step algorithm that: (i.) finds a possibly cyclic preliminary solution, (ii.) refines it into an acyclic solution from which to extract the topological order, and (iii.) further refines it by keeping the topological order. Overall, the method outperforms on the selected benchmark competing structure learning methods.

优缺点分析

Strengths:

  • Empirically, the Ψ\Psi -DAG outperforms existing structure learning methods (PC, FGES, NOCURL, GOLEM, DAGMA) and does not incur high computational costs as the others, apart from DAGMA, which is still faster. However, I have concerns on the chosen benchmark (see questions).

Weaknesses:

  • The idea of extracting a topological order from a preliminary solution before fitting the final solution after fixing the order is similar to the NOCURL approach. A detailed comparison between the two is not discussed in the paper.
  • The authors do not discuss whether simulated data has been normalized or not. This is important as non-normalized benchmarks are known to be flawed for the evaluation of structure learning algorithms ("Beware of the simulated DAG!", Reisach et al. 2021).
  • I might have missed it, but it is not clear to me how the "stochastic reformulation" build in Section 3 differs from the usual MSE problem.

问题

  • Why are results of NOTEARS on the synthetic dataset missing in the main body? When reported in the Appendix (Figure 7), NOTEARS seems to perform better than Ψ\Psi -DAG.
  • How is the acyclic optimization for the second step of your algorithm performed? Which acyclicity constraint are you employing?
  • How does the extraction of the topological ordering differs from the procedure used by NOCURL? In their work, they also find a preliminary solution and then extract a topological order to then refit the final weighted adjacency matrix.
  • How would the method perform on normalized data?

局限性

yes

最终评判理由

My main concerns were on the methodological comparison of ψ\psi-DAG with NOCURL, the interpretation of their loss function, and the lack of results on normalized data. In their rebuttal, the authors addressed the first two issues and provided preliminary results on the last weakness. Given the rebuttal and the discussion, I now recommend acceptance of the paper.

格式问题

none

作者回复

We sincerely thank the reviewer for their valuable feedback. Your insights help us refine our work and improve its clarity and impact. Below, we address the key points raised:

(W1 + Q3): Comparison with NOCURL

Our method avoids acyclicity constraints and uses a fundamentally different projection and optimization strategy for better scalability.

NOCURL first solves the unconstrained optimization problem Apre=argminF(A,X)+λh(A)A^{\text{pre}} = \arg\min F(A, X) + \lambda h(A), with the NOTEARS acyclicity constraint h(A)=tr[exp(AA)]dh(A) = \mathrm{tr}[\exp(A \circ A)] - d (costly matrix exponential), then applies Hodge decomposition to project to a DAG. It requires computing gradients of h(A)h(A) twice with different λ\lambdas, which increases the cost (Section 4: Algorithm: DAG with No Curl [6]).

Our method avoids this entirely:

  • We never compute h(A)h(A) or its gradients at any stage.

  • We perform a lightweight projection step (Algorithm 3) to directly obtain a topological ordering.

  • We solve a stochastic convex optimization (Problem 10) using this fixed ordering, which avoids expensive matrix operations and allows scalable SGD-based updates.

A detailed comparison with NOCURL is provided in Appendix A (lines 543–550) due to space limitations in the main text. We will move it into the main body in the camera-ready version for better visibility.

(W2 + Q4): Normalization of Synthetic Data

We show that our method works under realistic, non-equal noise variance conditions.

We thank the reviewer for raising this important point regarding standardization, which is sometimes referred to as a type of normalization of the simulated data. As highlighted by [1], non-standardized benchmarks can artificially inflate performance due to variance patterns (varsortability) that some algorithms exploit. Building on this, [2] demonstrated that continuous structure learning methods often fail under standardization because it inflates the effective noise ratio, breaking assumptions of equal noise variance. In our experiments (see lines 229–233), we deliberately used a non-equal noise ratio variance setting to provide a more realistic and challenging evaluation that avoids reliance on marginal variance cues. This setting aligns with the more comprehensive evaluation approach suggested by recent works and reflects real-world noise heterogeneity.

Our model demonstrated consistent performance across a wide range of noise ratios, as shown in Figure 4, indicating that it is robust to the distributional changes introduced by standardization. This suggests that our method would remain effective even when applied to standardized data, as it does not rely on marginal variance patterns or uniform noise levels. We apologize if this was not sufficiently clear in the current manuscript and will include a more detailed explanation to explicitly clarify our handling of standardization and noise variance.

W3: I might have missed it, but it is not clear to me how the "stochastic reformulation" build in Section 3 differs from the usual MSE problem.

The objective is similar, but our optimization is stochastic, enabling SGD-based scalability.

The stochastic reformulation introduced in Section 3 shares a similar objective with the standard MSE formulation, as both aim to minimize reconstruction error. However, the key difference lies in how the optimization is conducted. The stochastic formulation enables the use of Stochastic Gradient Descent (SGD)-type updates, which are computationally efficient and scalable to large graphs. In contrast, most benchmark methods, such as NOTEARS, GOLEM, and NOCURL, rely on full-batch gradient steps, which are more computationally demanding and often suffer from poorer scalability.

From a theoretical standpoint, this distinction aligns with the difference between Stochastic Approximation (SA) and Sample Average Approximation (SAA). SA methods, such as the one used in our formulation, optimize toward the true, unbiased solution by treating the data as stochastic samples. In contrast, SAA methods optimize over fixed empirical losses, which can lead to biased estimates and, consequently, the inability to consistently recover the true underlying DAG structure [3, 4]. SA-based strategies are widely adopted in other machine learning domains (e.g., deep learning), where they have been shown to outperform deterministic alternatives [5]. In our implementation (lines 199–200), we employed an advanced variant of Stochastic Gradient Descent, the Universal Stochastic Gradient Method proposed by Rodomanov et al. (2024), which adapts learning rates automatically.

Q1: Why are results of NOTEARS on the synthetic dataset missing in the main body? When reported in the Appendix (Figure 7), NOTEARS seems to perform better than Ψ\Psi-DAG.

NOTEARS is significantly slower and less scalable. We focused on DAGMA since it improves on NOTEARS in both accuracy and scalability.

We would like to clarify the purpose of Figure 7 in the appendix. This figure illustrates that, under the equal-variance setting, the conclusions of [1] and [2] hold, specifically, that continuous optimization methods such as NOTEARS and GOLEM achieve good structural recovery. However, as shown in Figure 7b, the runtime of NOTEARS and GOLEM increases substantially with the number of nodes, limiting their scalability. While NOTEARS achieves SHD comparable to our method in this simplified setting, our approach does so with significantly lower computational cost, as shown in Figure 7b. Given NOTEARS' scalability limitations and the fact that DAGMA improves upon NOTEARS both in accuracy and efficiency (see lines 213–214), we focused on DAGMA in the main experiments. Nonetheless, we emphasize that DAGMA becomes slower than our method as the number of nodes increases, as demonstrated in Figures 5 and 9. We will include NOTEARS in the relevant figures, despite its limited scalability, and will update the appendix to clarify the intent of Figure 7 and to highlight these distinctions more explicitly.

Q2: How is the acyclic optimization for the second step of your algorithm performed? Which acyclicity constraint are you employing?

Acyclicity is guaranteed via a projection step that yields a topological order, no gradients or acyclicity constraints needed.

We would like to clarify the acyclic optimization procedure in our method. Optimization is performed only in Steps 1 and 3 of our algorithm. In Step 2, as shown in Algorithm 3, we apply a projection operator that transforms the intermediate matrix into the acyclic graph and returns a corresponding topological ordering. This projection step enforces acyclicity by construction and does not involve gradient-based optimization. Once the topological order is obtained, Step 3 solves an optimization problem with a fixed vertex ordering, as detailed in Section 4.1 (Optimization for Fixed Vertex Ordering). Because the ordering is fixed, the optimization is constrained to the space of acyclic graphs, ensuring that every iterate during the optimization remains a DAG. More specifically, the optimization problem is reformulated as Problem (10), a stochastic convex optimization problem that can be solved using any stochastic gradient method. In our implementation (lines 199–200), we use the Universal Stochastic Gradient Method [Rodomanov et al., 2024] to ensure stable and efficient convergence.

We hope that we answered all of the questions posed in the review. We are happy to answer any additional questions in the discussion period.

[1] Reisach, Alexander, Christof Seiler, and Sebastian Weichwald. "Beware of the simulated dag! causal discovery benchmarks may be easy to game." Advances in Neural Information Processing Systems 34 (2021): 27772-27784.

[2] Ng, Ignavier, Biwei Huang, and Kun Zhang. "Structure learning with continuous optimization: A sober look and beyond." Causal Learning and Reasoning. PMLR, 2024.

[3] Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A., 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4), pp.1574-1609.

[4] Lan, G., 2012. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1), pp.365-397.

[5] Bottou, L., Curtis, F.E. and Nocedal, J., 2018. Optimization methods for large-scale machine learning. SIAM review, 60(2), pp.223-311.

[6] Yu, Yue, et al. "DAGs with no curl: An efficient DAG structure learning approach." International conference on machine learning. Pmlr, 2021.

评论

I thank the authors for their rebuttal.

I am satisfied with the highlighted differences between their method and NOCURL, which justifies the novelty of ψ\psi-DAG. Entirely avoiding the use of acyclicity constraints is valuable, and it significantly improves scalability of the method. To better frame their work in the context of unconstrained acyclic optimization, I would suggest the authors to also comment on papers sharing the same motivation, which are currently missing from the related works, e.g., [1–3, 5]. Furthermore, I agree that the trade-off between performance and scalability is significant when dealing with NOTEARS, and I encourage your decision of highlighting this matter in the main body.

I also thank the authors for their SA vs SAA discussion. If I had to be overly naïve, for most practitioners this would reduce to the choice of an optimization scheme to tackle the MSE problem. While lacking your theoretical emphasis, other works already explored the use of mini-batch optimization in causal structure learning [1, 3–6]. Which mini-batch size are you employing? Would your methodology still hold if solving MSE in a full-batch scheme? What would be the downsides?

Finally, I still have concerns on the empirical evaluation of the method. While I tend to agree that Ng et al. (2024) "suggests that our method would remain effective even when applied to standardized data", it would help to empirically validate this claim even in a small setting.

Overall, I thank the authors for better framing their work, of which I can better appreciate the novelty once clarified that the method is entirely unconstrained. In this case, it would help to better position the paper in the context of other unconstrained approaches. Together with my latest remaining concerns on the empirical evaluation, I am willing to revise my score accordingly to borderline accept.

  1. Annadani, Yashas, et al. "Bayesdag: Gradient-based posterior inference for causal discovery." Advances in Neural Information Processing Systems 36 (2023): 1738-1763.
  2. Charpentier, Bertrand, Simon Kibler, and Stephan Günnemann. "Differentiable DAG Sampling." International Conference on Learning Representations.
  3. Duong, Bao, et al. "Reinforcement Learning for Causal Discovery without Acyclicity Constraints." Transactions on Machine Learning Research.
  4. Lopez, Romain, et al. "Large-scale differentiable causal discovery of factor graphs." Advances in Neural Information Processing Systems 35 (2022): 19290-19303.
  5. Massidda, Riccardo, et al. "Constraint-Free Structure Learning with Smooth Acyclic Orientations." The Twelfth International Conference on Learning Representations.
  6. Nazaret, Achille, et al. "Stable differentiable causal discovery." Proceedings of the 41st International Conference on Machine Learning. 2024.
评论

We sincerely thank the Reviewer for their comments and for engaging during the rebuttal process. We appreciate the constructive feedback and the references to relevant works, which we will incorporate into our Related Work section in the final version of the paper.

Mini-batch Optimization and SA Motivation

We used a mini-batch size of 64 in all experiments, which offers a balance between computational efficiency and stability. Below, we elaborate on the motivation behind using SA and mini-batch strategies in our framework.

For large-scale problems, where both the sample size nn and dimensionality dd are high, computing full-batch gradients becomes computationally expensive. Moreover, in existing approaches such as NOTEARS or GOLEM, mini-batch optimization alone is not sufficient, since computing the acyclicity regularizer on each mini-batch is still costly.

In contrast, our Ψ\Psi-DAG framework is unconstrained and employs a lightweight projection step, enabling scalable and efficient mini-batch optimization. While the method is also compatible with full-batch training, it is typically slower and less effective due to poorer exploration of the non-convex optimization landscape. In practice, SGD-based methods (as used in our mini-batch scheme) help the model escape stationary points and poor local minima, leading to better overall solutions.

Empirical Robustness on Standardized Data

We conducted additional experiments and evaluated the SHD and F1 score on standardized datasets. The results (provided below) confirm that our method remains highly robust and competitive with state-of-the-art methods, even when data is standardized, supporting our theoretical motivation and further strengthening our empirical claims.

SHD (\downarrow)

Methodd = 10d = 25d = 50d = 100
NOTEARS10184382
GOLEM163697204
NOCURL1434105160
DAGMA152470113
Ψ\Psi-DAG (ours)14212

F1 Score (\uparrow)

Methodd = 10d = 25d = 50d = 100
NOTEARS0.30.30.30.4
GOLEM0.20.20.10.2
NOCURL0.30.20.10.2
DAGMA0.10.30.20.4
Ψ\Psi-DAG (ours)0.90.90.90.9

We hope this addresses all reviewer concerns. Once again, we are grateful for the insightful suggestions and engaging discussion throughout the rebuttal.

评论

I thank the authors for addressing my remaining concerns, I will adjust my evaluation accordingly.

评论

Dear Reviewer aQDY,

Thank you for your follow-up and for considering our rebuttal carefully. We are pleased to know that your concerns have been addressed, and we truly appreciate your positive feedback.

最终决定

While all reviewers were on the positive side about this paper, after going over the paper myself I conclude that it is below bar for NeurIPS. While the empirical study is very impressive and comprehensive, the algorithmic approach itself and the technical derivations, and in particular the main convergence results that follow in a straightforward manner from known results, put this work below bar for NeurIPS in my opinion.