PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高6标准差0.4
6
5
6
6
2.8
置信度
ICLR 2024

Directed Graph Generation with Heat Kernels

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11
TL;DR

A denoising autoencoder based on a nonhomogeneous heat equation to generate directed graphs

摘要

关键词
Directed graphsDigraphsGenerative modelsdenoising autoencodersheat kerneldiffusion kernelheat diffusion

评审与讨论

审稿意见
6

The authors propose a generative model, similar to denoising autoencoders, for directed graph generation. The encoder adds noise based on a heat equation expression to generate a perturbed representation, which the decoder denoises to reconstructs the desired generated graph via the random walk Laplacian. The authors test their approach in empirical evaluations.

优点

  • Originality and Novelty: The approach that the authors propose is, to the best of my knowledge, original and novel.
  • Significance: Nowadays it is certainly an interesting and important topics to study graphs, as well as generative models on graphs. The topic of directed graph generation has indeed been underappreciated in the literature, so this is a welcome addition.
  • Quality: The technical claims are, to the best of my knowledge, sound and reasonable. Details and questions are given below in the next section.
  • Clarity: The article is written moderately clearly, with ample room for improvement in its exposition. Suggestions are given in the section below.

缺点

  • The main weakness is in the presentation and empirical evaluation:
  1. I suggest that the authors provide more background and motivation on the mathematical prerequisites to make the paper more self contained.

  2. The decision to set M to be a constant matrix should be further motivated and explained (to people familiar with this, this is a natural choice, but this can be unclear to the less initiated readers)

  3. Crucial concepts rely on very recent work such as the Set Transformer in 2019 and the work of Veerman and Lyons in 2020. This makes the article more difficult to read...I suggest that the authors try their best to make this work more self contained in the presentation.

  4. The empirical evaluation is limited to very simple models (ER and SBM) under the squared MMD distance for various descriptors, under hyperparameter settings of 3 blocks and p as 0.6, seemingly without much justification.

问题

  1. Is there a concrete reason/justification for why the empirical evaluation is so focused on disconnected digraphs? My impression is that many interesting applications concerns connected/strongly connected digraphs. Is there a possibility where the evaluation metrics (clustering coefficients etc) are just capturing the disjointness of the generated graph, rather than the more fine-grained properties of connectivity within a single connected/strongly connected digraph?
评论

Q: The decision to set M to be a constant matrix should be further motivated and explained (to people familiar with this, this is a natural choice, but this can be unclear to the less initiated readers).

A: Thank you for the suggestion. We have added in Section 2 the motivation of our formulation of M**M**, which is to represent a non-informative matrix when the node representation matrix is constrained to be column stochastic. Each column of M**M** is defined as the uniform probability distribution vector, which corresponds to the maximum entropy probability distribution vector. Each column of M**M** also corresponds to the expected value of a random variable following a flat Dirichlet distribution.

Q: Crucial concepts rely on very recent work such as the Set Transformer in 2019 and the work of Veerman and Lyons in 2020. This makes the article more difficult to read...I suggest that the authors try their best to make this work more self contained in the presentation.

A: Thank you for this suggestion. We have added a high-level description of SetTransformers and our motivation to use it (i.e. using a model that is invariant to the order of the rows of our node representations) in Section 3.1. The necessary concepts from Veerman and Lyons are given in our submission. We cite their paper only when (1) we define the (negative of) the random walk Laplacian matrix, which is a standard definition in graph theory, and when (2) we say that its matrix exponential is column stochastic, which can be verified by eigendecomposition. This last concept is actually explained in our Appendix F when we study the eigendecomposition of our heat kernel matrix. We have then added a reference to Appendix F for details.

Q: The empirical evaluation is limited to very simple models (ER and SBM) under the squared MMD distance for various descriptors, under hyperparameter settings of 3 blocks and p as 0.6, seemingly without much justification. Is there a concrete reason/justification for why the empirical evaluation is so focused on disconnected digraphs? My impression is that many interesting applications concerns connected/strongly connected digraphs. Is there a possibility where the evaluation metrics (clustering coefficients etc) are just capturing the disjointness of the generated graph, rather than the more fine-grained properties of connectivity within a single connected/strongly connected digraph?

A: We would like to emphasize that the blocks in the stochastic block models are disconnected only in the qualitative experiment of Section 5.3, so that the different modes are easy to visualize. In our experiments with quantitative results (i.e. Section 5.2 and Appendix H.2), the different blocks are still connected but with low transition probability between blocks. The probabilities of transition between blocks are given in Equation (8) and Equation (30). Therefore, our evaluation metrics capture the fine-grained properties of connected digraphs. Moreover, we provide additional experiments with larger graphs in Appendix H.2 with p=0.4p = 0.4 for the Erdos-Renyi category, and 5 blocks for the stochastic block models.

审稿意见
5

The focus of the manuscript is on directed graphs (digraphs) generative process. The authors propose an encoder-decoder architecture. Their encoder is based on the heat diffusion (defined by the graph Laplacian), and does not require any training. The representation is then perturbed such that it corresponds to a nonhomogeneous process. The denoiser is then trained to reverse the diffusion, i.e. to match the initial condition of the process. They then provide experiments to validate their claims.

优点

  • The proposed method is well motivated from a theoretical point of view.
  • Focusing on digraphs is interesting as most methods are on undirected graphs.

缺点

  • The overall writing and organization could be improved. In particular, section 3, which lack continuity.
  • I found the experiments a bit limited, I think a few things are missing:
    • The authors should report the standard deviation in the table.
    • I think it is important to report other distances than the current MMD with RBF kernel, either with different kernels and / or with different variance parameters.
    • In the tables, it is hard to understand the magnitude of the scores. It would be great to add a row for a random method (e.g. random adjacency matrix used in line 2 of Alg.1).
    • The results in 5.3 could be in the main body of the text by shortening other sections (e.g. related work).

问题

  • How do you choose the initial node representation X(0)X(0) ?

  • In eq.1, you also need to specify the initial condition at t=0t=0. You could also explain X(0)X(0) has dd signals that you diffuse following the heat equation (it might give more intuition).

  • Finally, some denoising decoder is trained to predict the nodes and/or edges when given only X(T ) as input (see details in Section 3.1).

    I don't fully understand this sentence. Looking at Lnode\mathcal{L}_{node} it is trained to reverse the diffusion process. It is not trained to predict the node, but rather the initial dd signals.

  • Is the decoder conditioned on the noise level (e.g. like in score matching) ?

  • Possible typo: "our decoders are learned " \to "our decoders are trained "

评论

Q: The overall writing and organization could be improved. In particular, section 3, which lack continuity. I found the experiments a bit limited, I think a few things are missing: The authors should report the standard deviation in the table. I think it is important to report other distances than the current MMD with RBF kernel, either with different kernels and / or with different variance parameters. In the tables, it is hard to understand the magnitude of the scores. It would be great to add a row for a random method (e.g. random adjacency matrix used in line 2 of Alg.1). The results in 5.3 could be in the main body of the text by shortening other sections (e.g. related work).

A: Thank you for the suggestion. We have shortened the related work section by moving the similarities with diffusion models to the Appendix, and we have rewritten some parts of Section 3 that might be confusing. We have also included different variance parameters (σ2=100,10,1\sigma^2 = 100, 10, 1) in Table 2 as requested. As a random method, we have added results when sampling each column of the input of the decoder directly from a flat Dirichlet distribution. We have tested the sampling method that replaces line 2 of algorithm 1 by sampling the non-diagonal elements of the adjacency matrix uniformly in the interval [0,1] (while keeping the diagonal elements equal to 1). We call that approach "continuous DGDK" because the sampling space is continuous as opposed to our previous sampling approach that is discrete. The former random method outperforms the baseline but is outperformed by our approach. On the other hand, the latter is competitive with our proposed sampling approach since it also exploits the learned matrix N**N**. Due to lack of time, we did not include standard deviation, we intend to do it in the next version.

Q: How do you choose the initial node representation?

A: The initial node representation matrix N=X(0)**N** = **X** (0) is learned. Following Reviewer hJPx's suggestion, we have specified in Section 2 and Section 3.1 that the matrix N**N** is learned jointly with our decoders, and we have added implementation details on how we train it in Appendix B. In short, N**N** is a matrix that is made column stochastic by applying a column-wise softmax operator. It is optimized via standard gradient descent.

Q: In eq.1, you also need to specify the initial condition at t=0t=0. You could also explain has signals that you diffuse following the heat equation (it might give more intuition).

A: Thank you for the suggestion. We have included that intuition after Eq. (2). All the necessary conditions/constraints are included in Eq. (1) so that Eq. (2) is its solution (see e.g. Veermans and Lyons).

Q: I don't fully understand this sentence. Looking at it is trained to reverse the diffusion process. It is not trained to predict the node, but rather the initial signals.

A: Your understanding is correct. Our decoders are trained to reconstruct the initial representation that does not contain noise. We have replaced the word "predict" with "reconstruct".

Q: Is the decoder conditioned on the noise level (e.g. like in score matching) ?

A: The decoder is in a sense conditioned on the "noise level" since we formulate: (1) our encoder so that its limit tends to M**M** when TT tends to infinity, (2) each column of M**M** corresponds to the mean/expected value of a random variable following the flat Dirichlet distribution (with all the parameters α\alpha equal to 1). Therefore, the input of our decoder is "close" to the expected value of the flat Dirichlet distribution. However, this is different from the noise level conditioning in regular diffusion models trained with denoising score matching at various noise levels. In those cases, the decoder, or denoiser, is trained at multiple different noise levels simultaneously. But in our case, our decoder is only trained at one, "the final", noise level. Since our method is an efficient one-shot generation technique, the decoder never needs to be trained or evaluated at "intermediate" noise levels, as in diffusion models. Hence, an explicit noise level conditioning like in diffusion models is not necessary for our method.

审稿意见
6

The paper describes a generative approach for directed graphs. It loosely follows the idea of denoising autoencoders: a trained input function in R^d over nodes is corrupted through a heat diffusion process as to produce an almost constant function over the graph nodes. This function is then given as input to an encoder that is tasked to project the node representation into a latent space and a decoder that, given two node embeddings, predicts the presence of an edge.

优点

This paper deals with an interesting problem, the one of directed graph generation, which seems to be partially neglected by the main body of literature in graph generation but that is nevertheless relevant. To the best of my knowledge, the proposed methodology is original. It adapts some ideas from denoising autoecoders and the more recent denoising diffusion to the domain of directed graphs, building a principled and sound method.

缺点

One of the drawbacks of the paper is that the mathematical notation is a bit intricate. Many quantities are redefined during the method description and is difficult to keep track of all of them. Some equations are also defined but I missed if or where they were used, for eq 6 or the node loss (where is the model \phi used in the sampling process?). Personally, until section 3, I was imagining some sort of denoising diffusion technique (especially after eq 5 and 6). It took me a while to change my “expectations” about the following sections.

The other weakness is about the comparisons. Even if there aren’t many works dealing with directed graphs, it is still worth providing a solid testbench that could possibly be used also from future works that want to compare with the proposed method. Isn’t there more datasets that can be considered or comparative methods that could be adapted

问题

  • The diffusion time is set to 1 in the experiments, but is it a dataset-dependent parameter? I guess that it might be somehow dependent on the graph radius?

  • It took me a while to figure out what a kind of node function N had to be. Maybe making it clear from the beginning that it is learned could help the reading.

  • Still regarding N, permutation invariance is not an easy thing to learn. How much could it be a problem for the training convergence?

  • Your method consists of diffusing an initial random graph. How much important is the starting graph family? Since your noise converges to a constant function, would it be possible to sample directly M?

评论

Hi, it seems there is a mistake and the review is about another paper. The submitted paper is not about a differentiable graph descriptor based on Euler Characteristics Curves, but about the generation of directed graphs.

评论

You're right. Now it should be fine.

评论

Q: The diffusion time is set to 1 in the experiments, but is it a dataset-dependent parameter? I guess that it might be somehow dependent on the graph radius?

A: Yes, both the diffusion time T>0T > 0 and noise diffusivity rate α>0\alpha > 0 are hyperparameters that depend on the dataset, can be tuned, and have an impact on the eigenvalues of the heat kernel matrix as discussed in Appendix F. We perform an ablation study of the parameter α\alpha in Appendix H.4, and we show that careful tuning of α\alpha is required in order to generate digraphs that are similar to those in the training distribution. A similar ablation study could be performed for TT.

We agree that the optimal value of TT is related to the radius of the digraph. The intuition is similar to the difference between ODE solvers and their approximation via the Euler method. Our method solves a linear differiental equation (the heat equation) in closed-form by using the matrix exponential etΔe^{t \Delta}. An alternative solver could use the forward Euler method and iteratively use tt times the matrix (Δ+I)=S(\Delta + **I**) = **S** instead, which would correspond to performing tt steps of message passing.

Q: It took me a while to figure out what a kind of node function N had to be. Maybe making it clear from the beginning that it is learned could help the reading.

A: Thank you for this suggestion, we have included in Section 2 and Section 3.1 that the matrix N**N** is learned jointly with our decoders, and we have added experimental details on how we train it in Appendix B.

Q: Still regarding N, permutation invariance is not an easy thing to learn. How much could it be a problem for the training convergence?

A: In practice, this is not a problem. One reason is that the matrix N=X(0)**N** = **X** (0) is learned. As explained in our revised Section 3.2, our way of dealing with permutations can be seen as a data augmentation scheme that adds more adjacency matrices (of isomorphic digraphs) to the training set by considering different permutation matrices P**P** and formulates noisy node representations that can be written as a function of PeTΔPN**P**^\top e^{T\Delta} **P** **N** instead of only eTΔNe^{T\Delta} **N** (i.e., only P=I**P** = **I**). The matrix N**N** is jointly learned with the edge and node decoders that are robust to this kind of transformation. We report the plot curves of our optimization scheme with and without this data augmentation technique in Figure 6. One can see, that it doesn't have an impact on the training loss optimization. Moreover, it obtains slightly better performance in practice.

Q: Your method consists of diffusing an initial random graph. How much important is the starting graph family? Since your noise converges to a constant function, would it be possible to sample directly M?

A: We agree that it is possible to generate the input of the decoder by sampling each column of the input from a flat Dirichlet distribution. In practice, when we tried it, many generated graphs were not as sparse as in the training set. We have included quantitative results when using that sampling strategy in the revised version of Table 2 (called "Sampling from flat Dirichlet distribution"), its performance is worse than with our proposed approach, but better than the baseline GRAN. This means that it is a viable strategy.

We have also tried the sampling strategy proposed by Reviewer APd2, where instead of sampling discrete adjacency matrices with elements equal to 0 or 1, the sampled adjacency matrices have their non-diagonal elements sampled uniformly in the continuous interval [0,1], and their diagonal elements are 1. In this case, we obtain results that are competitive with our proposed sampling strategy. Unlike the strategy that samples directly from the flat Dirichlet distribution, this strategy exploits the learned initial node representation matrix N**N** and the matrix exponential eTΔe^{T \Delta} to construct the input of the decoder. This competitive performance suggests that the starting graph family is not that important to obtain good results.

评论

I thank the authors for answering my questions. I'll go through the revised paper in the next few days.

审稿意见
6

The paper presents an novel approach to crate a one-shot generative model for directed graphs. The approach represents the graph through their heat diffusion with a forcing term Q(t) that forces the limit distribution to be uniform over the nodes. To generate the graph, they train an edge decoder that, given a noisy representation of the heat diffusion, predicts the presence of the edge. With the decoder to hand, they generate a Erdos-Renyi random graph, and add some Bernoulli noise to the adjacency matrix, then obtain the directed Laplacian for the result and compute the heat kernel under the mentioned forcing term and feed it to the edge decoder.

优点

  • Novel approach to cast the one-shot generation of graph
  • can handle directed as well as undirected graphs

缺点

  • The experimental evaluation is a bit substandard given the relatively large recent literature on the topic. The authors should at least match SPECTRE (cited) for the evaluation protocol.

问题

While it is clear that the link predictor tries to match what it has seen in the training set, it is not clear how their approach changes the Erdos-Renyi distribution to one more similar to the training set.

评论

Q: The experimental evaluation is a bit substandard given the relatively large recent literature on the topic. The authors should at least match SPECTRE (cited) for the evaluation protocol.

A: We agree that there is a large literature about graph generation. However, the literature focuses mostly on undirected graphs, whereas our paper focuses on directed graphs. All our training graphs are directed, including Erdos-Renyi or stochastic block models. We explain in Section 4 how Spectre cannot be applied to directed graphs as their method assumes that their Laplacian matrix lies on a Stiefel manifold, which is not the case when the graph is directed (i.e. the adjacency matrix is not symmetric).

Q: "While it is clear that the link predictor tries to match what it has seen in the training set, it is not clear how their approach changes the Erdos-Renyi distribution to one more similar to the training set."

A: About the evaluation on the Erdos-Renyi distribution category specifically, we use a probability of p=0.6p = 0.6 in Table 1, and p=0.4p=0.4 in Table 2 to generate the training digraphs in this category. In those tables, we report the evaluation metric based on the in-degree MMD between the training and generated sets. Even in the directed case, the Erdos-Renyi category is defined by the overall in-degree of nodes. Our approach results in low MMD for that metric, which means that in-degrees between training and generated graphs are similar.

评论

We thank the reviewers for their valuable feedback. We were happy to see that the paper was generally well received. Reviewers acknowledged that the paper targets an interesting problem ... which seems to be partially neglected by the main body of literature" (hJPx, APd2, E6dr). Reviewers also emphasized the originality and novelty of the approach (t7Sc, hJPx, E6dr), said that the approach is "well-motivated from a theoretical point of view" (APd2), and stated that the suggested method is a "principled and sound method" (hJPx, E6dr).

One common concern of reviewers was readability. We have made efforts to improve the readability of the manuscript and modified it by taking into account the reviewers' suggestions. Our changes are in blue. In particular, we have included the motivation of using our formulation for our non-informative matrix M**M** in Section 2. We also specify that our initial node representation matrix N=X(0)**N** = **X** (0) is a learned variable in Section 2 and 3.1, and we give the details of its training in Appendix B. We also tried to make the paper more self-contained in general. We are open to other suggestions.

Another concern is about experiments. Following the suggestions of reviewers hJPx and APd2, we have included comparisons to two other sampling techniques in Table 2. The strategy that directly samples from a Dirichlet distribution obtains decent results but worse than our sampling strategy. On the other hand, the one that samples adjacency matrices from a continuous space but exploits our learned node representation matrix achieves competitive performance. We have also added in Figure 6 a comparison of the training loss when we apply a data augmentation technique that adds permuted training adjacency matrices to make the model robust to permutation invariance (see revised Section 3.2). We would like to emphasize that we report in Appendix H experiments with larger graphs of different sizes (from 180 to 200 nodes). Moreover, all our training graphs are directed, which makes baselines such as Spectre not directly applicable. Our stochastic block models are also connected in all our experiments except Section 5.3 for visualization purpose. The probabilities of transition between blocks are specified in Equation (8) and Equation (30).

AC 元评审

The paper proposes a denoising autoencoder-based generative model, using the heat equation with the graph Laplacian, for one-shot generation of directed graphs. The formulation is accompanied by numerical experiments.

Reviewers generally agree that the proposed method is novel and interesting. However, I share the following concerns

  • The presentation needs to be substantially improved (Reviewers hJPx, E6dr, APd2). The mathematical description needs to be much better structured. For example, the one theoretical result, namely Proposition 1, is not stated as a mathematical theorem (it's not clear what its conclusion is), and there is no proof for it.

  • The empirical validation needs to be improved (Reviewer E6dr), in particular use of the MMD should be justified.

为何不给更高分

The presentation needs to be substantially improved for the paper to become publishable.

为何不给更低分

N/A

最终决定

Reject