PaperHub
5.3
/10
Poster4 位审稿人
最低5最高6标准差0.4
6
5
5
5
2.8
置信度
正确性3.3
贡献度2.5
表达2.8
NeurIPS 2024

Bayesian Optimization of Functions over Node Subsets in Graphs

OpenReviewPDF
提交: 2024-05-10更新: 2024-11-06
TL;DR

This paper extends Bayesian optimization to a novel graph setting, where the search space consists of node subsets in a generic graph.

摘要

We address the problem of optimizing over functions defined on node subsets in a graph. The optimization of such functions is often a non-trivial task given their combinatorial, black-box and expensive-to-evaluate nature. Although various algorithms have been introduced in the literature, most are either task-specific or computationally inefficient and only utilize information about the graph structure without considering the characteristics of the function. To address these limitations, we utilize Bayesian Optimization (BO), a sample-efficient black-box solver, and propose a novel framework for combinatorial optimization on graphs. More specifically, we map each $k$-node subset in the original graph to a node in a new combinatorial graph and adopt a local modeling approach to efficiently traverse the latter graph by progressively sampling its subgraphs using a recursive algorithm. Extensive experiments under both synthetic and real-world setups demonstrate the effectiveness of the proposed BO framework on various types of graphs and optimization tasks, where its behavior is analyzed in detail with ablation studies.
关键词
Bayesian OptimizationGraphs

评审与讨论

审稿意见
6

This paper addresses the challenge of optimizing functions defined on node subsets in a graph. These functions are combinatorial, black-box, and expensive to evaluate. The proposed solution utilizes Bayesian Optimization (BO), mapping each k-node subset to a node in a new combinatorial graph, and traversing this graph efficiently through a recursive algorithm. Extensive experiments demonstrate the effectiveness of this approach on various graphs and optimization tasks.

优点

The new framework proposed in this paper is innovative. It is function-agnostic and can be applied to various optimization problems, thus offering a wide range of applications. The experimental results also demonstrate its potential. The presentation of paper is clear and comprehensible.

缺点

The domain of the optimization problem is the set of all k-tuples of vertices on a graph. However, it seems the structure (in particular, the edges) of the graph is not mentioned. Is the graph structure not essential in this context?

问题

I have to say I am not very familiar with the field of black-box combinatorial optimization. I am willing to let the Area Chair focus more on the opinions of the other reviewers. However, I am personally curious about, how such approaches achieve provably good performance when we have no assumptions on function ff ? It is likely to fall into no-free-lunch theorem.

局限性

The performance scales poorly as kk.

The choice of hyperparameter is ad-hoc.

作者回复

We thank the reviewer for acknowledging the novelty and potential of our work! Please see our responses to the raised questions below.

How such approaches achieve provably good performance when we have no assumptions on function ff? It is likely to fall into no-free-lunch theorem.

We’d like to respond from the general Bayesian Optimization (BO) perspective.

First, BO is most suitable for optimizing functions that are black-box and expensive to evaluate. If we have access to the gradient of the function, or the function is easy to evaluate, then there are more appropriate alternatives such as gradient-based optimizers or Reinforcement Learning (RL) methods. However, if the function is black-boxed and its values are costly to observe, then sample efficiency becomes an essential criterion to the optimizer, that is, we wish to obtain the best possible location within a limited number of queries. Note that RL usually requires a large number of samples to train the reward/policy models, which is infeasible under this setting.

The performance gain in BO mainly comes from the surrogate modeling approach, where a tractable model (e.g. Gaussian Processes) is used to gradually approximate the underlying function based on the observations. At each iteration, the surrogate model is fitted with previous observations and generates predictions on the rest search space, which allows us to identify the most promising location based on past knowledge, and then query its function value. In the next iteration, this new observation will be used to re-fit the surrogate, and we repeat the process until hitting the evaluation budget.

It means that, even though we have no prior knowledge about ff at the beginning, the algorithm tries to adaptly learn the pattern in ff along the search, where the next move is always guided by past knowledge, and thus explaining the superior performance of BO over other random heuristic methods.

It seems the structure (in particular, the edges) of the graph is not mentioned. Is the graph structure not essential in this context?

The underlying graph structure is very critical to the problem from the following perspectives.

  1. Unlike classical combinatorial optimization over discrete space (e.g. searching for the optimal hyper-parameter combination), the configurations (i.e. node subsets) in our setting are not independent but linked via an underlying graph structure, which needs to be considered during optimization. This is why a graph GP is adopted as the surrogate model, where the similarity between configurations is measured by a kernel on graph (Equation 2), as discussed in Section 3.2.

  2. Specifically, the above kernel is defined as a function of the graph Laplacian matrix LL, which is computed by L=DAL = D - A, with DD being the degree matrix and AA being the adjacency matrix. This is how the edges are incorporated into modeling.

  3. If the graph is sparse (e.g. a grid is sparser than a fully connected graph), the combinatorial graph will also be sparse. Then, with a fix-sized combo-subgraph of QQ combo-nodes, we can cover more hops (i.e. distant nodes) from the center combo-node at each iteration. Thus, the algorithm will traverse the comb-graph (i.e. substituting the elements in the subset) at a faster pace, as discussed in Lemma 3.2.

  4. If the underlying graph structure is informative (e.g. BA network is more uniquely structured with heterogeneous node degree distribution as opposed to the WS network, whose node degree is more homogeneous distributed), the combinatorial graph will also be informative. In this case, it means that the kernel in the graph GP surrogate will capture more structural information, which eventually helps the algorithm visit better locations during optimization, as explained in Sections 3.3 and 4.

The performance scales poorly as k.

We are glad that the reviewer noticed our discussion on the limitation about subset size kk, and we would like to make some further clarification below.

In the literature of subset selection on graphs, setting k<50k<50 (<1% of the network) is very a common paradigm [1,2,3], since many problems have been proven to be NP-hard due to the combinatorial explosion in the search space (Nk)\binom{N}{k}. As such, when the subset size kk increases, it poses a general challenge in the literature that the performance gain between heuristic optimization algorithms and random baselines is diminishing [4]. On top of this, the problem becomes even more challenging under our setting, since the underlying function is fully black-boxed and we assume no prior information about the graph structure.

Having said that, the proposed method still generally outperforms the other baselines across all experiments, and in the least favorable case, it performs comparably to the local search, which is also a novel baseline introduced in our paper since it needs to operate on the proposed combo-graph.

The choice of hyperparameter is ad-hoc.

We have conducted an ablation study in Appendix I to analyze the sensitivity of GraphComBO to the combo-subgraph size QQ and the non-improvement tolerance failtol`failtol`, where the results show that the proposed method is quite robust to the latter. However, we do agree with the reviewer that the algorithm would benefit from an automatic strategy for determining the combo-subgraph size QQ (e.g. an adaptive design of QQ), which we will leave to future exploration.

We’d like to thank the reviewer again for the comments that helped us improve the current work. We believe that we have now addressed all the concerns from the original reviews, and hope the reviewer could kindly consider increasing the rating/confidence in light of this.

[1] Maximizing the Spread of Influence through a Social Network, KKD 2003

[2] Cost-effective Outbreak Detection in Networks, KDD 2007

[3] Robust Influence Maximization, KDD 2016

[4] Influence Maximization on Social Graphs: A Survey, TKDE 2018

评论

Thanks for your detailed explanation! I do not have further questions and decide to maintain my positive rating.

评论

We thank the reviewer again for engaging in the rebuttal and for the time and effort in helping improve our manuscript.

审稿意见
5

The present paper proposes an approach based on Bayesian optimization for optimizing a function over subsets of nodes of size kk in a graph. At each step of the algorithm, a local neighborhood is constructed and the best node with respect to an acquisition function is chosen as the center for the next step. The approach is evaluated against several other classical techniques on a variety of benchmarks, where it is competitive and often outperforms the other approaches.

优点

I am unfamiliar with the topic and the related literature, so take my comments with a grain of salt. If applying Bayesian optimization to the problem is indeed novel, then I think it is an interesting contribution for the ML community, even if none of the ideas appears to be very involved in itself.

The strongest selling point of the paper to me are the empirical results, where the performance is consistent and tends to outperform the baselines.

缺点

It is not very surprising that an approach that makes informed decisions to guide the search outperforms simple baselines like BFS or local search, somewhat weakening the significance of the results to me.

问题

Are there other common baselines in the literature that are not based on Bayesian optimiazation? If one were to compare the present approach against a task-specific approach, do you have intuition on whether the present approach would perform significantly worse?

Other:

  • ego-subgraph is not defined

局限性

Some future direction are discussed.

作者回复

We would like to thank the reviewer for acknowledging the contribution of the proposed framework. Please see our response to the review comments below.

If applying Bayesian optimization to the problem is indeed novel, then I think it is an interesting contribution for the ML community, even if none of the ideas appears to be very involved in itself.

It is not very surprising that an approach that makes informed decisions to guide the search outperforms simple baselines like BFS or local search, somewhat weakening the significance of the results to me.

As discussed in the introduction, there are many optimization problems on graphs that involve expensive black-box functions defined on node subsets, e.g. outcomes from real-world implementation, results of large number simulations, and outputs from graph neural networks; at the same time, there are many applications where the underlying graph is not known a priori, e.g. offline social networks and contact networks. However, there is no general framework in the literature to optimize such black-box functions of node subsets in graphs, except for the classical graph traversal algorithms such as BFS/DFS and local search.

While we agree with the reviewer that it is very intuitive and “expected” that BO will outperform these classical algorithms, the main contribution of our work is, as pointed out by the reviewer, extending BO to this novel setup via an effective approach and obtaining good results over the existing baselines. To design such a framework, we have made the following technical contributions:

  1. We proposed a combinatorial graph tailored for node subsets in a generic graph.

  2. We introduced a recursive algorithm to sample subgraphs from the combinatorial space.

  3. To handle the combinatorial explosion problem, we use a local approach to traverse the combinatorial graph space by iteratively sampling its subgraphs and moving around the subgraph center to more promising regions, which is guided by a surrogate model.

In the early stage, we tried other localization methods such as selecting nodes inside a subgraph on the original graph as the subset. However, since we need to maintain the diversity of the subset to cover potentially distant nodes, none of them can fit into the problem as cohesively as the current method, which makes our design non-trivial.

Are there other common baselines in the literature that are not based on Bayesian optimization?

As discussed above, to the best of our knowledge, the only existing baselines are those based on graph traversal algorithms, which have been modified to the current problem setting. In particular, we have proposed kk-random walk and kk-local search (Appendix B) as baselines that operate on the original graph, which have shown strong results in some of our experiments. We are more than happy to discuss with the reviewer if they believe there are works in the literature that tackle the same problem setups as ours.

In addition, we would like to mention that a new experiment has been added to the general response, in which we compare our method to COMBO (nevertheless, a BO baseline) on small-scale networks. The result (Fig. 4 rebuttal PDF) shows the proposed framework consistently outperforms this baseline, and we kindly refer the readers to the general response for more details if they find it relevant.

If one were to compare the present approach against a task-specific approach, do you have intuition on whether the present approach would perform significantly worse?

We thank the reviewer for this high-level question on the suitability of our framework to specific problems, to which we’d like to respond from the general Bayesian Optimization (BO) perspective in the following.

First, BO is most suitable for optimizing functions that are black-box and expensive to evaluate. If we have access to the gradient of the function, or the function is easy to evaluate, then there are more appropriate alternatives such as gradient-based optimizers or Reinforcement Learning (RL) methods, and we believe BO, as a black-box solver, would not be an appealing approach in those settings.

However, if the function is black-boxed and its values are costly to observe, then sample efficiency becomes an essential criterion to the optimizer, that is, we wish to obtain the best possible location within a limited number of queries. Note that RL usually requires a large number of samples to train the reward and policy models, which is infeasible under this setting.

Based on the above discussion, if our goal is to optimize certain expensive black-box functions, where some “good guesses” can be obtained from domain knowledge or a task-specific approach, we can always incorporate that information into BO by either initializing the search at the promising locations identified by the task-specific method, or updating the prior in the surrogate with the domain knowledge. This implies that, in most cases, BO serves as a complement to the domain-specific methods rather than a complete substitute.

Other: ego-subgraph is not defined

We thank the reviewer for this reminder. In Lemma 3.2, an \ell-hop ego-subgraph centered at an arbitrary combo-node v^\hat{v} on the combo-graph means that we create an ego-network around v^\hat{v} with its \ell-hop neighbors, and the resulting ego-network is an ego-subgraph. We will make sure this concept is clear in the revised version.

We’d like to thank the reviewer again for the comments that helped us reflect on our work. We believe that we have now addressed the concerns from the original reviews, and hope the reviewer could kindly consider increasing the rating/confidence in light of this.

评论

Thank you for very detailed responses.

After reading the other reviews and the rebuttals, it seems to me that the proposed approach in this setting is indeed novel. Therefore, I admit that it may be unfair to criticize the work for using "only" poorly performing classical algorithms as the baselines.

Because of this, I am more inclined to propose accepting the paper, but my lack of background on this topic make it hard to state if the contributions are interesting enough for the community. For this reason, I've increased my confidence for the positive score.

评论

We thank the reviewer again for their time and efforts in the review that helped us improve our work, and we are always happy to answer any further questions.

审稿意见
5

This paper introduces a framework for optimizing functions of subset of nodes in a graph with Bayesian Optimization (BO). The framework need not know the graph structure beforehand, but does require the knowledge of the cardinality of the considered subsets of nodes 1kN1 \leq k \leq N for a graph G=V,E\mathcal{G} = \\{V, E\\} where V=N|V| = N. The framework works by iteratively building another graph, termed "combo-graph", where each "combo-node" is one of the (Nk)\binom{N}{k} subsets of kk nodes. Two combo-nodes \tilde{v}_i = \left\\{v^{(i)}_1, \cdots, v^{(i)}_k\right\\} and \tilde{v}_j = \left\\{v^{(j)}_1, \cdots, v^{(j)}_k\right\\} are linked if (v~iv~j)(v~iv~j)=vl,vm(\tilde{v}_i \cup \tilde{v}_j) \setminus (\tilde{v}_i \cap \tilde{v}_j) = \\{v_l, v_m\\}, and if (vl,vm)E(v_l, v_m) \in E. BO is then conducted on this partial combo-graph (using graph kernels) to select a promising combo-node to sample the objective function with.

优点

In spite of suffering from the drawbacks of local optimization, the presented framework is useful to tackle the combinatorial complexity involved in the optimization of functions over discrete structures. It opens an interesting avenue for extending BO to these problems.

Overall, the article is clear and well-written.

缺点

Noisy Setting. Both the preliminaries (Section 2) and the described BO on Combo-Graphs (Section 3.3) seem to suggest that the framework has access to true function values. However, in Appendix B, it seems that many experiments do take place in a noisy setting, because the objective function is explicitely defined with an additional noise term (e.g., Ackley), or because it results from an average of Monte-Carlo samples (e.g., Influence Maximization, Flattening the Curve). As a result, it is unclear to me whether the described algorithm is fit for noisy problems. The notations yty_t and f(vt)f(v_t) do not help.

Comparison with COMBO. I understand that COMBO does not address the same problem as the algorithm described in the paper. Nevertheless, given the lack of state-of-the-art baselines for this problem, it would be interesting to study how the algorithm compares to COMBO on problems involving small-enough graphs.

问题

Here are some questions to spark the discussion with the authors.

(1) Do you account for noise in your framework?

(1.1) If not, how do you expect your algorithm to react to noisy function values? I believe its behavior would be quite different, especially because the location of the combo-graph is selected according to the observed function value, which is not longer accurate.

(1.2) If so, why does the framework apply the acquisition function to unvisited combo-nodes only? Why change the location of the combo-graph based on the observed function value instead of a more refined / optimistic estimate (e.g., the posterior mean of the surrogate model or the acquisition function value)?

(1.3) If experiments are conducted in a noisy setting, can you provide an estimate of the noise level for each experiment as a proportion of the signal variance?

(2) Do you have any insights on how the proposed algorithm compares to COMBO on experiments involving small graphs?

局限性

I believe the authors have addressed the limitations of their work, and I am looking forward to the discussion period for more details on the noisy setting.

作者回复

We would like to thank the reviewer for the detailed feedback which helped us improve the current work. In summary, the reviewer's concerns are (1) If the framework can handle noisy observations, and (2) Comparison with COMBO on small networks. Please see below for our responses.

Do you account for noise in your framework? It is unclear to me whether the described algorithm is fit for noisy problems.

Yes, similar to other common BO methods, the proposed framework can handle noise. As pointed out by the reviewer, we used the Ackley function on a 2D grid with Gaussian noise as the underlying function in one of the synthetic experiments, meanwhile under real-world setups, we also used noisy results from simulations as the underlying functions.

For the notations, we intended to express the most standard setup in BO, that is y=f(x)+ϵy = f(x) + \epsilon, with ϵN(0,σϵ2)\epsilon \sim N (0, \sigma^2_{\epsilon}) being the noise term. We will make sure this is clear in the revised version.

A more comprehensive experiment under the noisy setting

To better answer the other questions from the reviewer and show our framework’s capability of handling noise, we further conduct a noisy experiment at different noise levels on BA (V=10k|V|=10k) and WS (V=1k|V|=1k) networks with k=8k=8, where the goal is to maximize the average PageRank within a node subset, i.e., f(S)=1ki=1kPageRank(Si)f(\mathcal{S}) = \frac{1}{k}\sum_{i=1}^k PageRank(\mathcal{S}_i), with S\mathcal{S} being a subset of kk nodes v1,v2,,vk\\{v_1, v_2, …, v_k\\} in the underlying graph.

Can you provide an estimate of the noise level to the signal variance?

While it is difficult to show the noise level to the signal variance in real-world experiments because of the combinatorial space, we can construct a standardized signal under this synthetic setting with the following procedures:

First, we standardized the PageRank scores over all nodes to mean=0 and std=1 in the original space (denoted as PageRanksPageRank_s). To standardize the underlying function in the combinatorial space, we multiply k\sqrt{k} to the avg. PageRanksPageRank_s as the final underlying function f~(S)\tilde{f}(\mathcal{S}), that is,

f~(S)=kfs(S)=1ki=1kPageRanks(Si)\tilde{f}(\mathcal{S}) = \sqrt{k} f_s({\mathcal{S}}) = \frac{1}{\sqrt{k}} \sum_{i=1}^k PageRank_s(\mathcal{S}_i),

in which E[f~(S)]=0\mathbb{E}[\tilde{f}(\mathcal{S})] = 0, and Var(f~(S))=1kVar(i=1kPageRanks(Si))=1k×k×Var(PageRanks(v))=1Var(\tilde{f}(\mathcal{S})) = \frac{1}{k} Var(\sum_{i=1}^k PageRank_s(S_i)) = \frac{1}{k} \times k \times Var(PageRank_s(v)) = 1.

Now, we can simply add random Gaussian noise to f~(S)\tilde{f}(\mathcal{S}). Specifically, we consider ϵN(0,σϵ2)\epsilon \sim N (0, \sigma^2_{\epsilon}) with σϵ\sigma_{\epsilon} at [0.1, 0.25, 0.5, 1], where the level of noise can be directly estimated since both the underlying function and noise are now of mean=0 and std=1. In addition, we further plot the estimated density of the original and noisy signals in Fig.2 (rebuttal PDF) to intuitively visualize the difference, which is done by randomly sampling 10510^5 observed values in the combinatorial space (Vk)\binom{\mathcal{V}}{k}.

In the noisy setting, why does the framework apply the acquisition function to unvisited combo-nodes only? Why change the location of the combo-graph based on the observations instead of the posterior mean or the acquisition function?

We would like to thank the reviewer for this insightful question. As suggested, we implement GraphComBO-Noisy, which uses the best posterior mean across both visited and non-visited combo-nodes within the combo-subgraph as the new center.

How do you expect your algorithm to react to noisy function values? I believe its behavior would be quite different, especially because the location of the combo-graph is selected according to the observed function value, which is not longer accurate.

The results (Fig.3 rebuttal PDF) show that the original method GraphComBO is robust to the noisy observations on both networks at different noise levels from σ=0.1\sigma=0.1 to σ=1\sigma=1. Compared with GraphComBO-Noisy, the observation-guided method performs comparably in most cases, except for a very noisy setting when σ=1\sigma=1 on WS networks, where we can observe a clear advantage from the method guided by posterior mean, and it can be explained as follows.

Unlike classical discrete combinatorial functions of independent variables, the underlying functions in our problems are highly related to the graph structure. For example, BA networks are known for rich structural information due to the scale-free characteristics (i.e. node degree is exponentially distributed), which makes the distribution of the original signal heavily right-skewed with extreme values even after standardization (Fig.2 Left). By contrast, the WS small-world network (randomly rewired from a ring) has more homogeneous node degrees, and thus the original signal will be more normally distributed after standardization (Fig.2 Right).

Therefore, the noise level (even at σ=1\sigma=1) is less significant on BA networks when the algorithm finds the promising region, Whereas on WS networks at σ=1\sigma=1, just as the reviewer described, we can see the algorithm is “misguided” by the observations compared to the posterior mean when the signal is highly corrupted.

Comparison with COMBO on small graphs

We thank the reviewer for acknowledging our different problem settings from COMBO. As suggested, we have implemented COMBO on smaller BA and WS networks of 500 nodes, which are still much larger than the ones used in their experiments (V<50|V|<50). The results are presented in Fig. 4 in the rebuttal PDF, and we can see a clear advantage of our framework over COMBO. Please refer to the general response for a more detailed analysis.

We’d like to thank the reviewer again for the insightful comments, and we will make sure to include the above discussion in the revised version. We believe that we have now addressed all the concerns from the reviews, and hope the reviewer could kindly consider increasing the rating in light of this.

评论

Thank you for the detailed rebuttal, the interesting discussion about the noisy setting and the additional results. I've increased my score to 5.

In light of the performance of GraphComBO and GraphComBO-Noisy, do you still recommend to use GraphComBO? If so, do you have a way to detect early in the experiment when GraphComBO-Noisy would be a more beneficial strategy?

评论

We thank the reviewer for the prompt engagement in the rebuttal, as well as their increased positive rating of our work. Please see our following explanations regarding the choice of GraphComBO (observation-guided) and GraphComBO-Noisy (posterior mean-guided).

  • Under the noiseless setting in our experiments (e.g. synthetic experiments with centrality measures), the difference in their performance is minimal after we tried both methods on several synthetic graphs with different synthetic functions, where GraphComBO only slightly outperforms GraphComBO-noisy in a few cases when kk is relatively big.

  • Under the noisy setting, (e.g. the new experiment in our rebuttal and the real-world experiments involving simulation results), we observed that GraphComBO-Noisy has a clear advantage over the observation-guided method when the noise level is high.

Since we often lack prior knowledge of the noise level when first encountering signals in real-world experiments, we will set the suggested GraphComBO-Noisy as the default option, given its overall performance across different settings, while retaining the observation-guided method in the revised version for noiseless scenarios.

We thank the reviewer again for the suggested method and will make sure to incorporate the above changes in the main manuscript. We are more than happy to discuss any additional improvements that would make the reviewer evaluate our work more positively and potentially further raise the rating.

审稿意见
5

This paper proposes a Bayesian optimization (BO) method for optimizing black-box functions with costly evaluations over subsets of nodes in a graph. The central idea is to use the combo-graph associated with these node subsets from the original graph. This approach maintains the structural information of the original graph while enabling the use of existing techniques for BO of functions defined over graph nodes (e.g., BayesOptG). Additionally, to manage the large feasible space and better balance exploration and exploitation, a local search heuristic is introduced. The proposed method outperforms several standard baselines across various synthetic and realistic problems.

优点

The paper is well-written overall, with sufficient technical details and figures that help build intuition. The proposed approach is technically sound, and the empirical evaluation shows significant improvements over the baselines in the problems considered.

缺点

  1. (Minor) While the paper is generally well-written, some graph-theoretic concepts, such as the graph Laplacian, are not adequately defined.

  2. The technical novelty is somewhat limited, as the proposed approach mainly combines BayesOptG over the combo graph with a local search heuristic.

  3. My main concern is the practical utility of the proposed approach. Real-world application networks are typically much larger than those considered in the empirical evaluation. Additionally, larger values of kk are likely to be of interest in these applications. The current results suggest that the proposed approach may not offer improvements over some baselines in these more challenging scenarios.

  4. The empirical evaluation would benefit from including other BO baselines. Several methods for BO over discrete sets exist, and I believe that LADDER could be a particularly strong candidate.

问题

  1. Is it possible to extend the proposed approach to settings where the objective function is defined over sets of variable size?

局限性

The authors acknowledge the proposed approach's limitations, particularly in diminishing performance gains as kk increases. Additionally, I believe this approach is restricted to graphs significantly smaller than those typically encountered in real-world applications. Unfortunately, I believe these limitations are more concerning than suggested in the paper.

作者回复

We thank the reviewer for the detailed feedback that helped us improve our work. Please find our detailed response below.

The empirical evaluation would benefit from including other BO baselines. I believe that LADDER could be a particularly strong candidate.

To the best of our knowledge, this is the first attempt in the literature at extending BO to black-box optimization of functions over node subsets in a generic graph, which means we are addressing a new problem setup that has not been explored by existing works.

For LADDER, we argue that their setting is fundamentally different from ours, where their underlying function (from their experiments) is defined on graph structures, i.e., their search space consists of different graphs and the goal is to find the optimal graph, which requires a graph kernel to measure the similarity between two graphs. However, in our setting, the underlying function is defined on node subsets from a single graph. After mapping the subsets to nodes on the combo-graph, a kernel on graph is required to measure the similarity between two nodes.

The closest match to the current work is COMBO, where the search space consists of node subsets from different (small) graphs (V<50|V|<50 in their experiments with k<50k<50). Nevertheless, it can not scale to large graphs due to the infeasible computational cost from the eigendecomposition of the underlying graphs.

With that being said, we have added the comparison with COMBO in the general response on small BA and WS graphs of V=500|V|=500 (still much larger than the graphs in their experiments), where the result (Fig.4 rebuttal PDF) shows a clear advantage of our framework over COMBO.

We will add the above discussion in the revised version and are happy to discuss more if the reviewer believes there are works in the literature that tackle the same problem as ours.

Real-world application networks are typically much larger than those considered in the empirical evaluation. I believe this approach is restricted to graphs significantly smaller than those typically encountered in real-world applications.

Firstly, we validated our method across various real-world applications on contact networks, social networks, transportation networks, and molecule graphs that scale from order 10210^2 to 10410^4. This is consistent with the real-world graphs used in graph machine learning (e.g. graph neural networks: GCN, GAT, GIN) and graph data mining literature (e.g. influence maximization [1,2,3]), which are in the same order as ours -- based on this, we believe the graphs considered are comparable to the literature and not "significantly smaller" than realistic graphs.

Second, the graphs considered in other graph-related BO works, e.g. LADDER (mentioned by the reviewer), COMBO, NASBO, GRABNEL, are much smaller than ours, which are typically at order 10110^1 and 10210^2.

A new experiment on larger network with larger subset size kk

Lastly, our framework can scale to large graphs, as it assumes no prior knowledge of the full graph and takes a local modeling approach that gradually reveals the graph structure. To better support our claim, we further test it on a large social network OGB-arXiv (V=1.7×105|V|=1.7\times10^5) from open graph benchmark with kk up to 128, where the results in Fig.1 (rebuttal PDF) show a clear advantage of our framework over the other baselines.

Additionally, larger values of k are likely to be of interest in these applications. The current results suggest that the proposed approach may not offer improvements over some baselines in these more challenging scenarios.

We argue that setting k<50k<50 (<1% of the network) is very a common paradigm in the literature of subset selection on graphs [1,2,3], which has been proven to be NP-hard in many problems due to the combinatorial explosion in search space (Nk)\binom{N}{k}, e.g. (100032)2.3×1060\binom{1000}{32}\approx2.3\times10^{60}. As such, the diminishing performance gain w.r.t. subset size kk poses a general challenge in the literature [4], and it becomes even more challenging in our setting, since the underlying function is fully black-boxed and we assume no prior information of the graph structure.

Having said that, the proposed method still generally outperforms the other baselines across all experiments, and in the least favorable case, it performs comparably to the local search, which is also a novel baseline introduced in our paper since it needs to operate on the proposed combo-graph.

The technical novelty is somewhat limited, as the proposed approach mainly combines BayesOptG over the combo graph with a local search heuristic.

As the prior work BayesOptG only considers k=1k=1, which is limited for many real-world applications, we aim to generalize the framework to a combinatorial setting with the following technical contributions:

  1. A combinatorial graph tailored for node subsets in a generic graph.
  2. A recursive algorithm to sample subgraphs from the combinatorial space.
  3. To handle the combinatorial explosion problem, we use a local approach to traverse the combinatorial graph space by iteratively sampling its subgraphs.

In the early stage, we tried other localization methods such as selecting nodes inside a subgraph on the original graph as the subset. However, since we need to maintain the diversity of the subset to cover potentially distant nodes, none of them can fit into the problem as cohesively as the current method, which makes our design non-trivial.

For the variable size of the subset, we believe it belongs to a different but interesting setup, and leave it to future work for exploration. And for the graph-theoretic concepts, we will make sure they are clear in the revised version.

We believe that we have now addressed all the concerns from the original reviews, and hope the reviewer could kindly consider increasing the rating in light of this.

Please find the references in the comment chatbox below.

评论

[1] Maximizing the Spread of Influence through a Social Network, KKD 2003

[2] Cost-effective Outbreak Detection in Networks, KDD 2007

[3] Robust Influence Maximization, KDD 2016

[4] Influence Maximization on Social Graphs: A Survey, TKDE 2018

评论

Dear Reviewer u7DA,

As the discussion period is drawing to a close, we'd like to kindly remind you of our rebuttal, and we remain keen to discuss the issues being raised and further improve the quality of our work.

Thank you again for your time and efforts in reviewing our manuscript, and we are looking forward to your response.

Best,

Authors of Submission #4018

评论

I would like to thank the authors for their detailed response. While I still believe that the proposed approach offers limited technical novelty, the authors have provided compelling evidence of its practical utility and competitive performance across a range of real-world applications. Therefore, I am happy to raise my score to 5.

评论

We thank the reviewer for acknowledging the practical contribution of our work across various real-world applications!

Regarding the technical contribution, we'd like to kindly emphasize the following technical difficulties, which are critical for extending BO to the novel combinatorial setting on graphs:

  1. Unlike classical combinatorial optimization of discrete variables, the combinatorial search space in our setting contains structural information from the underlying graph, which needs to be properly defined beforehand. Note this is different from BayesOptG, where the search space is not combinatorial but simply the underlying graph.

  2. When using a kernel to measure the similarity between two node subsets, instead of directly counting their common elements, we need to further consider the similarity of their structural information.

  3. To tackle the combinatorial explosion, we need to adopt a local modeling approach that can properly balance the exploration-exploitation trade-off in the structural combinatorial space.

We believe it is a non-trivial task to address these technical challenges, which has led to the following contributions: we first proposed a combinatorial graph tailored to the novel setting, which maps node subsets to combo-nodes in the combo-graph with certain desired properties (as explained in Section 3.1). Next, we introduced a recursive sampling algorithm to sample a subgraph from the combo-graph, and then traverse the combinatorial space by moving around the subgraph center, which is guided by the local surrogate model.

While being natural and intuitive, the proposed framework effectively handles the above technical problems and has shown strong performance under various settings in the experiments. From our perspective, such a canonical design may well be its advantage, i.e., a generic approach that can be applied to a wide range of practical applications on graphs.

Lastly, we want to once again thank the reviewer for the valuable comments and timely feedback that helped us improve our current work.

作者回复

We want to first thank all the reviewers for their detailed and valuable feedback, which we find very helpful in improving our work. We are also glad to see they acknowledged the innovation and potential impact in the community (bfWL, Y7Gf, yrTA), the technical soundness (u7DA, Y7Gf), the significance in experiment results (u7DA, Y7Gf, yrTA), and the clarity of writing (u7DA, bfWL, yrTA) in our work. We address the common concerns below.

Comparisons with the existing SOTA BO baselines (u7DA, bfWL)

We would like to emphasize that the current work is, to the best of our knowledge, the first attempt in the literature that extends Bayesian Optimization (BO) to black-box functions of node subsets in a generic graph. In particular, the local modeling approach in our framework enables the algorithm to handle situations when the underlying graph structure is not known a priori and can only be gradually revealed by queries on the fly.

The proposed optimizer has real-world applications across various domains and has been validated on contact networks, social networks, transportation networks, and molecule networks, with different underlying functions such as measures for network resilience, results from epidemic/social influence simulations, and outputs from graph neural networks.

We argue that this novel problem setting has been overlooked in the literature, where most graph-related BO methods (e.g. LADDER [1], NASBO [2], GRABNEL [3]) are designed for optimizing graph structures, i.e., their search space consists of different graphs and the goal is to find the optimal graph, in which a graph kernel is required to measure the similarity between two graphs. However, in our setting, the underlying function is defined on node subsets from a single graph, which, after being mapped to the combo-graph, requires a kernel on graph to measure the similarity between two nodes.

The closest match to the current work is COMBO [4], where the search space consists of kk-node subsets from kk different (small) graphs respectively (V<50|V|<50 in their experiments with k<50k<50), and the goal is to find the most optimal subset. Nevertheless, it assumes the knowledge of full graph structure in advance, and can not scale to large graphs due to the infeasible computational cost from the eigendecomposition of the underlying graphs.

Further experiment on the comparison with COMBO

With that being said, as suggested by reviewer bfWL, we have added the comparison with COMBO on small BA and WS graphs of V=500|V|=500 (still much larger than the graphs used in COMBO’s experiments), where the result (Fig.4 rebuttal PDF) shows a clear advantage of our framework over COMBO, and we make the following explanations.

To implement COMBO under our setting of kk-node subsets from a single graph G\mathcal{G} of size NN, we generate kk identical copies of G\mathcal{G} and form the kk-node subset by drawing one node from each of the copy. This leads to a search space of NkN^k, which is the key limitation of COMBO under this setting since it is supposed to be (Nk)\binom{N}{k}. As a result, there are many repeated and invalid locations in the search space, for example, at k=3k=3, (1,2,3), (1,3,2), (2,1,3), ... are different subsets in COMBO, but they all should be the same subset under the current single graph setting; meanwhile (1,2,2), (1,1,2), (1,2,1), … are valid subsets in COMBO, but they are invalid kk-node combinations on a single graph. This limitation makes COMBO highly inefficient under this new problem setting, and therefore leads to inferior performance compared to our proposed method.

We are happy to discuss more if the reviewers believe there are works in the literature that tackle the same problem as the current one.

Diminishing performance gain with increasing subset size kk (u7DA, yrTA)

In the literature of subset selection on graphs, setting k<50k<50 (<1% of the network) is a very common paradigm [5,6,7], since many problems have been proven to be NP-hard due to the combinatorial explosion in the search space (Nk)\binom{N}{k}. As such, when the subset size kk increases, it poses a general challenge in the literature that the performance gain between heuristic optimization algorithms and random baselines is diminishing [8]. On top of this, the problem becomes even more challenging under our setting, since the underlying function is fully black-boxed and we assume no prior information of the graph structure.

Having said that, the proposed method still generally outperforms the other baselines across all experiments, and in the least favorable case, it performs comparably to the local search, which is also a novel baseline introduced in our paper since it needs to operate on the proposed combo-graph.

Additional experiment on larger network with larger subset size kk

Lastly, we further test GraphComBO on a large social network OGB-arXiv (V=1.7×105|\mathcal{V}|=1.7\times10^5) from open graph benchmark with kk up to 128, where the results in Fig.1 (rebuttal PDF) show a clear advantage of our framework over all the other baselines. In this case, the local search baseline even underperforms random selection and random walk, as it overly focuses on local exploitation rather than exploration, whereas the latter is more critical on large underlying graphs.

[1] Combining latent space and structured kernels for Bayesian optimization over combinatorial spaces, NeurIPS 2021

[2] Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels, ICLR 2021

[3] Adversarial Attacks on Graph Classification via Bayesian Optimisation, NeurIPS 2021

[4] Combinatorial Bayesian Optimization using the Graph Cartesian Product, NeurIPS 2019

[5] Maximizing the Spread of Influence through a Social Network, KKD 2003

[6] Cost-effective Outbreak Detection in Networks, KDD 2007

[7] Robust Influence Maximization, KDD 2016

[8] Influence Maximization on Social Graphs: A Survey, TKDE 2018

最终决定

This paper presents a clever approach to allow existing Bayesian Optimization methods for optimizing over nodes in a graph to support the task of optimizing over subsets of nodes in a graph by constructing an appropriate combinatorial graph coupled with a local search technique. Reviewers concur that the paper is clearly written and experiments are compelling.