PaperHub
4.4
/10
Rejected5 位审稿人
最低1最高8标准差2.3
5
5
8
3
1
3.0
置信度
ICLR 2024

Hessian-Aware Bayesian Optimization for Decision Making Systems

OpenReviewPDF
提交: 2023-09-24更新: 2024-02-11
TL;DR

We propose Hessian-Aware Bayesian Optimization for optimizing a metamodel in Decision Making Systems involving multiple actors which makes progress on optimizing decision making systems with uninformative or unhelpful feedback.

摘要

关键词
Bayesian OptimizationActive LearningGaussian ProcessGraphical ModelsBayesianProbabilistic MethodsHessianHigh-dimensional optimizationGlobal optimizationUncertaintyOptimization under Uncertainty

评审与讨论

审稿意见
5

The paper studies the problem of learning good decision-making policies when multiple agents are involved. This problem traditionally presents several challenges, the biggest of which include reasoning about sparse feedback and unreliable gradient information, and accounting for interactions between agents. To perform effective optimization with sparse feedback and unreliable gradient information, the authors employ Bayesian optimization (BayesOpt). However, BayesOpt doesn't scale well to high dimensions, so the paper proposes using an abstraction of role and role interaction to model the agents on a high level, thus significantly simplifying the policy space to search over. To further aid scaling BayesOpt to the high-dimensional search space, the authors use a surrogate Hessian model to learn an additive structure of the space, which ultimately helps identify good regions within the space. Experimental results show promising performance of the proposed algorithm.

优点

The problem studied is interesting and important. The paper motivates the use of the role and role interaction abstraction well, and the proposed affinity score seems like a good way to quantify how suitable each agent is for each role. Some experiments show really strong performance from the proposed algorithm against a wide range of baselines.

缺点

There seem to be many components to the algorithm: role abstraction, BayesOpt with the Upper Confidence Bound policy, the Hessian surrogate. As far as I call tell, the ablation study doesn't give me insights into which components are useful for the final algorithm. For example, could we use a different optimization algorithm that doesn't rely on gradient information such as DIRECT or evolutionary algorithms? How much does information about the Hessian help BayesOpt?

I would have liked to see more discussion on the quality of the Hessian surrogate. My understanding is that Hessian information can be accessed via the JIT compilation of v(θ)v(\theta). How good is this surrogate? What are the situations where this surrogate doesn't offer reliable information?

问题

Please see my questions in the Weaknesses section.

评论

Dear reviewer, here are the answers to your questions.

  • On the ablation study.

    • As is typical in reinforcement learning or deep learning style papers, we ablated the higher-order model extensively (Algorithm 1,2,3). We did not ablate Algorithm 4 (Bayesian optimization) as we're not familiar of this as a technique in Bayesian optimization papers.
    • This paper evolved from an earlier work where we explored other techniques for empirically optimizing the acquisition function in BO which is the significant bottleneck since the work of Kandasamy et. al. We have several unreleased results regarding research into optimizing or improving the DIRECT algorithm.
    • We did not pursue and abandoned the above research work as it was deemed no longer interesting or valuable to the AI community.
    • We are not sure how evolutionary algorithms would fare at this problem. We wanted to have an algorithm backed by theoretical regret guarantees which is what Algorithm 4 provides.
    • To determine whether the Hessian helps in BayesOpt. We compared against TreeBO which is the exact same additive structure as our work, however, our approach uses the surrogate Hessian to learn the dependency structure. This validation is presented in Fig. 4 and Fig. 9 in Appendix C.5.
  • How good is this surrogate? What are the situations where this surrogate doesn't offer reliable information?

    • In our validation, we found the surrogate to be always helpful. We believe this has to do with the "constructed nature" of the benchmarks, which allows for significant learnability.
评论

We want to give a very small addendum to our response. We finally understood the reviewer's comments about ablation of Algorithm 4. We want to point out that there are only 2 hyperparameters in Algorithm 4 (Bayesian optimization), which were completely unchanged throughout the validation and found through simple grid search.

The hyperparameters for Algorithm 4 are as follows:

  • HA-GP-UCB
    • Size of the largest maximal clique: Note that in the theoretical analysis, the regret scales with the size of the largest clique. We observed that having larger maximal cliques would demonstrate slow improvement in the objective function due to an overly complex additive model. Having smaller maximal cliques would cause the objective function to asymptotically converge to a subpar value due to the limited expressiveness of the additive model.
    • Total number of additive kernels: Due to computational reasons, we limited the total number of additive kernels generated. We encountered out-of-memory and computational issues on commodity GPUs if this value was set too high.

The hyperparameters for Algorithm 1, 2, 3 are as follows:

  • Higher-order model
    • Neural network sizes: Internally, our higher-order model is constructed using neural networks forming the computational components. Each neural network component consisted of 2 hidden layers, with 4 neurons per layer. We found that having overly large networks would be too large for HA-GP-UCB to optimize, and likely would be too large for memory-constrained devices.
审稿意见
5

The paper proposes Hessian-aware Bayesian optimization for the multi-agent policy optimization.

优点

The multi-agent policy optimization is interesting setting.

The paper seemingly technically sound, but I currently cannot follow the technical detail because of reason below.

缺点

Although I'm not familiar with the topic (reinforcement learning with multi-agent), I currently think the paper is not easy to follow for many readers. Even the basic problem setting is difficult to fully understand because of unclear descriptions. The meanings of several key words (abstraction, immutable, compile, and so on) are not clarified in the text, because of which technical description is difficult to understand for those who are not familiar with the topic.

问题

Overall, I feel severe difficulty to understand the paper. I'd appreciate if the authors could provide more elaborated explanation.

What is the definition of the role affinity function \Lambda^\theta_r,i ? What does it indicate?

What is h without subscript in line 4 of Algorithm 4?

What is a surrogate of Hessian? How is it constructed ?

The authors claim the cumulative regret of the proposed method scale with O(log D). What assumption makes it possible? The implication behind assumption 1 is difficult to find for me. Does it assume some low dimensional essential dependency in the underlying true function?

评论

Dear reviewer,

We are looking into making changes to the draft to aid in readability. Here are immediate answers to your questions.

  • Basic problem setting

    • The basic problem setting is that of multiple agents interacting cooperatively to accomplish a common goal. This is also in a memory constrained setting (such as on IoT devices or drone delivery tasks). Within this problem setting, we propose a variant of high dimensional Bayesian optimization and show how it can optimize a compact policy space (memory constrained setting). Specifically we show in the case of malformed reward scenarios or complex multi-agent interactions, globally optimizing a small policy using our approach outperforms related work.
  • Definitions of key words

    • Abstraction: We solve several problems in our work by using abstraction. Abstraction, is separating a complex system into many components which interact in a well defined manner. In our work we use abstraction to separate decision making into role assignment, and role interaction,
    • Immutable: Immutable here means something that does not change. As we are optimizing over a parameter space, the immutable part of the policy (the instructions Alg. 1,2,3) are immutable as they are not searched over or optimized.
    • Compile: Here compilation means the immutable and mutable (parameters) part of the policy are joined together to create the actual policy.
  • What is the definition of the role affinity function \Lambda^\theta_r,i ? What does it indicate?

    • The role affinity function is trying to quantify the affinity for an agent to take on a specific role. It indicates the affinity for that agent to take on that specific role.
  • What is h without subscript in line 4 of Algorithm 4?

    • h is a tensor (a stack of matrices) formed by collecting many Hessian matrices from line 3 in Algorithm 4.
  • What is a surrogate of Hessian? How is it constructed ?

    • The surrogate Hessian is a surrogate quantity used in the paper as access to the true Hessian is either not available or too expensive to approximate. We discuss this further in Appendix H. The construction of the surrogate is as follows (found in Appendix A)
    • To estimate the Hessian, we used the Hessian-Vector product approximation. We relaxed the discrete portions of our metamodel policy into differentiable continuous approximation for this phase using the Sinkhorn-Knopp algorithm for the Role Assignment phase. For role interaction network connectivity, we used a sigmoid to create differentiable “soft” edges between each role.
  • The authors claim the cumulative regret of the proposed method scale with O(log D). What assumption makes it possible? The implication behind assumption 1 is difficult to find for me. Does it assume some low dimensional essential dependency in the underlying true function?

    • The approximate high level sketch of Assumption 1 is that the dependency graph drawn from the Erdős-Rényi model is of low degree/connectivity. The base of the logarithm is related to the quantity p_g we discuss in Assumption 1. This is elaborated on in the proof of Theorem 1.
审稿意见
8

This paper presents a multi-agent reinforcement learning algorithm based on Bayesian optimization. The main contribution of this paper is the idea to incorporate the Hessian information to solve the additive structure of the high dimensional problem as an additive decomposition. Because the value Hessian is unavailable, the system relies on the policy Hessian as a surrogate. The paper includes extensive evaluations and regret bounds.

优点

The main contribution of the paper in terms of the Hessian aware BO is very interesting and could have a very large impact on the community. In fact, I believe that the MARL application could be completely removed and still have a valuable paper, maybe extended to other high dimensional problems with additive structure. I have minimal experience with MARL, but it seems that the approach is competitive in that area as well. The paper provides an excellent theoretical and empirical analysis.

缺点

The main weakness of the paper (and it is understandable) is the amount of clutter in the paper. Most figures and tables are impossible to read without zoom. The amount of information provided should be more appropriate to a journal article, although I can see the motivation behind a submission to ICLR instead. However, the paper is clear enough. Despite reading appendix G, regarding the method presented, it is still unclear to me if H_pi can always be used as a reliable surrogate for H_v, as g might have some 0 (or small) components. In fact, for the optimal policy, shouldn’t it be 0?

问题

-Why do you need JIT-compilation? Isn’t it just instantiation with certain parameter values? -In algorithm 4, do you initialise the parameters with a uniform distribution? In BO literature, it is common to use low discrepancy sequences or sampling procedures.

评论

Dear reviewer,

We are looking into making changes to the draft to aid in readability. Here are immediate answers to your questions.

  • Why MARL and BO?

    • In order to make Bayesian Optimization valuable and competetive theoretically and empirically with a clear use case, we focused on decision making systems for memory constrained devices.
  • Why conference and not journal?

    • We believe that many excellent works in Computer Science come from conference publications. Which is why we hoped to have this work be accepted at a conference.
  • H_pi can always be used as a reliable surrogate for H_v?

    • This is a good question. In all of our validation scenarios, H_pi always served well in learning the dependency structure. We believe that this is due to the "constructed nature" of the reinforcement learning benchmarks. The tasks themselves are constructed in such a way that learning is possible. We're thinking about whether it is possible to make an argument along this direction in the Appendix.
  • Need for JIT-compilation?

    • The JIT-compilation was a design decision in order to have a compact policy space. We are looking for citations and references which show that this approach is a reasonable way to simplify a parameter space.
  • Algorithm 4 initialization procedure

    • We use Sobol sequences in order to quasi-randomly sample the parameter space.
评论

Thank you for comments.

Correct me if I'm wrong, but the JIT-compilation is a way to deal with a parametric model. You have an algorithm/model and you evaluate it with different parameters. In fact, I don't agree that it is a meta-model. I think the explanation with a parametric model would simplify the overall explanation.

The JIT-compilation, while it could be interesting, it is just a matter of experimentation/implementation (and should be discussed in that Section). It could be implemented as a Python function, a C++ template or something else. But I don't think it changes the method.

评论

We apologize for the last minute and quick response. Due to this issue which we noticed ourselves, we have changed the "metamodel" to a HOM (higher order model), and we have also changed "JIT-compilation" to GEN (generation) or GEN process.

This is more in line with the literature that we are able to find on the topic (as shown in footnote 4).

审稿意见
3

The paper proposed a metamodel based on message-passing neural network for multi-agent policy search. The MPNN represents a dependency graph between agents, which is claimed to be expressive and compact. A Bayesian optimization algorithm is proposed to learn the graph structure. Using direct queries of Hessian, we can learn the dependency graph using a GP-UCB variant. Experimental results have shown strong performance compared to MARL, HDBO and ablations.

优点

  1. The idea of using Bayesian optimization to learn the dependency graph structure is interesting and novel.
  2. The multi-agent policy representation is significant.
  3. Combination of role assignment and role interaction looks powerful in empirical results.
  4. Figures help to understand the contributions.

缺点

  1. The presentation of the paper is poor.
  • More background knowledge should be introduced to further clarify the motivations, like the necessity of the JIT compilation.
  • The notation is extensively abused and many algorithms and notations have no formal definitions (See questions for some of them).
  1. The motivation of one of the main contributions, using Hessian in Bayesian optimization is not very clear. As Bayesian optimization is a black-box function, directly querying Hessian might be very hard.
  2. There are no proofs or lemmas for the theorems.
  3. Should add related works of following topics:
  • Learning the dependency graph;
  • Modeling dependency graph as a MPNN;
  • Bayesian optimization to learn the graph structure;

Overall, it is a very interesting paper with possibly strong contributions, but the writing and presentation is very poor and confusing. I suggest the authors to do a very thorough revision during the rebuttal.

问题

Questions regarding the Hessian-award BO:

  1. How to observe the surrogate Hessian of the black-box function?
  2. What is the motivations of using Hessian in BO? If you already observed Hessian, why not using the gradient descent?

Questions overall:

  1. what do you mean by JIT compilations? The meaning is not clear in the context of multi-agent policy structure.
  2. Why do you use MPNN rather than graph convolution, graph attention or other transformer vi rants? These models should be more expressive.

Questions about the notations:

  • What does the subscripts of θ\theta means? For example, sometimes you use θr,i\theta_{r,i} and some times θg,e\theta_{g,e} without any explanations on what are r,g,i,er,g,i,e.
  • What is super script D in section 4.4? I only see the definition of ``some dimensionality’’. DD is a very important dependency terms of regret in the theorem.
  • What does the super script (i){}^(i) mean in section 4.4?
  • What is the mathematical expression of regret r(t)r(t) of querying Hessian?
  • What is U\mathcal U in Algorithm 4?
  • What is the Max-Cliques operation in Algorithm 4?
评论

Dear reviewer,

We are looking into making changes to the draft to aid in readability. Here are immediate answers to your questions.

  • The presentation of the paper is poor.

    • We are looking into improving the readability with another draft.
  • More background knowledge should be introduced to further clarify the motivations, like the necessity of the JIT compilation.

    • This is an interesting point, our approach with JIT compilation was to save on resources as we were considering the specific use case of memory constrained devices. We will see if we can find other related work which does the same to justify this design choice.
  • The notation is extensively abused and many algorithms and notations have no formal definitions (See questions for some of them).

    • We attempted our best to not abuse notation. Unfortunately, to make contributions that would be interesting and valuable to the decision making systems community, as well as have theoretical results, some of our notation was slightly abused.
  • The motivation of one of the main contributions, using Hessian in Bayesian optimization is not very clear. As Bayesian optimization is a black-box function, directly querying Hessian might be very hard.

    • We discuss this in detail in Appendix H.
  • There are no proofs or lemmas for the theorems.

    • We are not sure about what the reviewer means? Our proofs and lemmas were included in the supplementary materials. Could it be the case that this was not uploaded in OpenReview? If this is the case then we will reach out to the PCs in order to make sure that this will be fixed.
  • RW on Learning the dependency graph;

    • We are not aware of other works that learn the dependency graph other than Gibbs sampling or evolutionary algorithms, which we have included.
  • RW on Modeling dependency graph as a MPNN;

    • We are not aware of any additional work which models the dependency graph in Bayesian optimization as a MPNN. If the reviewer knows of such a work, we are happy to add it. The closest such related work is that of Rolland et. al.
  • RW on Bayesian optimization to learn the graph structure;

    • We do not use Bayesian optimization to learn the graph structure. We use a random sampling strategy over the domain. This is described in Algorithm 4.
  • How to observe the surrogate Hessian of the black-box function?

    • We discuss this in Appendix A: To estimate the Hessian, we used the Hessian-Vector product approximation. We relaxed the discrete portions of our metamodel policy into differentiable continuous approximation for this phase using the Sinkhorn-Knopp algorithm for the Role Assignment phase. For role interaction network connectivity, we used a sigmoid to create differentiable “soft” edges between each role.
  • What is the motivations of using Hessian in BO? If you already observed Hessian, why not using the gradient descent?

    • The motivation for using the Hessian in BO is to generalize the theoretical arguments of Srinivas et. al., Duvenaud et. al., Kandasamy et. al. and Rolland et. al. We believed that by using the Hessian a tighter regret bound in the high-dimensional Bayesian optimization could be obtained. We're not sure what the second part of the question means. Could the reviewer clarify?
  • what do you mean by JIT compilations? The meaning is not clear in the context of multi-agent policy structure.

    • JIT compilation is taking the immutable (unchanging) part of the metamodel policy (Alg. 1,2,3) and combining it with the mutable (changing) part of the metamodel policy (the parameters) and constructing the actual model which takes decisions.
  • Why do you use MPNN rather than graph convolution, graph attention or other transformer vi rants? These models should be more expressive.

    • As we were interested in making contributions to the decision making systems field, we were not interested in using these techniques from the deep learning literature. It could certainly be the case that using these architectures could be more expressive. We are not able to comment on this as we are not in the deep learning field.
  • Regarding θr,i\theta_{r,i} and other portions of the metamodel parameter space.

    • We defined these terms implicitly when we used them. Unfortunately due to the space constraints of the conference format, this was the only way we could submit this paper to a conference.
    • These terms are defined, implicitly, in Section 4.1, 4.2, as well as in the graphical illustration of Fig. 2.
    • We also invite the reviewer to read the code included in the supplementary materials, which we have tried to make very understandable and readable.

We continue with further answers in the next comment.

评论
  • Superscript (i)(i) in Section 4.4

    • Superscript (i)(i) represents the maximal clique within the additive decomposition. Here a complex high dimensional function space being search over is simplified by assuming it is constructed through addition of functional subspaces. This allows us to use the nice properties of linearity of addition. We touch upon this further in the proof of Proposition 1 in Appendix E.
  • What is the mathematical expression of regret r(t)r(t) of querying Hessian?

    • The regret mathematical expression is related to the model of computation we are analyzing in. For example, often time or space complexity is analyzed in order to understand and evaluate the quality of algorithms. Communication complexity is another complexity term that is used in distributed computing environments. In decision making algorithms, typically the key complexity is the regret, which compares the decision made to the best possible decision that could have been made with perfect information.
  • What is U\mathcal{U} in Algorithm 4?

    • U\mathcal{U} represents the uniform random distribution that we sample from in order to learn the dependency structure.
  • What is the Max-Cliques operation in Algorithm 4?

    • The Max-Cliques operation in Algorithm 4 computes the maximal cliques that was discussed earlier in our response. This is related to finding the Max-Cliques that allow us to simplify the functional space which we discuss further in the proof of Proposition 1. We note that typically, this specific operation is NP-hard in time complexity, however, this is typically thought not to be a problem within the model of computation we are analyzing in (decision making).
    • This specific operation was approximated quickly using the NetworKit library.
评论

We want to highlight our defined term, "JIT-compilation."

In the introduction we describe, "To achieve this, we use a metamodel which generates a policy model. The metamodel is divided into immutable instructions (i.e., algorithms) corresponding to the abstractions of the role and role interaction and mutable parameters that are just-in-time (JIT) compiled into a policy model during evaluation."

In section 4.3, we describe, "To overcome these challenges, we propose a metamodel which JIT-compiles into a graphical model. The JIT-compilation is conditional on the agents’ state thus capturing dynamic role interactions; in addition the compilation allows for a more compact policy space with far fewer parameters. The resultant compiled graphical model captures the state-dependent interaction between roles and yields the resultant actions for each role. After compilation, the interaction between roles are captured by the resultant conditional random field."

Also in section 4.3, we describe, "The graphical model compilation procedure is presented in Algorithm 2. Finally, Algorithm 3 drives the JIT-compilation."

We're working on adding a table of notations in order to aid readability which we can include in the Appendix. We will be uploading this shortly.

评论

I apologize that I didn't see the proofs in the appendix in the initial review.

However, I don't find the authors' response satisfactory. It remains unclear of the motivation of doing Hessian aware algorithm for black-box optimization and how costly it will be. Hessian type of methods are like double-sword but the author didn't make it clear what's the disadvantages and limitations. The authors said that they would do their best to improve the paper and would update the paper. However, till now, there has been no revision uploaded (I checked the revision history and found it empty). Because the paper was poorly written, the paper remains poorly written since there is no update. Thus, it is still difficult for me to understand and assess the contribution of the paper.

Therefore, I would keep my score.

评论

Just to clarify, I don't know what is happening in openreview. I also see the review history empty but the PDF I can access now is different from the one submitted originally. It has been updated.

I've seen the same behavior in other papers.

评论
  • We will immediately reach out to the PC as this should not be the case. We have uploaded many revisions for the paper.
  • We elaborate on the costs and benefits of the Hessian several times within the paper, and show that due to the expensive nature of the Hessian, we focus our specific use case to only decision making systems where a useful and valuable surrogate Hessian can be extracted from observing the policy generation process.
    • Introduction: In our work, we overcome this shortcoming by observing the GEN process of the HOM. In particular, we can measure a surrogate Hessian during the GEN process which significantly simplifies the task of learning the additive structure. We term this approach Hessian-Aware GP-UCB (HA-GP-UCB) and visualize our approach in Fig. 1.
    • Section 4.4: In practice, observing the hessian HvH_v is not possible due to v being a black box function. However, during the GEN process we can observe a surrogate Hessian, HπH_\pi. This surrogate Hessian is closely related to the HvH_v as v(θ)v(\theta) is determined through interaction of the policy with an unknown environment. Because the value of a policy is a function of the policy; it follows by the chain rule HπH_\pi is an important sub-component of HvH_v. We utilize the surrogate Hessian in our work and demonstrate its strong empirical performance in validation.
    • Appendix H: In Section 4.5 we remarked that although we cannot observe HvH_v, we can observe a surrogate hessian, HπH_\pi which is related to HvH_v by the chain rule. We justify our choice here with showing how HπH_\pi is an important sub-component of HvH_v (Skorski, 2019).
    • Appendix H: A possible avenue of overcoming this limitation is considering Hessian estimation through zero’th order queries. Several works along this direction have recently appeared using Finite Differences (Cheng et al., 2021), as well as Gaussian Processes (Müller et al., 2021). We consider removing this dependency on the surrogate Hessian for future work.

We want to especially highlight, our approach is not a general purpose approach to Bayesian Optimization. Our approach is an approach to solving difficult problems in decision making systems using Bayesian Optimization. General purpose HA-GP-UCB is outside the scope of this work.

评论

Thanks for the quick response. To make it easier to see the difference, would you mind highlighting the revised part? Since it seems that open review didn't have a revision history, there is only one PDF, so difficult to compare.

评论

We're working on that right now.

Thanks.

评论

We have uploaded a revision showing the changes in blue text, we have also summarized the changes we made to the paper in another official comment.

We are happy to answer last minute questions or clarifications asked by the reviewer.

Sincerely,

The authors of Hessian-Aware Bayesian Optimization for Decision Making Systems.

评论

Thank you for highlighting the revision.

I understand that the rebuttal period is time-limited so it is difficult to do a major revision. However, because the major concern is the paper is hard to follow, the paper remains hard to follow (though I worked on multiagent Bayesian optimization and marl before). I am afraid that I am unable to change my score given the current status of the paper.

评论

We thank the reviewer for continuing to engage with us. I empathize with the reviewer, and I can confidently state that this work attempts to have the most content within the conference format. This comparison is with the other works I am reviewing right now across major conferences associated with machine learning.

We want to highlight the key points that went into the research and construction of this paper which make it a valuable contribution:

  • Our work needed to be theoretically grounded requiring a regret bound.
  • Our work was in the subfield of high-dimensional Bayesian optimization requiring us to fix the incorrect proof by Kandasamy et. al. and give a non-trivial and complete regret bound when compared to the work of Rolland et. al.
  • Our work needed to have strong empirical results in at least one subfield of machine learning to justify our approach.
  • Our work needed to have strong empirical results in a currently valuable field of machine learning to have a chance of acceptance to the conference format.

Given the above constraints, our paper attempted to satisfy all these requirements which made it hard to follow. We thank the reviewer for their work in reviewing this paper.

审稿意见
1

A decision making system determine a sequence of actions to achieve a desired goal. To solve such a problem, the authors use Bayesian optimization where gradient information is not actively utilized. In particular, the authors propose Hessian-aware Bayesian optimization to optimize a multi-layer architecture. Finally, some theoretical and numerical results are reported to show the validity of the proposed methods.

优点

  • It solves an interesting problem.

缺点

  • This paper is hard to follow. There are too many components but they are not described appropriately.
  • Writing and presentation should be improved. For example, the Introduction section is started by the following sentences: "Decision Making Systems choose sequences of actions to accomplish a goal. Multi-Agent Decision Making Systems choose actions for multiple actors working together towards a shared goal. Multi-Agent Reinforcement Learning (MARL) has emerged as a competitive approach for optimizing Decision Making Systems in the multi-agent settings" These are three independent sentences. This part should be re-written. In addition, a period is missing. Also, this sentence "We propose the usage of Bayesian Optimization (BO)" is somewhat unnatural. Please revise it. There are other cases, but I would not enumerate all of them. Please revise an article carefully.
  • Figures are too small and legends are overlapped with graphs; please see Figures 4 and 5.
  • Theoretical results heavily rely on the previous work, in particular (Srinivas et al., 2010).
  • I would like to ask about the results in Figure 5. First off, why do BO results only show current maxima until a particular iteration, i.e., monotonically increasing? Why are the other results fluctuated? Also, why do the results at initial iterations differ across algorithms? It seems unfair to the baseline methods. Moreover, some baseline methods are better than the proposed methods. These results seem reasonable. Why do the authors use Bayesian optimization instead of reinforcement learning in these problems?

问题

  • Does Table 1 show the superior performance of your algorithm? To my understanding, HA-GP-UCB does not outperform some algorithms.
评论

Dear reviewer,

We are looking into making changes to the draft to aid in readability. Here are immediate answers for your questions:

  • This paper is hard to follow. There are too many components but they are not described appropriately.

    • We are looking into how best to edit the main paper text and the appendix in order to improve the readability of the paper.
  • Writing and presentation should be improved. For example, the Introduction section is started by the following sentences: "Decision Making Systems choose sequences of actions to accomplish a goal. Multi-Agent Decision Making Systems choose actions for multiple actors working together towards a shared goal. Multi-Agent Reinforcement Learning (MARL) has emerged as a competitive approach for optimizing Decision Making Systems in the multi-agent settings" These are three independent sentences. This part should be re-written. In addition, a period is missing. Also, this sentence "We propose the usage of Bayesian Optimization (BO)" is somewhat unnatural. Please revise it. There are other cases, but I would not enumerate all of them. Please revise an article carefully.

    • We are looking into how best to edit this part of the paper. We do want to highlight that the purpose of these three sentences is to highlight and show that, our specific contribution is to the field of multi-agent decision making systems, not reinforcement learning, or Bayesian optimization. As this is the case, we needed to frame and introduce all of these areas and how they're related within the introduction and how our work fits in the context of these areas.
  • Figures are too small and legends are overlapped with graphs; please see Figures 4 and 5.

    • We can include larger figures which are the exact same as Figure 4 and 5 in the Appendix. As our contribution requires significant validation, we had to make the figures a bit smaller to present all of the important validation in the main text.
  • Theoretical results heavily rely on the previous work, in particular (Srinivas et al., 2010).

    • We respectfully disagree. We believe that our work builds on top of Srinivas et. al., as well as Duvenaud et. al., Kandasamy et. al. and Rolland et. al. We do want to highlight that our theoretical results are novel as they provide a regret bound that scales with O(log D) with the dimension, which is significantly better than other bounds in high dimensional Bayesian optimization.
  • I would like to ask about the results in Figure 5. First off, why do BO results only show current maxima until a particular iteration, i.e., monotonically increasing? Why are the other results fluctuated? Also, why do the results at initial iterations differ across algorithms? It seems unfair to the baseline methods. Moreover, some baseline methods are better than the proposed methods. These results seem reasonable. Why do the authors use Bayesian optimization instead of reinforcement learning in these problems?

    • This is intimately related to how (in the area of decision making systems) different approaches are evaluated. In reinforcement learning based decision making systems, the policy that is evaluated is the singular policy that is being trained using reinforcement. In Bayesian optimization, the policy that is evaluated is the best found policy so far. Hence the difference between policy search (Bayesian optimization) and policy reinforcement (reinforcement learning). Our validation and our results show in resource constrained environments, with malformed or sparse reward, our approach of policy search using Bayesian optimization outperforms policy reinforcement methods such as various reinforcement learning methods we validated against.
  • Results in Table 1

    • In Table 1 we show the conditions under which HA-GP-UCB outperforms related work. This is related to the motivation and problem statement of the paper where we are interested in sparse or malformed rewards in cooperative multi-agent decision making scenarios.
评论

Most of my concerns still remain.

In particular, I do not agree with the statement on the difference of assessment criteria between BO and RL. They should be compared with the same criteria. If they are different, how do we understand and interpret the comparisons between the results?

Therefore, I would like to keep my score.

评论

We thank the reviewer for engaging in discussion with us.

Although it is highly unusual, and we have not seen results in reinforcement learning papers displayed in such a manner. We have created a new Appendix, Appendix K, where we replot the results from reinforcement learning approaches using the "best policy found so far" approach similar to BO. We want to point out that in practice, this is not how reinforcement learning works. However, in the spirit of giving reinforcement learning the best benefit of the doubt, we have created Appendix K with these results.

In this Appendix, we see that all our conclusions from the paper still remain and stand.

In addition, we repeat the same procedure for Table 5, and Table 6 which investigate RL and MARL under sparse or malformed reward. This is presented in Appendix L. In Appendix L, we see that all our conclusions from the paper still remain and stand as shown by Table 7 and Table 8.

We have also expanded footnote 10 to refer to Appendix J, Appendix K, and Appendix L showing these alternate presentations of data.

We hope that this helps the reviewer compare the two approaches of policy search and policy reinforcement. We also note that in all these cases, the data presented is overly favorable to RL and MARL, yet still our conclusions and observations hold.

We are also happy to help clarify any other concerns and engage with the reviewer further for this paper.

评论

Hello reviewers,

We are not sure exactly how to revise or move forward with the paper. At the moment, one reviewer has given a score of 8, with another reviewer giving a score of 1. Two reviewers have a score of 5, and another a score of 3.

We have answered each of the questions posed by the reviewers. We hope that they are helpful.

We are not sure exactly how to revise the paper. Could reviewer z1BB give their preferred phrasing of, "Decision Making Systems choose sequences of actions to accomplish a goal. Multi-Agent Decision Making Systems choose actions for multiple actors working together towards a shared goal. Multi-Agent Reinforcement Learning (MARL) has emerged as a competitive approach for optimizing Decision Making Systems in the multi-agent settings." In addition, does reviewer 22Fs have suggestions on how to improve the paper. We are not sure how to revise the paper at all and have it fit in the 9 page format. We are also not sure if reviewer 22Fs has access to the supplementary materials. The uploaded revision is a combined PDF of both the paper and supplementary materials for reviewer 22Fs.

We also highlight to reviewer z1BB that in the original submission as well as the uploaded revision, larger sized figures can be found in Appendix C.

At the moment, we cannot exactly find a perfect citation for "JIT-compilation" being helpful in reducing the complexity of searching over a parameter space.

In addition, it may not necessarily be true that "decision making" is thought under a model of computation like, for example, time and space complexity.

Given the several complexities at the moment in the review process for the paper, we are open to suggestions from the reviewers.

Within the uploaded revision, we have added a reference as to why we created and constructed a drone delivery task to validate our work in the multi-agent cooperative setting.

评论

We once more thank the reviewers for reviewing this paper. We have an additional revision for the reviewers. In this revision, we add a table of notations for the reviewers. This table can be found in Appendix B. We have also added a footnote at the beginning of the Design section (footnote 2) pointing to Appendix B.

We are happy to answer any questions and provide any more details and interact with reviewers to help them reach a decision on this paper.

We also thank the meta-reviewer for the upcoming task of making a decision on this paper given the diverse set of scores and reviews this paper has received.

Sincerely,

The authors of Hessian-Aware Bayesian Optimization for Decision Making Systems.

评论

As the rebuttal process draws to a close, we have another revision for you. In this revision, we have done the following changes:

  • Beautified references to the Appendix in the main text where we provide a specific Appendix letter if possible.
  • We have changed "JIT-compilation" to either "GEN" or "GEN process" to more accurately reflect the generation process which is different from compilation.
  • We changed the term "metamodel" to "higher-order model" to avoid unintended association with metalearning.
  • We have added a footnote 4 to a technical report which provides some amount of evidence that our approach of using a "higher-order model" which generates a model has increased parameter efficiency.
评论

Dear reviewers,

We thank you once again for reviewing this work. Here is the summary of changes since the rebuttal process began. Not all of these are highlighted (in blue text) in the current revision. In particular, we only highlighted changes that were content related.

  • Footnote 2 pointing to the table of notations (Appendix B). This table of notations is very useful in summarizing the high level of the paper. It should be possible to understand, intuitively, what our paper is doing from reading the introduction, related work, and this table.
  • Footnote 4 pointing to a technical report which shares similarities in spirit to our "higher-order model" approach to a policy.
  • Additional pointers to Appendix A (experimental details) and Appendix C (additional experiments).
  • Footnote 10 pointing to Appendix J, K, and L which show the data in a more favorable light towards RL and MARL under which our observations and conclusions still hold.
  • Inspiration for our drone delivery task in Appendix I. Our drone delivery task was inspired by recent work in vehicle routing problems for drone delivery. Our approach considers the difficulties of this task under the POMDP setting.
  • Changing "metamodel" to "higher-order model" to avoid any mistaken association with metalearning.
  • Changing JIT-compilation to "GEN" or "GEN process" to avoid any mistaken association with the programming language concept of JIT-compilation.
  • Fixed and reworked some sentence structures and minor changes to wording.

As the rebuttal process draws to a close, we are available to answer any last minute questions.

Sincerely,

The authors of Hessian-Aware Bayesian Optimization for Decision Making Systems.

AC 元评审

This paper addresses the dimensionality challenge in Bayesian optimization, with its application to the multi-agent decision making. To this end, it considers an additive decomposition, whose additive structure is learned by the surrogate Hessian. The approach seems to be sound, although the use of the surrogate Hessian is not clearly justified. Regret analysis is also provided. Most of reviewers feel that the paper is not easy to read since the paper involves various modules that are not well described. The authors made efforts to revise the paper, but reviewers feel that the paper should be much improved. Personally, the term Hessian-aware Bayesian optimization might be misleading since it feels like Bayesian optimization is using the knowledge of Hessian (but the Hessian is used to determine the dependency across parameters, leading to the additive decomposition, if I understand it correctly). Even during the AC-reviewer discussion, most of reviewers kept their initial decisions. We feel that the paper is not ready for being published on ICLR in its current version. Therefore, the paper is not recommended for acceptance in its current form. I hope authors found the review comments informative and can improve their paper by addressing these carefully in future submissions.

为何不给更高分

The presentation should be much improved.

为何不给更低分

N/A

最终决定

Reject