Multi-Fidelity Active Learning with GFlowNets
We present an algorithm for multi-fidelity active learning with GFlowNets. We demonstrate that it is able to discover diverse, high-scoring samples at lower cost than other relevant baselines.
摘要
评审与讨论
This paper introduces an algorithm using Multi-Fidelity Active Learning with GFlowNets. This algorithm effectively finds varied, top-performing options in fields like science and engineering. The authors point out that while there's a surge in data generation in these fields, present machine learning techniques struggle with efficiently querying a detailed, unknown target function.
To combat this, their algorithm employs a multi-tiered system, merging both basic and detailed reviews of the target function. It incorporates GFlowNets, a generative model, to grasp a simpler depiction of the data for efficient querying. The authors highlight that their method outdoes RL-based models in terms of data use and adaptability.
To test their algorithm, the authors chose tasks related to discovering molecules, like searching for drugs and studying materials. The results are encouraging, with the algorithm spotting a range of top-performing options using fewer queries compared to other techniques. They've outlined the algorithm's steps in a section named Algorithm 1 and have gone into depth about it in Appendices A and B. Essential experiment details are covered in Section 4, including data portrayal and benchmark task measures. Further experimental particulars are found in Appendix C, ensuring clarity and easy replication. Additionally, they've made their algorithm's code publicly available.
优点
-
Novelty: The proposed algorithm for Multi-Fidelity Active Learning with GFlowNets is a novel approach that addresses the challenge of querying a high fidelity, black-box objective function in scientific and engineering applications. The use of GFlowNets, a generative flow-based model, to efficiently query the objective function is also a novel contribution.
-
Evaluation: The authors evaluate the proposed algorithm on several molecular discovery tasks, including drug discovery and materials science. The evaluation shows promising results, with the algorithm discovering diverse, high-scoring candidates with fewer queries than other methods.
-
Reproducibility: The authors provide a detailed procedure of the steps of the algorithm in Algorithm 1, and additional details about the algorithm in Appendices A and B. They provide the most relevant information about the experiments in Section 4, including a description of the data representation and the oracles for each of the benchmark tasks. The rest of the details about the experiments are provided in Appendix C for the sake of better clarity, transparency, and reproducibility. Finally, the authors include the original code of their algorithm and experiments, which has been developed as open source.
-
Clarity: The paper is well-written and easy to understand, even for readers who are not experts in the field. The authors provide clear explanations of the concepts and methods used in the paper, and the figures and tables are well-designed and informative.
缺点
One potential weakness of this paper is that the evaluation is limited to molecular discovery tasks, and it is unclear how well the proposed algorithm would perform on other types of scientific and engineering applications. Additionally, while the authors provide a detailed procedure of the steps of the algorithm and additional details about the algorithm in Appendices A and B, some readers may find the paper to be too technical and difficult to follow. Finally, the authors do not provide a detailed discussion of the limitations of their approach or potential future directions for research.
问题
What is the main challenge in scientific discovery that current machine learning methods cannot efficiently tackle?
How does the proposed algorithm with GFlowNets address the challenge of querying a high fidelity, black-box objective function?
What are the advantages of multi-fidelity active learning with GFlowNets compared to RL-based alternatives?
Dear Reviewer 5Rpa, we appreciate your review of our manuscript. We are glad to read that you have appreciated the novelty of our work, the breadth of the evaluation, the clarity of the manuscript and our efforts regarding the reproducibility of our work. We are happy to also address your comments and questions below.
One potential weakness of this paper is that the evaluation is limited to molecular discovery tasks, and it is unclear how well the proposed algorithm would perform on other types of scientific and engineering applications.
We would like to note that besides the tasks on molecular discovery, we have included experiments with DNA aptamers and antimicrobial peptides, whose data representation and target functions are substantially different than the small molecules tasks. Furthermore, we have also included results on two well-studied synthetic functions (Appendix C.4).
[T]he authors do not provide a detailed discussion of the limitations of their approach or potential future directions for research.
Section 5 Conclusions, Limitations and Future Work includes a discussion of the future directions for research and the limitations of our work.
Questions
What is the main challenge in scientific discovery that current machine learning methods cannot efficiently tackle?
This question is addressed explicitly in the second paragraph of the introduction (”[…] Such scenarios present serious challenges even for the most advanced current machine learning methods”) as well as in the the subsequent paragraphs.
How does the proposed algorithm with GFlowNets address the challenge of querying a high fidelity, black-box objective function?
To address the challenge of querying a high fidelity, black-box function efficiently, in this work we have proposed a multi-fidelity active learning algorithm which leverages the availability of additional black-box functions with lower fidelity but much lower costs. This is the central element of our work.
What are the advantages of multi-fidelity active learning with GFlowNets compared to RL-based alternatives?
As discussed in the introduction and reflected in the results of our experiments, RL-based approaches are effective at optimisation but lack diversity in the batch of discovered candidates. The multi-fidelity active learning algorithm we propose learns to sample from the acquisition function instead of optimising it, by means of the use of GFlowNets. This approach achieves both diversity and high scores in the batch of candidates.
This paper designed an Active Learning algorithm to address the challenges of "needle-in-a-haystack" problems in scientific discovery, where the goal is to discover multiple, diverse candidates with high values of the target function, rather than just finding the optimum. The proposed method was evaluated on multiple tasks like DNA and Antimicrobial tasks, and molecular tasks. The experimental results were shown to outperform its single-fidelity counterpart while maintaining diversity, demonstrating its effectiveness in dealing with high-dimensional scientific data.
优点
-
This work is a good combination of active learning and GFlowNets.
-
The experimental results demonstrate the effectiveness of the proposed model in dealing with high-dimensional scientific data.
-
Instead of merely concerning the model performance (e.g., accuracy), this work focuses more on selecting more diverse samples with high values of the target function.
缺点
-
The baselines are quite simple, just compare between multi-fidelity active learning and single-fidelity active learning. The author could consider comparing it with other typical query synthesis active learning methods like [r1].
-
The performance of the proposed method on tasks and domains beyond those covered in this study remains uncertain due to the limitations of the tested benchmarking datasets.
[r1] Schumann R, Rehbein I. Active learning via membership query synthesis for semi-supervised sentence classification[C]//Proceedings of the 23rd conference on computational natural language learning (CoNLL). 2019: 472-481.
问题
Would the data samples selected be out-of-distribution samples? Since the evaluation considers mean top-K score and top-K diversity, it is still possible to select out-of-distributions samples.
Dear Reviewer TJ42, thank you for the review of our submission. We are happy to read that you positively valued the results presented in our manuscript, taking into account both metrics, the average scores and diversity. In your review, you also mention a couple of weaknesses and a question that we are happy to address in what follows.
The baselines are quite simple, just compare between multi-fidelity active learning and single-fidelity active learning.
First, we would like to note that besides the (GFlowNet-based) single-fidelity active learning baseline, we also include a multi-fidelity algorithm which optimises the BO acquisition function via the widely used reinforcement learning method PPO. This method can therefore be regarded as a multi-fidelity Bayesian optimisation method that optimises the acquisition function via reinforcement learning. This a strong baseline that is actually effective at discovering samples with high value of the target black-box function, as is reflected by the results of our experiments. However, we also observe that it exhibits low diversity in the discovered samples. The algorithm we propose, which uses a multi-fidelity GFlowNet, addresses this limitation, as you acknowledge in your review.
The remaining baselines are designed to help us gain understanding about the novel aspects of our proposed method. Specifically, the contribution of multi-fidelity versus single-fidelity active learning and the advantages of a GFlowNet sampler that selects the fidelity alongside the sample. Moreover, we include a baseline with random samples ranked by the acquisition function, which is known to be a strong baseline in Bayesian optimisation.
The baselines are described in Section 4.2
The author could consider comparing it with other typical query synthesis active learning methods like [r1].
Unfortunately, it would not be straightforward to adapt the methods in [r1] for our tasks, since the active learning algorithm in [r1] does not consider multiple oracles and the task at hand is classification of sentences, while our tasks are regression problems with substantially different data. Nonetheless, we appreciate the reference as we may be able to derive a suitable baseline from it in future work.
The performance of the proposed method on tasks and domains beyond those covered in this study remains uncertain due to the limitations of the tested benchmarking datasets.
There will always be uncertainty about the performance of any method on tasks and domains on which it has not been tested. Nonetheless, in the case of our experimental setup, we would like to argue that we have offered results in four tasks, using three distinct scientific discovery domains (DNA, antimicrobial peptides and small molecules), plus two additional sets of experiments on well-studied synthetic functions (Appendix C.4). Furthermore, we have performed robustness analyses of the impact of the oracle costs (Appendix E.2), the acquisition size (Appendix E.3) and the size of the final batch (Appendix E.4). In all the experiments, we have found highly consistent results, where the proposed algorithm MF-GFN is able to find high-scoring candidates with less budget than the baselines, while keeping high diversity, unlike the multi-fidelity PPO baseline.
Would the data samples selected be out-of-distribution samples? Since the evaluation considers mean top-K score and top-K diversity, it is still possible to select out-of-distributions samples.
We are not sure of having understood this question correctly, so we would appreciate a clarification. In particular, do you mean out-of-distribution with respect to what distribution? In general, the search space in our tasks is vast, therefore the chances are that the selected samples will be out of distribution, with respect to the initial data data set on which the surrogate is trained.
Dear authors,
I asked the last question because I noticed this paper https://arxiv.org/pdf/2210.12928.pdf
The paper you linked studies an application of GFlowNets for sampling dropout masks in neural networks, which can be useful for building Bayesian models. While GFlowOut shows promising results, it is prohibitively expensive to train in the context of the present paper. We instead rely on deep kernel learning (DKL), which has been studied and applied to such problems extensively and is much more efficient. We leverage GFlowNets only for candidate generation using DKL as surrogate model. When the GFlowNet generates candidates proportional to the acquisition function, these are likely to be out of distribution for the DKL model as they have the highest information gain. This is, however, by design since we re-train the model after the points are acquired.
Thank you for your response, I decided to keep my score towards acceptance.
The authors propose to extend the GFlowNet-AL paper, ICML 2022 paper by M. Jain et al., to the multi-fidelity setting. While the extension seems straightforward, and the general novelty contribution of the paper is limited, the paper falls short in demonstrating the effectiveness of the multi-fidelity setting on real-world applications. The authors somehow convert the existing experiments of GFlowNet-AL to a multi-fidelity setting with some synthetic simulations.
优点
Active learning setting is an important problem and using GFN shows to be effective in this setting.
缺点
There are a couple of weaknesses with this paper:
-
The novelty of this paper is very limited in terms of the model development. To cover this limitation, the paper needs to be more robust in terms of its application. However, the experiments do not support this, as they are primarily simulation-based.
-
The paper is not well-written in general. 1) The plots in the experiments are densely packed, making them difficult to understand. 2) While the paper includes a lot of basic information about the "importance of new scientific discovery" in the introduction, which is not directly relevant, it lacks a proper description of the multi-fidelity problem. Additionally, GFlowNet and Active Learning in the method section are not closely related, and it would be better to refer to GFlowNet-AL as a preliminary section. Doing so may clarify the paper's novelty.
-
The tasks used in the experiments are not based on real-world scenarios, which raises questions about the problem's practical importance. In general, I'm not familiar with multi-fidelity, and I didn't find the paper very clear in this respect, neither in the method nor in the experiments.
-
The sequences are very short, leading to questions about the method's applicability for longer sequences.
-
The method can also be viewed as an ensemble modeling approach, but it's not clear what the main advantage is. In the end, it seems that there is a single, expensive objective, which is the case in most real-world scenarios. And when we approximate them with multiple oracle, why they might be hard to get query from all of them?
问题
Could you please elaborate on the main challenge of this model and how you addressed it?
Could you please provide more details about the statement, "cheap online simulations take a few minutes"? What exactly are considered as "cheap simulations" in the context of sequence design?
I would appreciate seeing the benefits of your approach on larger sequences and molecules (e.g. antibodies).
I'd like to see at least one real-world application where you have multiple fidelity levels with varying costs.
It's not clear to me what the term "cost" is referring to. Is it related to validation experiments or the process of querying the black box?
Dear Reviewer Cu6t, thank you for reviewing our submission. The review mentions a number of concerns that we are happy to address below.
The novelty of this paper is very limited in terms of the model development.
The review does not provide many details explaining the assessment of the novelty of our submission, so we will here provide a general overview of the novel contributions of our paper. Our paper is, to our knowledge, the first work to propose a multi-fidelity active learning algorithm using GFlowNets. Single-fidelity active learning with GFlowNets has been explored before at least by Bengio et al. (2021), Jain et al. (2022) and Jain et al. (2023a). The multi-fidelity aspect of the algorithm is far from trivial as it implied, among other things, proposing a novel extension of GFlowNets to sample the fidelity alongside the candidate; training a multi-fidelity Bayesian surrogate with multi-fidelity data; incorporate a multi-fidelity acquisition function; regarding the evaluation, assessing the contribution of the multi-fidelity aspect on average scores, diversity, the impact of costs, etc.
Moreover, our paper is, also to the best of our knowledge, the first work presenting multi-fidelity active learning results for DNA aptamers, antimicrobial peptides and molecular design. As we discuss in our paper, the reason is that these problems present a number of challenges for existing methods, such as traditional Bayesian optimisation and reinforcement learning.
In the review, you state that “the experiments […] are primarily simulation-based.” We would like to recall that the experiments with small molecules (Section 4.3.3) are not simulations. In these tasks, we use the semi-empirical quantum chemistry method XTB with various levels of geometry optimisation as oracles. These oracles are used in practically relevant problems by the scientific community. Furthermore, the costs of the oracles are set such that they are proportional to their computational running time, as it would be done in practice. The results in these two tasks (Figure 3) demonstrate the effectiveness of MF-GFN at finding high-scoring, diverse candidates, outperforming the studied baselines, including a multi-fidelity Bayesian optimisation method using PPO as the optimiser of the acquisition function (MF-PPO). These results are consistent with the rest of the results in the paper (DNA, AMP and synthetic functions).
The plots in the experiments are densely packed
While the plots could admittedly be less packed, removing information would likely harm the completeness in the presentation of our results. Do you have a suggestion for an alternative form of presentation?
[The paper] lacks a proper description of the multi-fidelity problem
The multi-fidelity problem is described in Section 3.2 Multi-fidelity Active Learning. The desccription builds upon the preceding section, which describes the single-fidelity scenario. The multi-fidelity problem and the algorithm are also illustrated in Figure 1, and detailed formally in Algorithm 1. What details are missing, in your opinion, to make the description more proper?
GFlowNet and Active Learning in the method section are not closely related, and it would be better to refer to GFlowNet-AL as a preliminary section.
Section 3.1 Background introduces first the necessary background on GFlownets and then describes the (single-fidelity) active learning problem, explaining the specific case where a GFlowNet is used as a sampler, as done by Jain et al. (2022) with GFlowNet-AL. We have added a sentence referring explicitly to GFlowNet-AL and the corresponding citation. We have also included the name GFlowNet-AL in the description of the single-fidelity baseline, for further clarification. Do these changes address your comment?
The tasks used in the experiments are not based on real-world scenarios, which raises questions about the problem's practical importance
We refer to our discussion above, answering your comment about novelty and the experiments being “simulation-based”. Furthermore, we would like to note that the kind of experiments included in this paper are similar to those found in the literature (Angermueller et al., 2020, Zhang et al., 2022). Please let us know if this aspect needs further clarification.
References
- Bengio et al. Flow network based generative models for non-iterative diverse candidate generation. NeurIPS, 2021
- Jain et al. Biological sequence design with GFlowNet. ICML, 2022
- Jain et al. Multi-objective GFlowNets. ICML, 2023a
- Angermueller et al. Model-based reinforcement learning for biological sequence design. ICLR 2020.
- Zhang et al. Unifying Likelihood-free Inference with Black-box Optimization
The sequences are very short
The length of the DNA sequences is indeed short, since this task is used here as a proof of concept. The length of the AMP sequences in our experiments is the typical length of peptides and the typical length of such sequences in the related literature.
While it is uncertain whether our proposed algorithm would work as well with very long sequences, we would argue that this would not invalidate the method. As a matter of fact, the method—in particular the GFlowNet—can be easily adapted to handle very long sequences, for example by applying mutations to existing sequences instead of generating sequences de novo, as demonstrated in the Multi-Objective GFlowNet work (Jain et al., 2023a).
The method can also be viewed as an ensemble modeling approach.
We are not sure we have understood this concern. We will recall and illustrate the multi-fidelity problem at hand, in case there is a misunderstanding: the goal is to discover new, diverse candidates (e.g. molecules) with high scores of a certain property of interest (e.g. the ionisation potential), as measured by our best available method to estimate the property. Importantly, this method (highest fidelity oracle) is expensive, so we cannot afford many queries for the exploration of the candidate space. However, we have access to less expensive but less accurate methods (lower fidelity oracles) to estimate the property. The goal is to design an exploration scheme that makes a good use of the available oracles.
Could you please elaborate on the main challenge of this model and how you addressed it?
We have described the challenges addressed by our proposed algorithm in the introduction and throughout the paper. For example, see the second paragraph of the introduction, which mentions the challenge of “exploring combinatorially large, structured and high-dimensional spaces”. The rest of the introduction goes on to describe the current challenges. Section 2 reviews the related literature and mentions some of the open challenges. Section 3 presents our proposed method and how it addresses these challenges. Section 4 provides experimental results demonstrating the effectiveness of the method at addressing the challenges.
Could you please provide more details about the statement, "cheap online simulations take a few minutes"? What exactly are considered as "cheap simulations" in the context of sequence design?
By cheap simulations, we refer to computational oracles which can approximate some properties of interest. For instance, standard software such as FoldX and recent developments such as AlphaFold can provide protein structures which can be used to compute properties such as stability or surface energy which can be used to guide the design process (Stanton et al., 2022).
I would appreciate seeing the benefits of your approach on larger sequences and molecules (e.g. antibodies).
This paper is the first demonstration of multi-fidelity active learning with GFlowNets, and its application to more complex problems such as antibody design is a likely extension in future work.
I'd like to see at least one real-world application where you have multiple fidelity levels with varying costs.
Please see the results with small molecules (Section 4.3.3).
It's not clear to me what the term "cost" is referring to. Is it related to validation experiments or the process of querying the black box?
It refers to the processing of querying each oracle (black box functions). In the small molecules task, for example, the costs are proportional to the computational time to evaluate one molecule with each of the three oracles.
References
- Jain et al. Multi-objective GFlowNets. ICML, 2023a
- Stanton et al. Accelerating Bayesian optimization for biological sequence design with denoising autoencoders. ICML 2022.
Thank the authors for their response! While the responses addressed some of my concerns, I still would like to further discuss the practical advantages of the proposed method.
Before delving into the detailed discussion, I would like to point out that: 1) I'm not familiar with the Multi-fidelity topic, and 2) based on my understanding, the main goal of the paper is the application of GFlowNet-AL in a multi-fidelity setting. Considering these two points, I believe the paper couldn't convince me that the combination of AL and multi-fidelity is actually a need in practice. In the related works, the authors only mentioned, "interestingly, the literature on multi-fidelity active learning (Li et al., 2022a) is scarcer"; however, they didn't mention why the topic is scarce (while there are more works on the multi-fidelity problem in general) and what the advantages and disadvantages of the previous method are and how the proposed GFlowNet-based method could address them.
Re: Novelty: In my opinion, the primary advantage of the paper is extending/applying GFlowNet-AL for the multi-fidelity problem. However, the novelty of the paper from a theoretical perspective is somewhat limited. Although I agree with the authors that it is not trivial, the main claim of this paper is "the first work to propose a multi-fidelity active learning algorithm using GFlowNets." Based on this, I think the paper should be much more mature/solid with respect to the experiments. Specifically, the authors should include related works on multi-fidelity active learning problems, state their drawbacks, and explain how they addressed them. In the experiments, the authors need to demonstrate that what they are achieving with the proposed method is practically sound. More specifically, they need to point out the computational complexity of each oracle and mention how much they could save. In an active learning setting, when we have costly and time-consuming wet-lab oracles, active learning makes total sense. Here, the authors state that the computational complexity of different oracles (which are black-box) is different but fail to highlight how much they could save in computational costs and how much complexity their method adds.
Even in section 4.3.3, the authors did not convincingly demonstrate the practical usefulness of their approach. Including such a study would make the contribution clear. In my point of view, one of the simplest baselines could involve using GFlowNet-AL on each fidelity separately and observing the advantages.
Re multi-fidelity problem: That section is your description of multi-fidelity for active learning. For someone like me, adding multi-fidelity as a background would be gratefully appreciated.
Re larger molecules: Something that I'm not sure about the proposed method is how much is cost of GFlowNet and how it can help to reduce the cost of querying the oracles. To me, the main issue is that I think computational analysis is missed!
Re Ensemble modeling: One can frame the problem as querying all the oracles and ensembling the results. The question then is: What would be the computational benefit of the proposed method?
I will await the authors' response, engage in discussion with the other reviewers, and adjust my score accordingly.
Dear Reviewer Cu6t, thank you for reading our response and engaging in the discussion.
We understand that you are not familiar with multi-fidelity methods, precisely because the literature is still scarce, as we point out in the paper. The reason, in our opinion, is that extending active learning or Bayesian optimisation to a multi-fidelity setting involves multiple challenges, namely training a multi-fidelity surrogate, selecting a suitable multi-fidelity acquisition function and a method that is able to select both candidates and the oracle. This is especially the case in the kind of scientific problems we tackle, involving structured, high-dimensional data like biological sequences and molecules. While the following can only be speculation, a likely reason why several multi-fidelity active search papers have appeared in the literature is that the setting is comparably simpler, that is search in a binary class setting.
Aware that multi-fidelity methods are not familiar to many, the central part of our paper, Section 3, provides an in introduction to multi-fidelity active learning (3.2), preceded by the necessary background on single-fidelity active learning and GFlowNets (3.1). We have also included a visual summary of the algorithm (Figure 1) depicting the multi-fidelity aspects, as well as the formal details in Algorithm 1.
The scarcity of work on multi-fidelity active learning methods and, to our knowledge, the absence of previous work of multi-fidelity active learning for the scientific problems we address in our paper, should highlight the novelty of our contribution. Nonetheless, in your last comment about the novelty of our contribution you wrote the following:
I think the paper should be much more mature/solid with respect to the experiments. Specifically, the authors should include related works on multi-fidelity active learning problems
We can emphasise this aspect again: to the best of our knowledge there are no works on multi-fidelity active learning problems like the ones we approach in our paper.
the authors need to demonstrate that what they are achieving with the proposed method is practically sound
Despite the lack of past work to compare the results of our method, we have implemented multiple baselines to shed light on both the usefulness of the proposed method and the contribution of the novel aspects (multi-fidelity, GFlowNet variant, etc.). The baselines are described in Section 4.2, but let us focus here on the two strongest baselines:
- Multi-fidelity PPO (MF-PPO): Instantiation of multi-fidelity Bayesian optimisation where the acquisition function is optimised using reinforcement learning, specifically proximal policy optimisation (PPO).
- GFlowNet-AL with the highest fidelity
As shown in the results presented in Section 4 and in the multiple appendices, our proposed method MF-GFN systematically outperforms GFlowNet-AL in terms of sample efficiency. That is, MF-GFN discovers candidates with high scores by using a smaller budget. Furthermore, MF-GFN also systematically outperforms MF-PPO regarding the diversity of the candidates and in most cases regarding sample efficiency too.
In your most recent comments, you also point out that we “need to point out the computational complexity of each oracle and mention how much they could save”. The computational complexity of each oracle or rather what we refer in the paper as cost is specified at the corresponding subsections in Section 4. Furthermore, all costs are summarised in Table 1 of the Appendix.
The answer to the question of how much it can be saved differs for each task, but the question is answered in all cases by the results presented in Figures 1-4. For example, quoting the manuscript, in the case of the DNA task, “MF-GFN reaches the best mean top-K energy achieved by its single-fidelity counterpart with just about 25 % of the budget”. In the case of the AMP task, “[MF-GFN reaches the same maximum mean top-K score as the random baselines with 10× less budget and almost 100× less budget than SF-GFN [GFlowNet-AL]”. In the molecule tasks, MF-GFN matches the average top-K scores of the single-fidelity counterpart with about half the computational budget.
Finally, we can answer your comment about ensembles:
One can frame the problem as querying all the oracles and ensembling the results. The question then is: What would be the computational benefit of the proposed method?
Querying all the oracles would arguably result in wasted computational budget for the problem at hand, since given the result from the highest fidelity oracle for a candidate, the results from lower fidelity oracle would be worthless with the current design. Therefore, the computational benefit of the proposed method with respect to the suggested approach would be even larger than with respect to GFlowNet-AL.
Thanks for the reply!
Thank you for your response!
I already have checked the plots and Table 1 in the Appendix. For me, it is hard to find a connection between \lambda and the running time or real cost. As I mentioned in my initial comment, interpreting the plots is very difficult due to their density. While I observe that \lambda is proportional, the actual running time or computational cost remains unclear. For example, a 25% improvement in the DNA task raises the question of how this translates to an improvement in running time. The central concern is understanding the extent to which the complexity of your training/inference model has increased in comparison to the original approach, and if that cost is lower than the cost of multiple fidelities. From my perspective, these questions remain unanswered by the authors.
Regarding related works, I sought to determine if others have addressed this problem previously. I found MAPS and MAPS-SE relevant, aligning closely with my viewpoint of the issue in a real-world context. Could you please clarify the distinctions between multi-fidelity and these approaches? Additionally, I am curious why a comparison with MAPS and MAPS-SE wasn't explored as a potential baseline.
[MAPS and MAPS-SE] "Active Policy Improvement from Multiple Black-box Oracles," ICML 2023.
Dear reviewer, thank you for following up the discussion.
Oracles cost and computational time
It is common in the multi-fidelity literature to use unit-less costs since in practice the cost may refer to time, money, something else or a combination thereof.
In our previous response and in the paper we wrote that on the DNA task “MF-GFN reaches the best mean top-K energy achieved by its single-fidelity counterpart with just about 25 % of the budget”. Please note that this is different to “a 25% improvement” (from your last answer).
We can illustrate the correspondence in computational time by looking at the small molecules tasks (Section 4.3.3), since the costs assigned to the oracles are directly proportional to the computational time as measured in our own experiments. We assigned cost 1 to the lowest-fidelity oracle, which has an average running time of 0.25 s per molecule. Thus:
- Oracle 1:
- Oracle 2:
- Oracle 3:
- Total budget in molecules tasks:
Let’s take Figure 3b (molecules electron affinity task) as an example. The plot contains five curves, each corresponding to an algorithm. The X axis is the fraction of the total budget (from 0 to 1). Thus, in this case, in time: from 0 to 262.5 seconds. The Y axis is the average score of the top 100 candidates. The green curve corresponds to SF-GFN (single fidelity active learning with GFlowNet, akin to GFlowNet-AL). After using all the budget, the average score is 4. The blue curve corresponds to our proposed method, MF-GFN. It reaches an average of 4 (the best score by SF-GFN) with less than 40 % of the budget, that is with less than 106 seconds.
We hope that this sheds lights on your questions about the computational time. We would like to note that in many scientific discovery problems like the example used in the introduction, the difference in cost (computational time) between oracles can be of several orders of magnitude. For example, the highest fidelity oracle may be DFT which can take many hours or days to evaluate one candidate, while lower fidelity oracles may take a few seconds. This situation makes the development of multi-fidelity active learning methods an extremely promising avenue.
MAPS and MAPS-SE
Regarding related works, I sought to determine if others have addressed this problem previously. I found MAPS and MAPS-SE relevant, aligning closely with my viewpoint of the issue in a real-world context. Could you please clarify the distinctions between multi-fidelity and these approaches? Additionally, I am curious why a comparison with MAPS and MAPS-SE wasn't explored as a potential baseline.
Thanks for pointing to the paper. We were unaware of it, as it only appeared on arXiv 3 months before the deadline. There are several major differences in the problem that MAPS and MAP-SE are trying to solve from our setting. Specifically, they study the problem of imitation learning given a set of oracle policies, where the goal is to select what policies from this set to imitate. The task is still to learn a single policy to be used during inference based on the oracle policies which maximizes the single fixed reward function. On the contrary, the problem we study is one where there are multiple reward functions, each of which provides a different fidelity approximation and has a different cost associated to it. We do not have access to oracle policies in our setup. Moreover, the goal in the scientific discovery problems we study is to discover diverse high reward candidates, rather than the problem of learning an optimal policy as in MAPS and MAPS-SE. In summary, MAPS and MAPS-SE solve a considerably different problem than the one we solve in our work, and as such is not applicable to our tasks.
Finally, we would also like to re-emphasize that we disagree with the characterization of tasks we study being "not real world". There have been a number of studies highlighting the importance of these problems from a scientific perspective (Murray et al., 2022, Wang et al., 2023) as well as a practical perspective (https://www.nature.com/articles/d43747-022-00104-7).
- Wang et al., Scientific discovery in the age of artificial intelligence. Nature 2023.
- Murray et al., Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 2022.
Thanks for the clarification! I will raise my score!
One question which still is not clear to me: You achieved the same results with 25% of the budget. Therefore, you reduced the cost from 262.5s to 106s. Could you please let me know how much additional computational cost your model added compared to the simpler model during training and/or inference? The cost you are referring to is the cost of oracles, and I can see that you can improve it (although, to me, it is not impressive because it still is nothing compared to wet-lab costs.). However, based on my understanding, your model is more computationally expensive during training/inference as well. So, could you add that to the account and point out the savings?
Thank you for considering raising the score!
The main differences between the single-fidelity active learning algorithm (SF-GFN) and our proposed multi-fidelity method (MF-GFN) are the following:
- Oracles: MF-GFN can query different oracles.
- Surrogate model training
- Acquisition function
- GFlowNet training and sampling
Out of these four components, in practical applications of an active learning algorithm, querying the oracles (1) will most often dominate the total computational time of one active learning round. Therefore, if a multi-fidelity active learning algorithm reduces the number of queries to the more costly oracles, it will most likely reduce the overall time as well.
The remaining three components are admittedly more costly in the multi-fidelity setting. Unfortunately, we do not currently have the exact timings of these components. However, we can argue that the overhead should be small comparatively, since the main changes are the following:
- The multi-fidelity surrogate model includes an additional linear downsampling kernel to model the fidelity.
- The evaluation of the acquisition function depends on the inference time of the surrogate model.
- The multi-fidelity GFlowNet includes an additional step in the trajectories to sample the fidelity index.
We hope that this adds more clarity, despite not being able to provide exact timings.
This paper describes a multi-fidelity optimization algorithm which uses GFlowNets to optimize the acquisition function of a multi-fidelity deep-kernel Gaussian process. Experimentally the proposed algorithm seems to outperform a number of GFlowNet baselines.
优点
I don't think my analysis of this paper fits nicely into the strengths/weaknesses format requested in the review form, so I will first give general feedback and then describe strengths and weaknesses.
I think the paper proposes an interesting method, but in generally seems a bit... distorted. In my opinion it gives too much detail on unimportant bits (context of drug discovery) and not enough detail on the important bits (e.g. the actual method). The main emphasis is on GFlowNets, which actually don't seem to be the most important part of the method: I think a more appropriate title for the paper would be "multi-fidelity Bayesian optimization with deep kernel Gaussian processes". To me, the most important question for the paper to answer is "does the method work, and if so why." I didn't feel like the paper came even close to answering this question.
Strengths of the paper:
- The writing was clear (at least what the authors chose to write about was clear, although I think they chose to write about the wrong things in the paper).
- The proposed method is sensible and arguably has some novelty (although the exact degree of novelty was unclear)
- Experiments were fairly comprehensive (at least for the questions they investigated, which in my opinion were the wrong questions)
缺点
- Related work: the paper touches on a bunch of topics which are already very well-researched: multi-fidelity optimization (e.g. [1-3]), GFlowNets [4-7], and diverse optimization (e.g. [8-10]). I did not think the paper contextualized its contributions well.
- I think BO was dismissed too hastily by claiming that BO only cares about finding a single optimum. First, this is not true: see [9,10]. Second, diverse solutions may be found incidentally by when performing single-objective BO via its exploration, so it is inappropriate to dismiss BO even if diversity is not built into the objective. In fact, the method proposed in this paper also appears to have no built-in diversity objective: it just relies on GFlowNets incidentally generating a diverse set of points.
- Active search and quality-diversity optimization are mentioned in section 2, but never again. Does the proposed method have any advantages over these approaches? Would these not be sensible baselines for experimental comparison.
- The authors mention the existence of a lot of prior literature on multi-fidelity optimization, but dismiss it by saying "the literature is still scarce probably because most approaches cannot tackle the specifics of scientific discovery, such as the need for diverse samples." This is pure speculation, and I highly doubt that it is true. One cannot conclude that existing methods are insufficient just because you don't see many papers on them!
- In the past 2 years there has been a huge flood of papers about GFlowNets. The relationship between this paper and all the other papers is not clearly stated. [4,6] seem particularly relevant. The authors should clearly state the novelty (if any) from the existing GFlowNet literature.
- Surrogate model/acquisition functions: the main emphasis of the paper is on the GFlowNets, but the method also critically relies on a GP surrogate model and an acquisition function. In my experience, these choices are also incredibly important important, but they are not really explored or discussed much in this work. For example, I am aware that the deep kernel GPs used by the authors as a surrogate model are very prone to overfitting [11]. Is this an issue? The training of the surrogate model is not really discussed in the text when presumably it is very important!
- Flawed metrics: the authors follow previous works and examine the scores and diversity of the top K outputs. I think this is a flawed metric which doesn't reflect how these models will be used in practice, which is to propose a set of candidate points that will be taken forward to the next stage in screening. This implies extracting a diverse subset from all outputs, not simply looking only at the top K outputs and seeing how diverse they are. I recommend instead that the authors look at a monotonic diversity metric, such as #circles (Xie et al 2022).
- Experiments mainly compare against weak baselines: the experiments contrast the authors' method with some baseline methods, which in my opinion are fairly weak (few variations of random search and PPO, which is probably not sample efficient). Critically, the authors don't compare against any other sort of BO method. I think the experiments section should really be trying to answer whether this setup has any advantage over a reasonable other BO-like setup (e.g. multi-fidelity GPs with [domain-specific] standard kernels and basic multi-fidelity acquisition functions like expected improvement / cost).
- Origin of diversity unclear: another question which is not addressed theoretically or experimentally is why this method produces more diverse outputs (or whether it even does). Is it the surrogate model? Is it the GFlowNets? This seems like the key claimed advantage of the method, so it feels odd to me that it is not investigated more.
[1] A General Framework for Multi-fidelity Bayesian Optimization with Gaussian Processes
[2] Multi-Fidelity Bayesian Optimization via Deep Neural Networks
[3] Review of multi-fidelity models
[4] Multi-Objective GFlowNets
[5] Gflownet foundations
[6] Biological sequence design with gflownets
[7] GFlowNets for AI-driven scientific discovery
[8] Quality-diversity optimization: a novel branch of stochastic optimization
[9] Discovering Many Diverse Solutions with Bayesian Optimization
[10] Bayesian algorithm execution: Estimating computable properties of black-box functions using mutual information
[11] The promises and pitfalls of deep kernel learning
问题
Some specific questions are:
- How does this differ from previous work on GFlowNets?
- How does this method differ from a BO method which one could create by taking a model/acquisition model directly from existing papers?
- What are the results of a GP baseline using Matern kernel (toy tasks), string kernel (DNA task), and Tanimoto kernel (models) with EI/cost acquisition function and GFlowNets as the acquisition function optimizer?
Also, I have some writing suggestions if you revise the paper:
- The abstract contains ~3 sentences of introduction. I would cut this to ~1 sentence. It's good to keep the abstract short.
- The introduction contains a lot of description of black box optimization in general. While I think this is good, 9 pages is fairly short, and I think the paper does not even have enough space to properly describe the method at the moment. I would cut this down to ~1 paragraph, reference some works which discuss the problem in more detail, and try to keep the introduction to < 1 page.
- I would put the related work after the method. I recommend this slide deck by Simon Peyton Jones (a creator of Haskell) which explains why: https://www.microsoft.com/en-us/research/academic-program/write-great-research-paper/
Dear Reviewer EtrP,
Thank you for your thorough review. We appreciate that you have found our proposed algorithm interesting and sensible, and the overall article well-written and comprehensive. We understand that you disagree with several aspects of our paper, therefore in what follows we will gladly address your comments and concerns one by one.
Related work
A major concern in your review seems to be that our paper did not “contextualized its contributions well”. We would like to first note that out of the 10 citations that you provide as examples of the well-researched topics that our paper touches upon, 7 of them are already cited in our manuscript. Of the remaining 3, citation [3] is a review of multi-fidelity models but it doesn’t cover deep learning, Bayesian optimisation and active learning; citation [10] is a Bayesian optimisation method but it does not seem to tackle the diversity problem; citation [9], in contrast, seems very relevant, as it is a Bayesian optimisation method for finding diverse solutions. Accordingly, we have added a sentence at the end of the second paragraph of Section 2 acknowledging this work.
Bayesian optimisation
BO was dismissed too hastily by claiming that BO only cares about finding a single optimum. First, this is not true: see [9,10].
Our paper states that “Bayesian optimisation and reinforcement learning are designed to find the optimum of the target function” (Section 1). In Section 2, we state that “[t]he main difference between BO and the problem we tackle in this paper is that we are interested in finding multiple, diverse samples with high value of and not only the optimum”. In our opinion, this is not dismissing Bayesian optimisation but describing its goal and explaining why it is not directly suited for the problems we tackle in our paper. Rather than an opinion or a controversial claim, this is simply the problem statement found in the Bayesian optimisation literature (Frazier, 2018; Garnett, 2023). As a matter of fact, this aspect of BO is mentioned as the motivation in [9], which you cite: “the fact that BO traditionally seeks a single best optimizer may be a significant limitation”. From this standpoint, [9] (AISTATS, 2023) proposes a variant of Bayesian optimisation to discover diverse solutions.
Our paper could similarly be regarded as a BO variant: Not only do we not dismiss Bayesian optimisation, but rather our method strongly builds upon it. Specifically, our algorithm relies on a Bayesian surrogate (DKL) to model the data and a BO acquisition function (multi-fidelity max-value entropy search) for the exploration of the search space—the two key ingredients of BO. Note that the manuscript makes this connection explicitly:
Optionally, we can instead train a probabilistic surrogate and use as reward the output of an acquisition function that considers the epistemic uncertainty of the surrogate model, as typically done in Bayesian optimisation (Jain et al., 2022).
A crucial difference between our algorithm and traditional BO is that instead of optimising the acquisition function, GFlowNets learn to sample from it, which results in enhanced diversity (see below). We have added a sentence in the Active Learning section of 3.1 to further clarify this idea.
You mention that BO may also find diverse solutions “incidentally”. While we agree that this may be the case, our goal is to find diverse solutions rather systematically, since diversity is an important objective in certain scientific discovery applications, as it is also discussed in [9].
the method proposed in this paper also appears to have no built-in diversity objective: it just relies on GFlowNets incidentally generating a diverse set of points.
We respectfully disagree with this view. GFlowNets do have built-in diversity in their objective and diversity is at the core of the method, as is discussed in most “GFlowNets papers”, including the article where the method was introduced (Bengio et al., 2021). GFlowNets achieve diversity by learning to sample proportionally to the reward distribution. Therefore, diversity is not incidental, but systematic. The importance of diversity and its connection with GFlowNets is discussed throughout our paper and in particular the technical mechanisms are in described in Section 3.1. We also note that the ability of GFlowNets to systematically sample diverse solutions has been established in a number of prior works including but not limited to the following: Malkin et al., 2022, Zhang et al., 2023, Jain et al., 2023a,b.
Active search and quality-diversity
Active search is mentioned in the related work section because there is a connection with our method. However, the problem addressed by active search is to discover samples belonging to one rare class, instead of to the alternative common class. In contrast, our paper considers the more general problem where multiple oracles are available that output a continuous score for the input candidate.
Quality-diversity is class of algorithms that tackle a very similar problem to ours. However, to the best of our knowledge, there is no prior work using quality-diversity algorithms in a multi-fidelity setting for biological sequences or molecular design. Implementing such an algorithm would be a significant contribution in and of itself, in our opinion.
Scarce literature
In your review, you state that our observation that the literature on multi-fidelity methods for the kind of scientific discovery problems we tackle is still scarce is “pure speculation”. We would like to remark that this observation is the outcome of a systematic review of the literature in a an area where we carry out active research. If our observation is wrong and there is actually a rich literature on multi-fidelity methods for finding diverse candidates in combinatorially large, structured and high-dimensional spaces, we would be very grateful to be let known.
Novelty with respect to the literature on GFlowNets
The closest method to ours in the GFlowNet literature, to the best of our knowledge, is the work by Jain et al. (2022), which provided a set of results on the problem of biological sequence design using an active learning algorithm with GFlowNets. Other works, for example by Bengio et al. (2021) and by Jain et al. (2023), have also used GFlowNets in active learning algorithms. However, to our knowledge, no previous work has extended GFlowNets for multi-fidelity active learning. This is the main novel difference of our work with respect to other GFlowNets papers, which we highlighted repeatedly throughout the paper (Sections 1, 2, 3.2, 3.3…). We would gladly welcome suggestions about how to better clarify this point.
Surrogate model/acquisition functions
We agree that exploring the influence of the choice of different surrogate models and acquisition functions would be very interesting. However, a multi-fidelity active learning algorithm consists of multiple components and it would not be feasible to study all in detail. As discussed in the paper, our choices were motivated by the observation of what has worked successfully in previous work. We are aware that these choices may be suboptimal, for example deep kernel GPs may be prone to overfitting as you note, but given that we still obtain good results with our proposed algorithm, we may regard this is an indication of robustness. In our opinion, it is reasonable to leave for future work the exploration of different choices of surrogate models and acquisition functions.
Metrics
In your review, you assess that our choice of metrics is “flawed”. This choice is based on prior work, for example by Jain et al. (2022), and is as well motivated by the needs in potential practical applications. You write that “[t]his implies extracting a diverse subset from all outputs, not simply looking only at the top K outputs and seeing how diverse they are.” We respectfully disagree, for the following reasons: First, the notion of “all outputs” is ill-defined - how many is “all”? Second, the desiderata in our problem (see Section 4) is to find a batch of candidates with both high scores and high diversity. Therefore, assuming that “all” refers to the entire collection of candidates gathered across active learning iterations, it would not make sense to evaluate the entire set, including the low-scoring candidates found at the first iterations. Instead, it is reasonable to select the best candidates according to the highest fidelity oracle. That said, the choice of K in our experiments may be seen as arbitrary, hence we performed an ablation study of this choice (Appendix E.4) and found that the results are consistent.
Baselines
Your review assesses that our baselines are “fairly weak”. First, we would like to remind here, as well is done in the paper, the reasons behind the choice of baselines. We are not aware of prior work proposing multi-fidelity methods for designing DNA, AMP or molecules. Therefore, we could not directly compare our algorithm with existing work. The GFlowNet-based baselines that we included are designed to help us gain understanding about the novel aspects of our proposed method. Specifically, the contribution of multi-fidelity versus single-fidelity active learning and the advantages of a GFlowNet sampler that selects the fidelity alongside the sample. Moreover, we include a baseline with random samples ranked by the acquisition function, which is known to be a strong baseline in Bayesian optimisation.
The review also states “don't compare against any other sort of BO method”. We would like to argue that this observation is not correct, since besides the GFlowNet-based baselines, we also include a multi-fidelity algorithm which optimises the BO acquisition function via the widely used reinforcement learning method PPO. Bayesian optimisation algorithms consist of the following components: a Bayesian model fit on the available data (like our DKL surrogate model), an acquisition function and a method that optimises the acquisition function. Under this view, our PPO baseline is a multi-fidelity Bayesian optimisation method that optimises the acquisition function via reinforcement learning. This is made explicit in the submitted manuscript:
Multi-fidelity PPO (MF-PPO) Instantiation of multi-fidelity Bayesian optimisation where the acquisition function is optimised using proximal policy optimisation (PPO).
As discussed above and in the paper, this RL-BO approach is effective at discovering samples with high value of the target black-box function, as is reflected by the results of our experiments. However, it also exhibits low diversity in the discovered samples, as is also reflected by our results. The algorithm we propose, which uses a multi-fidelity GFlowNet, addresses this limitation.
The baselines are described in Section 4.2.
Origin of diversity
According to your review, the origin of diversity is unclear or even under question. First of all, our results, presented in Figures 2 and 3 as well as in the appendix (Figures 8 and 9) provide consistent empirical evidence that MF-GFN as well as the other GFlowNet-based baselines discover candidates with high diversity, as measured by the average pairwise distances of the final batch. As a matter of fact, the diversity of these methods is on par with the Random baseline, which is intrinsically diverse. Second, the results also consistently show that the diversity MF-GFN and the GFlowNet-based baselines is substantially higher than the diversity of the PPO baseline.
This sheds light on the questions you ask in the review about the origin of diversity (”is it the surrogate model? Is it the GFlowNets?”. If it was due to the surrogate model, the PPO baseline would obtain equally diverse samples, since the only difference with the other methods is the sampler of candidates. However, this is not the case and is a strong indication that the high diversity is due to the use of GFlowNet.
This is not surprising, since it is consistent with the theory of GFlowNets (Bengio et al., 2021, Bengio et al., 2023) as well the accumulated empirical evidence. Specifically, diversity arises from learning to sample proportional to the reward function. We refer to Section 3.1 and the referenced literature for further details.
Questions
How does this differ from previous work on GFlowNets?
To our knowledge, ours is the first work to propose a multi-fidelity active learning algorithm using GFlowNets. Single-fidelity active learning with GFlowNets has been explored at least by Bengio et al. (2021), Jain et al. (2022) and Jain et al. (2023a). The multi-fidelity component is far from trivial as it implied, among other things: a novel extension of the GFlowNet, so as to sample the fidelity alongside the candidate; training a multi-fidelity Bayesian surrogate with multi-fidelity data; incorporate a multi-fidelity acquisition function; regarding the evaluation, assessing the contribution of the multi-fidelity aspect on average scores, diversity, the impact of costs, etc.
How does this method differ from a BO method which one could create by taking a model/acquisition model directly from existing papers?
As discussed above, we are not aware of multi-fidelity Bayesian optimisation methods for the scientific discovery problems studied in our paper. That said, recalling our discussion above, the baseline MF-PPO we implemented is a multi-fidelity BO method. From our evaluation on four tasks, we observe that MF-PPO hardly discovers diverse candidates and is also less efficient at finding high-scoring candidates than MF-GFN in most of the tasks. The lack of diversity is most clear from the results showing mean scores of diverse top-K candidates (Figure 5 in the appendix).
What are the results of a GP baseline using Matern kernel (toy tasks), string kernel (DNA task), and Tanimoto kernel (models) with EI/cost acquisition function and GFlowNets as the acquisition function optimizer?
We do not have results of this specific variant of the algorithm. There are many variations of our algorithm that we could try (by changing the kernels, the surrogate architecture, the acquisition function, the GFlowNet policies, etc.) and it would not be feasible to provide a comprehensive analysis alongside the description and analysis of the algorithm. What aspect would this specific variant shed light on?
Additional comments
Finally, we would like to further clarify a few aspects that may have been misunderstood.
The summary of the review starts as follows:
This paper describes a multi-fidelity optimization algorithm which uses GFlowNets to optimize the acquisition function […]
We would like to recall that a crucial, distinctive feature of our algorithm due to the use of GFlowNets is that we do not optimise the acquisition function. Instead, the GFlowNet learns to sample proportionally to the acquisition function, increasing the diversity of the samples.
The summary proceeds and ends as follows:
Experimentally the proposed algorithm seems to outperform a number of GFlowNet baselines.
As discussed above, the baselines are not only GFlowNet-based but we also included a multi-fidelity Bayesian optimisation baseline (MF-PPO) where the acquisition function is optimised using proximal policy optimisation (PPO). Our proposed algorithm (MF-GFN) also outperforms this baseline.
The review mentions in passing that PPO is a “fairly weak” baseline because it is not sample efficient. We would like to note that the possible sample inefficiency of PPO is not relevant in our case, since it is not trained by querying the oracles, but rather the surrogate model and we train the model until convergence, allowing unlimited queries.
References
- Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018
- Garnett. Bayesian optimization. Cambridge University Press, 2023
- Bengio et al. Flow network based generative models for non-iterative diverse candidate generation. NeurIPS, 2021
- Jain et al. Biological sequence design with GFlowNet. ICML, 2022
- Jain et al. Multi-objective GFlowNets. ICML, 2023a
- Jain et al. GFlowNets for AI-Driven Scientific Discovery. Digital Discovery, 2023b.
- Bengio et al. GFlowNet Foundations. JMLR, 2023
- Malkin et al. Trajectory balance: Improved credit assignment in GFlowNets. NeurIPS 2023.
- Zhang et al. Robust Scheduling with GFlowNets. ICLR 2023.
I have read your detailed response to my questions and concerns: thank you for making a clear case for the significance of your work.
- Related work / significance: thanks for the additional description, I will reconsider. Please note though that citing papers providing context!
- Baselines/metrics: no further questions.
- GFlowNet diversity: the mere fact of sampling from a distribution does not imply diversity! I think the GFlowNet papers you cite all make an implicit assumption that the distribution of inputs with will be diverse. While this may often be the case in practice, it is easy to produce counterexamples which show this will not always be the case. When GFlowNets are applied to problems it is in my opinion usually unclear whether is actually a diverse distribution, and for this reason I personally consider their diversity to be incidental, although I acknowledge that it is not always incidental.
- Origins of diversity: I'm not sure that you understood my comment here. When comparing two algorithms, usually there are many changes and it is not clear which factors are important for diversity. This is what makes the origins of diversity unclear. For example, changing the acquisition function optimizer from PPO to GFlownets simultaneously changes what kind of molecules are produced (influenced perhaps by the model architecture), but also the degree to which the acquisition function is maximized (e.g. the maximum values found may be quite different) and, if I understand correctly, also whether you select inputs by "sampling" values proportional to acquisition function value or whether you choose greedily.
- Question about Matern/string/Tanimoto GPs: this suggested experiment was intended to disentangle which factors were responsible for the performance. All of your experiments used similar model types (deep kernel GPs) which made this less clear. I do not necessarily expect you to perform the experiment, it was just a suggestion.
Overall I am thinking about your arguments and will continue to think about them during the reviewer-AC discussion phase.
Dear reviewer, thanks for quickly reading our response, especially given that it got a bit long, and engaging in the discussion. We would like to briefly follow up about the diversity questions.
We agree that does not imply diversity of under all circumstances, because the notion of diversity is application-dependent. To what extent is sampling proportionally to the reward a good idea in the scientific discovery problems we tackle? One assumption that is made explicit in our paper (and others) is that besides diversity we are also interested in finding high-reward samples. Another assumption is that typically in these problems the target functions have multiple modes, separate from each other. With such target functions and with the additional goal of discovering high-scoring samples, we argue that sampling proportionally to the reward is a desirable objective to find diverse modes of the reward.
We also agree that it is not possible to control for every component that may influence the diversity of the samples when replacing a RL optimiser by a GFlowNet sampler. However, we observe that our results are consistent with the literature, in that RL is excellent at optimisation but less so at finding diverse optima, while GFlowNet tends to provide better diversity results. Regarding the question of how the batch of inputs for the oracles is selected, first samples are generated by either GFlowNet (approximately proportionally to the reward if well trained) or PPO. Then the top according to the acquisition function are selected.
The paper introduces a multi-fidelity active learning algorithm utilizing GFlowNets. The proposed method employs GFlowNets and a deep kernel learning surrogate model for generating candidates. The algorithm is evaluated across several tasks, including molecular discovery, demonstrating its capability to discover diverse, high-scoring candidates with fewer queries than other methods.
Summary of strengths
- The proposed integration of Bayesian optimization in a deep-kernel GP with GFlowNet is recognized as a sensible method with positive experimental results.
- The paper introduces an innovative multi-fidelity extension to the GFlowNet-AL framework, contributing to the advancement of active learning.
- Demonstrates efficiency in scenarios where oracle querying is time-intensive (e.g. certain wet-lab experiments).
During the discussion phase, Reviewer Cu6t actually raised the rating to 5. However, there are still reservations among the reviewers about the paper's current readiness for publication, due to the following concerns
- Concerns in the experiments not sufficiently demonstrating the method's significance.
- Questions are raised about the novelty of combining widely used methods in machine learning.
- There is a lack of discussion on whether the computational overhead of the proposed method is justified in light of the efficiencies it provides.
While the paper presents a methodologically sound approach with promising experimental results in the active learning domain, its novelty and the choice of experimental baselines are key areas of concern even after seeing the authors' responses. Additionally, there are concerns regarding the computational efficiency and scalability of the proposed method, especially in the context of larger real-world applications. These factors collectively suggest that while the paper has merits, it requires further refinement and more robust experimentation to convincingly argue for its novelty and practical significance.
为何不给更高分
The combination of well-known methods (Bayesian optimization with deep-kernel GP and GFlowNet) raises questions about the paper's novelty; Lack of strong experimental baselines limits the significance of the empirical results; Lack of discussion on the computational complexity and efficiency of the proposed method compared to the savings it provides.
为何不给更低分
The paper presents a sound methodological approach, integrating Bayesian optimization with GFlowNet, which is acknowledged as an effective technique. Despite concerns about baselines, the paper shows promising results, especially in scenarios with time-intensive oracle querying. The innovative extension of GFlowNet-AL to a multi-fidelity setting represents a meaningful contribution to the field of active learning.
Reject