5.0

/10

Rejected5 位审稿人

最低3最高6标准差1.1

3.8

置信度

正确性2.6

贡献度2.2

表达2.8

NeurIPS 2024

Robust Guided Diffusion for Offline Black-box Optimization

Can Chen,Christopher Beckham,Zixuan Liu,Xue Liu,Christopher Pal

OpenReview PDF

提交: 2024-05-06更新: 2024-11-06

TL;DR

We propose Robust Guided Diffusion for Offline Black-box Optimization (RGD), melding the advantages of proxy (explicit guidance) and proxy-free diffusion (robustness) for effective conditional generation.

摘要

关键词

offline model-based optimizationblack-box optimizationdiffusion modelsscore-based SDEguided diffusionclassifier diffusion guidanceclassifier-free diffusion guidance

评审与讨论

审稿意见

评分: 6置信度: 32024-07-05

The paper proposed RGD, a novel method for integrating classifier guidance into classifier-free guidance diffusion models for solving offline MBO problems. Experiment results and ablation studies validate that the method outperforms state-of-the-art baselines and each proposed component is resonable.

优点

Idea is intuitive and easy to follow
Motivating example in the introduction makes the reader easy to understand the limitations of the prior method and the advantages of the proposed method
Strong experiment results and detailed ablation studies make the proposed method more convincing

缺点

For the diffusion-based proxy refinement part, it seems that there are several estimations to compute the distance between $p_{\phi}(y\vert \hat{x})$ and $p_{\theta}(y\vert \hat{x})$ . Furthermore, it incurs additional hyperparameter $\alpha$ , which should be carefully tuned.

问题

For the diffusion-based proxy refinement part, have the authors tried other approaches for estimating $p(y)$ ? I wonder if the choice of estimation affects the performance significantly.
To compute the regularization loss term in Eq (15), we need to collect samples from the adversarial distribution. I cannot find the detailed procedure for collecting adversarial samples ( $M$ , $\eta$ , ...). Could authors elaborate more on that part?

局限性

There are a few minor comments on the manuscript.

For figure 2, it seems that $\tilde{s}(x_T, y, \omega)$ should be written as $\tilde{s}(x_T, y, \hat{\omega})$ . Furthermore, at first, it makes me confusion that RGD conducts classifer-guidance. However, that misleading part has been resolved after reading the manuscript.

作者回复

2024-08-05

Dear Reviewer,

Thank you very much for your thorough and constructive feedback. Your insights are immensely valuable and provide essential guidance as we seek to enhance the quality and clarity of our manuscript. We truly appreciate the time and effort you have invested in reviewing our work, and we are committed to carefully considering and incorporating your suggestions in our revisions.

Weaknesses:

For the diffusion-based proxy refinement part, it seems that there are several estimations to compute the distance between and . Furthermore, it incurs additional hyperparameter, which should be carefully tuned.

Yes, the diffusion-based proxy refinement involves three computational estimates: $p(x)$ , $p(x|y)$ , and $p(y)$ . We compute $p(x)$ and $p(x|y)$ using the learned SDE, as detailed in https://anonymous.4open.science/r/RGD-7DBB/likelihood.py, and estimate $p(y)$ using Gaussian kernel-density estimation. These methods are aligned with common practices in the field.

Regarding the hyperparameter $\alpha$ , it is not manually tuned but is instead optimized based on the validation loss, as detailed in Appendix B.

For the diffusion-based proxy refinement part, have the authors tried other approaches for estimating p(y)

In addition to the Gaussian kernel-density estimation discussed in our paper, we also experimented with Gaussian Mixture Models (GMM) for estimating $p(y)$ . The distribution $p(y)$ estimated by GMM was quite similar to that obtained using Gaussian kernel-density estimation. When we incorporated the GMM-based $p(y)$ into the diffusion-based proxy refinement module, the results remained consistent, with a performance of 0.968 using the original method and 0.964 with GMM on the Ant task. This similarity in outcomes underscores the robustness of our estimator choice. We will incorporate this discussion in Section 4.5 Ablation Studies.

To compute the regularization loss term in Eq (15), we need to collect samples from the adversarial distribution. I cannot find the detailed procedure for collecting adversarial samples. Could authors elaborate more on that part?

Thank you for your inquiry about collecting samples from the adversarial distribution. We employ gradient ascent to generate these samples. For a comprehensive explanation of this process, please refer to the "Adversarial Sample Identification" section in the global response.

Limitation

Notation s(x, y, $\hat{\omega}$ ) and s(x, y, $\omega$ )

Thank you for pointing out the notation inconsistency. Strictly speaking, we should use $s(x, y, \hat{\omega})$ instead of $s(x, y, \omega)$ . We opted to use $s(x, y, \omega)$ in the paper as the symbol $\hat{\omega}$ had not been introduced at that point.

Furthermore, at first, it makes me confusion that RGD conducts classifer-guidance. However, that misleading part has been resolved after reading the manuscript.

Regarding your initial confusion about classifier guidance in RGD, we appreciate your feedback. To clarify, we will insert the sentence "Our framework is based on proxy-free diffusion" at Line 57 to better communicate this aspect from the outset.

Overall

Does our response resolve your concerns? We value your detailed feedback and look forward to more discussions in the rebuttal phase. Thank you for your contributions.

2024-08-09

Thank you for your detailed feedback and I keep my positive rating. There are some minor comments.

For the diffusion-based proxy refinement part, have the authors tried other approaches for estimating p(y)

As the authors say, ablation studies on the choice of estimating p(y) enhance the robustness of the proposed method. I also recommend that authors conduct the ablation study across at least two tasks for the claim.

评论- Thanks for your prompt feedback and continued support.

2024-08-10

Thank you for your prompt feedback and continued support.

To further assess the robustness of our method against the choice of p(y), in addition to the Ant task, we have conducted experiments on the TFB8 and TFB10 tasks. The performance was 0.974 with the original method and 0.975 with GMM on the TFB8 task, and it was 0.694 with the original method and 0.692 with GMM on the TFB10 task. These consistent results reinforce the robustness of our estimator choice.

We will ensure that all discussions are meticulously incorporated into the final manuscript, as suggested.

审稿意见

评分: 5置信度: 42024-07-11

In this paper, the authors proposed to combine both classifier guidance and classifier-free guidance for offline black-box optimization. In addition, the authors propose a Proxy Refinement procedure by minimizing KL divergence between the Proxy distribution and diffusion distribution regarding $y$ .

优点

The paper is well-written and well-organized.
The paper introduces several refinement procedures to boost the offline optimization performance. The proposed Diffusion-based Proxy Refinement procedure is interesting.

缺点

Technical contribution seems to be incremental

Employing diffusion models for offline black-box optimization is not new. The technical contribution of this paper seems to be incremental. The draft extends the paper "Diffusion Models for Black-Box Optimization" [1]. However, detailed discussions about the relationship between the proposed method and the paper [1] are missing.

[1] Siddarth Krishnamoorthy, Satvik Mashkaria, and Aditya Grover. "Diffusion Models for Black-Box Optimization." ICML 2023.

Part of the technical details are not clear.

(a) In Equation (12), the concrete computation procedure of $p_\theta (\hat{\boldsymbol{x}} | y)$ and $p_\theta (\hat{\boldsymbol{x}})$ via diffusion model is not clear.

(b) The derivation of Equation (10) is not given. It seems that Equation (10) is from the forward pass of the diffusion model. However, the forward pass (Eq.32-32 in [10]) is regarding the distribution. And the concrete $\boldsymbol{x} _ t$ is constructed via the backward pass with $s_\theta(\boldsymbol{x}_k,k)$ for $k \in T,\cdots, t+1$ . In addition, how to choose $\mu(t)$ and $\sigma(t)$ in Equation (10) is not clear.

The additional proxy training, sample refinement procedure and proxy refinement procedure increase the computation cost

The additional proxy training, sample refinement procedure and proxy refinement procedure increase the computation cost. However, the time comparison with baselines is missing.

The additional proxy training, sample refinement procedure and proxy refinement procedure bring many additional hyperparameters, which may overfit the offline BBO task

In the offline BBO tasks, the offline dataset is provided. The evaluation is the black-box function value at the generated query at one time. The long-term convergence properties and exploration/exploitation balance are not considered. As a result, there are risks that overfit the evaluation metric for the offline tasks. The paper Introduces lots of additional hyperparameters, which increases the overfitting risks.

问题

Please provide more discussion about the technical details in Weakness 2.
What is the computation time of the proposed method? Please provide time comparisons with baselines.
Please provide explanations about the overfiting issue.

局限性

Additional computation cost and overfitting risk may be additional limitations besides the limitations discussed.

作者回复

2024-08-05

Dear Reviewer,

We greatly appreciate your insightful feedback. Your suggestions are crucial for enhancing our manuscript, and we are dedicated to meticulously revising our work in accordance with your recommendations.

Weakness

Technical contribution seems to be incremental. Employing diffusion models for offline black-box optimization is not new. The technical contribution of this paper seems to be incremental. The draft extends the paper "Diffusion Models for Black-Box Optimization" [1]. However, detailed discussions about the relationship between the proposed method and the paper [1] are missing.

We acknowledge the contributions of DDOM in applying diffusion models to offline model-based optimization. As noted, we have initially discussed the relationship between RGD and DDOM from Line 349 to Line 351. To clarify further, we will add this discussion:

"DDOM integrates diffusion models into offline model-based optimization without specifically addressing how to extrapolate from an existing offline dataset to obtain high-scoring samples—it relies solely on conditioning on the maximum value found in the static dataset. In contrast, our work introduces a proxy-enhanced sampling module that incorporates explicit guidance into the sampling process, enabling effective extrapolation. Furthermore, we have developed a diffusion-based proxy refinement module that leverages diffusion-specific priors to refine the proxy. This approach represents a novel advancement not previously explored in the literature."

These additions will be integrated immediately following Line 349 to provide a detailed comparison and highlight the novel aspects of our methodology.

Part of the technical details are not clear.

(a) We have trained a score function and then use the official library for score-based SDEs to compute the likelihood. Details are provided at https://anonymous.4open.science/r/RGD-7DBB/likelihood.py. The conditional probability $p(x|y)$ is calculated by inputting a specific $y$ label, while the unconditional probability $p(x)$ is computed using a zero label.

(b) Equation (10) employs Tweedie's formula, which is used to transform noise into clean data. We will include the citation "Robbins H. E., 'An empirical Bayes approach to statistics', in Breakthroughs in Statistics: Foundations and basic theory, Springer New York, 1992, pp. 388-394." in the manuscript to provide a reference for this formula. For the settings of $\mu$ and $\sigma$ , we adhere to the setting provided in Appendix C, "SDES IN THE WILD," from the paper "SCORE-BASED GENERATIVE MODELING THROUGH STOCHASTIC DIFFERENTIAL EQUATIONS."

The additional proxy training, sample refinement procedure and proxy refinement procedure increase the computation cost. However, the time comparison with baselines is missing.

We have detailed the computational costs in Appendix D. The expenses are justified, considering the performance improvements and the typically high costs of real-world experiments.

The additional proxy training, sample refinement procedure and proxy refinement procedure bring many additional hyperparameters, which may overfit the offline BBO task. In the offline BBO tasks, the offline dataset is provided. The evaluation is the black-box function value at the generated query at one time. The long-term convergence properties and exploration/exploitation balance are not considered. As a result, there are risks that overfit the evaluation metric for the offline tasks. The paper Introduces lots of additional hyperparameters, which increases the overfitting risks.

We acknowledge the introduction of additional hyperparameters in our approach. However, it's important to note that in the offline BBO tasks we address, access to the oracle black-box function is unavailable, precluding the direct exploration-exploitation considerations.

Furthermore, we do not use the black-box oracle function to fine-tune these hyperparameters, thereby avoiding overfitting. For instance, the hyperparameter $\alpha$ is solely adjusted using the validation set included in the offline dataset. This approach ensures that there is no overfitting risk associated with our method.

Questions

See Weakness.

Overall

Have we adequately addressed your concerns? We truly appreciate your comprehensive feedback and anticipate further dialogue during the rebuttal phase. Thank you for your insights.

评论- Reply to Authors' rebuttal

2024-08-12

Thanks for the authors' detailed clarification and responses. Most of my concerns have been addressed.

I still have some concerns regarding the overfitting risk. I acknowledge the different focuses of offline black-box optimization and online black-box optimization and explain why the authors preclude the exploration-exploitation considerations for long-term behavior. However, I am not sure what the key component among the several proposed ones is that makes the whole model more robust against the overfitting. In addition, what is the key component to achieve a better score beyond the maximum in the dataset and outperforms baselines?

评论- Please reply to the rebuttal.

2024-08-10

Dear Reviewer,

Please reply to the rebuttal.

AC.

2024-08-12

Additionally, it's worth noting that Eq.(10) in our submission aligns closely with Eq.(15) from the seminal work DDPM[r3]. In DDPM, they present the equation $x_0 \approx \frac{x_t - \sqrt{1-\alpha_t} \epsilon_{\boldsymbol{\theta}}(x_t)}{\sqrt{\alpha_t}}$ , which is derived in a discrete setting.

[r3] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.

评论- Clarification on Model Robustness and Key Components

2024-08-12

We appreciate the reviewer’s detailed feedback and are glad that most of the concerns have been addressed. We are uncertain about your use of the term 'overfitting.' We interpret this as the possibility that our proposed designs might overfit the trained proxy. Please let us know if we have misunderstood your concern.

The first key component enhancing our model's robustness is the use of proxy-free diffusion guidance. In this framework, the denoising step cannot be interpreted as a gradient-based adversarial attack, since it does not rely on proxy gradients to directly modify the input design. This concept is thoroughly discussed in the seminar work [A], particularly in Equation (6).

However, proxy-free diffusion guidance alone is insufficient, as it lacks direct guidance from the proxy, limiting its ability to extrapolate effectively. This limitation is illustrated in Figure 1 of our manuscript, titled 'Motivation for Explicit Proxy Guidance.' To address this, we introduce explicit proxy guidance as the second key component. This component aims to direct the sampling process toward high-property regions. Direct application of proxy gradients to the input space would result in out-of-distribution samples, potentially leading to what might be perceived as overfitting. Therefore, we apply proxy gradients to the scalar strength parameter $\omega$ , which modulates both condition and diversity. This approach of optimizing scalar parameters rather than the design itself is explored in the ICML 2023 paper [B], specifically in Section 4.5 on Adaptive- $\gamma$ . It demonstrates how supervision signals from the proxy can effectively update scalar hyperparameters, thereby enhancing robustness without directly modifying the input design.

In essence, our model combines (1) the robustness afforded by proxy-free diffusion—which does not rely on proxy gradients—with (2) the targeted guidance from a trained proxy, which influences only the scalar strength parameter $\omega$ to mitigate overfitting risk. These two key components form our proxy-enhanced sampling module, which is crucial for enhancing the model’s resilience against overfitting and are instrumental in achieving superior outcomes.

Additionally, in our diffusion-based proxy refinement module, we propose utilizing diffusion-derived distribution priors to refine the proxy, an approach not previously explored. This is also an important component. While prior work, such as COMs and ROMA, employed simple intuitive priors like conservative estimation and smoothness to refine proxy, our method leverages diffusion-derived distributions, providing a more relevant and impactful signal for refining the proxy, which has been proved to be more effective. The detailed comparisons with COMs and ROMA are discussed in detail in Appendix E, 'Further Ablation Studies'.

Have we adequately addressed your concerns? We truly appreciate your comprehensive feedback.

[A] Ho J, Salimans T. Classifier-free diffusion guidance[J]. arXiv preprint arXiv:2207.12598, 2022.
[B] C. Chen. et al. Bidirectional Learning for Offline Model-based Biological Sequence Design. ICML 2023

2024-08-12

Thanks for the authors' detailed response.

I now have a better understanding of how each proposed component works and how they relate to one another. I am not certain about the experiments, but I think the proposed method makes sense. Therefore, I decided to increase my score.

评论- Thanks for your support

2024-08-12

Thank you for your supportive feedback. We are pleased that our explanations have helped clarify the aspects of our work. We will ensure that these discussions are incorporated into the revised manuscript.

评论- Additional question about Eq.(10) in the draft

2024-08-12

Dear authors,

I find an issue when I try to derive Eq.(10) using Tweedie’s formula suggested by the author.

Note that $X_0 \sim p_0(X_0)$ and $X_t = X_0 + \mathcal{N}(0, \sigma(t)^2 \boldsymbol{I} )$ from forward pass of diffusion model (corresponding VE SDE Eq.(31) in [r1]) .

From the Tweedie’s formula, we achieve the following Equation: $\mathbb{E}[X_0|X_t=x_t] = x_t + \sigma(t)^2 \nabla _{x_t} \log p_t (x_t)$

It can then be approximated using the trained score function $s_\theta(x_t)$ as $\mathbb{E}[X_0|X_t=x_t] \approx x_t + \sigma(t)^2 s_\theta(x_t)$

However, the Eq.(10) in the submission drops the conditional expectation w.r.t. $X_0$ and directly obtain $x_0 \approx x_t + \sigma(t)^2 s_\theta(x_t)$

Did I miss anything to achieve Eq.(10)? Could the authors explain more about how to derive Eq.(10)?

[r1] Song et al. Score-Based Generative Modeling through Stochastic Differential Equations. 2021.

评论- Clarification on Eq.(10).

2024-08-12

Thank you for your insightful question regarding Eq.(10). We will now derive Eq.(10) step by step to address any concerns.

We follow Eq (33) from [r1] where $p(x_t|x_0) = N(x_t; \mu(t) x_0, \sigma^2(t) I)$ . Given this, we can sample $x_t$ from $x_0$ using: $x_t = \mu(t) x_0 + \epsilon \sigma(t)$ . To recover $x_0$ from $x_t$ , we need to know $\epsilon$ , which approximates as $\epsilon \approx -\sigma(t) \cdot s_{\boldsymbol{\theta}}(x_t)$ . Using this approximation, we derive $x_0 = \frac{x_t - \epsilon \sigma(t)}{\mu(t)} \approx \frac{x_t + s_{\boldsymbol{\theta}}(x_t) \cdot \sigma^2(t) }{\mu(t)}$ .

This approach originates from [r1], and we utilize the implementation framework detailed in another seminal work [r2]. Specifically, our code, available at https://anonymous.4open.science/r/RGD-7DBB/lib/sdes.py , implements this process as follows:

Line 24 implements $\mu(t)$
Line 27 implements $\sigma^2(t)$
Line 37 describes the sampling process: $x_t = \mu(t) x_0 + \epsilon \sigma(t)$
Line 112 optimizes: $\epsilon \approx -\sigma(t) \cdot s_{\boldsymbol{\theta}}(x_t)$ , where $\epsilon$ is the target, $\sigma(t)$ is the std, and $a$ is $s_{\boldsymbol{\theta}}(x_t)$ .

Apologies for any confusion. Considering that most readers and reviewers come from an offline MBO background, these advanced concepts of diffusion models can be challenging. We will add a statement following Line 178, "For a more detailed derivation, please refer to the Appendix." In the Appendix, we will include the discussion outlined above.

Have we adequately addressed your concerns?

References:

[r1] Song et al. Score-Based Generative Modeling through Stochastic Differential Equations. 2021.
[r2] Huang C W, Lim J H, Courville A C. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 2021, 34: 22863-22876.

2024-08-12

Thanks for the authors' detailed response.

From the authors' new response, I find that the derivation of Eq.(10) does not come from Tweedie’s formula suggested by the authors in the previous rebuttal. The Tweedie’s formula calculates the posterior expectation $X_0$ given $X_t = x_t$ , i.e. $\mathbb{E} [X_0 | X_t = x_t]$ , instead of a concrete sample $x_0$ .

The derivation of Eq.(10) is actually through reparametrization of $X_t$ and relies on an approximation of the Gaussian noise $\epsilon \approx -\sigma(t) \cdot s_\theta(x_t)$ .

I got the derivation. Thanks again for the authors' detailed response. My concern is well addressed.

评论- Thank you for your continued engagement

2024-08-12

Thank you for your continued engagement and for raising these important points regarding the derivation of Eq.(10).

The reference to Tweedie's formula in our previous rebuttal was intended as a high-level citation, acknowledging the foundational idea of recovering samples from imperfect data. It was not directly used to derive Eq.(10), but rather to highlight the conceptual framework. To enhance understanding, we will include a citation of the seminar work DDPM [r3], where similar concepts are more explicitly detailed.

Regarding the approximation $\epsilon \approx -\sigma(t) \cdot s_{\boldsymbol{\theta}}(x_t)$ , the essence of diffusion models is to learn a model that can predict the noise vector, thereby enabling the denoising of samples from pure noise to realistic data. In our specific case, we train the diffusion model $s_{\boldsymbol{\theta}}(x_t)$ to predict $-\frac{\epsilon}{\sigma(t)}$ . This is operationalized by optimizing the loss function mentioned in Line 112 of our sde.py, where $\epsilon$ is the target, $\sigma(t)$ is the std, and $a$ is $s_{\boldsymbol{\theta}}(x_t)$ . In essence, starting with $x_0$ , we sample a noise vector $\epsilon$ as the target, add the correponding noise to $x_0$ to generate $x_t$ , and then aim to train $s_{\boldsymbol{\theta}}(x_t)$ to more closely approximate $-\frac{\epsilon}{\sigma(t)}$ .

~~Have we adequately addressed your concerns?~~. Based on the latest feedback, it appears that the derivation is now clear, so we have removed this question sentence. Thank you again for your thoughtful engagement with our work.

审稿意见

评分: 6置信度: 42024-07-12

The paper proposes a framework called Robust Guided Diffusion for the problem of Offline Black-box Optimization. The key idea is to formulate the solution as conditional generation of high-performance designs using a diffusion model which has explicit guidance from a proxy (surrogate) model. This proxy model is also refined/updated via a proxy-free diffusion procedure. Experimental analysis is shown on multiple tasks from design-bench benchmark.

优点

Overall, I like the paper because it includes two simple changes to an existing approach (DDOM) that shows improved performance and the changes are validated by ablation choices.

缺点

One major premise (repeated multiple times in the paper) in the paper is that proxy guidance conditional generation is more robust than updating the design with standard gradient ascent on the proxy. However, it is not immediately clear why this should be true and the justification for this key point is somewhat limited. If true, this will be much bigger insight going beyond black-box optimization. If it is only about the exploration/exploitation balance driven by w, we could also make standard gradient have this property by optimizing a upper/lower confidence bound on the objective. Please describe why this is the case either via some empirical experiment or theoretical insight. Also, in equation 11, we might evaluate the proxy far away from the training data depending on the values of s_\theta(x_t), \sigma(t), \mu(t).
The related work coverage and corresponding experimental analysis of the paper can be improved. This problem has seen an extensive body of work recently. Please see the references below and discuss/compare them appropriately. Some of them are included in references but not compared in the experiments ([1], [2], [3]):
[1] Yuan, Ye, et al. "Importance-aware co-teaching for offline model-based optimization." Advances in Neural Information Processing Systems 36 (2023).
[2] Kim, Minsu, et al. "Bootstrapped training of score-conditioned generator for offline design of biological sequences." Advances in Neural Information Processing Systems 36 (2023).
[3] Nguyen, Tung, Sudhanshu Agrawal, and Aditya Grover. "ExPT: Synthetic pretraining for few-shot experimental design." Advances in Neural Information Processing Systems 36 (2023).
[4] Chemingui, Yassine, et al. "Offline model-based optimization via policy-guided gradient search." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 10. 2024.
[5] Yao, Michael S., et al. "Generative Adversarial Bayesian Optimization for Surrogate Objectives." arXiv preprint arXiv:2402.06532 (2024).

问题

Some of the tasks in design-bench benchmark has errors which makes them not so informative for evaluation. For example, the offline dataset in superconductor task has multiple copies of the same inputs but with different outputs. As a result, the random forest oracle which is fit on this offline data is not reliable. It is mentioned that "This issue has now been rectified by the development team." How is it fixed?

局限性

Please see weaknesses section.

作者回复

2024-08-05

Dear Reviewer,

Thank you for your thoughtful feedback. We are committed to incorporating your suggestions in our revisions.

Weaknesses

One major premise in the paper is that proxy guidance conditional generation is more robust.

Let's clarify some concepts. Our discussion from Line 26 to 30 categorizes offline BBO methods into forward and reverse approaches. (1) Employing standard gradient ascent on a proxy represents a forward approach, which encounters the OOD issue due to proxy inaccuracies in unseen designs. (2) Proxy diffusion guidance, a reverse approach, maps high values into high-scoring designs using diffusion steps and proxy gradients, but faces adversarial solutions due to reliance on proxy gradients. (3) Proxy-free diffusion guidance, another reverse approach, similarly maps high values to high-scoring designs but does not rely the proxy gradients.

The core premise of our paper is that proxy-free diffusion guidance (3) outperforms both proxy gradient ascent (1) and proxy diffusion guidance (2) in robustness due to its independence from explicit proxy gradients on inputs, mitigating the risk of adversarial manipulation. This significant insight aligns with findings from [A], which argues that proxy-free diffusion guidance surpasses proxy diffusion guidance in robustness. Proxy diffusion guidance is akin to a gradient-based adversarial attack, whereas proxy-free diffusion is not, as it lacks a proxy for the diffusion process.

[A] Ho, J. and Salimans, T. Classifier-free diffusion guidance

equation 11, we might evaluate the proxy far away from the training data.

Our diffusion-based proxy refinement module addresses this by identifying adversarial samples located beyond the training data. It then refines the proxy by reducing its distance with the diffusion distribution for these outliers. This refinement enhances the proxy's accuracy for samples distant from the training data. We demonstrated the superior effectiveness of this method over COMS and ROMA, in Appendix E "Further Ablation Studies."

Additionally, we optimize only the scalar strength $\omega$ , which has been empirically shown to provide greater robustness compared to complete design optimization in Section 4.5. "Ablation Studies" of BIB [B].

[B] Bidirectional learning for offline model-based biological sequence design. ICML 2023.

Related work.

We have incorporated $14$ baselines in our study, including recent methods like DDOM, BONET, and BDI. To address your specific points, we conducted additional experiments with [1, 2, 4]. The focus of [2] is on biological sequence design, thus we specifically compared [2] on the TF8 and TF10 tasks:

Method	TF8	TF10
BOOTGEN	$0.970 \pm 0.001$	$0.670 \pm 0.052$
RGD	$0.974 \pm 0.003$	$0.694 \pm 0.018$

Additionally, we present results for RGD alongside ICT and PGS:

Method	Superc	Ant	DKitty	Rosen	TF8	TF10	NAS
ICT	$0.505 \pm 0.014$	$0.958 \pm 0.008$	$0.960 \pm 0.025$	$0.778 \pm 0.012$	$0.957 \pm 0.010$	$0.688 \pm 0.020$	$0.665 \pm 0.072$
PGS	$0.475 \pm 0.048$	$0.748 \pm 0.049$	$0.948 \pm 0.014$	$0.740 \pm 0.019$	$0.968 \pm 0.019$	$0.693 \pm 0.031$	N/A
RGD	$0.515 \pm 0.011$	$0.968 \pm 0.006$	$0.943 \pm 0.004$	$0.797 \pm 0.011$	$0.974 \pm 0.003$	$0.694 \pm 0.018$	$0.825 \pm 0.063$

Our results confirm that RGD generally performs better than these methods. We opted not to include [3, 5] in our comparisons due to their experimental focus on a few-shot setting, and to prevent overcrowding the experimental section with an excessive number of baselines (already 14 + 3 = 17 in total). Here, "N/A" indicates that we lacked the resources to finish them now but this does not impact our overall conclusion. We will integrate these results into Section 4.4 of our manuscript. In the related work of our revised manuscript, we will discuss the mentioned [1, 2, 3, 4, 5].

Questions

benchmark errors

The original SuperC task presented two issues: (1) the offline dataset contained multiple instances of the same inputs with different outputs, and (2) the oracle generated inconsistent predictions for identical inputs due to its randomness in the code. Initially, a similar issue was reported, which we initially believed was related to the second point. This was rectified by the development team, who removed the randomness in the code.

Regarding the first issue, we consulted with the Design-Bench authors who advised retaining the duplicate entries as they represent distinct observations for the same inputs. To provide clear validation, we conducted further experiments after removing duplicates to reassess our method:

Method	BO-qEI	CMA-ES	RL	Grad	COMs	ROMA	NEMO	IOM	BDI	CbAS	Auto	MIN	BONET	DDOM	RGD
SuperC	$0.362$	$0.380$	$0.399$	$0.390$	$0.396$	$0.407$	$0.404$	$0.409$	$0.405$	$0.414$	$0.371$	$0.402$	$0.371$	$0.404$	$0.410$

These results demonstrate that our method continues to perform effectively in this adjusted scenario.

We will incorporate these experimental results into the Appendix. Additionally, we will add after Line 211 of the main text stating: "We removed duplications in the SuperC and reran the experiments, with details provided in Appendix."

Overall

Have we adequately addressed your concerns? We are eager to continue this dialogue during the rebuttal phase. Thank you.

评论- Response to rebuttal

2024-08-11

Thanks for taking the time to respond to my questions. Please see points related to your response below:

Regarding This significant insight aligns with findings from [A], which argues that proxy-free diffusion guidance surpasses proxy diffusion guidance in robustness. [A] Ho, J. and Salimans, T. Classifier-free diffusion guidance.

Unless I am missing something, the main premise in [A] is to "attain a trade-off between sample quality and diversity similar to that obtained using classifier guidance". I am not sure how this reference paper [A] findings conveys that proxy-free diffusion guidance surpasses proxy diffusion guidance in robustness. Moreover, guiding diffusion models to generate images for a specific class is a relatively easier/different problem that guiding towards the optima of a function. In the former, we are interested in sampling any point from a class conditional distribution whereas in the latter, we explicitly want to find the optima (rare sample in the distribution).
Thanks for the ablation comparing diffusion-based proxy refinement module with COMs/ROMa strategy. Since the ablation show final evaluation performance on the tasks, this ablation is useful only if the proxy refinement part of the three methods is changed and everything else is kept the same. For example, this requires fixing either gradient ascent or proxy enhanced sampling for search/generating candidates for evaluation. Is this the case?
Thanks for including the discussion about new related work and fixing the error in superconductor task.

评论- Please reply to the rebuttal.

2024-08-10

Dear Reviewer,

Please reply to the rebuttal.

AC.

评论- Thanks for your prompt response and support.

2024-08-12

Thank you for your willingness to consider our arguments and for your constructive suggestions. We will ensure to add these points and enhance the discussion of related work in the main paper.

评论- Thank you for your detailed feedback

2024-08-12

Thank you for your detailed feedback, which provides constructive insights for refining our paper. We will incorporate these points into the revised version.

In the paper [A], searching the term 'adversarial' reveals critical arguments supporting the robustness of proxy-free diffusion guidance over proxy diffusion guidance. The text notes,

Furthermore, because classifier guidance mixes a score estimate with a classifier gradient during sampling, classifier-guided diffusion sampling can be interpreted as attempting to confuse an image classifier with a gradient-based adversarial attack.

as mentioned in the second paragraph of the introduction. Additionally, descriptions related to Equation (6) state,

Eq. (6) has no classifier gradient present, so taking a step in the $\epsilon$ direction cannot be interpreted as a gradient-based adversarial attack on an image classifier.

These points suggest that proxy-free diffusion guidance surpasses proxy diffusion guidance in robustness.

Acknowledging that generating images for a specific class is simpler than guiding towards the optima of a function, the latter scenario necessitates more robust guidance mechanisms. This is where proxy-free guidance excels, as directly using proxy diffusion guidance may lead to out-of-distribution issues, whereas proxy-free diffusion guidance provides more robust guidance. We will incorporate these discussions in Line 121 of our manuscript where we introduce guided diffusion.

Thank you for your comment. Yes, in our ablation study, we kept all elements except the proxy refinement part constant across the three methods. We will emphasize this in Line 482 of our manuscript where we perform further ablation studies.

Have we adequately addressed your concerns?

评论- Response

2024-08-12

Thanks for the response. I am not fully convinced with the robustness of proxy-free diffusion guidance argument but I don't want to nitpick now and would be happy to increase the score towards acceptance. Please add these points in the main paper along with proper discussion of the related work.

审稿意见

评分: 3置信度: 42024-07-13

The paper introduces a robust guided diffusion framework for offline black-box optimization, combining proxy and proxy-free diffusion for conditional generation. Key improvements include proxy-enhanced sampling and diffusion-based proxy refinement to address out-of-distribution issues. Experiments on the Design-Bench benchmark show the method outperforms existing techniques, validated by ablation studies.

优点

The regularization of the proxy using the diffusion model is interesting. Additionally, optimizing the alpha parameter in an offline manner aligns well with the offline setup, enhancing the method's consistency and applicability.
Experiments and ablations on four continuous and three discrete tasks validate the effectiveness of the proposed RGD method, showing improved performance and robustness.

缺点

The paper lacks comparison with relevant approaches like ICT [1] and TRI-mentoring [2]. Despite referencing the latter in the related work section, it’s overlooked in the results.
It is unclear why the results without proxy-enhanced sampling still achieve competitive outcomes, surpassing the dataset y_max. This contradicts the claims in lines 40-46. Where does the out-of-distribution (OOD) problem arise then? What is the distribution of the generated 128 candidates with and without the sampling?
The BDI reported results are significantly lower than in the original paper, especially for the ANT and TFBIND8 tasks. This also seems to be the case for BONET results. Did the authors change the evaluation setup?

[1]: Importance-aware Co-teaching for Offline Model-based Optimization, https://arxiv.org/abs/2309.11600

[2]: Parallel-mentoring for Offline Model-based Optimization, https://arxiv.org/abs/2309.11592

问题

How are the initial 128 designs selected? Do you generate N designs and then use the proxy to select 128?
Are the diffusion parameters like T the same for DDOM?
Can you show, for a task, how the values (proxy/oracle) of the generated designs progress throughout the diffusion steps?
How were the discrete tasks handled? Were logits used, or were they kept discrete?
Previous methods typically evaluate on the Hopper task; why was it removed and Rosenbrock is added instead?

局限性

The authors address the limitations and potential negative impacts in their paper.

作者回复

2024-08-05

General Reply

Dear Reviewer,

Thank you for your valuable feedback. Your insights are instrumental in improving our paper, and we are committed to thoroughly revising our work based on your suggestions.

WEAKNESSES

The paper lacks comparison with relevant approaches like ICT [1] and Tri-mentoring [2].

Thank you for pointing out the omission of comparisons with ICT and Tri-mentoring. Our initial experiments did not include these methods due to their use of ensemble techniques, contrasting with our single proxy approach. To address this, we have now conducted comparative experiments with both methods using the same benchmarks:

Method	Superc	Ant	DKitty	Rosen	TF8	TF10	NAS
ICT	$0.505 \pm 0.014$	$0.958 \pm 0.008$	$0.960 \pm 0.025$	$0.778 \pm 0.012$	$0.957 \pm 0.010$	$0.688 \pm 0.020$	$0.665 \pm 0.072$
Tri-mentoring	$0.510 \pm 0.014$	$0.946 \pm 0.010$	$0.950 \pm 0.015$	$0.780 \pm 0.006$	$0.968 \pm 0.002$	$0.689 \pm 0.014$	$0.760 \pm 0.092$
Our Method	$0.515 \pm 0.011$	$0.968 \pm 0.006$	$0.943 \pm 0.004$	$0.797 \pm 0.011$	$0.974 \pm 0.003$	$0.694 \pm 0.018$	$0.825 \pm 0.063$

These results validate our method's effectiveness relative to ensemble-based approaches. We will include this data in Tables 1 and 2 of the revised manuscript.

It is unclear why the results without proxy-enhanced sampling still achieve competitive outcomes, surpassing the dataset y_max.

Thank you for your observation regarding the unexpected competitive outcomes of our model without proxy-enhanced sampling. The results surpassing dataset $y_{\text{max}}$ can be attributed to two key factors:

Conditioning on Maximum Labels: The diffusion model is conditioned with the label $y_{\text{max}}$ , naturally guiding the sample generation to orbit around $y_{\text{max}}$ .
Diversity of Generation: The inherent diversity of the diffusion model contributes to the possibility of occasionally surpassing $y_{\text{max}}$ even in the absence of explicit guidance.

This observation does not contradict the statements made in lines $40$ - $46$ , where we discuss the model's struggles without explicit guidance:

Comparative Performance: While samples without proxy-enhanced sampling can exceed $y_{\text{max}}$ , the results with explicit guidance consistently outperform those without, confirming the benefits of proxy-enhanced approaches as mentioned.
Frequency and Quality of High-Performance Samples: The diffusion model without explicit guidance does produce high-performance samples, but this occurs less frequently and with lower average performance compared to when explicit guidance is employed.

To provide a clearer picture, most of the $128$ candidates generated without explicit guidance indeed perform below $y_{\text{max}}$ , and their average performance significantly trails those generated with proxy-enhanced sampling. We will elaborate on these aspects in Line $283$ of the revised manuscript.

The BDI reported results are significantly lower than in the original paper, especially for the ANT and TFBIND8 tasks. This also seems to be the case for BONET results.

Thank you for noting the discrepancies in the BDI results. We referenced the BDI results from the Tri-mentoring paper [2], where a modified network architecture was used due to computational constraints. Specifically, instead of the original $6$ -layer MLP kernel, a more manageable 3-layer MLP was implemented to handle datasets. We will add this in Line 239.

Regarding BONET, a deviation was noted in the candidate selection process, evaluating $256$ candidates compared to the typical strategy of $128$ candidates. To align with standard practices, we reran BONET’s code under standardized conditions. We will add this in Line 249.

Questions

How are the initial 128 designs selected? Do you generate N designs and then use the proxy to select 128?

In our generative model approach, we do not select from a predetermined set of "initial designs" as seen in traditional methods like gradient ascent. Instead, our method generates designs directly from pure noise, bypassing the typical optimization of existing designs. This aligns with DDOM.

Are the diffusion parameters like T the same for DDOM?

Yes, we have aligned key diffusion parameters with DDOM. For instance, the diffusion time steps parameter $T$ is set consistently at 1000 in both our RGD and DDOM models to ensure comparability.

Can you show, for a task, how the values (proxy/oracle) of the generated designs progress throughout the diffusion steps?

We have documented the progression of values (proxy/oracle) for the Ant Dkitty design generation throughout the diffusion steps in Figure 1 of the global response PDF. The data demonstrate how the design is gradually guided towards higher scores via the proxy.

How were the discrete tasks handled?

In handling discrete tasks, we utilize the "map_to_logits" function provided by the design-bench library. This function effectively converts discrete task outputs into logits。

Previous methods typically evaluate on the Hopper task; why was it removed and Rosenbrock is added instead?

We excluded the Hopper task due to inconsistencies between the offline dataset values and those obtained from the oracle, as discussed in DDOM [3] under "A.4. HopperController". To enhance the robustness and credibility of our method, we introduced the Rosenbrock task instead.

Overall

Have we addressed your concerns with our response? Your thorough feedback is highly valued, and we welcome continued discussion in the rebuttal phase. Thank you.

[1] Ye Yuan et al. Importance-aware coteaching for offline MBO. NeurIPS 2023.
[2] Can Chen et al. Parallel-mentoring for offline MBO.NeurIPS  2023.
[3] Siddarth et al. Diffusion models for black-box optimization. ICML 2023.

2024-08-11

Thank you for your detailed responses. While you've addressed most of my concerns, two key issues are troubling: (1) The use of BDI results from the TRI-mentoring paper without proper citation, coupled with the initial omission of TRI-mentoring from the benchmarks, and (2) The results from a reduced network architecture, which leads to an unfair comparison that may inadvertently favor your method. These issues affect the credibility of the work, prompting me to revise my score to a rejection. I hope these concerns can be addressed in any future submissions to ensure a fair and accurate presentation.

评论- reply to the rebuttal

2024-08-10

Dear Reviewer,

Please reply to the rebuttal.

AC.

2024-08-11

Thank you for your detailed feedback. Regarding the use of BDI results from the TRI-mentoring paper, we have now included proper citations and incorporated TRI-mentoring into our benchmarks for a comprehensive comparison. We initially viewed TRI-mentoring primarily as an advanced ensemble method, which seemed distinct from our model’s approach, and thus only include it in the related work section. However, recognizing the importance of clarity and completeness, we have addressed this in our revised submission.

As for the reduced network architecture, we followed the specifications outlined in the published NeurIPS2023 paper TRI-mentoring and use the reported results in this NeurIPS2023 paper. Our approach was consistent with the standard three-layer MLP used across all methods in this comparison, ensuring a fair and uniform basis for evaluation. We hope these clarifications address your concerns. Could you please let us know if there are any further issues or suggestions you have?

评论- Appeal for Reassessment Based on Review Feedback

2024-08-12

Given your recognition that we have addressed most of the initial concerns:

While you've addressed most of my concerns,

we were surprised by the decision to downgrade our submission from 'borderline reject' to 'reject,' particularly since the remaining issues were, in our assessment, relatively minor. We are confident that the revisions we have implemented thoroughly address the issues raised. In light of these efforts, we kindly request a reevaluation of our paper.

2024-08-12

Thank you for your follow-up and revisions, and for admitting the oversight. However, transparency is most effective when it’s proactive, not reactive. The initial omission of TRI-mentoring while using their BDI results raise serious concerns about the credibility of your work. Even if corrected later, selective reporting undermines trust and cannot be overlooked. Therefore, I will maintain my decision to reject the paper. I strongly urge you to prioritize transparency in future submissions.

2024-08-12

Thank you for your prompt feedback.

As previously mentioned, we acknowledge the TRI-mentoring method and have discussed it in the related work section. We initially did not include a direct comparison with TRI-mentoring because (1) it employs an advanced ensemble approach, distinctly different from our methodology with a single proxy, and (2) our analysis already included a comprehensive set of 14 baselines. Following your insightful recommendation, we have now incorporated a comparative analysis with TRI-mentoring.

We also recognize and regret the inadvertent omission of the citation for the BDI results from TRI-mentoring, which has been duly corrected. We wish to clarify that our omission was limited to the citation of results; we did not manipulate the baseline results, and thus this should not be considered as selective reporting. We appreciate the emphasis on transparency and fully agree with its importance. We understand that peer review should focus on the scientific content and contributions of a manuscript. While unintentional oversights are unfortunate, they have been rectified and should not overshadow the substantive scientific evaluations of the work.

We hope that the changes implemented demonstrate our commitment to transparency and scientific rigor.

审稿意见

评分: 5置信度: 42024-07-14

The paper introduces a new method, named RGD, for Offline Black-box Optimization (BBO). RGD incorporates an improved proxy to guide the previous proxy-free method (i.e. DDOM[4]). Key technical innovations includes (1) improving the robustness of the proxy function against adversarial samples by consistency regularization with the diffusion process; (2) dynamic per-sample reweighting between proxy-guided and proxy-free sampling. Compared to previous approaches, RGD demonstrates superior performance on Design-Bench [3].

优点

Methodology: RGD integrates forward and reverse approaches for BBO, in a way that they can help with each other (e.g. using forward proxy to guide the reverse sampling and using the diffusion process to improve the forward proxy), which is technically sound and interesting.

Experiment: RGD demonstrates superior performance on Design-Bench, compared to the baselines.

Ablation: Ablations on different components of RGD are provided.

缺点

The reviewer would prefer some clarifications on the method and the experiments

i) Algorithm 1, Line 4, how to identify the adversarial examples? From Line 187-188, it looks like gradient ascent is utilized to find the x that maximize y, it is unclear to the reviewer that how to determine if the obtained x is an adversarial example

ii) Algorithm 1, Line 7, refine the proxy function via eq 15. It would be best if the author could provide further details on how to optimize eq (15), e.g. number of validation and adversarial samples, number of iterations for the bi-level optimization discussed in Appendix B.

iii) Algorithm 1 Line 13, optimizing \omega. Again, it would be best if the author could provide extra info on how to optimize \omega. From Algorithm 1, it looks like \omega is time dependent and optimized for each time step. How many training iterations are required for each time step. The reviewer also wonder if the obtained \omega are dramatically different between different time steps.

iv) From Line 257-258, it looks like the baselines shown in Table 1 & 2 were re-implemented. If this is the case, the authors are encouraged to include more implementation details, e.g. the model architecture for the score function, etc. This could help follow-up works to reproduce the reported results. The reviewer also wonders if the source code will be made public.

问题

Please refer to the weaknesses.

局限性

Limitations have been discussed in the appendix

作者回复

2024-08-05

General Reply

Dear Reviewer,

We sincerely appreciate the time and effort you have invested in providing such a constructive review of our manuscript. Your insights and suggestions are invaluable, and we are truly grateful for the guidance you have provided. We are fully committed to carefully considering and incorporating all your feedback to enhance the quality and clarity of our revised manuscript.

Weaknesses

i) Algorithm 1, Line 4, how to identify the adversarial examples?

Refer to the global response "Adversarial Sample Identification".

ii) Algorithm 1, Line 7, refine the proxy function via eq 15. It would be best if the author could provide further details on how to optimize eq (15), e.g. number of validation and adversarial samples, number of iterations for the bi-level optimization discussed in Appendix B.

We refine the proxy function using batch optimization. Each batch comprises 256 training, 256 validation, and 128 adversarial samples. The bi-level optimization process, outlined in Appendix B, involves a single iteration for both the inner and outer levels to adjust the hyperparameter $\alpha$ . We will include these details in Appendix B.

iii) Algorithm 1 Line 13, optimizing \omega. Again, it would be best if the author could provide extra info on how to optimize \omega. From Algorithm 1, it looks like \omega is time dependent and optimized for each time step. How many training iterations are required for each time step. The reviewer also wonder if the obtained \omega are dramatically different between different time steps.

$\omega$ is indeed optimized time-dependently, updated once per time step using an Adam optimizer with a learning rate of 0.01. We will include this specification in Line 168 of the revised manuscript. Regarding its variability, we have already discussed the variability of $\omega$ in Figure 3 of our manuscript, which exhibits significant changes between different time steps.

iv) From Line 257-258, it looks like the baselines shown in Table 1 & 2 were re-implemented. If this is the case, the authors are encouraged to include more implementation details, e.g. the model architecture for the score function, etc. This could help follow-up works to reproduce the reported results. The reviewer also wonders if the source code will be made public.

We only modify the setting where necessary. For example, for the case with DDOM/BONET which generates $256$ candidates unlike the typical $128$ from most methods, we reran the experiments to ensure comparable conditions. This ensures our results are directly comparable across all methods. We plan to make all source code publicly available upon acceptance of the paper, facilitating easy reproduction of our results by the research community.

Overall

Does this response address your concerns? We appreciate your feedback and look forward to further discussions during the rebuttal phase. Thank you for your input.

2024-08-12

Dear Reviewer,

As the discussion period is nearing its conclusion, we kindly ask you to engage in the discussion and provide notes on any concerns that have not yet been addressed, along with the reasons why.

Thank you for your attention to this matter.

AC.

评论- Urgent: please respond to rebuttal

2024-08-13

Dear Reviewer,

As the discussion period is nearing its conclusion, we kindly ask you to engage in the discussion and provide notes on any concerns that have not yet been addressed, along with the reasons why.

AC.

评论- reply to the rebuttal

2024-08-10

Dear Reviewer,

Please reply to the rebuttal.

AC.

作者回复

2024-08-05

Dear Reviewers,

We appreciate your detailed evaluation and insightful comments on our manuscript. Acknowledging your feedback, we have addressed one primary concern highlighted in your reviews within this response.

Adversarial Sample Identification

(Reviewer a4Cg) i) Algorithm 1, Line 4, how to identify the adversarial examples? From Line 187-188, it looks like gradient ascent is utilized to find the x that maximize y, it is unclear to the reviewer that how to determine if the obtained x is an adversarial example.

(Reviewer hUhz) To compute the regularization loss term in Eq (15), we need to collect samples from the adversarial distribution. I cannot find the detailed procedure for collecting adversarial samples.

We utilize a vanilla proxy to perform $300$ gradient ascent steps, identifying samples with unusually high prediction scores as adversarial. This method is based on the limited extrapolation capability of the vanilla proxy, as demonstrated in Figure 3 in COMs[1]. These unusually high predictions indicate deviations from the normal data distribution, validating their classification as adversarial examples. We will include these details in Line 188 of our manuscript to enhance understanding.

Best,

Submission 2262 Authors.

[1] Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. Conservative objective models for effective offline model-based optimization. In Proc. Int. Conf. Machine Learning (ICML), 2021.

最终决定Reject

2024-09-25

The paper proposes RGD, a method combining proxy and proxy-free diffusion approaches for offline black-box optimization. Initial reviews were mixed. Reviewers praised the comprehensive experiments and intuitive ideas but raised concerns about technical novelty, comparisons with recent methods, and potential overfitting risks.

In their rebuttal, the authors addressed many concerns by providing additional comparisons, clarifying technical details, and conducting further experiments. Most reviewers found these responses satisfactory, with some increasing their scores, and I did take the authors' concern about reviewer AXdG into consideration. However the overall support after rebuttal still fall below the typical acceptance range. Thus I regret to recommend a reject for this submission, and encourage the authors to resubmitted a revised version to the next top conference.