6.3

/10

Rejected3 位审稿人

最低5最高8标准差1.2

3.0

置信度

正确性3.3

贡献度2.3

表达3.0

ICLR 2025

ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

Omer Ronen,Ahmed Imtiaz Humayun,Richard Baraniuk,Randall Balestriero,Bin Yu

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

We develop a new general-purpose out-of-distribution score to regularize latent space optimization.

摘要

关键词

VAELatent Space OptimizationOOD

评审与讨论

审稿意见

评分: 8置信度: 32024-11-01

Latent Space Optimisation (LSO) is prone to over-exploration, leading to invalid solutions. This paper proposes to regularise the acquisition function $\\mathcal{A}_{\\hat{f}}(\\mathbf{z})$ to restrict the selected query points $\\mathbf{z}^{(new)}$ to a subset leading to valid output sequences. This is done via the proposed score, ScaLES, which approximates the log-likelihood of valid samples through the log-determinant of the decoder's Jacobian (pseudo) inverse. This score is shown to be a good proxy measure both theoretically and empirically. Furthermore, ScaLES also demonstrates SOTA computational time compared to existing methods.

优点

Quality

The paper is theoretically sound; intuitive explanations and well-detailed derivations are provided in both the main paper and the Appendix. I carefully checked the derivations and did not find any issues.
The experiments provided are extensive, using several architectures, datasets, and hyperparameters. Additional ablation results are also provided. All this shows that ScaLES generalises well across various configurations.
The limitations of ScaLES are discussed, and the impact of some theoretical assumptions in practical applications is well-detailed.

Clarity

The paper is well-written and easy to follow.
As far as I know, the contextualisation with respect to related work is done adequately. Still, LSO is not my area of research, so I may not notice missing related work.

Originality

As far as I know, the proposed method is novel. However, as I mentioned above, LSO is not my area of research, so I may not be aware of concurrent work close to the proposed method.

Significance

ScaLES is well-motivated theoretically and shows SOTA results across extensive experiments. This score can be of interest for the LSO community, both on the theoretical and practical side.

缺点

I really enjoyed reading this paper, and I only have a few minor comments.

Mathematical notation

In Alg. (1), the new samples are generally referred to as $\\cdot^{(new)}$ , except $y$ , which is referred to as $y^{new}$ . I suggest renaming it $y^{(new)}$ for consistency.
In Eq. (6), $L$ is not defined.
l. 212, the vocabulary size and sequence length are denoted in two ways: D, L and $D, L$ . Which notation is correct?
In Eq. (19), shouldn't $\\mathbf{x}$ be $\\mathbf{x}_{\\mathbf{z}}$ as in Eq. (9), or did I miss something there?
l. 648, could the numerator of $p_{\\mathbf{z}}^{(i)}$ be rewritten with $\\exp(\cdot)$ instead of $e^{\cdot}$ ? the font size of the fraction is quite small, and it is hard to distinguish which subscripts go where.

Clarity

Are the results in Fig. 9 an average over all the datasets? If so, could we also have the details per dataset (especially for the challenging Expressions dataset)? If it is only over one dataset, could the dataset used be mentioned in the legend?
In Sec.5, given the need to select a good value for $\\lambda$ , I think the claim that "ScaLES has no hyperparameter" could be misleading. I would encourage the authors to reformulate this and mention the need for the $\lambda$ parameter to weight the score given by ScaLES.

Typo

Appendix A is empty, consider removing it if not used.

问题

[1] observed that almost all types of probabilistic generative models can spuriously assign a high likelihood to OOD data samples. How would this affect ScaLES?
In 4.2, the authors mentionned that $\\lambda > 0.5$ did not improve the optimisation process. Are the results worse in that case? If so, why?
Can the authors provide some insights into why the Expressions dataset was more challenging for ScaLES? did $\\lambda > 0.05$ provided worse results or just no improvements?

References

[1] Nalisnick, Eric, et al. "Do Deep Generative Models Know What They Don't Know?." International Conference on Learning Representations. 2019

评论- answers to questions

2024-11-21

We thank the reviewer for their thorough reading of our work and for the positive feedback. We appreciate the reviewer’s suggestions on improving the mathematical notation, all of which have been incorporated in blue in the revised manuscript. Additionally, we have removed the claim regarding the absence of hyperparameters.

Regarding the Questions:

On likelihood base models and OOD Detection:
Recent work [1] has shown that, for normalizing flows, the number of singular values of the Jacobian matrix exceeding a certain threshold can help address the OOD paradox (i.e., spuriously assigning a high likelihood to OOD data samples). We believe it is a promising direction to investigate whether ScaLES could also be used in this context to improve OOD detection for likelihood-based models.
On the Expression Dataset and using larger $\lambda$ :
For the expression dataset, the (black box) problem is easier compared with the other benchmarks, which is why we think the regularization is less helpful. We believe that the gradient of the acquisition function is not as noisy as in the other datasets, and there are fewer gains in adding regularization. We note that in general we did not perform a through grid search to find the optimal $\lambda$ for each problem and it is possible that for some cases using $\lambda>5$ would be beneficial.

[1] Kamkari et al., “A Geometric Explanation of the Likelihood OOD Detection Paradox,” ICML 2024.

2024-11-23

I thank the reviewers for their clarification and paper updates.

Re OOD detection: The issue of OOD detection is not specific to normalising flows and was also observed in VAEs by Nalisnick et al. In that sense, I believe that my comment is related to the issue of heuristic validity raised by Reviewer yzw5. Based on Nalisnick et al., it is unclear whether OOD samples will always have a lower likelihood, and, as suggested by Reviewer yzw5, OOD samples may still be of interest in some configuration. Because if this, is it possible that some improvements observed with LES are due to a (spuriously) high likelihood of OOD samples ?

Given this and taking into account the observations of other revieers, I am lowering my score to 6 for now.

评论- thanks for your response

2024-11-23

We thank the reviewer for their thoughtful comments.

1. Likelihood and OOD Data

We would like to clarify that we did not claim OOD regions will always have lower density, but rather that OOD regions are more likely to have lower density—a claim quantitatively supported in Table 2. As noted in our original submission, likelihood consistently shows a strong correlation with being in-distribution, defined by decoding into valid objects. This is evidenced by the high AUROC (0.92 on average and always above 0.75) values observed across the various decoders we analyzed.

It is also important to note that Nalisnick et al.'s analysis focused on images, which involve a different data modality (continuous images versus discrete sequences) and applied a different definition of OOD, specifically based on images from different datasets rather than latent vectors and the concept of validity. A possible explanation for the differences between our findings and those of Nalisnick et al. is that our definition of validity more effectively distinguishes in-distribution data from out-of-distribution data compared to the approach used in their analysis of images from different datasets.

2. Interest in OOD Data Points

While it is true that OOD data points can sometimes be of interest, the key issue is not their potential value but whether there exists a practical algorithm to systematically and efficiently identify such points. For LSO, our analysis (i.e., the poor performance of LSO (L-BFGS) method) and previous work ([1] and [2]) suggest that no such algorithm currently exists. To substantiate this claim, we include below relevant excerpts from the cited literature in our introduction:

“Although in principle optimization can be performed over all of Z, it has been widely observed that optimizing outside of the feasible region tends to give poor results, yielding samples that are low-quality, or even invalid (e.g., invalid molecular strings, non-grammatical sentences); therefore, all LSO methods known to us employ some sort of measure to restrict the optimization to near or within the feasible region.”[1]

“Automatic Chemical Design possesses a deficiency in so far as it fails to generate a high proportion of valid molecular structures. The authors hypothesize that molecules selected by Bayesian optimization lie in ‘dead regions’ of the latent space far away from any data that the VAE has seen in training, yielding invalid structures when decoded.”[2]

We believe these examples, drawn from highly cited papers written by leading researchers in the field, provide compelling evidence that while OOD data points might occasionally be valuable, there are no practical, systematic methods to reliably identify them. We will be happy to further emphasize this point in the revised manuscript, to make it extra clear.

3. Empirical Improvements and OOD Regions

Finally, regarding the question of whether the empirical improvements we observe might stem from exploring OOD regions: we do not believe this to be the case. The following points support our reasoning:

The analysis in Table 2 demonstrates that LES consistently achieves higher values in regions of the latent space that decode into valid objects, which we and others ([1] and [2]) define as in-distribution.
Across 30 experiments, the proportion of valid solutions generated during optimization with LES is always higher than without LES. (Table 10, LES is always higher than LSO (GA))

Given these observations, we find it highly unlikely that LES explores OOD regions, as these would typically not decode into valid objects (per [1] and [2]).

References

[1] Tripp, A., Daxberger, E. and Hernández-Lobato, J.M., 2020. Sample-efficient optimization in the latent space of deep generative models via weighted retraining. Advances in Neural Information Processing Systems, 33, pp.11259-11272.

[2] Griffiths, R.R. and Hernández-Lobato, J.M., 2020. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chemical Science, 11(2), pp.577-586.

2024-11-23

I thank the authors for their detailed explanation and have updated my score accordingly.

评论- thank you!

2024-11-23

much appreciated :)

审稿意见

评分: 6置信度: 32024-11-03

This paper proposes a method for latent space optimization called ScaLES. The author's highlight that existing methods for latent space optimization often yield latents which correspond to invalid configurations in the observation space. To focus on latents within the "valid set", the authors propose to use the likelihood of a sample under the learned generative model, specifically using the decoder contribution to the likelihood. The authors show that incorporating this score into latent space optimization can be done in a computationally tractable manner and yields superior performance to existing methods on benchmark datasets.

优点

The paper address an important problem, i.e., finding effective methods for latent space optimization.
The paper is straightforward to understand and well written.
The paper does a very good job motivating the need for their method and contextualizing it relative to prior work.
The method seems simple and straightforward to implement.
The method shows promising performance in terms of identifying valid latent configurations yielding performance gains over existing methods.

缺点

CPA Assumption

A key aspect of the author's method is the assumption that all reasonable deep generative models can be well approximated by continuous piecewise affine (CPA) spline operators. I am confused exactly what the purpose is of making this approximation? If the sole purpose is to use this approximation to derive the likelihood of a sample under the generative model via Theorem 5, then the assumption seems unnecessary as such a formula exist via the well known change of variable formula or alternatively its extension to rectangular matrices [1]. Thus, unless I am missing something, the CPA assumption and Theorem 5, which constitute a large part of the method section, seem unnecessary.

On Scalability

The authors emphasize "scalability" as a key strength of the Scales method. Based on my understanding of the method, however, "scalability" does not seem to be the method's strong point. Scales requires computing the determinant of the Jacobian of a decoder. In the experiments the authors consider, the latent dimension is quite low, s.t. this computation can be done relatively efficiently, as shown in Table 1. It is known, however, that the computation of the log determinant of the decoder Jacobian scales cubicly with the latent dimension (as the authors note). I am not an expert in LSO, thus I do not know if most problems of interest require a significantly larger latent dimensionality. However, for such problems, the Scales method will not scale gracefully. Thus, I do not view "scalability" as the strong point of this method.

Presentation

I think the clarity of the presentation of the contributions and main ideas in this work can be improved. As far as I understand, the main contributions of this work are the following:

The authors propose the idea that the likelihood of a generative model can be used as a heuristic for determining whether a given latent will yield a valid observation.
The authors create a score for LSO based on this idea, using only the generator term in the likelihood formula.
The authors show that this score can be efficiently computed on the datasets considered and that it is able to yield latents which correspond to valid configurations more robustly than existing methods.

Currently, I feel that the abstract and introduction are rather vague in terms of describing the core ideas and contribution of this work. To this end, I think a more direct description, along the lines of what is written above, would be easy to write and would improve the clarity of the paper.

Validity of Scales Heuristic

I am skeptical of the whether the likelihood of a generated latent under the decoder will always be a good heuristic for whether this latent will yield a valid observed configuration. For example, many configurations of interest are very likely OOD w.r.t. the empirical data distribution, i.e., a newly discovered molecule may have zero coverage under the empirical data distribution. Consequently, whether Scales is able to identify such a latent is contingent on whether or not the decoder is able to extrapolate to these unseen regions. In other words, the success of Scales seems intimately tied to the performance of the decoders considered. In this sense, I think calling the method a "general purpose regularization method for LSO" is a bit of an overclaim. I think a deeper discussion on this point, i.e., the validity of the Scales heuristic and its relationship to the generalization of the models in question, would improve the paper.

Overclaims in Writing

I feel that some of the writing, when motivating Scales feels like a bit of an overclaim that should be toned down or at least better expounded upon. For example, the authors motivate Scales as a "exact and theoretically motivated method" or "implemented ... exactly as theoretically derived". In what sense is the method "exact" and in what sense is it "theoretically motivated" or "derived"? The latter makes it sounds like there exist a theorem on the optimality of the method, when, in reality, I presume the authors are referring to their derivation of the likelihood (Thm 5). I do not think this constitutes theoretical motivation for the method and feels like an overclaim which makes the method sound more principled than it actually is, as I understand it.

Further, as previously discussed, I do not think calling the method "general purpose" or "scalable" is particularly accurate, and I would encourage the authors to explain what is meant by this, if they include this language.

Baseline Method Descriptions

A key baseline method that the authors compare against is the "Bayesian uncertainty score". The authors, however, do not give a self-contained description of this method in the main text or appendix which makes it difficult to contextualize some of the experimental findings.

Summary

In summary, I think the core idea of this paper, i.e., that the likelihood of a generative model can serve as a good heuristic for identifying valid latents for LSO, is potentially interesting and the author's empirical results seem promising. Currently, however, I feel the manuscript obfuscates this relatively straightforward idea with superfluous theory and vague writing. Furthermore, I think several stated motivations of Scales, in particular scalability, feel like overclaims to me, for the reasons stated above.

Bibliography

http://benisrael.net/INTEGRAL-AMS.pdf

问题

1. What is the purpose of the CPA approximation?

2. What is the meaning of "scalability" in reference to Scales?

3. Is the latent dimension of the generative models considered in this work similar to what one would expect to use in more practical settings for LSO.

4. Lastly, I am curious if the authors have considered alternative instantiations of the Scales heuristic?

Specifically, a core idea of Scales, as I understand it, is that the generator likelihood, can serve as a good heuristic for the validity of latents in LSO. This likelihood consist of the prior likelihood term and a Jacobian determinant. Currently, the authors rely on the determinant term. I am curious if the prior likelihood could instead be used in some cases. For example, consider a case in which one does not force the latents to conform to a restricted prior such as a unit Gaussian as in a VAE, but instead just encodes data with a vanilla autoencoder. In this case, the latent distribution will have a complex shape which may better capture the overall likelihood of a generated latent. To measure this likelihood, one could model the unknown latent distribution via an exact likelihood method such as a normalizing flow. I am curious if the authors have intuition on whether the prior likelihood on its own, in such cases, could serve as an effective score for LSO. I ask this particularly because using the prior likelihood seems much more scalable than the Jacobian determinant, thus I am wondering if there are cases where this score could be effective.

评论- clarifications and revisions

2024-11-21

We thank the reviewer for their careful reading of our work, and we also appreciate their suggestions for improving our manuscript. We follow the original structure of the review in our response.

CPA Assumption

We thank the reviewer for their insightful comment and for highlighting the relevant reference showing that the CPA assumption is not strictly required to derive the change-of-variable formula. We appreciate the opportunity to clarify why the CPA assumption is important in our work.

Our derivation in Equation (10) enables us to compute the gradient of ScaLES as a function of the softmax probabilities and the decoder’s derivatives (the slope matrices $A_\omega$ ) in closed form. This approach reduces the computational burden compared to differentiating through $S(z)$ using automatic differentiation.
Calculating the derivative of ScaLES without the CPA assumption, as described by Ben-Israel, would require (via Jacobi's formula) computing (denoting the decoder as $\boldsymbol{G}(z)=\text{Softmax}(\boldsymbol{L}(z))$ ):
$\frac{dS(z)}{dz_i} = \text{Trace}((J\boldsymbol{G}^T J\boldsymbol{G})^{-1} \frac{dJ\boldsymbol{G}^T J\boldsymbol{G}}{dz_i})$

The term $\frac{dJ\boldsymbol{G}^T J\boldsymbol{G}}{dz_i}$ involves second-order derivatives of the decoder network, resulting in a significant computational burden that scales quadratically with the latent dimension in gradient evaluations. However, representing the decoder as a CPA implies that the second-order derivative of the logits matrix $\boldsymbol{L}$ is always zero, which alleviates the need to compute these derivatives.
Lastly, the CPA assumption provides intuition by partitioning the output space into regions, each associated with a distinct contraction or dilation, affecting the likelihood of it's image

The reviewer is correct that this important clarification was missing in the original manuscript. We have revised the manuscript to address this point (l.264–269).

On Scalability

We appreciate the reviewer’s suggestion to clarify the term "scalable," and we agree with their assessment. The reviewer is correct that while the calculation of the determinant may not constitute a computational burden in the VAEs we study (which are comparable in size to VAEs used in real applications, e.g., [1]), it will not scale gracefully to much larger VAEs. Reflecting on the reviewer’s comments, we have:

Removed the term "scalable" from the manuscript.
Emphasized the computational complexity of our method in the contributions (l. 92).

[1] Truong, G.B., et al, 2024. Discovery of Vascular Endothelial Growth Factor Receptor 2 Inhibitors Employing Junction Tree Variational Autoencoder with Bayesian Optimization and Gradient Ascent. ACS Omega.

Validity of ScaLES Heuristic

We fully agree with the reviewer’s observation that the success of ScaLES is closely linked to the performance of the decoder. This is a point we emphasized in lines 203–204. In the revised manuscript, we:

Removed the term "general-purpose."
Included a more detailed discussion of this under the limitations of Theorem 5 (l.284–286).

Please also see our response to R3 for a discussion regarding the 'Interest in OOD Data Points'.

Overclaims in Writing

We thank the reviewer for raising concerns about our claims. In response, we have removed the terms exact, general-purpose, theoretically motivated, and scalable from the manuscript.

Baseline Method Descriptions

We have added a description of the Bayesian uncertainty score in Appendix D (l.1045–1052).

Questions

What is the purpose of the CPA approximation?

As explained in our response above, the CPA approximation:
- (1) Enables closed-form expressions for the gradient of ScaLES.
- (2) Avoids the need to compute the Hessian of the decoder.
- (3) Provides a geometric interpretation of the decoder function as a map that partitions the latent space into regions, before applying the softmax function.
What is the meaning of "scalability" in reference to ScaLES?

We have removed the term "scalable" from our manuscript.
Is the latent dimension of the generative models considered in this work similar to what one would expect to use in more practical settings for LSO?

To the best of our knowledge, the VAE models considered in this study have latent dimensions comparable to those used in practical applications. We have provided a citation backing up this claim in our response above (On Scalability).
Lastly, I am curious if the authors have considered alternative instantiations of the ScaLES heuristic?

The suggestion of using normalizing flows as an alternative is intriguing and certainly worth further exploration. In this study, we focus on methods that do not require additional training and can therefore be easily integrated into existing pipelines (l.85-86).

2024-11-25

Dear reviewer, did our response address your concerns and questions? If not, we would love to carry out additional experiments and/or provide further clarification.

2024-11-26

I thank the authors for their reply and their effort in implementing the changes according to the feedback given. I think these changes have improved the quality of the paper. I am hence raising my score to a 6, thus recommending acceptance of the paper. I am also increasing my score for the paper presentation to a 3: good.

2024-11-26

Thank you for your feedback, we agree that your suggestions improved the quality of the paper.

审稿意见

评分: 5置信度: 32024-11-04

The paper proposed a sequence density function for discrete sequential data $x$ in a latent variable model setting, i.e., VAEs. The authors further demonstrated the effectiveness of this density function in detecting out-of-distribution $x$ , as well as in mitigating over-exploration during latent space optimization (LSO).

优点

The authors showed that the proposed method is more compute efficient and outperforms the baselines in LSO.
The question of LSO for sequential data is important, and using VAEs to solve it is interesting.

缺点

VAEs are generative models, i.e., modelling $p(x,z)$ with the pre-defined prior $p(z)$ and the learned likelihood $p(x|z)$ (the decoder). When we train a VAE, we need to choose the class of model for $p(x|z)$ that correctly describes the data type of $x$ . For example, in the case of RGB images, people use discrete mixture of logistics to model the discrete pixel value range from 0 to 255.

From the paper, it is not mentioned what likelihood models are used in the decoder for such discrete sequential data type (maybe I missed some parts).

If I understand correctly, the authors proposed an estimator (denoted as $S(z)$ ) for $p(x|z)$ . The part I am not sure about is: since we already have a free $p(x|z)$ from the VAE, can we simply use $p(x|z)$ to evaluate the density? What are the benefits on using ScaLES instead?

问题

Table 1, what is column ‘ScaLES/Uncertainty’? Is it the ratio or another method?
Typos:
1. Page 2 line 097, “out-out-distribution”, —> “out-of-distribution”
2. Page 8 line 396, “samples are decoded into the latent space”?
Other than VAEs, are there any method using other sequence generative models (e.g., Transformers) for this task?

评论- additional results following your comments

2024-11-21

We thank the reviewer for their careful reading of our work and for identifying typos, which have been corrected in the revised manuscript. Indeed, "ScaLES/Uncertainty" describes the ratio and not another method.

Likelihood Clarification

First, we would like to clarify that the likelihood used in our approach corresponds to that of a sequence of L categorical random variables, each with D categories.

Second, we wish to emphasize that ScaLES is not an estimator of p(X=x∣Z=z), but of $p_{X(Z)}(X = x(z))$ . This distinction is depicted in Fig. 2 in the original manuscript, and we have added further mention of this fact in l.184-185. Specifically:

ScaLES does not assume a joint probabilistic model for X and Z.
Instead, we consider X as a deterministic function of Z—the random variable—mapped through the decoder.

This is an important distinction, as many existing methods, such as the uncertainty method proposed by Notin et al., do assume that X follows the conditional distribution p(X=x|Z=z). Consequently:

ScaLES describes a single likelihood function pX(Z)(X = x(z)) over the output space.
p(X=x∣Z=z) defines a distinct likelihood function for each z.

Variational Inference and Likelihood

We would like to note that p(X=x|Z=z) is estimated through variational inference techniques, which are themselves approximate (e.g., training neural networks). Therefore, it holds no theoretical guarantees under any realistic assumptions. Rather, similar to ScaLES, it assumes that the decoder fits the data well, serving as a good-enough approximation to the true distribution.

New Likelihood-Based Score

Inspired by the reviewer’s insightful comments, we have incorporated a new score into our evaluation. This score quantifies the likelihood of the most probable output for a given latent vector z under p(X=x|Z=z), formally defined as:

$L(z) = max_x p(X=x|Z=z)$ .

Notably, to the best of our knowledge, employing this score as a constraint in LSO has not been explored in prior work.

Comparative Evaluation with Likelihood Score

To facilitate a direct comparison between these two scores, we reproduced our entire evaluation section (Tables 1–9) to include the likelihood score. We are happy to report that our findings indicate that, in general, ScaLES outperforms the likelihood score as a scoring metric. Specifically:

For the average rank of the top 20 solutions, ScaLES achieved the lowest average rank of 1.83 (lower is better) compared to 2.8 for the likelihood approach. (l.480-481)
For the top 1 solution, ScaLES obtained the lowest average rank of 1.97, outperforming the likelihood approach, which achieved an average rank of 2.93. (l.1024-1025)

Addressing Transformer Network Study

Lastly, we thank the reviewer for the suggestion to study Transformer networks. We would like to emphasize (Table 2 in the original manuscript) that we study 10 decoders with a Transformer architecture. Our study is restricted to VAEs, as all past literature we are aware of conducted LSO using a VAE.

2024-11-25

Dear reviewer, did our response address your concerns and questions? If not, we would love to carry out additional experiments and/or provide further clarification.

2024-11-25

I am really grateful for the additional experiments, and the new results are indeed quite interesting. I am also surprised that this likelihood based approach has not been examined in the prior work.

However, I am still not convinced by the explanation of the difference between ScaLES and $p(x|z)$ . In particular, the additional experiments show that the likelihood based approach perform good as well. Maybe it is worth to investigate a bit more into the difference.

Overall, based on the current story of the paper and the rebuttal, I will keep my score.

2024-11-26

We thank the reviewer for their thoughtful critique of our work. We are pleased to note that the reviewer does not dispute our conclusion that LES outperforms the likelihood score, as supported by the results presented in Tables 2, 3, 8, and 10. Below, we summarize these findings:

Table 2: LES outperforms the likelihood score in identifying OOD data points, achieving an average AUROC of 0.93 compared to 0.91 for the likelihood score (l.377).
Table 3: LES regularization demonstrates superior optimization results for the top 20 solutions:
- Average ranking (lower is better): LES achieves a value of 1.83, compared to 2.8 for the likelihood score.
- Number of times a solution is within 1 standard deviation of the best solution (higher is better): LES achieves 21 instances, compared to 14 for the likelihood score.
Table 8: LES regularization improves optimization results for the best solution found:
- Average ranking (lower is better): LES achieves 1.97, compared to 2.93 for the likelihood score.
- Number of times a solution is within 1 standard deviation of the best solution (higher is better): LES achieves 18 instances, compared to 16 for the likelihood score.
Table 10: LES regularization delivers a higher percentage of valid solutions across all tasks, with an average of 0.61 compared to 0.58 for the likelihood score.

We would like to emphasize that our manuscript explicitly recognizes the likelihood score as “a more computationally efficient alternative, which comes with some performance trade-offs” (lines 308-309).

We also appreciate that the reviewer does not dispute the conceptual differences between the two approaches, as highlighted in both our previous response and the work of Nalisnick et al., cited by Reviewer 3. To reinforce this point, we include relevant excerpts from Nalisnick et al.:

“The VAE and many other generative models are defined as a joint distribution between the observed and latent variables. However, another path forward is to perform a change of variables. In this case x and z are one and the same, and there is no longer any notion of a product space X × Z.”

“The change of variables formula is a powerful tool for generative modeling as it allows us to define a distribution p(x) entirely in terms of an auxiliary distribution p(z), which we are free to choose, and f.”

Beyond the quantitative and conceptual differences highlighted above, we conducted an additional experiment to further demonstrate the advantages of LES in recovering the true density.

In this experiment, we consider a d-dimensional (d=25,56,75 and 256 to match the latent space dimensions in our study) Gaussian vector (z) transformed into a vector of “probabilities” using the softmax transformation. For each data point, we calculated LES and the likelihood score, along with the true density of X under the softmax transformation. To visualize the differences, we sampled evenly spaced data points between -20 and 20 along the first dimension of z, while sampling the other dimensions from a Gaussian distribution with a standard deviation of 0.1. These results, presented in Appendix C1 of the revised manuscript, clearly demonstrate that LES provides a more accurate estimate of the true density of X. In contrast, the likelihood score fails to capture the true density's correct structure.

To summarize, we have demonstrated the following differences between LES and the likelihood:

Mathematical difference: LES and the likelihood score are based on different formulas.
Conceptual difference: The two approaches make distinct probabilistic assumptions.
Large differences in density estimation: LES outperforms the likelihood score in a toy model setting.
Performance difference in LSO: LES demonstrates superior results when applied to LSO.

[1] Nalisnick, Eric, et al. "Do Deep Generative Models Know What They Don't Know?." International Conference on Learning Representations. 2019

2024-11-28

Dear reviewer, did our response help in clarifying the difference?

评论- General response

2024-11-21

We would like to thank all the reviewers for their time and careful reading of our work. We are delighted that the reviewers agree on the need to mitigate over-exploration in LSO and on the ability of our proposed method to help alleviate that limitation.

We now summarize the main points raised by the reviewers and the consequential changes to our manuscript.

Changes to the Method Name and Title

Following R2’s suggestions to clarify the exact meaning of the term “scalable,” we have revised the name of the method to Latent Exploration Score (LES) (we will use ScaLES and LES in our responses interchangeably) and the title of the paper to “Mitigating over-exploration in latent space optimization using LES.” This change reflects the fact that the determinant calculation does not scale gracefully to the latent dimensions of VAEs. While most existing VAEs in LSO have manageable latent dimensions, this may not be the case for other modalities or tasks. We hope this change will prevent any confusion for the readers.

The Difference Between LES and p(X=x|Z=z)

R1 asked us to clarify whether LES serves as an estimator of p(X=x|Z=z), and if so, why we wouldn’t directly use p(X=x|Z=z) (i.e., the probability of the generated output given the latent vector). In response, we have revised the manuscript (l.184-185) to emphasize that LES estimates $p_{X(Z)}(X=x(z))$ , assuming that x (which, in this case, is the entire probabilities' matrix as defined in eq. 9, l.215) is a deterministic function of z, rather than a random variable that follows a conditional distribution.

Motivated by this comment, we explored the use of $L(z) = max_x p(X=x∣Z=z)$ as a regularization method for LSO, which we refer to as the likelihood baseline. This metric captures the probability of the most likely output given z, aligning with how sequences are typically generated in LSO. To the best of our knowledge, this approach has not been examined in prior work. As part of this investigation, we reproduced all the results from the original manuscript, incorporating this new baseline (Tables 1–9 in the revised manuscript have been updated accordingly).

Our results demonstrate that, while the likelihood score serves as a strong baseline, LES outperforms it. Specifically:

For the average rank of the top 20 solutions, ScaLES achieved the lowest average rank of 1.83 (lower is better) compared to 2.8 for the likelihood approach. (l.480-481)
For the top 1 solution, ScaLES obtained the lowest average rank of 1.97, outperforming the likelihood approach, which achieved an average rank of 2.93. (l.1024-1025)

Why the CPA Assumption is Needed

R2 sought clarification on the necessity of assuming that the decoder can be represented as a CPA, noting that the change-of-variables term can be computed without this assumption. We addressed this (l.264-269) by explaining that one can calculate LES without the CPA assumption, but this assumption is essential for several reasons:

It enables a closed-form expression for the derivative of LES.
It avoids the computationally expensive process of calculating the decoder's Hessian when computing the derivative of LES.
It provides a geometric perspective on how the decoder function contracts or dilates the latent space.

Over-Claiming in Writing

In response to suggestions from R2 and R3, we have revised the manuscript to remove the terms: “exact,” “no hyperparameters,” “theoretically motivated,” “general-purpose,” and “scalable.” We hope these changes better reflect our proposed methodology and its merits.

评论- Rebuttal summary

2024-12-04

We sincerely thank all the reviewers for their and feedback and engagement during the rebuttal period.

To the best of our understanding, all the reviews agree that our proposed method out-performs existing baselines in terms of LSO performance.

The only remaining issue is the difference between our approach and the likelihood as calculated using decoder probabilities (p(x|z)), which we addressed in detail in our response to reviewer o3Wo. We briefly summarize the main points here for convenience:

We (and others, e.g., [1]) have highlighted the conceptual differences between these two approaches. Specifically, a likelihood score derived through a change-of-variables is based on different probabilistic assumptions (e.g., X as a deterministic function of Z vs. X as a random variable conditioned on Z) compared to one calculated directly from the logits. Notably, in the context of LSO, X is typically considered the most probable sequence and is therefore treated as a deterministic function of Z.
Our experimental evaluation demonstrates a clear performance gap between the two approaches, with our method consistently outperforming the alternative.
Additionally, we conducted an experiment to show that the likelihood approach does not accurately describe the density of a random vector under the softmax transformation, highlighting the differences between the two methods (Fig. 3, l. 1057). Specifically, the likelihood score saturates, failing to capture the correct shape of the distribution.

Building on the above, we believe we provide a compelling explanation of the differences between the two scores, supported by both conceptual insights and empirical evidence, which we hope will be considered.

[1] Nalisnick, Eric, et al. "Do Deep Generative Models Know What They Don't Know?." International Conference on Learning Representations. 2019

AC 元评审

2024-12-21

This paper seeks to address the overexploration problem that exists in latent space optimization, where sufficiently powerful optimizers can easily induce a VAE to produce highly "out of distribution" samples (e.g., unrealistic molecules) in order to achieve high scores. The authors' score essentially involves scoring molecules based on their density in the latent space after the decoder transformation, computed via change-of-variables. For strengths, I think the approach is fairly clever, obviously moves a step beyond simple likelihood estimation, and seems to achieve reasonable results.

However, the experimental setup is very vague in a way that makes Table 8 extremely hard to interpret, and potentially makes LES look significantly better. To be completely frank, I think some of these unanswered questions alone mean that the paper absolutely needs another round of review with additional clarity. I detail my concerns extensively below.

Why are we reporting the average of top 20 solutions in the main text, with the top 1 solution only appearing in the supplementary materials? This is an extremely non-standard thing to report when the methods involved don't explicitly try to optimize for top 20 performance. While some methods exist in the literature in LSO that do this (e.g., Maus et al., 2023), diversity constrained optimization is not what's being done here, and this is never explained as far as I can tell.
[Read the whole point here before you think I've misunderstood the purpose of the paper.] Some of the baselines in the paper do not achieve anywhere near previously reported experimental result values even when looking at the top 1 table in the appendix (e.g., TuRBO with SELFIES 25 achieves [0.49, 0.31, 0.49] vs [>0.75, >0.90, >0.65] in Maus et al., 2022. Now, there are two possible completely reasonable explanations for this:
- Possible explanation one: proposals made by TuRBO are being filtered out via the rd_filters check, and the method is now failing because it is oblivious to the authors' (reasonable) proxy definition for a "valid" molecule. This would be the best possible explanation, because it means that the authors method is working very well!
- Possible explanation two: the methods are not being run to convergence, or for an insufficient evaluation budget. This would be less good: even if the baselines stumble due to the validity checks, if they eventually produce solutions that pass these checks the story needs to now be about optimization efficiency, but optimization performance as a function of time is not evaluated anywhere in the paper.

The crux of the problem is that it's extremely unclear from the paper which of the above two factors is dominating here.

For bullet point 2-(1), the only real ablation of the impact of filtering is the "fraction valid" table in the appendix (Table 10), which (a) are not convincingly better on every task we see much worse performance of the validity unaware methods, and (b) don't tell the whole story. Assuming the authors are doing the reasonable thing and "invalid" solutions (e.g., as measured by rd_filter) are thrown out post hoc after the run has concluded (again this should be detailed), what we need to see here is a comparison between the unfiltered scores and the filtered scores. A large gap here would convincingly indicate that yes, the authors' implementation of these baselines is working as intended, they just aren't producing valid solutions.

For point 2-(2), at a minimum the evaluation budget needs to be listed. Much better, and arguably solving both parts of concern 2, would be to simply plot both unfiltered and filtered top 1 score as a function of the evaluation budget. Such a plot would let us see (1) yes the unfiltered versions of the algorithms eventually achieve their expected performance, but (2) as soon as you filter performance levels off much lower because these methods are validity oblivious. Or, alternatively, perhaps it's the case that the unfiltered optimizers do eventually achieve good filtered scores, they just do so much slower. The results currently in the paper might indicate this is not likely to be the case, but that really depends on the evaluation budget used here.

I'm almost at character limit, so I'll just mention here that I don't mean the above lengthy points to be overwhelming criticism, I actually really quite like the method in this paper, and think it's much more clever than e.g. simple likelihood scoring or similar. However, the way the results are presented to the reader make it very hard to evaluate a few extremely key aspects of the work in a way that makes me think the entire experimental presentation simply needs to be redone.

审稿人讨论附加意见

I think the authors largely addressed most of the reviewer concerns here, and were quite detailed in their updates. Overall, I think the paper is quite close to the bar even with my additional concerns above, but my own concerns combined with a few still slightly hesitant reviewers just didn't push me over the edge here.

最终决定Reject

2025-01-22

Reject