3.5

/10

withdrawn4 位审稿人

最低3最高5标准差0.9

3.5

置信度

ICLR 2024

Rare Event Probability Learning by Normalizing Flows

Zhengqi Gao,Dinghuai Zhang,Luca Daniel,Duane S Boning

OpenReview PDF

提交: 2023-09-16更新: 2024-03-26

TL;DR

we proposed a method to utilize normalizing flows to accurately estimate the occurrence probability of rare events.

摘要

关键词

rare event estimationnormalizing flows

评审与讨论

审稿意见

评分: 3置信度: 42023-10-23

The authors introduce rare event sampling via normalizing flows. For this they parameterize the rare event set via a function $g$ such that the rare event set is the set of points where $g \leq 0$ . Then they introduce a sequence of decreasing sets $\Omega_{a_i}$ such that this goes to $\Omega$ for $i = M$ . Now a normalizing flow is trained for approximate each set $\Omega_{a_i}$ , which corresponds to a temperature schedule for the rare event probability measure. The normalizing flows are trained each on their own using the reverse KL and then the weights up to flow $i-1$ are frozen for training the flow $i$ . The approach is benchmarked against other rare event sampling methods such as SUS, SIR, .. on toy examples of varying information.

优点

The paper does a good job at explaining its approach. The experimental results seem impressive and its design choices seem well-motivated via ablation studies. Furthermore, using a normalizing flows makes a lot of sense for this kind of task.

缺点

I am not convinced of the novelty of this approach. This paper mostly cites pre 2021 papers. Please clarify the relation to more modern approaches such as [1,2].
The flows are trained with the reverse KL. This comes with some caveats. First one assumes differentiability of the function $g$ . Please comment on whether this is realistic. Furthermore, the reverse KL is known to be mode seeking. I think for most applications in the field of rare event sampling it is crucial to cover all the modes of a density. There has been some recent line of work for normalizing flows such as [3] to overcome this but this seems like a major limitation.
Similarly, the evaluation should also include some measure of the distance to the true measure and not only the estimated probability. As far as I understand the paper, this should be possible.
Please also cite relevant papers such as [4], who introduced a kind of log det schedule for covering multimodal distributions, which I think is related to way the different $\Omega_{a_i}$ are constructed.
This paper does not come with any code. Do the authors intend to make their code public? Appendix C does not suffice for reproducibility in my opinion.
The heuristic why MCMC wont cut it for this problem makes sense for vanilla MH. But if one takes gradient informed steps such as HMC or MALA, I am not sure why this rationale outlined in section 3.3 should hold true. What is the proposal for MCMC taken in the experiments?

[1] A Flow-Based Generative Model for Rare-Event Simulation, Gibson et al

[2] Conditioning Normalizing Flows for Rare Event Sampling, Falkner et al

[3] Flow Annealed Importance Sampling Bootstrap, Midgley et al.

[4] Deep Probabilistic Imaging: Uncertainty Quantification and Multi-modal Solution Characterization for Computational Imaging , Sun et al.

问题

See weaknesses. I think the paper follows a nice idea, has several benchmarks, but does a poor job at literature review. Also I think uploading the code is very important for reproducibility, since this paper is mostly applied.

审稿意见

评分: 3置信度: 42023-10-26

The paper introduces a technique for rare event sampling that combines normalizing flows with importance sampling. The authors refer to this technique as NOFIS (NOrmalizing Flows assisted Importance Sampling). They justify their work by highlighting the limitations of standard sampling algorithms, such as MCMC, in sampling regions of low probability, where the density, denoted as $p$ , is approximately $10^{-X}$ , with X being an integer greater than 4. In this context, known as the regime of rare event sampling, algorithms like MCMC would require an impractical number of samples, rendering these approaches highly inefficient. The authors propose that employing normalizing flows-aided importance sampling holds promise as a solution to this problem.

优点

The paper flows smoothly and is enjoyable to read.
The authors provide great level of details and do not take anything for granted, which I appreciate.

缺点

Novelty: I don’t find much novelty in the proposed paper. The technique presented by the authors has already been explored in many prior works in different fields, particularly in physics, where rare event sampling is often a challenging problem (see below).
Related Works: Despite many prior works combining normalizing flows with importance sampling, and beyond, exist, this paper lacks a dedicated Related Work section. Several seminal works have been completely overlooked despite their significant contributions to the field of normalizing flow-aided importance sampling in statistical physics [1], chemistry[2], and quantum field theory[3,4,5].
Annealed Importance Sampling: There is no reference to annealed importance sampling [6], which I believe is highly tight to the idea of the paper. Besides [6], several relevant works [7,8,9] perform annealed importance sampling within the context of normalizing flows, falling within the same category as the CRAFT method referenced in the paper, though only marginally. What these methods do closely aligns with what the authors propose in the paper: instead of learning the target distribution in one step, they 'anneal' towards that distribution by learning and sampling from intermediate distributions, ensuring that the final learned probability density has as much support as possible, including regions where the target density is small enough to fall within the rare event regime. I believe it is crucial for this paper to be published in this or any other venue to highlight the connection to these (and the previously referenced works).
Rare Event Sampling: A recent paper [10] discusses similar behaviors in training normalizing flows and combining them with importance sampling to ensure full support over the target density, including rare event regions. I would find it interesting if the authors commented on this work within the context of their findings. Some of the metrics and tools proposed in [10], such as the mode-dropping estimator, could also be used to assess the performance of a sampler in approximating regions of low probability where a shallow sampler is likely to lose some of the probability mass.
Idea of Anchor Points: The notion of anchor points has implicitly been explored in some of the prior works mentioned above, albeit with a slightly different connotation that may have escaped the authors' attention. For instance, in the paper by Kanwar et al. [4] (Fig. 4), the authors use a technique very similar to what is suggested in this paper, although with slightly different connotations (e.g., they use previously trained flow-based models as starting (anchor) points to sequentially train more challenging distributions).
Additional Related Works: Other closely related works, such as [11], are not mentioned in the manuscript despite having similar titles. This may cause confusion for potential readers.
Experiments: I find the results presented in the paper not entirely convincing. Although the authors compared their approach to a large set of baselines, this alone does not seem sufficient to claim the superiority of the proposed method. I am surprised that the proposed approach is not compared against prior works, such as Annealed Importance Sampling with Normalizing Flows [7], and naive RealNVP training with a sufficiently large number of couplings and no anchor points.

As a side note, I strongly recommend that the authors conduct an extensive literature search to include and acknowledge existing prior works, and eventually, compare and discuss potential differences and similarities

问题

I'd like to see how the author would compare their work (and its corresponding novelty) to previous works. In particular, I'd like to see comparisons with Refs. [6-9] for the annealing aspect and Ref. [10] for the theoretical discussion regarding low-support regions (e.g., the rare event regime). Furthermore, discussing the differences concerning Ref. [11] would be helpful for the readers.
I'd appreciate if the authors could perform an extensive literature search and create a Related Work section to place their paper in the context of existing prior works. Please refer to Refs. [1-11].
I found the last paragraph in Section 3.1 and the discussion in Appendix B to be a bit unintuitive. It has been shown in the literature that using Forward KL, instead of Reverse KL, generally results in larger support and, therefore, has some benefits when combined with importance sampling. In that sense, I am surprised by the author's claim that training using Forward KL deteriorates performance. Do the authors consider the case where NO samples are given from the target density? If so, then I may understand this point. Otherwise, when a sample set from the target density, even if small, is available, it should be possible to show that training with Forward KL is feasible.
It would be informative to see the density plot from Figure 4 for the other baselines as well.
On page 8, referring to Figure 4, the authors write "[…] the right part further reveals that when increasing $N_{IS}$ , the estimation could become even more accurate." This result does not seem neither novel nor unexpected. Indeed, it was already demonstrated in prior works, as seen in [1,5], that the variance of the importance sampling estimators scales with $N^{-1}$ , with N being the number of samples. Could maybe the authors comment on this?

Minor

The quality of the plots on pages 7-8 is quite poor. The axis labels are missing, and the font size for the x-y tick labels is too small.
As a side note, I sometimes find the MK notation a bit confusing. However, I understand that it would require a substantial effort to rewrite the manuscript and adapt to a clearer notation. Nevertheless, this my be a feedback worth keeping in mind for the authors for future iterations of the manuscript.
I find it somewhat unintuitive to completely relegate the discussion of the datasets to the appendix. Perhaps the authors could add corresponding references in the main text when mentioning the datasets and also refer to the Appendix for further details.
In the conclusion, statements like using nested subset events as bridges agains strongly reminds of annealed importance sampling. I believe that a discussion comparing the present method to AIS, highlighting potential differences, or connecting them through their analogies is an essential element currently missing in the manuscript.

References:

审稿意见

评分: 3置信度: 32023-10-26

The authors apply a normalizing flow model approach to rare event probability estimation, defined where the probability is less than 1e-4. This is done by the normalizing flow model learning proposal distributions, then estimating rare event probability using importance sampling on the learned proposal distribution.

优点

Paper is well presented, and using normalizing flows to assist with importance sampling (as compared to the other way around which has been done) is new.

缺点

Freezing seems to provide only a marginal advantage over non-freezing. The main advantage as the authors proposed is in the speed, but that's not particularly central to the paper as speed is measured by function calls and not wall clock time. If we remove step 5 from NOFIS then most of the method is not particularly distinguishable from standard normalizing flows.

In addition, if we're looking for just samples from the proposal distribution, what's the advantage of using NFs over other generative models? If there is a lack of distinguishing feature then the middle portion on NFs specifically might not be needed in lieu for a general generative model construction.

问题

Figure 2: Overlay highlighted green areas - not sure if I see the highlights?

What about just using the normalizing flow to directly estimate the likelihood of the rare event?

审稿意见

评分: 5置信度: 32023-11-01

The paper proposes to use normalizing flows to sample rare events. The neural networks learn the proposal distribution for the importance sampling and then use importance sampling to estimate the rare event probability. The numerical experiments show that the proposed method uses fewer function calls and has smaller errors in the average of the estimation.

优点

The motivation and the problem statement are clear. The paper is also easy to follow.
The implementation details about the algorithm are well-explained and the math of the method is also well-written.
The numerical section shows experiments with synthetic data and real-world data with multiple dimensions. The paper also compares the proposed method with five other baselines.

缺点

The experiments only contain up to dimension 62, and the paper does not explain why sampling rare events at this dimension is difficult. How the comparison may look like if we compare the method with traditional sampling methods, like metropolis sampling.
The method's speedup and precision improvement are not clear from the languages used in the text.
The experiments in Figure 2 and Figure 3 look unrelated to rare event sampling but show the effectiveness of the method approximating a given distribution. It will be beneficial to get more ideas on what these figures tell us.

问题

Does the number of anchors matter in your experiments?
How do you determine the training is complete?
For Tables 1 and 2, do you have the measurement of time in seconds? When you say function call, does it always take the same time for different methods? If the numbers include the time of training the neural networks, would the proposed method still be faster than other methods, especially non-ML methods?
It would also be useful to see the confidence interval from the 20 estimations. Do you have them?