PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
4
3
3
ICML 2025

Importance Corrected Neural JKO Sampling

OpenReviewPDF
提交: 2025-01-23更新: 2025-08-13
TL;DR

We accurately sample from unnormalized densities by alternating local flow steps with non-local rejection steps.

摘要

关键词
SamplingWasserstein Gradient FlowsNormalizing FlowsRejection Sampling

评审与讨论

审稿意见
4

This paper presents a method to sample from an probability distribution known through its density, up to an unknown normalizing constant. The method follows the trend of neural parameterizations to solve the proximal steps of the JKO scheme to compute the Wasserstein Gradient Flow of the reverse Kullback-Leibler divergence. The neural parameterization is based on the Benamou-Brenier formulation of optimal transport, which allows to write every step of the JKO scheme using a neural ODE and thus to parameterize the density at a given step via a continuous normalizing flow, that is tuned to minimize the reverse KL divergence. A few theoretical properties of the resulting scheme are presented (mostly in the case of a log concave energy). Besides, to counter the fact that Neural JKO schemes essentially explore the energy landscape locally, the authors propose to introduce rejection steps to enhance sampling. They modify the classical importance sampling scheme using a proposal coming from the Neural JKO scheme (to mitigate the curse of dimensionality) and allowing to propagate the resulting probability density in order to be able to resume subsequent JKO steps. The method is tested on a benchmark of several distributions with closed form densities with various properties (multimodal, with narrow high energy regions, various dimensions) and report improvements over a number of competing methods (classical sampling methods, or more recent JKO, NF or diffusion based). The measures of quality of the generated samples are the energy distance to GT samples or KL divergence values.

Update after Rebuttal

I thank the authors for their responses and maintain my positive opinion on the paper. I maintain my score to 4.

给作者的问题

See my questions above about

  1. the existence of rejection schemes combined with generative models
  2. the use of JKO for sampling and positioning the present instanciation of neural JKO within the literature.

论据与证据

The paper’s main claim is the capacity of the proposed method to escape local maxima of the target density and prevent model collapse thanks to the proposed rejection scheme. This is indeed supported by the experiments where the methods shows better capacity to sample from multimodal distributions. The proposed rejection scheme combines well with the neural JKO scheme because it allows to propagate densities, which is essential to continue sampling using the JKO scheme.

方法与评估标准

The method is tested on several sampling tasks where the target distribution is known and enjoys tractable sampling for comparisons. The quantitative evaluation criteria make sense, i.e. using MMD for two sample hypothesis testing and using estimates of log normalization constants. The method is tested against several types of methods : classical sampling methods (MALA, HMC) that are known to have trouble handling multimodality, and more recent generative model based methods (DDS based on diffusion models, their own neural JKO without resampling) and a last CRAFT method based on sequential MC methods.

理论论述

The novel theoretical results are in Corollary 3.2, Thm 3.3, Thm 4.2 and Corollary 4.3.

The first two results state results on the convergence of JKO steps, reformulated using a dynamical OT formulation towards tha WFG curve of the reverse KL divergence, in the log-concave case where the KL divergence is geodesically convex (which is not the one that is the objective of the method). This is not a new result per se, but the statement that the functional G (detailed in app F) is geodesically convex for small tau even when F is not is a strong motivation to use the JKO scheme instead of more direct mininmization strategies. The other theorems showcase the good properties of the proposed rejection scheme and how the combine well with the JKO scheme. I only checked the proof of Thm 4.2.

实验设计与分析

While the combination of JKO scheme and the rejection sampling outperforms all other methods, it would have been interesting to include in the comparison a few methods based on rejection schemes. Naive ones (rejection sampling or importance sampling) are bound to perform poorly in the cases that are tested du to high rejection rates or the curse of dimensionality but surely there must be more recent methods using similar concepts ? Namely the idea of using proposal distributions that are less naive than default choice and are guided by a generative model has been used in some works recently.

补充材料

I reviewed parts of the supplementary material, namely App B1,C1, D, E and F. App E is essential to understand fully the experiments, and app F1 presents interesting considerations on neural JKO schemes.

与现有文献的关系

The paper seems mostly well positioned within the related literature on WGF, sampling algorithms, continuous NF. I still have two small concerns :

  • A number of papers on neural versions of the JKO scheme are cited, but the actual approach taken by the authors is not clearly positioned compared to this work : what is the originality (if any) of the proposed neural parameterization (CNF version, Benamou-Brenier dynamical view on transport) compared to these studies ? Are there already to the authors’ knowldge papers that use JKO schemes specifically for sampling ?
  • As in my comment above, the combination of rejection schemes using generative models as proposal distributions and more classical methods have been explored recently, can the authors say more about the relation between their work and those methods. I think about for instance about Gabrié et al. 2022 (and others cited in the introducion), which the authors cite but do not discuss much.

遗漏的重要参考文献

See my comment above.

其他优缺点

The theoretical analyis of WGF is interesting and gathers a number of results that are spread in the literature, which is nice, but only applies in the ideal log-concave case, which is precisely the one that the authors aim at extending with their method. This is a minor criticism as the analysis is much harder in the non log concave case and the authors give convincing arguments about the well behavedness of the JKO scheme in the nonconvex case (app F).

The rejection scheme seems efficient and importantly is not at odds with the need to propagate densities in the JKO scheme, which is a nice result.

其他意见或建议

There are a number of typos in the paper that I did not list (e.g. Lebesgue, not Lebesque) but should not survive a careful proofreading of the paper.

作者回复

We would like to thank the reviewer for the detailed and valuable feedback.

Convexity Assumption

There appears to be a misunderstanding regarding the assumptions for the theoretical part: We do not assume that the density is log concave. Instead, Assumption 3.1 assumes that the functional is λ\lambda-convex along generalized geodesics. For the KL divergence, this corresponds to the assumption that the target energy log(q)-\log(q) is λ\lambda-convex for some λR\lambda\in\mathbb R. We stress that this explicitly includes the case of negative λ\lambda. This assumption is much weaker than log-concavity and is automatically fulfilled, if log(q)-\log(q) is smooth (plus some asymptotics for x\\|x\\|\to\infty). More intuitively, the condition can be rephrased as: log(q)-\log(q) is λ\lambda-convex if and only if there exists some (possibly negative) λR\lambda\in\mathbb R such that log(q(x))λ2x2-\log(q(x))-\frac{\lambda}{2}\\|x\\|^2 is convex.

Given that this misconception appeared in more than one review, we will include this discussion in the final version and highlight that λ\lambda-convexity is usually not an issue as long as the target distribution is smooth.

Relation to other neural JKO Schemes

We stress that we consider the theoretical results (Cor 3.2, Thm 3.3) to be the main novelty of Section 3. As outlined in the beginning of Section 3, the implementation (Sec 3.2) is close to previous papers, specifically to (Vidal et al., 2023) and (Xu et al., 2024), who also use the dynamic formulation of the Wasserstein distance. But these papers consider the generative modeling setting instead of sampling. We are not aware of a reference, which adapts the dynamic formulation for sampling, which requires some smaller technical differences (e.g., other F\mathcal F, we start with latent distribution, while Vidal et al., Xu et al. start with the data distribution). However, we stress that the main contributions of our paper are:

  • The theoretical analysis (Cor 3.2, Thm 3.3) of neural JKO schemes,
  • proposing importance-based rejection/resampling steps which can maintain the density of the generated samples (Section 4) and preserves independence of the generated samples,
  • combining both to an importance corrected neural JKO sampler, which achieves state of the art results.

In the introduction and in the beginning of Section 3 we already write that similar approximations of the JKO steps exist in the literature. We will add a reminder in the beginning of Section 3.2 (and specifically point to Vidal et al., 2023, Xu et al., 2024 for using the dynamic formulation).

Literature on Rejection Schemes with Generative Models

Indeed there exist some approaches to combine rejection steps with generative models in the literature. Many of these approaches are based on sequential Monte Carlo techniques (e.g., CRAFT which we used as comparison, but also Arbel et al., Phillips et al.). These methods implement a reweighting step by first approximating importance weights and then sampling from the empirical distribution defined by this weights. However, for SMC-based methods the analytic evaluation of the arising density is usally not possible. Moreover, after SMC reweighting steps, there exist with a high probability several samples at the exact same position. Finally, the generated samples are not exactly independent.

The paper of (Gabrie et al.) proposes to iteratively train a normalizing flow for sampling by running a Langevin process, adding Metropolis steps with the current normalizing flow as proposal, and retraining the normalizing flow with the updated samples. In contrast to our paper, the rejection steps are not part of the model, but are rather used for training a normalizing flow. In addition, they can't include these steps in the model, because this would require to evaluate the density of these steps, which is not possible for the Metropolis algorithm. Since several papers from the literature [1] found that training normalizing flows to approximate probabilities with disconnected modes (like GMMs) or non-Gaussian tails (like funnel, mustache) are difficult, this might limit the expressiveness of the model.

We will extend the discussion on these methods in the final version.

[1] https://arxiv.org/abs/1907.04481, https://arxiv.org/abs/2009.02994, https://arxiv.org/abs/2206.14476

审稿意见
4

This paper proposes to sample from an unnormalized probability density via a sequence of interleaved continuous normalizing flows (CNFs) and importance accept/reject steps. The CNFs, which are penalized with a velocity norm regularizer as in OT-Flow (Onken et al. 2021), are interpreted as Wasserstein proximal mappings applied to the reverse KL divergence loss functional by replacing the static form of the W2W_2 distance typically used to define the proximal mapping with its equivalent dynamic formulation. The authors use this interpretation to show convergence of their OT-regularized CNF velocity fields to the velocity field corresponding to the Wasserstein gradient flow of the reverse KL divergence as the proximal mapping step-size τ0\tau\to0 and that, for nonzero τ\tau, the velocity fields at each step correspond to the OT velocity fields between the starting and ending measures at each step. The CNF scheme is implemented by representing the velocity fields as neural networks and optimizing the parameters of the networks to minimize the dynamic OT-regularized CNF loss; hence the scheme is termed “neural JKO.”

To address the issue of slow or incorrect convergence known to plague CNFs when the target density is multimodal, the authors propose to insert “importance-based rejection steps” in between some of the CNF/Wasserstein proximal steps. These rejection steps consist, essentially, of one step of rejection sampling: each particle XX from the current ensemble (resulting from previous CNF and rejection steps) is rejected with probability 1α(X)1 - \alpha(X), where α(X)=min{1,g(X)cf(X)}\alpha(X) = \min \left\{ 1, \frac{g(X)}{cf(X)} \right\}, g(X)g(X) is the unnormalized density of the target measure, f(X)f(X) is the unnormalized density of the current ensemble, and c>0c > 0 is a tuning parameter. Thus, each particle has a lower chance of getting rejected if the importance weight g(X)f(X)\frac{g(X)}{f(X)} is large. If the particle is rejected, it is replaced by repeating the entirety of the previous CNF/rejection procedure (starting from the reference density) to generate a new sample from the current particle distribution. The authors show that it is possible to write the new density of the particle ensemble after it has been transformed by this one-step rejection procedure and that the rejection sampling step decreases the KL divergence to the target. The density information is carried through the rest of the CNF/rejection sampling procedure, and as such allows the procedure to be used for density estimation as well.

Numerically, the method is exercised on a variety of target distributions in between two and 1600 dimensions and shown to generate samples which result in lower energy distance and higher estimates of the “log normalizing constant” in comparision to MALA, HMC, DDS, CRAFT, and Neural JKO. The experiments in the paper all, in the end, generate an ensemble of 50,000 samples, and the quality metrics for the samples are likewise computed over 50,000 samples.

update after rebuttal

I thank the authors for their response. I have a positive view of this paper and will thus maintain my score.

给作者的问题

  1. How does one choose the “schedule” of rejection layers and CNF layers? Would a heuristic like ESS, which is used in CRAFT/AFT, be useful here?

  2. How sensitive is the method to the initial choice of step-size τ0\tau_0 and step-size schedule?

  3. Line 289 R: Is the curse of dimensionality in rejection sampling specific to use of a standard normal proposal distribution? Or would the issue occur with any fixed/non-tailored proposal? If it is the latter, you may want to emphasize this point to further highlight why the use of the tailored/CNF proposals is useful.

论据与证据

All claims are sound and substantiated by theoretical and empirical results.

方法与评估标准

The proposed method on the whole is sensible in the context of similar, previously successful methods for sampling from unnormalized probability densities (e.g., continuous normalizing flows and OT-regularized variants, annealed flow transport Monte Carlo, etc.). In order to use the method, one must have access to the unnormalized density of the target measure and its score. One potential methodological issue, which is briefly mentioned in remark 4.5, is that the rejection steps rely on the generation of new samples from the current particle distribution. In the context of the proposed method, this generation implies re-sampling from the reference distribution and re-simulating the previous sequence of CNF and rejection layers. The CNF layers shouldn’t pose too much of an issue in themselves, as they involve integration of previously identified ODEs with relatively straight trajectories, but the previous rejection layers have the potential to dramatically increase the time required to regenerate another sample. That is, if a sample is rejected at any layer, we must start over at the reference distribution and re-do all of our previous CNF/rejection layers. If the acceptance probability is 1r1 - r, then the probability of successfully passing through nn previous rejection layers is (1r)n(1 - r)^n. This feature may result in long runtimes, as we see for the higher dimensional test distributions in Table 5.

In the exposition of the paper it is not made explicitly clear how this resampling is performed in the rejection steps -- i.e., the rejection procedure presented in section 4 is general and never specified for the method at hand. The reader must, therefore, put the pieces together on their own. I think it would be helpful to explicitly outline how one resamples in the context of this neural JKO-IC method so as to give a clearer picture of what is at stake in the rejection steps. A diagram of an example sample trajectory might even be helpful for this purpose (perhaps with a background similar to that of the plots in Figure 10, but with arrows indicating the procession of one particle through the layers, with at least one rejection step depicted).

The quality of the samples generated is evaluated using the energy distance (MMD with the negative distance kernel) and by estimation of the log normalizing constant. These evaluation metrics are consistent with those used for similar sampling methods in the literature. In the appendix there are plots of 2D marginals of the samples generated by the various methods, which are also helpful. The method is primarily compared to the Metropolis-Adjusted Langevin Algorithm (MALA), Hamiltonian Monte Carlo (HMC), the Denoising Diffusion Sampler (DDS), Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT), and Neural JKO (equivalent to the proposed method without the rejection steps). MALA and HMC are standard MCMC approaches, while DDS, CRAFT, and Neural JKO are all, to some extent, based on dynamic measure transport. This collection of comparison algorithms is reasonable, and in the appendix there are additional  comparisons to un-regularized continuous normalizing flows (CNFs).

理论论述

I checked the proofs of Corollary 3.2, Theorem 3.3, Theorem 4.2, and Corollary 4.3 at a high level. My only quibble is that in the proof of Theorem 4.2, the density p~\tilde p is interpreted as a probability, i.e., p~(x)=P(X~=x)\tilde p(x) = \mathbb{P}(\tilde X = x),  which isn’t quite right. However, I suspect that if the proof were repeated working with P(X~A)=Ap~(x)dx\mathbb{P}(\tilde X \in A)=\int_A\tilde p(x)\,\mathrm{d}x for some measurable set AA the result would still emerge.

实验设计与分析

The experimental design on the whole seems sound, but I think it would be interesting to compare the proposed Neural JKO-IC method to the competing methods for varying ensemble sizes. As far as I can tell, all experiments performed involved the generation of N=50,000N = 50,000 samples, which is quite large and should render any Monte Carlo estimation error in the losses for the various methods small. Thus, the error metrics reported in the results are likely dominated by the biases of the various methods. What happens if the ensemble size is decreased, for example, to 50, 500, or 5,000 samples? Does Neural JKO-IC also have low variance relative to the competing methods at these smaller ensemble sizes? At the very least, the particle ensemble size should be listed in the main part of the paper (not just the appendix) to give proper context for the results.

There is also no comparison or discussion of the computational cost among the various methods. Training and runtimes are given for Neural JKO-IC are given in Table 5, but no training or runtime information is available for the competing methods. Knowing the relative computational costs of each of the methods is important for contextualizing numerical performance and informing choice of method in application.

补充材料

In the supplement, I reviewed the proofs, algorithms, experimental details, and additional results.

与现有文献的关系

The connection between OT-Flow (Onken et al. 2021) and Wasserstein proximal mappings was highlighted in (Vidal et al. 2023), and the Neural JKO scheme (absent the rejection layers) which forms the basis for the proposed method was also proposed in Vidal et al. 2023. Annealed Flow Transport Monte Carlo (AFT, Arbel et al. 2021) and Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT, Matthews et al. 2022) are similar in spirit to the proposed method in that they involve the interleaving of normalizing flow layers with importance resampling and mutation steps. One big difference between AFT/CRAFT and the proposed method is that the former methods are designed follow a specific annealing path of densities, while the latter seeks a sequence of densities corresponding to Wasserstein proximal mappings. Moreover, the importance resampling steps used in AFT/CRAFT fundamentally cannot alter the support of the empirical particle distributions, while the rejection sampling approach, while perhaps more costly, generally does modify the support, and evidently to good effect. Importance weights have also been combined with normalizing flows in, e.g. in (Noe et al. 2019), where normalizing flows are used to define a proposal distribution for importance sampling, but this approach is not iterative like AFT/CRAFT and the proposed method.

References

  • Onken, D., Fung, S. W., Li, X., and Ruthotto, L. OTflow: Fast and accurate continuous normalizing flows via optimal transport. In AAAI Conference on Artificial Intelligence, volume 35, pp. 9223–9232, 2021.
  • Vidal, A., Wu Fung, S., Tenorio, L., Osher, S., and Nurbekyan, L. Taming hyperparameter tuning in continuous normalizing flows using the JKO scheme. Scientific Reports, 13(1):4501, 2023.
  • Arbel, M., Matthews, A., and Doucet, A. Annealed flow transport Monte Carlo. In International Conference on Machine Learning, pp. 318–330. PMLR, 2021
  • Matthews, A., Arbel, M., Rezende, D. J., and Doucet, A. Continual repeated annealed flow transport Monte Carlo. In International Conference on Machine Learning, pp. 15196–15219. PMLR, 2022.
  • Noé, F., Olsson, S., Köhler, J., and Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457):eaaw1147, 2019.

遗漏的重要参考文献

No, all the works that came to mind when I was reading this submission were appropriately cited.

其他优缺点

Strengths

  • I was previously unaware of the connection between OT-regularized CNFs and Wasserstein proximal mappings, and I enjoyed learning about it by reading this paper. While this viewpoint was already advanced in (Vidal et al. 2023), the convergence results and the viewpoint of “piecewise geodesic interpolation” in this work are, to my knowledge, new.
  • While the neural JKO numerical approach already appeared in (Vidal et al. 2023), the addition of the importance-based rejection steps to the neural JKO scheme is new and does seem to improve sample quality significantly.
  • The rejection steps are relatively easy to add if the neural JKO steps have already been implemented.

Weaknesses

  • The numerical examples do not exercise the method across a range of sample sizes.
  • The rejection steps have the potential to cause long runtimes of this method.
  • The numerical results are difficult to quickly interpret due to the tabular format.

其他意见或建议

(In this section I append the line numbers with “L” to indicate that the text in question is in the left column, and “R” to indicate that it is in the right column.)

The preliminary material presented in section 2, while providing a firm mathematical background for the methods presented, is a little bit technical, and I wonder if the level of technicality could be lightened in order to make the material more approachable to a broad machine learning audience. For example, on line 146 (R), the minimal norm velocity field is described as belonging to a “so-called regular tangent space.” While this statement is true, it would be more evocative and approachable to simply state that the velocity must be a gradient. Similarly, the definition of Wasserstein gradient flows could be explained in the main body of the paper without invoking reduced Frechet subdifferentials (velocity is the negative gradient of the first variation of F\mathcal F…), with the more technical definition saved for the appendix. Level of technicality is of course a matter of taste, but I wonder if toning down some of the detail in the main body of this paper would help it achieve broader appeal.

Presenting the numerical results in tables makes it difficult to quickly assess the relative performances of the methods, especially given that all of the metrics are in scientific notation. Consider using bar charts, box plots, or another suitable plot instead of tables for the numerical results; these plots would make it easy for the reader to quickly compare the performances of each method without having to squint at nine rows of seven exponents.  

There are some typographical errors or notational inconsistencies which need to be corrected:

  • Lines 152-153 L: the integral in the definition of P2(Rd)\mathcal{P}_2(\R^d) should be dμ(x)\mathrm{d} \mu(x), not dx\mathrm{d} x
  • Line 157 L and elsewhere: it is confusing to use π\bf \pi to denote the coupling between μ\mu and ν\nu and also to use π1\pi_1 and and π2\pi_2 to denote projection of π\bf \pi onto the first and second coordinates. Consider using γ\gamma or a different letter for couplings instead (γ\gamma matches Γ(μ,ν)\Gamma(\mu, \nu)).
  • Line 184 R: Here the normalized target density is written q(x)=Zgg(x)q(x) = Z_gg(x), but it should be q(x)=g(x)/Zgq(x) = g(x)/Z_g to be consistent with the setup in Section 1 (Line 43 L).
  • Line 190 R: It seems like \propto is used to indicate equality up to an additive constant, but I think this notation is more commonly used for equality up to a multiplicative constant. Perhaps consider alternate notation, or at least specify that the constant is additive.
  • Line 246 R: should the stopping time be τ\tau and not TT in μτk+1=zτk(,T)μτk\mu_\tau^{k+1} = z_\tau^k(\cdot, T)_\sharp \mu_\tau^k?
  • Line 276/Equation (7): the top row of the matrix on the RHS should be vθ(zθ(x,t),t)v_\theta(z_\theta(x, t), t), not vθ(x,t)v_\theta(x, t). This error is also in Algorithm 6.
  • Line 298 L: Should this read “this leads to slow convergence speeds”?
  • Line 311 L: The base/acceptance rate should be 1r1-r, not 1+r1+r

There are also spelling and grammatical errors throughout the paper. Below is a list of the ones I noticed, but I would suggest performing a thorough check for others which I may have missed.

  • Line 78 L: “the rejection steps readjust”
  • Line 83 L: subject-verb agreement needed here: either “our methods generate… and allow” or “our method generates… and allows”
  • Lines 91-92 L: “the velocity fields… converge
  • Lines 97-98 L: “we consider neural JKO schemes in more detail”
  • Line 132 L: “directly following”
  • Line 148 R: “absolutely continuous”
  • Line 322 R: “a constant ratio”
  • Line 364 L: “and which we can sample from”
  • Line 374 L: I’m not sure that “vice versa” is quite the correct term here. Maybe you are looking for “likewise” or “moreover” instead?
  • Line 380 R: “it is a kernel metric”
  • Line 422 L” “distance between two sets of…”
  • Line 1052: “Since it holds X~=X\tilde X = X if XX is accepted and…”
  • Line 1256: “we evaluate the Wasserstein distance based on fewer samples”
  • Line 1372: “we run an independent chain for each generated sample”
  • Line 1452: “becomes computationally costly”
  • Line 1470: “Brenier’s theorem”
  • Line 1474: “We observed numerically that the expressiveness of discrete-time architectures scales… and that these architectures are less stable to train”
  • Line 1475: “The evaluation can be cheaper and the density evaluation” -- I’m not sure what this sentence is trying to say. Perhaps that “the sampling and density evaluation can be cheaper”?
  • Line 1476: “Residual architectures are”
  • Line 1477: “they are very expensive
  • Line 1505: “On the other side” -- perhaps you mean “on the other hand”?
作者回复

Thank you very much for your very detailed and thoughtful review. Please find the answers to your questions and comments below. Additionally, we will correct the typos and grammar errors.

Questions

  1. Our choice of the schedule is based on the following heuristic: Since the rejection layers are (on a long term) more expensive than the CNF layers, we start with using only CNF layers until the progress made by them becomes small. This can be detected without knowing the target distribution, because we have (approximately) access to the Wasserstein distance between input and output of the JKO layer via the norm of the velocity field. Afterwards, we alternate one CNF layer and three rejection steps. Numerically, the CNF layer only does minor local adjustments from this point, while the main work is then done by the rejection steps.

  2. If the step size is too large, we might miss some modes of the target distribution (which matches with the theoretical consideration that we loose convexity (with respect to the Wasserstein metric) of the loss in this case). If the step size is small, then the mapping learned by the CNFs is close to the identity and does not have a large effect. While the model is sensitive with respect to the first case, the second case only increases training and sampling time of the model (since more CNF steps are required), but does not effect much the quality of the result. A simple tuning heuristic for the step size can again be established based on the (approximated) Wasserstein distance: We start with a small step size and increase it as long as the Wasserstein distance between input and output of the next CNF step is smaller than a certain threshold. While we have not automatized this procedure so far, we are quite sure that such an adaptive choice of the step size can be established.

  3. The curse of dimensionality indeed not specific to normal distribution, but to any proposal which is not close enough to the target. So in fact the use of tailored/CNF proposals can help to reduce or even completely avoid the curse of dimensionality. We will highlight this further.

Number of Samples

Our main intention behind the large number of samples is the following: Since the samples generated by our model are independent, our main quantity of interest is how much the distribution of a generated samples differs from the target distribution (in MMD, Wasserstein etc.). Of course we cannot measure this quantity directly, but taking a finite number of samples from both distributions. Using a small number of samples introduces an error in this estimator, which is much more related to the properties (sample complexity) of the evaluation metric than with the approximation quality with the model. Therefore, from our viewpoint, the number of samples should be chosen as large as possible.

While the log(Z)\log(Z) estimation can be viewed in the same way (just with the KL divergence), we agree that this can also viewed differently. If we are not interested in the KL divergence between generated and ground truth distribution (which was our viewpoint so far) but in the log(Z)\log(Z) estimate itself, we agree with you that reporting the errors for different numbers of samples could be useful. We will include such a study in the final version. In our first experiments with 50005000 samples the greater picture is about the same as for 5000050000 samples.

Proof Theorem 4.2 (i)

Yes, you are right, the expression P(X=x)P(X=x) is meant in a weak sense (where we integrate against all measurable sets AA) and that this part is written a bit sloppy. We will write it down more formally for the final version. However, neither the proof nor the result change.

Visualization etc.

Thank you for the suggestions. We will add a flow chart-like diagram describing the sampling process to the paper. For the numerical part, we think that box plots often make it harder to assess the details, even though we agree that they give a better first impression. We are happy to add a box plot for visualizing the numbers, but we prefer to keep the tables as well (if the space constraints allow it, we will keep both in the main part of the paper).

审稿意见
3

This paper applies the Wasserstein Gradient Flow (WGF) framework to the sampling problem, i.e. sampling from a given target distribution. The proposed approach consists of two key stages:

Stage 1: JKO Steps with Continuous Normalizing Flows (CNFs)

  • Given a terminal density, the authors perform Jordan–Kinderlehrer–Otto (JKO) steps, each of which corresponds to an ODE-based control problem.
  • These steps are learned through Continuous Normalizing Flows (CNFs), which parameterize the velocity field.
  • However, a notable drawback of this approach is its slow and suboptimal convergence, primarily due to the complexity of solving high-dimensional JKO problems.

Stage 2: Importance Rejection Sampling Enhancement

  • To address the inefficiency of pure JKO-based updates, the authors incorporate an rejection sampling scheme that alternates with JKO steps.
  • Due to the resampling, it shows faster convergence, requiring only few steps till they reach the convergence. Moreover, it converges to more optimal solution.
  • The resulting framework iterates between JKO-based updates and importance sampling, aiming to improve convergence speed and sample quality.

The proposed method is evaluated on several sampling benchmarks including LGCP and funnel energy functions.

给作者的问题

  • Did the authors observe any instability or divergence issues when alternating between JKO steps and importance sampling?

  • Ablation Studies: What happens if we reduce the batch size (affects importance rejection sampling)? Also, could authors provide ablation studies on number of flow steps (longer flow steps)?

论据与证据

The idea of combining importance sampling with a Wasserstein Gradient Flow framework is novel and conceptually interesting. Theoretical claims are correct.

However, I see major computational challenges in the proposed method: A critical aspect of this method is that it heavily relies on accurate trace estimation of the gradient of the velocity field in CNFs, which is required for both the JKO update and importance sampling correction. I have strong concerns regarding the computational feasibility of the proposed approach:

  • To obtain a sample from μk\mu_k, an initial sample xx must be passed through kk different CNF models vθiv^i_\theta (i=1,2,ki =1, 2, \dots k). Furthermore, it is necessary to estimate the trace of the Jacobian Tr(vθk(x))Tr(\nabla v^k_\theta (x)) along the whole trajectory to perform importance sampling.

  • Updating vθk+1v^{k+1}_\theta requires drawing a fresh batch of samples from μk\mu_k at every iteration, leading to excessive computational overhead.

  • The training of CNFs itself is demanding since it involves the integration of the trace Tr(vθk(x))Tr(\nabla v^k_\theta (x)), which further exacerbates the computational burden.

  • Even in the evaluation procedure, it still requires to compute the integration of the trace. Moreover, due to the importance sampling, the sample should be drawn by large batch.

Due to this computational burden (in both training/evaluation scheme) and due to the heavy reliance on importance sampling, I guess that JKO-IC (the proposed method) will possess scalability issue.

方法与评估标准

My major concern is its computational efficacy as discussed in previous section "Claims and Evidence". Accordingly, I don't see the potential application to more realistic dataset. To verify its applicability, I would like to further ask authors to include the following:

  • Comparison on the training time, evaluation time, and GPU storage with other benchmark approaches?

  • Discussion on the advantages of the proposed method over existing importance weighted based schemes like [1 ,2, 3]?

  • I believe there should be a comparison with [1] and [2] for several datasets (e.g. funnel, LGCP). These are also importance sampling based methods.

  • I feel that there are no sharp or realistic energy function benchmarks discussed in the paper. Could the authors run on 40-modes GMM implemented in [3] or [4], or more realistic datasets DW4 or LJ13 discussed in [5]? (The implementation of GMM in this paper is different from original 40-mode benchmark.)

理论论述

I’ve checked all the theories and proofs. I believe the claims are all correct.

实验设计与分析

As aforementioned, I would encourage authors to verify the followings:

  • Computational Efficacy Study.

  • Discussion on related works including [1], [2], and [3].

  • Comparison with [1] and [2].

  • Comparison on additional benchmark dataset.

补充材料

Yes, I checked theory, algorithm, details of benchmark data, evaluation metrics, and time/gpu consumption.

与现有文献的关系

The sampling problem can be reformulated in various ways. This paper suggests a new sampler by incorporating rejection sampling into the original JKO scheme. I believe this idea is very interesting and worth further investigation to improve its efficacy.

遗漏的重要参考文献

I believe a discussion and direct comparison with [1] and [2] is necessary to further improve the paper.

References

[1] Phillips, Angus, et al. "Particle Denoising Diffusion Sampler." ICML, 2024.

[2] Chen, Junhua, et al. "Sequential Controlled Langevin Diffusions.", ICLR, 2025.

[3] Albergo, Michael S., and Eric Vanden-Eijnden. "Nets: A non-equilibrium transport sampler.", preprint, 2024.

[4] He, Jiajun, et al. "No Trick, No Treat: Pursuits and Challenges Towards Simulation-free Training of Neural Samplers.", preprint 2025.

[5] Akhound-Sadegh, Tara, et al. "Iterated Denoising Energy Matching for Sampling from Boltzmann Densities." ICML, 2024.

其他优缺点

Strength

  • The paper is well-organized, easy-to-follow.
  • The idea of integrating importance sampling with JKO updates is a new fresh idea.
  • The paper is well-written and provides a rigorous theoretical foundation.

其他意见或建议

.

作者回复

We would like to thank the reviewer for the detailed evaluation of our paper. Please find our comments below.

Literature

The process of research is an active field and therefore many interesting contributions are published frequently in particular in this modern field. However, we want to kindly point out, that in accordance to the ICML guidelines, comparisons to very recent preprints, which are not published yet (like [3,4]) or published a week before the submission deadline (like [2]) or haven't even been available as a preprint until the submission deadline (like [4]) cannot be expected. We do our best to provide timely comparisons and have now even added a comparison to [2] (details below), but we can't compare to methods which appeared on arxiv shortly before or even after the ICML submission deadline. Of course, we are happy to include all papers mentioned by the reviewer in the "related work" section (where [1,3] are already cited).

Computational Cost

We respectfully disagree with the reviewers opinion that our method would be computational infeasible. Specifically we would like to stress the following points:

  • The evaluation of trace(vtk(x))\mathrm{trace}(\nabla v_t^k(x)) is estimated by a Hutchinson trace estimator (trace(A)=E[zTAz]\mathrm{trace}(A)=\mathbb E[z^TAz]) for zz with zero mean and identity covariance matrix.
  • During the training of the CNF a single sample of zz is enough in the Hutchinson estimator to approximate the trace (since we only need an unbiased estimate of the loss for training the CNF). For the evaluation we take 5 samples of zz for each time step. We find that this already leads to a small variance of the resulting densities, while the cost remains comparably low (see also the answer to Question 5 of Reviewer Eq2H).
  • When training vθk+1v_\theta^{k+1}, we do not draw in each training step "a fresh batch from μk\mu_k". Instead, we maintain during training time a dataset of 50000 samples of μk\mu_k. During training of vθk+1v_\theta^{k+1} we then draw batches from this dataset. After the training of vθk+1v_\theta^{k+1} is completed, we generate a dataset of μk+1\mu_{k+1} by applying the CNF onto the samples from μk\mu_k. We add this detail to the numerical details.
  • During evaluation we don't need to draw a large batch. During test time the samples don't interact with each other (see also our reply to the part "Batch size" in the paragraph "Questions" below)
  • Please note that we already discussed the computational aspects of CNFs in Appendix F.2 and list the resulting training and evaluation times including the required GPU memory in Table 5 in the appendix. We can see that they remain moderate for all examples considered in the paper.

Nevertheless, as already discussed in the limitations paragraph, lowering the computational cost is part of our current and future research. In particular the use of supervised trained generative models as intermediate surrogates for several steps in our model is future work and goes beyond the scope of the paper.

Other Benchmark Problems and Comparisons

We believe that our test problems are standard among the community and respectfully disagree with the reviewers opinion that "no sharp benchmarks" were used. However, upon the reviewers request, we now ran our method on the GMM40 problem in d=50d=50. For the sake of availability of comparisons, we adapt the same setting as Table 3 in [2] and also evaluate the Sinkhorn distance. We can see in the table below that the neural JKO IC outperforms DDS, CRAFT and SCLD on this example.

DDSCRAFTSCLD [2]Neural JKO IC (ours)
Sinkhorn distance5435.185435.1828960.7028960.703787.733787.733154.913154.91

Questions

  1. Stability: In the case that hyperparameters are chosen inappropriately, the method might miss some modes of the target distribution, but we never observed divergence issues. Once the hyperparameters are chosen appropriately, it consistently produces the same results.
  2. Ablations: The paper already contains an ablation study with respect to the number of steps in Figure 10, where we plot the error measures over the number of steps. Since our model is trained iteratively, the plots show how the model would behave if we stop the training earlier and we can see that the error measures saturate.
  3. Batch size: We would like to clarify that at test time the sampling procedure is independent of the batch size. The only parameter in the rejection step depending on several samples, in particular depending on more than one sample is E[α(Xk)]\mathbb E[\alpha(X_k)] (it is estimated within the training phase and fixed afterwards). For the evaluation the samples do not interact with each other and are consequently independent.
审稿人评论

I thank authors for the detailed response. Moreover, I appreciate authors for correcting some points that I have misunderstood. My concerns are adequately addressed. I'll raise the score to 3.

审稿意见
3

This paper contributes a method called "Importance corrected neural JKO sampling", based on the well-established Jordan-Kinderlehrer-Otto (JKO) scheme. The method is constituted by a flow-based ordinary differential equation (ODE), which is parameterized by neural networks and learned using standard neural ODE optimization techniques. The authors approximate this solution by using a finite sequence of proximal maps; they show that, under some assumptions on the target distribution, the neural network that minimizes a given loss over the ODE will minimize the reverse KL w.r.t. the target, and produce samples via a flow mapping. In particular, the solution asymptotically approaches the Wasserstein gradient flow and approximates the flow discussed in the well-established theorem from Benamou and Brenier. However, the authors identify that this sampling procedure can be inefficient due to the behavior of the reverse KL divergence, used in variational inference techniques like neural ODEs; to ameliorate these inefficiencies, they adapt this neural JKO scheme by contributing a method hybridizing importance and rejection sampling. This method samples from a proposal distributions and uses a weighted rejection step, whose corrected distribution is consistent with the target and is closer to the target than the proposal in the prescribed reverse KL. Further, due to its construction, the corrected distribution's density is known analytically, which allows the user to know the normalized density when retrieving the target samples from this procedure. This procedure gives typical Monte Carlo convergence rates that are nominally dimensionally-independent. Finally, the authors provide numerical results demonstrating competitive results with several typical methods on many benchmark problems spanning low- to high-dimensions.

Update after rebuttal

I thank the authors for their response. I maintain my score to 3.

给作者的问题

  1. Does the (uncorrected) neural JKO (N-JKO) scheme described align exactly with prior N-JKO schemes? Is there a difference? What does this paper contribute over, e.g., [Xu et al., 2024]? A different method or just theory on an equivalent method?

    • The paper is unclear as to whether such an uncorrected scheme is previously introduced. For example, it cites many papers using JKO-like schemes via W2 proximal mapping iterations, but does not delineate very well from them. Being clearer here would help the paper stand out and better define the contributions to the field, where a reader could clearly see that this is not just adding importance/rejection sampling to a pre-existing neural ODE scheme.
  2. Do Cor. 3.2 and Thm 3.3 have anything to do with the rejection steps?

    • As suggested in previous comments, it seems like this submission attempts to contribute two different methods that work in tandem. The theoretical extent that the importance correction contributes to specifically the JKO scheme is not well described. It is certainly acceptable that the answer to this theoretical question is "they are two different items that work well together", but such a disclaimer would highlight exactly what the authors can say about their work.
  3. What are the practical and theoretical parallels with MCMC schemes? Why not just use a "Metropolis Adjusted Neural JKO" scheme? Should I regard the result as a CRAFT-like approach to unadjusted Langevin dynamics (ULA)?

    • While the guarantees of JKO are great in practice, the uncorrected N-JKO does not seem too different from Langevin. There is unfortunately no comparison to a Metropolis adjustment to neural JKO, nor is there a comparison of the uncorrected N-JKO scheme to ULA. Again, it is imperative that the methods are appropriately benchmarked and indeed that the fact this paper incorporates two different methods is justified.
  4. Practically, are there any remarkable features regarding N-JKO's sensitivity to hyperparameters? Time discretizations? Adjoint solves? etc.

    • In contrast to MALA or HMC/NUTS, there is a functional optimization loop inside the body of the methods. This is admittedly discussed in part within appendix F, i.e., that the required coupled forward/adjoint can be quite costly, that they use a classic adaptive explicit Runge-Kutta scheme, and that "choosing larger architectures [than what they have] does not bring significant advantages". However, in contrast to MALA, HMC, and (unmentioned) SVGD, the proposed methods do not have any asymptotic or mean-field guarantees for a nonasymptotic architecture/function class; therefore, it is certainly worth noting how much effort regarding heuristics and hyperparameter tuning it may take to achieve the results given.
  5. It is stated in F.2 that the evaluation of the neural IC only uses a total of five Rademacher vectors for estimating the Jacobian trace. If it's a Monte Carlo estimator, why would this be enough? Would increasing this make the performance better or worse? Could this poison the error metrics in any way?

    • While it is entirely acceptable to use just a few evaluations for training the neural networks, it is imperative that the authors are clear that any additional refinement of this discretization works in their favor and not against it. To be more explicit, a case where higher variance in the trace estimator coincidentally improves results over more exact trace estimators would raise questions of method validity.

论据与证据

The claims in this submission are reasonably clear and well-supported, though their consequences are a little under-discussed. A mild criticism is that the provided theoretical results largely hinge on log-densities that are λ\lambda-concave, with discussion of the non-convexity of the reverse KL and its tendencies to seek modes; however, the results for the importance correction show improvement in this reverse-KL regime. While such a descent property is comforting, it seems undercut by the qualifications the authors lay out regarding the loss the authors choose to descend.

方法与评估标准

The methods employed and evaluation metrics seem well-founded. If these results are published with further details, one suggestion would be to report results using MMD with various kernels, rather than relying solely on the energy distance.

理论论述

The theoretical proofs were only briefly considered. The simpler proofs, i.e., Corollary 3.2, Theorem 4.2 (i), and Corollary 4.3 were checked, and Theorem 4.2 (ii) was also considered informally. The remaining proofs were not.

实验设计与分析

The provided experiments largely demonstrate a comprehensive approach to the problem at hand. A slight shortcoming that might be suggested is that, as the importance/rejection step is not tethered to the gradient flow regime, seeing if that performs any different when used in conjunction of Langevin or Hamiltonian dynamics. This would allow the reader to see how much the error decreases due to a good importance correction versus the quality of the JKO. For instance, if the importance corrected Langevin dynamics outperformed MALA substantially due to sample independence, then this is a notable result in and of itself.

补充材料

The supplementary material included code to reproduce the experiments. It was not run by the reviewers, though it was briefly skimmed to ensure that the code looked feasible.

与现有文献的关系

This reiterates several results and ideas in neural ODEs (e.g., [Chen et al., 2018] in NeurIPS) and JKO sampling literature (e.g., [Jordan et al., 1998] for its introduction, as well as, e.g., [Salim et al., 2020] in NeurIPS for the discussed connection to proximal algorithms). Further, its proposed methods augment typical variational inference techniques, e.g., [Marzouk et al. 2016] in Handbook of UQ, [Lambert, 2022] in NeurIPS. The discussion of functional geometry over the reverse KL divergence echos prior literature, e.g., [Marzouk et al. 2016] and [Grenioux et al., 2023] in ICML, but is clear and reasonably concise in the prior literature's connection to the current methods. Regarding neural JKO details, the authors provide a convincingly comprehensive collection of works reflecting progress towards neural network-based JKO sampling schemes, e.g., [Altekrüger et al., 2023], [Mokrov et al., 2021], [Onken et al., 2021], and [Xu et al., 2024] (publication venues provided in submission). Each of these seems to tackle varying problems of using JKO for sampling unnormalized densities. For instance, [Altekrüger et al., 2023] seems to tackle a flow using MMD-based discrepancies, [Mokrov et al., 2021] seems to minimize the reverse KL divergence (the expression of the functional is in (5), where the constant β\beta seems to be chosen according to (13)), but do so using an explicit construction of transport maps using a discrete sequence input-convex neural networks, [Onken et al., 2021] seems to work very similarly to this paper using neural ODEs but omits the regularization term this submission's authors denote wθw_\theta (which corresponds to the Wasserstein proximal mapping). The authors do not seem to articulate the differentiation of their submission's neural JKO step from [Xu et al., 2024] in terms of approaching the JKO from a neural ODE perspective; the reviewer could not distinguish between the two in the time allotted. In particular, (8) from [Xu et al., 2024] seems virtually identical to (7) in this submission.

遗漏的重要参考文献

The importance-correction rejection steps closely mimic the accept-reject step in typical Metropolis-Hastings algorithms, where the density described in (4.2) seems to be akin to a density derived from the Markov transition kernel, e.g., [Andrieu et al., 2003] in Machine Learning. The divergence seems to be that MH accept/reject steps act locally, whereas this is a globalized accept/reject procedure; critically, the reject step still "accepts" a change of position according to the proposal distribution to make such steps global regardless. However, while this superficially seems like a "Metropolized" transport algorithm in the vein of, e.g., [Parno and Marzouk, 2018] in SIAM Journal on UQ and [Gabrié et al., 2022] in PNAS, the globalization differentiates itself from such algorithms. It would behoove the authors to investigate parallels in, e.g., "global" MH transition kernels (see discussion in, e.g., section 3.3 in [Andrieu et al., 2003]) and differentiate themselves for the benefit of the reader who believes this to be such a Metropolization of Neural JKO. Moreover, the acceleration of Langevin sampling using birth-death processes, proposed in [Lu et al., 2023], was overlooked. Finally, there is little to no mention of Stein variational gradient descent (SVGD) [Liu and Wang, 2016] in NeurIPS, despite its popularity as a competing algorithm. SVGD approximates a gradient flow under a kernelized KL-loss ([Liu, 2019] in NeurIPS) or a χ2\chi^2-gradient flow ([Chewi et al., 2020] in NeurIPS), and the recently introduced noisy SVGD variant ([Priser et al., 2024] in ICLR), which addresses some drawbacks of the vanilla SVGD. However, these developments are notably absent from the relevant literature section.

其他优缺点

The major weakness of this article is that it has a very inconsistent writing style and does not seem well-contained. In particular, many of the theoretical background and discussion seemed at-length with little intent; Theorem 2.2 is included from prior literature but only invoked once and hardly discussed for the reader's benefit. Many terms stand undefined (e.g., lsc/lower semi-continuous, coercive, λ\lambda-convex, etc.). Further, it does lack the discussion of some vital assumptions. In particular, there is some missing context regarding the assumption of λ\lambda-convexity of the target distribution qq. Of course, there will always be assumptions for theoretical results, but such an assumption is particularly strong seeing as this is not satisfied by most distributions considered in the numerical results. Perhaps such an assumption is "softened" by the rejection steps, or just analytically convenient and stronger than necessary in practice. With careful consideration of the typical ICML reader, the analytical and theoretical discussion could be made significantly more accessible and parochial by focusing on the actual methods and ideas vital to the method. Finally, there is a connected issue that the paper is not well self-contained, where it's unclear what theory, exactly, the authors contribute to the field and what is rehashed from prior literature. In particular, the paper does not differentiate its core neural JKO scheme very well from similar neural JKO approaches.

With these shortcomings in mind, the paper does invariably provide strong theoretical guarantees, particularly for smooth and log-concave targets. Despite some issues with the presentation of guarantees and terminology, the compact presentation of Bernamou & Brenier, its connection to the algorithm, and the discussion of neural ODEs is reasonably clear and helpful. To add to the theoretical insights, the rejection procedure seems method-agnostic and might be of great benefit to other inference techniques, particularly other variational inference schemes. If anything, the authors undersell the intrigue of such an idea. The numerical results demonstrate an undeniably strong algorithm, with a stark comparison against unadjusted neural-JKO. The independence of the corrected samples (which is not proclaimed until the conclusion) is actually quite remarkable, especially compared with the sequential nature of the correction of Metropolis algorithms. Additionally, Appendix B.1 is well written considering how formal the discussion of the theory assumptions are within the paper.

其他意见或建议

  • A discussion of the nominal dimension-free ideas in Corollary 4.3 could be nice:
    • The dimension-free guarantee applies to α\alpha, so what makes the problem hard? Presumably it has to do with when the target is non-log-concave, which will be exacerbated by dimension, when performing the neural-JKO scheme.
  • If there is sufficient space, it might be interesting to include more explicit insight on how the importance rejection step compares to typical SMC methods as well as, e.g., [Midgley et al., 2023] in ICLR 2023, or the CRAFT algorithm discussed.
    • This is partially elaborated in the related work section, but it would be more helpful after the explanation of the algorithm.
  • If space, it might be nice to just remark in the text that 4.3 simply comes from an application of Hoeffding's inequality, as it is unclear why it is a corollary at first glance.
  • While annealed importance sampling was largely popularized by [Neal, 2001], the idea of importance sampling dates back significantly further.
  • Similarly, while (Rezende & Mohammed, 2015) certainly introduced the "CNF" term, it should be noted that the "concept" of coupling complicated target and simple reference dates further back, even within generative modeling communities.
  • Visual accessibility:
    • In addition to the highlighting of tabular metrics, the numbers should be bolded for visual accessibility.
    • The coloring in Figures 2/3 should probably be adjusted for visual accessibility. Instead of red/green consider using a visually accessible color palette (a little over four percent of people are red/green colorblind).
  • Style/grammar comments:
    • The introduction would benefit from more careful attention to its language and style.
    • There should be appropriate capitalization in the bibliography.
    • The letter λ\lambda is used both for Lebesgue measure and convexity.
    • The intention of the word "ration" on line 322, page 6 is unclear; perhaps the authors meant "ratio"?
    • In the left column of line 416 on page 8, the phrasing "We can see, that importance..." is quite awkward.
    • Remark 4.5 is clearly crucial, but difficult to parse. What is "moderate base of 1+r1+r"? The entire remark could benefit from more clear writing.
    • The appendices, especially F, should be reviewed for style/grammar (e.g., line 662 "We give some more backgrounds..." in appendix A, repeated use of "Lebesque" instead of "Lebesgue" in appendix B, the sentence in 1303-1305, line 1442 "...this package does not rely backpropagation...", line 1476 "...the evaluation can be cheaper and the density evaluation since these...", line 1476 "Residual architectures is at least....", line 1477 "...they are very expansive to train..." etc.).
作者回复

Many thanks for the detailed and thoughtful review. Please find our answers below. For the final version, we will additionally correct the typos, extend the literature part (e.g. with SVGD), improve the visual accessibility based on your comments and add the definitions of lsc/coercive/λ\lambda-convexity in the appendix.

On the λ\lambda-Convexity

We stress that for the theoretical results we only require that the negative log-density of the target is λ\lambda-convex for some λR\lambda\in\mathbb R which explicitly includes negative λ\lambda. This assumption is very weak and automatically fulfilled when qq is smooth enough (plus some asymptotics for x\\|x\\|\to\infty). In particular, the theoretical results are also applicable for target densities which are not log-concave. We will add this explanation after Assumption 3.1 in the paper.

Questions

  1. We stress that in Section 3 we consider the theoretical results (Cor 3.2, Thm 3.3) to be the main novelty of our paper. The actual neural JKO steps are indeed very similar to approaches in the literature, particularly, to the papers of (Vidal et al., 2023; Xu et al., 2024) which also rely on the dynamic formulation of the Wasserstein distance. Since these papers consider the generative modeling setting there are some small technical differences (e.g., we start with the latent distribution while Vidal et al., Xu et al. start with the data distribution; moreover they only consider the Gaussian as target gg, which makes the whole objective convex regardless of τ\tau, but is not useful for the sampling application). In the final version, we will clarify the relation to these approaches more detailed in the beginning of Section 3.2, where we derive the neural JKO step.
  2. The statements of Cor 3.2 and Thm 3.3 are not directly related to the rejection steps in Section 4. However, in order to be able to combine the importance-based rejection step in an iterative manner, we need the following requirements on our sampling model: evaluate the density, sample independently, start at an arbitrary latent distribution, avoid mode collapse. Since not many generative models (or sampling methods) fulfill these properties at the same time, we focused on the neural JKO scheme. We will clarify this in the introduction.
  3. It is not surprising that the neural JKO scheme and Langevin sampling produce very similar results, since both approximate the same Wasserstein gradient flow with respect to the KL divergence in the limit. The main difference is that we can evaluate the density of the distribution generated by the neural JKO scheme, while we can't do that for the distribution generated by Langevin sampling. Thus, we can combine the neural JKO scheme with our importance-based rejection resampling steps, while we can't do that with the Langevin steps. We briefly outlined this relation in the introduction (l 47--50 right side), but will add a reminder to that in the numerical results. Regarding ULA vs. neural JKO: When running our experiments, we also ran a plain ULA sampling (without Metropolis correction). The results are mostly similar to the MALA results despite the Metropolis correction helps a little bit to fill out the narrow tails of the funnel/mustache distribution. Therefore, we omitted ULA in the paper.
  4. The most important hyperparameters are the initial step size τ\tau and the size of the network architecture, which require some tuning. If those parameters are chosen inappropriately, the model might miss some modes (which matches the theory, since we loose any convexity of the JKO steps in this case). We outline a possible tuning strategy in the answer to Reviewer BUrR (Question 2). But most hyperparameters from the neural JKO steps are quite standard choices (the velocities are dense 3-layer neural networks, we use the torchdiffeq library with rather standard parameters, the training parameters like batch size, learning rate etc. are also quite standard choices for CNFs).
  5. We think that we can resolve this confusion: When computing the Jacobian trace for 5 new Rademacher vectors in each time-discretization step of the ODE. Since the ODE trajectories are almost straight and therefore the Jacobian only slightly varies over the different time steps. Consequently, the effective number of considered Rademacher vectors is 5 times the number of time discretization steps which is (depending on the example) in a order of magnitude of 20 to 100. Therefore the errors mostly cancel out over the whole solution of the ODE. Indeed, we observed that the variance of the estimator over the whole solution of the ODE is already very small for 5 Rademacher vectors per time step and taking more does not have much of an effect. Taking less Rademacher vectors causes a bias in the importance weights and lowers the quality of the results. We will add this explanation to the final version of the paper.
最终决定

This paper proposes a novel sampling algorithm, termed Importance-Corrected Neural JKO, which interleaves continuous normalizing flows (CNFs) implementing neural JKO steps with importance-based rejection sampling to improve convergence and robustness when sampling from unnormalized target distributions. Strengths:

  • Novel combination of methods: While it combines ingredients introduced in recent previous papers (Vidal et al., 2023; Xu et al., 2024) - but for generative modeling instead of sampling-, the neural JKO scheme with an importance-corrected rejection step is new and effective.
  • Empirical performance: The method outperforms classical MCMC (e.g., MALA, HMC) and flow-based baselines (e.g., Neural JKO, DDS, CRAFT) on a range of benchmark tasks, particularly in high dimensions and multimodal settings. Concerns:
  • Computational cost and scalability: Several reviewers (notably JGqE and BUrR) expressed concern over the potential inefficiency of the method due to rejection steps requiring full regeneration of sample paths. However, the authors clarified that costs are manageable and provided implementation details and runtime data in the appendix.
  • Positioning within literature: While the authors cite and discuss relevant prior work (e.g., Vidal et al., Xu et al., AFT, CRAFT), reviewers requested clearer articulation of how this work differs from or builds upon those methods, especially regarding- Vidal 2023,Xu et al 2024 - and comparisons with recent importance-weighted or rejection-based sampling schemes which share many similarities with the introduced scheme.
  • clarity: one reviewer raised the concern that the authors are not clear regarding the novelty of their theoretical results and their limitations Final Assessment: All reviewers were generally positive : 4/5, 3/5, 4/5, 3/5 (one reviewer went from 2 to 3 after rebutal). While there are some clarity and scope limitations, the authors’ thorough rebuttal and additional experiments addressed key concerns. Recommendation: Accept.This paper makes a strong theoretical and practical contribution to the literature on sampling with generative models and optimal transport, and will be of interest to the NeurIPS community.