Fast Non-Log-Concave Sampling under Nonconvex Equality and Inequality Constraints with Landing
We propose OLLA, a projection-free overdamped Langevin framework that enforces both equality and inequality constraints via a deterministic “landing” correction along the manifold normal.
摘要
评审与讨论
This paper introduces Overdamped Langevin with Landing (OLLA), a novel projection-free sampling algorithm for distributions defined by complex, and potentially non-convex, equality and inequality constraints. By decomposing Langevin dynamics into tangential and normal components and enforcing constraint satisfaction via an exponential "landing" drift, OLLA provably converges exponentially fast in 2-Wasserstein distance under a log-Sobolev inequality (LSI) assumption. The authors also present a practical Euler–Maruyama discretization (OLLA-H) using a Hutchinson estimator for curvature corrections, achieving only gradient costs per step.
优缺点分析
Strengths
- The authors provide a unified framework to handle arbitrary smooth equality and inequality constraints without explicit projections.
- The authors provide rigorous non-asymptotic exponential convergence guarantees in the 2-Wasserstein distance, leveraging LSI constants.
- Practical discretization via Hutchinson trace estimation reduces per-step cost to gradient evaluations.
- The paper is generally well-structured.
Weaknesses
- The theoretical convergence rate has a similar mathematical form to that of unconstrained Langevin algorithms under good conditions, depending on an LSI constant. Intuitively, one might expect that constraining the sampling domain would improve the LSI constant and lead to faster convergence, especially as the domain diameter shrinks [1-3]. The paper does not provide such an analysis or an estimate of the LSI constant in terms of the domain geometry. It is unclear how difficult such an analysis would be. Could authors comment on that, and if possible, offer a theoretical advantage in convergence rate?
- The experimental evaluation is confined to synthetic tasks. As it is a paper with a strong theoretical focus, this is understandable.
- Typos: line 210: "Metropolis-Hasting correction" -> "Metropolis-Hastings correction"
References
[1] Reflection Couplings and Contraction Rates for Diffusions, Probability Theory and Related Fields, 2016
[2] Sampling from a Log-Concave Distribution with Projected Langevin Monte Carlo, Discrete & Computational Geometry, 2018
[3] Projected Stochastic Gradient Langevin Algorithms for Constrained Sampling and Non-Convex Learning, COLT, 2021
问题
Do the analysis and convergence guarantees extend naturally to high dimensions? If not, could the authors discuss the challenges in making this dependence explicit?
局限性
There are no limitations or potential negative societal impacts of their work.
最终评判理由
My concerns have been adequately addressed, and I have updated my score accordingly. The additional theoretical insights and experimental results provided in the rebuttal helped clarify my key concerns. I encourage the authors to include these additions in the revised version to strengthen the final paper.
格式问题
There are no paper formatting concerns.
We appreciate the reviewer for the constructive feedback. Below, we provide itemized responses to address each of the comments, and we hope they clarify your questions and concerns.
Weakness (1): The theoretical convergence rate has a similar mathematical form to that of unconstrained Langevin algorithms under good conditions, depending on an LSI constant. Intuitively, one might expect that constraining the sampling domain would improve the LSI constant and lead to faster convergence, especially as the domain diameter shrinks [1-3]. The paper does not provide such an analysis or an estimate of the LSI constant in terms of the domain geometry. It is unclear how difficult such an analysis would be. Could authors comment on that, and if possible, offer a theoretical advantage in convergence rate?
Answer. Thank you for providing insightful feedback. We agree that we missed this analysis in our paper. Below we (1) state a geometric lower bound for the LSI constant on , (2) explain how shrinking diameter improves the on-manifold mixing rate and how this interacts with the landing rate in our bound.
Geometric control of First, we bring up the following informal theorem (Theorem 7.3., [2]):
- Theorem (Informal) Let be a compact Riemannian manifold with diameter and non-negative Ricci curvature. Then, , or holds, where is the first eigenvalue of the Laplace-Beltrami operator.
This implies if the constraints shrink the diameter (or increase ), the lower bound of increases, so the on-manifold decay of acceraltes (dominated by exponential rate
How this yields a theoretical advantage in our bound. Our non-asymptotic bound reads, for sufficently large enough ,
$
W_2(\rho_t, \rho_\Sigma) \leq \frac{M_h}{\kappa} e^{-\alpha t} + \sqrt{\frac{2}{\lambda\_{\text{LSI}}} \textsf{KL}^{\Sigma} (\tilde{\rho}_t || \rho\_{\Sigma}) }
$
With , the second term decays at a scale . Thus the effective rate is controlled by . By this analysis, we can observe that stronger constraints that reduce make larger, so, provided is not the bottleneck, the overall convergence rate in improves.
Question (1): Do the analysis and convergence guarantees extend naturally to high dimensions? If not, could the authors discuss the challenges in making this dependence explicit?
Answer. Our convergence guarantees have some dependency on dimension . The rates are dimension-free, but the explicit -dependence enters through the additive constant, which we can make explicit and a little sharper than now.
Where the dimension dependency enters. To simplify the setup, we illustrate this using Theorem 1 (Same thing can be applied to mixed scenario). In theorem 1, there exists a constant (there is an errata; missing ), and depends on the constant by , which again depends on by the proof of Lemma E.5 in appendix. However, we revised the proof by using better linear algebra techniques and were able to conclude that we can reduce the dependence of by .
Therefore, we can say our convergence rate depends on (due to the square-root on ) in the equality-only case (same analysis can be applied to mixed cases). We will incorporate this refinement and reflect this dependency in our paper.
Weakness (2): The experimental evaluation is confined to synthetic tasks.
Answer. We appreciate this concern. For the related answers, please refer to our responses to tmLS and 74bE, where we added two real-world benchmarks with full setup and results.
Suggestion (1): Typos in Line 210: "Metropolis-Hasting correction" "Metropolis-Hastings correction".
Answer. Thank you for pointing this out. We will fix this.
Thank you again for your positive review and helpful feedback. We'd appreciate the opportunity to discuss further if you have any additional questions or suggestions.
Reference
[2] M. Ledoux (1999). Concentration of measure and logarithmic Sobolev inequalities. Séminaire de Probabilités 33.
Thank you for the detailed response. My concerns have been addressed, and I have updated my score accordingly. Please make sure to include the additional theoretical and experimental results in your revision.
We appreciate your helpful feedback, questions, and the time you have dedicated. We will integrate these discussions into the final version.
This paper proposes OLLA, a projection-free Langevin dynamics method for sampling under nonconvex equality/inequality constraints. By combining tangential noise with normal drift ("landing"), OLLA enforces constraints without projections or slack variables. Theoretical analysis shows exponential convergence in Wasserstein distance, while a practical variant (OLLA-H) uses trace estimation for scalability. Experiments demonstrate superior efficiency over baselines in high dimensions.
优缺点分析
Strengths:
- This paper introduces a novel SDE design with a landing term for constraint enforcement, combined with rigorous non-asymptotic convergence analysis for three constraint scenarios.
- This paper addresses a critical gap in constrained sampling, especially for nonconvex sets where projections fail, which also has practical impact for Bayesian inference, generative modeling, and other ML applications with constraints.
- This is the first method to unify equality/inequality constraints in Langevin dynamics without projections.
Weaknesses:
- Critical assumptions are deferred to the appendix, making it harder to evaluate the method’s applicability upfront. This obscures potential limitations, such as sensitivity to constraint degeneracy or nonsmoothness, which should be highlighted in the main text.
- Hyperparameters are sensitive but lack principled selection guidelines.
- Limited to synthetic experiments; real-world benchmarks would strengthen the impact.
问题
- How should practitioners choose for real-world problems? Are there heuristics or adaptive schemes?
- How does OLLA behave when constraints violate LICQ (e.g., degenerate gradients)?
- The paper notes degraded performance for large . Is there a theoretical or empirical threshold beyond which OLLA becomes impractical?
局限性
Yes
最终评判理由
The authors answered my questions and addressed my concern, and thus I would like to keep my positive score.
格式问题
No
We appreciate the reviewer for the constructive feedback. Below, we provide itemized responses to address each of the comments, and we hope they clarify your questions and concerns.
Weakness (3): Limited to synthetic experiments; real-world benchmarks would strengthen the impact.
Answers. We added two real-world tests that demonstrate the advantages of OLLA-H under complicated potentials and domain constraints. For the experimental results, please refer to the results we provided to reviewer tmLS.
Experiment (A): Molecular/Polymer system. In this experiment, we have atoms , and we sample from the complicated potential , where is 12-degree piece-wise polynomials (e.g. nonbonded pair potential), is highly nonlinear function involving periodic cosine function (e.g. Periodic dihedral torsion potential).
The equality constraints are defined by , and , and the inequality constraints are given by for .
Experiment (B): UCI German Credit: Bayesian 2-layer logistic regression with fairness + monotonicity constraints. In this experiment, we use a two-layer ReLU neural network with hashed categoricals (to incorporate categorical data). The potential function is the posterior distribution with an isotropic Gaussian prior.
The equality constraints are , , where is the binary outcome label (e.g. good/bad credit), is the sensitive attribute (e.g. gender), and is the predicted probability based on .
The inequality constraints are set to be , , where is the index set of monotone-increasing features, is the index set of monotone-decreasing features, is logit value based on , and is anchor points to check monotonicity.
Summarized Results. In scenarios where the set of active inequality constraints is sparse (i.e. is small) or the manifold geometry is especially intricate so that each projection requires many Newton iterations, OLLA-H proves more computationally efficient than other baseline algorithms.
Weakness (2) & Question (1): Hyperparameters are sensitive but lack principled selection guidelines. How should practitioners choose for real-world problems? Are there heuristics or adaptive schemes? Question (3): The paper notes degraded performance for large . Is there a theoretical or empirical threshold beyond which OLLA becomes impractical?
Answers. Below we give a principled rule and a practical recipe.
Landing rate . First, we observe that a monotonically decreasing upper bound of can be achieved whenever (from theorem 3). However, is unknown in most cases. So, we rely on the following informal theorem (Theorem 7.3., [2]):
- Theorem (Informal) Let be a compact Riemannian manifold with diameter and non-negative Ricci curvature. Then, , or holds, where is the first eigenvalue of the Laplace-Beltrami operator.
Therefore, we start searching reasonable at a scale of , which not only decreases equality constraints fast but also maintains numerical stability of discretization.
Step size . For choosing a reasonable step size, we note the previous analysis on the decay of . Because decays exponentially fast with base , we choose a reasonable such that with some step size searching.
Boundary repulsion rate . This term plays a crucial role in satisfying the inequality constraints. Too small will make trajectories sticky near the boundary of , and too large will make aggressive repulsion on the boundary so that the sample overshoots. In practice, we monitor the inequality violation during sampling and adjust so that the sample does not overshoot and show sticky behavior at the boundary. Fortunately, it turns out that is less sensitive than (Table 6 in Appendix), and it is often not hard to choose a good .
Adaptive method for choosing In our high-dimension stress test, we observed that keeping fixed while increasing the number of equalities slows the convergence rate (Figure 12, Appendix H). In contrast to this, this phenomenon was improved by adaptively choosing ; increasing with mitigates this problem (described in Section 5, Appendix H.2). A plausible explantation is that, as grows, the feasible manifolds's diameter shrinks, so the requirement effectively calls for a larger .
Threshold hold of beyond which OLLA-H becomes ineffective. However, when is increased, we also need to reduce the step size to keep to ensure discretization stability. If we combine the previous two criteria to choose , we obtain a two-sided bound .
This implies that if the lower bound grows with (as observed in our stress test), then a fixed will eventually violate . Consequently, must be reduced as increases. At that point, to cover the same physical time horizon , the step count increases, starting to make OLLA-H ineffective compared to other baselines.
Question (2): How does OLLA behave when constraints violate LICQ (e.g., degenerate gradients)?
Answers. LICQ is used in the theory to guarantee a uniformly well-conditioned normal space (also invertibility of Gram matrix ). In practice, OLLA/OLLA-H remains well-defined and typically stable under LICQ failures (e.g., degenerated/colinear constraints) by using a Moore-Penrose pseudoinverse (Remark 4, Appendix B).
To understand how this can mitigate the problem, we recall the induced definition of orthogonal projector , which again becomes a idempotent projector onto . Hence, always projects onto the tangent space .
- Example (redundant gradient): Let , . In this example, are colinear, making singular. However, still projects onto the circle's tangent line, which is precisely the intended behavior of our algorithm.
Weakness (1): Critical assumptions are deferred to the appendix, making it harder to evaluate the method’s applicability upfront. This obscures potential limitations, such as sensitivity to constraint degeneracy or nonsmoothness, which should be highlighted in the main text.
Answers. Thank you for pointing this out. This point was also raised by reviewer cZ9Z. We will move the key assumptions to the main text with short intuitions and reasonable examples to understand them. The following response briefly highlights the major changes; for detailed assumptions and clarifications, please see our reply to reviewer cZ9Z.
Reason to introduce new assumption (M1''). One of our original mixed-case assumption (M1) asked the projected manifold to lie in for small enough , which can be strong in practice. To mitigate this, we propose the following assumption (M1'')
- Assumption (M1''): Define = . we assume that after sufficently large time , there exist such that .
- Corollary (informal): Under assumptions (C1)-(C3) and (M1'')-(M3), for and , it holds that
$
W_2(\rho_t, \rho_\Sigma) \leq \frac{M_h}{\kappa} e^{-\alpha t} + \sqrt{\frac{2}{\lambda_{LSI}} \textsf{KL}^{\Sigma} (\tilde{\rho}\_{t}^{\Sigma\_{in}} || \rho\_{\Sigma}) + C\_{bd}e^{-\alpha\_{bd}t}\cdot \text{diam}\_{\Sigma} (\Sigma)^2}
$
where as in Theorem 3.
This implies when inequalities and initial distributions are nice (e.g. (M1) holds or contracts to quickly), the last term (including ) decays rapidy and the overall rate can be improved; otherwise that term becomes the bottleneck.
Thank you again for your positive review and helpful feedback. We'd appreciate the opportunity to discuss further if you have any additional questions or suggestions.
Reference
[2] M. Ledoux (1999). Concentration of measure and logarithmic Sobolev inequalities. Séminaire de Probabilités 33.
Thanks for the detailed response. I acknowledge the rebuttal and would like to keep my current score.
Thank you for your constructive feedback, questions, and your time. We will reflect the outcomes of these discussions in the final version.
This paper studies the problem of sampling from a density satisfying log-Sobolev inequality with nonlinear equality and inequality constraines, even when the feasible set is nonconvex. The authors propose a novel framework, OLLA, which modifies Langevin dynamics to enforce constraints via a landing mechanism. By operating on the tangent space of the constraint manifold, OLLA avoids costly projections. The method achieves exponential convergence under some assumptions and is efficiently discretized using Hutchinson trace estimation. Empirical results demonstrate that OLLA matches or exceeds the performance of existing constrained sampling algorithms in both accuracy and computational efficiency, particularly in high-dimensional settings.
优缺点分析
Strengths:
This paper propose a novel sampling framework combined with projection free landing mechanism in optimization and shows exponential convergence guarantees in 2-Wasserstein distance under mild LICQ and log-Sobolev assumptions. To make the algorithm more efficient they further modify the algorithm by using Hutchinson trace estimation, yielding low per-step cost even in high dimensions.
Weakness:
The stated assumptions are unclear and may not hold in practical scenarios, and the algorithm is highly sensitive to its hyperparameters, yet the authors provide no guidance on how to select them.
问题
How should one select the landing rate and step size?
Since there is no theoretical analsysis for OLLA-H, how is the Hutchinson trace estimator affect the convergence rate?
Could you clarify the assumptions underlying the main theorems and illustrate concrete settings in which they hold?
局限性
yes
最终评判理由
The authors' rebuttal addressed my concerns. I will keep my positive score.
格式问题
NA
We appreciate the reviewer for the constructive feedback. Below, we provide itemized responses to address each of the comments, and we hope they clarify your questions and concerns.
Weakness (1): The stated assumptions are unclear and may not hold in practical scenarios. Question (3): Could you clarify the assumptions underlying the main theorems and illustrate concrete settings in which they hold?
Answer. Below we restate them concisely, provide settings where they hold, and for the mixed case, we propose a milder replacement (M1'') of the strong assumption (M1) together with a corollary.
Core assumptions (C1) - (C3) First, we recall our notations
- Assumption (C1)-LICQ: For every , is linearly independent.
This assumption guarantees a uniformly well-conditioned normal space and underlies the projection/landing analysis.
- Assumption (C2)-Regularity of : is assumed to be compact and connected. Also, on and is coercive ( as ). When there are only inequality constraints, we assume .
This assumption are required for the existence of (from C4 in Appendix) and linkage between constraint violation and normal displacement: smaller implies smaller within its recoverable tubular neighborhood.
- Assumption (C3)-Bounded initialization: For the initial law , we assume .
This ensures a finite landing time (almost surely) into the recoverable tubular neighborhood (for equality constraints) or the constrained domain (for inequality constraints).
Examples satisfying (C1)-(C3). Some of classical manifolds can satisfy assumptions (C1)-(C3). For example, sphere , Stiefel manifold (with ), Tori .
Mixed-case only assumptions (M1) - (M3) Our mixed-case analysis introduced the assumption (M1) to control how the projected manifold intersects the interior of .
- Assumption (M1)-Regularity of : The projected manifold lies inside the interior of for , where is the width of recoverable tubular neighborhood .
We agree that (M1) can be strong in many practical geometries. To understand why this is assumed. We can think of the following examples. We set and define the equality-only projection (the nearest-point projection onto ). Also, we write , .
- Example (1)-satisfy (M1): pick
In this case, lies inside the for , and satisfy the assumption (M1). Also, we note that the projected process fully lies on , not stuck at the boundary of (Here, is the neareast-point projection onto ).
- Example (2)- violates (M1): pick
In this case, while is identical to example (1), and fails to satisfy (M1). In this example, the projected process can stuck at the boundary when , slowing down the convergence of .
- Example (3)- dramatric case: pick .
This inequality constraint can define the same defined on Example (1) or (2). However, under , the feasible region of this inequality stretches far along , so boundary hits persits and the process routinely overshoots . This behavior degrades the sampling efficiency when hovers relatively high on , even though at the end.
To incorporate this, we propose a milder assumption (M1'') below:
- Assumption (M1''): Define = . we assume that after sufficently large time , there exist such that .
- Corollary (informal): Under assumptions (C1)-(C3) and (M1'')-(M3), for and , it holds that
$
W_2(\rho_t, \rho_\Sigma) \leq \frac{M_h}{\kappa} e^{-\alpha t} + \sqrt{\frac{2}{\lambda_{LSI}} \textsf{KL}^{\Sigma} (\tilde{\rho}\_{t}^{\Sigma\_{in}} || \rho\_{\Sigma}) + C\_{bd}e^{-\alpha\_{bd}t}\cdot \text{diam}\_{\Sigma} (\Sigma)^2}
$
where as in Theorem 3.
To comment on , if the inequality constraints behave nicely so that (M1) is achieved or the feasible set of inequality constraints contracts to very fast, the term decays fast, leading to a better convergence rate. This aligns with our observation from previous examples.
The remaining assumptions (M2) and (M3) are comparably milder than (M1). For the assumption (M2), we can pick sufficiently high accordingly to , because we only require . Also, we left the comment of Assumption (M3) on Remark 5 (in Appendix A), and this is likely to be satisfied for large enough due to uniformly elliptic behavior of the associated Fokker-Planck equation.
Question (2): Since there is no theoretical analysis for OLLA-H, how does the Hutchinson trace estimator affect the convergence rate?
Answer. The stochastic trace is used only in the mean-curvature term . In our formulation, is multiplied by (normal operator), hence the Hutchinson noise acts only in the normal direction and tangential mixing on is minimally affected. What change is the second-order discretization bias in the landing dynamics, which we will show below:
- (Decomposition) Let and be the target process (). Then, the distance between and is decomposed into
The first term primarily depends on and the statistics of , while the second term heavily depends on the on-manifold geometry (e.g., LSI constant) and should be minimally affected by the Hutchinson noise (from our continuous dynamic analysis in Theorem 3). Therefore, we analyze how the noise affects the first term.
- Lemma (Informal, equality-only, ). Suppose is bounded for a.s. Then, we have
$
\mathbb E [h(X_k)] \leq (1-\alpha \Delta t)^K \mathbb E[h(X_0)] + \mathcal{O} (\frac{\Delta t}{\alpha} ( (1+ \frac{d^2}{N}))
$
where is the total number of iterations.
Therefore, the Hutchinson estimator may not affect the converence rate, but rather increase the discretization error, which again vanishes as .
Weakness (2): The algorithm is highly sensitive to its hyperparameters, yet the authors provide no guidance on how to select them. Question (1): How should one select the landing rate and step size?
Answer. Below we give a principled rule and a practical recipe.
Landing rate . First, we observe that a monotonically decreasing upper bound of can be achieved whenever (from Theorem 3). However, is unknown in most cases. So, we rely on the following informal theorem (Theorem 7.3, [2]):
- Theorem (Informal) Let be a compact Riemannian manifold with diameter and non-negative Ricci curvature. Then, , or holds, where is the first eigenvalue of the Laplace-Beltrami operator.
Therefore, we start searching a reasonable at a scale of , which not only decreases equality constraints fast but also maintains numerical stability of discretization.
Step size . For choosing a reasonable step size, we note the previous analysis on the decay of . Because decays exponentially fast with base , we choose a reasonable such that with some step size searching.
Boundary repulsion rate . These terms play a crucial role in satisfying the inequality constraints. Too small will make trajectories sticky near the boundary of , and too large will make aggressive repulsion on the boundary so that the sample overshoots. In practice, we monitor the inequality violation during sampling and adjust so that the sample does not overshoot and show sticky behavior at the boundary. Fortunately, it turns out that is less sensitive than (Table 6 in Appendix), and it is often not hard to choose a good .
Thank you again for your positive review and helpful feedback. We'd appreciate the opportunity to discuss further if you have any additional questions or suggestions.
Reference
[2] M. Ledoux (1999). Concentration of measure and logarithmic Sobolev inequalities. Séminaire de Probabilités 33.
Thank you for the detailed response. I will maintain my current positive score. I hope the authors will revise the paper based on the discussions with all reviewers.
We appreciate your valuable feedback, questions, and also your time. We will incorporate the outcomes of these discussions into the final version.
This paper makes use of the landing technique -- incorporting constraints directly into the guiding stochastic differential equation (SDE) -- from non-convex optimization to handle both equality and inequality constraints in sampling. The proposed algorithm, Overdamped Langevin with LAnding (OLLA), avoids projection to the feasible set, and hence runs efficiently and is proven to converge exponentially fast. With Hutchinson trace estimation, to approximate the Hessian in the mean curvature term, OLLA could scale to high dimensions.
优缺点分析
Strengths:
- The paper integrates the both equality and inequality constraints into Langevin dynamics and avoids projection.
- The proposed OLLA algorithm comes with theoretic convergence guarantee.
- OLLA-H algorithm scales to high dimensions and has SOTA performance compared with similar samplers with constraints.
Weaknesses:
- It is a little disappointing that OLLA-H has comparable performance with CGHMC in the included tests. Hence it remains doubt whether practitioners choose one over the other.
- In all the simulation tests, the highest dimensionality is 700. The paper could be strengthened by including an example of dimension bigger than 1000.
- In all the simulation tests, the target distributions are relatively simple -- either uniform or Gaussian, though the constraints are complicated and versatile. This may not be very realistic -- in practice both distribution and constraints could be complicated. It would be more convincing to test OLLA(-H) on some more practical distributions with certain difficulty to sample.
问题
Here are a few more questions and suggestions:
- In the submanifold , the inequality constraint is . However, in the set of active inequality constraints, why we have ?
- line 97: .
- In Section 3, the constants , and tabular neighborhood were given very abruptly and there lacked some background.
- line 123, .
局限性
Yes.
最终评判理由
Thanks for the authors' rebuttal that resolved my concerns. I will keep my score.
格式问题
N/A
We thank the reviewer for the thoughtful and constructive feedback. Below, we (1) give a brief overview of what we changed, and then (2) leave itemized responses to address each of the comments, and we hope they can clarify your questions and concerns.
Overview of main changes/clarifications.
- We added two new experiments (Table A, B) to demonstrate the advantages of OLLA-H under practical scenarios: (A) a physics-motivated molecular system with a nontrivial potential and complex constraints, and (B) a high-dimensional Bayesian logistic regression with fairness and monotonicity constraints.
- We clarify computational trade-offs between OLLA-H and CGHMC (and other projection-based baselines), including per-iteration complexity.
- We clarify the notations of active inequality set , and backgrounds on .
Weakness (1): It is a little disappointing that OLLA-H has comparable performance with CGHMC in the included tests. Hence, there remains doubt whether practitioners choose one over the other.
Weakness (2): In all the simulation tests, the highest dimensionality is 700. The paper could be strengthened by including an example of a dimension bigger than 1000.
Weakness (3): In all the simulation tests, the target distributions are relatively simple -- either uniform or Gaussian, though the constraints are complicated and versatile. This may not be very realistic -- in practice, both distribution and constraints could be complicated. It would be more convincing to test OLLA(-H) on some more practical distributions with certain difficulty in sampling.
Answer. OLLA-H is preferable to CGHMC (also other baselines) when (1) large portion of inequality constraints stays inactive (loose boundaries small ), (2) constrained domain and potential function are so complicated that they force small step size for CGHMC to have reasonable acceptance rate on Metropolis-Hastings correction, or (3) Newton steps for projection become larger under geometrically complicated .
Computational picture (per step).
Following notations on paper, let be the number of equality constraints, the number of inequalities, and the currently active inequality set. OLLA-H forms and inverts a Gram matrix using only equalities and active inequalities:
where is the number of Hutchinson probes. For the slack-variable + Newton's methods (CLangevin / CHMC), they inflate the cubic of constraint block to :
where is the number of required iterations for Newton's method to converge. The CGHMC (Metropolis-Hastings + Newton's method) avoids slack variables, but samples are accepted or rejected based on Metropolis-Hastings correction:
When the boundary of the feasible set is tight and complex (many inequalities intermittently active), can grow as OLLA-H's cubic term reflects this. However, in the same regime, CGHMC's acceptance rate typically deteriorates due to frequent boundary interactions. Also, if the step size is too big to drastically change the energy of Hamiltonian of the system, the CGHMC's acceptance rate again drops, yielding low ESS/CPU-time. Simultaneously, projection based methods (CLangevin, CHMC, CGHMC) often require larger and can be extremely slowed down if the Newton method stalls on irregular or high-dimensional .
Additional Empirical Results (Appendix Table A,B).
For the experiment setup (potential, constraints), please refer to our reply to the reviewer 74bE.
Table A - Molecular system with realistic potential (complex , small , large )
The below figure shows the estimation value of the test function with total CPU time to run (wrapped by brackets). The constraint violations for both equality constraints and inequality constraints are below 0.005 / 0.02, respectively, across all algorithms. Also, the configurations are given by (15, 7, 6), (30, 17, 36), (45, 27, 91), (60, 37, 171), (90, 57, 406)
| Method \ dim | 15 | 30 | 45 | 60 | 90 |
|---|---|---|---|---|---|
| OLLA-H | 23.02 (22.51s) | 236.6 (42.48s) | 852.9 (67.38s) | 2152 (107s) | 7561 (257.8s) |
| CLangevin | 36.44 (294.4s) | 228.5 (1611s) | 925.5 (3724s) | 2162 (6631s) | 7572 (15230s) |
| CHMC | 31.1 (79.34s) | 266.8 (280.1s) | 881.3 (614s) | 2235 (1090s) | 7640 (2514s) |
| CGHMC | 29.8 (49.96s) | 256.8 (107.9s) | 895.7 (174.8s) | 2174 (234.3s) | 7607 (415.1s) |
In this setting, only a small subset of inequalities is active at each step (so ), which makes OLLA-H's Gram matrix compact and economic to compute. Also, due to the complexity of the potential and constrained domain , the remains relatively high . Across the dimensions, OLLA-H attains comparable estimates of test functions while achieving best CPU-time improvements of at least 2x to 60x relative to other baselines, further widening the gap in effectiveness. This highlights OLLA-H's advantage when is intricate but sparsely active in inequalities.
Table B - Bayesian logistic regression with fairness + monotonicity constraints (High-dimensional )
The below figure shows the test NLL loss on UCI german credit data with total CPU time to run (wrapped by brackets). All baselines kept equality constraint biolations at level. For inequality constraints, each algorithms achieve around (OLLA-H), (CLangevin), (CHMC), across all dimensions.
| Method \ dim | 706 | 1986 | 4994 | 9986 | 49986 |
|---|---|---|---|---|---|
| OLLA-H | 0.603 (8.25s) | 0.541 (8.65s) | 0.551 (8.30s) | 0.564 (9.85s) | 0.577 (15.85s) |
| CLangevin | 0.604 (16.45s) | 0.546 (16.55s) | 0.550 (19.75s) | 0.544 (20.55s) | 0.557 (28.50s) |
| CHMC | 0.605 (18.10s) | 0.602 (19.40s) | 0.599 (21.35s) | 0.602 (27.05s) | 0.601 (52.20s) |
Here, the target density is a non-Gaussian posterior, and constraints are also nonlinear functionals, both implemented via a two-layer neural network. In this experiment, the CGHMC suffers from a low-acceptance rate problem () due to the complicated potential (or high step size) and frequent inequality activations. So, we dropped the result of CGHMC. While the results show the projection-based algorithms (CLangevin, CHMC) attain smaller inequality constraint violations over OLLA-H, the lower computational cost of OLLA-H is still highlighted with moderate constraint violations, which corresponds to our purpose of the algorithm.
Question (1): In the submanifold , the inequality constraint is , However, in the set of active inequality constraints, why do we have ?
Answer. Thank you for pointing out. Our notation follows:
- Feasible set:
- Set of active inequality constraints:
Under this notation, will be flagged as "activated", and we enforce them to decay exponentially fast via our OLLA dynamics. We understand our sign conventions for inequalities may not align with other literature. For example, [1] use feasibility and define the active set as . However, this is equivalent to ours under a sign change of inequalities.
Suggestion (1): In section 3, the constants , , and tubular neighborhood were given very abruptly, and there lacked some background.
Answer. Thank you. We agree that these should be introduced earlier with some intuition. We will move the definitions to Preliminaries and add a short intuitions with cross references on the revision. Here, we leave concise definitions and roles for and their backgrounds.
- (Lemma C.1/C.3)
This constant is the proportionality constant used when showing that reducing and decreases the normal displacement . The existence of this constant is guaranteed by LICQ condition and determined by the manifold structure of .
- (Theorem C.1/C.2) is the reach of "recoverable tubular neighborhood" , which is defined by .
For , the nearest-point projection is well-defined and there exists a recovery map such that, writing , we can recover by . This structure underlies our analysis of the projected process .
Suggestion (2): Typos in Line 97 (), Line 123 (.
Answer. Thank you for pointing out. We'll fix both errata.
Thank you again for your positive review and constructive feedback regarding the paper. We'd appreciate the opportunity to discuss further if you have any additional questions or suggestions.
Reference
[1] M. Muehlebach and M. I. Jordan (2022). On constraints in first-order optimization: A view from non-smooth dynamical systems. JMLR 23(256).
Thanks to the authors for their efforts to answer all my questions and address my concerns.
Thank you for the helpful feedback, questions, and your time. We will incorporate these discussions into the final version.
The paper proposes a new and unified projection free sampling method based on Langevin dynamics that can handle both equality and inequality constraints. The paper also shows non-asymptotic convergence analysis. Given the strong results and potential applications, I recommend it to be accepted.