7.3

/10

Poster3 位审稿人

最低7最高8标准差0.5

2.3

置信度

正确性3.7

贡献度3.7

表达3.7

NeurIPS 2024

Log-concave Sampling from a Convex Body with a Barrier: a Robust and Unified Dikin Walk

Yuzhou Gu,Nikki Lijing Kuang,Yian Ma,Zhao Song,Lichen Zhang

OpenReview PDF

提交: 2024-05-05更新: 2024-11-06

TL;DR

We design a Dikin walk for log-concave sampling over polytopes and spectrahedra with the fastest mixing time and efficient per iteration cost.

摘要

We consider the problem of sampling from a $d$-dimensional log-concave distribution $\pi(\theta) \propto \exp(-f(\theta))$ for $L$-Lipschitz $f$, constrained to a convex body (described by $n$ hyperplanes) equipped with a barrier function, contained in a ball of radius $R$ with a $w$-warm start. We propose a robust sampling framework that computes spectral approximations to the Hessian of the barrier functions in each iteration. We prove that for the polytope constraints, sampling with the Lee-Sidford barrier function mixes within $\widetilde O((d^2+dL^2R^2)\log(w/\delta))$ steps with a per step cost of $\widetilde O(nd^{\omega-1})$, where $\omega\approx 2.37$ is the fast matrix multiplication exponent. Compared to the prior work of Mangoubi and Vishnoi, our approach gives faster mixing time as we are able to design a generalized soft-threshold Dikin walk beyond log-barrier. We further extend our result to show how to sample from a $d$-dimensional spectrahedron, the constrained set of a semidefinite program, specified by the set $\{x\in \mathbb{R}^d: \sum_{i=1}^d x_i A_i \succeq C \}$ where $A_1,\ldots,A_d, C$ are $n\times n$ real symmetric matrices. We design a walk that mixes in $\widetilde O((nd+dL^2R^2)\log(w/\delta))$ steps with a per iteration cost of $\widetilde O(n^\omega+n^2d^{3\omega-5})$. We improve the mixing time bound of prior best Dikin walk due to Narayanan and Rakhlin that mixes in $\widetilde O((n^2d^3+n^2dL^2R^2)\log(w/\delta))$ steps.

关键词

Log-concave samplingDikin walk

评审与讨论

审稿意见

评分: 7置信度: 32024-06-26

The authors propose a sampling framework from a convex body with a barrier.

优点

The main strength of the paper lies at the novelty and the significance of the results, especially that of Theorem 1.1.

缺点

See questions below.

问题

Page 1: Can you please motivate the problem a bit better? For a broader CS audience? :)

Page 2: Can you please give some intuition about the benefit of using the Hessian in the design of your walk?

Page 3: Can you provide further details about the hardness of constructing the Hessian of the universal barrier?

Page 4: I do not understand Lines 154 -- 156.

Page 5: Line 15: Why is this probability 0.5 \min{\tau, 1}?

Page 6: Lines 196, 202: Can you please give some intuition about the symmetry and convexity properties? :)

Page 7: Line 240: I do not understand this math display.

Can you please elaborate on Equation (1)?

Page 8: Lines 285 -- 288: I do not understand this sentence :)

Can you please give some details about TensorSRHT?

Page 9: Can you please provide a more extensive future work? Thanks!

局限性

None.

评论- Rebuttal (continued)

2024-08-01

Q: Can you provide some more details about TensorSRHT?

A: TensorSRHT is a type of sketching matrix for the tensor product of vectors. To start off, consider the SRHT matrix, this matrix can be decomposed into $PHD$ , where $P$ is a sampling matrix that samples $m$ coordinates, $H$ is a Hadamard matrix and $D$ is a diagonal matrix with diagonal entries being i.i.d. Rademacher random variables. The key here is multiplying $PHD$ with an $n$ -dimensional vector takes $O(n\log n+m)$ time because the Hadamard matrix could be applied to vectors with fast Fourier transform. TensorSRHT is a tensorization of this transform: the matrix is defined as $P (HD_1 \otimes HD_2)$ where $\otimes$ is the Kronecker product. Using the mixed product property, this transform could be applied to tensor product of vectors without forming the tensor product: $P (HD_1\otimes HD_2)(u\otimes v)=P(HD_1 u \otimes HD_2 v)$ , in time $O(n\log n+m)$ instead of $O(n^2)$ . See [AKK+20] for more details.

Q: Can you provide a more extensive future work?

A: For sure. To further elaborate, we believe there are several important future directions:

Removing or mitigating the dependence on $R$ : one significant advantage of Dikin walk for uniform sampling is that the mixing time does not depend on $R$ , in contrast to hit-and-run. It will be important to develop a walk for log-concave sampling that does not depend on $R$ . We note that [KV24] shows that if the function $f$ is in addition exhibiting a strongly-convex property, then it’s possible to remove the $R$ . It will be interesting to see whether without strong convexity, this is still possible.
Improving per iteration cost. Many recent breakthroughs in convex optimization based on the interior point method leverage the fact that the Hessian matrix changes relatively slowly, hence it’s possible to develop linear algebraic data structures to maintain the Hessian matrix and reduce the per iteration cost. Note that our Markov chain also takes a significant number of steps to mix, and the key results we proved in the work shows that the Hessian changes relatively slowly. It will be interesting to design a linear algebraic data structure for maintaining the Hessian matrix and further accelerate the algorithm.
Better mixing for spectrahedron. It is known for spectrahedron, the hybrid barrier has self-concordance parameter $\sqrt{nd}$ [Ans00] in contrast to the $n$ of log barrier. It would be interesting to verify the three conditions on the hybrid barrier and obtain a walk mix faster than ours in the current work.

References:

[BF86] Imre Barany, Zoltan Furedi. Computing the volume is difficult. STOC 1986.

[NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM 1994.

[JKLPS20] Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. FOCS 2020.

[HJSTZ22] Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving SDP faster: a robust IPM framework and efficient implementation. FOCS 2022.

[AKK+20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David Woodruff and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. SODA 2020.

[Ans00] Kurt M Anstreicher. The volumetric barrier for semidefinite programming. MOR 2000.

[KV24] Yunbum Kook and Santosh Vempala. Gaussian Cooling and Dikin Walks: The Interior-Point Method for Logconcave Sampling. COLT 2024.

作者回复

2024-08-01

We appreciate your thoughtful comments and questions, and we would be happy to answer your questions.

Q: Can you motivate the problem a bit better, for a broader CS audience?

A: Yes. Sampling is a fundamental problem in computer science and machine learning. Consider any constrained optimization problem where the goal is to optimize a function over a constrained set, to run any kind of first-order or second-order algorithm, one has to first construct a good initial point. This usually boils down to sample from the constrained set according to particular distributions, which is the central topic studied in this paper. In recent years due to the surge of differentially private machine learning, one could also implement the exponential mechanism using log-concave sampling from convex bodies.

Q: Can you give some intuition on the benefits of using Hessian in your walk?

A: In short, the walk we are using is a variant of ball walk, which could be succinctly described as at the current point $x$ , randomly pick a point $y$ in the ball centered at $x$ with radius $\delta$ , and move to $y$ with correct probability. This idea however, does not work so well if the convex body is, say, a long and flat ellipse: we are constrained to pick the ball radius $\delta$ small, as otherwise we could walk outside of the convex body, hence the walk takes many steps to mix. Computing the Hessian counters the issue by instead computing a maximal volume ellipse around $x$ : this enables us to sample points with different scales in different directions, and hence enabling our walk to mix faster.

Q: Can you provide further details of the hardness of constructing the Hessian of universal barrier?

A: Given a $d$ -dimensional convex body $K$ , the universal barrier is defined as $\log\mathrm{vol}_d (K^\circ(x))$ where $\mathrm{vol}_d$ is the $d$ -dimensional volume and $K^\circ(x)$ is the polar set of $K$ with respect to $x$ : $K^\circ(x)=\\{ y\in \mathbb{R^d}: y^\top (z-x)\leq 1, \forall z\in K\\}$ . The main computational obstacle comes from even implementing the zeroth order oracle for the barrier function: estimating the volume of a high dimensional convex body is hard, no deterministic algorithm could estimate the volume to a factor of $(\frac{d}{\log d})^d$ [BF86], and randomized algorithms are essentially MCMC types that require generating samples from the convex body. See also [NN94] for a more detailed exposition.

Q: I do not understand line 154 - 156.

A: Line 154 - 156 says that the cost of computing the Hessian of log-barrier for a spectrahedra is $O(dn^\omega+n^2 d^{\omega-1})$ , if we consider the regime $n\geq d$ , then the first term $dn^\omega$ dominates since it has a larger exponent on $n$ . If $n\geq d^{\frac{3\omega-4}{\omega-2}}$ which is roughly $n\geq d^{8.4}$ , then our algorithm is faster than the $dn^\omega$ time needed by exact computation.

Q: Line 15: why is this probability $0.5 \min\\{\tau, 1 \\}$ ?

A: Because we are employing a lazy walk that only moves to the next point with half probability.

Q: Can you give more intuition on the symmetry and convex properties?

A: The symmetry property is very helpful when one tries to relate the self-concordance parameter with cross ratio distance. Intuitively, we prove that for all barriers of concern, they are also $\sqrt{\nu}$ -symmetry, and we further prove that the cross ratio distance between any two interior points $u$ and $v$ is at least $\frac{1}{\sqrt{\nu d+dL^2R^2}}$ , this establishes our mixing time. The convexity of the regularized barrier is crucial when we try to bound the discrepancy between the log determinants of the current point and the point we are trying to move to. Recall that we define the point to move to is $z=x+\Phi(x)^{-1/2} \psi$ where $\psi$ is an i.i.d. Gaussian vector. Given the convexity, we could bound the log determinant as $\log \det \Phi(z) - \log \det \Phi(x) \geq (z-x)^\top \nabla \log \det \Phi(x)$ , and the RHS term is precisely the bounded local norm which we could control.

Q: Line 240: I do not understand the math display. Can you elaborate more on Equation (1)?

A: The point of the math display of line 240 is to show that the probability could be decomposed into a determinant term and a term depending on the local norm of $x$ and $z$ , and our subsequent analysis shows how to bound these two terms. For Equation (1), it is the convex program associated with $\ell_p$ Lewis weights. It involves finding a maximum volume ellipsoid subject to the (soft) constraint that the polytope is contained in the ellipsoid. Here, by ``soft’’, we mean that instead of forcing $a_i^\top M a_i\leq 1$ for all $i$ which corresponds to the $\ell_\infty$ norm, we instead consider an $\ell_p$ norm constraint, which could be written as $\sum_{i=1}^n (a_i^\top Ma_i)^{p/2}\leq d$ .

Q: Line 285 - 288: I do not understand this sentence.

A: This sentence compares our sampling algorithm with the IPM-based algorithms for semidefinite optimization [JKLPS20, HJSTZ22]. In those algorithms, the Hessian matrix could be carefully maintained and controlled, further improving the per iteration cost of the algorithm. For us however, while it might still be possible to develop a similar framework to reduce the per iteration cost, the picture is much less clear, and the strategy we employ is a simple method that does not require any sophisticated maintenance of the Hessian matrix.

2024-08-06

Thank you! :)

审稿意见

评分: 7置信度: 12024-07-15

The paper gives a Markov chain Monte Carlo algorithm for estimating the probability of a high dimensional polytope under a log-concave probability distribution. The algorithm improves the mixing time of previous results, while maintaining the best known per-iteration cost. More specifically, in conjunction with a known self-concordant barrier function, they use a cheap spectral approximation of the Hessian matrix required in an iteration, and they modify the Markov chain to converge more quickly despite the approximation. A second result applies similar ideas to spectrahedrons, and there the improvement over past results is both the asymptotic number of iterations and the asymptotic computational complexity of each iteration.

优点

This is an important and fundamental problem that has been researched extensively recently, and the result improves upon the best known algorithms.

缺点

The improvement is somewhat incremental, but nonetheless significant.

问题

None.

局限性

None.

作者回复

2024-08-01

We thank you for your review. We respectfully disagree with your point that the improvement we obtain is incremental, as the prior best algorithms for $d$ -dimensional polytopes with $n$ hyperplanes mixes in $nd$ steps, while ours mixes in $d^2$ steps. For the regime where $n\gg d$ , this is a huge improvement, and it is in fact the best known mixing time for uniform sampling over polytopes. Moreover, our framework extends to general convex bodies with well-behaved self-concordant barrier functions. Prior works [MV23] could only handle polytopes.

Reference

[MV23] Oren Mangoubi, Nisheeth K. Vishnoi. Sampling from Structured Log-Concave Distributions via a Soft-Threshold Dikin Walk. NeurIPS 2023.

评论- Reply to rebuttal

2024-08-09

Thank you for your response. I did say that the improvement is significant. I agree that the improvements in the mixing time, and in the scope of application are significant.

2024-08-09

We appreciate your clarification and thank you again for your valuable review!

审稿意见

评分: 8置信度: 32024-07-17

In this paper, the authors give improved mixing times for a Markov chain whose goal is to approximately sample from distributions whose densities are proportional to $\exp(-f(x))$ , where $f$ is $L$ -Lipschitz and convex, restricted to a convex set defined by the intersection of $n$ hyperplanes that lies in a Euclidean ball of radius $R$ . The Markov chain itself is an approximate variant of the Dikin walk that tolerates using spectral approximations to the Hessian instead of the exact Hessian for drawing the next sample.

The main result is as follows. Given a subroutine to $1/d$ -spectrally approximate the Hessian of the barrier that runs in time $C_g$ and given that the barrier itself is $\nu$ -self concordant, the authors give a Markov chain whose iteration complexity to converge to a $\delta$ -TV-approximate distribution to the target is $\widetilde{O}(\nu d + d L^2 R^2)\log(1/\delta)$ . Each step has runtime $C_g + d^{\omega}$ .

This general framework recovers some of the best known mixing time bounds for special cases of the studied problem, including sampling from the uniform distribution over a convex body. Moreover, for the setting considered, as far as I can tell, this is the first guarantee whose iteration complexity does not depend on the number of constraints defining the polytope (this seems to get replaced by the self-concordance parameter of the barrier, which can be made to be $\widetilde{O}(d)$ using more sophisticated barriers than the log-barrier).

The ideas also transfer to giving mixing times for the problem of sampling uniformly from the constraint set of a covering SDP. Again, a key part of the contribution is that the walk can tolerate using spectral approximations to the Hessian of the barrier (in this case, the log det barrier) instead of requiring it exactly.

The main technical insight is that there are a few properties that one needs the barrier to satisfy in order to give the $O(\nu d)$ dependence in the mixing time. They consist of a symmetry assumption, a bounded local norm condition, and that the barrier is convex under regularizing it with an identity matrix (see page 6). Under these assumptions, the authors prove that their variant of the Dikin walk satisfies the promised mixing time (the argument is executed in Appendices B, C, D).

优点

The problem addressed in this work is an important one to advance our understanding of the geometry of polytopes. The conditions under which mixing times can be obtained are pretty general, and as the authors show, only depend on certain natural properties of the barrier function that is being used. The results also imply significant quantitative improvements over prior work (in particular the result about uniformly sampling from polytopes with an iteration complexity that is independent of the number of constraints when $n$ is much larger than $d$ ).

缺点

There are a few questions I have (see below).

Besides that, a very minor comment -- in Section 2.3, the Lewis weight optimization program as written is not actually convex. But, there is a convex formulation for all $p > 0$ given in [LS19].

问题

Can one obtain better mixing times under more structural assumptions on the convex set (e.g. symmetry, some notion of uniform convexity, etc)? Or is this an open question?

Do you think it is possible to get runtime improvements by intelligently recycling Hessian computations? Then, the amortized runtime of finding each approximate Hessian could improve over the stated $C_g + d^{\omega}$ . In particular, self concordance implies some kind of Hessian stability (for any two nearby points, the Hessians at those points are spectrally close) -- so, could you hope that the Hessians of the barrier aren't changing too much between two consecutive steps?

局限性

Yes

作者回复

2024-08-01

We deeply appreciate your thoughtful and positive feedback for our work. We thank you for pointing out the convexity of the Lewis weights program for all $p>0$ as indicated in [LS19], and this is particularly the case for our application when $p$ is $\mathrm{poly}\log n$ . For your questions,

Q: Could one hope to obtain a better mixing time bound on more structural assumptions on the convex sets?

A: To the best of our knowledge, the only work that explores this frontier is due to Kook and Vempala [KV24], where they showed that if $f$ satisfies the so-called relatively strongly-convex property, then the mixing time could be independent of $R$ .

Q: Do you think it’s possible to get further runtime improvements by smartly reusing Hessian computations?

A: Yes, we do believe that it’s possible to further reduce per iteration cost by developing a linear algebraic data structure to maintain the inverse Hessian matrix. Similar to many IPM-based algorithms for optimization [CLS19, LSZ19, JSWZ21], since the algorithm takes many steps, the relative change between iterations is small, hence it’s possible to design data structures to amortize the cost. We believe this is an important next step for this line of research, as our work has essentially obtained a good (up to dependence on $L$ and $R$ ) mixing time.

References:

[LS19] Yin Tat Lee and Aaron Sidford. Solving Linear Programs with Sqrt(rank) Linear System Solves. 2019.

[KV24] Yunbum Kook and Santosh Vempala. Gaussian Cooling and Dikin Walks: The Interior-Point Method for Logconcave Sampling. COLT 2024.

[CLS19] Michael Cohen, Yin Tat Lee and Zhao Song. Solving Linear Programs in the Current Matrix Multiplication Time. STOC 2019.

[LSZ19] Yin Tat Lee, Zhao Song and Quiyi Zhang. Solving Empirical Risk Minimization in the Current Matrix Multiplication Time. COLT 2019.

[JSWZ21] Shunhua Jiang, Zhao Song, Omri Weinstein and Hengjie Zhang. Faster Dynamic Matrix Inverse for Faster LPs. STOC 2021.

评论- response to author response to review

2024-08-08

Thanks so much for the detailed response :)

最终决定Accept (poster)

2024-09-25

I would like to thank both the authors and reviewers for the good discussion. The reviewers broadly agree that the paper is interesting and should be accepted. I concur and recommend acceptance.