Asymptotics of Alpha-Divergence Variational Inference Algorithms with Exponential Families

审稿意见

评分: 7置信度: 32024-06-22

This paper focuses on the theoretical study of variational inference using alpha-divergences using exponential family. This is an important problem in the field of variational inference. Specifically, this work proves a geometric convergence rate for the algorithm proposed in [9]. Moreover, this paper also proposes an alternative optimization algorithm and its convergence guarantees are provided. The proposed algorithm finds applications in VAE.

优点

Asymptotic convergence rate of the algorithm proposed in [9] is derived in this work. Moreover, this paper proposes an alternative unbiased algorithm with convergence guarantees.
Assumptions (H1)-(H3) follow by reasonable explanations.
Applications of the proposed algorithm to VAE are provided.

缺点

It only considers the variational family to be exponential models.
The convergence is proved for the proposed algorithm in the asymptotic sense.

问题

Typo at line 81, ''negative'' $\to$ ''positive'' since you drop the negative $\frac{1}{\alpha(\alpha-1)}$ term.
What is the technical difficulty to derive a convergence result similar to Theorem 1 for the proposed algorithm?

局限性

Yes.

作者回复

2024-08-03

Thank you for your thorough review and honest feedback. Below, we provide detailed responses to your second question. The two weaknesses you pointed out are adressed in our global rebuttal. We hope that our responses will clarify the issues you raised.

What is the technical difficulty to derive a convergence result similar to Theorem 1 for the proposed algorithm?

It does not present any major difficulty and can be done by following the exact same steps as in the proof of Theorem 1. We would get a similar result, with a slightly slower rate however: the $\gamma$ in the convergence speed would be replaced by $\gamma V(\eta_*)$ , and we have $V(\eta_*) \leq 1$ by Jensen’s inequality. It is easy to gain some intuition for why this is the case: under the mean parameterization, the update is changed from $(1 - \gamma) \mu + \gamma \mathcal R$ to $(1 - \gamma V) \mu + \gamma V \mathcal R$ , so we are still performing a convex combination of $\mu$ and $\mathcal R$ but with a less aggressive coefficient, thus yielding a slower rate of convergence. This also provides a heuristic as to why UNB moves slower at the start of the experiments: when the variational parameter is poorly chosen, $V$ is closer to $0$ , which caps the progress that can be made at each step when the only values allowed for $(\gamma_t)$ are between $0$ and $1$ .

We are grateful for your careful review, we hope our responses have adequately addressed your questions and provided further insights.

评论- Response to the rebuttal

2024-08-08

Thank you for the response. I would like to keep my score.

审稿意见

评分: 8置信度: 22024-06-28

This procedure studies the asymptotic properties of a variational algorithm that ensures a monotonic decrease in the alpha-divergence. Specifically the authors investigate behavior in the setting where the variational distribution belongs to the exponential family of distributions. In this setting, and when a key integral can be analytically calculated, the the authors show monotonic convergence at a geometric rate. For scenarios where a key integral cannot be calculated explicitly, the authors show an unbiased empirical approach. Furthermore the authors show that this empirical approach enjoys almost sure convergence to a local minimizer of the alpha-divergence. Finally, the paper presents, both, simulated and real-data experiments to support the theoretical claims.

优点

This is an very well-written paper and clear despite its high level of mathematical rigor. The authors consider an important problem of alpha-divergence minimization, which has broad impacts in the ML community. The convergence analysis and unbiased empirical minimization algorithm appear novel and interesting.

缺点

The paper does not contain any major weaknesses that I could find, beyond limitations of the analysis. These limitations include the specific assumption of an exponential family variational distribution, as well as the stated assumptions H1-H4 and C0-C3.

Below are some detailed comments:

L25 : The inclusive KL divergence should be $D(p||q)$ .
L69 : Shouldn't the posterior be proportional to the joint $p(x,y)$ rather than the marginal $p(y)$ ?
L80 : "Assuming the argmax is uniquely defined at each iteration" it is unclear whether this is a reasonable assumption
Eq. (5) : Notation $(\partial A)^{-1}(\mu)$ undefined
L108 : Is it obvious that $\mathcal{R} = U/V$ ?
L196 : Notation $\circ$ undefined (is this a composition?)

Perhaps the biggest weakness is Sec. 5. This section falls a little flat as it concludes "we both obtain a biased and unbiased algorithm" without making those algorithms explicit or obvious (at least to me). Also, it is unclear how strong an assumption it is that the exponential family does not depend on $x$ as in $q(\cdot \mid x) = q$ . In what reasonable scenario would the encoder be independent of the input?

问题

See "Weaknesses" section.

局限性

The authors do not explicitly state limitations of the proposed methodology. Nevertheless the authors do make explicit assumptions of the analysis.

作者回复

2024-08-03

Thank you for your meticulous review and kind feedback. Below, we provide detailed responses to each of your questions, aiming to further clarify our work and address any remaining uncertainties.

Diverse remarks on notation + L196 : Notation $\circ$ is undefined (is this a composition?)

Yes, $\circ$ denotes a composition. We thank you for carefully reading our paper and will make sure to correct these mistakes.

L69 : Shouldn't the posterior be proportional to the joint $p(x, y)$ rather than the marginal $p(y)$ ?

Indeed, we can take $p(y) = p_{\theta}(x, y)$ . The dependency in $x$ is dropped for notational ease.

L80 : "Assuming the argmax is uniquely defined at each iteration" it is unclear whether this is a reasonable assumption.

As you point out, this assumption can be challenging in the general case, but when the variational family is an exponential family as in (H1), the argmax in (1) is uniquely defined by $(1 - \gamma) \partial A(\eta) + \gamma \mathcal R(\eta)$ if and only if this quantity is in $F$ . This is why the choice of $\gamma$ is so critical: if $\mathcal R(\eta)$ is not in $F$ , we must take $\gamma$ small enough to ensure that the next iterate will be valid. Under (H1), the set $F$ is open and convex, which implies two things. First, by convexity, if $\gamma_0 \in (0, 1]$ is a valid choice, then all $\gamma \in (0, \gamma_0]$ are also valid choices. Second, by openness, it is always possible to find such a $\gamma_0$ .

L108 : Is it obvious that $\mathcal R = U / V$ ?

Note that $V$ is the normalizing constant that makes $q_{\eta}^{\alpha} p^{1-\alpha} / V$ a probability density function. By linearity of the integral, $U / V$ is the expectation of the measurable function $S$ under a distribution with density $q_{\eta}^{\alpha} p^{1-\alpha} / V$ , i.e. $\check\varphi_{\eta}^{\alpha}$ . This explains why $\mathcal R = U / V$ .

[Sec. 5.] falls a little flat as it concludes "we both obtain a biased and unbiased algorithm" without making those algorithms explicit or obvious. Also, it is unclear how strong an assumption it is that the exponential family does not depend on $x$ as in $q = q(\cdot | x)$ .

The assumption $q_{\eta}(\cdot | x) = q_{\eta}$ in Section 5 was used only to illustrate a specific point: when $p_{\theta}$ is fixed and the decoder does not depend on $x$ (i.e., $p_{\theta} = p$ and we only update $\eta$ ), the exact gradient of the VR bound equals (up to a negative factor) $\mathcal R(\eta) - \partial A(\eta)$ . This was meant to show that, in this particular scenario, an iteration of the algorithm discussed in Section 3 corresponds to a gradient ascent step on the VR bound. This assumption is not representative of the VAEs' training process, which uses the gradient estimators (17) and (18) with an Adam optimizer. Specifically, estimator (18) is used for VR and estimator (17) for UB. We will reorganize this section to clarify these points.

Thank you once again for your detailed review and supportive comments. We hope that our responses have satisfactorily addressed your questions and concerns.

评论- Thanks for the rebuttal

2024-08-10

Thanks for the thorough rebuttal. I think this is a strong piece of work and will keep my score. Unfortunately I can't increase my confidence score as I am unfamiliar with the relevant work in this area and I have not had a chance to thoroughly validate the analysis beyond a standard detailed read through.

As a side note, I personally find $p(y) = p_\theta(x,y)$ notation to be very confusing / misleading. Especially when you also refer to the prior $p_\theta(y)$ (L57). You may want to consider revising this. At the very least you should explicitly state that x is assumed implicit for brevity.

评论- Response to Reviewer Comment

2024-08-10

Thank you again for your positive feedback. Regarding the notation, we understand your concern about potential confusion. We will emphasize the fact that the data $x$ is fixed throughout the optimization process, and will rename the prior $p_0$ to avoid any ambiguity and improve the accessibility of the paper.

审稿意见

评分: 7置信度: 32024-06-29

The paper studies the converge properties of an optimization algorithm, used to minimize the alpha divergence when the variational approximation belongs to the family of exponential distributions. The optimizer is a modification of an existing method and its performance is competitive with existing methods (without being stronger). However, the new approach enjoys theoretical guarantees.

优点

This paper is line with recent publications on proving the convergence of VI when minimizing the KL-divergence. Obtaining similar results for the alpha-divergence is a natural next step. The main contribution of the paper is theoretical, and I find the results convincing (although I did not read the appendix in detail).

The theory motivates a revision to existing methods, which seems to be competitive. Minimizing alpha-divergence is notoriously difficult and I appreciate the paper's discussion of algorithms to do so, and demonstration on some examples.

缺点

The paper lacks clarity and is difficult to read.

First, the notation is heavy. Two ways to address this might be to introduce a table of notation and to clearly delineate definitions as formal statements. I'm guessing this wasn't done because of space constraints, but this could be a good use of the additional page for accepted papers.

In Section 5, I was puzzled by the assumption that $q_\eta(\cdot \mid {\bf x}) = q_\eta$ , that is the factor does not depend on ${\bf x}$ . Does this mean that the encoder maps each image to the same latent representation? It seems like the VAE would then be useless, at least for the task of data compression. This also makes me question the results of Section 6. I'm guessing that while the VAE fails to do any meaningful compression of existing images, it may still generate convincing new images. I'd like to see such images (potentially in the appendix), as a supplement to Table 1, and a simple check that the model is trained in a useful manner.

Once the authors clarify this point, I can adjust my score.

问题

I'll use this for minor comments and clarification questions.

instead of (or in addition to) exclusive and include KL, write KL(q||p) and KL(p||q)
Figure 1 and Figure 2 are hard to read. In Figure 1, indicate the optimum. Could the colors for the objective function be more distinct than the colors of the trajectories?
Figure 2: font size should font size of text. Could the box plots be replaced with trajectories and shaded intervals?
line 314: fix the reference to eq:vrbound
Define the IWAE and VAE baselines. Does this simply correspond to minimizing KL(q||p) and KL(p||q)? Is this using the variational family as prescribed in line 275?

局限性

Yes.

作者回复

2024-08-03

We are grateful for your careful reading and constructive comments on our submission. Below, we respond to each of your questions, hoping to clarify any uncertainties.

The paper lacks clarity and is difficult to read.

We are sorry to hear that despite our best efforts to make the paper and the notation as clear and readable as possible, you encountered some difficulty in reading our work. We acknowledge that the notation in the paper is dense and will do our best to improve the overall readability by following your advice.

In Section 5, I was puzzled by the assumption that $q_{\eta}(\cdot | x) = q_{\eta}$ , that is the factor does not depend on $x$ . Does this mean that the encoder maps each image to the same latent representation?

The assumption $q_{\eta}(\cdot | x) = q_{\eta}$ in Section 5 was used only to illustrate a specific point: when $p_{\theta}$ is fixed and the decoder does not depend on $x$ (i.e., $p_{\theta} = p$ and we only update $\eta$ ), the exact gradient of the VR bound equals (up to a negative factor) $\mathcal R(\eta) - \partial A(\eta)$ . This was meant to show that, in this particular scenario, an iteration of the algorithm discussed in Section 3 corresponds to a gradient ascent step on the VR bound. This assumption is not representative of the VAEs' training process, which uses the gradient estimators (17) and (18) with an Adam optimizer. Specifically, estimator (18) is used for VR and estimator (17) for UB. We will reorganize this section to clarify these points.

Figure 2: font size should be font size of text. Could the box plots be replaced with trajectories and shaded intervals?

Like you suggest, we originally intended for Figure 2 to show full trajectories with shaded intervals, but the result was cluttered and difficult to interpret due to the overlapping of five lines and their respective intervals. Additionally, the trajectories were quite noisy, although Polyak-Ruppert averaging could mitigate this specific issue. We will make further efforts to improve the clarity of the plot.

Define the IWAE and VAE baselines. Does this simply correspond to minimizing KL(q||p) and KL(p||q)?

Yes, IWAE is obtained when $\alpha = 0$ and VAE corresponds to $\alpha = 1$ .

We extend our gratitude for your careful review and insightful feedback, and we hope our responses have effectively addressed your questions and concerns. We acknowledge the issues you noticed in the writing of our paper and will make sure to fix them.

评论- Response to rebuttal

2024-08-09

I've read the authors' rebuttals and I thank them for addressing my comments.

I'd like to ask for a clarification regarding section 5 and the special case $q_\eta(\cdot \mid {\bf x}) = q_\eta$ . The authors write "we can exploit this link to derive a VAE training procedure for the unbiased algorithm (10)." After re-reading the paragraph, I still do not understand how this link is exploited in order to generate a training algorithm for the case where $q_\eta(\cdot \mid {\bf x}) \neq q_\eta$ .

Regarding the figure, I understand that plotting trajectories can sometimes hurt readability. I hope the authors will consider adjusting the font size.

Yes, IWAE is obtained when $\alpha = 0$ and VAE corresponds to $\alpha = 1$ .

Thank you for the clarification.

评论- Clarification on Section 5

2024-08-10

We acknowledge that Section 5 lacks clarity and appreciate the opportunity to resolve this issue. In what follows, we expand on the reasoning behind this part of the paper.

Consider an algorithm whose updates are of the form $\mu_{t+1} = \mu_t +\gamma_t \mathrm M_th(\mu_t)$ , where $\mathrm M_t$ is a positive definite matrix and $h:F\to F$ . Usually, $h$ is the gradient of some function $H$ that we want to maximize.

In our case, it is equivalent to consider $\mathrm M_t = \mathrm{Id}$ and $h=\partial_{\eta} H$ , or $\mathrm M_t=\mathrm F(\mu_t)$ and $h=\partial_{\mu} H$ , where $\mathrm F(\mu)$ is the Fisher Information Matrix of the model at $\mu$ ,. This arises directly from the chain rule, along with the identities $\mu = \partial A(\eta)$ and $\partial_{\eta}\mu=\mathrm F(\mu)$ recalled in Appendix A.1. Indeed, they imply $\partial_{\eta} H(\mu)=\mathrm F(\mu)\partial_{\mu} H(\mu)$ .

Since the gradients of interest are easier to compute w.r.t. $\eta$ , we will choose the first option. Specifically, we have $\partial_{\eta}q_{\eta}(y)=\left(S(y) - \mu\right) q_{\eta}(y)$ , which leads to $\partial_{\eta} V(\mu)=\alpha \left(U(\mu) - \mu V(\mu)\right)$ .

For the unbiased algorithm $(5)$ , we set $h(\mu)=\frac{U(\mu)}{V(\mu)} - \mu$ , which we recognize as the gradient of $H:\mu\mapsto\frac{1}{\alpha}\log V(\mu)$ .

Multiplying $H$ by the positive constant $\frac{\alpha}{1-\alpha}$ and writing it in integral form, we find that $\frac{\alpha}{1 - \alpha} H(\mu)=\frac{1}{1-\alpha}\log \mathbb E_{y\sim q_{\eta}}\left[\left(\frac{p(y)}{q_{\eta}(y)}\right)^{1-\alpha}\right]$ (remember that $\mu$ and $\eta$ are related by the one-to-one mapping $\partial A$ ).

In the more general case where we have $p_{\theta}(\cdot, x)$ instead of $p$ and $q_{\eta}(\cdot|x)$ instead of $q_{\eta}$ , this is the Variational Rényi (VR) bound: $\mathcal L_{\alpha}^{\mathrm R}(p_{\theta}, q_{\eta}, x)=\frac{1}{1-\alpha}\log \mathbb E_{y\sim q_{\eta}(\cdot | x)}\left[\left(\frac{p_{\theta}(x, y)}{q_{\eta}(y| x)}\right)^{1-\alpha}\right]$ .

In other words, the exact biased algorithm is a particular case of gradient ascent on the VR bound.

By doing a similar work for the unbiased algorithm, where $h(\mu_t)=U(\mu_t) - \mu_tV(\mu_t)$ , we get $H(\mu)=\frac{1}{\alpha} V(\mu)=\frac{1}{\alpha} \mathbb E_{y\sim q_{\eta}}\left[\left(\frac{p(y)}{q_{\eta}(y)}\right)^{1-\alpha}\right]$ . Hence, the unbiased algorithm is a particular case of gradient ascent on a new bound given by $\mathcal L_{\alpha}^{\mathrm G}(p_{\theta}, q_{\eta}, x)=\frac{1}{1-\alpha} \mathbb E_{y\sim q_{\eta}(\cdot | x)}\left[\left(\frac{p_{\theta}(x, y)}{q_{\eta}(y| x)}\right)^{1-\alpha}\right]$ .

Using the REINFORCE gradient and backpropagation, we train VAEs to maximize these bounds. The respective methods are referred to as VR and UB in the paper.

We hope this reformulation of the proof will help clarify any ambiguities. Should you have any further questions or require additional explanations, please let us know. Thank you again for your time and constructive comments.

2024-08-11

Thank you for engaging. However these additional details do not provide the clarification I'm looking for. Namely, why do we need to study the case where $q_\eta (\cdot \mid {\bf x}) = q_\eta$ in order to derive a learning algorithm?

In the above derivation, I do not see where in the calculation of $\frac{\alpha}{1 - \alpha} H(\mu)$ we exploit the fact that $p(\cdot, x) = p$ and $q_\eta = q(x)$ . And once we derive $H$ in this particular case, why can we backtrack to the more general case?

I believe the writing should make it clear why this detour is necessary, how is the simpler case exploited, and why can we then go back to the more general case. Would it not be possible to directly work in the more general case?

Using the REINFORCE gradient and backpropagation, we train VAEs to maximize these bounds.

The authors have not defined "reinforce" gradient, could you define it?

2024-08-12

Thank you for your feedback. We appreciate the opportunity to clarify our approach.

In the paper, we distinguish between two contexts of variational inference: the traditional framework discussed in sections 3 and 4, and the more complex VAE training setup.

In the traditional setting, we want to approximate the true posterior $p_{\theta}(\cdot|x)$ by learning the parameters of a single distribution $q_{\eta}(\cdot|x)$ . Since the data $x$ remains fixed throughout the entire training process, we drop all dependencies in $x$ : we suppose that $p_{\theta}(\cdot|x)$ is known up to a constant (through the function $p$ ), and simply we denote $q_{\eta}$ . This does not mean that the learnt parameter isn't dependent on the data. Rather, we are optimizing a parametric distribution over the entire dataset, e.g., coefficients in Bayesian logistic regression.
In contrast, VAEs involve optimizing both an encoder $q_{\eta}(\cdot|x)$ and a decoder $p_{\theta}(x|\cdot)$ simultaneously. One of the main goals with VAEs is to learn a meaningful latent representation of the data, so each data point must be effectively mapped to the latent space, hence the particular conditioning on $x$ . Unlike the traditional setup, where we learn a single distribution, VAEs require joint optimization of two networks. This complexity necessitates adjusting the approach.

The derivation in our paper shows that the algorithms MAX and UNB perform gradient ascent on specific variational bounds, namely the VR bound and $\mathcal L_{\alpha}^{\mathrm{G}}$ .

The VR bound, for example, is defined as $\mathcal L_{\alpha}^{\mathrm R}(p_{\theta}, q_{\eta}, x)=\frac{1}{1-\alpha}\log \mathbb E_{y\sim q_{\eta}(\cdot|x)}\left[\left(\frac{p_{\theta}(x, y)}{q_{\eta}(y|x)}\right)^{1-\alpha}\right]$ .

If we are in setup 1, the only thing we can act on is the parameter $\eta$ . Differentiating the previous formula w.r.t. $\eta$ shows that MAX is a form of gradient ascent on $\mathcal L_{\alpha}^{\mathrm R}$ .

For VAEs, we need to optimize both the encoder and decoder networks, and we are not changing $\eta$ and $\theta$ but the weights of the networks, of which the parameters are functions (say $\eta=\eta(W)$ and $\theta=\theta(W)$ , where $W$ denotes the weights of both networks). To handle this new task, we use the reparameterization trick for differentiable optimization and the REINFORCE gradient, i.e., we write $\partial_{W}f(x, y, W)=f(x, y, W)\cdot \partial_{W}(\log f(x, y, W))$ with $f = \left(\frac{p_{\theta(W)}(x, y)}{q_{\eta(W)}(y|x)}\right)^{1-\alpha}$ and $\partial_{W}$ denotes the derivative with respect to the networks' weights. If further details are needed, we can provide a more comprehensive derivation.

What ties the algorithms from sections 3 and 4, and the VAE optimization process we propose together is the fact that they minimize the alpha-divergence by maximizing the same variational bounds. The procedures that lead to maximizing the bounds, however, are very different.

We hope this clarifies the fact that these two frameworks are different and thus need be treated adequately. Thank you for your consideration.

2024-08-12

Thank you for providing additional details.

I'm well familiar with how training a VAE works and this is not my point of confusion. Moreover, while everything that the authors state in their response is correct (distinction between Bayesian inference and learning for VAEs), this still does not answer my question. I'll restate it once last time: why do you need to first derive the case where $q_\eta$ does not take in an input ${\bf x}$ , i.e. you're not learning an encoder and doing amortized variational inference, in order to derive the training procedure for the VAE? The sentence that I find confusing is

We can exploit this link to derive a VAE training procedure for the unbiased algorithm.

I still don't see how the link is exploited.

Perhaps you can clarify the following: when you write " $q_\eta (\cdot \mid {\bf x}) = q_\eta$ " and " $q_\eta$ does not depend on ${\bf x}$ ", what setting are you examining here? As far as I can tell, this is not what you call "the traditional setting" where you do posterior learning, since you're still learning $\theta$ (or the weights $\theta$ depends on). It is simply an encoder which doesn't take in ${\bf x}$ or some specific input $x_n$ .

As is, I still believe this is a valuable contribution to the NeurIPS community and I'll maintain my score, which a weak accept.

2024-08-13

We appreciate your patience and apologize for misunderstanding your point of confusion, which is absolutely valid. This specific sentence in the paper is indeed wrong, due to a poor formulation of the underlying idea.

Throughout the entire paper, the function $q_{\eta}$ always depends on $x$ and can be written $q_{\eta}(\cdot|x)$ . For simplicity and to reduce notation clutter, we omit this dependency until Section 5.

In Section 5, the function denoted $q_{\eta}(\cdot|x)$ is not the density of a distribution in an exponential family parameterized by $\eta$ . Instead, it could be expressed $\tilde q_{f_{\eta}(x)}(\cdot)$ , where $f_{\eta}$ is a neural network. When we wrote "the parameter $\theta$ is not updated, $q_{\eta}(\cdot|x)$ belongs to an exponential family and $q_{\eta}(\cdot|x)=q_{\eta}$ and does not depend on $x$ ", we meant that we were temporarily reverting to the "traditional VI" setting. The goal there was to explain why MAX performs gradient ascent on the VR bound when we are in the traditional setting and $q_{\eta}(\cdot|x)$ is the density of an exponential distribution.

The exact reasoning that led to the VAE training algorithms is the one described in the comments above. To summarize, we start by noticing that in the setting of Sections 3 and 4, the algorithms MAX and UNB are gradient ascent procedures, respectively on the VR bound and $\mathcal L_{\alpha}^{\mathrm G}$ . Training VAEs to maximize these bounds leads to the methods referred to as VR and UB.

We hope this explanation will satisfactorily address your concerns about this part of the paper. We will delete the problematic sentence and reorganize section 5 so that it better reflects the train of thought behind the VAE training methods we use. Thank you for your thorough review and for helping us ensure the clarity of our paper.

2024-08-14

Thank you for getting back to me. My intention is not to be stubborn, and I'm quite keen to better understand your work.

The content of Section 5 is still not entirely clear to me, but at this point, I believe I need to have another close look at the paper in light of the comments provided by the author, and I'll discuss the matter with the other reviewers. Ideally, I would like to read a revised Section 5. Since this section is fairly short, I invite the authors to rewrite Section 5 to include their clarifying comments. That said, I recognize this is an unusual request and even if the authors did not do this, I'd still consider changing my score after re-reading the paper and our exchange during the rebuttal period.

评论- Section 5 revamp proposition

2024-08-14

Thank you for your interest in our work and for giving us the opportunity to improve it. Below is a draft of the revised Section 5 you asked for. We hope that this revision will clarify the points you raised.

In this section, we explain how to transpose the algorithms presented in Section 4.1 to the training of Variational Auto-Encoders (VAEs) [22]. We start by showing that the exact versions of the biased and unbiased algorithms correspond to gradient ascent procedures on two different variational bounds. Let us first address the case of the biased algorithm. Recall that it writes $\displaystyle{\mu_{t+1}=\mu_t+\gamma_t\left[\frac{U(\mu_t)}{V(\mu_t)} - \mu_t\right]}$ . For $\eta\in E$ and $\mu=\partial A(\eta)$ , we define the Variational Rényi (VR) bound [26] by $\mathcal L_{\alpha}^{\mathrm R}(\mu)=\frac{1}{1-\alpha}\log\mathbb E_{y\sim q_{\eta}}\left[\left(\frac{p(y)}{q_{\eta}(y)}\right)^{1-\alpha}\right]=\frac{1}{1-\alpha}\log V(\mu).$

Noticing that $\partial_{\eta}V(\mu)=\alpha \left(U(\mu) - \mu V(\mu)\right)$ , we can thus express an iteration of the biased algorithm as $\mu_{t+1}=\mu_t+\frac{\gamma_t\cdot\alpha}{1-\alpha}\partial_{\eta}\mathcal L_{\alpha}^{\mathrm R}(\mu_t)=\mu_t+\frac{\gamma_t\cdot\alpha}{1-\alpha}\mathrm{F}(\mu_t)\partial_{\mu}\mathcal L_{\alpha}^{\mathrm R}(\mu_t).$ Under (H1), the Fisher Information Matrix $\mathrm{F}(\mu_t)$ is positive definite, hence the biased algorithm is a gradient ascent procedure on the VR bound.

The unbiased algorithm, which writes $\mu_{t+1}=\mu_t+\gamma_t\left[U(\mu_t) - \mu V(\mu_t)\right]$ , similarly amounts to performing gradient ascent on the variational bound $\mathcal L_{\alpha}^{\mathrm G}$ defined by $\mathcal L_{\alpha}^{\mathrm G}(\mu)=\frac{1}{1-\alpha} \mathbb E_{y\sim q_{\eta}}\left[\left(\frac{p(y)}{q_{\eta}(y)}\right)^{1-\alpha}\right]=\frac{1}{1-\alpha} V(\mu).$

In the context of VAEs [22], we learn both a probabilistic encoder $y\mapsto \tilde q_{f(x;\eta)}(y)$ and a probabilistic decoder $x\mapsto\tilde p_{g(y;\theta)}(x)$ , where $\tilde q_{\eta}$ and $\tilde p_{\theta}$ are densities from families parameterized respectively by $\eta$ and $\theta$ , while $x\mapsto f(x;\eta)$ and $y\mapsto g(y;\theta)$ are neural networks. For simplicity and to align with the usual notation for VAEs, we will denote $\tilde q_{f(x;\eta)}(\cdot)=q_{\eta}(\cdot | x)$ and $\tilde p_{g(y;\theta)}(\cdot)=p_{\theta}(\cdot | y)$ . We will also use the shorthand $\phi=(\eta, \theta)$ .

Since the biased and unbiased algorithms studied in the previous sections minimize the alpha-divergence by maximizing the variational bounds $\mathcal L_{\alpha}^{\mathrm R}$ and $\mathcal L_{\alpha}^{\mathrm G}$ , we propose to train VAEs to maximize those same bounds. In this new setting, they write $\mathcal L_{\alpha}^{\mathrm R}(\eta, \theta, x)=\frac{1}{1-\alpha}\log\mathbb E_{y\sim q_{\eta}(\cdot|x)}\left[\left(\frac{p_{\theta}(x, y)}{q_{\eta}(y | x)}\right)^{1-\alpha}\right],$ $\mathcal L_{\alpha}^{\mathrm G}(\eta, \theta, x)=\frac{1}{1-\alpha}\mathbb E_{y\sim q_{\eta}(\cdot|x)}\left[\left(\frac{p_{\theta}(x, y)}{q_{\eta}(y | x)}\right)^{1-\alpha}\right].$

To update both the encoder and decoder simultaneously, we differentiate them with respect to $\phi$ using the reparameterization trick [22]. If $z\sim r(\cdot)$ and there exists a mapping $v$ such that $v(z; \eta, x)$ has the same distribution as $y$ when $y\sim q_{\eta}(\cdot | x)$ , then we have

$\partial_{\phi}\mathcal L_{\alpha}^{\mathrm R}(\eta, \theta, x)=\mathbb E_{z\sim r(\cdot)}\left[\overline w_{\alpha}(z, \eta, x)\partial_{\phi}\left(\log\frac{p_{\theta}(x, v(z; \eta, x))}{q_{\eta}(v(z; \eta, x) | x)}\right)\right],$

$\partial_{\phi}\mathcal L_{\alpha}^{\mathrm G}(\eta, \theta, x)=\mathbb E_{z\sim r(\cdot)}\left[w_{\alpha}(z, \eta, x)\partial_{\phi}\left(\log\frac{p_{\theta}(x, v(z; \eta, x))}{q_{\eta}(v(z; \eta, x) | x)}\right)\right].$

where $\displaystyle{w_{\alpha}(z, \eta, x)=\left(\frac{p_{\theta}(x, v(z; \eta, x))}{q_{\eta}(v(z; \eta, x)|x)}\right)^{1-\alpha}}$ and $\displaystyle{\overline w_{\alpha}(z, \eta, x)=\frac{w_{\alpha}(z, \eta, x)}{\int_{\mathsf Y} {w}_{\alpha}(z', \eta, x)r(z')\nu(\mathrm d z')}}$ .

To train VAEs, we simply plug batch estimates of these gradients into an optimizer like Adam. Notably, $\partial_{\phi}\mathcal L_{\alpha}^{\mathrm G}$ can be estimated unbiasedly, while estimators of $\partial_{\phi}\mathcal L_{\alpha}^{\mathrm R}$ are subject to bias. We will study the practical implications of this fact in Section 6.

审稿意见

评分: 6置信度: 42024-07-11

This paper proposes the asymptotic analysis for both exact and empirical alpha-divergence minimization algorithms, especially in the case of infinite number of iterations. The paper mainly focuses on the exponential family setting, and provides geometric convergence analysis of exact minimization algorithms using fixed point theories. To bypass the difficulty on studying the empirical minimization algorithms due to the bias of previous algorithms, a novel unbiased algorithm is proposed and analyzed, including almost sure convergence to a local minimizer and a law of the iterated logarithm. Finally, the paper experiments on toy gaussians and on variational auto-encoders to show the effectiveness of the proposed algorithm.

优点

The paper is clearly organized, including both exact and empirical analysis.
The paper evaluates the asymptotic properties as the number of iterations goes to infinity, which are seldom discussed in previous works.
Extensive discussions on the convergence of empirical alpha-divergence minimization algorithms are provided in the paper.

缺点

The novelty of the paper appears to be limited. The basis of the theoretical analysis is mostly covered by [1], and the main theoretical analysis focuses only on exponential families.
Some assumptions in the theoretical analysis could be further evaluated.
- For Assumption (H3) in section 3, it remains unclear when the mapping $\mathcal{M}_{\gamma}$ is a contraction.
- It would be beneficial to further verify this for $q_{\eta}$ and $p$ in the case of the exponential family.
- For Assumption (C2) in section 4, the paper states that for “ $\alpha \in (0, \frac{1}{2})$ suppose specific behaviors on the relative tails”. It would be illuminating to verify this assumption in experiments, especially in VAE experiments, where the choice of $\alpha$ can make a difference to the empirical results.
The experiments of this paper could be extended. It would be nice to add experiments like Bayesian logistic regression as in [1]. Furthermore, the empirical results lack significance. In the toy gaussian experiments, the proposed UNB method achieves more accurate but slower convergence results compared to the biased MAX method, and is surpassed by the NAT method with proper step size. In VAE experiments, the UB approach is not significantly better than the VR approach in some cases.
The relation between the unbiasedness and the computational intensiveness of the proposed algorithm could be further justified. The bias issue is not analyzed theoretically because “biased gradient estimators hinder any theoretical study”, and there seems to be a trade-off between unbiasedness and computational intensiveness regarding the comparison of MAX and UNB methods in toy experiments.

[1] K. Daudel, R. Douc, and F. Roueff. Monotonic Alpha-Divergence Minimisation for Variational Inference. Journal of Machine Learning Research, 24(62):1–76, 2023.

问题

The paper asserts that “biased gradient estimators hinder any theoretical study”. Is it possible to compare the biased and unbiased algorithms on a relatively simple example like gaussian distributions or two-mixture of gaussians?
If assumption (H3) is not satisfied, do we still have geometric convergence? If not, what would be the convergence rate?
In VAE experiments, different $\alpha s$ result in different empirical results for both VR and UB methods. Moreover, UB achieved relatively worse results for $\alpha \in (\frac{1}{2}, 1)$ , which is the interval where assumption C2 could be satisfied. Could the authors provide a heuristic understanding of this phenomenon?
In line 208 of section 4, the paper states “on top of that, we lose the monotonicity property”. The monotonicity property is essential for the geometric convergence in the exact alpha-divergence minimization algorithm analysis. Does it also contribute to the empirical algorithm analysis? Is it beneficial to solving the instability problems for large step sizes?

局限性

The paper states the limitation of only considering the exponential family. Still, there are additional limitations when focusing on this setting. Please refer to the second point of weaknesses.
Although the paper claims to focus mostly on asymptotic theory, the quality of experiments could still improve to better support the theoretical analysis. Please refer to the third point of weaknesses.

作者回复

2024-08-03

Thank you for your thorough and constructive review of our paper. We greatly appreciate the time and effort you have dedicated to providing feedback. We are committed to addressing the concerns raised and improving our work. Below, we respond to each of your points in detail.

The basis of the theoretical analysis is mostly covered by [1].

While our work indeed builds on the algorithm proposed in [1], which had already established the monotonicity property and explicit update form for the exponential family, our contributions lie in proving the algorithm’s convergence at an asymptotically geometric rate and towards a minimizer of the alpha-divergence. We further consider this algorithm in the empirical setting and propose an unbiased version, for which we provide two convergence theorems.

For Assumption (C2), the paper states that “ $\alpha \in (0, ½)$ supposes specific behaviors on the relative tails”. It would be illuminating to verify this assumption in experiments.

We have found that convergence can still occur even if assumption (C2) is not met. For example, when $p$ is the density of a Cauchy distribution and the variational family is Gaussian, we still observe convergence (Appendix C). So far, we have not encountered a simple case where convergence entirely fails. This suggests that assumption (C2) could be relaxed or even replaced in future studies.

It would be nice to add experiments like Bayesian logistic regression.

Due to space constraints, we had to make choices regarding which experiments to present in the paper. We opted for the toy Gaussian examples as we found them to be more insightful than Bayesian logistic regression or other classical applications of variational inference algorithms.

The empirical results lack significance.

We believe that the main takeaway from the toy gaussian experiments is that the MAX approach is highly robust against aggressive hyperparameter tuning strategies, and generally converges faster than other methods with theoretical guarantees.

Although the NAT method performs well when it converges, it has shown occasional instability and is costlier per-iteration.

In VAE experiments, the UB approach is not significantly better than the VR approach in some cases.

It is true that the UB approach does not always show a significant advantage over the VR approach in our VAE experiments. However, we believe that this study offers useful insights. As you pointed out, the choice of $\alpha$ makes a difference in the empirical results, which is quite exciting as the idea behind the use of alpha-divergences in variational inference is to overcome some limitations of the traditionally used KL divergence which can be problematic on certain datasets. We observe that a proper tuning of $\alpha$ can greatly improve the model’s performance, though the optimal value seems to depend on both the gradient estimator and the dataset. A thorough analysis of the phenomena at play is beyond the scope of this paper, but is certainly a compelling question for future research.

The bias issue is not analyzed theoretically because “biased gradient estimators hinder any theoretical study”.

Analyzing the MAX approach theoretically presents two main difficulties. First, if it converges, it converges to a minimizer of an approximation of the alpha-divergence, rather than the true alpha-divergence itself. Second, this approximation is hard to characterize, since it amounts to finding and analyzing a function whose gradient is $\widehat U / \widehat V$ .

Is it possible to compare the biased and unbiased algorithms on a relatively simple example like gaussian distributions or two-mixture of gaussians?

In the experiments section, both the biased and unbiased algorithms are compared in the case of a Gaussian variational family with Gaussian, mixture of Gaussian and Cauchy targets (see also Appendix C). The biased algorithm is referred to as MAX, while the unbiased algorithm is UNB.

If assumption (H3) is not satisfied, do we still have geometric convergence? If not, what would be the convergence rate?

We need assumption (H3) to obtain the geometric rate of convergence. Lemma 2 (found at the beginning of Appendix A.1) establishes that, as long as the sequence $(\eta_t)$ remains bounded, at least one fixed point of the mapping $\mathcal M_{\gamma}$ will be a limit point of this sequence. However, evaluating whether this limit point is the definitive limit and determining the convergence rate without (H3) is beyond the scope of this study.

UB achieved relatively worse results for $\alpha \in (½, 1)$ , which is the interval where assumption C2 could be satisfied. Could the authors provide a heuristic understanding of this phenomenon?

We believe that the phenomenon at play is quite complex and depends mostly on the dataset, since opposite behaviors are observed between CIFAR10 and CelebA. Moreover, as we already discussed, assumption (C2) might be subject to relaxation, thus it may not fully explain the deeper reasons behind the observed convergence behavior.

Does [the monotonicity property] contribute to the empirical algorithm analysis? Is it beneficial to solving the instability problems for large step sizes?

Empirically, we can not guarantee the monotonicity property unless an accept-reject step is integrated into the procedure. The main idea behind the analysis of the sample-based algorithm is to use the mean parameterization of the exponential family to write it as a Robbins-Monro procedure, hence we do not leverage the monotonicity property seen in the exact setting. However, the fact that the exact versions of MAX and UNB enjoy this property may explain why these two approaches appear to be quite stable and follow direct paths to minimizers. This behavior is illustrated in Figure 1, and similarly observed in the mixture of Gaussian and Cauchy cases. We can include additional figures in Appendix C to further exemplify this matter.

审稿意见

评分: 7置信度: 32024-07-13

This paper explores alpha-divergence Variational Inference (VI) from a theoretical perspective, and in particular the monotonic alpha-divergence minimization algorithm. It includes an asymptotic analysis of the algorithm applicable to exponential families, establishing conditions that ensure convergence to a local minimizer at a geometric rate. The theoretical examination of the sample-based counterpart of the algorithm leads to a modified unbiased version, for which both almost sure convergence and a law of the iterated logarithm are provided. Experimental validation using synthetic and real-world datasets illustrates and supports the theoretical findings of the paper.

优点

This is a relevant analysis of alpha-divergence VI algorithms, delivering an in-depth study of their behavior. It is well-written, with technically intricate proofs that are articulated with clarity. Overall, it is a very nice contribution to the field.

缺点

I mainly have minor comments:

Some assumptions would benefit from more detailed discussion. Specifically, I am unsure why conditions (H4) and (C0') are considered realistic and sensible.
The impact of the number of samples K used in the sample-based algorithm is underexplored: what influence does it have? Additionally, is it important to maintain the same number of samples as the number of iterations increases?

问题

局限性

作者回复

2024-08-03

Thank you for your thorough review and encouraging feedback on our submission. We provide detailed responses to each of your questions below, hoping to clarify any uncertainties you may have had reading our paper.

I am unsure why conditions (H4) and (C0') are considered realistic and sensible.

The rationale behind assumption (H4) is that as we approach the optimal solution, we can afford to be progressively less conservative in our choice of $\gamma$ , since $\mathcal R(\eta_t)$ should always remain sufficiently close to the set $F$ , or even belong to that set. More specifically, one can show that (H4) holds if the set $\mathsf K_{\mathcal L_{\alpha}(\eta_0)}$ is compact, meaning that the set of $\eta$ 's that verify $\mathcal L_{\alpha}(\eta) \leq L_{\alpha}(\eta_0)$ is bounded. Consequently, each decrease in the alpha-divergence allows for an increase in the highest permissible value of $\gamma$ . We are open to including a concise discussion on this matter in the Appendix.

Regarding (C0’), it is essentially a stricter version of (C0) and involves a design choice left to the practitioner's discretion. It's important to note that (C0) and (H4) are incompatible, which partly explains why we do not get a geometric convergence rate in empirical scenarios.

What influence does the number of samples $K$ have in the sample-based algorithm?

Increasing the number of samples $K$ reduces the bias of the estimator used in the first sample-based algorithm (MAX). Notably, the estimator $\widehat U / \widehat V$ becomes asymptotically unbiased, meaning that as $K \to +\infty$ , it converges to $\mathcal R$ . This is illustrated in the toy Gaussian experiment, where a small sample size ( $K = 10$ ) causes the MAX algorithm to converge to suboptimal parameters with higher alpha-divergence compared to unbiased algorithms. To highlight the impact of sample size on bias, we can include additional plots in Appendix C. Aside from this, a smaller sample size accelerates per-iteration computation but results in noisier estimates of the intractable integrals. This creates a tradeoff, a detailed exploration of which is beyond this paper's scope.

Is it important to maintain the same number of samples as the number of iterations increases?

Theoretically, maintaining the same number of samples across all iterations is useful to satisfy condition (iv) in the proof of Theorem 3, as it guarantees the continuity of the function $\mu \mapsto \mathrm{Cov}(G(\mu))$ . However, in practice, one might opt to use fewer samples in the early iterations and increase the number later on to improve the accuracy of the final outcome.

Thank you again for your attentive review and encouraging feedback on our submission, we hope our responses have helped resolve any ambiguities.

作者回复

2024-08-03

We deeply thank the reviewers for their careful and detailed reviews of our manuscript. We are grateful for their constructive feedback and for offering us an opportunity to improve our work. Below, we provide responses to some points that have been raised by multiple reviewers. We hope that this discussion will satisfactorily address any concerns the reviewers may have.

(Reviewer pDqJ) The main theoretical analysis focuses only on exponential families.

(Reviewer FQDG) It only considers the variational family to be exponential models.

We chose to focus on exponential families as they offer a convenient setting and enjoy interesting theoretical properties. First, they ensure the existence of a unique solution to the argmax problem in (3), given that $\gamma$ is chosen appropriately (this is not a restrictive assumption). They also allow us to state (H3) in an understandable way, rather than through obscure integral conditions. Finally, we believe that the principles established in this restricted setting offer insights and intuition about the problem at hand, potentially serving as a foundation for future analyses in more general contexts.

(Reviewer pDqJ) For Assumption (H3) in section 3, it remains unclear when the mapping $\mathcal M_{\gamma}$ is a contraction. It would be beneficial to further verify this for $q_{\eta}$ and $p$ in the case of the exponential family.

(Reviewer FQDG) The convergence is proved for the proposed algorithm in the asymptotic sense.

In Proposition 1, we explore the contraction property of the mapping $\mathcal M_{\gamma}$ . The analysis reveals that, under assumption (H3), this mapping is a contraction in a neighborhood of the parameter $\eta_*$ . It can be noticed that when $\mathcal Q$ is an exponential family as in (H1) and the target $p$ belongs to $\mathcal Q$ , i.e. $p=q_{\eta_*}$ , assumption (H3) is always verified. Indeed, we can show that in this setting, the only fixed point (c.f. Lemma 1) is the target parameter $\eta_*$ , and that we have $\rho_* = \alpha < 1$ .

Determining the size of the neighborhood in which $\mathcal M_{\gamma}$ is a contraction is challenging in the general case. However, in the simpler case where $Q$ is a real Gaussian family with fixed variance and $p$ is Gaussian (not necessarily with the same variance), this condition is verified for any mean in $\mathbb{R}$ .

Again, we thank all reviewers for their comprehensive and constructive feedback. We hope that our responses have adequately addressed your questions and concerns, and that our work has been significantly improved as a result.

最终决定Accept (poster)

2024-09-25

This paper provides convergence results for two algorithms for minimizing the alpha-divergence between an exponential family and a target distribution. While reviewers were overall positive and I think this paper has some worthwhile ideas, the paper also has some weaknesses.

The first algorithm analyzed (essentially that in Equation 5) is an algorithm that was previously proposed and proven to decrease monotonically. The algorithm is also in practice unimplementable, as it relies on expectations that are not available for practical target distributions. The contribution of this paper is to give a rate of convergence (theorem 1).

The second algorithm analyzed (essentially that in equation 10) seems to be novel. (And may be valuable). The analysis for this algorithm (theorems 2 and 3) give something similar to a \sqrt{\log(t)/t} rate. This algorithm is implementable. Convergence is proven to some unknown parameter vector μ_* which is in a set of critical points of the alpha divergence. This weakness is exhibited in the fact that the results in Theorem 2 and Theorem 3 make no reference to the minibatch size K, despite the fact that we hope that algorithm to become more accurate when K is larger. The experimental evidence for the algorithm is quite limited.

A weakness of the paper is the form of the assumptions being made. Ultimately, the only sources of variability in the problem are the exponential family chosen, the target distribution, and algorithmic choices (e.g. step sizes). So ultimately, any assumption must be about one of these things. But the form in which the assumptions are given makes it extremely difficult to understand for which target distributions the algorithms are claimed to work. For example, where does the assumption on the target distribution lie in Theorem 1? I asked some reviewers, but could not obtain an answer. I believe the answer is that H3 makes reference to φ^α_η which is defined in equation 2 to be a geometric mixture of the variational family and the target distribution. But what does this mean? For what distributions would this be satisfied? This is not discussed.

Another weakness of the paper is a rather "severe" use of notation, with various symbols seemingly either undefined or at least requiring the reader to hunt through the previous pages to find the definition. If the paper is technical fine, but it would be a great service to the reader in the main assumptions and results to make this clear in the statement, e.g. add a few lines of setup to the theorem clarifying what algorithm is being used, and the previous lines where symbols are defined. Sometimes the notation is quite obscure, e.g. in Eq. 13, Cov_η is covariance with respect to the variational family defined by parameters η, whereas Cov_{φ^α_η} is a covariance with respect the distribution φ^α_η.

In conclusion, while algorithms for alpha divergences are of fundamental interest and this paper contributes valuable new ideas, I believe that the relative lack of clarity of the paper made some of its weaknesses obscure to the readers. After consultation with reviewers and the SAC, I am recommending that the paper be accepted, but I urge that the paper be revised to increase clarity to maximize the potential impact.

Asymptotics of Alpha-Divergence Variational Inference Algorithms with Exponential Families

摘要

评审与讨论

优点

缺点

问题

局限性

优点

缺点

问题

局限性

优点

缺点

问题

局限性

优点

缺点

问题

局限性

优点

缺点

问题

局限性