PaperHub
5.6
/10
Poster5 位审稿人
最低3最高7标准差1.4
6
7
3
6
6
3.4
置信度
正确性2.8
贡献度2.6
表达3.0
NeurIPS 2024

Training Binary Neural Networks via Gaussian Variational Inference and Low-Rank Semidefinite Programming

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-12

摘要

关键词
Binarized Neural Networks; Variational Inference; Semi-definite Programming;

评审与讨论

审稿意见
6

The paper is concerned with variational inference in binary neural networks, e.g., neural networks that contain weights quantized to be either -1 or +1, with the aim of improving the overall performance of binary neural networks by performing Bayesian model averaging. For this, the authors review prior work on the topic and discuss relaxations that are typically done to optimize the optimization problem of finding a variational approximation to the posterior over the binary weights. To approach the problem, the authors show that under the assumption of Gaussian variational inference, e.g., the continuous relaxation of the discrete weights is assumed Gaussian distributed, the objective is a non-linear semi-definite program that is, hence, non-convex. Due to the non-convexity the program is solved iteratively using SGD until a local optimum is reached. The proposed approach elegantly connects the setting to hyperplane rounding and uses a low-rank approximation of the covariance of the variational family. The authors performed an extensive comparison against existing works on VGG and ResNet architectures for common benchmark tasks (CIFAR-10/100 and (Tiny)-ImageNet) based on accuracy and top-1/5 accuracy. The results look generally promising.

优点

I will provide a general statement of the papers strengths below.

Uncertainty quantification in binary neural networks seems like a relevant and interesting direction and I appreciate the connection of non-convex optimization with the field. The technical details are well explained in most parts and the method seems interesting and promising. Moreover, I appreciate the through empirical evaluation even though limited to a single metric and the additional ablation on the rank of the deviation matrix.

缺点

I will give a general review on the papers weaknesses below and provide detailed questions in the questions section.

At parts the paper is difficult to follow as notation is sometimes not introduced or details aren't explained enough. For example, page 4 above line 151 capital L should be a curly capital L from my understanding and it was not immediately clear to me why the joint constraint in line 162 is important. Moreover, the contributions of the paper seem rather limited in the context of related work as the resulting method seem to resemble variational inference with a low-rank Gaussian approximate family using the reparameterization trick and natural gradient updates. It is possible that I misunderstood the updates in Alg.1 as I could not find derivations of them in the appendix or main text. The results look promising but are limited to a single metric, accuracy, which might not accurately the performance of each method. Additional common metrics, such as, the negative log predictive density and the calibration error, are missing at the current stage

问题

  1. Could the authors elaborate on connections of their methods to natural gradient-based optimization in variational inference, as done in classical and recent literature such as [Honkela] or [Shen]?
  2. The authors model correlations between all units in the network (line 182) through a low-rank approximation and need to resort to a very low rank approximation of rank smaller or equal to ten. Often a block-diagonal matrix assumption is made (assuming layers are independent). Did the authors experiment with a block-diagonal structure in which blocks are low-rank? If so, how does such an approach compare? It seem that a high-rank for each block would be feasible in this setting.
  3. After reading up about hyperplane rounding, I can see the connection to using a sign function. However, the authors only mention this without elaborating on the connection. I would recommend making those connections more explicit and elaborating on them.
  4. A critical aspect of the method seem that it relies on a low-rank approximation, however, choosing the rank might be non-trivial. Moreover, as the ablation in Fig.1 (note that this should be a table) indicates it is not clear that increasing the rank necessarily improves the results. How is the rank chosen in the experiments and how do the authors propose to chose the appropriate rank?

References: [Honkela] Honkela et al. (2008), Natural Conjugate Gradient in Variational Inference. [Shen] Shen et al. (2024), Variational Learning is Effective for Large Deep Networks.

局限性

The authors highlight some open problems and future research directions. However, the limitations section could be further improved to detail the specific technical challenges associated with the approach.

作者回复

We thank the reviewer for their assessment and questions. We respond to each point inline below. We encourage the reviewer to ask any follow-up question and to comment on our responses. We hope that they will consider raising their score at the end of our exchange.

The results look promising but are limited to a single metric, accuracy, which might not accurately the performance of each method. Additional common metrics, such as, the negative log predictive density and the calibration error, are missing at the current stage.

We did not provide these metrics because they are not available for our competitors. Here we provide a selection of the results of our method using the negative log predictive density (NLPD) and the calibration error (ECE) metrics.

For the 1W1A setting on the CIFAR-10 dataset:

ArchitecturesNLDPECE
ResNet180.600.074
VGG-Small0.710.086

For the 1W32A setting on the CIFAR-10, CIFAR-100 and TinyImageNet dataset:

CIFAR-10 VGG16CIFAR-10 ResNET 18CIFAR-100 VGG16CIFAR-100 ResNet18Tiny-ImageNet ResNet18
NLPD0.490.251.591.122.13
ECE0.190.0350.130.0780.13

For the ImageNet dataset:

AlexNet 1W1AAlexNet 1W32AResNet18 1W1AResNet18 1W32A
NLPD22.823.006.916.91
ECE0.881.004.504.44

Could the authors elaborate on connections of their methods to natural gradient-based optimization in variational inference, as done in classical and recent literature such as [Honkela] or [Shen]?

We thank the reviewer for pointing out these references. Our approach is comparable to that of Shen in the setup of the variational learning problem and the usage of Gaussian variational distributions. However, in contrast with Shen, our method works over low-rank covariances, while Shen adopts diagonal covariance matrices. Finally, to the best of our understanding, the natural-gradient methods of Honkela and Shen are based on considering the Riemannian gradient with respect to the Fisher metric, while our method implicitly uses the Riemannian gradient over the space of low-rank Gaussian distributions defined via the embedding Σ=ZZT.\Sigma = Z Z^T. This has the advantage of enforcing the low-rank constraint and is the basis for the Burer-Monteiro approach to semidefinite programming. We believe that this key ingredient, together with the connection with Grothendieck’s inequality and hyperplane rounding, is completely absent from the previous work cited by the reviewer. This is to be expected as that work does not have to deal with the training of binary neural networks and is not concerned with rounding from real to binary weights. In conclusion, we cannot agree with the reviewer’s assessment regarding the limited novelty of our method and reassert its significant departure from previous work.

The authors model correlations between all units in the network (line 182) through a low-rank approximation and need to resort to a very low rank approximation of rank smaller or equal to ten. Often a block-diagonal matrix assumption is made (assuming layers are independent). Did the authors experiment with a block-diagonal structure in which blocks are low-rank? If so, how does such an approach compare? It seem that a high-rank for each block would be feasible in this setting.

We concur with the reviewer on the promise of using a block low-rank structure, a direction we have already been exploring. In our first presentation of the method we opted to model all correlations as we were interested in studying the maximum possible advantage we can obtain via a Gaussian approximation. Now that we have established the existence of this advantage, our future work will focus on how to reap it without incurring the computational costs, particularly in terms of memory, that VISPA suffers. We will include this discussion in Section 5.

After reading up about hyperplane rounding, I can see the connection to using a sign function. However, the authors only mention this without elaborating on the connection. I would recommend making those connections more explicit and elaborating on them.

While this connection is well-known in the theoretical computer science community, we agree that it deserves more elaboration for a more general audience. Unfortunately, we were strongly constrained by space limitations in the detail we could provide in Section 3.1, as the geometric view of hyperplane rounding requires introducing the vector embedding view of the underlying SDP. If the reviewer deems it satisfactory, we would be happy to include a separate section detailing the geometric derivation of hyperplane rounding in the Supplementary Material.

A critical aspect of the method seem that it relies on a low-rank approximation, however, choosing the rank might be non-trivial. Moreover, as the ablation in Fig.1 (note that this should be a table) indicates it is not clear that increasing the rank necessarily improves the results. How is the rank chosen in the experiments and how do the authors propose to chose the appropriate rank?

See FAQ 2 in the author rebuttal. Most results presented use K=4K=4, as this was experimentally showed to be a good compromise between performance and computational cost. More generally, for a robust result, we would recommend averaging solutions from different choices of rank.

评论

I want to thank the authors for their rebuttal and addressing my questions and concerns. I have followed the discussion with other reviewers. In consideration of the other reviews and assessments I will keep my positive score and slightly increase to a weak accept.

评论

Thank you for your thoughtful consideration of our rebuttal. We appreciate your decision to maintain a positive score and are grateful for the slight increase to a weak accept.

审稿意见
7

The paper proposes a Gaussian variational inference approach to training binary neural networks. Unlike traditional VI methods, the proposed VISPA algorithm is motivated by a SDP relaxation of the binary neural network objective. In experiments, VISPA gives state-of-the-art results for training binary neural networks on CIFAR-10, CIFAR-100 and TinyImageNet benchmarks.

优点

  • The paper is well-written and the overall problem formulation was clearly motivated.
  • The proposal to use Gaussian variational inference for directly training binary neural networks appears to be novel.
  • The experimental results are convincing, though larger-scale experiments, for example a ResNet-50 instead of the AlexNet on ImageNet or transformers would make the paper stronger.

缺点

  • The non-diagonal low-rank representation of the covariance adds a significant overhead. It is unclear whether the algorithm can remain scalable. From Figure 1 (Impact of K on model accuracy) it remains unclear whether setting K > 1 is really worth the additional computational overhead.
  • The VISPA algorithm is in some sense "heuristically" derived, and it is unclear whether for example stationary points of the method are critical points of the original training objective.
  • I appreciate the connection to variational inference, but in the objective function there is no entropy or prior present, so it is unclear whether this algorithm actually has any connections to Bayesian inference.

问题

  • Can lines 11-13 in VISPA algorithm be viewed as an orthogonal projection onto the constraint? It would be interesting to view VISPA as a projected stochastic gradient method on the problem P_corr, which would help in proving convergence and making the algorithm more theoretically sound.

  • Even for K=1, the method seems to perform better than other variational baselines like BayesBiNN. Is there any intuitive explanation as of why the algorithm works so much better in practice?

局限性

All limitations are addressed.

作者回复

We thank the reviewer for their assessment and questions. We respond to each point inline below. We encourage the reviewer to ask any follow-up question and to comment on our responses.

The non-diagonal low-rank representation of the covariance adds a significant overhead. It is unclear whether the algorithm can remain scalable. From Figure 1 (Impact of K on model accuracy) it remains unclear whether setting K > 1 is really worth the additional computational overhead.

The computational overhead is mostly restricted to the usage of memory. If nn is the number of weights, our memory for storing relevant variables is at most (K+1)n(K+1) n compared to nn for other methods. The total significance of this increase depends on the batch size and the size of the examples. For instance, in the case of ResNet18, there are roughly 91069 \cdot 10^6 weights while, for batch size=100, the total size of a data batch is 1510615 \cdot 10^6, for a total memory usage of 2410624 \cdot 10^6. Hence, choosing K=1 will lead to a usage of 3310633 \cdot 10^6, a 37.537.5% increase. This increase will typically get smaller as we consider models trained on larger images. In practice, we often observe even smaller increases, as our estimate does not include memory usage due to backpropagation, which can be very large, e.g., in the case of residual connections.

The effect on the runtime is negligible because most of the runtime is spent in forward and back-propagation anyways. In conclusion, our method is most suitable in situations where memory is not strictly constrained or can be augmented at low cost. We plan to include this discussion in the final version of the paper.

The VISPA algorithm is in some sense "heuristically" derived, and it is unclear whether for example stationary points of the method are critical points of the original training objective.

We agree, but notice that the focus of this paper is to provide a mathematically motivated algorithm that performs competitively in practice. We believe it eminently possible to obtain the suggested result for sufficiently small step size under the assumption that the loss is smooth by adapting existing results in manifold optimization. Indeed, we can think of our algorithm in the Burer-Monteiro formulation as minimizing the loss function over the manifold given by the constraints μi2+(ZZT)_ii=μi2+jZij2=1\mu_i^2 + (ZZ^T)\_{ii} = \mu_i^2 + \sum_j Z_{ij}^2=1 for all ii, which is a product of spheres, a well-studied case [1]. As suggested by the reviewer, we can view step 11 of our algorithm as a projection step onto the manifold, which effectively implements a retraction from the tangent space back to the manifold. To make the convergence argument complete, it only remains to analyze the effect of minibatch sampling of the loss on the gradient estimates used by the algorithm. Again, by taking the step size sufficiently small, it should be possible to make the contribution of the variance negligible and guarantee that the algorithm obeys a descent lemma and finds a near-stationary point.

[1] Boumal, N. (2023). An introduction to optimization on smooth manifolds.

I appreciate the connection to variational inference, but in the objective function there is no entropy or prior present, so it is unclear whether this algorithm actually has any connections to Bayesian inference.

Please see FAQ 3 in the author rebuttal. We will add the entropy term in the final version and justify why we let it vanish for our final algorithm.

Can lines 11-13 in VISPA algorithm be viewed as an orthogonal projection onto the constraint? It would be interesting to view VISPA as a projected stochastic gradient method on the problem P_corr, which would help in proving convergence and making the algorithm more theoretically sound.

Yes. As explained above, Step 11 of our algorithm indeed performs a projection step over the manifold of interest. We would be happy to add this explanation in the final version.

Even for K=1, the method seems to perform better than other variational baselines like BayesBiNN. Is there any intuitive explanation as of why the algorithm works so much better in practice?

The improved behavior for K=1K=1 is most likely a benefit of the randomization built into our rounding argument, even for K=1K=1. While other methods deterministically round based on the current μ\mu-value, our algorithm for K=1 effectively adds noise to μ\mu via the ZrZr term before rounding, allowing us to better explore the space of possible assignments.

审稿意见
3

This paper aims to propose a new method training binary neural networks. The method, according to the authors claims, is based on variational inference and semi-definite programming. During training, the method maintains a low-rank Gaussian distribution from which the neural network parameters are drawn. The training objective is the expected loss over the Gaussian distribution. During test, the method obtains a binary neural network by rounding a random sample drawn from the Gaussian distribution. Experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet demonstrate the competitiveness of the proposed method.

优点

  • The accuracy reported by the authors are competitive against the baselines.
  • The experiments are quite extensive including a wide range of datasets.

缺点

  • Respectfully, this paper has little to do with variational inference nor semidefinite programming, contrary to what the title, the abstract, the introduction, and the related work suggest. Thus, a large fraction of the paper should be revised.

    1. Just because you minimize the objective in Eq (2) over a Gaussian family does not imply this is variational inference. If the authors disagree, please explain the posterior that the objective aims to approximate. In fact, minimizing Eq (2) over a Gaussian family will collapse to a degenerate Gaussian with a zero covarianc.e That is exactly why the authors have to add a moment constraint on the mean vector and the covariance matrix to prevent the collapse.
    2. Just because you have a positive semi-definite constraint on the covariance matrix does not imply that this is a semi-definite program. This formulation is not even remotely close to semi-definite programs. Moreover, the algorithm proposed by the authors is not even remotely close to any prominent techniques in semi-definite programming. The algorithm is basically stochastic gradient descent with Polyak's momentum and a projection step to an Euclidean sphere.
  • The memory consumption of this method is not reported, and I suspect that this method is way more memory-intensive than baselines. Note that storing a rank-KK covariance matrix increases the memory by a factor of K+1K + 1 (storing the mean mu\mathbf{mu} and a low-rank matrix \Zv\Zv with KK columns. The main results are reported using K=8,4,2K = 8, 4, 2 depending on the datasets.

问题

NA.

局限性

NA.

作者回复

We thank the reviewer and regret to find that the connections to variational inference and semidefinite programming were not clear to them. We believe that, in the attempt to streamline the presentation of the mathematical component of our method, we may have omitted some steps that clarify such connections. A summary of our clarifications are given in FAQ 3 and FAQ 4 of the author rebuttal. Below, we aim to present more details of our clarifications, together with their proposed location in the final version of the paper. Lastly, we note that the closely related works by Ajanthan et al [1] and Meng et al [2] (see Section 3.1) also rely on the same connection to variational inference. More recently, the paper by Shen et al (pointed out by Reviewer 6N2m) [3] describes a similar framework to ours for the (non-binarized) training of LLMs under the moniker of variational learning, a variant of variational inference. We welcome further questions and suggestions that may improve the clarity of our work.

CONNECTION TO VARIATIONAL INFERENCE

ENTROPY REGULARIZATION OF PROBLEM 2 (Top of Section 3)

Motivated by stability and generalization concerns, we may consider an entropy regularized version of Problem 2 given by

OPT(θ):=minpPθEwp,x[L(f(x,w),yx)]H(p)OPT(\theta) := \min_{p \in \mathcal{P}} \theta \cdot \mathbb{E}_{w \sim p, x}[L(f(x, w), y_x)] - H(p)

where H(p)H(p) is the entropy of the distribution p.p. This can be seen to be the log-partition function of a probabilistic graphical model (PGM), albeit one with a complicated nn-way sufficient statistic given by the loss function LL, and the corresponding optimizer is the Gibbs distribution pθ.p_\theta.

Following Wainwright and Jordan [4 -Chapter 3.7 and Chapter 5.1], the task of approximating the log-partition function of a model is exactly where we may deploy the toolkit of variational inference.

MEAN FIELD APPROXIMATION (Line 145 in Section 3)

The mean field technique [4 - Chapter 5.2-5.3] is based on approximating pθp_\theta by finding its closest approximation, in KL divergence, from a family of tractable models. Meng [2] and Ajanthan [1] consider the naive mean field approximation, which corresponds to selecting the class of graphical models with no edges, i.e., product distributions. The resulting optimization problem is:

min_pP_BernoulliD(pp_θ)=min_pP_BernoulliD(pp_θ)=min_pP_BernoulliθE_wp,x[L(f(x,w),y_x)]H(p)OPT(θ)\min\_{p \in \mathcal{P}\_{Bernoulli}} D(p || p\_{\theta}) = \min\_{p \in \mathcal{P}\_{Bernoulli}} D(p || p\_{\theta}) = \min\_{p \in \mathcal{P}\_{Bernoulli}} \theta \cdot \mathbb{E}\_{w \sim p, x}[L(f(x, w), y\_x)] - H(p) - OPT(\theta) where PBernoulli\mathcal{P}_{Bernoulli} is defined in the submission.

BNN Training via Gaussian Variational Inference (Section 3.1)

Gaussian variational inference ([5] and references therein) suggests to use as tractable models the family of multivariate Gaussian distributions. We then obtain:

min_pP_corrD_KL(pp_θ)=min_pP_corrθE_wp,x[L(f(x,w),y_x)]H(p)OPT(θ)\min\_{p \in \mathcal{P}\_{\textrm{corr}}} D\_{KL}(p || p\_{\theta}) = \min\_{p \in \mathcal{P}\_{\textrm{corr}}} \theta \cdot \mathbb{E}\_{w \sim p, x}[L(f(x, w), y\_x)] - H(p) - OPT(\theta)

At first, this formulation may seem problematic as we are fundamentally changing the reference measure from the discrete measure over the binary hypercube 1,+1n\\{-1,+1\\}^n to the Lesbegue measure over Rn.\mathbb{R}^n. However, this change is justified, when θ\theta goes to infinity, by the approximation result of Grothendieck's inequality, which formally shows that the loss minimization problem over P_corr\mathcal{P}\_\textrm{corr} approximates that over P_Bernoulli\mathcal{P}\_{Bernoulli} for certain classes of objective functions. See discussion in 3.1 lines 165-178.

This reasoning leads us to consider the regime in which the entropy term vanishes (θ\theta \to \infty) to obtain the problem:

min_pP_corrE_wp,x[L(f(x,w),y_x)]\min\_{p \in \mathcal{P}\_\textrm{corr}} \mathbb{E}\_{w \sim p, x}[L(f(x, w), y\_x)]

This is exactly the problem studied in our paper. For sake of completeness, at a preliminary stage of our investigation, we also had a version of our gradient-based SDP algorithm include the entropy term for different large values of θ\theta. In all cases, our final algorithm (θ\theta \to \infty) proved more stable, accurate and faster to run, so that we did not pursue different settings of θ\theta further.

[1] Thalaiyasingam Ajanthan, Puneet Dokania, Richard Hartley, and Philip Torr. Proximal Mean-Field for Neural Network Quantization. ICCV, 2019.
[2] Xiangming Meng, Roman Bachmann, and Mohammad Emtiyaz Khan. Training Binary Neural Networks using the Bayesian Learning Rule. ICML, 2020.
[3] Shen et al., Variational Learning is Effective for Large Deep Networks. ICML, 2024.
[4] Wainwright, M.J. and Jordan M.I., Graphical Models, Exponential Families, and Variational Inference.
[5] Diao, M.Z., Proximal Gradient Algorithms for Gaussian Variational Inference:Optimization in the Bures–Wasserstein Space, MIT Thesis.

CONNECTION TO SEMIDEFINITE PROGRAMMING

See FAQ 4 in the author's rebuttal.

QUESTION ABOUT MEMORY USAGE

The memory consumption of this method is not reported, and I suspect that this method is way more memory-intensive than baselines.

If nn is the number of weights, our memory for storing relevant variables is at most (K+1)n(K+1) n compared to nn for other methods. The total significance of this increase depends on the batch size and the size of the examples. For instance, in the case of ResNet18, there are roughly 91069 \cdot 10^6 weights while, for batch size=100, the total size of a data batch is 1510615 \cdot 10^6, for a total memory usage of 2410624 \cdot 10^6. Hence, choosing K=1 will lead to a usage of 3310633 \cdot 10^6, a 37.537.5% increase. This increase will typically get smaller as we consider models trained on larger images.

评论

Hi Authors,

Thanks for the response.

Connection to Variational Inference

Could you clarify what is parameter θ\theta? You have used θ\theta as a scaler to multiply the expectation term E[L(f(x,w),y)]\mathbb{E} [L(f(x, w), y)] and subsequently let θ\theta go to infinity. On the other hand, you also use θ\theta to index the weight distribution wpθ\mathbf{w} \sim p_{\theta}, which in general has to be represented by a high dimensional vector in the context of neural networks.

Also, can you clarify how you arrive at the first equation θE[L(f(x,w),y)]H(p)\theta \cdot \mathbb{E}[L(f(x, w), y)] - H(p)? I think you want to define a conditional distribution p((x,y)w)=exp(L(f(x,w),y))p((x, y) \mid w) = \exp(- L(f(x, w), y)) with a prior p0(w)p_0(w). Doing variational inference in ww gives the (negative) ELBO E[logp((x,y)w)+logp(w)]H(p)=Ep[L(f(x,w),y)]Ep[logp0(w)]H(p),-\mathbb{E}[\log p((x, y) \mid w) + \log p(w)] - H(p) = \mathbb{E} _ {p} [L(f(x, w), y)] - \mathbb{E}_p [\log p_0(w)] - H(p), where pp is the variational distribution. It's quite close, but not the same as the equation you wrote.

Connection to SDP

I have read FAQ 4. But I have to say that I still find this connection is very superficial.

Memory Usage

I am partially convinced. However, I also want to note that ResNet18 is not a huge model by today's standard. For instant, a ResNet50 model will have a higher (relative) memory overhead compared to ResNet18.

评论

We thank the reviewer for their comments. We respond inline below.

Could you clarify what is parameter ? You have used θ\theta as a scaler to multiply the expectation term and subsequently let θ\theta go to infinity. On the other hand, you also use to index the weight distribution wpθw \sim p_\theta, which in general has to be represented by a high dimensional vector in the context of neural networks.

The parameter θ\theta is a scalar inverse temperature that determines the extent of the entropy regularization, i.e., lower θ\theta -> higher temperature -> more entropy regularization. The distribution pθp_{\theta} is the optimal distribution for the problem

OPT(θ):=minpPθEwp,x[L(f(x,w),yx)]H(p),OPT(\theta) := \min_{p \in \mathcal{P}} \theta \cdot \mathbb{E}_{w \sim p, x}[L(f(x, w), y_x)] - H(p),

i.e., the Gibbs distribution associated with parameter θ\theta. It is not a distribution over θ\theta, but still a distribution over high-dimensional weight vector, which is denoted by ww throughout the paper. If wpθw \sim p_\theta, then ww follows the optimal distribution for the problem above. Alternatively, one can think of θ\theta as a scaling of the loss function to balance its importance wrt to the entropy term.

Also, can you clarify how you arrive at the first equation?

Taking the prior p0p_0 to be the maximum entropy distribution over the weights, the term Ep[logp0(w)]\mathbb{E}_p[\log p_0(w)] sums up to a constant and does not affect the optimization. The loss function LL can be taken to be scaled by θ\theta. This allows us to balance the trade-off between loss and entropy.

I have read FAQ 4. But I have to say that I still find this connection is very superficial.

It is possible that we might be biased because our design of the method was heavily influenced by the SDP viewpoint and in particular by the application of SDPs to relaxations for combinatorial optimization problems. At the same time, we feel that the hyperplane rounding in our method may be difficult to justify without knowledge of SDP methods in that area.

I am partially convinced. However, I also want to note that ResNet18 is not a huge model by today's standard. For instant, a ResNet50 model will have a higher (relative) memory overhead compared to ResNet18.

The reviewer is correct that the relative memory consumption will increase for larger models, keeping image size constant. However, we note that in practice, we often observe even smaller increases, as our estimate does not include memory usage due to backpropagation. More importantly, models deployed on edge devices, where binarization is most needed, tend to naturally be smaller than ResNet50, such as MobileNet or SqueezeNet. For instance, on MobileNet v2, the number of weight is 3.41063.4 \cdot 10^6, with a typical data batch still having size 1510615 \cdot 10^6 (assuming batch size = 100100), for a total memory consumption of 18.410618.4 \cdot 10^6. In this case, running our method with K=1K=1 will lead to a memory usage of 21.8106,21.8 \cdot 10^6, just an 1818\\% increase. Even choosing K=6K=6 will still maintain memory consumption within a factor of 22 of the original.

评论

Hi Authors,

Thanks for the clarification. Regarding the variational inference part, is it accurate to say the parameterization pθp_\theta has the form logpθ(x)θlogp(x)\log p_\theta (x) \propto \theta \log p(x)? If so, it's it not clear to me why it make sense to let θ\theta go to infinity.

评论

Thanks for the response and question. The first-order optimality condition is actually

logpθ(w)θEx[L(f(x,w),yx)].\log p_\theta(w) \propto - \theta \cdot \mathbb{E}_{x}[L(f(x,w), y_x)].

As θ\theta \to \infty, the probability mass of pθp_\theta concentrates on the weights assignments that minimize the loss, because the exponential penalty for higher loss grows larger and larger.

审稿意见
6

The paper provides a theoretical framework for BNN training (a longstanding model compression and computational speed up technique) based on Gaussian variational inference. The framework succeeds in providing theoretical grounding for practical and well established techniques for BNN training - specifically Straight Through Estimators (STEs), weight clipping and usage of sign function as special case of their formulation. Inspired by previous work by Ajanathan et al. (2019) and Meng et al. (2020) which interpret latent weights as mean field parameters, their framework allows one to model pairwise correlations between latent weights by using low rank Gaussian distributions over weights leading to a more accurate optimization problem (expressible as a SDP according to the authors). They also use sensible relaxations for tractability (Burer-Monteiro approach), leaving the door open to more sophisticated techniques (gradient flow over Bures-Wasserstein manifolds) for future work.

优点

  • [Clarity, Quality] The authors contextualize their work well with respect to the past work on the topic, and the writing is clear and succinct.
  • [Originality] The paper presents a novel modelling approach to training BNNs by combining and extending several well established methods (use of gaussian variation inference as a proxy for bernoulli distributions, SDP relaxation of a combinatorial problem etc) allowing them to leverage well studied and optimized methods and numerical routines for the intractable BNN training problem (Problem (1) in the paper).
  • Their relaxation of the sample space to real numbers and usage of gaussian weights facilitates the usage of gradient descent based approaches for optimization (a challenge in earlier works where gradient based approaches use various heuristics since the objective function doesn't admit a tractable/sensible gradient)
  • The authors perform ablation to elucidate the impact of the main hyperparameters (ZZ and KK), as well as empirical evaluation on notable configurations of datasets and architectures
  • [Significance] The paper ties together several methods BNN optimization methods in a unifying framework, and presents an interesting relaxation which yields concrete improvement over state of the art results

缺点

  • The paper can benefit from empirical evaluation over a broader class of architectures since BNNs can have applicability beyond vision as a domain. I do acknowledge that the domain is typically restricted to the class of vision based transforms, but leveraging these techniques in other major model classes of interest can have significant impact which is left unexplored.
  • The authors claim that the problem involves an SDP but the algorithm presented has very loose connections to most formalizations of semi definite programming (and variational inference). The SDP claim seems to derive only from a semi definite constraint and the problem itself fails to be convex let alone semidefinite. (See Line 163-164).

问题

  • In L242-243, the authors mention taking 40 samples and average results for predictions. How much does this impact performance (aka inference time, storage etc), since one of the primary motivations of BNNs is to minimize computational storage and runtime requirements
  • It is well established in quantization literature that activations (and weights) both admit low precision representations without significant performance loss. Did you try activation precision below 32 bits since low precision architectures are quite common (and perhaps even the norm)?
  • Do you have a hypothesis for why lower K perform better for 1W1A while higher K is better for 1W32A architectures?

局限性

There are no major limitations

作者回复

We thank the reviewer for their assessment and questions. We respond to each point inline below. We encourage the reviewer to ask any follow-up question and to comment on our responses. We hope that they will consider raising their score at the end of our exchange.

The paper can benefit from empirical evaluation over a broader class of architectures since BNNs can have applicability beyond vision as a domain. I do acknowledge that the domain is typically restricted to the class of vision based transforms, but leveraging these techniques in other major model classes of interest can have significant impact which is left unexplored.

See FAQ 1 in the author rebuttal.

The authors claim that the problem involves an SDP but the algorithm presented has very loose connections to most formalizations of semi definite programming (and variational inference). The SDP claim seems to derive only from a semi definite constraint and the problem itself fails to be convex let alone semidefinite. (See Line 163-164).

In the author rebuttal, please see FAQ 3 for the connection to variational inference and FAQ 4 for the connection to semidefinite programming.

In L242-243, the authors mention taking 40 samples and average results for predictions. How much does this impact performance (aka inference time, storage etc), since one of the primary motivations of BNNs is to minimize computational storage and runtime requirements?

This has limited impact on storage, as the algorithm can maintain the average dynamically. The impact on inference time may be as large as 40x, but may be significantly reduced by parallelization of the processing of the 40 samples. It is an active area of focus to further reduce this impact. In preliminary results, we find that the process may be sped by using a smaller number of correlated samples via Gaussian quadrature. In this case, 2K+12K +1 samples would suffice, which is as small as 3 for the K=1K=1 version of our method. In conclusion, we note that the prime runtime concern is to reduce the cost of training, which we achieve by binarizing both weights and activations while outperforming competitors.

It is well established in quantization literature that activations (and weights) both admit low precision representations without significant performance loss. Did you try activation precision below 32 bits since low precision architectures are quite common (and perhaps even the norm)?

We conducted experiments using binary activation precision. The results of these experiments are detailed in Tables 1, 3, and 4, which show the performance under the settings of binary activation and binary weights across different datasets and network architectures. For future work, we plan to explore other low precision configurations, including 2-bit, 3-bit, and 4-bit precision.

Do you have a hypothesis for why lower K perform better for 1W1A while higher K is better for 1W32A architectures?

See FAQ 2 in the author rebuttal.

评论

Thank you for your response. Based on the FAQ and the responses to other reviewers, I have more confidence in the proposed method. I still respectfully disagree with the emphasis on the SDP part of the algorithm which as the authors explain in Part 4 stems from the semi-definite constraint and sharing the optimization approach for solving large SDPs using a non convex relaxation. Different classes of problems can admit similar relaxations/approximations and I do not believe that is enough to fully justify the heavy usage of the term. However, this is a difference of semantics, and I would like to focus on the efficacy and functionality of the method itself

Based on the increase in memory usage (as pointed out in the limitations on memory constrained devices issue by Reviewer Bkmh), I believe it is vital to use mitigation strategies to reduce the impact of KK on memory. Similar concerns are also raised by Review M9aW. Given that one of the primary motivations is deployment/training on edge devices, the increase in memory usage needs to be compensated by other factors to not contradict the primary objective. The authors have mentioned that they plan to explore this domain using low precision (and reliance on the fact that larger images ameliorate the issue, ref Rebuttal to reviewers Bkmh and M9aW). However, the performance of the final model also tends to increase with KK as noted in the paper and FAQ 2. I would like the authors to explicity mention a brief set of guidelines and best practices to understand the trade offs in choosing an appropriate value of K with respect to factors such as training time/memory and expected model perf like in FAQ 2 as part of the paper/appendix.

Finally, based on the responses so far I will increase my score to 6.

评论

We thank the reviewer for their response. We will add a paragraph in the main body of the paper covering the content of FAQ 2 and our response to reviewer Bkmh, which contains estimates of how the memory consumption varies with KK.

We will also note that models deployed on edge devices, where binarization is most needed, tend to naturally be small, such as MobileNet or SqueezeNet. For instance, on MobileNet v2, the number of weight is 3.41063.4 \cdot 10^6, with a typical data batch still having size 1510615 \cdot 10^6 (assuming batch size = 100), for a total memory consumption of 18.410618.4 \cdot 10^6. In this case, running our method with K=1K=1 will lead to a memory usage of 21.8106,21.8 \cdot 10^6, just an 1818\\% increase. Even choosing K=6K=6 will still maintain memory consumption within a factor of 22 of the original.

审稿意见
6

This paper presents an optimization framework for Binarized Neural Network (BNN) training using Gaussian variational inference, resulting in a low-rank semidefinite programming (SDP) formulation. The authors propose the Variational Inference Semidefinite Programming Algorithm (VISPA) to improve accuracy by modeling and learning pairwise correlations between weights. The empirical evaluation demonstrates that VISPA outperforms state-of-the-art algorithms on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet datasets.

优点

The paper introduces a novel optimization framework based on Gaussian variational inference for BNN training, which goes beyond latent weights and leverages low-rank semidefinite programming relaxations.

The paper provides a theoretical motivation for the use of latent weights, STE, and weight clipping. It also presents a new interpretation of latent weights methods within the proposed optimization framework.

The empirical evaluation on multiple benchmark datasets shows consistent and significant performance improvements over existing state-of-the-art BNN training methods.

缺点

While the method shows excellent results on the tested datasets, the paper could benefit from a discussion on the generalization capabilities of the proposed approach to other types of neural networks or applications.

The performance of VISPA may be sensitive to the choice of hyperparameters, such as the covariance rank 𝐾, which is not extensively explored in the paper.

问题

Can you provide more details on the computational complexity and runtime of VISPA compared to other state-of-the-art BNN training methods?

How does VISPA perform on other types of neural networks, such as transformer architectures? Can you elaborate on the initialization strategy for the weight deviation matrix 𝑍 and its impact on the training stability and final performance?

What are the potential limitations of the proposed method when applied to resource-constrained devices, and how can these be mitigated?

Could you provide more insights into the impact of different hyperparameter settings, especially the covariance rank 𝐾, on the performance of VISPA?

局限性

N.A.

作者回复

We thank the reviewer for their assessment and questions. We respond to each point inline below. We encourage the reviewer to ask any follow-up question and to comment on our responses. We hope that they will consider raising their score at the end of our exchange.

How does VISPA perform on other types of neural networks, such as transformer architectures?

See FAQ 1 in the author rebuttal. We will include this discussion in Section 5 (Limitations and Open Problems) in the final version.

We have performed preliminary experiments of VISPA on the transformer architecture Swin-T[3] using ImageNet. We binarized the weights of the linear layers except for the qkv layer before the softmax. The experiment was performed with a learning rate of 0.1, a weight decay of 1e-5, and an epoch size of 150, without fine-grained hyper-parameter tuning. Results are shown in the table within the PDF attached to the author's rebuttal.

We have identified 3 methods to which to compare VISPA's performance : BiViT[1], BiT[2] and BiBERT[4]. As summarized in the table, such methods differ in fundamental ways from our application on VISPA, making an immediate comparison somewhat tenuous. All of them use Knowledge Distillation to enhance the performance of binarized transformer architectures, putting our method at a disadvantage. Furthermore, our strongest competitor BiViT uses a special handling of the softmax operation, facilitating its task. On the other hand, our implementation of VISPA does not binarize QKV layers, gaining an advantage there. With these caveats, we find that our method outperforms all competitors in terms of top-1% test accuracy, sometimes decisively. We believe that this result strongly suggests that the performance improvements, shown in this paper for vision-based transforms, will translate successfully to transformer. A more detailed study would involve different architectural choices on the binarized transformer, which are beyond the scope of our submission.

[1] He, Y., et al. "BiViT: Extremely Compressed Binary Vision Transformers." ICCV, 2023.
[2] Liu, Z., et al. "BiT: Robustly Binarized Multi-distilled Transformer." NIPS, 2022.
[3] Liu, Z., et al. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV, 2021.
[4] Qin, H., et al. BiBERT: Accurate Fully Binarized BERT. ICLR, 2022.

Could you provide more insights into the impact of different hyperparameter settings, especially the covariance rank 𝐾, on the performance of VISPA?

See FAQ 2 in the author rebuttal.

Can you provide more details on the computational complexity and runtime of VISPA compared to other state-of-the-art BNN training methods?

As shown in Algorithm 1, let MM represent the batch size, nn the number of weight parameters, and KK the embedding dimension. The computational complexity of the VISPA approach is O(Mn+nK)O(Mn+nK). In comparison, other state-of-the-art BNN training approaches typically require O(Mn)O(Mn). In most cases M>>K,M >> K, so that the increased cost due to the nK term is a negligible fraction of the total running time, as most time is spent performing forward- and back-propagation through the neural network.

Can you elaborate on the initialization strategy for the weight deviation matrix 𝑍 and its impact on the training stability and final performance?

In Section 4 (Experiments), we discussed that the weight deviation matrix ZZ is initialized using the Xavier normal distribution, with a mean of 0 and a standard deviation of s * \sqrt{\frac{2}{\text{fan\\_in} + \text{fan\\_out}}}. Here, fan_in represents the number of input units, fan_out the number of output units, and s is a scaling factor parameter. Xavier initialization has been empirically shown to work well in practice across a variety of neural network architectures (e.g., CNNs, ResNets, GANs, and Transformers). It is based on a theoretical analysis that ensures the variance of the input to each neuron is similar to the variance of the output. This choice balances the scale of weights and helps maintain stable gradient magnitudes [1]. For details on the impact of ZZ's initialization on training stability and performance, see Appendix A.2. There we show that VISPA achieves consistent performance across a range of choices of scaling factor s.

[1] Xavier Glorot and Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks, AISTATS, 2010.

What are the potential limitations of the proposed method when applied to resource-constrained devices, and how can these be mitigated?

The main potential limitation of VISPA on resource-constrained device is the increase in memory usage due to having to maintain the covariance variable Z.Z. If nn is the number of weights, our memory for storing relevant variables is at most (K+1)n(K+1) n compared to nn for other methods. The total significance of this increase depends on the batch size and the size of the examples. For instance, in the case of ResNet18, there are roughly 91069 \cdot 10^6 weights while, for batch size=100, the total size of a data batch is 1510615 \cdot 10^6, for a total memory usage of 2410624 \cdot 10^6. Hence, choosing K=1 will lead to a usage of 3310633 \cdot 10^6, a 37.537.5\\% increase. This increase will typically get smaller as we consider models trained on larger images. In practice, we often observe even smaller increases, as our estimate does not include memory usage due to backpropagation, which can be very large, e.g., in the case of residual connections.

A possible mitigation strategy is to store our variable ZZ at lower precision. Given that this variable is only accessed via multiplication with Gaussian noise, we believe that this will not change the behavior of our algorithm.

作者回复

We thank the reviewers for their time. We are grateful that most reviewers agree on the novelty of our framework based on Gaussian variational approximation, the clarity of the submission and the significance of the performance improvements revealed by our experiments. In this rebuttal, we address frequently brought up questions.

FAQS

1. Generalization of our method to other neural network architectures

We chose to focus our first presentation of VISPA on vision-based applications because of the availability of well-studied binarized architectures and well-established baselines, which isolate the performance of our method more closely. Indeed, for transformers, there is not yet agreement on the best binarized architecture, due to the difficulty of binarizing activations in softmax layers. Only recently, researchers have made progress in bypassing this obstacle [1]. This also contributes to the scarcity of baselines and to the absence of a standard benchmark.
To demonstrate the applicability of VISPA to transformers, we include in the PDF a table of preliminary results. See the rebuttal to Reviewer Bkmh for a more detailed discussion.

[1] He, Y., et al. "BiViT: Extremely Compressed Binary Vision Transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2023.

2. Setting of rank parameter K

The impact of KK on test accuracy is discussed in Section 4.1, Figure 1. Our expectation was that higher values of KK should yield better accuracy, as the algorithm can explore a larger space of distributions. As pointed out by a number of reviewers, the results do not fully support this conclusion. Results do tend to marginally improve as K gets larger with the exception of VGG-Small + CIFAR-10 (1W1A). The accuracy of ResNet18+CIFAR10 (1W32A) is essentially constant over KK. We have two hypothesis for this lack of conclusive evidence:

  • Higher values of KK yield a larger feasible space and require longer time to converge. In the case of VGG-Small + CIFAR-10 (1W1A), it is possible that the fixed number of epochs chosen for the experiments was sufficient for convergence on smaller KK values, but insufficient on higher values.
  • Higher values of KK yield a more complex distribution from which it is more expensive to sample good weights at inference stage.

Based on preliminary experiments, we believe that a combination of the two may be at play for VGG-Small + CIFAR-10 (1W1A).

3. Connection to Gaussian Variational Inference

Following Wainwright and Jordan [1 -Chapter 3.7], the variational inference technique can be applied to the variational formulation of the problem of computing the log-partition function of a probablistic graphical model. By adding an entropy regularization term 1θH(p)\frac{1}{\theta} \cdot H(p), the optimization problem Problem 2 (Line 134) can be put in this form for a graphical model whose sufficient statistics are described by the loss function. The fundamental idea of variational inference is then to approximate the log-partition function and the corresponding Gibbs distribution pθp_{\theta} by restricting the entropy-regularized optimization problem to a tractable class of probability distributions (see [1], Chapter 5.2). The Gaussian Variational Inference approach (see [2,3] and references therein) suggests using the class of multivariate Gaussian distributions as the tractable family of choice. Grothendieck's inequality, which shows that such a Gaussian distribution can be used to approximate the required function in the regime θ\theta \to \infty, leads us to let the entropy term vanish. Finally, this yields the same objective function as in our submission.

[1] Wainwright, M.J. and Jordan M.I., Graphical Models, Exponential Families, and Variational Inference
[2] Diao, M.Z., Proximal Gradient Algorithms for Gaussian Variational Inference:Optimization in the Bures–Wasserstein Space
[3] Lambert, M. et al, Variational inference via Wasserstein gradient flows

4. Connection To Semidefinite Programming

The reviewers correctly note that we use the term SDP to simply indicate the presence of a semidefinite constraint; however, this usage has become frequent in recent literature as researchers investigate the potential of semidefinite programs with non-convex objectives and constraints, in particular low-rank constraints [1]. For clarity, throughout the paper, we repeatedly reassert the non-convexity of our formulation. Despite this non-convexity, we argue that our techniques are typical of SDP programming. The Burer-Monteiro approach is one of the most promising practical techniques to reduce the computational cost of SDPs [2], with a significant amount of research devoted to understanding which reduction in rank preserves the global optima of the original solution [3,4,5]. Similarly, our use of moment matching constraints in the definition of Pcorr\mathcal{P}_{\textrm{corr}} is typical of the relaxation of combinatorial optimization problems to semidefinite programs, which is a pillar of the theory of approximation algorithm. Indeed, our work is originally motivated by Grothendieck’s inequality and hyperplane rounding.

[1] Mei, S., Misiakiewicz, T., Montanari, A., & Oliveira, R. I. (2017, June). Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality.
[2] Yurtsever, A., Tropp, J. A., Fercoq, O., Udell, M., & Cevher, V. (2021). Scalable semidefinite programming.
[3] Boumal, N., Voroninski, V., & Bandeira, A. (2016). The non-convex Burer-Monteiro approach works on smooth semidefinite programs.
[4] Cifuentes, D., & Moitra, A. (2022). Polynomial time guarantees for the Burer-Monteiro method.
[5] Erdogdu, M. A., Ozdaglar, A., Parrilo, P. A., & Vanli, N. D. (2022). Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs.

最终决定

Thank you for your valuable contribution to Neurips and the ML community. Your submitted paper has undergone a rigorous review process, and I have carefully read and considered the feedback provided by the reviewers.

The paper proposes a variational approach to training binary neural networks using an SDP relaxation. The authors obtain competitive numerical results compared to state-of-the-art. Overall, the paper received mostly positive response from the reviewers. Some issues raised by the reviewers were successfully addressed in the rebuttal. These issues include (i) connections to variational inference and SDPs, (ii) memory consumption, (iii) impact of hyperparameters.

Given this positive assessment, I am willing to recommend the acceptance of your paper for publication.

I would like to remind you to carefully review the reviewer feedback and the resulting discussion. While most reviews were positive, the reviewers have offered valuable suggestions that can further strengthen the quality of the paper. Please take another careful look a the 'weaknesses' section of each reviewer comment and the issues (i)-(iii) listed above. I encourage you to use this feedback to make any necessary improvements and refinements before submitting the final version of your paper.

Once again, thank you for submitting your work to Neurips.

Best, Area Chair