PaperHub
7.1
/10
Poster5 位审稿人
最低4最高5标准差0.5
4
5
4
4
5
3.2
置信度
创新性3.2
质量2.8
清晰度2.6
重要性3.4
NeurIPS 2025

Low Precision Streaming PCA

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

Low-precision quantized variants of Oja’s algorithm can provably converge in streaming PCA under suitably fine discretization while drastically reducing memory and compute requirements.

摘要

关键词
Streaming PCALow-Precision QuantizationOja’s AlgorithmStochastic QuantizationFinite-Sample ConvergenceSpectral Gap AssumptionMinimax Lower Bounds

评审与讨论

审稿意见
4

Partially motivated by the advances of using limited precision arithmetic in neural networks, the authors study the effect of quantization on streaming PCA, restricted to finding the eigenvector with the largest eigenvalue. In particular, they analyze Oja's algorithm with two quantization schemes (linear and nonlinear), giving bounds on the "difference" of their approximated eigenvector and the optimal one. Additionally, the authors support their theoretical findings with a suite of experiments.

优缺点分析

Strengths

  • While PCA has been studied extensively in the literature, incorporating quantization and studying its effect is an original and novel twist.
  • The lower bounds can be of independent interest, e.g., to compare future algorithms against.
  • While the article is generally well written, I think that the presentation can be improved. See below.

Weaknesses

  • Some notations are not well-defined, rendering the equations challenging/impossible to parse/comprehend. See the questions for details.
  • Similary, some abbreviations, e.g., BF16 in line 33 are not defined before they are used.
  • A few existing related works are enumerated (Line 35--) but their actual results are not put into perspective.
  • The practical relevancy, most prominently the impact on runtime, is not illustrated.
  • The contributions are not clearly articulated. E.g., what does the "general theorem for streaming PCA" tell? Which lower bound (the new one or an existing one) is being matched?

问题

Key points:

  • Is the length of the stream nn fixed beforehand or can nn grow over time?
  • Why there is no β\beta dependence in Lemma 1 and Lemma 2?
  • What is meant by i=b1\prod_{i=b}^1 applied to a vector? E.g., line 143 and (7).
  • In line 164, isn't everything on the l.h.s. constant and the assumption always holds? Or is nn increasing?
  • Theorem 1/2: What is α\alpha?

Further questions/remarks:

  1. Do the authors require a bounded input set?
  2. In line 59, is there no nn dependence for constructing and diagonalizing Σ^\hat \Sigma?
  3. Can the authors clarify in the beginning, what they mean by "nearly dimension-free"? From my understanding, the dimension enters the error only logarithmically; this should be made explicit.
  4. The norms in Assumption 1 are not made explicit: Is the expectation over the operator norm and the a.s. bound over the Frobenius norm?
  5. In line 99, is βN>0\beta \in \mathbb N_{>0}?
  6. In (3), what is the -1+1 for?
  7. In the logarithmic scheme, line 107: Is it the same β\beta introduced for linear quantization?
  8. In line 112, the authors state that they elaborate the stochastic rounding for linear quantization, but the equations concern QNL\mathcal Q_{NL}, i.e., nonlinear quantization?
  9. In line 113, can the authors make "well within the range" explicit?
  10. In line 114, do ll and uu have index ii? Or what do the ii-s in (5) refer to?
  11. In line 193, how explicitly were βe\beta_e and βm\beta_m chosen?
  12. From which dataset was Figure 1 created? How was b=10b=10 chosen? According to Theorem 2, bb depends on n,αn,\alpha and the eigengap λ1λ2\lambda_1-\lambda_2?
  13. Algorithm 2: Line 5 does not depend on ii, can it be moved out of the loop?

A few notes on typos/grammar:

  • In line 46, is η>0\eta>0 missing?
  • In line 68, space before the reference.
  • It could help to refer to the existence of the supplement in the outline (line 82-85).
  • In line 87, it seems that 0∉[n]0\not \in [n]; this should be made explicit, e.g., by using N>0\mathbb N_{>0} as N\mathbb N is ambiguous.
  • Euclidean: Capitalize.
  • In line 122, use "the norm of the estimated u\mathbf u".
  • Sometimes the authors use "Eq.", sometimes "equation", e.g., line 143, 151, 165.
  • Line 156, Rd×d\mathbb R^{d\times d}.
  • Can the authors write "i.i.d." consistently?
  • In line 167, "are bounded" => "is bounded".
  • In line 168, "in/see" duplicate.
  • In line 205, add comma to min\min.
  • In line 220, "larger" => "smaller"?
  • The bibliography is very inconsistent: Sometimes a venue has a number in words, other times in digits. Sometimes abbreviations are added, sometimes not. Some references have editors listed, others not. Names such as "Bernstein" should be capitalized. Similarly, abbreviations like "LLM", "DCM", "PCA".

局限性

Yes.

最终评判理由

Through consideration of the other reviewer's comments and the comprehensive rebuttal regarding my own review, I am bumping my score.

格式问题

None.

作者回复

Thank you for your kind words regarding the originality, novelty, and the writing of the manuscript. We will fix all typographical errors pointed out by the reviewer in the revised manuscript and address key questions below:

Regarding bfloat16 : The bfloat16 (brain floating point) floating-point format uses 16 bits, designed for efficient deep learning training and inference. We will clarify this in the paper.

Regarding related work : We provide more details on related work in Appendix G. However, most of the known guarantees are for stochastic gradient descent (SGD) with convex objectives, and we are not aware of existing analyses that apply to the streaming PCA setting off the shelf.

Regarding runtime comparisons : When operating with β\beta-bits, the overall complexity for streaming PCA (and that of the batched variant) is O(ndβ)O(nd\beta) and we performed a benchmarking experiment to illustrate this below.

ExperimentRuntime (seconds)
64 bits0.0274 ± 0.00136
32 bits0.0260 ± 0.00225
16 bits0.000398 ± 0.0000235

Regarding contributions : The general theorem refers to a result for streaming PCA with Oja’s algorithm that can handle martingale noise in the iterates, and is a generalization of known analysis with IID data. We are unaware of existing lower bounds for our problem setting and refer to the lower bounds proved in our work. We will clarify this.

Regarding length of the stream: In the current manuscript, the length of the stream nn is an input and the learning rate is constant over time for ease of analysis. To handle variable learning rates using only constant-rate updates, we employ a standard doubling trick. Specifically, we divide the time horizon into blocks of exponentially increasing size, where each block kk has length 2k2^k and uses a constant learning rate ηk\eta_k tailored to that block’s length. At the end of each block, the previous estimate is discarded and a new estimate is initialized. This scheme effectively simulates a decaying learning rate ηt\eta_t while retaining the simplicity and analytical tractability of constant-rate updates.

Regarding β\beta dependence in Lemma 1 and 2: While the bounds in Lemma 1 and 2 are stated for a given δ,d\delta, d, the optimal choice of these parameters given a fixed bit budget β\beta does make the bounds dependent on \beta (see lines 133-137).

i=b1\prod_{i=b}^{1} applied to a vector: The product is a product of matrices in order bb to 11, which is then multiplied by a vector. We will clarify this in the revised manuscript.

Regarding L.H.S of line 164: The number of batches bb depends on nn. The O(1)O(1) factor here represents a constant w.r.t asymptotic behaviour of n. We will clarify this in the revised manuscript.

Regarding α\alpha in Theorem 1 and 2: α\alpha is a constant which determines the learning rate, η\eta. Our results require α1\alpha \geq 1 (see Lemma A.2). We will refer to this lemma in the main paper.

Regarding bounded input set assumption: We only require the variance and bounded assumptions in Assumption 1, which is popular in streaming PCA literature (see [1-7]). We note that while the current analysis assumes a bounded set, this is for ease of analysis and it can be generalized to unbounded subgaussian data (see [8, 9, 10]).

For space constraints and ease of exposition, we will provide an easy argument that changes the algorithm by considering "truncated" datapoints Yi=Xi1(Xi2αn)Y_i = X_i \mathbb{1}(||X_{i}||^{2} \leq \alpha_n), where the truncation parameter αn\alpha_n is set to be clog(n)c\log(n). Thus, we replace any datapoint whose squared norm exceeds the truncation parameter by zero. Standard truncation arguments can then show that the sin-squared error of the output of this algorithm w.r.t the original principal eigenvector has the same theoretical guarantee. The crux of the argument is that the covariance matrix of the truncated data changes very little since the truncations happen with very small probability.

Regarding time and space complexity for diagonalizing Σ^\hat{\Sigma} : Thank you for pointing this out - our intention was to say that the time and space dependencies of offline diagonalization are Ω(d2)\Omega(d^2). We will clarify this.

Regarding “nearly dimension-free” bounds : Yes, our dimension dependence is logarithmic in dd, which we refer to as nearly dimension-free and we will make this explicit in the revised manuscript.

Regarding .||.|| norm : Both norms are the operator norm .2||.||_2. Throughout the paper, the norms .||.||, .2||.||_2 and ._op||.||\_{op} all refer to the 2-norm .2||.||_2 (which is the operator norm for matrices). We will clarify this in the notation section.

Regarding β\beta being natural number : Yes, β\beta is the number of bits used by the model and is a natural number.

Regarding +1,-1 in (3) : Yes, the second value in the quantization should be δ(2β11)-\delta (2^{\beta-1} - 1) instead of δ((2β11)+1).-\delta ((2^{\beta-1}-1)+1). We will fix this typo.

Regarding stochastic rounding in Line 112: Thank you for pointing that out. We will fix this typographical error.

Regarding “well within the range”: Thank you for pointing it out. Here, we mean that the value being rounded is between the smallest and the largest numbers in the quantization grid.

Regarding \ell and uu in Line 114: Thank you for pointing out the typographical error. The i's in (5) will be removed in the revised paper.

Regarding choice of βe\beta_e and βm\beta_m: The choice of the parameters ζ\zeta and δ0\delta_0 is inspired from floating point computation, where the choice of the number of bits in the mantissa, βm\beta_m, determines ζ\zeta and the number of bits in the exponent, βe\beta_e, determines the smallest representable number δ0\delta_0. The values βe\beta_e and βm\beta_m were chosen by minimizing the error κ1=ζ2+δ02d\kappa_1 = \zeta^2 + \delta_0^2 d explicitly, that is, by taking a derivative with respect to βe\beta_e and setting the expression to zero. Note that the quantization errors ζ\zeta and δ0\delta_0 depend on the bits and the choice is inspired from floating point computation.

Regarding the dataset used in Figure 1: The dataset is the same as used in Section 5, Experiments and is described in detail in Appendix F. Since the dataset is synthetically generated, we are able to explicitly calculate the eigengap. However, we would like to point out that the analysis is robust to upper bounds off by a multiplicative constant on the eigengap with a slight increase in the final sin-squared error.

Regarding Algorithm 2, line 5: Yes, we will move the definition of ρ\rho outside the loop, making the algorithm more readable.

References

[1] Moritz Hardt and Eric Price. The noisy power method: a meta algorithm with applications. NeurIPS 2014.

[2] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some non-convex matrix problems. ICML 2015.

[3] Ohad Shamir. Convergence of stochastic gradient descent for PCA. ICML 2016.

[4] Ohad Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. ICML 2016.

[5] Zeyuan Allen-Zhu and Yuanzhi Li. First efficient convergence for streaming k-PCA: a global gap-free and near-optimal rate. FOCS, 2017.

[6] Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming PCA: Matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. COLT, 2016.

[7] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-dependency analysis of the noisy power method. COLT, 2016

[8] Lunde, R., Sarkar, P. and Ward, R., 2021. Bootstrapping the error of Oja's algorithm. Advances in neural information processing systems, 34, pp.6240-6252.

[9] Liang, X., 2023. On the optimality of the Oja’s algorithm for online PCA. Statistics and Computing, 33(3), p.62.

[10] Kumar, S. and Sarkar, P., 2024. Oja's algorithm for streaming sparse PCA. Advances in Neural Information Processing Systems, 37, pp.74528-74578.

评论

I thank the authors for considering and answering my questions to my full satisfaction. In particular, with the Πi=b1\Pi_{i=b}^1 issue (which apparently also confused another reviewer) settled and the choice of β\beta-s clarified, I am happy to bump my score and trust that the authors incorporate the content of their rebuttals into the revised version of the manuscript.

评论

Thank you for your positive response. We will incorporate these points and clarify the choice of β\beta in the revised manuscript.

审稿意见
5

The authors proposed a low-precision version Oja's algorithm for streaming PCA, in which the iterates and the samples and the gradients are all quantized, and showed that its batched version achieves optimal estimation error rates, under either linear quantizer (LQ) and nonlinear quantizer (NLQ). The take-away lies on the advantage of NLQ in achieving nearly dimension-free estimation error.

优缺点分析

Strengths:

  1. The paper conveys rather clear messages - NLQ achieves faster nearly dimension-free rate - and batched version (or sample splitting) leads to much better performance.

  2. The results provide a clear understanding on the problem - the authors established lower bounds and matching upper bounds (by quantized batched Oja's). In view of Sec. 4, the proof techniques are nontrivial with some novelty.

  3. The experiments nicely back up the theory.

Weaknesses:

  1. Some ambiguity in notation: P3 only defines .2,.op\|.\|_2,\|.\|_{op}, while later .\|.\| is frequently used (e.g., Assumption 1, Eqs. 8-9)
  2. The weakness is that the motivation of such study should be made clearer.

问题

  1. Can the authors further clarity the motivations/applications of their problem setting? I am a bit confused about the "quantization constraint" -- specifically, what kind of quantization information one can use to estimate v1v_1 in the authors' setting.

In the algorithm, the authors quantize everything (samples, gradient, iterates) -- is this motivated by some practical regimes, or can we only quantize the samples? Even only quantizing the samples, the problem distinguishes with some problems I know of where the samples should be quantized before running the algorithms (e.g., see "Distributed gaussian mean estimation under communication constraints: Optimal rates and communication-efficient algorithms").

  1. The phenomenon that NLQ is statistically much better than LQ is new to me. I understand that this stems from the choice of the relevant parameters -- which nonetheless appears less straightforward to the readers.

Can the authors provide some intuition that leads to such difference? Are there similar results appearing in other low-precision estimation problems?

  1. The authors studied two specific quantizers. Can the authors' method work other quantizers? Or can the authors explain why they specifically consider these two quantizers?

局限性

The limitation has been listed as questions above.

最终评判理由

I maintain my current rating. The authors' responses address my main concerns and promote my understanding to their work. The setting differs from a standard estimation problem under quantization, in that everything is quantized throughout the computational procedure. The results are novel and interesting.

格式问题

N/A

作者回复

Thank you for your valuable feedback on the clarity of our paper and the novelty of our results. We will clarify the motivation of our work and make our notations consistent in a revision.

Motivation

Hardware accelerators—TPUs, for instance—derive much of their speed advantage from relying on low-precision arithmetic. In this paper, we analyze streaming PCA in a low-precision setting, where intermediate values stored in memory have a limited number of bits. This leads to quantization errors at every intermediate step [1-4]. We model this behavior as follows: (i) quantize at points where any computed values are stored in memory, and (ii) fundamental arithmetic operations such as addition, multiplication, dot products, and matrix-vector products are quantized after being computed in high precision. Note that this is different from only quantizing the samples, as the latter allows computed quantities to be stored in high precision. When operating with β\beta-bits, the overall complexity for streaming PCA (and that of the batched variant) is O(ndβ)O(nd\beta). We performed a benchmarking experiment to illustrate this below.

ExperimentRuntime (seconds)
64 bits0.0274 ± 0.00136
32 bits0.0260 ± 0.00225
16 bits0.000398 ± 0.0000235

NLQ is statistically much better than LQ

Non-linear quantization (NLQ) is a type of logarithmic scheme where adjacent points in the quantization grid are multiplicatively close, that is, the error due to quantization in NLQ is proportional to the value itself. On the other hand, the linear quantization (LQ) scheme has an additive error, which can be (i) lossy when the value being quantized is small (see [2]), and (ii) needlessly accurate if the values are large.

In Oja’s algorithm with quantization, the learning rate is small, and the updates to the values are themselves small. If the error due to quantization is additive and of the same order as the value itself, then the direction of the gradient is significantly different compared to when the coordinates of the gradient directions are correct up to multiplicative factors.

Regarding Quantization Schemes

The quantization schemes we use have been widely used by other works [2-5]. As we discuss below, our analysis can be readily adapted to the following quantization schemes commonly used in low-precision computing.

The Floating Point Quantization (FPQ) is widely adopted and is a type of Logarithmic quantization scheme, where adjacent values in the quantization grid are multiplicatively close. Logarithmic schemes are of wide practical interest and are used in most modern programming languages such as C++, Python, and MATLAB, and are standardized for common use (such as the IEEE 754 floating point standard [6]). Another quantization scheme used for low-precision training is power-of-two quantization [1], where any value is rounded down to the nearest power of two. This is yet another type of logarithmic quantization. Logarithmic quantization is similar in principle to the non-linear quantization (NLQ) scheme we use in our work. In particular, Lemma A.9 in the appendix establishes a relationship between the distance of a quantized vector and the original one under NLQ. This Lemma applies to FPQ and more generally to most other logarithmic quantization schemes. We believe that our proofs can be modified to work with any logarithmic scheme, including FPQ.

References

[1] Przewlocka-Rus, Dominika, et al. "Power-of-two quantization for low bitwidth and hardware compliant neural networks." arXiv preprint arXiv:2203.05025 (2022).

[2] Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. "Training deep neural networks with low precision multiplications, 2014."

[3] Li, Zheng, and Christopher M. De Sa. "Dimension-free bounds for low-precision training." NeurIPS 2019.

[4] De Sa, Christopher, et al. "High-accuracy low-precision training."

[5] Das, Dipankar, et al. "Mixed precision training of convolutional neural networks using integer operations." ICLR 2018.

[6] Kahan, William. "IEEE standard 754 for binary floating-point arithmetic."

评论

Thanks for the responses which address my main concern in a satisfactory manner. I agree with the authors that the motivation of their study is different from those quantizing the samples only before the statistical procedure. I will maintain my score.

审稿意见
4

This paper presents a rigorous study of low-precision algorithms for streaming principal component analysis (PCA). The motivation stems from the increasing need to perform high-dimensional data processing under severe memory and bandwidth constraints. The authors propose variants of Oja’s algorithms that apply quantization to either gradients or data, and they analyze their convergence properties under both linear and logarithmic stochastic quantizations.

  1. Quantized Oja's Algorithm: The authors generalize Oja's method to the setting where the data or gradients are low-precision representations, and rigorously analyze its convergence to the leading eigenvector of the data covariance matrix.
  2. Theoretical Guarantees: They provide non-asymptotic convergence bounds under various quantization models (e.g., unbiased stochastic quantizers and one-bit compression), and establish trade-offs between quantization level, convergence rate, and variance.
  3. Lower Bound: A matching lower bound is provided for the streaming PCA problem under quantization constraints, characterizing the inherent difficulty of the task.
  4. Empirical Validation: Experiments on synthetic and real-world datasets validate the theoretical findings, demonstrating that low-precision algorithms can achieve nearly the same accuracy as full-precision ones at a fraction of the communication cost .

优缺点分析

Strengths: This is a good algorithm for streaming PCA with low computational precision. The particular novelty lies in designing the algorithm (as in when to quantize). The probability boosting framework is standard and of course, it comes at the cost of additional space and memory. The proofs follow a standard decomposition of the projection error and matrix product concentration.

Weakness:

-- Some theoretical results depend heavily on simplifying assumptions about data distributions or quantization noise, and may not fully reflect practical scenarios (e.g., non-iid or adversarial data). See for example, Robust Streaming PCA NeurIPS 2022, which considers one such practical model for this problem.

-- The learning rate requires knowledge of spectral gap. This is unreasonable in practice.

问题

-- Why do you need to quantize every step of the algorithm? Why can’t you compute a final update and quantize that? In particular, I find what is the purpose of the quantization in Line5, Algorithm1?

局限性

-- Use of excessive hyperparameters whose knowledge might not be known priori.

格式问题

NA

作者回复

Thank you for your helpful feedback and for recognizing the novelty in our work.

Assumptions about data distribution, quantization noise, and non-IID/adversarial data

Below we sketch that the assumptions we use are widely adopted in streaming PCA and quantization literature. We also provide a discussion of how to adapt our techniques to non-IID settings.

Assumptions about data distributions

Assumption 1 states standard moment bounds used to analyze PCA in the stochastic setting. These assumptions are also used in citations [5-12] to derive near-optimal sample complexity bounds for Oja’s rule.

Quantization noise

The quantization schemes we use have been widely used by other works [1,2,3,4]. As we discuss below, our analysis can be readily adapted to the following quantization schemes commonly used in low-precision computing.

The Floating Point Quantization (FPQ) is widely adopted and is a type of Logarithmic quantization scheme, where adjacent values in the quantization grid are multiplicatively close. Logarithmic schemes are of wide practical interest and are used in most modern programming languages such as C++, Python, and MATLAB, and are standardized for common use (such as the IEEE 754 floating point standard [19]). Another quantization scheme used for low-precision training is power-of-two quantization [16], where any value x is rounded down to the nearest power of two. This is yet another type of logarithmic quantization. Logarithmic quantization is similar in principle to the non-linear quantization (NLQ) scheme we use in our work. In particular, Lemma A.9 in the appendix establishes a relationship between the distance of a quantized vector and the original one under NLQ. This Lemma applies to FPQ and more generally, to most other logarithmic quantization schemes. We believe that our proofs can be modified to work with any logarithmic scheme.

Non-IID or adversarial data

In the streaming PCA literature, we are aware of the following extensions to non-IID data.

Independent but non-identically distributed data: Robust Streaming PCA ([3]) handles data generated from covariance matrices that change over time (while being in the family of spiked models with kk spikes). Their analysis tracks the error accumulated from the mismatch of the varying covariance matrices with the last one in each batch. We believe that our rounding algorithm and analysis can be used to replace the part that considers matrix concentration.

Markovian datasets: [18] analyzes Oja’s algorithm for Markovian datasets. We believe that our analysis, which incorporates martingale noise resulting from quantization, can be easily combined with their analysis to obtain results for Markovian data.

Adversarial or contaminated data: The proof techniques here are vastly different ([13,14,17]) and would require substantially different tools. This presents an interesting direction for future work.

We believe that our work represents a first step in exploring quantized streaming PCA algorithms, which can be built upon to analyze various extensions.

Hyperparameters and learning rate

Only two hyperparameters: Parameters such as V\mathcal{V} and M\mathcal{M}, and quantization parameters δ,α,δ0\delta, \alpha, \delta_0, are required for analysis only. Hence, these are not hyperparameters.

The only two hyperparameters required by Algorithm 1 are the batch size bb and the learning rate η\eta. The optimal choice for both is determined by the spectral gap γ=λ1λ2\gamma = \lambda_1 - \lambda_2.

Learning rate and eigengap: The expression for the learning rate η=αlognn(λ1λ2)\eta = \frac{\alpha \log n}{n (\lambda_1 - \lambda_2)} is also present in other works on streaming PCA [5-12] to derive the statistically optimal sample complexity (up to logarithmic factors). In particular, Theorem 1 (lines 165-166) in our paper shows that the error in the output vector is

denη(λ1λ2)+ηVλ1λ2+max(bαlogn,1)κ1.d e^{-n\eta (\lambda_1-\lambda_2)}+\frac{\eta \mathcal{V}}{\lambda_1-\lambda_2}+\max(\frac{b}{\alpha\log n}, 1) \kappa_1.

If a smaller learning rate η\eta is used (for example, from plugging in an upper bound UU on the eigengap (λ1λ2)(\lambda_1 - \lambda_2) and using a crude upper bound η=αlognnU\eta = \frac{\alpha \log n}{n U} instead of η=αlognn(λ1λ2)\eta = \frac{\alpha \log n}{n (\lambda_1 -\lambda_2)} will make the first term larger, leading to a slightly larger sin-squared error. A similar argument shows that the batch size is also robust to the choice of an upper bound on the eigengap. We will clarify this further in the revised manuscript.

Quantization at every step of the algorithm

Our motivation for this study was to study the theoretical convergence guarantess of PCA in low-precision settings, which improves runtimes in practice. In the low-precision setting, intermediate values stored in memory have a limited number of bits, leading to quantization errors at every intermediate step [1-4]. We model this behavior by (i) quantizing at points where computed values are stored in memory, and (ii) fundamental arithmetic operations such as addition, multiplication, dot products, and matrix-vector products are quantized after being computed in high precision. In Line 5, Algorithm 1, the intermediate summands are quantized because they are stored in memory. We quantize again after summing the values.

While in some settings, it may be reasonable to quantize at the final step, in this paper, we analyze streaming PCA in a low-precision setting, where intermediate values stored in memory have a limited number of bits. When the quantization is applied at the last step, the analysis is straightforward since the final error is equal to the sum of the error without quantization and the additional error δ\delta introduced by quantizing the final result.

References

[1] Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. "Training deep neural networks with low precision multiplications, 2014."

[2] Li, Zheng, and Christopher M. De Sa. "Dimension-free bounds for low-precision training." NeurIPS 2019.

[3] De Sa, Christopher, et al. "High-accuracy low-precision training."

[4] Das, Dipankar, et al. "Mixed precision training of convolutional neural networks using integer operations." ICLR 2018.

[5] Moritz Hardt and Eric Price. The noisy power method: a meta algorithm with applications. NeurIPS 2014.

[6] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some non-convex matrix problems. ICML 2015.

[7] Ohad Shamir. Convergence of stochastic gradient descent for PCA. ICML 2016.

[8] Ohad Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. ICML 2016.

[9] Zeyuan Allen-Zhu and Yuanzhi Li. First efficient convergence for streaming k-PCA: a global gap-free and near-optimal rate. FOCS, 2017.

[10] De Huang, Jonathan Niles-Weed, and Rachel Ward. Streaming k-PCA: Efficient guarantees for Oja’s algorithm, beyond rank-one updates. COLT, 2021.

[11] Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming PCA: Matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. COLT, 2016.

[12] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-dependency analysis of the noisy power method. COLT, 2016

[13] Diakonikolas, Ilias, et al. "Nearly-linear time and streaming algorithms for outlier-robust PCA." ICML 2023.

[14] Cheng, Yu, Ilias Diakonikolas, and Rong Ge. "High-dimensional robust mean estimation in nearly-linear time." SODA 2019.

[15] Bienstock, Daniel, et al. "Robust streaming PCA." NeurIPS 2022.

[16] Przewlocka-Rus, D., Sarwar, S.S., Sumbul, H.E., Li, Y. and De Salvo, B., 2022. Power-of-two quantization for low bitwidth and hardware compliant neural networks. arXiv preprint arXiv:2203.05025.

[17] Price, E. and Xun, Z., 2024, October. Spectral guarantees for adversarial streaming PCA. FOCS 2024.

[18] Kumar, S. and Sarkar, P.. Streaming PCA for Markovian data. NeurIPS 2023.

[19] Kahan, William. "IEEE standard 754 for binary floating-point arithmetic."

评论

Thank you for your reply. I will maintain my score.

审稿意见
4

The paper gives quantized version of Oja's algorithm for streaming PCA under both linear and logarithmic quantization schemes along with randomized rounding. They give lower bounds on quantization error for both quantization schemes. The batched versions of their algorithms almost match the lower bounds. The Non logarithmic quantization scheme can achieve errors which are independent of the dimension and hence very useful in high dimensions. The authors also give a way to boost the success probability of their algorithm. They validate their guarantees by experiments on synthetic data.

优缺点分析

Strengths:

  1. PCA is a very important problem in machine learning. Addressing the problem of finding out principal components in low precision scenario is an important and interesting problem. 2)Lower bounds for quantization errors for linear and nonlinear quantization schemes and algorithms that almost matches the lower bounds.
  2. The algorithm is easy to understand and appears simple to implement too.
  3. I did not check the proofs in detail however the results appear correct and sound.

Weaknesses and Suggestions:

  1. Overall, I did not like the writing of the paper. Better notational convention can be used. Also, there are typos. For e.g. I think there is a misplaced or additional bracket in eq. (3). Please correct me if I am wrong. In all equations involving product over terms, e.g. eq. (7) why is the lower limit of product bb and upper limit 11. Again, it appears weird to me unless I missed something.
  2. Experiments are only performed on small set of synthetic data.

问题

See Weaknesses

局限性

Yes

最终评判理由

I have modified my score base on rebuttal. But I would urge the authors to improve the presentation and explicitly explain the multiplications happening to remove confusions

格式问题

See weaknesses

作者回复

Thank you for your valuable feedback. We will fix all typographical errors in the revision; in particular, the second value in the set of equation (3) should be δ(2β11)-\delta \cdot (2^{\beta-1}-1) instead of δ((2β11)+1)-\delta \cdot ((2^{\beta-1}-1)+1).

Lower and upper limits in matrix product

Each iteration of the Oja’s algorithm has a gradient update ut+1=(I+ηXt+1Xt+1T)utu_{t+1}=(I+\eta X_{t+1}X_{t+1}^T)u_{t} followed by normalization ut+1=ut+1/ut+1u_{t+1} = u_{t+1}/\left\lVert u_{t+1} \right\rVert. As a result, the final vector uTu_T be unwound into the (normalized) product of n matrices successively left-multiplied to the initial vector u0u_0. The crucial observation here is that the first term in the matrix from the left uses the last data point, while the rightmost matrix corresponds to the first data point. This is why the lower and upper limits go backward. We will clarify this notation in the revision.

Experiments

We conducted additional experiments on both synthetic and real-world datasets. For real-world datasets, we consider the top eigenvector of the sample covariance matrix of the data as the ground truth. All experiments below show the sin2\sin^2 error with respect to the ground truth eigenvector. For all three datasets, we observe that the trends of these experiments are similar to those presented in our paper. In particular, we observe that, in general,

  1. All quantized methods perform better as the bit budget increases.
  2. Batched methods outperform their analogues with per-iterate updates.
  3. NLQ outperforms LQ for a given bit budget.

Additional Synthetic Data

Our first experiment considers covariance matrices generated using a random orthonormal basis QQ, chosen by performing QR decomposition of ZZ such that Zi,jN(0,1)Z_{i,j} \sim N(0,1) with eigenvalues decaying as λi:=i2\lambda_{i} := i^{-2}. We use n=10,000n=10,000 and d=100d=100.

Method4 bits6 bits8 bits10 bits12 bits
Standard Oja0.004 ± 5.92e-170.004 ± 3.63e-170.004 ± 2.22e-170.004 ± 5.97e-170.004 ± 2.96e-17
Standard Oja + LQ0.960 ± 0.0170.454 ± 0.0190.109 ± 0.0050.016 ± 7.16e-040.005 ± 1.55e-04
Standard Oja + NLQ0.297 ± 0.0100.117 ± 0.0050.030 ± 0.0010.010 ± 4.98e-040.006 ± 1.36e-04
Batched Oja0.006 ± 0.0040.002 ± 0.0020.001 ± 6.13e-040 ± 1.99e-060 ± 7.21e-08
Batched Oja + LQ0.572 ± 0.0130.068 ± 0.0020.005 ± 1.55e-040 ± 2.03e-050 ± 3.14e-06
Batched Oja + NLQ0.033 ± 0.0010.005 ± 2.41e-040.001 ± 3.47e-050 ± 1.58e-050 ± 5.01e-06

Image data

We use the MNIST dataset, which comprises grayscale images of handwritten digits (0 through 9). Here, n = 60,000, d = 784, with each image normalized to a 28 × 28 pixel resolution.

Method4 bits6 bits8 bits10 bits12 bits
Standard Oja0.063 ± 1.22e-100.063 ± 2.13e-100.063 ± 1.94e-100.063 ± 1.90e-100.063 ± 2.66e-10
Standard Oja + LQ0.997 ± 0.0020.93 ± 0.0090.421 ± 0.0080.143 ± 0.0020.07 ± 0.001
Standard Oja + NLQ0.511 ± 0.0300.226 ± 0.0080.110 ± 0.0030.072 ± 9.98e-040.066 ± 5.31e-04
Batched Oja0.372 ± 0.1060.268 ± 0.0810.270 ± 0.0950.021 ± 0.0260.020 ± 0.012
Batched Oja + LQ0.887 ± 0.0100.273 ± 0.0070.050 ± 8.32e-040.030 ± 2.76e-040.030 ± 2.15e-04
Batched Oja + NLQ0.067 ± 0.0560.034 ± 5.57e-040.030 ± 2.62e-040.029 ± 1.49e-040.031 ± 2.11e-04

Time series data

The Human Activity Recognition (HAR) dataset contains smartphone sensor readings from 30 subjects performing daily activities (walking, sitting, standing, etc.). Each data instance is a 2.56-second window of inertial sensor signals represented as a feature vector. We have (n = 7,352, d = 561), and the data constitutes a time series.

Method4 bits6 bits8 bits10 bits12 bits
Standard Oja0.013 ± 3.19e-100.013 ± 1.56e-090.013 ± 4.12e-100.013 ± 4.31e-100.013 ± 3.74e-10
Standard Oja + LQ0.998 ± 2.33e-040.871 ± 0.0160.305 ± 0.0030.087 ± 0.0020.028 ± 5.94e-04
Standard Oja + NLQ0.455 ± 0.0140.157 ± 0.0040.068 ± 0.0010.033 ± 9.53e-040.020 ± 3.96e-04
Batched Oja0.822 ± 0.0250.670 ± 0.0750.053 ± 0.0530.063 ± 0.0050.005 ± 0.010
Batched Oja + LQ0.943 ± 0.0050.405 ± 0.0060.041 ± 0.0010.009 ± 1.05e-040.006 ± 3.60e-05
Batched Oja + NLQ0.047 ± 5.58e-040.013 ± 1.86e-040.007 ± 8.97e-050.007 ± 2.64e-050.006 ± 1.33e-05
评论

Thank you for the rebuttal. I have modified my score

审稿意见
5

The authors establish new lower bounds on linear and non-linear logarithmic quantized Oja's algorithm for streaming PCA. The bound is sharp (up to logarithmic factors) when batching is used. Error for log quantization is dimension-free, while linear quantization scaled with the dimension of the data points. The authors conduct empirical analysis over varying sample size, data dimension, and quantization bits, along with guidance for selecting optimal quantization hyperparameters when given a bit budget.

优缺点分析

Strengths:

  • A novel lower bound for a very classical algorithm (Oja + streaming PCA) with strong applications to machine learning. Indeed, it surprises me that such a result for quantized streaming PCA using Oja's rule had not already been established in literature. Bounds for both linear and logarithmic are provided, which enables flexibility on quantization procedure.

  • Very easy to read -- the review of streaming PCA, Oja's algorithm, and quantization + stochastic rounding will be comprehensible for those outside of numerical optimization.

  • The lower bound is sharp (up to algorithmic factors) when batching is used, demonstrating the utility of the bound.

  • Explicit determination of optimal quantization parameters under a fixed budget makes these bounds practically useful as well.

  • Reduction of linear dependence on dataset size to logarithmic dependence via batching is a nice result.

Weaknesses

  • Results do not apply to extraction of top-k components.

  • I find the proof techniques in the manuscript do not necessarily reveal enough intuition for the core results. Deeper proof sketches of the core theorems would be helpful.

问题

  1. What are the core challenges for extension to top-k? Namely, what prevents analysis over the deflated covariance matrix (once the previous component is found)?

  2. Is assumption 1 unreasonable? I'm not asking because I believe it's unreasonable, I just do not know if it's used in other similar results besides [1]. For example, streaming convergence for the power method with momentum in [2] does not seem to rely on assumptions regarding the distributions of entries in the X_i.

[1] Kumar, S., & Sarkar, P. (2024). Oja's Algorithm for Streaming Sparse PCA. arXiv preprint arXiv:2402.07240.

[2] Xu, P., He, B., De Sa, C., Mitliagkas, I., & Re, C. (2018, March). Accelerated stochastic power iteration. In International Conference on Artificial Intelligence and Statistics (pp. 58-67). PMLR.

局限性

yes

最终评判理由

I am maintaining my score. I find this bound interesting as this result seemingly doesn't already exist in literature for such a classical method (Oja's method). At least one reviewer has complained that some knowledge of the spectral gap is required but observe that is the case with similar literature [1,2]. Spectral-free knowledge which results in accelerated convergence seems notoriously difficult to acquire.

[1] Xu, Peng, et al. "Accelerated stochastic power iteration." International Conference on Artificial Intelligence and Statistics. PMLR, 2018.

[2] Rabbani, Tahseen, et al. "Practical and fast momentum-based power methods." Mathematical and Scientific Machine Learning. PMLR, 2022.

格式问题

None

作者回复

Thank you for your valuable feedback on the readability of our paper and the novelty of our lower bounds. We agree that providing more intuition makes the core ideas more accessible, and we will expand the proof sketches in a revised version of the paper.

Regarding the top kk components

We believe our techniques can be extended to the top kk components via a deflation-based approaches (see [9,10]). Currently, our algorithm assumes the gapped setting where λ1>λ2\lambda_1 > \lambda_2. Extending it to the top kk components may require the assumption and knowledge of all eigengaps λ1λ2,λ2λ3,,λkλk+1\lambda_1 - \lambda_2, \lambda_2 - \lambda_3, \dots, \lambda_{k}-\lambda_{k+1}.

We discuss what we believe is a barrier to extending our analysis to k-PCA with the deflation-based method, as requested by the reviewer, below. However, we want to point out that we believe that k-PCA methods that maintain kk vectors and perform QR decomposition at every step (see e.g. [5, 14]) would provide sharper bounds with less stringent assumptions. This is part of ongoing work.

Deflation based methods

A standard deflation approach splits the data stream into kk chunks. For each chunk CiC_i, construct the projection matrix Pi1P_{i-1} that removes the first i1i-1 estimated eigenvectors, and then estimate the leading eigenvector of Pi1(jCiXjXj)Pi1P_{i-1} \left(\sum_{j\in C_i} X_j X_j^{\top}\right) P_{i-1}. Because Oja’s updates use only matrix–vector products, the method remains fully streaming with O(kd)O(kd) memory. For quantized streaming PCA, one must also consider the accumulation of quantization errors from each eigenvector estimate. We expect that the martingale-based arguments should hold since the new datapoints become Pi1XjP_{i-1}X_j for some fixed matrix Pi1P_{i-1} (independent of XjX_j in chunk CiC_i). However, the main barrier appears from the fact that one needs to make sure that the eigengap λiλi+1\lambda_i-\lambda_{i+1} is larger than the errors accumulated up until now, which is O(iΔ)O(i\Delta) where Δ\Delta is the maximum sin\sin error over the ii chunks.

Regarding assumption 1

Assumption 1 states standard moment bounds used to analyze PCA in the stochastic setting. These assumptions are also used in citations [1-7] to derive near-optimal sample complexity bounds for Oja’s rule. We note that while the current analysis assumes a bounded set, this is for ease of analysis, and it can be generalized to unbounded subgaussian data (see [11,12,13]).

For ease of exposition, we will provide an easy argument that changes the algorithm by considering "truncated" datapoints Yi:=Xi1(Xi2αn)Y_i := X_i 1 (||X||_{i}^{2} \leq \alpha_n), where the truncation parameter αn\alpha_n is set to be clog(n)c\log(n). Thus we replace any datapoint whose squared norm exceeds the truncation parameter by zero. Standard truncation arguments can then show that the sin-squared error of the output of this algorithm w.r.t the original principal eigenvector has the same theoretical guarantee.

Equation (1) in the Section titled “Stochastic PCA” in Accelerated stochastic power iteration (citation [8]) states their assumptions; in fact, [8] uses a stronger assumption than ours. The matrix AA in their paper is the covariance matrix Σ\Sigma, the matrices A~t=XtXt\tilde{A}_t = X_t X_t^{\top}, and the almost sure bound r=Mr = \mathcal{M}. One difference is that the [8] use the bound E[XXΣ2]σ2E[\left\lVert XX^{\top} - \Sigma \right\rVert ^2] \le \sigma^2, while we use the weaker assumption E[(XXΣ)2]V\left\lVert E[ (XX^{\top} - \Sigma)^2] \right\rVert \le \mathcal{V}; note that E[(XXΣ)2]E[XXΣ2]\left\lVert E[ (XX^{\top} - \Sigma)^2] \right\rVert \le E[\left\lVert XX^{\top} - \Sigma \right\rVert ^2] by Jensen’s inequality.

References

[1] Moritz Hardt and Eric Price. The noisy power method: a meta algorithm with applications. NeurIPS 2014.

[2] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some non-convex matrix problems. ICML 2015.

[3] Ohad Shamir. Convergence of stochastic gradient descent for PCA. ICML 2016.

[4] Ohad Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. ICML 2016.

[5] Zeyuan Allen-Zhu and Yuanzhi Li. First efficient convergence for streaming k-PCA: a global gap-free and near-optimal rate. FOCS, 2017.

[6] Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming PCA: Matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. COLT, 2016.

[7] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-dependency analysis of the noisy power method. COLT, 2016

[8] Xu, P., He, B., De Sa, C., Mitliagkas, I., & Re, C. (2018, March). Accelerated stochastic power iteration. In International Conference on Artificial Intelligence and Statistics (pp. 58-67). PMLR.

[9] Jambulapati, Arun, et al. "Black-box k-to-1-PCA reductions: Theory and applications." COLT 2024

[10] Mackey, Lester. "Deflation methods for sparse PCA." Advances in neural information processing systems 21 (2008).

[11] Lunde, Robert, et al "Bootstrapping the error of Oja's algorithm." NeurIPS 2022.

[12] Kumar, S., & Sarkar, P. (2024). Oja's Algorithm for Streaming Sparse PCA. arXiv preprint arXiv:2402.07240.

[13] Liang, Xin. "On the optimality of the Oja’s algorithm for online PCA." Statistics and Computing 33.3 (2023): 62.

[14] De Huang, Jonathan Niles-Weed, and Rachel Ward. Streaming k-PCA: Efficient guarantees for Oja’s algorithm, beyond rank-one updates. COLT, 2021.

评论

Thanks to the authors for their thorough response, especially regarding Assumption 1. I shall maintain my score.

最终决定

This work studies quantized Oja's algorithm for streaming PCA showing upper and matching lower bounds for linear quantization and logarithmic quantization.

All reviewers agree that the theoretical contributions of the paper are good. There were some concerns raised about the need to know the spectral gap for the algorithm. This has been somewhat addressed by the authors by pointing out that knowing the spectral gap is somewhat of a standard assumption in previous streaming PCA works. Further, some of the reviewers have expressed concerns about the quality of the empirical evaluation. The authors provide an extended empirical evaluation in the rebuttal which has addressed this concern. Several reviewers had issues with the presentation of the paper. I recommend that the authors include the extra experimental evaluation and implement the suggestions on improving the presentation of the paper.