On Sampling Strategies for Spectral Model Sharding
We propose new sampling strategies and techniques for local training of low-rank factorized neural networks for the scenario of federated learning on heterogeneous devices.
摘要
评审与讨论
This paper addresses the problem of heterogeneous clients in federated learning, where each client may have different constraints and capabilities. The authors propose two sampling strategies for spectral model sharding, which involves partitioning the model parameters into low-rank matrices based on singular value decomposition. The first strategy provides unbiased estimators of the original weights, while the second strategy aims to minimize the squared approximation error. The authors discuss how these estimators can be incorporated into the federated learning loop and provide practical considerations for local training. Empirical results show that both methods improve performance on various datasets. Overall, the paper presents novel sampling strategies for heterogeneous clients in federated learning and demonstrates their effectiveness in improving performance.
优点
Strengths of this paper include the innovative approach to develop sampling strategies for spectral model sharding in federated learning. The authors provide two novel strategies derived from specific optimization problems, which aim to minimize the expected squared approximation error. These strategies are presented in closed form, making them easily implementable in practical settings. Additionally, the proposed methods do not require additional calculations on potentially weak client devices, enhancing their efficiency for on-device training. The incorporation of unbiased estimators and consideration of all client-specific sub-models distinguish this work from previous heuristic sampling designs, offering potential improvements in performance on various datasets.
缺点
The manuscript presents a natural and well-articulated approach, however, I have some concerns:
-
The reliance on the singular value decomposition (SVD) as a core component of the proposed strategies could raise apprehensions, especially considering the substantial computational cost associated with SVD. Furthermore, the need for repeated SVD decompositions after each communication round could limit the practical application of the method, particularly on large-scale models. It would be valuable for the authors to explore methods to mitigate computational costs, such as utilizing approximations or reducing the frequency of SVD decompositions, to enhance the feasibility of the approach, especially for larger and more complex models.
-
The absence of a comparison with the FedHM method based on random sampling in the experimental section is noteworthy. While the authors state their focus on analyzing the relative performance of the proposed spectral sharding strategies, the lack of comparison with the FedHM method could limit the comprehensive understanding of the strengths and weaknesses of the proposed approaches. Including such a comparison would provide deeper insights into the performance of the proposed strategies and their positioning in relation to existing methods.
-
Typographical Errors: The manuscript contains minor typographical errors, such line 304 "We leave this fo future work" -> "We leave this for future work"
问题
-
Considering that existing methods such as PriSM and FedHM have only been tested on models with parameters numbering up to tens of millions, there is a concern regarding the computational cost of singular value decomposition (SVD) and its applicability to larger-scale models. Therefore, it would be valuable to explore whether the proposed strategies in this paper can be extended to larger models, potentially utilizing approximations or other techniques to address the computational challenges associated with SVD.
-
How does the accuracy of the proposed methods compare to FedHM? Considering the absence of a direct comparison with FedHM in the baseline methods presented in the manuscript, evaluating the performance comparison would provide valuable insights into the effectiveness of the proposed strategies relative to an existing method.
局限性
N/A
We thank the reviewer for the time and effort invested in our work. We are happy to take the feedback into account and improve our presentation accordingly. Below we address the questions raised in the review.
Comparison with FedHM. According to the paper [1], FedHM is a method also based on spectral sharding, as it uses top- singular values and singular vectors of weight matrices of factorized layers. This corresponds to the Top- strategy in our paper. However, the FedHM method includes two additional details. First, the initial layers of the network are never decomposed, and the number of such layers is treated as a hyperparameter. Second, during the aggregation step on the server side, instead of averaging the updated singular vectors, the full weight matrices are re-materialized and then averaged.
In our work, we focused on exploring the sampling strategies and, for simplicity, followed other works [2, 3] by not factorizing only the very first and the very last (i.e., classification head) layer. To test the strategies with this aggregation method, we conducted several experiments on CIFAR-100 with keep ratio of . Their results are reported in Tab. R1 in the attached pdf. Interestingly, the strategies benefit from this kind of averaging, while Collective strategy still outperforms Top-. Arguably, the performance of the Unbiased strategy is explained by the excessively large auxiliary multipliers used during the weight matrix re-materialization.
Larger models. We admit that demonstrating scalability is highly desirable for any federated learning algorithm. However, due to our limited computational resources, we focused on experiments using datasets and models of the same scale as our fairly recent baselines. We hope to apply our approach to large-scale tasks in the future.
Computational cost of SVD. Indeed, singular decomposition may appear to be a limiting factor for methods based on spectral sharding. However, it is important to note that this decomposition takes place solely on the server side and therefore does not impose any computational burden on low-capacity clients. In addition, in our method, the SVD decomposition for different layers can be performed in parallel, as they are independent of each other. For example, in case of transformer models which typically consist of homogeneous blocks, batched SVD implementation can already provide the speed up. Also, recent solvers for the SVD algorithm for GPU such as QDWH-SVD [4, 5] potentially allow for efficient implementation of our algorithm even for large models.
As a proof of concept, we simulated the training of a larger model with the same architectures on CIFAR-100 by decreasing the keep ratio to and applying fast randomized SVD algorithm [6] which estimates only half of the singular values and vectors. Also, in this experiment we used FedHM-like weight aggregation scheme on the server side, because otherwise PriSM-like scheme (used in the main text) would lead to the low-rank server model. As reported in Tab. R2, in this setting our strategies are still able to learn. In addition, we experimented with ODE-based SVD approximation [7] for the updated weight matrices but the results for large weight matrices were unsatisfactory.
References.
[1] Yao et al. FedHM: Efficient Federated Learning for Heterogeneous Models via Low-rank Factorization. 2021.
[2] Khodak et al. Initialization and regularization of factorized neural layers. In ICLR, 2021.
[3] Niu et al. Overcoming Resource Constraints in Federated Learning: Large Models Can Be Trained with only Weak Clients. TMLR, 2023
[4] Nakatsukasa and Higham. Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD. SIAM Journal on Scientific Computing, 2013.
[5] Sukkari et al. A High Performance QDWH-SVD Solver Using Hardware Accelerators. TOMS, 2016.
[6] Haiko et al. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. 2009.
[7] Dieci and Eirola. On Smooth Decompositions of Matrices. 1999.
I thank the authors for the responses. I have taken other reviewers' comments and authors' rebuttals into account. I will updata my rating.
To handle system heterogeneity in federated learning, spectral model sharding partitions model parameters into low-rank matrices using singular value decomposition (SVD). As done in prior work, e.g., PRiSM, SVD of the weight matrices of affine layers are done and only randomly sampled terms from this decomposition are locally updated at each client (different for each client based on their system capabilities), boosting on-device training efficiency. This study introduces two sampling strategies for sharding: one produces unbiased estimators of the original weights, and the other minimizes squared approximation error. Both strategies can be incorporated into the federated learning loop and authors show improvements on various FL datasets. These solutions have closed-form optimals that can be computed on the server side, avoiding additional computations on client devices.
优点
- Sampling methods proposed (e.g., sampling based on eigen value), make it hard to analyze the marginal distribution over the inclusion probabilities, and thus the approximation error of the resulting matrix, whereas the strategies explored in this work are more amenable to analysis.
- The preliminaries section presents a crisp and clear overview of spectral model sharding, though maybe some notational overload (e.g., definition of ) could have been avoided.
- The discussion on local training is particularly informative, and is useful for practitioners to understand how different choices of the sampling strategy and corresponding auxiliary weights can affect local training dynamics. The proposed solutions to control the effective learning rate also seem reasonable. This is also empirically verified in Table 3.
- The framework presents a way to propose and analyze new variants of the marginal inclusion probabilities, which can easily be plugged into Conditional Poisson schemes that allow for more interactions between different singular vectors.
- The empirical results in Table 1 suggest that when 60% clients have keep ratio and 40% clients have , the unbiased and collective strategies significantly outperform PRiSM, top-n (and variants). Though, the gains are marginal for a fixed keep ratio. Nevertheless, its nice that the authors evaluate confidence intervals, which suggests that the gains are at least statistically significant.
- The proposed modification of PriSM: PriSM + Wallenius + ClipLR, reduces exploration during training, improves exploitation, and thus empirically improves convergence of FedAvg training.
缺点
- The paper only analyzes the pre-client update approximation errors, i.e., the error induced by spectral sharding locally at each client. It would be nice if the authors can comment on how the post-client update average deviates from the weight matrix obtained by running one round of FedAvg without spectral sharding (for the proposed sampling strategies).
- While the authors analyze the Frobenious norm of the discrepancy, a tighter norm to analyze would have been the spectral norm. It is unclear if and how the optimal solutions (and findings) in Frobenious norm transfer.
- For the collective estimator, the paper only provides a closed form solution when all clients in the per-round average have the same constraint . Can the authors elaborate on the case where there is heterogeneity within each round of the update? This seems limiting in practical settings.
- It is unclear why the collective estimator is a natural class to analyze in terms of estimators that optimally tradeoff bias and variance?
Note: On the outset, this work seemed like a small modification of PRiSM, but reading it more carefully, I feel the paper clearly presents that the choice of sampling in spectral model sharding can significantly affect the convergence and performance in FL. While the theoretical results on the approximations are nice, there is definitely more room for analyzing convergence, post-gradient update performance etc., for different types of sampling strategies. At the same time, the empirical results would also be more impactful if extended to larger models (> 1B parameters, e.g., in LLMs) and settings with both system and statistical heterogeneity. Thus, I am giving a score of 5 for now, but will definitely increase, if the authors can respond on some of my concerns/clarifications in the rebuttal period.
问题
- For the unbiased estimator, the sum of inclusion RVs is a Poisson Binomial distribution with variance that scales with for a matrix of size . Does this not violate the computational constraints on each user, what if the user can only handle a fixed rank of ? Or, is the approach limited by the fact that the computational constraints for each user need to scale with model size?
局限性
The authors mention some limitations in Appendix A, though the limitations section can use some better sign posting.
We thank the reviewer for the time and effort invested in our work. We are happy to take the feedback into account and improve our presentation accordingly. Below we address the questions raised in the review.
Post-client update. We appreciate this valuable suggestion! To analyze the post-gradient updates, we followed the optimization path of the server model under each strategy with communication rounds and calculated the cosine similarity between the server model update under that strategy and under FedAvg method with full-size model on each client. Subsequently, the model weights were updated according to the selected strategy: The results of these experiments are provided in Fig. R2 in the attached pdf. As shown, there is a notable connection between the relative performance of each strategy and its “alignment” with FedAvg updates. The deterministic Top-n strategy lags behind its randomized counterparts in terms of accuracy in the considered setting (see Tab. 1 of the paper), and a similar observation is made for its update directions.
Tighter error bound. While the Frobenius norm is convenient for analysis and allows for closed-form solutions, it indeed provides a relatively loose bound for the spectral norm. For this reason, in Appendix A of the submitted paper, we present the results from experiments using Schatten norms which offer a tighter bound. However, as we demonstrated, the obtained results did not significantly differ from those described in the main text.
System heterogeneity. When heterogeneous clients participate in the same round, we follow prior approaches [1] and cluster the clients based on their capacity. After that, the sampling distribution of singular values is computed within each cluster independently. We used this methodology to conduct experiments for the third, “heterogeneous” block of Tab. 1 in the submitted paper. We agree that this approach is ad-hoc, and aim to develop a more grounded solution in future work.
Motivation of the collective estimator. We agree with the reviewer that the collective estimator may not be the most natural choice from the perspective of the bias-variance trade-off. In the paper, we mentioned this trade-off (i) to motivate the unbiased strategy and (ii) to explain why the unbiased strategy might have limitations such as large variance. Our motivation for the collective estimator was different. We aimed to generalize the well-known property of truncated SVD (Eckart–Young–Mirsky theorem) by incorporating the knowledge about the number of participating clients. Therefore, we formulated the optimization problem to find the estimator with the least expected Frobenius discrepancy.
Larger models. We admit that demonstrating scalability is highly desirable for any federated learning algorithm. However, due to our limited computational resources, we focused on experiments using datasets and models of the same scale as our fairly recent baselines. We hope to apply our approach to large-scale tasks in the future.
Sum of inclusion RV. For both the unbiased and collective estimators, the sum of inclusion random variables is not random but deterministic and always equal to . Note that in both Theorem 1 and 2, we derive the probabilities for sampling without replacement with fixed sample size, see Eq. (19) and Eq. (25) respectively. Additionally, all the sampling methods we use guarantee that exactly singular values are sampled for the client with keep ratio . For that reason, the variables are not mutually independent and do not lead to a Poisson distribution.
References.
[1] Mei et al. Resource-adaptive federated learning with all-in-one neural composition. In NeurIPS, 2022.
I thank the authors for the detailed responses, and encourage them to include the post-client update experiments (though along with cosine similarity I feel an accuracy plot would be useful), and the discussion on the collective estimator in the final version. I retain my scores from now and will certainly consider updating them after discussion with other reviewers.
he study addresses the challenge of heterogeneous clients in federated learning by introducing two sampling strategies for spectral model sharding. These strategies, derived from specific optimization problems, either produce unbiased estimators of the original weights or minimize the squared approximation error. Empirical results demonstrate that these methods enhance performance on various datasets, despite sometimes slower convergence rates.
优点
The paper clearly introduces the problem and identifies a significant gap in the current research. The proposed method is well-presented, with experiments that support the main claims. Additionally, it includes a detailed discussion on the performance of the algorithms, impact of data heterogeneity, communication complexity, and the influence of hyperparameters.
缺点
Practitioners can benefit greatly from reading the paper. Adding a convergence analysis of the proposed algorithm would provide valuable insights for technical readers in the field of federated learning.
问题
Considering the experiments, could one suggest reducing the model size based on the fact that trained layer using your algorithm is low rank? How do you justify using your method instead of considering a model with fewer parameters?
It would be beneficial to add the performance results of FedAvg, specifically examining the effect of local epochs, and local training (training on local data only). Additionally, comparing the communication costs of FedAvg and your algorithm can provide a fair measure of the reduced communication cost.
Have you tried factorizing different layers? Which layer's factorization can reduce the communication cost the most, and which one affects the performance the most?
局限性
Adding a convergence analysis of the algorithm will make the paper stronger.
We thank the reviewer for their time and effort in evaluating our work. We are happy to take the feedback into account and improve our presentation accordingly. Below, we address the questions raised in the review.
Q1 and Q2. While using a model with fewer parameters is a straightforward solution for low-capacity clients, we believe that the capabilities of contemporary large vision and language models, as demonstrated in recent years, indicate that larger model size often significantly improves the performance. As a consequence, the challenge of training a large model on a set of low-end devices naturally emerges. One advantage of spectrally sharded models is the ability to train a server model that is significantly larger than the capacity of any individual participating client.
In the context of the aforementioned challenge, we compare the communication costs of the full-size ResNet model trained on CIFAR-100 against different spectral sharding strategies with a keep ratio of 0.1, see Fig. R1 in the attached pdf. These results correspond to those in Tab. 1 of the submitted paper (where FedAvg was referred to as “No Sharding”). In addition, we trained slim versions of various models each containing approximately the same number of parameters as the subsampled per-client models in the considered strategies.
As shown, the full model trained with FedAvg is the least communication-efficient. The slim version of the ResNet model demonstrates the best efficiency in this experiment. However, as mentioned above, the ultimate goal of sharded training is to enable the user to train a model of size which surpasses the individual capabilities of clients. And slim models do not provide this ability.
Q3. The low capacity of a participating client imposes a global constraint on the total computational budget. Distributing this budget among different layers of the network is a complicated and valuable problem, as demonstrated in recent works e.g. [1, 2]. Our study focuses on sampling strategies, which represent a subsequent step after the budget has been distributed. For simplicity, in our work, we spend our budget in proportion to the original layer widths, using the same keep ratio for each subsampled layer.
The communication cost and influence on the performance of different layers are intrinsic properties of each specific architecture. E.g., as mentioned in the submitted manuscript, various methods (including ours) do not decompose the very first and the very last layers which are crucial for model performance, and we confirmed this finding in our early experiments. Speaking of the communication cost, for example, in ResNet architecture the number of parameters in a residual block grows with depth and, as noted in [3], “last 3 bottleneck residual blocks of ResNet-50 contain ∼60% of the entire model parameters”. At the same time, modern transformers typically consist of blocks of the same size.
Convergence analysis. We agree that a formal proof of the convergence would strongly support not only our approach but the entire group of spectral sharding-based methods. However, due to the complexity of such an analysis, in our work, we opted for an empirical check of convergence which is provided in Appendix A of the submitted paper. We aim to provide rigorous proof in future work.
References.
[1] Wang et al. Pufferfish: Communication-efficient Models At No Extra Cost. In MLSys, 2021.
[2] Wang et al. Cuttlefish: Low-Rank Model Training without All the Tuning. In MLSys, 2023.
[3] Yao et al. FedHM: Efficient Federated Learning for Heterogeneous Models via Low-rank Factorization. 2021.
Thank you for your rebuttal. I would like to maintain my current score.
Dear reviewers, please find the results of requested experiments attached. The comments on these results are provided in the individual replies.
Spectral model sharding is a technique used in federated learning to address system heterogeneity by breaking model parameters into low-rank matrices. This allows clients with varying computational capabilities to work with smaller, more manageable portions of the model. The authors build on this technique by introducing two new sampling strategies for spectral model sharding. The first strategy creates unbiased estimators of the original model weights. The second strategy focuses on minimizing the squared approximation error.
The paper is well-articulated and provides meaningful insights, with the proposed strategies being more amenable to analysis than previous approaches. Empirical results demonstrate improvements over existing methods. Although there are some concerns about the focus on pre-client update errors and the computational cost of SVD on the server side, the reviewers are unanimously positive, leaning toward acceptance. I agree with this decision, as the paper makes a valuable contribution to addressing system heterogeneity in federated learning. As recommended by the reviewers, I encourage the authors to include post-client update experiments and a more detailed discussion on the collective estimator in the final version.