3.0

/10

withdrawn4 位审稿人

最低1最高5标准差1.4

4.0

置信度

正确性1.8

贡献度1.5

表达1.5

ICLR 2025

Collaborative and Efficient Personalization with Mixtures of Adaptors

Abdulla Jasem Almansoori,Samuel Horváth,Martin Takáč

OpenReview PDF

提交: 2024-09-19更新: 2024-11-24

TL;DR

We propose an efficient formulation for personalized federated learning problems that utilizes mixture of adaptors for personalization and weight sharing as an implicit regularizer. We demonstrate its benefits experimentally and theoretically.

摘要

关键词

federated learningcollaborative learningadaptors

评审与讨论

审稿意见

评分: 3置信度: 32024-10-29

The paper introduces FLoRAL, a parameter-efficient framework for better personalization. To do this, each client learns to mix between parameter-efficient adaptors according to their task and performs a clustering on top of that. The results show that FLoRAL can outperform an ensemble of full models with optimal cluster assignment. Theoretically, the authors provide a convergence analysis of their framework.

优点

Inspired by LoRA, aggregating partial parameters (adaptors) of federated learning models for better personalization is interesting. It is parameter-efficient and allows local models to benefit both from collaborations and the generalization of their data.
The idea is well-motivated and presented clearly in the introduction. The results show that a mixture of adaptors sometimes can beat a mixture of models.
The theoretical analysis is provided. Also, some insights into why aggregating only the adaptors could lead to a good performance are discussed. It is an interesting phenomenon from my perspective and can be explored further.

缺点

Related work is not comprehensively compared and discussed: Firstly, there are many recent works in personalization for federated learning, such as [1,2] ; authors could discuss in the related work section and compare them in the experimental section. Secondly, in related work, authors can discuss how FLoRAL is different from LoRA-related federated learning methods (what are the strengths, differences, etc).
The experimental section is poorly presented: The experiment part could be greatly improved.

Baselines are not introduced in the main paper and I don't know what they are (what are local adaptors, ensemble?).
Synthetic experiment is confusing. For Table 3, those compared baselines are not defined and I am not sure what messages authors trying to convey from this.
The experiments are conducted mostly on MNIST/CIFAR-10 and only simple architectures like CNN/MLP are used. More real-world datasets and more complicated model architectures can be used and tested.
Generally, the experiment part is not structured in writing. It is mixed with results, comparisons, conclusions, and ablation studies. Authors can consider presenting it in a more structured way.

Minor: For line 153, what is this ∆^{C−1}? Authors may define it more explicitly.

[1] https://arxiv.org/abs/2211.15281

[2] https://openreview.net/forum?id=nO5i1XdUS0

问题

How is θ implemented in practice and what is it exactly? The authors mention it is a vector (Line 388). However, it remains unclear to me what exactly it is and how it relates to other parts of the parameters.
What are the potential challenges and outcomes for applying FLoRAL to larger models such as language models?

评论- Response to Reviewer GzGL (Part 2/2)

2024-11-17

Generally, the experiment part is not structured in writing.

The experiments section is 1.5 pages. Our structure is: overview of tasks -> tasks details -> discussion of results. Could the reviewer give some actionable suggestions on they would like us to restructure this section?

what is this ∆^{C−1}?

It is defined explicitly in line 153. The wiki page is also good and explains very well what a simplex is.

How is θ implemented in practice and what is it exactly?

It is a trainable vector of $C$ real numbers. Think of it as logits (the final layer output before softmax), but here it is trainable for each client and acts as a gating mechanism for the LoRAs.

Instead of training it directly in $\Delta^{C-1}$ , we train it in $\mathbb{R}^C$ by obtaining $\pi^k = \text{softmax}(\theta^k)$ and running gradient descent. This is actually a naive training methodology because we can run mirror descent directly on $\pi^k$ , which gives an exponentiated gradient descent. However, we maintained the simple update in line 7 in Algorithm 1 because it is a simple and good baseline despite its naivety.

The relationship between $\theta$ and the rest of the parameters is clearly shown in Algorithm 1, precisely lines 6 and 7. We believe that this is clear, and we are happy to answer any specific questions. We also invite the reviewer to see the exact implementation in the code to avoid any ambiguities (e.g., see floral/model/router.py).

What are the potential challenges and outcomes for applying FLoRAL to larger models such as language models?

This is an interesting question. Thanks for asking it.

The main challenge comes from running these models on the client's devices in the first place. However, assuming that the clients' devices are capable of running them + some LoRAs (which is, indeed, the case in the latest AI-enabled devices, such as the newer iPhones with "Apple Intelligence"), then we can fine-tune a FLoRAL model with $C$ LoRAs on objectives that are thought to have a multi-tasking structure. For example, Google uses a separate model for each language for their next-word prediction applications. This may not be necessary when using FLoRAL, which would save resources in terms of compute and memory (and thus cost). The same could be said about Apple Intelligence, where dedicated adaptors are used for each writing task, e.g., proofreading, rewriting, etc. We are very keen to try FLoRAL on such large-scale applications, but currently lack the resources to run simulations of this scale.

We hope that we have cleared all of the the reviewer's concerns. We are always more than happy to clarify any further concerns.

评论- Response to Reviewer GzGL (Part 1/2)

2024-11-17

We thank the reviewer for their time and their constructive comments.

Related work is not comprehensively compared and discussed

A comprehensive literature survey on all of the techniques covered in this paper (personalized federated learning, clustering, LoRAs, and their combinations) would be very difficult, so we tried to cover the works that we thought are most relevant.

First, we focused on multi-task learning (which is the same as clustering in our context) because this is the setting of interest in our experiments. We cast the personalization problem as a multi-task learning problem because we use the inductive bias that similar clients solve similar tasks, so they might also be able to personalize together instead of separately/locally for each client.

Second, we focused on formulations of the personalized FL problem, which, as far as we know, employ either a proximal regularizer (in weight space) or a feature/representation regularizer (in feature space, which is orthogonal to our work, see line 139). Weight-sharing / partial personalization can be seen as a special case of an implicit proximal regularizer. This virtually covers all kinds of regularization between clients, so any work must be a (perhaps sophisticated) special case of these. If the reviewer believes that some personalization techniques do not fall into any of those categories, we would be genuinely interested to know.

Third, we focused on works that apply LoRAs, and mixtures thereof, in the federated learning setting because this is a critical idea behind our work. In fact, any adaptor should work with FLoRAL, but we mostly use LoRAs since they are one of the best and most well-known parameter-efficient adaptors. Indeed, we can also use an adaptive bias term with the LoRAs (see paragraph starting at line 252 and Table 5 in Appendix G.3). We believe that we can also use multiplicative adaptors, but we have not tested this.

Regarding [1], the main point of relevance is that this a personalized FL algorithm. The first difference is that this is instance-dependent, as evident from the title, whereas our method is not, which is an important distinction when it comes to efficiency. Second, we do not have "local parameters". Even though the router $\pi$ is a local parameter of $C$ numbers, we can get rid of this assumption at the expense of $C$ extra computation steps per communication round. Furthermore, the algorithm is slightly complex, whereas ours is a "method" and not exactly an "algorithm". In other words, FLoRAL can use any algorithm for solving (MFL). Thus, there is an important difference in terms efficiency, ease of use, and comparability. Our method can be easily adjusted to work on top of a FedAvg routine by letting clients relearn $\pi^k$ from scratch at the beginning of each round, which can take $C$ forward passes, making it highly compatible and easy to use. We are not aware of any efficient personalization method that does not require local parameters or stateful clients.

Regarding [2], this is actually a special case of our framework: set $C=K$ , let $\pi_c^k = \delta_{kc}$ (one-hot, i.e., assign a cluster to each client), turn off LoRA adaptors, turn on bias adaptors (it is on by default), and use SGD with momentum as the client's optimizer (to accomodate [2, Eq. 9]). We invite the reviewer to look up the FLoRAL module in the code and see how easy it is to use this setup.

Baselines are not introduced in the main paper

We apologize for not dedicating more space to explaining this further. The discussion at line 510 explain the baselines, but we will try to explain them in more details in our revision.

Ensembles are a well-known class of models, and in our case, they are just mixtures of models (instead of mixtures of losses, which are equivalent in the optimal/one-hot routing case in our experiments). They can be seen as a "scaled-up" model, and thus an upper bound (which we were able break in the reduced size CIFAR-10-R tasks). Local adaptors are quite simply just that, adding local adaptors to the model for each client, where "local" means "per-client", or "personal" (consider the construction above connecting [2] to our method).

Synthetic experiment is confusing

Could the reviewer explain what is confusing about the synthetic experiments? Perhaps Sections G.1 and G.2 can help.

("Table 3" should be named "Figure" instead, we will fix this typo).

The experiments are conducted mostly on MNIST/CIFAR-10 and only simple architectures like CNN/MLP are used

The tasks in the experiments are actually difficult to solve because they are highly heterogenous. Even the linear task is not trivial. They are not simply MNIST and CIFAR-10 (we also used CIFAR-100 by the way). Please see the experiments section for the details behind the construction.

We are planning on adding more experiments using transformers on translation tasks with $C$ different languages.

Please continue to part 2.

审稿意见

评分: 5置信度: 42024-10-31

This paper proposes a lightweight personalized federated learning method, FLoRAL, which integrates multiple shared adapters with client-specific vectors constrained by a simplex. The authors discuss the relationship between the proposed method and existing personalized federated learning approaches, talk about the probability selection strategy when aggregating weights, and present convergence properties of the proposed method as well as the impact of inaccurate estimation of the vector $\pi$ .

优点

I commend the authors for their clear analysis of the weight selection strategy for aggregation and their thorough examination of how uncertainties in estimating the vector $\pi$ impact convergence.

缺点

The experiments do not include all relevant baselines. Specifically, methods based on Mixture of Experts (MoE), such as [1], and shared Lora, such as [2, 3], are also closely related to the proposed approach.
The convergence analysis and the connection between FML and MFL problems are established under the assumption of convexity. However, the experiments involving neural networks generally do not meet this convexity assumption. A stronger alignment between the analysis and experiments would enhance the study’s validity.
The contributions of the work are not clearly articulated. While the authors claim that FLoraL is memory-efficient, similar advantages are also present in previous LoRA-based methods, such as [2, 3]. The authors should consider specifying the unique advantages that the proposed method offers over existing approaches.

[1] Yi, Liping, et al. pFedMoE: Data-Level Personalization with Mixture of Experts for Model-Heterogeneous Personalized Federated Learning. arXiv:2402.01350, arXiv, 11 Feb. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2402.01350.

[2] Yi, Liping, et al. pFedLoRA: Model-Heterogeneous Personalized Federated Learning with LoRA Tuning. arXiv:2310.13283, arXiv, 11 Feb. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2310.13283.

[3] Cho, Yae Jee, et al. "Heterogeneous lora for federated fine-tuning of on-device foundation models." International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023. 2023.

问题

In Theorem 4.5, the authors demonstrate that the convergence rate is influenced by the term $\mathcal{O}\left(\frac{G\Vert \delta_{c,0}\Vert_{1}^2}{\mu T^{1+\beta}}\right)$ , where $\delta_{c,0}$ represents the degree of mismatch. Does this imply that, even if the predicted distribution differs significantly from the ground truth, it can still converge to the optimal solution. This seems counterintuitive; could the authors provide some intuition behind this result?
There are several typographical errors, and some notations are unclear. For example, in Eq. (2), $\pi_{k'c}$ should be written as $\pi_c^{k'}$ .
In line 467, the authors state, “we do not aim to improve over the state of the art.” However, the experimental section does not clearly demonstrate the effectiveness of the proposed method. If relevant results are provided in the appendix, I would recommend placing the most significant results that highlight the advantages of the proposed method within the main text.

评论- Response to Reviewer ncEB

2024-11-17

We thank the reviewer for taking the time to review our paper.

The experiments do not include all relevant baselines.

Note that:

We have already cited pFedLoRA (line 131).
We have cited an earlier paper on federated MoE, even though our method is not exactly similar to MoE as it does not depend on input (see line 536), which is an important distinction.
Many of the related works we cite already consider federated fine-tuning of LoRAs.

Nonetheless, we would be happy to include the interesting papers you mentioned and discuss them in the related work section.

To elaborate more on the point regarding MoE, we note that for client $k$ the mixture $\pi^k$ has $C$ parameters, so in total we have $C \times K$ personalization parameters across the whole pool of clients, which is not a lot (see the paragraph at line 391 for further discussion). However, when $\pi^k(x)$ is a function of the input or some hidden features, the number of parameters will increase significantly, making our method less efficient. Extending these mixtures to input-dependent mixtures in an efficient way is not trivial.

We acknowledge that this is an interesting direction to explore (as mentioned in line 536), and we have already done some preliminary experiments but with mild results. However, we believe that further exploration would be interesting, which is why we mentioned it in the conclusion.

The convergence analysis [...] are established under the assumption of convexity. However, the experiments involving neural networks generally do not meet this convexity assumption.

This is a valid concern that is often ignored in some convergence results, but for good reasons. First of all, algorithms designed for the convex case often work surprisingly well for neural networks, so we like to believe that the analysis would work as well. Second, convergence in the convex case demonstrates theoretical soundness of our method (kind of similar to how we train a novel vision model on MNIST as a proof of concept). Third, not all models are neural networks. For example, the linear synthetic experiment is almost convex (the non-convexity is only because of the UV^T part, which is adaptor-dependent).

In general, this concern can be valid, but despite the fact the there is a theoretical disconnect between neural networks and analyses in the convex case, we believe that this disconnect should not undervalue the analyses' implications on neural networks in practice.

The contributions of the work are not clearly articulated. While the authors claim that FLoraL is memory-efficient, similar advantages are also present in previous LoRA-based methods, such as [2, 3].

The main difference is two-fold: (1) we train the full model, and not only fine-tune the LoRAs, and (2) we train a mixture of LoRAs, and not only one LoRA. These two differences are not trivial. The reason for the former is that the base model might not necessarily be a solution or a pre-trained model, and we have made no such assumptions in the analysis. As for the latter, this is an inductive bias: that the $K$ clients are solving $C \ll K$ implicit tasks (similar clients are solving similar tasks). This is true for clustering problems, so we ran our experiments in this case. Indeed, we found that FLoRAL can perform very well on these tasks while having a shared base model across clusters.

An interesting phenomenon is that FLoRAL can outperform a mixture of $C$ full models in the low-data regime, even when the ensemble start training with the optimal cluster assigment, which is surprising as this case can be seen as an upper bound. We show in Appendix B.2 that this can be attributed to the reduced gradient variance in learning the (shared) base model.

Q1

The value $\delta^k_{c,0}$ is the mismatch at initialization. Note that $\lVert \delta^k_{c,t} \rVert_1$ is always bounded above by 2. We assume in this theorem that $\lVert \delta^k_{c,0} \rVert_1$ decreases (sub-)linearly, but we have also derived the rate for the general case without assumptions, which can be seen in its full glory in Theorem A.9.

Q2

Thank you for catching this typo! Could the reviewer elaborate on the unclear notations?

Q3

Kindly refer to the paragraph above that starts with "An interesting phenomenon ...". Our method is not directly comparable to state-of-the-art baselines. In fact, we can train FLoRAL models with such baselines, so it is enough to demonstrate its effectiveness against different parameterizations, namely ensembles and local adaptors, both of which have much higher complexities and parameter count over the whole pool of clients.

We hope that we have addressed the reviewer's concerns, and we are happy to answer any further questions.

审稿意见

评分: 3置信度: 42024-10-31

The paper introduces FLoRAL. It is claimed to be a memory-efficient federated learning framework focused on model personalization through the combination of LoRA adapters.

优点

This paper has a nice figure illustration.
FLoRAL uses low-rank adaptation(LoRA) to personalize the model for each client. This significantly reduces the number of parameters that need to be stored and transmitted compared to using full models.
Adapters in FLoRAL are learned collaboratively between clients, leveraging information from multiple data sources.

缺点

Although many efforts can be witnessed in this paper, we still find that the structure of this paper is hard to follow, and we can see a lack of explanation/motivation behind some techniques:

The authors implement the aggregated gradient every H step in the whole time zone T and use the modulo operator to describe what happens at some specific timestep. However, this would overcomplicate the problem rather than simply using ‘local and ‘global rounds as usual.
More discussion is needed on the communication and computation cost. For example, which parameters (i.e., vectors/matrices) are communicated between the clients and server? Which operations need to be computed by clients/servers? How is the communication cost (regarding the parameter size) compared with the computation cost?
Besides, experimental results show that the results are not convincing compared to SoTA baselines. For example, The work did not compare the proposed framework with another popular federated framework, i.e., FFA-LoRA (https://openreview.net/forum?id=NLPzL6HWNl).
The benefits of LORA averaging in FL are not justified in theoretical analysis. For example, what are the effects of averaging the LORA adapters u and a (lines 11 and 12 of Algorithm 1) in the theoretical analysis?
Some notations are not clearly explained, i.e. (loss function of sampled client i_t at the weight of client k (w_k) in line 303).

问题

The aggregated gradient is implemented in every H step. However, the author did not mention how we can compute the loss function of i_t (the sampled client) over the client's weights (line 303).
In line 316, what is the motivation behind the exponential formula for updating the learned client mixtures?
The authors claimed that this is a parameter-efficient framework. However, we see that the base layers are synchronized in the Algorithm 1 (line 419). So, what parameter efficiency could this framework bring?
How are clusters assigned to clients?

评论- Response to Reviewer JECk

2024-11-17

We thank the reviewer for taking the time to review our work.

The authors implement the aggregated gradient every H step [...]. However, this would overcomplicate the problem ...

This is equivalent to $H$ local rounds and $\lfloor T / H \rfloor$ global rounds, and we can set $T$ to be a multiple of $H$ without loss of generality. It does not overcomplicate the problem. In fact, it makes the analysis easier.

... which parameters (i.e., vectors/matrices) are communicated between the clients and server?

All the parameters except $\pi^k$ for each client, as mentioned in line 391. See the discussion in that paragraph for the stateless clients case, which is not expensive.

Which operations need to be computed by clients/servers?

Kindly refer to Algorithm 1 (line 406). The server aggregates the parameters as line 11 and 12 in Algorithm 1, and the clients run line 6, 7, and 8 in Algorithm 1.

How is the communication cost (regarding the parameter size) compared with the computation cost?

The extra communication per client is $C \times \rho \times d$ . For example, for FLoRAL(1%) with $C=4$ , the total communication cost is $1.04 d$ , an extra $0.04 d$ over the normal case (called FedAvg in the paper). As for ensembles, the total communincation cost would be $4d$ . For local adaptors, the total communication cost is $d$ , but we need to maintain $\rho \times d$ parameters for each client. The total parameter count of the personalized solutions in this case would be $d + K \times \rho \times d$ . Thus, FLoRAL solutions are the most efficient system-wise, significantly so when $K \gg C$ . For example, simulating a personalized FL problem with local adaptors given only $K=1000$ and $\rho = 0.01$ still requires $11d$ memory, which can be a lot and this is not even near the scale of real cross-device settings. On top of that, FLoRAL is consistently better than local adaptors.

experimental results [...] are not convincing compared to SoTA baselines. [...] did not compare the proposed framework with another popular federated framework, i.e., FFA-LoRA

We apologize for not being aware of this popular federated framework. This work is concerned with a FL training technique of LoRAs, which is an interesting contribution. In fact, this technique can be used with FLoRAL, so it is not clear how we would compare them together.

Could the reviewer provide the SoTA baselines that make our experimental results seem not convincing?

The benefits of LORA averaging in FL are not justified in theoretical analysis.

LoRA averaging is not beneficial per se. It is the averaging of the base model that provides the variances reduction benefits demonstrated in Appendix B.2. Please refer to this section for the full details. Also, recall that the paragraph starting at line 356 discusses this benefit and refers to Appendix B.

Some notations are not clearly explained, i.e. (loss function of sampled client i_t at the weight of client k (w_k) in line 303).

The notation is explained in that very line. Each client has its own objective $f^k$ . We sample clients $i_t$ arbitrarily from the pool of clients as long as the assumptions are satisfied. As a practical example, the variable $i_t$ can denote a cohort, which implies $f^{i_t} = \frac{1}{\sum_{k \in i_t} p(k)} \sum_{k \in i_t} p(k) f^k$ (think of this as putting clients in a bucket and renormalizing the sampling probabilities).

We would greatly appreciate it if the reviewer can help us improve the notations by making actionable suggestions.

... the author did not mention how we can compute the loss function of i_t (the sampled client) over the client's weights (line 303).

Kindly refer to line 11 and 12 in Algorithm 1 (line 420 in the text). This is almost exactly how we do it in practice. The notations in the analysis are designed to work well for the analysis, but not for practice. Think of the function $f^{i_t}$ as quite simply the aggregated objectives of the sampled clients, which is reflected in the algorithm as $S_\tau$ , the set of clients sampled at round $\tau$ .

In line 316, what is the motivation behind the exponential formula for updating the learned client mixtures?

The motivation is discussed in detail in Appendix C. This update is optimal under entropy regularization (without regularization, the optimal is a one-hot vector), and it does not have adaptivity (forgets about the previous iterates/prior), which is not necessary for our analysis.

... what parameter efficiency could this framework bring?

The efficiency is explained in our answer above to the question about the communincation cost. It is particularly evident when compared to ensembles.

How are clusters assigned to clients?

Kindly refer to lines 496, 504, 1776, and 1802.

We are happy to engage further and answer any questions the reviewer has.

审稿意见

评分: 1置信度: 52024-11-03

The paper proposes "FLoRAL," a parameter-efficient federated learning framework that uses mixtures of low-rank adaptors (LoRAs) to enable [perhaps] client personalization in heterogeneous settings.

优点

The strength of this paper is its exploration of parameter-efficient personalization in federated learning which is an important topic.

缺点

See my comments below:

The paper lacks a clear presentation of the exact problem it aims to solve. In multiple sections—such as the abstract (lines 11-28) and the introduction (lines 51-57, 72-77)—the objectives and approach remain ambiguous. The methodology section does not clearly delineate the specific FL problem at hand or how FLoRAL is a unique solution to this problem. Overall, the paper needs a more clear writing.
Section 3 introduces five different FL setups, but it is unclear which of these the authors ultimately focus on. Are the authors targeting personalized FL, clustered FL, multi-task learning FL, or another variant? Furthermore, the notations "W" and "L" on line 151 are undefined, adding to the confusion.
The related work section is weak and misses recent studies on personalized FL, multi-task FL, and representation learning. A thorough review of these fields is essential to position FLoRAL relative to state-of-the-art methods. It would help to articulate the novel contributions of FLoRAL more clearly in relation to existing methods, especially highlighting areas where it advances beyond current FL approaches.
While the methodology is not clear, it appears overly simplistic, consisting mainly of mixing adaptors through weighted combinations. This is a straightforward extension of FedEM, as noted in line 378. Key methodological details are also missing—such as the initialization process, whether only the adaptors are learned and communicated, and whether the mixing of adaptors occurs at the client or the server. Providing a more detailed breakdown of the algorithm would clarify the novelty and rigor of the approach.
The experimental section lacks critical information regarding the experimental setup and baseline comparisons. The baselines in Table 1 are not clearly explained; for instance, "Ensemble" is introduced without adequate description, and it is unclear how various settings were configured. The purpose of the "reduced 5% data" setting is also not explained. Furthermore, if the authors intend to address multi-task FL, a comparison with state-of-the-art MTL FL methods would be essential to substantiate the claims of FLoRAL’s effectiveness. In overall, the methodology section is not clear at all.
The conclusion section does not summarize the key findings of the study but rather discusses potential future work.

问题

See my comments above.

评论- Response to Reviewer Xqmc (Part 2/2)

2024-11-17

Key methodological details are also missing—such as the initialization process, whether only the adaptors are learned and communicated, and whether the mixing of adaptors occurs at the client or the server.

The initialization process is not mentioned because it is slightly orthogonal to our contribution. We can either use the traditional LoRA initialization or just the default one in PyTorch (e.g., Kaiming uniform). We use the latter, but we found that initialization does not significantly change the results or their implications.

The rest of your concerns can be answered from Algorithm 1. Namely, it is clearly demonstrated that both the base model and the adaptors are learned and communicated (lines 1, 8, 11, and 12 in Algorithm 1 and the comments therein). Also, it is clear that the mixing occurs at the client side (line 6 in Algorithm 1 and line 391 in the text). The algorithm is a simplification of the one we use in practice. Please refer to the code for the exact implementation. We would be happy to answer any questions you have about the code.

The experimental section lacks critical information regarding the experimental setup and baseline comparisons.

We agree with this comment and we shall provide more details in the appendix in our revision. Due to the complexity of our experimental setup, we believe that a proper and complete understanding can be achieved much more efficiently by looking at the configuration files in our code, which are quite readable.

The baselines in Table 1 are not clearly explained; for instance, "Ensemble" is introduced without adequate description, and it is unclear how various settings were configured.

An ensemble in this context is simply a mixture of models (not a mixture of predictions/losses, refer to line 373 for relationship). We mix the model weights according to $\pi$ . The optimal $\pi$ uses the correct mixture/cluster assigment from the start, and this case is thus equivalent to the case of mixed losses because the optimal $\pi$ in all of our experiments is one-hot.

The purpose of the "reduced 5% data" setting is also not explained.

It is to demonstrate a regime where FLoRAL outperforms ensembles, which is not obvious in hindsight.

Furthermore, if the authors intend to address multi-task FL, a comparison with state-of-the-art MTL FL methods would be essential to substantiate the claims of FLoRAL’s effectiveness.

Could the reviewer kindly elaborate in what sense we need to "substantiate the claims of FLoRAL’s effectiveness"? We have shown that FLoRAL can outperform the ensemble approach. This can be translated to the fact that increasing a model's size does not necessarily lead to better performance. In other words, FLoRAL can consistently outperform ensembles in the low-data regime for CIFAR-10 rotate. We believe that this finding is interesting and demonstrates "FLoRAL's effectiveness" in practice against a strong baseline that seemed like an upper bound in hindsight. We believe that this finding can lead to insights into designing better parameter-efficient models (vs. scaling up).

We believe we have addressed all of the reviewer's concerns. Thus, we kindly ask the reviewer to reconsider their review. If further concerns arise, we will be more than happy to answer any questions the reviewer has.

评论- Response to Reviewer Xqmc (Part 1/2)

2024-11-17

We thank the reviewer for taking the time to review our paper. Most of the reviewer’s concerns can be addressed by a careful reading of our paper, and we shall direct the reviewer to the exact lines where the answers can be found.

The paper lacks a clear presentation […] approach remain ambiguous.

Line 204 onward explains the exact problem and the interpretation of the low-complexity differences mentioned in the introduction. Specifically, (MFL-WS) is clearly mentioned as the objective of interest.

The methodology section does not clearly delineate the specific FL problem at hand ...

Could the reviewer kindly elaborate on the "does not clearly delineate the specific FL problem at hand" part? The objective of interest is (MFL) with weight sharing. Section 5 is dedicated to a more practical implementation of solving this objective, which amounts to relaxing it to (FML) with weight sharing and learning the softmax-parameterized mixture directly with gradient descent.

... or how FLoRAL is a unique solution to this problem.

FLoRAL is unique in its construction of $\Gamma$ , which allows for: (1) an efficient parameterization by weight sharing on the base model and (2) adaptation to C tasks with collaboratively-learned low-rank adaptors.

Section 3 introduces five different FL setups, but it is unclear which of these the authors ultimately focus on.

Are the authors targeting personalized FL, clustered FL, multi-task learning FL, or another variant?

This has already been answered above. To reiterate, we are focusing on (MFL), a multi-task learning objective, which can also be cast as a clustering objective.

Furthermore, the notations "W" and "L" on line 151 are undefined, adding to the confusion.

This is a section about notations, and this matrix is intentionally undefined. It just demonstrates our stylistic choice for denoting matrices vs. vectors.

The related work section is weak […].

Does the reviewer have a particularly relevant work that they believe we have missed in our related work? We have also clearly mentioned that representation learning is orthogonal to our work. In other words, representation learning can be made to work with FLoRAL.

A thorough review of these fields is essential to position FLoRAL relative to state-of-the-art methods.

The paragraph starting at line 467 answers this concern. The previous state-of-the-art, specifically the ones in the clustering context, treat the ensemble approach with optimal $\pi$ as an "oracle" or an upper bound. We demonstrated in our experiments that our method can actually beat that upper bound in the low-data regime (for CIFAR-10, specifically).

This phenomenon is interesting because it shows that the ensemble approach can overfit when clients have little data (a realistic scenario). The overfitting is clearly demonstrated in the test loss plots on the reduced-size datasets, which can be seen in the right column of Figures 3 to 7 in Appendix G.5.

The benefits of FLoRAL is that it can mitigate this overfitting. We tried to explain this theoretically by proving that the base model's gradients have reduced variance over all clients vs. only the clients in its cluster. This benefit manifests when the variance of the base model's gradient across clusters is small, which is exactly the motivation behind weight sharing in the first place.

While the methodology is not clear, it appears overly simplistic […]

We believe that simplicity is a positive thing. Perhaps the reviewer can judge the simplicity of our work more accurately by looking at the code that implements the method in practice, which is provided in the supplementary material and in the anonymous repo link in the paper.

This is a straightforward extension of FedEM, as noted in line 378.

It is not. In fact, we can use FedEM to train a FLoRAL model. In general, any algorithm that solves (MFL) can be used.

Please continue to part 2.

撤稿通知

2024-11-24

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.