FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning
Federated Black-box Prompt Learning presents a novel challenge regarding query efficiency in LLM APIs, which has not been previously addressed.
摘要
评审与讨论
This paper presents a prompt tunning method in federated learning system, incorporated with black-box LLMs on the cloud. The method, FedOne, is proposed to reduce the cost on APIs usage. Theoretical analyses and experiments are conducted to support the main claims.
优点
- This work focuses on a new perspective of federated learning with LLMs, APIs usage cost reduction.
- The experiments' results seem positive, and the proposed FedOne's overall generalization ability is validated on various datasets.
- The discussion about the weaknesses of white-box prompt learning is interesting in the literature of federated learning.
缺点
- Considering that the computational cost on the entire LLM of white-box prompt learning is transfer to the cost on APIs usage of black-box prompt learning, the overall motivation is questionable. More detailed cost-benefit analyses and related facts would explain this well.
- The claim and the proof of Corollary 2, which support the main idea that 1 client is needed, is naive and questionable either. is related to aggregation and overall variance (e.g., in Assumption 3) due to the fact that larger reduce the aggregation noise in federated learning setting. Even if small stepsize can be use to minimize it, as claimed in Line 269, the training process in slower, leading more global epochs. Thus, the bound, from Line 1519 to 1595, seems vaccum and the claim do not make sense. A more comprehensive analysis could help, which takes into account the trade-offs between reduced aggregation noise from larger and the increased number of global epochs.
- In the reported results, Fune-tuning works better than FedOne. A more detailed analysis of this result and the compatibility of the two methods would help to explain the underlying reasons.
问题
- Can the gradient diversity and be decoupled in the analyses?
- Are there any facts supporting that cost on a local LLM is larger than cost on APIs usage? Could you please provide a more detailed cost-benefit analysis comparing the computational costs of white-box prompt learning v.s. the API usage costs of black-box prompt learning?
- Are there scenarios where FedOne might be preferable despite lower performance? Could you please discuss potential improvements to FedOne that could help close the performance gap with fine-tuning.
We sincerely appreciate your insightful comments and constructive feedback, which have been invaluable in improving the quality and clarity of our manuscript. Below, we provide detailed responses to the identified weaknesses and questions.
*W1: Considering that the computational cost on the entire LLM of white-box prompt learning is transfer to the cost on APIs usage of black-box prompt learning, the overall motivation is questionable. More detailed cost-benefit analyses and related facts would explain this well.
- The statement "the computational cost of the entire LLM in white-box prompt learning is transferred to the cost of API usage in black-box prompt learning" is not universally applicable.. Access to advanced LLMs is typically governed by large organizations, which often restrict white-box access, making fine-tuning or internal modifications unfeasible. This leaves researchers and practitioners with no alternative but black-box prompt tuning to adapt these models for specific tasks. Black-box prompt tuning enables efficient model customization by modifying the inputs (prompts) rather than the model’s internal structure, which is crucial in scenarios where internal access or fine-tuning is prohibited or unavailable.
- Even if the stated cost-transfer argument were valid, black-box prompt learning offers clear advantages in terms of cost-effectiveness and practicality. White-box fine-tuning of LLMs requires substantial computational resources, such as GPUs or other high-performance hardware, resulting in significant upfront investments and maintenance costs. In contrast, black-box prompt learning shifts this burden to API usage fees, which are often more manageable and scalable, particularly for smaller organizations or individual practitioners. Moreover, in our experiments, the cost of white-box prompt learning (e.g., training on cloud GPUs with RoBERTa-base) was comparable to the cost of black-box prompt learning (e.g., using the OpenAI API for GPT-3.5-turbo). However, black-box methods allowed us to leverage a more advanced model (GPT-3.5-turbo) than would have been feasible in the white-box setting, further underscoring its practicality and efficiency.
*W2: The claim and the proof of Corollary 2, which support the main idea that 1 client is needed, is naive and questionable either. K∗ is related to aggregation and overall variance (e.g., λ in Assumption 3) due to the fact that larger K reduce the aggregation noise in federated learning setting. Even if small stepsize can be used to minimize it, as claimed in Line 269, the training process in slower, leading more global epochs. Thus, the bound, from Line 1519 to 1595, seems vaccum and the claim do not make sense. A more comprehensive analysis could help, which considers the trade-offs between reduced aggregation noise from larger K and the increased number of global epochs.
- The demonstration of Corollary 2 is rigorous. We analyze the optimal by balancing convergence complexity and query count, aiming to achieve the fastest convergence with the fewest queries. Specifically, we first fix the convergence accuracy, denoted as , and determine the number of iterations, , required to achieve this accuracy. We then analyze the total number of queries, given by . This analysis shows that the combined effect of convergence complexity and the number of queries is directly proportional to . Therefore, to optimize query efficiency—balancing both convergence complexity and query count—it is crucial to minimize . Ideally, setting achieves the most efficient balance, ensuring both theoretical and practical efficiency.
- You appear to be conflating , and in the algorithmic context. To clarify: represents the total number of clients in the federated learning system; denotes the number of clients activated per round. Increasing reduces the variance introduced by the aggregation of clients, as described in Equation (13); captures the bias introduced by data heterogeneity, reflecting inconsistencies in client data distributions. These two sources of error—bias from data heterogeneity () and variance from random client selection ()—are distinct. Additionally, the total number of clients, , remains constant throughout the algorithm's execution, ensuring a stable framework for analysis and implementation.
*W3: In the reported results, Fine-tuning works better than FedOne. A more detailed analysis of this result and the compatibility of the two methods would help to explain the underlying reasons.
Thank you for your valuable suggestion. It is important to note that fine-tuning requires access to the internal parameters of the model, which is often restricted, and involves significant computational costs due to the need to retrain the model on large task-specific datasets. We acknowledge that fine-tuning can achieve superior performance in certain cases; however, its feasibility is constrained by these factors, underscoring the importance of alternative approaches like FedOne for scalable and accessible model adaptation.
*Q1: Can the gradient diversity and K be decoupled in the analyses?
Certainly, decoupling is possible. Calculating the derivative of equation (12) reveals two key insights: The upper bound on the learning rate decreases as increases, indicating that greater data heterogeneity necessitates a smaller learning rate to effectively mitigate the bias introduced by the heterogeneity. Similarly, the upper bound on decreases as increases. This can be explained by the fact that activating a larger number of clients per round amplifies the effects of data heterogeneity, thereby exacerbating the bias in the aggregation process.
*Q2: Are there any facts supporting that cost on a local LLM is larger than cost on APIs usage? Could you please provide a more detailed cost-benefit analysis comparing the computational costs of white-box prompt learning v.s. the API usage costs of black-box prompt learning?
- A local LLM requires clients to utilize powerful computational resources, typically GPUs, which involve significant upfront costs. If ignore the upfront cost. We use the cost in our experiment as example, renting a cloud GPU (e.g., NVIDIA V100 at 1USD /hour) for a single FedOne-Prompt Tuning trial costs approximately 0.2USD, including data loading, training, and evaluation. By comparison, using black-box APIs such as GPT-3.5-turbo incurs token-based fees; in our experiments, processing 32 few-shot samples (5 repetitions, 100 epochs) costs about 0.4USD for training. While the costs are comparable, black-box prompt learning leverages more advanced models (e.g., GPT-3.5-turbo), making it a more cost-effective and practical solution without requiring substantial upfront investment in local infrastructure.
- Moreover, we would like to point out that, white-box methods are not suitable for local training in FL. In particular, general FL models often involve clients that are resource-constrained devices with limited computational power. These devices typically cannot support the high computational demands of white-box approaches, which require access to model parameters and may involve extensive retraining or fine-tuning. This makes white-box methods impractical for local training in many real-world applications where clients are constrained by resources.
*Q3: Are there scenarios where FedOne might be preferable despite lower performance? Could you please discuss potential improvements to FedOne that could help close the performance gap with fine-tuning.
- FedOne is specifically designed to address the unique challenges of federated learning scenarios, particularly in light of the significant query costs associated with cloud-based LLM services. In environments where computational resources are abundant and cost is not a limiting factor, fine-tuning typically achieves superior performance by enabling extensive model adaptation and optimization. However, in more common FL settings, where clients are often resource-constrained and query costs for accessing cloud-based models are prohibitively high, FedOne provides a substantial cost reduction while maintaining competitive performance. By prioritizing query efficiency and minimizing the need for extensive model updates, FedOne strikes an effective balance between performance and resource utilization, making it a practical and cost-effective solution for real-world FL deployments.
- Potential enhancements to FedOne to narrow the performance gap with fine-tuning: Black-box prompt tuning focuses on optimizing the input prompt to guide the model in producing accurate task-specific outputs. This gap can be reduced through more sophisticated prompt design techniques, such as employing iterative interaction strategies, integrating richer contextual information, or leveraging refined prompt templates. These advancements improve the model's task adaptability without requiring modifications to its internal parameters. Moreover, the performance gap between black-box prompt tuning and fine-tuning can be further bridged by combining the strengths of both approaches. For instance, incorporating a limited degree of fine-tuning on top of black-box prompt tuning—such as lightweight fine-tuning by freezing certain layers—can enhance task adaptability while preserving the flexibility of black-box methods. This hybrid approach leverages the efficiency and scalability of black-box prompt tuning alongside the adaptability of fine-tuning, resulting in improved task-specific performance.
The response to W1 is reasonable, but the rest of the analysis of One is confusing, the assumptions of the parameters inherently decouple the connections that should exist. The noise from less sampling of clients is not taken into account in the analysis alone, leading to biased conclusions, and therefore does not change my rating.
We have uploaded a revised version addressing some of the issues identified during the discussion (highlighted in blue). Your review and feedback would be greatly appreciated.
| Section and line number | Description | Reviewer comments |
|---|---|---|
| eq(12) #line 1495 #line 1508 | Fixing Errors | PV8j(W4) |
| #line 312 | The value of n in this paper | PV8j(W2) |
| Corollary 2 #line 1589 | Re-derivation of Corollary 2 | QMks(W1) xxa2(W2) |
| Remark 4 #line 1611 | Principle of Corollary 2 | 4cVb(Q1) QMks(W1) 4cVb(Q1) |
This paper proposes a more cost-efficient federated learning framework called FedOne to optimize query efficiency when interacting with the cloud large language model (LLM). A convergence analysis for Federated BDPL is also been explored in this paper. Extensive experiments demonstrate the effectiveness of the proposed method.
优点
-
Improving the query cost efficiency is important in federated learning scenarios. The proposed method is simple and reasonable with a theoretical guarantee.
-
The paper is well-written and easy to understand.
缺点
-
More details on optimizing from "" to "" for improving query efficiency in Section 3.2 should be provided, as it is the most important part in the whole paper. According to the current version, the proposed method is too simple and too straightforward. The effect of the value of on model convergence should be also analyzed in this section.
-
The experiment does not contain related baselines to compare with, such as [1],[2]. As a result, it is very hard to evaluate whether the FedOne reflects the SOTA performance.
[1] Black-box Prompt Tuning for Vision-Language Model as a Service. IJCAI 2023.
[2] Fedbpt: Efficient federated black-box prompt tuning for large language models. ICML 2024.
- It is unclear whether the number of clients has an effect on the performance of FedOne.
问题
Please refer to the Weakness section.
We sincerely appreciate your insightful comments and constructive feedback, which have been invaluable in improving the quality and clarity of our manuscript. Below, we provide detailed responses to the identified weaknesses.
*W1: More details on optimizing K∗ from ">1" to "=1" for improving query efficiency in Section 3.2 should be provided, as it is the most important part in the whole paper. According to the current version, the proposed method is too simple and too straightforward. The effect of the value of K∗ on model convergence should be also analyzed in this section.
We carefully analyze the optimal by balancing convergence complexity and query count, with the goal of achieving the fastest convergence with the fewest queries. Increasing , the number of activated clients, can accelerate convergence by leveraging more data; however, it also increases query overhead, leading to higher communication and computational costs. We rigorously demonstrate that the number of queries required for Fed-BDPL to achieve an -solution is . This result emphasizes that optimal query efficiency is achieved when only a single client is activated per round.
*W2: The experiment does not contain related baselines to compare with, such as [1],[2]. As a result, it is very hard to evaluate whether the FedOne reflects the SOTA performance.
[1] Black-box Prompt Tuning for Vision-Language Model as a Service. IJCAI 2023.
[2] Fedbpt: Efficient federated black-box prompt tuning for large language models. ICML 2024.
Thank you for your suggestion about the experimental baseline:
- Reference [1] focuses on visual and language model integration, making it unsuitable for our study. Specifically, the framework in [1] jointly optimizes prompts for different modalities by sharing the intrinsic parameter subspaces of both visual and language modalities, a process that necessitates interaction between these two modalities.
- Reference [2], cited in Section 4.2 (line #338, Sun et al., 2022), serves as a key baseline in our experimental evaluation. We have adapted this work as the "FedOne-BBT" baseline (line #332), allowing for a direct comparison between our proposed FedOne algorithm and existing state-of-the-art methods.
*W3: It is unclear whether the number of clients has an effect on the performance of FedOne.
In our theoretical analysis, the total number of clients does not impact the performance of FedOne. This is because our analysis accounts for the inherent randomness introduced by client sampling. This randomness, stemming from the stochastic nature of client selection, must be addressed and eliminated, as the parameter we aim to optimize depends on the contributions of all clients. Consequently, our theoretical framework does not consider the effect of the total number of clients, , on the algorithm's performance. Instead, the key factor influencing convergence is the number of activated clients, . To further investigate the impact of the number of clients, we will conduct experiments focusing on the parameter .
Dear reviewer, thank you again for your time and effort put into reviewering our manuscript. Please let us know if our responses have addressed your concerns. If you have remaining concerns, please don't hestitate to let us know and we are happy to address them. Thank you!
Thank the authors for the explanation. Most of my concerns have been addressed. According to the current version of the paper, I prefer to keep my score.
We have uploaded a revised version addressing some of the issues identified during the discussion (highlighted in blue). Your review and feedback would be greatly appreciated.
| Section and line number | Description | Reviewer comments |
|---|---|---|
| eq(12) #line 1495 #line 1508 | Fixing Errors | PV8j(W4) |
| #line 312 | The value of n in this paper | PV8j(W2) |
| Corollary 2 #line 1589 | Re-derivation of Corollary 2 | QMks(W1) xxa2(W2) |
| Remark 4 #line 1611 | Principle of Corollary 2 | 4cVb(Q1) QMks(W1) 4cVb(Q1) |
This paper focuses on federated Black-Box Discrete Prompt Learning (BDPL) and offers the first theoretical analysis of this approach, uncovering valuable insights from the findings. Specifically, the authors demonstrate that utilizing FedAvg with a single client in each training round yields the highest query efficiency. The numerical results corroborate this theoretical conclusion, reinforcing the validity of the proposed method.
优点
- The paper presents a convergence error analysis of federated BDPL.
- Building on these theoretical results, the authors optimized federated BDPL by strategically determining the number of participating clients for each training round.
缺点
- The paper builds upon an existing federated BDPL framework.
- The rationale behind the FedONE algorithm is not adequately explained.
问题
-
The rationale behind the FedONE algorithm lacks a proper explanation. Specifically, if I understand correctly, FedONE aggregates only one client in each round. Does this imply that, in each round, the clients will set their local model parameters to match those of the selected client? Intuitively, this could lead to significant fluctuations in model training, particularly when the local data across clients are highly heterogeneous. Further clarification on how to interpret and address this phenomenon is needed.
-
Additional experiments should be included to assess the impact of data heterogeneity across clients.
We sincerely appreciate your insightful comments and constructive feedback, which have been invaluable in improving the quality and clarity of our manuscript. Below, we provide detailed responses to the identified weaknesses and questions.
W1: The paper builds upon an existing federated BDPL framework.
This article shares similarities with an existing federated BDPL framework; however, we address a distinct problem that has not been explored in previous research:
- Prior research on federated black-box prompt tuning has largely overlooked the significant query costs associated with cloud-based LLM services and has not provided corresponding theoretical analysis. We are the first to highlight that previous studies have neglected the issue of query efficiency in this context.
- We conduct a rigorous theoretical analysis, with results showing that by restricting each round of activation to a single client, FedOne achieves optimal query efficiency. Additionally, the corresponding experimental results validate the effectiveness of FedOne, offering an effective paradigm for addressing the black-box problem in scenarios with limited computational resources.
- We utilize the Gumbel-Softmax technique to reparameterize the categorical distribution of the prompt vocabulary and apply the policy gradient method to approximate gradients in the black-box setting. This enables optimization in the parameter space rather than in label probabilities, thereby mitigating bias. This approach allows for more stable model training while facilitating theoretical analysis based on unbiased gradients, as opposed to biased gradients.
W2: The rationale behind the FedONE algorithm is not adequately explained.
We will provide a detailed explanation of the rationale behind the FedOne algorithm:
- An intuitive explanation for the effectiveness of FedOne lies in its ability to maximize the utility of each LLM query. By activating only one client per round, FedOne ensures that each query contributes significantly to the global model update. This selective client activation minimizes redundant computations and reduces communication overhead, optimizing the overall query efficiency within federated learning frameworks.
- We have rigorously demonstrated, through theoretical proof, that the number of queries required for Fed-BDPL to achieve an -solution is , where represents the number of activated clients. This result highlights that optimal query efficiency is achieved when only a single client is activated per round, as is the case with FedOne.
- Experimental results presented in Figure 2 and Table 3 show that by utilizing the minimum number of activated clients—FedOne—the federated learning framework achieves the highest possible query efficiency. This approach maximizes efficiency between clients and the server, leading to a more efficient learning process.
*Q1: The rationale behind the FedONE algorithm lacks a proper explanation. Specifically, if I understand correctly, FedONE aggregates only one client in each round. Does this imply that, in each round, the clients will set their local model parameters α to match those of the selected client? Intuitively, this could lead to significant fluctuations in model training, particularly when the local data across clients are highly heterogeneous. Further clarification on how to interpret and address this phenomenon is needed.
- We provide a detailed explanation of the rationale behind the FedONE algorithm: Previous research on federated black-box prompt tuning has largely neglected the significant costs associated with querying LLM cloud services. This omission is crucial, as querying external LLMs often entails substantial computational and financial expenses, limiting the scalability and practicality of such methods. Additionally, prior studies have not presented a comprehensive convergence analysis for federated black-box prompt tuning, particularly for the optimization of discrete prompts. In contrast, our work focuses on enhancing the query efficiency of the federated BDPL algorithm, addressing both cost and efficiency challenges. Through rigorous theoretical analysis and empirical validation, we demonstrate that the FedONE algorithm effectively reduces query overhead while maintaining competitive convergence performance. This emphasis on query efficiency provides a more cost-effective and scalable solution for real-world federated learning applications.
- The parameter does not represent local model parameters. Instead, is defined as the transmission-update-aggregation process in our algorithm. Specifically, is transmitted from the server to the client, updated locally on the client, and then aggregated back to the server.
- Data heterogeneity presents a significant challenge in federated learning; however, it falls outside the primary scope of our current research. In Remark 1, we provide a preliminary analysis of the impact of gradient diversity—an important measure of data heterogeneity—on convergence. This analysis highlights how variations in client data can affect the overall model training process. While this issue is not the focus of the present work, we will explore the effects of data heterogeneity on convergence in greater depth.
*Q2: Additional experiments should be included to assess the impact of data heterogeneity across clients.
Data heterogeneity poses a significant research challenge. In Remark 1, we present an initial analysis of the impact of gradient diversity—a measure of data heterogeneity—on convergence, highlighting that greater gradient diversity results in slower algorithmic convergence. Our primary focus is on the query efficiency of the joint BDPL algorithm, and we demonstrate the effectiveness of the FedOne algorithm through rigorous theoretical analysis and experimental validation. The results show that the algorithm achieves optimal query efficiency when activating a single client per round. We will incorporate experiments related to data heterogeneity, such as class imbalance.
Dear reviewer, thank you again for your time and effort put into reviewering our manuscript. Please let us know if our responses have addressed your concerns. If you have remaining concerns, please don't hestitate to let us know and we are happy to address them. Thank you!
Thanks for the authors' response. They have provided a clear explanation of the rationale behind FedONE. However, data heterogeneity is a key aspect of federated learning that should not be overlooked. Given the theoretical contributions of the paper, I have decided to raise my score.
Thank you for your recognition and suggestion. We are honored to address your query and will further investigate data heterogeneity in our follow-up work.
We have uploaded a revised version addressing some of the issues identified during the discussion (highlighted in blue). Your review and feedback would be greatly appreciated.
| Section and line number | Description | Reviewer comments |
|---|---|---|
| eq(12) #line 1495 #line 1508 | Fixing Errors | PV8j(W4) |
| #line 312 | The value of n in this paper | PV8j(W2) |
| Corollary 2 #line 1589 | Re-derivation of Corollary 2 | QMks(W1) xxa2(W2) |
| Remark 4 #line 1611 | Principle of Corollary 2 | 4cVb(Q1) QMks(W1) 4cVb(Q1) |
This paper addresses Black-box Discrete Prompt Learning (BDPL) in Federated Learning and proposes FedOne. The proposed FedOne selects one client at each round, and the selected client updates the sample probability for each token at different positions. This work provides a theoretical analysis and shows the convergence rate of the proposed algorithm. Extensive experiments show the remarkable performance of the proposed work.
优点
- This work proposes an interesting solution to addressing BDPT in federated learning.
- The work discloses an interesting discovery that sampling one client at every round can provide remarkable performance.
- The experience is conducted on the GLUE dataset and two pretrained models, i.e., Roberta and GPT-3.5. And the experimental results are promising.
缺点
- The proposed method requires each client to store a size of s. Nowadays, the pretrained embedding layer of an LLM gets larger, e.g., from 32k in LLaMA-2 to 128k in LLaMA-3. Therefore, it costs a large amount of computation and communication overhead.
- This paper does not explicitly discuss the effect of . This should be part of the ablation study.
- The review of FL is insufficient. I see there are a number of works addressing LLM fine-tuning under federated learning. The authors should concretely discuss why the existing works are infeasible.
- goes to an order of . This sounds really weird. I think the author should justify the reason.
问题
See Weaknesses.
We sincerely appreciate your insightful comments and constructive feedback, which have been invaluable in improving the quality and clarity of our manuscript. Below, we provide detailed responses to the identified weaknesses.
W1: The proposed method requires each client to store a size of n×N αs. Nowadays, the pretrained embedding layer of an LLM gets larger, e.g., from 32k in LLaMA-2 to 128k in LLaMA-3. Therefore, it costs a large amount of computation and communication overhead.
The parameter , used for generating the discrete prompt, is much smaller than the model parameters. As shown in Table 2, FedOne exhibits significantly low computational and communication overhead, which is a critical factor in its scalability and practical applicability in federated learning scenarios. By minimizing both the computational resources required for processing and the frequency of communication between clients and the server, FedOne ensures efficiency, even in resource-constrained environments.
W2: This paper does not explicitly discuss the effect of n. This should be part of the ablation study.
- Theoretically, as demonstrated in Theorem 1, a larger value of leads to a slower convergence rate for Fed-BDPL. This can be intuitively understood, as increasing the prompt length requires more epochs to effectively identify and optimize the most relevant prompt tokens. A larger adds complexity to the search space for the optimal token configuration, thus prolonging the convergence process.
- Experimentally, to further investigate the impact of prompt length, we will add an ablation study focusing on the parameter .
W3: The review of FL is insufficient. I see there are a number of works addressing LLM fine-tuning under federated learning. The authors should concretely discuss why the existing works are infeasible.
Thank you for suggesting the inclusion of the LLM fine-tuning baseline. We have incorporated the most relevant work on Federated Prompt Tuning in Section 5.3:
- Existing research on fine-tuning LLMs in federated learning often assumes that the models are open-source, with full access to their parameters for direct modification. In contrast, our work focuses on black-box learning scenarios, such as those involving commercial large language models. These models offer only inference services, limiting access to their underlying parameters. Users can interact with them by providing inputs and receiving outputs via API calls but cannot modify model weights or apply traditional gradient-based fine-tuning. Consequently, fine-tuning methods are inapplicable in these closed-system settings, where internal parameters remain inaccessible.
- Additionally, fine-tuning in FL generally assumes that clients have significant computational resources to manage the extensive load required for retraining large models. However, this is often impractical in FL scenarios, where clients are typically resource-constrained devices with limited processing power. Requiring substantial computational capabilities for fine-tuning can lead to scalability issues and limit the applicability of such methods in large-scale, decentralized settings. In contrast, our approach focuses on methods that do not require direct access to the model or heavy client-side computation, making it more suitable for practical deployment in resource-limited environments.
W4: goes to an order of O(T). This sounds really weird. I think the author should justify the reason.
Thank you for pointing out the issue. We identified an error in Equation (12), where an extra appeared in the denominator due to a substitution error on line #1495. This mistake has been corrected and does not affect the convergence results:
$ -1+\lambda L\eta+2\eta^{2}L^{2}E(1+\frac{1}{K_{\ast}})\leq0 $ $ 0<\eta \leq \eta^{\ast}=\frac{-\lambda L+\sqrt{\lambda^{2}L^{2}+8L^{2}E\left(1+\frac{1}{K_{\ast}}\right)}}{8L^{2}E\left(1+\frac{1}{K_{\ast}}\right)} $Thanks for your response. I have a couple of follow-up comments regarding your response:
- The authors claim that a smaller leads to a faster convergence rate. For an extreme case, the fastest convergence rate is at . From our common sense, adding a single vocabulary to achieve an outstanding performance is impossible. I failed to find the setting of in the experimental design. Can the authors explicitly show the value of ? Based on your observation, what vocabulary should the prompt include?
- Table 2 claims that the proposed FedOne achieves a minimal trainable parameter size on clients compared to the white-box approach. By enabling the LoRA adapter, the trainable size largely depends on the rank in those white-box approaches. I think the authors should consider other federated LLM fine-tuning approaches, such as [1, 2, 3].
- There is a very related work [4] that the authors do not cite. This work also employs BP-free training methods. I understand the proposed approach differs from FwdLLM [4], but the authors should discuss this work in the paper.
- Up to now, ICLR still allows paper revision. However, I cannot see the authors revise their paper, which hardly convinces me that ICLR can accept this work.
References: [1] Improving LoRA in Privacy-preserving Federated Learning [2] FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [3] FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model [4] FwdLLM: Efficient FedLLM using Forward Gradient
We have uploaded a revised version addressing some of the issues identified during the discussion (highlighted in blue). Your review and feedback would be greatly appreciated.
| Section and line number | Description | Reviewer comments |
|---|---|---|
| eq(12) #line 1495 #line 1508 | Fixing Errors | PV8j(W4) |
| #line 312 | The value of n in this paper | PV8j(W2) |
| Corollary 2 #line 1589 | Re-derivation of Corollary 2 | QMks(W1) xxa2(W2) |
| Remark 4 #line 1611 | Principle of Corollary 2 | 4cVb(Q1) QMks(W1) 4cVb(Q1) |
The paper proposes a novel method for federated black-box discrete prompt learning. The paper analyses the convergence in terms of the number of LLM queries, and shows that sampling a single client in every iteration is the optimal choice to achieve the best sample efficiency in terms of the number of LLM API queries.
The reviewers commended that the paper provides a novel perspective of focusing on the LLM API query efficiency in federated prompt learning, and the proposed theory-inspired approach of sampling one client in every round is interesting.
However, the reviewers also expressed important concerns about the paper, which I agree with. For example, the theoretical analysis focusing on the API query efficiency may have overlooked the noise from sampling a small number of clients in every round. As a result, the resulting conclusion that sampling a single client in every round is optimal may have been misled. In addition, the resulting algorithm may be overly simple. The fact that the theoretical results do not depend on the number of clients also seem a little unconventional to me. Also, the important factor of heterogeneity is not considered yet in analysis and algorithm.
As a result, the paper still has room for improvement. I believe a careful revision of the analysis and the algorithm would greatly benefit the paper, and hence recommend rejection.
审稿人讨论附加意见
During rebuttal, the reviewers expressed important concerns regarding the reasonableness of the theoretical results.
Reject