PaperHub
6.3
/10
Poster4 位审稿人
最低5最高7标准差0.8
5
6
7
7
3.0
置信度
正确性3.3
贡献度2.8
表达2.8
NeurIPS 2024

Unravelling in Collaborative Learning

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Collaborative learningunravellingadverse selection

评审与讨论

审稿意见
5

The paper models the collaborative learning problem using a statistical model. In this model, since each agent has its own utility with respect to the risk and the cost of sampling, they may not converge to the Nash equilibrium of optimal general risk. More precisely, the process may undergo a phenomenon known as unraveling, wherein it contracts to the point that it becomes empty or consists solely of the worst agent. The paper designs a mechanism to avoid this phenomenon and provides a practical implementation for the transfer-free mechanism.

优点

  • The paper models the collaborative learning problem in a manner that is both strict and easy to understand.
  • The paper identifies the "unraveling" problem during the process of collaborative learning and provides a solution. This seems like a very important issue for making collaborative learning work.
  • The paper clearly lists the hypotheses used in every theorem, making it easy to understand the conditions under which the theorems apply.

缺点

  • The paper does not show empirically when the "unraveling" problem occurs or whether the mechanism in section 4.2 makes a difference. It is not clear whether the problem is common in practice.
  • The empirical implementation of the mechanism in section 4.2 requires hypothesis 7, which may not be feasible in practice if the aggregator does not know P0P_0 exactly or the number of samples is relatively small, i.e., less than qq'.
  • The notation is confused, i.e., n\underline{n} is used to denote an expression of argmaxn\arg\max_{n}\cdots.

问题

  • Hypothesis 4 appears to be a bound on the number of agents. Could you provide an example of the number of agents required in practice?
  • As mentioned in the weaknesses section, could you demonstrate the "unraveling" problem and the mechanism to address this issue in practice? How common is this problem?
  • Is the selection of ηδ\eta_{\delta} in proposition 3 optimal?

局限性

I do not foresee any negative societal impact from this paper. For other limitations, please see the questions and weaknesses sections.

作者回复

We thank the reviewer for their feedback, and answer their questions below.

"The paper does not show empirically when the "unraveling" problem occurs or whether the mechanism in section 4.2 makes a difference. It is not clear whether the problem is common in practice."

Unravelling is a common issue which arises in many real-world markets, as documented by several econometric studies. For instance, [18] or [19] study the phenomenon on the American insurance market. Our mechanism in section 4.2 is largely inspired by verification methods, which are actually implemented by insurance companies to limit adverse selection. Regarding the specific case of collaborative learning, our study is still prospective since, to our knowledge, currently deployed federated learning systems do not explicitly give the possibility to agents to opt out (think of Google keyboard which is trained while the phone is charging and the user is away). However, we expect collaborative learning systems involving strategic agents to develop, in particular among competing firms (see for instance [20] for the power market). Given that such markets will likely feature asymmetry information, we believe that our work provides guidance to ensure the stability of such decentralized learning systems. In this sense, our main objective is to highlight the risk posed by adverse selection to policy makers for nascent data markets. We thank the reviewer for this remark, we will make clearer the practical importance of unravelling, and the solution brought by our mechanism Γ^\hat{\Gamma} in the final version of the text.

"The empirical implementation of the mechanism in section 4.2 requires hypothesis 7, which may not be feasible in practice...$.

The aggregator does not need to access the full distribution P0P_0, but indeed needs to have at their disposal qq’ i.i.d samples from it. We agree that when qq’ is very small, the quality of the type estimation would be low, with adverse effects on the stability of the coalition and the welfare under Γ^\hat{\Gamma}. This mechanism is typically suited for regimes where qq’ is large enough to estimate the types, but not too large, so that we can still largely benefit from collaborative learning. However, we would like to recall that even when qq’ is very small, it is still possible to re-establish the truthful, optimal collaboration by the means of VCG transfers as explained in our answer to reviewer K1qa. Indeed, the VCG mechanism ensures that participating in the coalition and reporting their true types is a dominant strategy for agents, by aligning their individual payoffs with social welfare. We thank the reviewer for this remark, which we will include in the final text.

"The notation is confused..."

The reason why we rooted for this notation is that u\underline{u} (the utility under the outside option) is the minimum utility the aggregator must ensure within the coalition, hence the underscore. n\underline{n} inherits from this underscore because it is the number of samples which achieves u\underline{u}. We would however be glad to take into account any alternative suggestion from the reviewer.

"Hypothesis 4 appears to be a bound on the number of agents. Could you provide an example of the number of agents required in practice?"

Since the bound in H4 features many problem-dependent quantities (such as the amplitude of types, the utility parameters a and c…) it is hard to provide a one-size-fits-all answer to this question. The paper has been written with the rationale that JJ is large, as we expect federated learning systems to involve many agents. However, even in the case where JJ would be too small for H4 to hold, Theorem 1 would remain valid. We would only be unable to characterize LoptL^{opt} as accurately, that is with the optimal total number of samples Nˉ\bar{N} (instead, we can only guarantee that the total number of samples is smaller than Nˉ\bar{N}). Hence, our results would remain true even with a small number of players, but would be slightly less precise. We thank the reviewer for pointing this out, and we will add this remark to our final text.

"As mentioned in the weaknesses section, could you demonstrate the "unraveling" problem and the mechanism to address this issue in practice? How common is this problem?"

We refer the reviewer to our reply to the mentioned weakness.

"Is the selection of ηδ\eta_{\delta} in proposition 3 optimal?"

We acknowledge that this is an excellent question if we are interested in getting sharp bounds in Theorem 2 and thank the reviewer for pointing it to us. After having looked in the literature, we are not sure whether the proposed estimator achieves optimality, for instance in the minimax sense. We however suspect that the parametric rate 1/q1/q is almost optimal from primary considerations. However, due to its complexity and since it would not change our conclusions, the question on the optimality of ηδ\eta_\delta is out of the scope of the paper, in our opinion, and we leave it for future research.

We hope that we replied to the reviewer's concerns, and would be happy to answer any additional question.


References
[18] Einav, L., Finkelstein, A., & Cullen, M. R. (2010). Estimating welfare in insurance markets using variation in prices. The quarterly journal of economics, 125(3), 877-921.

[19] Hendren, N. (2013). Private information and insurance rejections. Econometrica, 81(5), 1713-1762.

[20] Pinson (2023), "What may future electricity markets look like?", Journal of Modern Power Systems and Clean Energy 11

评论

Thank you for your response and I will keep my rating.

评论

Thanks for the clarifications. I have updated the review and rating.

审稿意见
6

There can be strategic learners and collaborations for collaborative learning is not trivial. When data qualities are private, coalitions may undergo unravelling wherein only worst agents will be left in the coalition.

Authors propose a probabilistic verification-based mechanism to make optimal collaboration as Nash Equilibrium with high probability.

优点

Authors consider the formation of coalitions and consider strategic agents with distribution of different quality.

Authors also propose that optimal coalition occurs at NE as per their mechanism with high probability and back it up with rigorous theoretical proofs.

缺点

The authors don’t show that if the agents report the type profile of the agents truthfully. They rather consider that the aggregator to approximate it based on their probability distribution. Also, experimental verification isn’t provided.

问题

As the aggregator is estimating the types of contributor's types, they need to have access to the probability distribution from which they are sampling data. How practical is this scenario?

局限性

As authors proposed by the authors, it would have been better if agents learned their types themselves in an online fashion.

作者回复

We thank the reviewer for their comments, although we are quite surprised by the provided summary which does not correspond to our paper. We believe that there was a mistake on their side. We however reply to their questions below.

"Weaknesses: the authors don’t show that if the agents report the type profile of the agents truthfully. They rather consider that the aggregator to approximate it based on their probability distribution."

We appreciate the reviewer's feedback and would like to clarify the principles of our mechanisms, as there may have been a slight misunderstanding. To rephrase them, we consider two mechanisms. The first is a direct-revelation mechanism which asks agents to report their types. In this naive approach, incentive-compatibility is not enforced and this leads the coalition to be either empty or reduced to a singleton at any equilibrium, which precisely corresponds to unravelling. In a second time, we propose a new mechanism which recovers the grand coalition as a pure Nash equilibrium. Contrary to the standard direct revelation approach, it does not incentivize agents to truthfully report their profile of types, but only asks whether they wish to participate.

As opposed to what the reviewer suggests, we believe that this feature is one of its strengths. Indeed, the fact that the agents’ action space is reduced to either opting in or out prevents them from strategically manipulating the mechanism. This allows to circumvent the common difficulties of implementing incentive-compatibility. We would like to stress that our approach rests on a well established line of literature in mechanism design, at the intersection of verification-based [11] and evidence-based [12] methods. Additionally, our mechanism does not suppose agents know their type, as their participation constraint holds with high probability by design.

An alternative procedure, which may be closer to what the reviewer has in mind, would be to:

  1. ask agents to declare a type,
  2. estimate actual types based on the provided samples,
  3. design penalties for agents with declared types which are not aligned with the estimation made in step (2) (for instance, through accuracy shaping).

Even though this lying cost-like [13] approach could recover the optimal grand coalition as a pure Nash equilibrium through enforcing incentive compatibility, we were not able to find additional beneficial features and clear improvement over our proposed mechanism. In addition, it seems to us that this alternative mechanism asking for types is more complicated to analyze than our proposed solution. Finally, such a procedure would require the agents to know their own type, in opposition to our mechanism.

"Also, experimental verification isn’t provided."

If the reviewers agree with this point, we will add experiments on toy examples in our revised version to empirically validate our results. In particular, we will:

  • provide a figure illustrating the socially optimal allocation,
  • compare the optimal allocation with the one induced by the mechanism Γ^\hat{\Gamma},
  • empirically verify that the strategy profile (1,,1)(1,\ldots,1) is a Nash equilibrium under the mechanism Γ^\hat{\Gamma}.

"As the aggregator is estimating the types of contributor's types, they need to have access to the probability distribution from which they are sampling data. How practical is this scenario?"

Estimating types does not require access to the full sampling distribution of each agent (which would indeed be unrealistic), but only each contributor sending qq samples from PjP_j to the aggregator. That being said, an issue the reviewer may have in mind is that agents might be tempted to twist their sampling process to fool the mechanism, that is providing samples from a distribution which is not PjP_j. This issue falls in the category of hidden action, and several methods have been devised in the literature to solve it [14, 15]. In our work, we are interested in hidden information rather than hidden action. Since concurrently addressing moral hazard and adverse selection is a notoriously hard problem in mechanism design, we leave aside the former by assuming that agents honestly sample from their distribution PjP_j. We would like to stress that this choice is customary in the strategic learning literature [16] and more generally in the mechanism design community [17].

"As authors proposed by the authors, it would have been better if agents learned their types themselves in an online fashion."

As argued in our previous reply, our verification-based mechanism does not require agents to know their own types for them to benefit from joining the coalition with high probability. We will change the text in the conclusion of the revised version accordingly.

We hope we have addressed the reviewer's concerns. If they are not satisfied with our response, we would be happy to answer any further questions.


References
[11] Green, & Laffont (1986). Partially verifiable information and mechanism design. The Review of Economic Studies, 53(3), 447-456.

[12] Grossman (1981). The informational role of warranties and private disclosure about product quality. The Journal of law and Economics, 24(3), 461-483.

[13] Lacker, & Weinberg (1989). Optimal contracts under costly state falsification. Journal of Political Economy, 97(6).

[14] Karimireddy, Guo, & Jordan, (2022). Mechanisms that incentivize data sharing in federated learning. arXiv preprint.

[15] Huang, Karimireddy, & Jordan (2023). Evaluating and Incentivizing Diverse Data Contributions in Collaborative Learning. arXiv preprint.

[16] Ananthakrishnan, N., Bates, S., Jordan, M., & Haghtalab, N. (2024, April). Delegating data collection in decentralized machine learning. PMLR.

[17] Laffont, J. J., & Martimort, D. (2009). The theory of incentives: the principal-agent model. In The theory of incentives. Princeton press.

审稿意见
7

Adverse selection is a phenomenon studied in economics in which information asymmetry may have a negative effect in the market equilibrium. The paper considers a federated learning setting when the agents are strategic, and the authors formalize the concept of ``adverse selection'', and analyze it in several directions. The question they ask is the following: assume that different agents are able to collect samples with different quality and share it with other agents. In this case agents with high quality sample may benefit little from sharing their information. This may have a spiral effect so that none would be happy to share their samples. The authors formalize this problem and analyze it trough a series of results:

(1) they model the above mechanism in which each agent i can sample data points with quality \theta_i

(2) they show how a benevolent social planner can maximize the overall utility for all agents if she knows all \theta_is

(3) they formalize the strategic setting in two cases and ask whether unraveling may occur:

  • They that indeed this may occur if the social planner is unaware of the quality of samples. Assuming that agents are able to statically share the value of \theta_i they show that in this case the Nash equilibrium is when the lowest quality agent or no agent shares their samples.

  • Motivated by this negative result the authors consider a different scenario in which the agents do not share the value of the quality of their samples but they share a fixed number samples if they want to participate in the aggregation process, using these samples the planner estimates the quality of their data and asked for more samples if needed. In this case, agents will all reveal the true value of the quality of their samples, and the social planner can maximize the overall utility function so that unravelling will not occur.

I find the results interesting and solid (although I didn't check the proofs). The presentation is a bit hard to follows but I don't see an easier way of presenting all of these theoretical results.

优点

  • solid mathematical modeling and analysis
  • interesting result: extending an interesting concept called unravelling to federated learning

缺点

  • hard to follow all details

问题

I didn't understand the subsection about VCG mechanism. What is it and why it is not available in your framework?

局限性

The authors have discussed the future directions in the conclusion.

作者回复

We thank the reviewer for their feedback, and are glad that they find our result interesting. We answer to their questions and remarks below.

"Weaknesses: hard to follow all details."

We understand that our notations may appear somewhat intricate and our text dense. We tried to lighten it as much as possible, and strike a balance between legibility and exactness despite the inherent complexity of the topics under study. In case the reviewers have suggestions to improve notations, we would gladly take them into account.

"I didn't understand the subsection about VCG mechanism. What is it and why it is not available in your framework?"

The VCG mechanism [7, 8, 9] is a fundamental concept in mechanism design. It is a generic transfer-based mechanism which allows the principal to maximize social welfare in the presence of strategic agents. More precisely, it ensures that participating and bidding true types is a dominant strategy for all agents. Incentive-compatibility is achieved through monetary transfers, which align individual payoffs with total welfare by paying each agent the total value of the other agents. An introduction to the topics can for instance be found in [10, chapter 9]. Given the paramount importance of the VCG approach in the mechanism design literature, it seemed important that we mentioned it. We however claim that it is not available in our case because it relies on monetary transfers, which we do not allow in our study. Indeed, we only allow the principal to implement transfer-free mechanisms, since monetary payments to participants seem unlikely in many collaborative learning applications. We nonetheless agree that the subsection dedicated to VCG may be too allusive, and we will do the necessary to flesh it out in the final version of the text. In particular, we will (i) explicitly write down the VCG payment and provide an intuitive explanation, and (ii) prove in a lemma that at least one player must receive a non-zero monetary payment when using VCG.


References
[7] Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance, 16(1), 8-37.

[8] Clarke, E. H. (1971). Multipart pricing of public goods. Public choice, 17-33.

[9] Groves, T. (1973). Incentives in teams. Econometrica: Journal of the Econometric Society, 617-631.

[10] Roughgarden, T. (2010). Algorithmic game theory. Communications of the ACM, 53(7), 78-86.

评论

I agree that the complicated notation is somewhat necessary for presenting all the technical results. I think that VCG section should be either removed or something should be added there. At the current form it is not adding anything to the paper. Given that the authors make this change, I am updating my score to an accept.

评论

We deeply thank the reviewer for having updated their evaluation, and we will make the necessary changes in the VCG session as per explained in our response.

审稿意见
7

The paper studies adverse selection in federated learning, due to varying levels of data quality among the clients. In particular, a phenomenon of unravelling, in which the collaboration is essentially destroyed due to insufficient incentives for participation for the agents with high quality data, is identified and studied. Within a specific model for the incentives of the clients and the server objective, the authors first derive optimal protocols under full information. Then they prove the emergence of the unravelling effect under hidden information and suggest a mechanism that avoids this effect by estimating the clients' dataset qualities from data.

优点

  • The paper studies a very relevant topic, namely participation incentives in FL. It also provide a new model for such incentives in the presence of varying data qualities and demonstrates and studies the effect of unravelling, which is, to my awareness, also a novelty in the context of FL.

  • The paper is well-written and the model design choices are well-justified.

  • Related work is covered well.

  • The paper shows how to incentivize participation by estimating the clients' data qualities by an initial sampling phase. In this way, more data can be collected from the clients with high-quality distributions. I think this approach is valuable, since it does not require any payments/penalties, but instead incentivize collaboration purely by changing the learning algorithm.

缺点

  • In the proposed frameworks, the utilities of the clients are evaluated based on upper bounds on the loss and therefore on the type of guarantees on the model that can be obtained. One might expect that clients would instead reason about the expected loss (and hence utility) that they will get from the FL protocol, under the decisions they take (and over the randomness of the sampled data). It will be nice to see a discussion about why clients may chose to reason via statistical guarantees as compared to expected rewards; as well as a discussion of whether the techniques in the paper can be extended to the case where expected reward is considered.

  • If I interpret equation (8) correctly, currently the problem that the server solves requires that agents who are not included in the collaboration also would benefit from joining. Intuitively, this feels unfair/undesirable, since these agents will be "left out" of the collaboration. Of course, that won't be a problem if the grand coalition is chosen, but how about the remaining cases?

问题

  • For H1, it will be nice if the authors can clarify what alphaalpha, betabeta and gammagamma can depend on in their framework. Are they absolute constants, can they depend of H\mathcal{H}, the distribution PP, etc?

  • In H3, does the lower bound on θj\theta_j need to be larger than 00?

  • Could the authors clarify how Lemma 1 and Proposition 3 relate to classic results from domain adaptation about the generalization of classifiers learned on multi-source data and about estimating the discrepancy distance from finite samples?

局限性

It will be nice to see a discussion about possible notions of optimality of mechanisms that incentivize participation in this context.

作者回复

We thank the reviewer for their positive feedback, and try to address the points they raised below.

"In the proposed frameworks, the utilities of the clients are evaluated based on upper bounds on the loss..."

We thank the reviewer for their comment. Indeed, it may seem natural to assume that agents maximize their expected utility. We thought during elaborating our work about this alternative, but we came up with the conclusion that this approach would hardly apply to our specific problem. Indeed, agents would need to know P0P_0 to be able to form an expectation about their reward. This would make the statistical inference problem meaningless, as they could simply consider the Bayes estimator rather than performing an empirical risk minimization based on samples from PjP_j. Additionally, the risk does not generally admit an exact expression, making this approach mathematically hardly tractable. We would like to stress that our modeling choice of working with a high probability upper-bound is actually common in the literature, see for instance [1,2,3].

"If I interpret equation (8) correctly, currently the problem that the server solves..."

We deeply thank the reviewer for this remark, which unveils a typo in the formulation of problem (8). The correct formulation should be

maximizeW:(B,n){0,1}J×R+Jj[J]uj(B,n)subject tominj[J]uj(B,n)uj0.\begin{aligned} &\text{maximize}\quad W:(\mathsf{B},\mathbf{n})\in\{0,1\}^J \times R_{+}^J \mapsto \sum_{j\in [J] }u_{j} (\mathsf{B}, \mathbf{n})\\ &\text{subject to}\quad \min_{j\in [J]}u_{j} ( \mathsf{B}, \mathbf{n})- \underline{u}_j \geq 0. \end{aligned}

We will correct this typo in the revised text accordingly.

"It will be nice to see a discussion about possible notions of optimality of mechanisms that incentivize participation in this context."

We agree that discussing the optimality of mechanisms incentivizing participation with information asymmetry would be very interesting. However, there is no easy answer to this question at first glance. One option we have thought of is to first compute an upper-bound on the welfare achievable by any incentive-compatible transfer-free mechanism. And second, see how close the welfare derived under our proposed mechanism is from this bound. However, we were not able to complete these ideas in the time being. Therefore, while this question is interesting by its own, we leave it for future work.

"For H1, it will be nice if the authors can clarify what α\alpha, β\beta and γ\gamma can depend on in their framework. Are they absolute constants, can they depend of H\mathcal{H}, the distribution PP, etc?"

We thought that the parameters α,β\alpha, \beta and γ\gamma were already motivated and exemplified. Indeed, a remark (line 115 to 120) is made in our submission, where explicit values for these parameters were provided in the classification case, as well as in the regression case using Rademarcher complexity and a recent PAC Bayes bound. To summarize this remark, typically, γ\gamma is an absolute constant, α\alpha should be thought of as a parameter which depends on the desired level of confidence δ\delta, and finally β\beta is related to the complexity of the problem at hand (related to the Rademacher complexity of the considered function class in the classification case, and the sup-norm of the loss in the regression case). We would be happy to clarify this remark in the paper if the reviewers feel the need to.

"In H3, does the lower bound on theta_j need to be larger than 0?"

For any j[J]j \in [J], θj\theta_j needs indeed to be non-negative, because by definition: θj=supgGRj(g)R0(g)0. \theta_j = \sup_{g\in\mathcal{G}} \mid\mathcal{R}_{j}(g)-\mathcal{R}_0(g)\mid \geq 0. Note that nothing prevents from θ=0\underline{\theta} = 0, so an agent can have access to the target distribution P0P_0 in our framework.

"Could the authors clarify how Lemma 1 and Proposition 3 relate to classic results from domain adaptation about the generalization of classifiers learned on multi-source data and about estimating the discrepancy distance from finite samples?"

The domain adaptation results we use are well-known in the literature, however we acknowledge that further details should have been included. Lemma 1 is a straightforward application of a more general, well-known bound which can for instance be found in [Theorem 5.2, 4] or [Theorem 1, 5]. Contrary to these works where the weights on source distributions can be any point in the simplex, we use weights that are proportional to the number of samples provided to the aggregator, hence the simpler expression. Likewise, Proposition 3 is known and has been used for estimation of the G-divergence from a finite sample. It has been applied to the classification case for instance in [6]; see in particular Lemma 2. We thank the reviewer for this comment and we will provide further details about how our work relates to domain adaptation literature. We would also like to stress that the aim of our work is not to bring novel results in domain adaptation, but rather to use the domain adaptation literature to motivate our (utility) model.

[1] Huang, Karimireddy, & Jordan (2023). Evaluating and Incentivizing Diverse Data Contributions in Collaborative Learning. arXiv preprint arXiv:2306.05592.

[2] Karimireddy, Guo, & Jordan (2022). Mechanisms that incentivize data sharing in federated learning. arXiv preprint.

[3] Ananthakrishnan, Bates, Jordan, & Haghtalab (2024, April). Delegating data collection in decentralized machine learning. In International Conference on Artificial Intelligence and Statistics. PMLR.

[4] Zhang, Zhang & Ye (2012). Generalization bounds for domain adaptation. Advances in neural information processing systems, 25.

[5] Konstantinov, & Lampert, (2019). Robust learning from untrusted sources. In International conference on machine learning. PMLR.

[6] Ben-David, Blitzer, Crammer, Kulesza, Pereira, & Vaughan, (2010). A theory of learning from different domains. Machine learning, 79.

作者回复

We deeply thank the reviewers for their evaluation of our work and their insightful feedback. We will take into account their remarks for the final version of the text. They seem to agree on the importance of adverse selection in collaborative learning and the fact that our research question is meaningful. In what follows, we reply individually to each reviewer and try to address their concerns. We will gladly answer additional questions, if any, during the author-reviewer discussion period.

最终决定

The paper studies unravelling, a celebrated economic phenomena, in the context of collaborative filtering.

I agree with some of the reviewers about confusing notations, which the authors have addressed in their rebuttal.

After carefully reading the paper, the reviews, and the rebuttal, I believe this paper could interest the community and deserve acceptance.