PaperHub
6.0
/10
Rejected4 位审稿人
最低5最高8标准差1.2
8
5
5
6
2.0
置信度
ICLR 2024

Incentivized Black-Box Model Sharing

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

This paper presents a novel incentivized black-box model sharing framework that fairly distributes rewards and monetary payoffs to each party, and satisfies individual rationality regarding model performance.

摘要

关键词
incentivesensemble distillationcollaborative learningmodel sharing

评审与讨论

审稿意见
8

In this paper, the authors introduced an incentivized black-box model sharing framework that equitably distributes ensemble predictions and rewards parties based on their contributions. The authors (1) introduced a Weighted Ensemble Game to quantify the contribution of black-box models towards predictions; (2) derived a closed-form solution for fair reward allocation based on Weighted Ensemble Game and Fair Replication Game; (3) theoretically proved that approximate individual rationality is satisfied. Finally, the authors also conduct numerical experiments on real-world data to confirm the efficacy of their method.

优点

Overall, this paper is well written and clearly addresses the three main questions that the authors proposed to address, each corresponding to (1) how to quantify the contributions made by each model, (2) how to ensure that each party receives a fair payment/reward and (3) how to ensure individual rationality is ensured. It also provides solid theoretical results for each of the aforementioned questions, accompanied by empirical evaluations.

Nonetheless, I am not an expert in the field of Black-Box Model Sharing and hence have limited expertise in evaluating the merit/weakness of this work.

缺点

See questions.

问题

(1) Could you provide one specific example that motivates why individual rationality is chosen as one of your key metrics?

(2) Why do you consider Shapley fairness as your main fairness notion? Any other fairness notions that might fit into your framework?

(3) In Sec 5 you suggested that "We will later empirically show that the virtual regret ϵ\epsilon is not needed and the strict IR is satisfied". Is this a purely empirical observation or do you believe stronger theoretical results can be established here?

评论

We thank Reviewer jHme for reviewing our paper and for the positive and encouraging feedback!

We hope to answer your questions as follows.


Q1:

Could you provide one specific example that motivates why individual rationality is chosen as one of your key metrics?

Let's consider a scenario where three private hospitals (A, B, and C) want to share predictions based on their patient data to improve the predictive performance of their model. However, Hospital C has already invested significant resources into collecting a high-quality dataset (with a high data diversity), while Hospitals A and B have datasets that are noisy or of lower quality in comparison. If the ensemble predictions are naively shared among all three hospitals, Hospitals A and B would benefit more because they can significantly improve their models, while it may not benefit Hospital C at all (i.e., no collaborative gain in the form of either monetary or improved model) and its competitors are better off. In other words, hospital C did not improve its utility by collaborating (i.e., the individual rationality is not satisfied) and would be disincentivized to collaborate. Therefore, enforcing individual rationality can ensure Hospital C's collaboration by ensuring its utility will improve (or positive collaborative gain).

Additionally, a similar example is used to motivate individual rationality in the case of autonomous driving to achieve a win-win outcome (in the form of legislation that enforces individual rationality) for both consumers and car companies [1].

Therefore, individual rationality is one key incentive that we need to consider in collaboration.

[1] Karimireddy, S. P., Guo, W., & Jordan, M. I. (2022). Mechanisms that incentivize data sharing in federated learning. arXiv preprint arXiv:2207.04557.


Q2:

Why do you consider Shapley fairness as your main fairness notion?

Intuitively, the more one contributes, the more it should receive from collaboration. To formalize this intuition, there are many ways to do so. In our problem, we adopt the Shapley value as an interesting and meaningful way to achieve this because it is the unique solution that satisfies the four fairness properties (efficiency, symmetry, dummy party, and linearity) as in App. B.1.

  • [symmetry] To ensure two equally contributing parties are equally recognized, symmetry ensures that two parties with equal marginal contributions to any coalition in the collaboration receive the same collaborative gain;

  • [dummy party] To prevent free-riders, dummy party ensures that parties with zero marginal contributions to any coalition receive no collaborative gain.

    Additionally, it is shown by Sim et al. (2020) that Shapley value also satisfies some other fairness properties: strict desirability and strict monotonicity, which we have newly added to App. B.1. Hence, the Shapley value is a careful design choice in our work.

As additional justification, the Shapley value is a very widely used solution in previous data sharing works (Ghorbani & Zou, 2019; Jia et al., 2019a; Sim et al., 2020; Tay et al., 2022; Karimireddy et al., 2022;) as the notion of fairness.

Any other fairness notions that might fit into your framework?

One other common fairness notion, known as egalitarian, aims at achieving equitable outcomes for all the parties (e.g., by rewarding all the parties equally). This does not satisfy the dummy party property and can cause the free-rider problem since even the non-contributing party is rewarded equally with the high-contributing parties.

Another common fairness notion (in the allocation of goods) is envy-freeness, which is not applicable here. Because it is typically used for rivalrous goods, but the rewards here (i.e., ensemble predictions) are non-rivalrous (since they can be replicated), and it is the reason for our proposed definition of fair replication game.

For discussions of different fairness notions, we will incorporate them in the revision.


We answer the other question in the subsequent comment.

评论

Q3:

In Sec 5 you suggested that "We will later empirically show that the virtual regret ϵ\epsilon is not needed and the strict IR is satisfied". Is this a purely empirical observation or do you believe stronger theoretical results can be established here?

It is mainly an empirical observation, but we think stronger theoretical results may be possible. The reason is the derivations for our theoretical result on the virtual regret ϵ\epsilon involve several inequalities (e.g., Hoeffding's inequality, the triangle inequality, the constraint of the growth function by the VC dimension, and the supremum inequality of the domain divergence measure, as in App. C.3.). The improvement of the tightness of one or more of these inequalities can potentially lead to a stronger theoretical result. We will include this comment in our revision.


We hope our responses have addressed your concerns and helped you to understand our work better. We are happy to answer any additional questions.

评论

Dear Reviewer jHme,

We sincerely appreciate the time and effort you've dedicated to reviewing our paper, as well as the insightful questions you've raised.

If you have any more questions, we are eager to provide prompt responses.

Thanks,

Authors

审稿意见
5

This paper introduces a framework for model sharing across parties. In relation to prior work, this paper considers incentives, as well as parties only sharing their model (rather than data which can be sensitive). The framework distributes rewards in proportion to the contribution of each party, and also allows for payments between parties.

优点

  • Tackles an important and practical problem of considering incentives in the context of model sharing
  • Model enforces desirable properties such as fairness and IR, and combines many practical considerations together
  • Analysis is thorough

缺点

The main weakness is in the exposition - I was not able to understand the model. It seemed like the model and problem formulation were not comprehensively specified. The fact that there is an FAQ section on the model speaks to how the model is not completely clear. Here are my questions that I couldn’t find answers to:

  • How should we compare prediction error to monetary payments to "rewards" (samples of ensemble predictions)? (Do they use the same unit of measurement?)
  • Relatedly, what is the formula for the utility of party i?
  • Payments can be made from one party to another. Does each party, decide on their own, how much to pay to each other party, or is this transfer also specified as part of the mechanism? Does each party have a budget?

The model has two main parts, as described in Figure 5. Can we simply de-couple these two stages and study each part separately, or are there interactions that require studying them together? Just studying one aspect would make the paper simpler and more clear.

问题

see above

评论

We thank Reviewer 1Wgz for taking the time to review our paper and for recognizing that our studied problem is important and practical, our model achieves desirable properties, and our analysis is thorough. We wish to provide the following clarifications and have incorporated some changes in our revision (with highlighted text in the updated pdf).


The fact that there is an FAQ section on the model speaks to how the model is not completely clear.

Our provided Q&A section in App. A.2 is meant to further supplement the main paper with more nuanced and specific details. In our revision, we will improve the exposition of our work to specify the problem formulation more clearly.

W1:

How should we compare prediction error to monetary payments to "rewards" (samples of ensemble predictions)? (Do they use the same unit of measurement?)

The monetary payments (i.e., pip_i) and the value rir_i of the rewards (of size TiT_i) can be viewed to use the same unit of measurement (i.e., in the same domain). This is by design so that the allocation mechanism (Theorem 1) enables a translation between ensemble predictions (i.e., the realized rewards TiT_i) and monetary payment. We note that this has not been previously achieved by existing works. Please refer to Q4 in App. A where we illustrate how the monetary currency is projected into the positive real domain R+\mathbb{R}_+ to align with the reward domain.

When you say "compare prediction error to monetary payments", we assume you mean the relationship between the two. To elaborate more on the relationship: An increase in the monetary payment pip_i leads to a higher reward rir_i, and thus a larger size for TiT_i, which can decrease the upper bound on the prediction error LD(hi)L_{\mathcal{D}}(h_i'):

  • From Proposition 2, the upper bound of prediction error (i.e., LD(hi)L_{\mathcal{D}}(h_i')) depends on payment pip_i, because LD(hi)LD(hi)+ϵiL_{\mathcal{D}}(h_i') \leq L_{\mathcal{D}}(h_i)+\epsilon_i and ϵi\epsilon_i depends on TiT_i. To compare the prediction error to monetary payment with approximation, we investigate "the change in ϵi\epsilon_i" as a function of pip_i in a newly added empirical result (i.e., Fig. 9 in App. D.2.1), which provides an empirical quantification between the accuracy gain w.r.t. the increase in (monetary) payment.

W2:

Relatedly, what is the formula for the utility of party i?

By definition, uiu_i is the Shapley value to the linearly combined game of GG and GpG^p. By derivation, the explicit formula for the utility uiu_i of party ii is given in Theorem 1: ui=ri+ri++pi+piu_i = r_i + r_i^+ + p_i^+ - p_i , where

  • ri=ϕir_i=\phi_i,
  • ri+=VN×piVNϕir_i^+ = \frac{\mathcal{V}_N\times p_i}{\mathcal{V}_N - \phi_i},
  • pi+=jNiϕi×pjVNϕjp_i^+=\sum_{j \in N \setminus i }\frac{\phi_i \times p_j}{\mathcal{V}_N - \phi_j}, and
  • piR+p_i\in \mathbb{R}_+ is determined by the party itself.

The utility uiu_i represents the overall gain of party ii from the collaboration, as a sum of ii's total reward (ri+ri+)(r_i + r_i^+) and payoff/net compensation (pi+pi)(p_i^+-p_i).

  • (ri+ri+)(r_i + r_i^+) represents the value of ensemble predictions that party ii receives from the collaboration,
  • pi+p_i^+ represents the value of the received monetary compensation from all other parties, and
  • pip_i is the payment that party ii made to the host in exchange for additional rewards of the value ri+r_i^+.

Our formula of the utility fully specifies all the terms. Note that these interpretations (of the utility) are provided in the main paper (above Theorem 1 and below Remark 1).


W3:

Payments can be made from one party to another. Does each party, decide on their own, how much to pay to each other party, or is this transfer also specified as part of the mechanism?

Each party decides on its own how much to pay in total (i.e., pip_i) to the central host; the exact distribution of pip_i to each other party is specified by our mechanism (and managed by the central host) to achieve fairness: The proposed formula of pi+=jN{i}ϕi×pjVNϕjp_i^+=\sum_{j\in N\setminus \{i\}}\frac{\phi_i \times p_j}{\mathcal{V}_N - \phi_j} in Theorem 1 specifies this payment distribution from every party jN{i}j\in N\setminus \{i\} to party ii.

Does each party have a budget?

Our mechanism can achieve fairness whether or not a party has a budget constraint:

  • [With budget constraint.] If party ii has a budget of BiB_i, it can freely choose payment on its own as long as piBip_i \leq B_i, and fairness is always achieved by the reward design and realization, as shown in App. C.6.

  • [W/o budget constraint.] If party ii does not have any budget constraint, then it can make the maximal payment pi=(ϕϕi)(VNϕi)/ϕp_i^\ast = (\phi^\ast -\phi_i)(\mathcal{V}_N-\phi_i)/\phi^\ast to receive all the reward of value VN\mathcal{V}_N, which is mentioned in the paragraph below Remark 1. In this case, fairness is achieved by fairly distributing the payments.


We address the other concerns in the subsequent comment.

评论

W4:

Can we simply de-couple these two stages and study each part separately, or are there interactions that require studying them together? Just studying one aspect would make the paper simpler and more clear.

We consider these two stages together (1) so that we can study the overall gain (i.e., uiu_i) of a party ii from the entire collaboration; and (2) because the GpG^p depends on the outcome of GG:

  1. Considering the two stages together enables us to study the overall gain (i.e., uiu_i) of a party received from collaboration including the reward (i.e., ri+ri+r_i + r_i^+) and payoff (i.e., pi+pip_i^+ - p_i). Note that, in the paragraph above Theorem 1, we formally couple the two stages via two games, namely GG and GpG^p. The outcome of GG determines rir_i, and the outcome of GpG^p determines ri+pi+pi+r_i^+ - p_i + p_i^+. With this coupling, we can directly determine the number for ensemble predictions Ti=(ri+ri+)×TT_i=(r_i+r_i^+)\times T that party ii receives.

  2. The fair replication game GpG^p shown in Definition 5.1 depends on the outcome (i.e., ϕii=1n\\{\phi_i\\}_{i=1}^n) from GG, because the fairness in the monetary gain (i.e., ri+pi+pi+r_i^+ - p_i + p_i^+) relies on the contribution measure ϕi\phi_i. To satisfy the Shapley fairness incentive (i.e., fairly allocating rewards and payoffs), it is necessary to calculate the Shapley value from the first stage. Hence, the two stages need to be coupled.

Let's consider a concrete scenario of private hospitals collaborating by sharing predictions of unlabeled medical data. Small hospitals with lower contributions (i.e., a model of lower quality/performance) can make payments to purchase more rewards (i.e., ensemble predictions) to improve their models. Hence, it is important for the payment (in exchange for rewards) to be carefully based on the outcome of the first stage so that these parties (i.e., small hospitals) are able to obtain a high utility (i.e., improvement in their local models' performance).

We will make these reasons (for considering the two stages) more explicit in our revision.


We hope our clarifications have addressed your questions and helped improve your opinion of our work. We are happy to answer any additional questions.

评论

Dear Reviewer 1Wgz,

Thank you for taking the time to review our paper and for your valuable questions.

If you have any more questions or need more details, we are happy to answer them promptly within the discussion period.

Best,

Authors

评论

Thank you to the authors for their detailed clarifications. My original questions have been clarified.

I still think that the paper is very dense and difficult to follow. Specifically, section 3 should completely specify the model, which it currently does not do rigorously. I found reviewer gZai's summary of the paper to be a clearer description of the model. Here are a couple of examples of sources of confusion:

  • The reward is not defined - I believe it corresponds to a scalar value, but it is initially introduced as a set of predictions.
  • This section should also clearly delineate which aspects of the process will be specified as the main contributions of the paper (Section 4+5). For example, after the sentence "parties are allowed to make monetary payments pip_i ..." - the authors should write that this payment mechanism will be detailed in Section 5 (and perhaps the desirable properties of this mechanism should also be written here). Essentially, it was unclear which parts are taken as given / as definition, and which parts represent the main contribution.
  • There is a paragraph about the valuation function, but at this point it is completely unclear why this is relevant and how this relates to the model specified in the previous paragraph. It is written that the shapley value represents the "fair contribution of party i", but the fair contribution was never defined.

I don't need the authors to respond individually to the above points since this is late (my apologies). But in summary, I found the paper difficult to follow because there was no roadmap and I felt that many descriptions and terms came out of nowhere. Looking at the paper, I do think much of my concerns can be alleviated if Section 3 was significantly clarified.

评论

We thank Reviewer 1Wgz for acknowledging that our clarifications have clarified the questions. We wish to provide the following response:


The reward is not defined - I believe it corresponds to a scalar value, but it is initially introduced as a set of predictions.

We define the value of reward in the second paragraph in Sec. 5 (i.e., "let riR+r_i\in \mathbb{R}_+ denote the numerical value of party ii's reward ..."):

  • The realized reward for party ii is a set UiU_i of data subsampled from the ensemble predictions, and its size is TiT_i i.e., Ui=Ti|U_i|=T_i (see Q5 in App. A2).
  • The reward value is (ri+ri+)R(r_i+r_i^+) \in \mathbb{R}, and Ti=(ri+ri+)×TT_i = (r_i+r_i^+) \times T (see Reward Realization in Sec. 5.1).

This section should also clearly delineate which aspects of the process will be specified as the main contributions of the paper.

We have incorporated your suggestion in our updated Sec. 3 in the revised paper (see pdf).

it was unclear which parts are taken as given / as definition, and which parts represent the main contribution.

In our updated pdf, Sec. 3 now presents an overview of the proposed mechanism with the illustrative diagram (Fig. 1) previously in App. A, in which we clearly describe the definitions and contributions. Also, we note that our specific contributions are summarized in the last paragraph in Sec. 1.

...the valuation function, but at this point it is completely unclear why this is relevant...

The valuation function V\mathcal{V} is designed so that we can utilize the Shapley value to fairly quantify the contribution of each party, which is made clear in our revised paper.

I found the paper difficult to follow because there was no roadmap ... I do think much of my concerns can be alleviated if Section 3 was significantly clarified.

We thank the reviewer for the feedback, and have included a summarized description of our mechanism in App. A to serve as the roadmap, due to the page limitations. We hope that our accordingly updated Sec. 3 provides better clarifications and has helped improve your opinion of our work.

审稿意见
5

The paper studies how to incentivize different agents to participate in black-box model sharing.

More specifically, given a set of points S, the host wants each agent to share their predictions on those points, and the host incentivizes them by giving the final ensemble predictions over these points (every agent's predictions are weighted by some weights beta), which can be used to get a new and hopefully improved model h'. The number of these additional points and the ensemble predictions on these points given to each agent is proportional to the contribution of the agent. They show a principled manner of how to measure contribution of each agent. Also, they show how to incentivize each agent to actually participate here: i.e. there's incentive for them to report their predictions because the new model h'trained with the addition of the points and ensemble predictions performs better than the previous model h.

Each agent can make a payment to collect more of those points and their ensemble predictions. And the paper shows how to set up these payment values and reward values so as to guarantee some form of fairness (T1 on pg 5).

They also evaluate their approach on some datasets.

优点

-The main problem that they study is well-motivated, and the guarantees that they seek seem reasonable as well. It's nice that they can verify the theoretical claims in their experiments.

缺点

-My main complaint of the paper is that the overall presentation was pretty hard to follow, resulting in some confusion over few details of the paper. For instance, I’m a little confused about how the weights beta_{i,x}’s are set if the true label for point x is unknown. See more detailed question below. And also, it seems that there’s an assumption about the unique of the optimal ensemble weights. Anyway, I think it would be helpful to add more prose to improve the overall presentation of the paper; I think the valuation part in section 6 is not too surprising but can be used as a sanity check and be moved to the appendix, which will allow more room to add more prose throughout the paper.

问题

-The paper describes once how the ensemble weights are set in 4.1. However, here it’s assumed the host actually knows the ground truth. So, is it just that in the very beginning where the host has access to a data set that’s held off, the host asks the clients to participate and find these weights in the very beginning and use these weights going forward? But more realistically, the host would want to query each party to provide predictions for points for which the true label is unknown. In those cases, how would want find these weights? Note that the way things are written, the weight beta_{i,x} is set differently for each point x, meaning one can’t estimate these beta_{i,x} differently for each x, if the true label for that y is not known, but rather set a weight beta_i that’s the same across all the points. This should still maintain proposition 1, as all the arguments are always averaged over the entire distribution D anyway.

-I think there’s an inherent assumption that the optimal weights beta’s are unique. Consider a following example where every party has the same exact model h. Then, the ensemble model will be the same no matter how the weights beta’s are set. In this case because everyone has the same model, one should be rewarded the same reward, meaning the beta’s should be uniform across every client. However, setting beta’s such that it places all its weight on a single model is also an optimal solution, which results in only that client receiving all the rewards. I think this is not just an artifact of this toy example, but if the data that each client has is pretty homogenous and resulting in similar overall loss, this can be very possible (assuming that as I described above the weights should be chosen not over (party i, point x) but rather over just the parties).

评论

We are grateful to the reviewers for their constructive and detailed feedback. We will incorporate this feedback into our revised work. We respond below to their concerns:


W1:

I’m a little confused about how the weights βi,x\beta_{i,x}’s are set if the true label for point x is unknown.

Please kindly refer to our response below in Q1 and App. B.2 for how βi,x\beta_{i,x}’s are set if the true label for point x is unknown.

add more prose throughout the paper.

To avoid significant changes (listed below) in the rebuttal as instructed by the guideline in ICLR 2024 "Area chairs and reviewers reserve the right to ignore changes that are significantly different from the original paper", we will make the following changes in the revision:

(1) move the valuation part in section 6 to Appendix

(2) move some remarks in Appendix to the main paper to add more elaborations

(3) include more experiments with non-optimal (practical) ensemble methods in the main paper


Q1:

So, is it just that in the very beginning ... the host asks the clients to participate and find these weights in the very beginning and use these weights going forward?

No. In our setting, no matter what ensemble method is used, the host will query each party for each data point and continually determine the ensemble weights.

But more realistically, the host would want to query each party to provide predictions for points for which the true label is unknown.

Your realistic consideration is correct. In fact, in some of our experiments, the host queries each party for each data point and continually determines the ensemble weights, where the true label is unknown.

...for which the true label is unknown. In those cases, how would the host want find these weights?

We assume the host knows the ground truth only when we use the optimal ensemble method. As we have described in App. B.2, ensemble methods from previous works such as average ensemble (AVG), majority vote (MV), knowledge vote (KV), and multiplicative weight update (MWU) do not need the ground truth. Those ensemble methods can set the weight βi,x\beta_{i,x} both differently for each party and each data xx.

On the other hand, if the ground truth is available, we can observe a much stronger correlation (i.e., -0.72) between the average ensemble weight V_i\mathcal{V}\_i and the generalization error L_D(h_i)L\_{\mathcal{D}}(h\_i) on MNIST in Table 2, compared with the correlation of MV (i.e., -0.24) and MWU (i.e., -0.24) on MNIST in table 1.

We conduct an additional experiment (update in App. D.1) and highlight that the requirement on the ground truth is not very restrictive, where the labeled data (i.e., ground truth) of size as small as 100 is sufficient to identify high-quality models, as shown in Fig. 6 in App. D.1. The negative correlation of size 100 is as good as that of size 4000. In practice, those weights estimated on the small labeled dataset can be potentially used for other unlabeled data.


We address the other concern in the subsequent comment.

评论

Q2:

Consider a following example where every party has the same exact model h ... In this case because everyone has the same model, one should be rewarded the same reward, meaning the beta’s should be uniform across every client.

In our experiment, the beta’s are indeed uniform across every client when they have the same exact model hh, because we set the initial value of βi,x=1/n\beta_{i,x}^\ast=1/n in solving linear optimization problem of the optimal ensemble. This implementation detail can be found in our provided code (in line 75 in the file "server/collaboration.py").

I think this is not just an artifact of this toy example, but if the data that each client has is pretty homogenous and resulting in similar overall loss, this can be very possible.

Your understanding is correct. Translating your intuition of "pretty homogenous" data for iiNi\neq i'\in N into the following: if hi(x)h_i(x) and hi(x)h_{i'}(x) are "close", then we want βi,xβi,x\beta_{i,x} \approx \beta_{i',x}. To achieve this in implementation, we add the following regularization term i,iN ii(βi,xβi,x)(2hi(x)hi(x)1)\sum_{\substack{i,i'\in N \\\ i\neq i'}} (|\beta_{i,x}-\beta_{i',x}|)^{(2-\||h_i(x)-h_{i'}(x)\||_1)} to the original linear optimization problem, as follows:

minimizeβi,x[0,1],iN. f(x)i=1nβi,xhi(x)+λi,iN ii(βi,xβi,x)(2hi(x)hi(x)1)\underset{\beta_{i,x}^*\in [0,1],\forall i \in N.}{\text{minimize}}\ |f(x)-{\textstyle\sum_{i=1}^n\beta_{i,x}^*h_i(x)}|+\lambda \sum_{\substack{i,i'\in N \\\ i\neq i'}} (|\beta_{i,x}-\beta_{i',x}|)^{(2-\||h_i(x)-h_{i'}(x)\||_1)}

subject to i=1nβi,x=1.\text{subject to}\ {\textstyle\sum_{i=1}^n\beta_{i,x}^*=1}.

The hyperparameter 2 in (2hi(x)hi(x)1)(2-\||h_i(x)-h_{i'}(x)\||_1) is because the L-1 norm of the difference is upper bounded by 2. Please kindly refer to App. B.2 for the detailed updated formulation.

Let's illustrate this concept with an example: Suppose we have a collaboration involving five parties. Here, each party shares a 3-class prediction vector hi(x)h_i(x) of the query data xx. The ground truth for this data is [1.0,0.0,0.0]\mathbf{[1.0, 0.0, 0.0]}. Refer to the table below for hi(x)h_i(x) values from each party. Specially, we impose that h1(x)h_1(x) and hi1(x)h_{i\neq 1}(x) are "close".

Partyhi(x)h_i(x)βi,x\beta_{i,x}^\ast w/o the regularizationβi,x\beta_{i,x}^\ast w/ the regularization
1[1.0, 0.0, 0.0]1.00.2016
2[0.9, 0.1, 0.0]0.00.1996
3[0.9, 0.1, 0.0]0.00.1996
4[0.9, 0.1, 0.0]0.00.1996
5[0.9, 0.1, 0.0]0.00.1996

We observe that β1,x\beta_{1,x}^\ast of party 11 is 1.01.0 and βi,x\beta_{i,x}^\ast of other parties are 0 when we exclude the regularization term, even though hi(x)h_i(x) and hi(x)h_{i'}(x) are "close". However, when we include the regularization term, we can then achieve βi,xβi,x\beta_{i,x} \approx \beta_{i',x} for iiNi\neq i'\in N as desired.


We hope our responses have addressed your concerns and improved your opinion about our work. We are happy to answer any additional questions.

评论

Dear Reviewer hnks,

Thank you for taking the time to review our paper and for your valuable questions.

If you have any more questions or need more details, we are happy to answer them promptly within the discussion period.

Best,

Authors

评论

Sorry about the late response to the rebuttal, and thanks for the clarifications!

Since there isn't that much time remaining for the rebuttal period, I'll be sure to make my questions brief. I'm still a little hung up on the weights when the ground truth is not known.

Don't a lot of the theoretical results hinge upon the weights being chosen according to the optimal weights when the ground truths are known? For instance, I think intuitively that if I just choose the weights arbitrarily (e.g. just always listen to a particular agent for whatever reason), I shouldn't expect to get a lot of the nice properties describe in Section 5.

It seems to me that the proposed approaches when the ground truth is not known are mostly 'heuristics' to be close to the optimal weights; I do see that for multiplicative weights, you can in fact get a no-regret style closeness to the optimal weights.

It would be great if my concern here is valid or not.

And adding a regularization term to enforce a uniqueness of the optimal weights makes sense to me!

评论

We thank Reviwer hnks for acknowledging our clairifications, and wish to address the questions as follows.


Don't a lot of the theoretical results hinge upon the weights being chosen according to the optimal weights when the ground truths are known?

We wish to clarify that this is not the case. Our theoretical results in Sec. 5 do not specifically require the optimal weights or knowing the ground truth. Indeed, "our allocation scheme does not depend on any particular ensemble method" (see the last three lines in the paragraph below Remark 1). We will make this clear in our revision.

For instance, I think intuitively that if I just choose the weights arbitrarily ... I shouldn't expect to get a lot of the nice properties described in Section 5.

The "nice properties" will still hold, but we believe that we should not extend our considerations to any arbitrary ensemble method.

  • The properties (i.e., Shapley fairness, IR, and Theorem 1) described in Sec. 5 still hold with arbitrary weights (i.e., βi,x\beta_{i,x}). The measured fair contribution ϕi=xUβi,x/T\phi_i = \sum_{x\in U}\beta_{i,x}/T in Sec. 4 only specifies the contribution based on a given ensemble method (not the optimal one), and our method in Sec. 5 is designed to satisfy these properties w.r.t.~ϕi\phi_i, without explicit dependence on the ensemble: As shown in the proofs in Apps. C.2 and C.6, our theoretical results do not require specific weights βi,x\beta_{i,x}.

  • Nevertheless, we believe that the choice of ensemble method should not be arbitrary, because (1) effective ensemble methods (not requiring knowing the ground truth) are available, and (2) a sub-optimal ensemble method leads to ineffective collaboration, as elaborated later.

It seems to me that the proposed approaches when the ground truth is not known are mostly 'heuristics'...

We assume that by "proposed approaches", the reviewer refers to the specific ensemble methods (i.e., AVG, MV, and MWU), and we wish to clarify that these methods are existing methods and not our proposed solutions.

It would be great if my concern here is valid or not.

Your concern can be addressed, by using an effective ensemble method even if the ground truth is not available.

In addition to our discussion above (our theoretical results are not conditioned on the optimality of the ensemble method), our empirical observations demonstrate that some existing ensemble methods perform reasonably well (compared to the optimal ensemble) in App. D.2.1, without knowing the ground truth. Hence, we primarily consider effective ensemble methods instead of any arbitrary method, because:

  1. Existing ensemble methods, such as KV and MWU, already perform reasonably well. We summarize the results from Fig. 7(b) and Fig. 8 (b,c) into the following table, which demonstrates that the quality/optimality of the ensemble (method) has an effect on the accuracy gains of the parties: Optimal ensemble leads to the highest accuracy gain (which is expected), the effective ensembles (i.e., MWU and KV) both perform reasonably. We highlight that our method can be adapted to other (more effective) ensembles designed in the future.
PartyOriginal AccImproved Acc (MWU)Improved Acc (KV)Improved Acc (Optimal)
[0]0.0980.3890.3920.801
[1-2]0.2150.2400.4330.842
[0-3]0.4070.4220.5280.890
[3-5]0.2800.4570.5560.912
[6-9]0.3820.6320.7600.925
  1. A pooly chosen sub-optimal ensemble method can lead to an ineffective collaboration of the parties, as an arbitrary ensemble or adversarial ensemble (see the example in the last paragraph in Sec. 4) can produce meaningless ensemble predictions (i.e., with very high ensemble error LD(hN)L_{\mathcal{D}}(h_N)) that will not benefit any party.

And adding a regularization term to enforce a uniqueness of the optimal weights makes sense to me!

Thank you for confirming our proposed regularization design!

We hope to have clarified your questions about when the ground truth is not known, and helped improve your opinion of our work.

评论

I'm aware that the results still hold true regardless of what the chosen weights are, but my point is that the results hold only with respect to the given weights.

So, if I choose arbitrary weights, the contribution determined by these weights would still be arbitrary and the theoretical guarantees would be arbitrary as it's with respect to those arbitrary weights. Therefore, in order to fully capture the contribution of each agent faithfully, we better choose "good" weights that truly captures the contribution of each agent because the good theoretical guarantees hinge upon the fact that the weights are faithfully measuring the contribution. Simply said, the quality of the theoretical guarantees hinge upon the quality of the weights.

And my concern is that coming up with good weights and verifying that those are good weights is hard to do when there are no ground-truth; the paper is only empirically showing the previous ensemble methods seem to do well, but these are just practical heuristics, and there's no theoretical guarantee that these are good weights. Note that even in the experiments, a hold-out dataset is being used to show that the ensemble methods are performing well. But in the absence of any hold-out dataset with ground truths, how can one even tell whichever chosen ensemble method is actually obtaining good weights or not?

Or am I still misunderstanding the results?

评论

We thank Reviewer hnks for the further questions and wish to provide the following clarifications.


coming up with good weights and verifying that those are good weights is hard to do when there are no ground-truth

We acknowledge that finding and verifying the good weights without any ground-truth is hard.

the paper is only empirically showing the previous ensemble methods seem to do well, but these are just practical heuristics, ...

Note that our objective is to design an incentive framework for black-box model sharing for a given ensemble method, because previous works did not consider the problem of valuation and incentive-aware allocation. Our objective is not to propose a new ensemble method.

there's no theoretical guarantee that these are good weights...But in the absence of any hold-out dataset with ground truths, how can one even tell whichever chosen ensemble method is actually obtaining good weights or not?

To directly answer the question, it is very challenging to provide theoretically guaranteed good weights "in the absence of any hold-out dataset with ground truths"; fortunately, our empirical observation shows that a relatively small dataset (with ground truths) is sufficient.

To elaborate, under the PAC-learning framework (as the theoretical guarantee) [1,2], there is a lower bound on the sample complexity of such a hold-out dataset:

  • Any (ϵ,δ\epsilon,\delta) PAC-learning algorithm needs a labeled dataset (i.e., ground truth) with size Ω(max(nlogd,dlogn)/ϵ)\Omega(\max (n \log d, d \log n)/\epsilon) (Theorem 6 from [1]) where nn is the number parties and dd is the VC dimension of the hypothesis class.

This lower bound demonstrates the necessity of the hold-out dataset with ground truth for theoretical guarantees.

As a verification, we perform an estimation [3] on the VC dimension for the ML model (neural network) used for MNIST.

  • From [3], the VC dimension is estimated with O(WLlog(W))O(W L \log (W)) where W=7840W=7840 is the number of weights and L=1L=1 is the number of layers for the neural network we used for MNIST.

We find that the lower bound above using the estimated VC dimension is around 100100x of the size of the dataset in Fig. 6 in App. D.1, which we attribute to the omitted constants in the big-Oh notations of both the sample complexity lower bound and that of the VC dimension bound in [3]. This empirical observation shows that, though a labeled dataset is necessary (for theoretical guarantees), a relatively small size can be sufficient to determine good weights.

We hope our response has helped clarify your questions.

References

[1] Chen, J., Zhang, Q., & Zhou, Y. (2018). Tight bounds for collaborative PAC learning via multiplicative weights. Proc. NeurIPS.

[2] Blum, A., Haghtalab, N., Procaccia, A. D., & Qiao, M. (2017). Collaborative PAC learning. Proc. NeurIPS.

[3] Harvey, N., Liaw, C., & Mehrabian, A. (2017). Nearly-tight VC-dimension bounds for piecewise linear neural networks. Proc. COLT.

审稿意见
6
  • This paper proposes a theoretical framework for incentivized black-box model sharing, based on cooperative games.
  • On the first stage of interaction, each party iin[n]i\\in[n] trains a multiclass classifier hi(x)h_i(x) using distribution mathcalDi\\mathcal{D}_i, but are interested in maximizing performance on a different distribution mathcalD\\mathcal{D}.
  • The trained classifiers are sent to a trusted party, and combined into an ensemble model hN(x)=sumibetai,xhi(x)h_N(x)=\\sum_i \\beta_{i,x} h_i(x). The trusted party evaluates hNh_N on a dataset UsimmathcalDTU\\sim\\mathcal{D}^T from the target distribution, and performance is translated into fair rewards rir_i for each party by the weighted ensemble game (WEG) mechanism.
  • The WEG mechanism is based on Shapley values of a fully-additive cooperative game. The contribution of the ii-th party is assumed to be equal to the average ensemble weight of their predictor (sumxinUbetai,x/T\\sum_{x\\in U} \\beta_{i,x}/T).
  • On the second stage, each party is allowed to add pip_i monetary funds to increase their reward, and additional rewards ri+r_i^+ and payments pi+p_i^+ are distributed fairly by the fair replication game (FRG) mechanism, relying on Theorem 1.
  • Once the final reward values are set, rewards (ri+ri+r_i+r_i^+) are realized as iid samples from the set (x,hN(x)xinU\\{(x,h_N(x)\\}_{x \\in U}, and offset payments pipi+p_i-p_i^+ are realized as monetary transfers.
  • Empirical evaluation is performed on MNIST, CIFAR-10 and SVHN, demonstrating accuracy gains in several settings.

优点

  • Problem is well-motivated. Two-stage collaborative game structure is an interesting design approach.
  • Makes effort to support key assumptions (e.g for valuation functions).
  • Empirical evaluation supports claims and provides confidence bounds. Documented code is provided.

缺点

  • Limitations of the proposed method are not discussed clearly.
  • Unclear applicability for practical ensemble methods: Average ensemble weight is uncorrelated with the objectives of the parties (Table 1), experiments are performed with an "ideal method" (Section 4.1).
  • Presentation is dense, and was hard for me to follow. Many remarks which were very helpful to my understanding only appeared in Appendix A.

问题

  • Motivation: Under which conditions is the model incentive structure realistic, and the valuation assumption applicable? In the hospital example mentioned in Appendix A (Q2), it is reasonable to assume that every hospital has access to a data source mathcalDi\\mathcal{D}_i based on their local population, however it doesn’t seem intuitive to me that the hospital would desire a classifier that has good performance on a population mathcalD\\mathcal{D} which is different than their own, and common to all other hospitals. Can you clarify this example, or give a different practical example where assumptions intuitively hold?
  • How does the method perform under practical (non-ideal) ensemble methods?
  • Price of fairness: If I understand correctly, it seems that the overall welfare of the parties (sumiLmathcalD(hi)\\sum_i L_{\\mathcal{D}}(h_i)) would be maximized by sharing all target-dataset data (xt,hN(xt)t=1T\\{(x_t,h_N(x_t)\\}_{t=1}^T with all parties. What are the shortcomings of this approach? How does its welfare compare to the mechanism presented in the paper?
  • What is the relation between the objective L_\\mathcal{D}(h_i) and the utility uiu_i presented in Theorem 1? Also, is it possible to quantify the relation between payment and accuracy increase for a given problem instance?
  • Technical questions: What is the meaning of the notation hatLmathcalD(h,hN)\\hat{L}_{\\mathcal{D}}(h,h_N) in Section 5.2? Is there an upper bound on the size of realized reward TiT_i?
评论

We thank Reviewer gZai for taking the time to review our paper and for providing a very detailed summary and questions and positive feedback that our problem is well-motivated, our design approach is interesting, and for appreciating our efforts in supporting key assumptions and empirical evaluations.

We would like to address the comments as follows.


W1:

Limitations of the proposed method are not discussed clearly.

We have added a section (App. A.1) discussing the limitations and will make this clearer in our revision.

For example, one limitation of our method is that it is developed for ensemble methods that take the form of weighted sum formulation (in Section 4.1), which does appear in several common ensemble methods (mentioned in Section 4.1).


W2:

Unclear applicability for practical ensemble methods

We indeed have experimental results for non-ideal ensemble methods (which are more practical) and have also included additional experiments.

  • [Results for non-ideal ensemble methods.] The results are in Tables 4 and 5 (in App. D.1). We observe a stronger correlation between the average ensemble weight V_i\mathcal{V}\_{i} and the generalization error L_D(h_i)L\_{\mathcal{D}}(h\_{i}) with the practical ensemble methods (i.e., MV, KV, and MWU) as shown in Table 5 under the non-i.i.d. data setting, which we believe is more important because the setting is more realistic. If the ensemble is bad (e.g., AVG shown in Table 5), our method of valuation and allocation still works, but parties with better models may not be identified as such in the collaboration.

  • [Additional experiments.] We included additional results in our paper, in Fig. 8 (in App. D.2.1), to demonstrate the accuracy gains w.r.t. different non-ideal (practical) ensemble methods used. In particular, we observe that strict IR is always satisfied.

    For the additional experiment on CIFAR-10, we will include it if it is complete before the end of the rebuttal; otherwise, we will include it in the revision.

As these different (non-ideal) ensemble methods produce predictions of varying qualities, we use the ideal (i.e., optimal ensemble) method to illustrate the effect of an ideal case of collaboration. The non-ideal ensemble methods might not be as effective in achieving fairness or IR as the ideal ensemble, but they can be applied nevertheless.


W3:

Many remarks which were very helpful to my understanding only appeared in Appendix A.

Thank you for taking the time to carefully read our prepared appendix. It would be very helpful if the reviewer would let us know the specific remarks (currently in appendix) that would aid the understanding of the reader, if moved to the main paper. We will improve the presentation accordingly.


We address the other concerns in the subsequent comment.

评论

Q1:

Under which conditions is the model incentive structure realistic, and the valuation assumption applicable?

The model incentive structure is realistic when the parties are self-interested (i.e., interested in accuracy gain and monetary compensation) and desire to perform well on the same target domain D\mathcal{D} (please refer to the first paragraph in Sec. 3 for the formal problem formulation).

The valuation assumption is applicable when the ensemble method follows the weighted sum formulation of hN(x)=i=1nβi,xhi(x)h_N(x) = \sum_{i=1}^n\beta_{i,x}h_i(x) (in Sec. 4.1).

In the hospital example mentioned in Appendix A (Q2), it is reasonable to assume that every hospital has access to a data source Di\mathcal{D}_i based on their local population, however it doesn’t seem intuitive to me that the hospital would desire a classifier that has good performance on a population D\mathcal{D} which is different than their own, and common to all other hospitals. Can you clarify this example, or give a different practical example where assumptions intuitively hold?

Consider the condition of a pandemic like COVID-19, that has an impact globally, affecting all hospitals and individuals. The data of different hospitals may have acquisition bias due to the different demographics of the patients, so the hospitals' source data Di\mathcal{D}_i over COVID-19 variants would be heterogeneous. Individuals may be affected by other COVID-19 variants from other distributions in the future. In such a situation, each hospital is interested in developing generalizable classifiers to cater to all individuals by facilitating collaboration and would be dealing with the same population D\mathcal{D} [1].

Moreover, existing works on collaborative learning often adopt this same setting of learning from multiple domains to perform well on the same target domain (Chang et al., 2019; Lin et al., 2020; Feng et al., 2021).

We can also extend to different target domains in future work (as mentioned in Sec. 7) to design a collaboration that enables customized improvements (w.r.t. each Di\mathcal{D}_i).

[1] Peiffer-Smadja, N., Maatoug, R., Lescure, F. X., D’ortenzio, E., Pineau, J., & King, J. R. (2020). Machine learning for COVID-19 needs global collaboration and data-sharing. Nature Machine Intelligence, 2(6), 293-294.


Q2:

How does the method perform under practical (non-ideal) ensemble methods?

Please see the response to W2 above.


Q3:

Price of fairness: If I understand correctly, it seems that the overall welfare of the parties (iNLD(hi))(\sum_{i\in N}L_{\mathcal{D}}(h_i)) would be maximized by sharing all target-dataset data with all parties.

Yes, your understanding is correct.

What are the shortcomings of this approach?

It can cause the free-rider problem, if we simply maximize the overall welfare regardless of the contributions. To elaborate, if there is a party that contributes nothing to the collaboration and yet still receives all the ensemble predictions (as every other party). This is unfair to the parties that make meaningful contributions and can thus disincentivize other parties (Sim et al., 2020).

How does its welfare compare to the mechanism presented in the paper?

Under the case where all the parties have no budget constraints and make the maximal payment pip_i^\ast (see in the 7th line in the paragraph below Remark 1), our proposed mechanism can also maximize the overall welfare. In this case, fairness is still achieved by distributing the payment fairly, as the less-contributing parties will compensate the top contributors via monetary payments.

If there is any budget constraint, the maximum overall welfare may not be achieved. So, there is a price to maximize overall welfare. Which one to choose will depend on the problem setting.

评论

Q4:

What is the relation between the objective LD(hi)L_{\mathcal{D}}(h_i) and the utility uiu_i presented in Theorem 1?

The utility uiu_i is the value that quantifies the received reward (ri+ri+)(r_i + r_i^+) and the payoff (pi+pi)(p_i^+ - p_i). The generalization error LD(hi)L_{\mathcal{D}}(h_i) can be reduced with incorporating ensemble predictions (of size TiT_i) realized from (ri+ri+)(r_i + r_i^+).

From Proposition 2, LD(hi)LD(hi)+ϵiL_{\mathcal{D}}(h_i') \leq L_{\mathcal{D}}(h_i)+\epsilon_i and ϵi\epsilon_i depends on Ti:=(ri+ri+)×TT_i:=(r_i+r_i^+)\times T. The total reward (ri+ri+)(r_i + r_i^+) in utility uiu_i determines TiT_i, and affects the improved generalization error LD(hi)L_{\mathcal{D}}(h_i'). Specifically, if the payoff pi+pip_i^+ - p_i is fixed, a larger uiu_i could lead to smaller LD(hi)L_{\mathcal{D}}(h_i').

Is it possible to quantify the relation between payment and accuracy increase for a given problem instance?

Yes, and we present an informal idea: Note that any party ii receives ensemble predictions of size Ti:=(ri+ri+)×TT_i:=(r_i+r_i^+)\times T. The payment pip_i is made to purchase the additional reward of value ri+=VN×piVNϕir_i^+ = \frac{\mathcal{V}_N\times p_i}{\mathcal{V}_N - \phi_i}, which is realized as additional samples of ensemble predictions of size (ri+×T)(r_i^+ \times T). Therefore, the larger payment pip_i that party ii makes, the more ensemble predictions it will receive.

From Proposition 2, LD(hi)LD(hi)+ϵiL_{\mathcal{D}}(h_i') \leq L_{\mathcal{D}}(h_i)+\epsilon_i and ϵi\epsilon_i depends on TiT_i, where the payment pip_i is implicitly embedded in TiT_i. Therefore, a higher payment means a larger TiT_i, hence smaller ϵi\epsilon_i. If we use LD(hi)+ϵiL_{\mathcal{D}}(h_i)+\epsilon_i as an approximation of LD(hi)L_{\mathcal{D}}(h_i'), Δϵi\Delta\epsilon_i can represent the reduction in error, which is empirically reflected as the increase in accuracy. To quantify the relation between payment and accuracy increase, we write Δϵi\Delta\epsilon_i as a function of pip_i shown below:
$

\Delta\epsilon_i(p_i) = M - 4 \sqrt{2d\log(2(r_iT+\frac{p_iT}{1 - \phi_i}+m_i+1))+2\log(\frac{8}{\delta})}\sqrt{\frac{\alpha_i^2}{r_iT+p_iT/(1 - \phi_i)}+\frac{(1-\alpha_i)^2}{m_i}}

$

where M=42dlog(2(riT+mi+1))+2log(8δ)αi2riT+(1αi)2miM=4 \sqrt{2d\log(2(r_iT+m_i+1))+2\log(\frac{8}{\delta})}\sqrt{\frac{\alpha_i^2}{r_iT}+\frac{(1-\alpha_i)^2}{m_i}} is a constant. When pi=0p_i=0, Δϵi(pi)=0\Delta\epsilon_i(p_i)=0. When pip_i is larger, Δϵi(pi)\Delta\epsilon_i(p_i) will be larger, and thus a higher accuracy.

This idea is empirically verified in an additional empirical result in Fig. 9 in App. D.2.1.


Q5:

Technical questions: What is the meaning of the notation L^D(h,hN)\hat{L}_{\mathcal{D}}(h,h_N) in Section 5.2? Is there an upper bound on the size of realized reward TiT_i?

L^_D(h,h_N)\hat{L}\_{\mathcal{D}}(h,h\_N) is the empirical error of training with the ensemble predictions. Denote U_i=x_t,hN(x_t)_t=1T_iU\_i=\\{x\_t, h_N(x\_t)\\}\_{t=1}^{T\_i} the set of ensemble predictions that party ii receives. By definition, L^_D(h,h_N)=1T_i_xU_i(h(x),h_N(x))\hat{L}\_{\mathcal{D}}(h,h\_N) = \frac{1}{T\_i}\sum\_{x \in U\_i}\ell(h(x),h\_N(x)).

There is an upper bound on the size of realized reward Ti=(ri+ri+)×TT_i=(r_i+r_i^+)\times T, where TiT_i is always less or equal to the number of all ensemble predictions TT. The size TT is specified by the size U|U| of the distillation dataset UU.


We thank the reviewer again for the detailed questions and constructive feedback and hope that our response has answered your questions and helped raise your opinion of our work. We are happy to provide further clarifications.

评论

Thank you for the detailed response, and for the clarifications made in the revision. I have no further questions.

For improving clarity (W3), I found Figure 5 and Appendix A.2 Q2 to be helpful, and I believe that a greater emphasis on concise intuitive examples will improve the overall clarity of the exposition.

评论

Dear Reviewer gZai,

We thank you very much for reviewing our response. We are glad that our response has addressed all your concerns.

We will improve the clarity of the exposition accordingly in the revision.

Best Regards,

Authors

评论

Post-discussion period summary (of responses and updates)

We sincerely appreciate the valuable feedback provided by the reviewers. We summarize the latest responses and updates to the manuscript:

  1. We thoroughly responded to all questions from the reviewers to address their concerns. Both reviewers 1Wgz and gZai asknowledge that their questions have been addressed.

    We note that Reviewers 1Wgz and hnks primarily raised comments on the clarity of our presentation, which we try to address by providing clarifications in our response and updating the manuscript accordingly.

  2. The following updates are included:

    • We carefully revise Sec. 3 (Reviewer 1Wgz) to include Fig. 1 (Reviewer gZai) to clearly demostrate the components and steps of our proposed mechanism.
    • Further, we add a detailed but concise description of the precise steps in our mechanism in App. A (Reviewer 1Wgz).

In the revision, we will continue to refine our writing and presentation to improve its clarity, by carefully incorporating the suggestions from the reviewers.

We wish to extend our sincere thanks to all the reviewers for their constructive feedback and for acknowledging that our response has addressed their questions. We hope that our clarifications and updates have addressed all the questions and helped raise your opinions of our work.



We are extremely grateful to the reviewers for their insightful comments, which are key to refining and improving our paper.


Strengths

It is clear from the reviews that several aspects of our paper have been positively accepted:

  • Our study on incentivized black-box model sharing is an important, practical, and well-motivated topic (Reviewers 1Wgz, gZai, hnks).

  • We have clearly addressed the three main questions of incentivized black-box model sharing (Reviewer jHme).

  • We have provided empirical evaluations to verify the theoretical claims (Reviewers gZai, hnks).

  • Our proposed properties are desirable, and the analysis is thorough (Reviewers 1Wgz, hnks).


Revision

We have made the following modifications to our paper in the rebuttal:

  • We have conducted an additional experiment (update in App. D.1) and highlighted that the requirement on the ground truth is not very restrictive, where the labeled data (i.e., ground truth) of size as small as 100 is sufficient to identify high-quality models, as shown in Fig. 6 in App. D.1.

  • We have provided additional empirical results in Fig. 8 in App. D.2.1 to demonstrate the improved accuracy with different practical ensemble methods, where the strict IR is still achieved.

  • We have provided an empirical quantification between the accuracy gain w.r.t. the increase in (monetary) payment in Fig. 9 in App. D.2.1.

  • We have updated our optimization problem of the optimal ensemble method in App. B.2 to achieve a similar weight βi,x\beta_{i,x} for similar prediction hi(x)h_i(x).

  • We have added a section (App. A.1) discussing the limitations.

  • We have discussed two more properties of Shapley value (App. B.1).


We will carefully consider all the feedback given in our revised version. We thank the reviewers for their valuable input and hope our answers can increase your opinions of our work. We are happy to provide more clarifications.

AC 元评审

Reviewers appreciated the extensive effort by the authors to clarify their contributions, but they also maintained several concerns regarding the paper.

-The presentation clarity of the paper remains a major weakness, making it difficult to understand the overall model, and the applicability of the results. Expositional issues stem from the fact that the model / framework is very complex. The complexity is also an issue in itself, as it is hard to pinpoint exactly where the contribution of the paper lies, what assumptions are being made, and why they are justified. -On the technical side, the fact that getting a good set of weights is hard should at the very least be discussed. -Most importantly however, as the quality of the guarantees depends on the weights, and the weights are purportedly learned from a small dataset, there should be some discussion of the distribution from which it is sampled (especially given that the introduction mentions agents covering different distributions), and how bounds depend on sampling from this distribution. Some argument should also be made as to why predictions (and ensemble weights) generalize beyond this small dataset.

For all this reasons, the paper could use another iteration prior to publication.

为何不给更高分

Reviewers were very explicit about their concerns during the discussion.

为何不给更低分

N/A

最终决定

Reject