6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

3.5

置信度

正确性3.0

贡献度2.8

表达3.3

NeurIPS 2024

Fair Wasserstein Coresets

Zikai Xiong,Niccolo Dalmasso,Shubham Sharma,Freddy Lecue,Daniele Magazzeni,Vamsi K. Potluru,Tucker Balch,Manuela Veloso

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

We present Fair Wasserstein Coreset (FWC), a novel coreset approach to generate fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks.

摘要

Data distillation and coresets have emerged as popular approaches to generate a smaller representative set of samples for downstream learning tasks to handle large-scale datasets. At the same time, machine learning is being increasingly applied to decision-making processes at a societal level, making it imperative for modelers to address inherent biases towards subgroups present in the data. While current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples, their impact on downstream learning processes has yet to be explored. In this work, we present fair Wasserstein coresets ($FWC$), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. $FWC$ uses an efficient majority minimization algorithm to minimize the Wasserstein distance between the original dataset and the weighted synthetic samples while enforcing demographic parity. We show that an unconstrained version of $FWC$ is equivalent to Lloyd's algorithm for k-medians and k-means clustering. Experiments conducted on both synthetic and real datasets show that $FWC$: (i) achieves a competitive fairness-performance tradeoff in downstream models compared to existing approaches, (ii) improves downstream fairness when added to the existing training data and (iii) can be used to reduce biases in predictions from large language models (GPT-3.5 and GPT-4).

关键词

Algorithmic FairnessNonconvex OptimizationCoresets

评审与讨论

审稿意见

评分: 6置信度: 32024-07-06

This paper introduces a new data distillation technique called Fair Wasserstein Coresets. The general idea is to create a synthetic core set along with sample weights to represent a larger dataset, by minimizing the Wasserstein distance between core set and dataset, while ensuring a fairness constraint is satisfied. The paper develops a majority minimization algorithm for this Wasserstein problem and empirically validates it on several data sets demonstrating a competitive fairness utility trade-off.

优点

The Wasserstein problem is well-formulated with theoretical guarantees.
The connections with k-medoids is intuitive.

缺点

I suspect there is a potential error in Proposition 2.1, specifically pertaining to the inputs and outputs defined in these functions. Note that z consists of inputs ( $d, x$ ) and outputs ( $y$ ) of the NN, whereas $g_{\psi}$ is an MLP, i.e., it is a function that takes only $(d, x)$ as input. From [69 (original reference)], the MLP satisfies the Wasserstein inequality but only on the marginal distributions over p_{(x,d)} rather than over $p_{Z}$ . This may be resolved if we consider not the Wasserstein distance of the MLP output, but instead the Wasserstein distance of the function $h(z) = | g_{\psi}(x,d) - y |$ .
What do you mean in Lemma 3.1 that the corset is “no better than the best fair Wasserstein corset formed by $m |D||Y|$ data points”? I suspect you mean better with regard to achieving a lower Wasserstein distance, but please clarify.
The empirical analysis in Figure 1 is hard to parse. Can you measure the Pareto frontier from all of the observations and demonstrate that FWC is dominant? FWC seems Pareto efficient for Adult, Crime, and Drug, but not Credit potentially — but it is hard to see.
It is hard to understand the trade-offs between accuracy and disparity in the LLM experiments in Table 1, just by reporting these numbers. How important is it that the disparity dropped by 0.009 at a 2.97 point loss in accuracy? Again, it would be important to demonstrate some Pareto efficiency. Furthermore, the change in accuracy and disparity do not seem statistically significant based on the SD reported.

问题

Please clarify the potential error and comment about Proposition 2.1. If there is an error, what are the consequences with subsequent theoretical results?

局限性

Limitations are thoroughly discussed in the Appendix.

作者回复

2024-08-06

We thank the reviewer for the careful review and comments; we answer each questions below.

It is indeed true that the MLP $g_\psi$ satisfies the Wasserstein inequality for $p_{(x,d)}$ rather than $p_{z}=p_{(y,x,d)}$ (with the first inequality being ultimately what we are interested in), so thanks for pointing this out. We have changed the notation to define the downstream deviation as $d(p_{(\hat{X},\hat{D});\theta}, p_{(X,D);e})$ . Proposition 2.1 still holds as it can be updated as follows:

$d(p_{(\hat{X},\hat{D});\theta}, p_{(X,D);e}) \leq L_k W_1 (p_{(\hat{X},\hat{D});\theta}, p_{(X,D);e}) \leq L_k W_1(p_{\hat{Z};\theta}, p_{Z;e}),$

as the Wasserstein-1 distance of the joint $(X,Y,D)$ is larger than the marginal $(X,D)$ , as the $L_1$ distance is the sum of the absolute values of the distances across each variables (we have added a more formal proof of this in Appendix B). In other words, the Wasserstein distance minimized by FWC still constrains the downstream learning discrepancy we are interested in. We have updated Proposition 2.1 to reflect this. Finally, note that defining $h(z) = |g_\psi(x, d) - y|$ does not work as the reverse triangle inequality $||x| - |y|| \leq |x-y|$ only holds if the expectation is inside the absolute value, so the upper bound necessary for the proof of Proposition 2.1 would not follow.
Thanks for the suggestion and you are totally correct. It is in the sense of achieving a lower Wasserstein distance. We have clarified it by stating that the optimal objective value (Wasserstein distance) of problem (4) is no lower than that of problem (5) with the $m$ replaced by $m|\mathcal{D}||\mathcal{Y}|$ .
Thank you for your suggestion to improve the readability of our results. We have now updated Figure 1 to include Pareto frontiers, computed over all models and coreset sizes, which we have uploaded in the global rebuttal above. We include a table below to indicate which FWC models are included in the Pareto frontier in each dataset.

	Pareto Frontier
Dataset	FWC ( $\epsilon=0.01$ )	FWC ( $\epsilon=0.05$ )	FWC ( $\epsilon=0.1$ )
Adult	$\checkmark$	$\checkmark$
Drug	$\checkmark$	$\checkmark$	$\checkmark$
Crime	$\checkmark$	$\checkmark$	$\checkmark$
Credit	$\checkmark$

Thank you for the feedback. Determining the acceptable fairness-accuracy trade-off has been a discussion in the community and is eventually dependent on the application. Fairness-accuracy trade-offs have been highlighted in several works, eg., [1] and [2], and the acceptable trade-off is up to an end user and is application dependent (eg., finding lower discriminatory alternatives is mandatory in fair lending and acceptable accuracy reduction is dependent on "business necessity" [3]). Using LLM's is one of the downstream tasks considered in the paper and the experiment shows that coresets produced by FWC are able to reduce disparity in the outputs of LLM's for a classification task on average. When we observe the mean and standard deviations for demographic parity, we see that demographic parity is consistently reduced for outputs when compared to zero shot prompting for GPT-4 across all runs, showing that the coresets can help reduce disparity. For GPT-3.5, while the average disparity is reduced across runs, such consistency is indeed not observed, owing to a diverse set of outputs from the large language model. We will add this discussion to the final version of the draft.

References:

[1] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst., 33(1):1–33, October 2012.
[2] Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 259–268, New York, NY, USA, 2015. ACM.
[3] Gillis, Talia B., Vitaly Meursault, and Berk Ustun. "Operationalizing the Search for Less Discriminatory Alternatives in Fair Lending." In The 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 377-387. 2024.

评论- Score adjusted

2024-08-08

Thanks for clarifying the theoretical result and updating your figures to show the Pareto frontier. The discussion in (#4) is useful and I recommend you include this discussion in your final version. I have updated my score accordingly.

审稿意见

评分: 6置信度: 32024-07-12

The paper gives an algorithm to generate smaller weighted synthetic dataset from real data set such that the synthetic data can enforce demographic parity when used for downstream tasks. This is achieved by solving an optimization problem of minimizing the Wasserstein distance between the two dataset distributions along with demographic parity-based fairness constraint. The authors describe how to efficiently solve this problem by reformulating it and subsequently using a majority minimization algorithm to solve the resultant nonconvex problem. They provide convergence guarantees for the algorithm and also generalization bounds for the solution. The theoretical results are supported by experiments on real and synthetic datasets.

优点

The paper for the most part is written clearly with some minor writing issues (see weaknesses). It is well structured and not too difficult to follow the high-level ideas. Both fairness and scalability are relevant issues so the paper will be of interest to the community.
I could not check all proofs, but the theoretical results appear sound. The connection between the unconstrained problem and Lloyd's algorithm for $k$ -means is neat.
The authors have performed experiments on both real and synthetic datasets and compared with a number of existing methods. As such the paper is a good mix of theory and practice.

缺点

The paper seems to borrow a lot of ideas and proof techniques from existing works like [56], [71] and others. for e.g. the reformulation, ideas to speed up the algorithm etc. As such I am not entirely sure about the novelty quotient of the work. It would be better if the authors can highlight why the modifications to techniques from existing works are non-trivial.
The explainability of the synthetic data will be very less. Specifically, as far as I understood, the authors are assigning the output label and sensitive attribute value to the generated data points just in same proportion as that in the original data. It is not clear to me what does this mean for the individual synthetic data points. Also do the features in the generated synthetic data correspond exactly to the features in original data?
I suggest the paper be proofread for minor corrections in writing: E.g.: In the contributions make the 'w' 's capitalized. On line 171 the authors say $P \geq 0$ (which I think means each entry in non-negative) while on line 258 it is $P \geq \mathbf{0}$ . Maintain the consistency.
Should not the weight of the synthetic data sum to $n$ and not $m$ (line 141 $\Delta_m$ )? Typically, we try to preserve the weight of the original data in expectation while reweighing the sampled points. Please Clarify.

问题

See weaknesses

局限性

See Weaknesses

作者回复

2024-08-06

We are grateful to the reviewer for their time and comments in the review of our work; we answer each point below.

While we acknowledge that the ideas for the reformulation and part of the complexities considerations are adapted from [1, 2] ([56, 71] in the original references) we would like to emphasize the significant contributions of this work:

Novel algorithm. Our novel algorithm FWC represents the first fair coreset approach explicitly focusing on downstream learning, in contrast to existing methods that emphasize local properties of generated samples (Section 4);
Theoretical claims. We support our novel algorithm FWC with theoretical claims regarding its convergence properties (Section 5.2) and generalization analysis (Section 5.3), as well as the effect on downstream learning tasks (Proposition 2.1 and Appendix B.1);
Connection to k-means. We establish a practical connection to the k-means algorithm, thereby extending FWC's applicability beyond fairness considerations (Section 6);
Empirical results. We demonstrate FWC's effectiveness through empirical results on synthetic data for scalability, as well as on real datasets and two LLMs for downstream classification tasks (Section 7 and Appendix C).

The features for the generated data are the same as the ones in the original training data, but the values of such features would be different. For instance, in Appendix C.2, in ``Using FWC to correct biases in LLM'' we include a textual example of how one of our generated coresets would look like. The value of each feature would depend on the chosen cost function $c$ ; in Section 4.2 we highlight the case of the $L_1$ and $L_2$ distance, as well as how to select feature values if only values existing among the training data features can be selected (akin to k-medoids).
We have corrected the inconsistencies, thanks for pointing them out.
In our work, the synthetic coresets $\hat{Z}$ lie in $\mathcal{Z}^m$ . Only when $\theta\subseteq \mathbb{R}^m_+$ and $\sum_j \theta_j = m$ , the empirical distribution $p_{\hat{Z};\theta}:= \frac{1}{m}\sum_{j=1}^m \theta_j \delta_{\hat{Z}_j}$ is a valid probability measure.

References:

[1] G. Peyré, M. Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
[2]Z. Xiong, N. Dalmasso, A. Mishler, V. K. Potluru, T. Balch, and M. Veloso. FairWASP: Fast and optimal fair wasserstein pre-processing. Proceedings of the AAAI Conference on Artificial Intelligence, 38(14):16120–16128, Mar. 2024.

2024-08-09

I have read the reviews and the rebuttal and have accordingly updated my score.

审稿意见

评分: 7置信度: 42024-07-13

This paper proposed to extract coresets from a set of data samples using Wasserstein distance with fairness constraints. The authors formulates this problem as a minimization with linear constraints. The coreset selection is over the whole input space, not just from original data samples. The importance / weight of each coreset sample are also optimized. Extensive experiments show this method achieves better fairness-utility tradeoff, and can be applied in LLMs to reduce bias.

优点

The paper is nicely written and easy to follow. I appreciate the detailed steps and neat reformulations of the optimization problem. Theoretical guarantees are provided. The experiments supports the effectiveness of the proposed method very well.

缺点

In section 4.2 (line 203), how can problem Eq.(12) be separated into subproblems as in Eq.(13), are the optimal solutions of all subproblems the same and equal to the solution to (12)?
In section 6 (line 259), why are the minimizers of problem (17) always has only one non-zero entry in each row? As problem (17) can be seen as a relaxed version of discrete Kantorovich problem, where we can't say anything about the sparsity of the optimal plan. Please elaborate.

问题

See Weaknesses.

局限性

The authors included detailed limitations.

作者回复

2024-08-06

We are grateful to the reviewer for the effort spent reading our paper and for the encouraging comments. Below are our responses to the questions raised:

First of all, we apologize for a typo in (13). The minimum over $\hat{X}_i \in \mathcal{X}$ in (13) should be corrected to the minimum over $\hat{X}_j \in \mathcal{X}$ , replacing the subscript $i$ with $j$ . We appreciate the reviewer’s attention to detail in identifying this typo. With this typo corrected, then we can observe from (12) that, for each $j\in[m]$ , the variable $\hat{X}_j$ only affects the corresponding $\hat{Z}_j$ , because $\hat{Z}_j = (\hat{d}_j,\hat{X}_j,\hat{y}_j)$ . Different $\hat{Z}_j$ terms are independent of each other in the objective function of (12), so (12) can be decomposed into subproblems that optimize each $\hat{Z}_j$ (which is essentially $\hat{X}_j$ ). (12) and (13) are equivalent in the sense that they share the same optimal solutions.
The discrete Monge--Kantorovich (optimal transport) problem has a similar structure as (17), but is more complex. The constraint of (17) only requires that the sum of each $\underline{column}$ of $P$ equals $\frac{1}{n}$ , so (17) can be easily solved by computing the smallest component of each row of $C$ . An optimal solution $P$ can be a matrix where all entries of each row are zero except for the one corresponding to the smallest component of that row in $C$ . In contrast, a typical optimal transport problem that is written as an LP also requires the sum of each $\underline{row}$ to equal to some given values. For this reason, optimal transport problems do not have easily computable closed-form optimal solutions. In the discrete Monge--Kantorovich problem, this pair of constraints on the sum of rows and columns essentially state that the marginals of the probability measure (the decision variable) on the product space must equal the two given probability measures.

2024-08-13

Thanks for the clarification. I have no further questions.

审稿意见

评分: 6置信度: 42024-07-18

This paper talks about "fair Wasserstein coresets", weighted representative points generated to represent the original datasets. The goal is to meet two purposes: 1) the Wasserstein distance of the coreset and the input data set is minimized, 2) fairness in terms of demographic parity. Having a small Wasserstein distance can help to bound the downstream discrepancy for ReLu activated perceptrons.

The authors formulate the problem as an optimization problem (4). There are four steps: 1. Manually set the proportion of each combination of decision (Y) and feature (D). 2. Formulate linear constraints for the fairness constraint. This one borrows directly from [71]. 3. Formulate the Wasserstein distance optimization by using [56] 4. Simply further

After that, the problem is not convex. The authors use "majority minimization" [52, 38] to solve it. Specifically, one defines a convex surrogate function that upper bounds the non-convex function, and optimizes the convex function.

Section 5 reports theoretical guarantees: running time of the algorithm, convergence guarantee for the surrogate function, and last bound the generalization guarantees.

Experiments are reported in the last section: e.g., improving fairness in LLM.

On the positive side, the problem formulation is interesting and valid, Wasserstein coreset with fairness consideration. The use of this coreset for downstream applications make sense. Thus the problem and solution have merit. Experiment are thorough.

The weaknesses (or limitation in significance) is that both crucial steps (2) and (3) are basically using prior work. The theoretical results are standard.

Summarizing I feel that the paper is OK but would give a weak accept.

优点

缺点

The weaknesses (or limitation in significance) is that both crucial steps (2) and (3) are basically using prior work. The theoretical results are standard.

问题

I understand that there are many different notions of fairness and the authors focus on one of them, demographic parity. This is OK. One suggestion is that if the authors can provide some discussions and insights on how the results may improve other notions of fairness it will be valuable to have.

局限性

N/A

作者回复

2024-08-06

We are thankful to the reviewer for their time and thoughtful feedback. As the reviewer has pointed out, several notions of fairness exist in the literature. Focusing on the classification setting, chapter 3 in [2] classifies these notions into independence, separation, and sufficiency. Demographic parity falls in the class of the independence notion, and hence, other measures of fairness that are closely related, eg., disparate impact, would also improve when optimizing for demographic parity. However, as highlighted in [2], other notions of fairness, such as separation may not simultaneously be satisfied. Our work is focused on demographic parity and cannot guarantee an improvement in these other measures.

To test this, we evaluated our work, which optimizes for demographic parity, on the equalized odds measure, which falls under the notion of separation. When we consider equalized odds, FWC is not a part of the Pareto frontier for the Drug dataset, and more generally, FWC performance is not as competitive. This is in contrast to demographic parity: in Figure 1, FWC sits on the Pareto frontier across all datasets for fairness-performance trade-off in downstream classification (see the addendum in the global rebuttal above for the updated version of Figure 1). We have included a table below to indicate which FWC models are part of the Pareto frontier when considering demographic parity versus equalized odds, and will add this discussion to the paper.

	Pareto Frontier,	Demographic Parity		Pareto Frontier,	Equalized Odds
Dataset	FWC ( $\epsilon=0.01$ )	FWC ( $\epsilon=0.05$ )	FWC ( $\epsilon=0.1$ )	FWC ( $\epsilon=0.01$ )	FWC ( $\epsilon=0.05$ )	FWC ( $\epsilon=0.1$ )
Adult	$\checkmark$	$\checkmark$		$\checkmark$
Drug	$\checkmark$	$\checkmark$	$\checkmark$
Crime	$\checkmark$	$\checkmark$	$\checkmark$			$\checkmark$
Credit	$\checkmark$			$\checkmark$		$\checkmark$

Finally, while we acknowledge that steps (2) and (3) are built from [1], we would like to emphasize the significant contributions of this work and especially note that: (i) our novel algorithm FWC represents the first fair coreset approach explicitly focusing on downstream learning, in contrast to existing methods that emphasize local properties of generated samples; (ii) we support FWC with theoretical claims in Propositions 2.1, Theorems 5.3, 5.4, Proposition 5.5 (as well as additional insights in Appendix B.1); (iii) we establish a practical connection to the k-means algorithm, thereby extending FWC's applicability beyond fairness considerations (Section 6); and (iv) we demonstrate FWC's effectiveness through empirical results on synthetic data for scalability, as well as on real datasets and two LLMs for downstream classification tasks (Section 7 and Appendix C).

References:

[1] Z. Xiong, N. Dalmasso, A. Mishler, V. K. Potluru, T. Balch, and M. Veloso. FairWASP: Fast and optimal fair wasserstein pre-processing. Proceedings of the AAAI Conference on Artificial Intelligence, 38(14):16120–16128, Mar. 2024.
[2] Barocas, Solon, Moritz Hardt, and Arvind Narayanan. Fairness and machine learning: Limitations and opportunities. MIT press, 2023.

2024-08-12

Thanks for the response. I believe including them in the revision can be helpful.

作者回复

2024-08-07

We would like to thank again all reviewers for their comments, detailed feedback and questions which have improved the quality of our paper.

While we have addressed each reviewer individually, we are using the global rebuttal to upload a .pdf with the new version of Figure 1 which includes Pareto frontiers; this addresses the comment made by reviewer 4tbZ and is relevant for the discussion of demographic parity following the question by reviewer xAdA.

最终决定Accept (poster)

2024-09-25

The reviewers noted that this paper formulates a relevant and interesting problem and clearly describes an approach that has sound theoretical guarantees and thorough experiments. Given the unanimous recommendation for acceptance from the four reviewers, I recommend acceptance.