Meta Optimality for Demographic Parity Constrained Regression via Post-Processing
摘要
评审与讨论
This paper considers fair regression with respect to statistical parity under the attribute-aware setting. The main focus/contribution is on obtaining a minimax rate for learning fair regressors:
- Taking the post-processing perspective/approach to achieving fairness, and leveraging the error decomposition result in prior work, this paper bound the minimax rate for fair regression as (minimax rate of the unconstrained regression problem) + (sample complexity of learning the fair post-processor); the fair post-processor is the optimal transports to the barycenter of the output distributions of the unconstrained regressors.
- The authors cast the problem of learning the barycenter/optimal transports as optimizing over the congruent potentials.
- A sample complexity bound for learning the barycenter/optimal transports.
给作者的问题
See weaknesses.
论据与证据
Yes
方法与评估标准
N/A
理论论述
Did not check proof in detail, but the presented results are reasonable/expected.
实验设计与分析
N/A
补充材料
No
与现有文献的关系
- Paper adopts the "post-processing" perspective of [Chzhen et al., 2020] and [Le Gouic et al., 2020] for fair regression, and leveraged its error decomposition to derive the minimax rate (of which there are also prior work, even though are limited to specific data generation processes; see weakness 1).
- The barycenter/optimal transport estimator used in this paper is inspired by [Korotin et al., 2022], for which the authors derived the sample complexity bound.
遗漏的重要参考文献
Le Gouic et al. Projection to Fairness in Statistical Learning. 2020.
其他优缺点
Weaknesses.
-
Theorem 3.4 states a minimax rate in terms of the minimax rate of the unconstrained regression problem plus the sample complexity of learning the barycenter/optimal transports. A heuristic interpretation of the result is given in remark 3, which says that if the regression problem is harder than that of learning the transports ("Such a situation may frequently happen"), then the minimax rate would be dominated by that of the regression problem. Unfortunately, no concrete examples is provided to show when would the minimax rate be dominated by the first term.
In particular, the reviewer is curious about the complexity of learning the barycenter/optimal transport as conveyed in the second term. For example, on synthetic problems constructed in prior work that analyzes minimax optimality of fair regression, what does the constants and instantiate to, and does the second term decay faster than the first?
-
Because Theorem 3.4 is derived from the post-processing perspective, the second term could have been replaced by any sample complexity bound for learning the barycenter/optimal transports. How does the derived bound compare (or why incomparable) to that in section 4 of [Chzhen et al., 2020], which was derived for a nonparametric estimate?
-
The result in Theorem 3.4 is obtained by taking the post-processing perspective, rather than analyzing the fair regression problem in an end-to-end manner (i.e., in-processing), so the caveats of post-processing vs. in-processing applies. In particular, post-processing can be suboptimal if the class of regressor is constrained, that is, constraining the in ; see [Woodworth et al., 2017].
-
The algorithm in Section 4 basically takes the same form of that in [Chzhen et al., 2020], except that here the barycenter/optimal transports are estimated by optimizing the congruent potentials on the empirical problem. It is unknown how it compares to the original nonparametric estimator, and there is no experiment results in this paper, so the contributions in this section is unclear. What are the benefits of using the new formulation compared to the original?
其他意见或建议
- line 180: incomplete sentence
- line 301: Dodley -> Dudley
We appreciate the reviewer’s detailed feedback. Here, we address the main concerns:
W1:
A pertinent example is when is a composition of multiple functions, i.e., , within the Holder class as used by Schmidt-Hieber (2020), and belongs to the Sobolev space. In this scenario, , where represents the cumulative smoothness for , and indicates the dimensionality of the input of . For the Sobolev space with smoothness , Definition 3.3 is satisfied with and by selecting as a set of functions spanned by the first wavelet bases. The rate becomes , up to logarithmic factors, under the assumption . If and share the same level of smoothness, i.e., , the first term dominates the second if for some . Here, denotes the dimensionality of the essential intermediate data representation. Such an intermediate representation may have multiple dimensions, i.e., , causing the first term from the conventional regression problem to dominate the rate.
We will incorporate this example after discussing the implications of the main theorem in the revised version.
W2, W4:
Our results offer significant advantages over the error bounds presented by Chzhen et al. (2020). Firstly, our results are derived under weaker assumptions. Their results necessitate the conventional regression algorithm to have a sub-Gaussian high-probability error bound, whereas our results only require a bound on the expected square error. Additionally, while they demand constant upper and lower bounds on the density of , we only assume the Poincare-type inequality. This broadens the applicability of our meta-theorem compared to their findings.
Secondly, their results cannot achieve a rate faster than , as their upper bound on the estimation error of is and dominates other terms. However, our results can achieve a rate faster than by leveraging the smooth structure of .
Thus, our results can demonstrate a faster rate under weaker assumptions compared to those provided by Chzhen et al. (2020). We will include this discussion as an implication of our main theorem in the revised version.
W3:
We wish to clarify that our results do not contradict those of (Woodworth et al., 2017), as their results pertain to equalized odds, not demographic parity. Our findings show that concerning sample size, post-processing can be optimal. However, it may be suboptimal for other parameters, such as the Lipschitz constant and the number of groups . Analyzing optimality for these parameters is an important direction for our future work.
The reviewer appreciates the authors' response, and has raised their score!
This paper studied the fair regression problem with demographic parity as a constraint. It claimed that existing minimax optimal regression algorithms are coupled with data generation methods, and proposed meta-theorems to validate the fair minimax optimality. Then they demonstrated that the optimal regression can be achieved through post-processing methods, which thus can be efficiently and flexibly achieved.
给作者的问题
I would consider it is a good paper if authors can address my above concerns.
论据与证据
The claim that existing analyses are coupled with data generation is not clear. I saw it was mentioned in introduction while I cannot connect it in the main methodology.
方法与评估标准
This is a theoretical paper which provides analyses instead of new methods.
理论论述
I read through the theoretical analyses but only understand some parts of them.
实验设计与分析
No experiments were presented in the paper.
补充材料
The appendices contain some theory proofs and no additional supplementary materials provided.
与现有文献的关系
I checked the related papers about OT map estimation in Wasserstein barycenter problems and minimax optimality under regression research. However, I cannot immediately get the corresponding improvements over these works although the authors have highlighted their differences.
遗漏的重要参考文献
I have related comments below.
其他优缺点
Strengths: I think the three listed contributions are important if they are correct. The meta-optimality connects the minimax optimal error for fair regression and traditional regression. They demonstrated that optimal regression models can be fair through post-processing method. And they provide convergence rate analysis for the transport map estimation.
Weaknesses:
- The first thing I felt confused is that in fairness research, minimax optimality also refers to minimizing the risk of the worst group [1], which is under the Rawlsian fairness concept. I found the authors are working at a different area after I checked the work of Chzhen& Schreuder (2022) and Zeng (2024). The authors mentioned the worst-case error taken over a certain set of data generation models, while I did not understand how this limitation is addressed in this paper.
- I found regressor f is group-wise (line 75). Then a conventional regressor may be group unaware. How to do post-processing in this case? Maybe I am questioning in an incorrect way.
- I realised this work and some earlier ones focus on regression analyses. I am curious why regression is considered only? Will these analyses be true for classification as well? A recent paper [2] pointed out loss/error is more sensitive than discrete perditions when considering risk distributions. Is this paper sharing a similar insight?
[1] Minimax Pareto Fairness: A Multi Objective Perspective. ICML 2020. [2] Towards Harmless Rawlsian Fairness Regardless of Demographic Prior. NeruIPS 2024.
其他意见或建议
Is the proposed post-processing algorithm a new method compared with existing post-processing fairness work? Then would the experiments on it help justify the advances?
We appreciate the reviewer's comprehensive feedback and would like to address their main concerns:
W1: Rawlsian Fairness
We wish to clarify that Rawlsian fairness is fundamentally different from equality-based fairness concepts like demographic parity and equalized odds. Rawlsian fairness focuses on minimizing adverse effects on the most disadvantaged group, whereas equality-based fairness aims for equal treatment across groups. Due to these fundamental differences, results from Rawlsian and equality-based fairness approaches are not directly comparable.
Furthermore, we emphasize that minimax optimality is a well-established concept in statistical literature, used to define the best estimator for a statistical estimation problem. In our context, "minimax" involves maximizing over all possible underlying distributions and minimizing over all estimation algorithms. This concept differs from Rawlsian fairness, as we can define a minimax optimal Rawlsian fair learning algorithm that minimizes the worst-case error, considering both underlying distributions and groups.
Regarding data generation models, they represent potential underlying distributions in nature. Considering a specific set of data generation models is not a limitation of the learner's ability but rather reflects prior knowledge about the data. The sample complexity varies based on this prior knowledge. Extensive research in statistical literature has shown how prior knowledge fundamentally influences sample complexity. Our meta-theorem can accommodate various prior knowledge assumptions by leveraging these existing results.
W2: Group-wise Regressor
We use a group-wise predictor because it is commonly employed in fairness literature. Developing an optimal post-processing method for an unaware predictor is an important area for future research.
W3: Classification
In the context of minimax optimality, regression problems are often considered more fundamental than classification problems, as the minimax optimal error for classification is typically derived using regression techniques. Extending our work to classification problems is a significant direction for future research.
This paper investigates the theoretical properties of fair regression problems by leveraging optimal transport techniques. It provides important theoretical bounds in the context of fair regression, and designs regression algorithm matching the upper bound.
给作者的问题
N/A
论据与证据
Yes. The overall structure of this paper is logical and sound, with the logic flow clear.
方法与评估标准
Not applicable. This is a theoretical paper.
理论论述
Not in detail - looks reasonable.
实验设计与分析
Not applicable
补充材料
N/A
与现有文献的关系
This paper provides new theoretical analysis for the problem of fair regression.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
We thank the reviewer for their positive feedback and for recognizing the importance of our theoretical contributions in fair regression.
This paper addresses the problem of fair regression under demographic parity, with a focus on providing minimax optimality guarantees in an attribute-aware setting. The core contribution is a theoretical framework that decomposes the minimax risk for fair regression into two components: the minimax rate of the unconstrained regression problem and the sample complexity of learning a fair post-processor, which is an optimal transport map to a Wasserstein barycenter. The authors frame this estimation problem through congruent potentials and provide convergence guarantees for the barycenter estimator.
Strengths:
The paper contributes to the theoretical understanding of fair regression by clarifying the statistical cost of post-processing methods relative to unconstrained learning. The meta-theorem and decomposition offer a structured lens for analyzing fair learning under distributional constraints, and extend existing results by [Chzhen et al., 2020] and [Le Gouic et al., 2020]. The authors provide sample complexity bounds for the post-processing component, drawing from recent developments in optimal transport theory (e.g., [Korotin et al., 2022]).
Weaknesses:
Both reviewers noted ambiguity in the claim that existing analyses are tightly coupled with specific data generation processes. Some conceptual issues were raised around the applicability and limitations of post-processing, particularly when the function class of the unconstrained regressor is restricted. There is no direct comparison to related results in [Chzhen et al., 2020] on nonparametric estimation, nor examples illustrating when the regression component dominates the learning cost.
Overall, this is a conceptually valuable and mathematically interesting contribution to the theory of fair regression. It clarifies the statistical limits of fairness through post-processing and advances the literature on minimax optimality in constrained learning. However, the paper would be substantially strengthened by clearer exposition, more thorough justification of certain claims, and concrete examples or experiments that ground the theory.