The Disparate Benefits of Deep Ensembles
We uncover the disparate benefits effect of Deep Ensembles, analyze its cause and evaluate approaches to mitigate its negative fairness implications.
摘要
评审与讨论
The paper examines how deep neural network ensembles affect group fairness, demonstrating that these ensembles provide disproportionate benefits to protected groups (a phenomenon termed the "disparate benefits effect"). The authors establish that this effect is linked to the predictive diversity among ensemble members, noting that deep ensembles show greater sensitivity to prediction thresholds compared to individual models due to their sensitivity to calibration. Additionally, they investigate post-processing techniques to mitigate the unfairness introduced by the disparate benefits effect while maintaining ensemble performance. Their findings indicate that Hardt post-processing is particularly effective, as it successfully preserves the ensemble's performance while reducing unfairness.
优点
The main strengths are:
- The focus on how deep ensemble models affect fairness violations in group fairness scenarios, a relatively understudied area in current research.
- The authors provide reasonable arguments on the why (group) unfairness emerges in deep ensembles. Their claims are thoroughly tested by conducting comprehensive analyses through controlled experiments and large-scale benchmark evaluations, examining multiple group fairness metrics under various distribution shifts.
- The argument of using post-hoc methods to edit calibration of the predictive distribution, rather than focusing on non-homogeneous members weighting is both practical and convicing.
缺点
The paper presents valuable insights into the relationship between deep ensembles and group fairness. The authors' thorough analysis could be further enriched by expanding their investigation to include other fairness definitions discussed in the literature [1,2]. For instance, while the authors focus on group fairness, their findings could be contextualized alongside the results from [3], which examines min-max fairness [2] and shows that deep ensembles improve worst-group accuracy (their findings align with the "FalseFalseTrue" phenomenon described in [4]). To strengthen the paper's contribution to the broader fairness literature (which would help create a more comprehensive understanding of deep ensembles' impact on algorithmic fairness), I suggest either:
- Clarifying the strategic choice to focus on group fairness and its specific implications, or
- Preferably, extending the analysis to examine how the authors' hypotheses validity holds across different fairness definitions, particularly in relation to the findings in [3].
[1] Verma, Sahil, and Julia Rubin. "Fairness definitions explained." Proceedings of the international workshop on software fairness. 2018.
[2] Zietlow, Dominik, et al. "Leveling down in computer vision: Pareto inefficiencies in fair deep classifiers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[3] Ko, Wei-Yin, et al. "Fair-ensemble: When fairness naturally emerges from deep ensembling." arXiv preprint arXiv:2303.00586 (2023).
[4] Lin, Yong, et al. "Spurious feature diversification improves out-of-distribution generalization." In ICLR (2024).
问题
Continuing from the Weaknesses section of the review, given that the min-max fairness setting [1] addresses inefficiencies of statistical parity/equality measures, when evaluating fairness, given your findings on predictive diversity, it would be interesting also to understand:
- How does your analysis of predictive diversity relate to the worst-group accuracy (WGA) findings in [2]?
- In the context of min-max fairness, how would you assess the applicability of the Hardt post-processing method from a calibration standpoint? Are there any benefits in using it?
[1] Zietlow, Dominik, et al. "Leveling down in computer vision: Pareto inefficiencies in fair deep classifiers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[2] Ko, Wei-Yin, et al. "Fair-ensemble: When fairness naturally emerges from deep ensembling." arXiv preprint arXiv:2303.00586 (2023).
Thank you for the constructive feedback on our work. We appreciate the positive assessment, particularly regarding the thorough testing of our claims, the comprehensiveness of our analysis, and the scale of our evaluations. We respond below to the weaknesses and questions raised in your review:
To strengthen the paper's contribution to the broader fairness literature (...) I suggest either: 1) Clarifying the strategic choice to focus on group fairness and its specific implications, or 2) Preferably, extending the analysis to examine how the authors' hypothesis validity holds across different fairness definitions, particularly in relation to the findings in [2].
Thank you, both suggestions are indeed valuable directions to improve the positioning of our work in the literature. Regarding the presentation within the main paper, we now elaborate more directly in direction 1) (see line 180 - 182). We decided to focus on the presented group fairness definitions (Eq. (2) - (4)), as they are grounded in legal notions of fairness and are currently considered the be the major measures of interest for group fairness within the algorithmic fairness community (see e.g. [3, 4]). Mathematically, they describe the statistical notions of independence and separation. From a social perspective, these metrics correspond to well-established legal principles, such as disparate impact and disparate benefit, which are fundamental concepts in anti-discrimination law [4, 5, 6, 7]. Statistical Parity builds upon the idea of disparate impact by ensuring that outcomes are independent of sensitive attributes, reflecting a focus on preventing systemic bias in decision-making processes [4, 5]. On the other hand, Equalized Odds and Equal Opportunity extend these principles by addressing disparate harms conditional on legitimate factors, and are thus closely linked to the notion of disparate benefit. Their statistical definitions provide an actionable framework to evaluate fairness desiderata while maintaining interpretability in contexts with legal and social implications.
However, we agree with 2): investigating complementary group fairness definitions, particularly min-max fairness as suggested, provides a more comprehensive picture. Therefore we added an additional investigation in Sec.F.2 in the appendix. We elaborate more on this investigation and the relation to the findings in [2] within the next answer.
W: For instance, while the authors focus on group fairness, their findings could be contextualized alongside the results from [2], which examines min-max fairness [1] and shows that deep ensembles improve worst-group accuracy.
Q: How does your analysis of predictive diversity relate to the worst-group accuracy (WGA) findings in [2]?
The majority of our findings focus on the “error balancing” regime in [1], using exactly the same measures (their Eq.(1) is our EOD and Eq.(2) is essentially our AOD), which have not been considered in [2]. Furthermore, [1] investigates min-max fairness (their Eq.(3)) on the worst-group accuracy which was also done in [2], but additionally on the worst group TPR which is part of our analysis. Those two notions can be seen as complementary, one focussing on the disparity between groups, while the other focusses on the worst group. Our experiments in Fig.2 (for all tasks in Fig.11-13) show the change in TPR for both groups, thus also the worse group (A=0). Here we find, that the TPR of both groups never decreases due to ensembling. We additionally added results on the change in accuracy due to ensembling in Fig.19 - 21 in the appendix, with a discussion through the lens of min-max fairness in Sec.F.2. Similarly as in [2], we find that the accuracy of both groups never decrease due to ensembling, showing that ensembling does not lead to the detrimental effect reported for in-processing methods in [1]. Yet we do not find that the disadvantaged group generally improves more than the advantaged group and accuracy improvements are generally of similar magnitude. Noteworthy, [2] often reported relative changes in worst group accuracy, while we report absolute changes for accuracy, TPR and FPR. This likely explains the strong benefit for the disadvantaged group in [2]. If both groups benefit similarly on an absolute scale, the relative gain is higher for the group with originally lower accuracy. In sum, we find that deep ensembles improve worst group accuracy (and error rates). However the main issue pointed out by our work remains, namely that ensembling often increases the gap in e.g. error rates between groups, impairing fairness as defined by the most common and well-established group fairness metrics.
In the context of min-max fairness, how would you assess the applicability of the Hardt post-processing method from a calibration standpoint? Are there any benefits in using it?
The improved calibration of Deep Ensembles will also help to select a better decision threshold for higher accuracy on the worst group. This is shown in Fig.6b for FF (age/race) as well as Fig.31 in the appendix for all tasks. Hardt post-processing was explicitly developed to choose the thresholds such that the differences in error rates (e.g. TPR) between groups are minimized (or set to a specified value), thus it is not applicable nor needed to improve min-max fairness. Simply setting the optimal threshold for the worst group on a validation set should suffice.
Thank you once again for your valuable feedback, your time and effort on this assessment. If you have any further questions or comments regarding the additional results, we would be glad to discuss them.
[1] Zietlow, Dominik, et al. "Leveling down in computer vision: Pareto inefficiencies in fair deep classifiers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[2] Ko, Wei-Yin, et al. "Fair-ensemble: When fairness naturally emerges from deep ensembling." arXiv preprint arXiv:2303.00586 (2023).
[3] Mehrabi, Ninareh, et al. “A Survey on Bias and Fairness in Machine Learning”, ACM computing surveys (2021)
[4] Barocas, Solon, et al. “Fairness and Machine Learning: Limitations and Opportunities”, MIT press (2023)
[5] Feldman, Michael, et al. "Certifying and removing disparate impact.", SIGKDD (2015)
[6] Hardt, Moritz, et al.. "Equality of opportunity in supervised learning." NeurIPS (2016)
[7] Xiang, Alice, and Inioluwa Deborah Raji. "On the legal compatibility of fairness definitions." HCML workshop at NeurIPS (2019)
I sincerely thank the authors for their detailed and thoughtful responses to my feedback, which also addressed my additional curiosities on the topic. Overall, I find this to be a good contribution, and I have accordingly raised my score. I have no further questions or concerns at this time.
Thank you very much for your prompt reply and engagement with our rebuttal! We are glad to hear that you found the additional experiments insightful and are in favor of our contribution!
This paper empirically investigates fairness violations in Deep Ensemble models. It finds that, in most cases, Deep Ensemble models exhibit poorer fairness performance compared to single models, despite achieving significantly better overall performance, which echoes existing research. The paper aims to understand the reasons behind the amplification of unfairness and demonstrates that post-processing can mitigate unfairness issues in Deep Ensemble models.
优点
This work offers a perspective on fairness issues in Deep Ensembles.
The presentation style is clear and easy to follow.
Additionally, I find the experiments conducted to be convincing.
缺点
(1) A key limitation of this work lies in the interpretation of results presented in Section 6. The author hypothesizes that disparities in the average predictive diversity among groups contribute to the observed disparate benefits effect. However, the benefit remains ambiguous. For instance, in Figure 2, while the subgroup with A=1 demonstrates a higher TPR, the subgroup with A=0 shows a lower False Positive Rate (FPR), making the ensemble effect difficult to assess.
(2) In Figure 13(a), the disadvantaged group shows a higher TPR and nearly the same FPR as the advantaged group, which is inconsistent with the analysis in Section 6.
(3) The fairness concern addressed is the disparity across sensitive groups. However, the author's approach to explaining unfairness through performance differences does not fully explore the underlying reasons for fairness issues in deep ensembles.
(4) The proposed solution is also relatively unremarkable. The use of model-agnostic post-processing methods, which are not specifically tailored for deep ensembles, is unsurprising. The key question is how the deep ensemble model differs in effectively addressing fairness concerns compared to other high-performing models.
问题
(1) One reason for unfairness arises from skewed data distribution. Figure 3(c) shows a much more skewed distribution compared to Figure 3(a). However, I found that the fairness issues in CX (age) are much milder than those in FF (age/gender), and I am curious why this is the case.
(2) The trade-off between utility and fairness shows similar behavior in relation to fairness issues across classification tasks. This raises the question: what distinguishes the behavior of a deep ensemble model with strong utility performance in terms of fairness?
伦理问题详情
NA
The proposed solution is also relatively unremarkable. The use of model-agnostic post-processing methods, which are not specifically tailored for deep ensembles, is unsurprising. The key question is how the deep ensemble model differs in effectively addressing fairness concerns compared to other high-performing models.
The main contribution of our paper is not a solution to the disparate benefits effect of Deep Ensembles, but the discovery of this effect on a variety of datasets and tasks. Deep Ensembles are widely used to improve predictive performance, especially in computer vision tasks, and we investigate the fairness implications arising from that common practice. Having said this, the surprising finding about the Hardt post-processing method applied to Deep Ensembles is its increased effectiveness due to the better calibration of the ensemble compared to individual ensemble members. Furthermore, we discuss other post-processing methods in the paper such as non-uniform weighting of the deep ensemble members specifically tailored to them. Finally, we want to emphasize that we don’t see Deep Ensembles as a method to improve fairness. They are widely used to improve accuracy, but there hasn't been a rigorous study of their implications regarding group fairness.
One reason for unfairness arises from skewed data distribution. Figure 3(c) shows a much more skewed distribution compared to Figure 3(a). However, I found that the fairness issues in CX (age) are much milder than those in FF (age/gender), and I am curious why this is the case.
Just to make sure, are you referring to the percentage of samples with and provided in the legend of individual plots in Figure 3? We fully agree that unfairness often arises due to skewed data distributions on e.g. tabular data. However this does not seem to be a strong factor in the case of large scale image datasets with high capacity CNN models. A very compelling characterization of this from a bias variance decomposition perspective is provided in [1] which was brought to our attention by reviewer HhJa. Furthermore, the absolute unfairness of FF (age/gender) and CX (age) is roughly equivalent (see Table 2 and 3), yet the disparate benefits effect, the change in fairness due to ensembling, is more prominent for FF (age/gender) than for CX (age), which is consistent with Figure 3 where we find stronger average predictive diversities for the former.
The trade-off between utility and fairness shows similar behavior in relation to fairness issues across classification tasks. This raises the question: what distinguishes the behavior of a deep ensemble model with strong utility performance in terms of fairness?
We are unsure if we understand this statement and question correctly. It is widely known that combining different deep neural networks to a Deep Ensemble improves performance. We empirically corroborate this in all our experiments, see e.g. Table 1 ( accuracy shows how much the accuracy improved in the Deep Ensemble compared to its individual members). Our work shows that the performance does not increase evenly across groups, leading to a disparate benefit effect which in many cases means that the ensemble is more unfair than the individual members. Furthermore, we postulate that this disparate benefits effect can be explained by differences in the average predictive diversity per group and empirically investigate this relationship.
Thank you again for the constructive feedback, and the time and effort put into this assessment of our work. We would be happy to discuss any further questions or remarks you may have.
[1] Zietlow, Dominik, et al. "Leveling down in computer vision: Pareto inefficiencies in fair deep classifiers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Dear reviewer, as the rebuttal is approaching its end, we kindly ask if our responses have adequately addressed your concerns or if there are any additional questions we can clarify.
Thank you for your time and effort in assessing our work. We appreciate the positive feedback regarding the clarity of our presentation and that the conducted experiments are convincing. We address below the stated weaknesses and questions:
A key limitation of this work lies in the interpretation of results presented in Section 6. The author hypothesizes that disparities in the average predictive diversity among groups contribute to the observed disparate benefits effect. However, the benefit remains ambiguous. For instance, in Figure 2, while the subgroup with A=1 demonstrates a higher TPR, the subgroup with A=0 shows a lower False Positive Rate (FPR), making the ensemble effect difficult to assess.
The goal of the experiments presented in Section 6 is to investigate possible reasons for the disparate benefits effect that was demonstrated in Section 5. The analysis in Figure 2 shows the changes () in PR, TPR and FPR due to ensembling as these are the building blocks of the considered group fairness measures. Note that Figure 2 does not depict the absolute values but the change due to ensembling ( PR, TPR and FPR). For example, the positive delta in TPR for A=1 means that the TPR for A=1 increases due to ensembling whereas the TPR for A=0 remains roughly the same after ensembling.
Our main argument is that if the average predictive diversity between groups (A=1 vs A=0) differs (large black arrows in Figure 3), then different groups will have different ‘benefits’ or deltas due to ensembling ( PR, TPR and FPR), leading to larger fairness violations according to the measures in Eq. (2)-(4).
In Figure 13(a), the disadvantaged group shows a higher TPR and nearly the same FPR as the advantaged group, which is inconsistent with the analysis in Section 6.
Is it possible that there is a typo in the figure number? In Figure 13(a), the advantaged group A=1 has a higher TPR (i.e. the TPR after ensembling is larger than before ensembling) and both groups have about the same FPR, thus we are confident the results are consistent with the analysis in Section 6 (i.e. Figure 2). Furthermore, we would like to emphasize that the disparate benefits effect does not necessarily imply that the advantaged group benefits more, but that one of the two groups benefits more than the other. Empirically we find that this is often the case for the already advantaged group, which is why the effect is problematic from an algorithmic fairness perspective.
The fairness concern addressed is the disparity across sensitive groups. However, the author's approach to explaining unfairness through performance differences does not fully explore the underlying reasons for fairness issues in deep ensembles.
We agree that our study is just a first step to fully explore the fairness implications of Deep Ensembles, as mentioned within the limitation statement in lines 533 - 539. There are indeed numerous aspects to algorithmic fairness, and no investigation however rigorous can ever claim completeness. Due to the widespread interest in group fairness and its application in real-world scenarios (performance differences between protected groups), we decided to start our exploration with an in-depth study of group fairness in deep ensembles. The types of analyses performed in our paper are in line with those reported in the group fairness literature. Would you have any suggestions for additional analyses within the scope of group fairness that our manuscript is currently missing?
The paper explore the impact of Deep Ensembles on group fairness through empirical studies across three datasets—FairFace, UTKFace, and CheXpert—using statistical parity difference (SPD), equal opportunity difference (EOD), and average odds difference (AOD). Their results show that while ensembles improve performance globally, they generally tend to favor advantaged demographic groups according to fairness measures, which they call "the disparate benefits effect". They link this effect to how ensemble members' predictions vary across demographic groups. Finally, the study shows that because Deep Ensembles are better calibrated than individual models, post-processing techniques, such as the algorithm by Hardt et al. (2016), are effective in mitigating these fairness disparities while preserving the performance benefits of Deep Ensembles.
优点
The study addresses an important aspect of Deep Ensembles by examining the fairness implications of their performance improvements. This kind of evaluation is crucial to ensure that gains in predictive accuracy are achieved without unintended consequences for different demographic groups.
缺点
- The reported differences in fairness metrics between models are often very small, making it difficult to assess the significance of the disparate benefits effect. Including the original results for individual models in Table 1 would help clarify these differences.
- The baseline selection process is unclear - while I assume the baseline represents the aggregated average across all individual models, the paper doesn't explicitly specify it (is it average results on all individual models / architectures and different seeds?). It should be clarified for proper comparison.
- The paper lacks intuitive explanations for the relationship between prediction diversity and fairness. While the average predictive diversity (DIV) shows strong differences between demographic groups, it's unclear why this specifically impacts fairness criteria.
- The study relies solely on Hardt's post-processing method for mitigation. Comparing with other approaches, like in-processing techniques, would strengthen the claims.
问题
- See Point 3 above - Could you provide an intuition on how diversity among ensemble members contributes to fairness violation? Why does high predictive diversity (DIV) across demographic groups could lead to increased disparate impact or affect equalized odds? I do not see the causation;
- Are the calibration benefits of Deep Ensembles evenly distributed across the two different demographic groups?
- a) The controlled experiment lacks testing of different levels of ensemble diversity. How would increasing the number of diverse training images for A=1 (two/three/four/five different images) affect the fairness metrics (DP and EO)? Does it decreases gradually? Such experimental variations might provide additional insights into the observed relationship.
b) What are the corresponding average metrics of individual members (at least on 10 members) in this controlled experiment? Do individual members also show better fairness metrics compared to the deep ensemble?
4)The baseline selection process requires more clarification. Is it the average metrics over the 10 individual models on the 5 architectures and the different seeds? On page 6, at the end of the caption, it states “...and the average ensemble member.” Should this instead read “...and the average of individual members”?
- Why was the Deep Ensemble not tested with fairness constraints through in-processing methods? Testing fairness constraints during training of the members could help strengthen the claims about the suited post-processing mitigation approaches.
W: The study relies solely on Hardt's post-processing method for mitigation. Comparing with other approaches, like in-processing techniques, would strengthen the claims.
Q: Why was the Deep Ensemble not tested with fairness constraints through in-processing methods? Testing fairness constraints during training of the members could help strengthen the claims about the suited post-processing mitigation approaches.
We made the strategic choice to focus on Deep Ensembles where individual members have not been subject to any pre- or in-processing fairness techniques. We argue this is the most widely adopted scenario in practice. Furthermore, we clearly state that considering members that have been subject to pre- or in-processing methods is out of scope and thus a limitation of our study (line 538 - 539), yet an important direction for future work. The main focus of our study is to establish and understand the disparate benefits effect for Deep Ensembles. However, it would feel incomplete without an investigation of how to mitigate negative fairness implications of the disparate benefits effect, thus we evaluated post-processing methods that can be easily applied by practitioners who already have pre-trained ensemble members. We also analyzed why the Hardt post-processing method performs so well (due to the improved calibration). We firmly believe that the vast space of possible pre- and in-processing interventions on individual models warrants its own thorough investigation, building upon the presented findings.
The controlled experiment lacks testing of different levels of ensemble diversity. How would increasing the number of diverse training images for A=1 (two/three/four/five different images) affect the fairness metrics (DP and EO)? Does it decreases gradually? Such experimental variations might provide additional insights into the observed relationship.
Thank you for this very interesting suggestion. We implemented the experiment you suggested, yet already if we combine 3 images the underlying task becomes too trivial for FashionMNIST. There will always be at least one image that can be easily assigned to the correct class with any individual model for A=1 meaning they reach around 98% accuracy. As a consequence there is no real benefit of ensembling anymore, as all members (correctly) predict the same for A=1. However, we continued our endeavours along your idea and implemented a similar experiment that enables us to investigate this relationship. We based this experiment upon the original controlled experiment. In the new experiment, we defined A=0 as the original image + uniform random noise in the bottom image; and A=1 as linear interpolation between a second image from the same class and uniform random noise with strength . With , A=1 in the new controlled experiment is the same as A=1 in the original controlled experiment. For , both groups are the same. We use different to analyze the relationship between the average predictive diversity and the disparate benefits effect. The results and a more detailed description are provided in Section F.1 in the appendix. In sum, we find that for increasing , both the average predictive diversity and the fairness violations due to ensembling increase, demonstrating a strong correlation between them (see Fig.18).
What are the corresponding average metrics of individual members (at least on 10 members) in this controlled experiment? Do individual members also show better fairness metrics compared to the deep ensemble?
Due to lack of space in the main paper, we provide the numbers in Table 19, together with the numbers from the additional controlled experiment in Section F.1 in the appendix. Note that for the original experiment, those metrics are also shown in Figure 4a, where the left side of each plot shows mean and std of performance and fairness metrics of individual members (Note that we used 1-SPD, 1-EOD and 1-SPD for visual clarity in those plots, so that we measure “fairness” rather than “unfairness or fairness violation”) and the right side of each plot shows the mean and std of the metrics for the full ensemble of 10 models. Individual models have lower performance but higher fairness, whereas the ensemble has higher performance, but lower fairness.
Once again, thank you for valuable feedback, insightful questions and for your time and effort invested in our review. If you have any further questions or comments regarding the additional results, we would be glad to discuss them.
Thank you for your detailed response, the revisions have strengthened the paper. The alternative experimental design incorporating noise interpolation with the α parameter adds new empirical insights into predictive diversity and fairness violations.
However, the explanation connecting ensemble diversity to fairness violations still needs strengthening. The causal mechanism that "if there are more diverse members for one group, this group benefits more from ensembling, leading to an increased disparity in the obtained error rates between groups which increases fairness violations" assumes that higher diversity always increases error disparities. For instance, we could assume that in some cases, a diverse ensemble could generalize better at test time for a disadvantaged group, reducing its error rate and narrowing the performance gap with the advantaged group, thus potentially improving fairness. Without formal analysis on these different scenarios, it remains unclear why diversity inherently leads to demographic disparities rather than potentially reducing them.
Despite this gap, the paper makes valuable contributions with strong empirical evidence. I've revised my score to 6, as the remaining questions represent opportunities for future work rather than critical flaws.
Thank you for your thoughtful engagement and valuable feedback throughout the rebuttal process!
Yes we fully agree that higher diversity for the disadvantaged group reduces the error gap, which we e.g. observe for FF (gender/age) in Table 1. However, our primary concern lies with the opposite scenario - higher diversity in the advantaged group - where the error gap increases. This is why we emphasized this case in our presentation. We will ensure the camera-ready version of the paper explicitly addresses both cases more comprehensively. Thank you for highlighting the need for this clarification.
We are also glad to hear that you found the alternative experimental design insightful and appreciate your support for our overall contribution!
Thank you very much for your valuable and thorough assessment of our work. We address below the weaknesses and questions that you raised in your review:
The reported differences in fairness metrics between models are often very small, making it difficult to assess the significance of the disparate benefits effect. Including the original results for individual models in Table 1 would help clarify these differences.
We tested the statistical significance of our results in Table 1 (bold numbers). Significant changes in fairness are often of the same magnitude as the changes in performance, which is the extent of what we can expect as an effect due to ensembling. In computer vision tasks with performances of around 90% accuracy, an improvement of 1-2% is considered substantial, especially due to something as simple as ensembling. We tried including the absolute numbers in Table 1 during the writeup, which was a visual nightmare and hard to comprehend for friendly readers we asked to provide feedback, thus we decided to move Table 2 and 3 to the appendix and refer to them in the caption of Table 1. To convey some information about the absolute numbers, we included the gray shading in Table 1 if the fairness violation of individual models and the ensemble is > 5%. Do you have a suggestion to more effectively communicate both the absolute values and the changes due to ensembling in a single table instead of referring to a second table?
W: The baseline selection process is unclear - while I assume the baseline represents the aggregated average across all individual models, the paper doesn't explicitly specify it (is it average results on all individual models / architectures and different seeds?). It should be clarified for proper comparison.
Q: The baseline selection process requires more clarification. Is it the average metrics over the 10 individual models on the 5 architectures and the different seeds? On page 6, at the end of the caption, it states “...and the average ensemble member.” Should this instead read “...and the average of individual members”?
The results in the main paper are presented for the ResNet50 architecture which we state in line 215, comparisons to the other architectures are in the appendix (Sec. F.3 & F.4). For one task, we thus have 50 independently trained models, which we divide into 10 models for 5 seeds. The performance of individual models is computed as mean and std over all 50 models (Table 3). The performance of the Deep Ensemble (consisting of 10 members) is computed as mean and std over the 5 seeds (Table 2). Thank you for spotting this imprecise statement, you are correct this should read “... and the average of individual members”. We corrected this in the updated version of the paper.
W: The paper lacks intuitive explanations for the relationship between prediction diversity and fairness. While the average predictive diversity (DIV) shows strong differences between demographic groups, it's unclear why this specifically impacts fairness criteria.
Q: Could you provide an intuition on how diversity among ensemble members contributes to fairness violation? Why does high predictive diversity (DIV) across demographic groups could lead to increased disparate impact or affect equalized odds? I do not see the causation.
Thank you for raising this point, we added a more intuitive explanation in line 401 - 403. In short, if there are more diverse members for one group, this group benefits more from ensembling, leading to an increased disparity in the obtained error rates between groups which increases fairness violations. As elaborated in line 370 - 371, only when ensemble members disagree in their prediction (are diverse), the ensemble can predict differently and improve upon its members.
Are the calibration benefits of Deep Ensembles evenly distributed across the two different demographic groups?
Thank you for the suggestion for this additional investigation. While the calibration for individual models and the Deep Ensembles are not equal across groups, the improvement of calibration due to ensembling is roughly equivalent. We added this analysis to Section F.7 in the appendix (Figure 30).
We thank all reviewers again for their time and effort to provide their valuable feedback to our work. In response to individual reviewers’ requests, we conducted additional experiments and added results and discussions to our manuscript as follows:
- Additional controlled experiment (NjdB): We conducted an additional controlled experiment to examine the relationship between the average predictive diversity and changes in accuracy, SPD, EOD and AOD due to ensembling in more detail. A detailed description of the setup and results are presented in Sec.F.2 in the appendix.
- Investigation of min-max fairness (HhJa): Our main experiments are based on well-established group fairness metrics that capture the difference of e.g. error rates between groups. We provide an additional investigation on the recently proposed notion of min-max fairness that is complementary to the established difference based group fairness metrics. In a nutshell, min-max fairness is concerned with the worst group e.g. accuracy or error rates. Our additional experiments show that Deep Ensembles generally improve accuracy or error rates for both groups, also the worst group. Thus there is no min-max fairness impairment due to Deep Ensembles, both groups improve. However, as our main results show, the groups do not benefit equally, which we denote as disparate benefits effect. This effect often amplifies discrepancies between groups, further exacerbating unfairness according to the well-established group fairness metrics we consider in our work.
- Calibration per protected group attribute (HhJa): We provide a more detailed analysis of the calibration of Deep Ensembles in Fig.6a, subdividing per group attribute. A discussion is added in Sec.F.7, with the results shown in Fig.30. We observe that while individual members and the Deep Ensemble sometimes have different levels of calibration for the two groups, Deep Ensembles are generally more calibrated than individual members for any given group.
For easier reference, we highlighted new or strongly altered sections in brown in the revised version of our manuscript.
The submission studies the disparate impact of ensemble models on different groups, focusing on statistical parity and equality of opportunity. The main conclusion is that ensemble methods disproportionately benefit the majority group, and thus exacerbate the disparity between the majority and the minority groups. The authors also show that post-processing techniques are effective in mitigating the unfairness of ensemble models.
The paper has received mixed reviews. The main concern is the lack of theoretical analysis, i.e., why do ensemble methods tend to favor the majority groups? Despite the positive-leaning ratings, during the AC-reviewer discussion, the majority of the reviewers expressed concerns about the theoretical grounding of the work. The finding of using post-processing to improve fairness is not new either, given the recent line of work in the literature on algorithmic fairness, both theoretically and empirically [1-2]. The authors should discuss and ideally adopt these methods into the empirical studies as well.
[1]. Unprocessing Seven Years of Algorithmic Fairness [2]. Fair and Optimal Classification via Post-Processing
I strongly urge the authors to take the reviewers' suggestion and the above comments when preparing for the next submission.
审稿人讨论附加意见
The ratings are mixed and the paper is borderline. Despite the seemingly positive-leaning ratings, the comments from reviewers during the AC-reviewer discussion period are slightly leaning towards the negative side. I took a read of the paper as well and provided some further comments to help the authors strengthen their work, in light of recent literature on algorithmic fairness.
Reject