PaperHub
7.0
/10
Poster4 位审稿人
最低6最高8标准差1.0
6
8
6
8
3.8
置信度
ICLR 2024

FairSeg: A Large-Scale Medical Image Segmentation Dataset for Fairness Learning Using Segment Anything Model with Fair Error-Bound Scaling

OpenReviewPDF
提交: 2023-09-23更新: 2024-04-21
TL;DR

We curate and release the first large-scale fairness dataset for medical segmentation.

摘要

关键词
Medical SegmentationMedical ImagingFairness LearningHealth EquityDeep LearningTrustworthy AI

评审与讨论

审稿意见
6

This paper proposed a dataset for retinal disc/cup segmentation with several pre-defined attributes, which should be useful for studying the fairness problem in the medical domain. Furthermore, the authors set a baseline for the problem and define the evaluation metrics in this scenario. Overall, this work is sound and meaningful.

优点

[1] Providing a dataset for fairness-related research is meaningful for the current community, along with its baseline and evaluation setting.

[2] Good writing and clear motivation

缺点

[1] ICLR might not be the best place for this paper. Other medical journals or conferences would be more suitable.

[2] There are many evaluation ways to assess the fairness problem. The selected metrics might not be the most suitable one. Please elaborate more on the motivation of baseline setting and evaluation.

[3] Some current works should be included to make the experiments sufficient. See: FairAdaBN: Mitigating unfairness with adaptive batch normalization and its application to dermatological disease classification

[4] Since most of the attributes are only for the patient level, why use the pixel-wise weights?

问题

See the above weaknesses

伦理问题详情

This work proposed a retinal dataset with several attributes, which should be further checked from the ethics view.

评论

Thank you so much for your review and the insightful comments.

ICLR might not be the best place for this paper. Other medical journals or conferences would be more suitable.

Our primary contribution centers on fairness in machine learning, rather than on medical imaging itself. We regard medical imaging as a crucial area for studying fairness, given its human-centric nature and the potentially severe consequences of unfairness in medical deep learning systems in real-world scenarios.

There are many evaluation ways to assess the fairness problem. The selected metrics might not be the most suitable one. Please elaborate more on the motivation of baseline setting and evaluation.

We introduced a novel metric for assessing fairness in medical segmentation, called Equity-Scaled Segmentation Performance (ESSP). ESSP offers a more direct and clinician-friendly evaluation compared to traditional fairness metrics like DPD and DEOdds. Unlike DPD and DEOdds, which may overlook overall performance, leading to a situation where a model with uniformly lower performance across all demographics could falsely appear fairer. This misalignment is particularly problematic in safety-critical medical applications, which demand high accuracy. In contrast, our proposed ESSP addresses this issue. It evaluates not only the disparity across different demographic groups but also measures the extent to which overall segmentation accuracy is compromised to achieve a fairer model. A higher ESSP score indicates a model that achieves both fairness and high accuracy.

Alongside ESSP, according to the reviewer’s suggestion, we have also updated the DPD and DEOdds results in the supplementary material, providing a comprehensive fairness assessment of various segmentation approaches. As the table below illustrates, our FEBS (Fairness-Enhanced Biased Segmentation) approach outperforms previous methods in terms of DPD and DEOdds, highlighting its effectiveness in achieving fairer outcomes in medical imaging. From the table below, we show that our method achieves comparable DPD and DEodds performances when compared against other segmentation methods.

MethodOverall DPDOverall DEodds
CupSAMed0.00850.0196
SAMed+ADV0.00790.0071
SAMed+GroupDRO0.00780.0154
Ours (SAMed)0.00790.0020
TransUNet0.00830.0430
TransUNet+ADV0.00740.0389
TransUNet+GroupDRO0.00810.0317
Ours (TransUNet)0.00850.0492
RimSAMed0.00050.0670
SAMed+ADV0.00040.0657
SAMed+GroupDRO0.00090.0792
Ours (SAMed)0.00020.0715
TransUNet0.00140.0877
TransUNet+ADV0.00250.0708
TransUNet+GroupDRO0.00180.0699
Ours (TransUNet)0.00140.0822

Some current works should be included to make the experiments sufficient. See: FairAdaBN: Mitigating unfairness with adaptive batch normalization and its application to dermatological disease classification.

FairAdaBN was originally proposed for classification tasks, not segmentation, and necessitated modifications to the backbone architectures of classification models. However, with the advent of recent large-scale segmentation models like Meta's SAM, these models require fine-tuning with pre-trained weights for optimal performance. Altering the architecture, as in the case of FairAdaBN, might hinder the loading of these pre-trained parameters, which are trained from extensive datasets, potentially leading to lower segmentation accuracy. Additionally, FairAdaBN was initially applied to a ResNet, and adapting it to Transformer-based segmentation models also presents challenges. Thus, we suggest that further investigation is needed before applying FairAdaBN to segmentation tasks. We have cited and discussed FairAdaBN in our related work section, and we aim to integrate the FairAdaBN approach into our future research.

Since most of the attributes are only for the patient level, why use the pixel-wise weights?

Patients from various demographic groups may exhibit different anatomical characteristics. For example, Black people often have a larger cup-to-disc ratio and cup area compared to other races. These anatomical differences can influence segmentation accuracy. To address this, we employ pixel-wise weights to accommodate these underlying anatomical variations within the fundus images.

审稿意见
8

This paper proposed a fundus image dataset for benchmarking the fairness of medical image segmentation methods, which is the first dataset and benchmark in this field. The authors also proposed to rescale the loss function with the upper training error-bound of each identity group to tackle the fairness issue.

优点

  • Novel Dataset: The paper introduced FairSeg, a new dataset for medical image segmentation with a focus on fairness. The creation of such a dataset is valuable as it addresses a gap in the current availability of medical datasets with fairness considerations.

  • Fairness-Oriented Methodology: The authors proposed a fair error-bound scaling approach and an equity scaling metric. These methods represent an advanced effort to integrate fairness directly into the model training process, which could lead to more equitable healthcare outcomes.

  • Open Access: I like that the author released the dataset and code for reproducibility and further research, which is a strong aspect of this work.

缺点

问题

  • The dataset was released as npz format. Could you please also release the original format?

  • It would be great if you could release the trained models as well.

  • Where do you plan to host this benchmark? CodaLab could be a good platform.

评论

We thank the reviewer for the positive and encouraging review.

Please add NSD which is suggested by metrics reloaded.

Dice and IoU are commonly used by previous research [1,2,3]. For the sake of future comparison and completeness, we will keep the Dice and IoU in the table, and add NSD as the additional metric. Furthermore, we have included the NSD results for race as table below. The table indicates that unfairness commonly exists in our proposed fair segmentation task, which further strengthens the contribution of our FairSeg dataset. We have added those results in the supplementary material.

Overall NSD↑Asian NSD↑Black NSD↑White NSD↑
Cup
SAMed0.72220.68250.68710.7338
SAMed+ADV0.74050.70450.69540.7537
SAMed+GroupDRO0.74150.70930.69420.7547
Ours (SAMed)0.73990.70380.70150.7518
TransUNet0.93140.88450.89080.9449
TransUNet+ADV0.92080.88190.88170.9331
TransUNet+GroupDRO0.92850.87680.88730.9426
Ours (TransUNet)0.92620.87960.88500.9397
Rim
SAMed0.74830.73130.72150.7556
SAMed+ADV0.80630.76940.76860.8182
SAMed+GroupDRO0.80780.77600.76960.8191
Ours (SAMed)0.80430.77320.77040.8146
TransUNet0.96010.93260.93260.9688
TransUNet+ADV0.95540.92450.92210.9656
TransUNet+GroupDRO0.95960.93220.93070.9685
Ours (TransUNet)0.95810.93060.93150.9666

Add nnUNet.

We have included nnUNet in our segmentation benchmarks, which are detailed in the supplementary material. The table provided below presents results focusing on racial disparities. From the table, it is evident that nnUNet demonstrates marginally better performance compared to SAMed and TransUNet. However, it still exhibits significant performance disparities across different racial groups. This suggests that algorithmic unfairness is a pervasive issue in our proposed cup-disc segmentation, regardless of the choice of segmentation architectures.

MethodOverall ES-Dice↑Overall Dice↑Overall ES-IoU↑Overall IoU↑Asian Dice↑Black Dice↑White Dice↑Asian IoU↑Black IoU↑White IoU↑
CupnnUNet0.86250.87100.77040.78650.86750.88550.86970.76390.80350.7725
RimnnUNet0.80030.83350.69590.72310.79300.76820.84900.68540.66390.7397

The dataset was released as npz format. Could you please also release the original format?

We will release the dataset with the original format in addition to the existing npz format of our database.

It would be great if you could release the trained models as well.

The trained checkpoints of our models have been released through our Github repository.

Where do you plan to host this benchmark? CodaLab could be a good platform.

We will co-host our benchmark/dataset using both Google Drive and CodaLab.

Reference:

[1] Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023).

[2] Zhang, Kaidong, and Dong Liu. "Customized segment anything model for medical image segmentation." arXiv preprint arXiv:2304.13785 (2023).

[3] Chen, Jieneng, et al. "Transunet: Transformers make strong encoders for medical image segmentation." arXiv preprint arXiv:2102.04306 (2021).

审稿意见
6

In this work, the authors introduced the new FairSeg dataset, designed to address fairness concerns in the domain of medical segmentation. Their innovative methodology centers on a fair error-bound scaling technique, which recalibrates the loss function by considering the upper error-bound within each identity group. Furthermore, they designed a new equity-scaled segmentation performance metric to facilitate fair comparisons between different fairness learning models for medical segmentation. Extensive experimentation underscores the efficacy of the fair error-bound scaling approach, demonstrating either superior or comparable fairness performance when compared to state-of-the-art fairness learning models. Furthermore, The related dataset and code are both made publicly accessible by the authors.

优点

  • The paper is well-written and easy to follow.
  • The proposed framework is technically sound.
  • The experiments are comprehensive.

缺点

There is no visualization comparison between different methods.

问题

  1. In equation (1), a parenthesis is missing in the formula.
  2. The authors proposed a new the Dice loss with a novel Fair Error-Bound Scaling mechanism, however there are experiment results to show the differences between the new dice loss and common one.
评论

Thank you very much for your review and the insightful comments.

There is no visualization comparison between different methods.

We have added a visualization comparison of segmentation results between our method and other competing approaches.

In equation (1), a parenthesis is missing in the formula.

We have addressed the formatting issue in the paper.

The authors proposed a new Dice loss with a novel Fair Error-Bound Scaling mechanism, however, there are experiment results to show the differences between the new dice loss and the common one.

As mentioned in Section 6.2 - “Training and Implementation Details”, the original SAMed and TransUNet are trained using cross entropy and the common dice losses. Hence, in Tables 1-5, the results between SAMed vs SAMed (Ours) / TransUNet vs. TransUNet (Ours) are the performance comparisons for the new proposed dice loss and the common one.

审稿意见
8

This paper proposes a publicly available medical fairness segmentation dataset (FairSeg) that contains 10,000 subject samples of 2D SLO Fundus images. The paper also proposes equity-scaled segmentation performance metrics to facilitate fair comparisons.

优点

  1. The fairness concern is an important topic, especially in medical images and the lack of segmentation dataset is a big issue. The motivation of the proposed dataset is strong.

  2. The dataset contains a large amount of segmentation ground truths (10,000) and is well evaluated by authors with several SOTA learning algorithms.

  3. As described by the authors, the segmentation seems to undergo a rigorous process including a hand-graded annotation by a panel of five medical professionals after initial registration.

缺点

  1. The accuracy of the Nifty reg needs to be investigated since it might not be the SOTA for image registration.

问题

  1. Why validation set is not constructed/used in selecting models in training?

  2. It would be helpful to report Hausdorff distance and average surface distance along with Dice to better evaluate the methods.

  3. The details of how standard deviation is computed need to be elaborated. Is it computed across the mean of for each group?

  4. How is the training/testing split performed? Is it just randomly sampled without considering sensitive attributes at patient level?

  5. It would be helpful to discuss the importance of registration in preprocessing using NiftyReg.

评论

Thank you very much for the supportive comments and valuable suggestions!

The accuracy of the Nifty reg needs to be investigated since it might not be the SOTA for image registration.

In Section 4, we mentioned that “It’s noteworthy that this registration operation demonstrates considerable precision in real-world scenarios, as evident from our empirical observations that highlight an approximate 80% success rate in registrations.” During experiments, we have compared Niftyreg against the SOTA deep learning based approach [1] and a retinal image-based registration method [2]. We observed that NiftyReg is more robust than other registration methods. Upon further analysis, we have calculated that NiftyReg achieves an accuracy of roughly 82.4% in registration tasks. All the failed registration cases have been excluded by five professional clinician graders. Although there is a failure rate of about 20%, we are committed to ongoing exploration of state-of-the-art (SOTA) registration tools, including contemporary deep learning-based methods. Our aim is to enhance registration accuracy and, consequently, to release more images in our datasets during our future work.

Why validation set is not constructed/used in selecting models in training?

Given that our model is selected based on the last epoch of training. The finetuning of foundation models like SAM could be computationally expensive for both training and inference. This makes computing the validation accuracy every few epochs and selecting the checkpoints based on the accuracy is infeasible.

To better facilitate future research, we have released an extra 500 images as the validation set in our codebase and dataset.

Report Hausdorff distance and average surface distance along with Dice.

Please see the table below for Hausdorff distance (HD95) and average surface distance (ASD) for race. We have included such metrics across all sensitive attributes in the supplementary material. From table below, we observed that the disparity between different demographic groups commonly existed regardless of its evaluation metrics, which further strengthened the significance of our proposed FairSeg dataset.

MethodOverall HD95↓Asian HD95↓Black HD95↓White HD95↓Overall ASD↓Asian ASD↓Black ASD↓White ASD↓
CupSAMed9.623111.000511.01429.18663.96504.67654.77433.7209
SAMed+ADV9.459411.017011.11938.94794.01234.86234.92363.7321
SAMed+GroupDRO9.463311.089111.04498.96033.94944.64624.86343.6855
Ours (SAMed)9.449410.837611.16378.94533.92884.70424.82103.6606
TransUNet4.75775.69835.54114.49372.05522.36922.45731.9382
TransUNet+ADV5.01575.93795.82924.74762.13192.46932.49782.0198
TransUNet+GroupDRO4.81955.64245.67534.55352.06152.36372.49521.9393
Ours (TransUNet)4.96035.88185.73984.69922.14412.46412.54452.0268
RimSAMed9.937911.485911.56719.43373.93534.50134.54733.7476
SAMed+ADV8.831610.536410.23648.35623.35233.99463.89753.1700
SAMed+GroupDRO8.760910.463810.05928.30763.34733.83453.94993.1702
Ours (SAMed)8.793010.316710.34078.30823.41744.00534.06083.2208
TransUNet4.44045.44965.08934.19641.75341.96431.99411.6809
TransUNet+ADV4.63015.60655.43904.35691.81102.10852.08851.7213
TransUNet+GroupDRO4.52685.42435.22904.28421.74981.96832.00171.6742
Ours (TransUNet)4.53115.52515.28084.26821.83212.06422.07711.7564
评论

The details of how standard deviation is computed need to be elaborated. Is it computed across the mean of for each group?

The computational cost of multiple runs with large segmentation foundation models is high. Previous studies [3, 4] typically evaluate these models only once, without standard deviation, rather than conducting multiple runs of evaluations.

How is the training/testing split performed? Is it just randomly sampled without considering sensitive attributes at patient level?

The training and testing split are just randomly sampled, and our sample distribution of different sensitive attributes reflects the real-world clinical patient distribution.

It would be helpful to discuss the importance of registration in preprocessing using NiftyReg.

As mentioned above and in Section 4/Figure 1, to obtain a large-scale fundus dataset with high-quality pixel-wise annotation, we need to use the OCT machine to generate the OCT fundus images and their corresponding cup-disc mask. However, OCT machines are fairly expensive and less prevalent in primary care, therefore, we propose to migrate those annotations from 3D OCT to 2D SLO fundus for potentially broader impact in early-stage glaucoma screening in the primary care domains. In order to transfer the annotations from 3D OCT fundus images to 2D SLO fundus images, we need to register the image to align the two imaging modalities by comparing the characteristic features between the two fundus imaging modalities of the same patient. The computed alignment matrix is then applied to the disc-cup masks of the OCT fundus images, aligning them to the SLO fundus images.

Reference:

[1] Hoopes, Andrew, et al. "Hypermorph: Amortized hyperparameter learning for image registration." Information Processing in Medical Imaging: 27th International Conference, IPMI 2021, Virtual Event, June 28–June 30, 2021, Proceedings 27. Springer International Publishing, 2021.

[2] https://github.com/tobiaselze/oct_fundus_registration/tree/main

[3] Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023).

[4] Zhang, Kaidong, and Dong Liu. "Customized segment anything model for medical image segmentation." arXiv preprint arXiv:2304.13785 (2023).

评论

Dear AC and Reviewers,

We are truly grateful for the time and effort you have invested in reviewing our paper. Your constructive feedback has been invaluable in enhancing the quality of our work. We are particularly appreciative of your recognition of the importance of the new task (FairSeg) we have proposed and our comprehensive experiments.

In response to your comments and suggestions, we have made several significant updates to our manuscript, which we summarize as follows:

  1. Inclusion of More Segmentation Backbones: In our revised version, we have incorporated nnUNet to further demonstrate that algorithmic unfairness is a prevalent issue in cup-disc segmentation tasks, regardless of the segmentation architecture employed.

  2. Expansion of Evaluation Metrics: To provide a more thorough assessment of segmentation accuracy, we have added Normalized Surface Distance (NSD), Hausdorff Distance (HD95), and Average Surface Distance (ASD). Additionally, to better evaluate fairness, we have included the DPD and DEodds metrics.

  3. Enhanced Clarity in Writing: We have made concerted efforts to clarify areas of the manuscript that reviewers found unclear or sought more information about. These amendments and additions are reflected in our revised submission.

We hope that these updates effectively address the concerns raised by the reviewers and clarify any ambiguities. We are eager to engage further and provide any additional information that may be required.

Sincerely,

The Authors

AC 元评审

This paper introduces the FairSeg dataset, a new benchmark to address fairness concerns in medical segmentation, garnering unanimous and consistent acceptance with scores of 6, 6, 8, and 8 across reviews. The common strengths highlighted in the four reviews center on the significant contribution of novel datasets tailored for medical image segmentation while integrating considerations for fairness. Particularly commendable is the technical soundness demonstrated through innovative methodologies, notably the introduction of a fair error-bound scaling technique that recalibrates the loss function by considering the upper error-bound within each identity group, complemented by the introduction of novel equity-scaled segmentation metrics facilitating fair model comparisons. The paper is appraised for its well-written presentation, technical rigor, and comprehensive experiments. Additionally, its commitment to open access for both dataset and code enhances reproducibility and fosters further research. Identified weaknesses include recurring criticisms related to the choice of evaluation metrics, with reviewers advocating for a more thorough exploration of alternatives and a clearer rationale behind baseline setting and evaluation choices. The absence of references to current works is also noted.

为何不给更高分

As highlighted by the reviewers, although the paper addresses a critical medical benchmark with potential in the medical field, its broader scientific value, relevance, and significance to the diverse audience of ICLR may be somewhat constrained.

为何不给更低分

This paper presents the FairSeg dataset, serving as a novel benchmark to tackle fairness concerns in medical segmentation. It achieves unanimous and consistent acceptance, earning scores of 6, 6, 8, and 8 across reviews. The merits are clearly evident— the paper is technically robust and garners praise from all reviewers.

最终决定

Accept (poster)