5.3

/10

Rejected4 位审稿人

最低5最高6标准差0.4

4.0

置信度

正确性2.5

贡献度1.8

表达2.8

ICLR 2025

Say My Name: a Model's Bias Discovery Framework

Massimiliano Ciranni,Luca Molinaro,Carlo Alberto Barbano,Attilio Fiandrotti,Vito Paolo Pastore,Vittorio Murino,Enzo Tartaglione

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

Unsupervisedly, we find biases in models, and we name them.

摘要

关键词

bias discoveryunsupervised debiasing

评审与讨论

审稿意见

评分: 5置信度: 42024-11-03

The authors introduce the "Say My Name" pipeline, a method designed to identify and interpret biases learned by image classification models. The pipeline consists of five main steps:

Selection of Representative Subset: Selecting a subset of images that are representative of the biases learned by the model.
Captioning with Vision-Language Model: Using a vision-language model (VLM) to generate captions for the selected images.
Keyword Extraction: Selecting keywords that are common across captions within the same class.
Embedding Computation: Computing embeddings of the text descriptions for each class.
Keyword Ranking: Comparing the results from steps 3 and 4 to rank the top keywords associated with each class.

The authors apply their method to several datasets, including Waterbirds, CelebA, BAR, and ImageNet-A. They demonstrate that their approach can effectively identify meaningful textual descriptions of biases present in the models.

优点

Clarity and Readability: The paper is well-written and easy to understand. The motivation behind the work is clearly explained.
Methodological Breakdown: In Section 3.2, the authors provide a detailed breakdown of the five-step pipeline, with clear links to each subsection. This organization enhances the readability and comprehension of the methodology.
Practical Utility: Interpreting biases in machine learning models is crucial. The proposed method appears straightforward to implement and could be readily applied to various image classification tasks.

缺点

Lack of Novelty in Bias Identification: The method for bias identification seems to be a straightforward application of existing techniques for extracting textual descriptors from images. Previous work has already explored the connection between classification errors and biases. The proposed pipeline essentially relies on using a vision-language model to caption images and then summarizing or ranking features. It's unclear how the authors' approach offers a significant technical contribution beyond existing methods or how it compares to simpler approaches, such as directly using a VLM followed by summarization with a large language model (LLM).
Interpretability and Quantification of Results: The extracted rankings of keywords are difficult to quantify and interpret. For example, in Figure 4, the top features for different classes in the BAR dataset have scores ranging from 0.4 to 0.55, associating "climbing" with terms like "cliff," "rock," "rocks," and "steep." However, it's unclear whether these terms represent biases or causal features. There is a lack of empirical analysis to validate whether humans agree that these are indeed biases and how these scores should be interpreted or used in practice.
Focus and Relevance of Bias Mitigation Study: The inclusion of bias mitigation using the identified descriptors seems somewhat tangential to the main focus of the paper. Previous work has shown that, for datasets like CelebA and Waterbirds, extracting spurious attributes can improve performance. As such, the contribution in this area appears limited and may distract from the primary contributions of the paper.

问题

Please see the weaknesses section for the main questions.

One missing reference (https://arxiv.org/abs/2204.13749), where the authors learned the unbiased-biased split directly from training.

伦理问题详情

There is a need to ensure that the bias identification methods used do not inadvertently mislabel causal features as biases, which could lead to misunderstandings or exacerbate existing biases.

2024-11-23

We thank the reviewer for their feedback and the provided reference that will be integrated and discussed in the paper. We provide a response to each raised point here.

[W1 - Lack of novelty in bias identification] While the connection between classification errors and biases is indeed known, the goal of our paper is to provide a human-understandable description of the biases. To the best of our knowledge, few works have attempted to tackle this task, most notably B2T by Kim et al. (see general response for a comparison). Our proposed method is notably different than just employing a VLM followed by a LLM, as it entails a bias-mining dedicated strategy. Thus, in our work, we propose:

A refined bias-mining approach for relevant sample selection.
A state-of-the-art keyword extraction pipeline that, in contrast to previous methods, does not require extensive validation sets. Our method is also more robust than existing methods such as B2T in many cases (see general response).

[W2 - Interpretability and Quantification of Results] The Biased Action Recognition (BAR) dataset is a widely used benchmark for debiasing methods. It has been specifically constructed to introduce correlations between action and environment, and thus it represents an ideal benchmark for our method. Whether causal or not, BAR has been introduced in (Nam et al., 2020) so that 95% of training samples show these correlations and only a small portion (5%) does not. In this sense, the community treats these as biases. For example, climbing can be performed in a non-natural environment, diving can be done in a swimming pool, and so on.

Additionally, our method SaMyNa is a tool for end users, allowing them to inspect correlations between concepts in the data. Whether a correlation is causal or a bias is a choice that can, and should, be delegated to the final users themselves.

[W3 - Focus and Relevance of Bias Mitigation Study] The inclusion of bias mitigation methods is indeed not the central contribution of our work, but it serves as empirical validation that the labels extracted by SaMyNa represent actual biases in the data. In fact, we do not introduce any novel debiasing technique, but we rely on standard GroupDRO to verify our claim, achieving results competitive with SOTA methods. For a more detailed comparison between semantic bias extraction pipelines, we invite the reviewer to refer to the general comment.

评论- Reply to authors

2024-11-27

After reading the author's reply as well as the feedback from other reviewers, I am going to keep my original rating as 5.

2024-11-28

Thank you for taking the time to review our work. We are happy to engage in discussion to clarify some points and to enrich and improve the paper following your suggestions: in such sense, we would be happy to hear more your reasons behind the evaluation, as we believe we are addressing all the concerns around our work

审稿意见

评分: 6置信度: 32024-11-04

The authors introduce a framework for detecting and captioning semantic biases of deep learning vision models. The authors propose a tool that identifies biases learned by models and assigns human-interpretable semantic labels to these biases for explainability and debiasing. The method operates by sample subset selection, sample captioning via MLLM, keywords selections via text encoder, extracting learned class embedding, and keyword ranking. The authors test the framework on popular benchmark datasets. The proposed method successfully identified biases. Also, this discovery can be used with bias mitigation methods, effectively debiasing models.

优点

The paper tackles an important problem in machine learning which is bias and spurious correlations, and propose an effective tool to analyse these biases from the endpoint of humans.

缺点

Experimental analysis on bias discovery is lackluster. I think correlation analyses between the proposed method and human annotations are needed.
The efficacy of the method could depend heavily on the model type and alignment of the MLLM or text encoder. I believe there should be an experimental analysis to show the robustness of the method on this matter.

问题

The proposed method does not use a validation set. How are the hyperparameters of its various components tuned?

2024-11-23

We thank the reviewer for the overall positive feedback. Here is our response to the concerns above:

[W1 - Correlation between the proposed method and human annotations] We thank the reviewer for the great suggestion. We indeed think that this would be a valuable addition to our work, and we have started a comparison between human annotations and the keywords extracted by our method. We are now collecting responses from human participants using questionnaire forms, and we expect to have the results within the upcoming weeks for the final revision of the manuscript.

Meanwhile, we can compare our current results to known biases in the considered datasets. For waterbirds, our method finds keywords related to forests, trees, and vegetation for the landbirds class, and ocean/sea for the waterbirds class. This shows that the context of the extracted keywords is in line with the dataset bias. The same can be said about CelebA (we find gender bias for the considered attributes). For BAR, in (Nam et al., 2020) different pairs of <action, bias> are suggested, namely: (Climbing, RockWall), (Diving, Underwater), (Fishing, WaterSurface), (Racing, PavedTrack), (Throwing, PlayingField), and (Vaulting, Sky). Our method finds relevant keywords for all classes: (Climbing, Rock/Cliff), (Diving, Scuba/Underwater), (Fishing, River/Sea/Ocean), (Racing, Track/Car), and (Throwing, Pitch/Mound), (Vaulting, Midair/High). More fine-grained comparisons will be added with undergoing human study.

[W2 - Robustness to model alignment to text encoder] We completely agree with the reviewer’s point, and we invite him to check the analysis performed in the supplementary material, in Sec. B.2.1, where we have employed a variety of text embedders on waterbirds: despite the specific extracted words slightly varying, the semantics remain consistent, showing the robustness of SaMyNa to different text encoders.

[Q1 - Validation set and hyperparameter tuning] We thank the reviewer for this question. The absence of a validation set is one of the strengths of our method, which makes it more generally applicable to a broader range of datasets (more details can be found in the general response). We have three hyperparameters for SaMyNa, namely $f_{min}$ , $t_{sim}$ and K. In the supplementary material, we already provided ablations on these hyperparameters separately in sections B.2.2, B.2.4, and Table 12 to show their limited impact. We do not perform tuning on these hyperparameters, as they are mainly for the convenience of usage of the method (e.g. higher K results in longer running times, but do not significantly alter the output of the method except for very small values of K like 1). In the same fashion, $f_{min}$ and $t_{sim}$ are just used for filtering the keywords, for example, setting $t_{sim}$ =-1 would not change the relative ordering (nor the score) of the extracted keywords, as is also the case for $f_{min}$ . Nevertheless, we have performed an ablation study on Waterbirds (in the same setup as Sec. 4.1) on the combination of all 3 hyperparameters with $f_{min}$ =0 (no filtering), $t_{sim}$ =-1(no filtering), and K=50 (our maximum), obtaining the following top keywords, aligned with the keyword predictions provided in the paper (we can not attach the full outcome for character limit, we will add it in the revised version of the paper):

bias 0: forest (0.37326), foliage (0.37124), shrubs (0.36617), deciduous (0.35358), tree (0.35058), forested (0.34944), twigs (0.34178), stalks (0.33679), wooded (0.33528), plants (0.33146), leaf (0.33143), woodland (0.32835), plant (0.32778), trees (0.32101), branch (0.31242), leafy (0.30757), vegetation (0.30122), fern (0.29482), ferns (0.29417), pine (0.29382), branches (0.29285), jungle (0.28948), woodpecker (0.28427), evergreens (0.28349), lilies (0.26924), evergreen (0.26723), grasses (0.26579), flora (0.26248), garden (0.26248), brownishblack (0.25752), brownishgray (0.25395), coniferous (0.25326), twig (0.25266)

bias 1: sea (0.57176), ocean (0.56346), seas (0.54325), beach (0.50411), seaside (0.49274), watercraft (0.45897), waters (0.45848), seascape (0.45239), shoreline (0.44961), shore (0.44412), tide (0.44050), coast (0.43822), coastal (0.42281), maritime (0.41680), aquatic (0.41262), pier (0.40740), sailing (0.40685), beachfront (0.40225), harbor (0.39950), coastline (0.39708), ships (0.37315), marina (0.37144), water (0.36510), ship (0.35923), waves (0.34343), waterfront (0.34084), marine (0.32861), sails (0.31933), boats (0.31620), sailboat (0.31366), wave (0.31221), floating (0.30932), lagoon (0.30155), submerged (0.29920), swim (0.29753), bay (0.29548), wet (0.29291), pond (0.29052), whale (0.28699), swimming (0.28541), surfboard (0.28261), wake (0.27740), boat (0.27665), midflight (0.27397), cliffs (0.27336), fish (0.27117), dock (0.27024), reef (0.26725), glide (0.26585), lake (0.26331), vessel (0.25468), sand (0.25399), river (0.25284), flying (0.25184)

2024-11-26

Thank you for the detailed response. My concern [W2] and question [Q1] have been resolved. For [W1], although I think that a human annotation study would benefit the paper, I recognize that it could be hard to conduct such a study during the limited time period of the discussion phase. I believe it would increase the quality of the paper if the study is completed and included in the camera-ready version.

Currently, although my concerns are resolved, given my current confidence score, I am refraining from increasing my score and will maintain my current score.

2024-11-26

We thank the reviewer for the positive response. We completely agree, and we are working in that direction - we commit to complete and release the human annotation study for the camera-ready version of the paper.

2024-12-02

Dear reviewer, you can find the preliminary results of our human annotation study in the first answer to the general comment.

审稿意见

评分: 5置信度: 52024-11-04

This paper tackles the identification of hidden dataset bias within training data, which prevents model from learning intrinsic features that generalizable across distributions. With existing text-based models, it extracts and ranks the bias-related keywords out of data, and leveraging this as the pseudo bias-labels for supervised learning to debias the model. It shows effectiveness in various dataset bias benchmarks in both synthetic and real-world setups.

优点

This paper is well-written and addresses the critical research question of identifying unknown dataset bias (spurious correlation) within training. This would essentially enhance the explainability and reliability of models in real-world applications especially for safety critical purposes.

缺点

1. Lack of novelty and effectiveness: Several key ideas of this paper already exist in previous paper [1]. These include 1) sampling keywords using pretrained captioning model, and 2) identifying bias key words. Despite of subtle technical difference, e.g., detecting bias keywords via contrasting true and false positives (this work) or true positive and false negative (Kim et al. [1]), but in overall this paper does not provide any scientific novelty for the same goal beyond the existing papers. Such resemblance in technical details is reflected in highly limited improvements in debiasing compared to Kim et al., as proposed in Table 1. Therefore, it would be helpful to further elaborate the novel contribution of this paper against existing baselines.

2. Potential risk in identifying biased models: Section 3.1 proposes to identify the biased models by looking at how confidently misclassify the training data. However, models might result in being overfitted to relatively small number of bias-conflicting data in training data, resulting in potential bias to be NOT detected in iteration $t^*$ . Therefore, it is deemed required to further validate the effectiveness of utilizing train data for detecting bias against oracle, i.e., using held-out validation data.

[1] Kim et al., Discovering and Mitigating Visual Biases through Keyword Explanation, CVPR 2024

问题

See weaknesses.

2024-11-23

We thank the reviewer for their feedback. We respond to each raised point here.

[W1 - Lack of novelty and effectiveness] We refer the reviewer to the general comment where novelties and comparisons about the effectiveness are shown. In addition, we would also like to explain here other important differences between SaMyNa’s and B2T’s keyword extraction algorithms.

First of all, an important novelty of our algorithm is that it can find an embedding vector that represents the bias of the model in a certain class. This embedding vector is found by doing arithmetic operations between the embeddings of the captions as explained in the paper. This means that we solve the problem of synonyms. In B2T there is a heavy filtering step before the ranking of the keywords that uses the YAKE keyword extraction algorithm. YAKE does not take into account the semantics of words, which means that it may filter out synonyms if they do not reach a certain frequency threshold individually. Conversely, our method does a very lightweight filtering of keywords before ranking, removing only very rare keywords that may result from captioning mistakes. After this lightweight filtering, we work entirely in the embedding space of the text embedder, which can account for all the synonyms and construct an embedding vector that represents the bias itself semantically. Keyword embeddings are then compared to the bias embeddings and ranked according to cosine similarity, our heavy filtering is done after ranking by setting a similarity threshold. Evidence of B2T’s filtering being too heavy is found in the full keyword rankings for the class “not blond” of CelebA in the supplementary material of B2T, where the keyword “woman” does not survive filtering and does not appear in the ranking. In particular, B2T selects the top 20 best keywords according to YAKE before ranking, while we discard keywords that do not appear in at least 15% of the captions for a given class and then aggregate the surviving keywords in a single pool of keywords that will be used for all classes.

Another important difference is the output of the two algorithms, as briefly explained in the general comment, B2T finds keywords that represent the “opposite of the bias” (as explained in footnote 3 of B2T’s paper): for example, it finds “man” for the “blond” class when in reality the bias is “woman”. While in this simple case it is possible to simply invert B2T’s answer, this is less obvious in other contexts like BAR where, for example, bias keywords include “rock”, “track”, “water”, “sky” etc.

Additionally, our algorithm has the interesting mathematical property that for binary classification datasets, the ranking for one bias class is symmetrical with respect to the other class (because the two bias embedding vectors point in opposite directions, so the cosine similarity will give opposite scores). B2T’s algorithm does not have this property and has a filtering step that removes keywords that appear at the top of the rankings for both classes.

Finally, in Appendix E of SaMyNa, we show the potential of our algorithm to work on other modalities without involving text. We use image embeddings instead of caption embeddings to produce the embedding vectors that represent the biases, and then we rank image patches instead of keywords according to the similarity of the patch to the bias embedding and we display this as a heatmap. This shows that the underlying mechanism is fundamentally different from B2T’s because their method can only work with CLIP-like models and needs both images and text.

[W2 - Potential risk in identifying biased models] We would like to point out that the goal of the strategy proposed in Sec. 3.1 is to identify a t* such that the model did not overfit the bias-conflicting samples.

To empirically support our approach, Sec. B.1 of the supplementary material reports the distribution of aligned and conflicting samples using the ground truth of the bias alignment (unavailable for SaMyNa). We observe that the chosen model at time t* has a good separation between bias-aligned and conflicting samples.

Related to the use of held-out sets, as remarked in the general comment section, it is hard to guarantee that bias-conflicting samples are in the held-out set, especially when the proportion of bias-aligned samples is very high: the non-need for a validation set is indeed one of the strengths of SaMyNa.

2024-11-25

I appreciate the authors for their detailed responses for the review, and my responses are as below:

During rebuttal, the authors claimed that this method is able to detect the bias only using the training data (and thus not requiring validation data). However, I think B2T can also be easily applied using the training data: detecting bias using true positive and false negative in training data, instead of validation set.
As mentioned by the authors, B2T's heavy filtering, especially on semantically same but different names (synonyms) could be mitigated by the proposed method. In addition, identifying words related to bias instead of its opposite (as in B2T) is more intuitive and potential.

However, I think overall contributions in this paper are not significantly novel and impactful against B2T, but have technical differences in each pipeline (e.g., captioning, sampling bias). The proposed methods, as shown in Table 1, fail to suggest the empirical improvements against B2T, only outperforming it by 1% in all of the datasets.

2024-11-25

We appreciate the reviewer’s answer and we would like to further argue the methodological gap existing between B2T and SaMyNa. Even if we agree that the two methods share the idea of identifying the bias through other foundational models, at the same time they show fundamental and significant differences that make SaMyNa applicable in more realistic scenarios than B2T.

[A1 - B2T can also use training data] B2T can not use training data for two very practical reasons:

Complexity: As shown in the general answer, B2T needs to caption the whole validation set, requiring much more complexity than SaMyNa. Already working at the validation set level on a dataset like CelebA is computationally very expensive (e.g., requiring 60 days with LLaVA 34B as captioner) - working at the training set level is just unrealistic.
Overfitting: Working only at the training set level, there is no strategy in B2T to really know (and decide) when to stop the training for bias-mining, unless using SaMyNa’s first part of the pipeline- but in that case, would it still be B2T?. To this regard, it’s important to highlight how in a bias-naming method, the first part involving bias-mining is at least as important as the scheme used for assigning bias keywords. As such, SaMyNa’s bias-mining step, stopping before complete memorization of (the few) bias-conflicting samples (by construction) is a fundamental methodological novelty with respect to B2T.

[On SaMyNa’s contribution] The objective of SaMyNa is not to provide a new sota debiasing approach (although we are, substantiated by the results remarked by the reviewer, and with lower complexity than the competitors), but rather, as the reviewer also highlights, to provide a more humanly-accessible pipeline (any human can read and interpret the outcomes of every stage of our pipeline) and to be scalable. As such, we have also proposed some experiments in the wild on partitions of ImageNet (ImageNet-A, Fig. 5) hardly taken by other debiasing algorithms, finding some biases like hand for stick-insect (because in most of the pictures stick insects are held in a hand), or meal for crayfish (because some of the pictures display the cooked crayfishes).

Furthermore, we believe that B2T’s output being the opposite of the bias is not just less intuitive, but in the majority of real-world cases is not interpretable by humans (what is the opposite of a “rock” in the BAR dataset?).

Finally, about the effectiveness of B2T vs SaMyNa, both, in our testing provided in the general comment using LLaVA 34b, and in B2T’s supplementary material using ClipCap, on the CelebA dataset, B2T completely misses one of the two genders. If the task wasn’t binary, debiasing on CelebA would not have worked as well for B2T. We test debiasing also on non-binary datasets like BAR, surpassing the state-of-the-art.

2024-11-26

I appreciate for the authors' quick responses.

However, I'm still not convinced by B2T cannot use training data: Could B2T also instead leverage a subset of training data, instead of going through the entire validation set, and caption them for bias detection? Also, in terms of memorization, this issue could be easily mitigated by just simply sampling a held-out validation set from the training data.

Also, even if B2T provides the attributes that make models fail, as opposed to this paper (that provides those make model succeed), I don't think the former one lacks interpretability, or could be more informative in practice. It is still meaningful to understand under which circumstances the models fail, e.g., climbing on ice instead of rock. In addition, practically, we could actively leverage this information for augmenting data with such failure cases, making models generalize across such shifts.

2024-11-26

We thank the reviewer for the prompt reply. Considering the typical bias-conflicting/aligned ratio in biased datasets, the percentage of bias-conflicting samples in the training set could be as low as 0.5% of the total training samples. Thus, with a held-out validation set one cannot have any guarantees on the percentage of bias-conflicting samples in the held-out subset, making it hard even to ensure that bias-conflicting samples can be found in such subset at all (fundamental requisite for B2T).

2024-11-28

Dear Reviewer,
To further provide evidence of our arguments supporting that B2T can not make use of the training set directly, we ran dedicated analyses, focusing on two well-established benchmarks configurations: (i) a native validation made of only bias-aligned samples (BFFHQ); (ii) No native validation set (BAR dataset).

Specifically, we performed two experiments: We sampled 10 different validation splits of 10% of the original training set data from BFFHQ, using 10 different seeds, and counted the total number of bias-conflicting samples ending up in the obtained split, showing in the following Table that out of the 96 total conflicting samples only a few may end up in a random split, and thus it is unrealistic to think that they will be the only samples misclassified by a vanilla model. This hinders the applicability of B2T in similar scenarios, as the source of correctly and incorrectly classified samples must come from a validation set with certain characteristics. | Seed | Conflicting in a 10% Val. Split | |------|---------------------------------| | 0 | 8 | | 1 | 6 | | 2 | 9 | | 3 | 4 | | 4 | 5 | | 5 | 9 | | 6 | 8 | | 7 | 9 | | 8 | 8 | | 9 | 6 |
We ran B2T on the BAR dataset, extracting ourselves an held-out validation split, stratifying with respect to the known class populations distributions. The vanilla model is a ResNet-18 (typically used in the literature for this dataset), without any exit mechanism for Bias Mining (as it is for B2T).

Results of this second experiment are reported in the following comment for space reasons.

2024-11-28

Results from running B2T on BAR, with an held-out validation set

The following tables outline the results for each class:

Climbing did not provide any output, as there were no misclassified samples in the validation set

Diving:

Keyword	Score	Acc.
soldiers jump	5.11	0
trampoline	4.72	0
swimmers jump	4.61	0
young man jumping	4.25	0
man jumps	3.969	0.5
man jumping	3.969	0
pool	3.344	0.6667
jump	3.328	0
jumps	3.047	0.6667
water during sunset	2.688	0
sunset	2.469	0
swimmers	1.953	0
swimming	1.75	0.6667
woman is swimming	1.234	0
dog	0.953	0
water	0.5312	0.8
man	0.2344	0.6
lake	0.1562	0
young man	0.04688	0
woman	-0.04688	0

Fishing

Keyword	Score	Acc.
boy fishing	3.438	0.5
beach	3.11	0
boy	1.828	0.5
fishing	0.4375	0.7778
man fishing	0.375	0.5
lake	0.2812	0.6667
man	-0.4219	0.8333

Racing

Keyword	Score	Acc.
start	1.656	0.6667
track	1.594	0
race	1.484	0.8
motorcycle	1.219	0.6667
practice	0.6406	0.6667
leads the field	0.625	0.5
racecar driver leads	0.2344	0
driver leads	0.1719	0
driver drives	0.1719	0
drives	0.09375	0.6667
racecar driver drives	-0.2188	0
leads	-0.2344	0.5
motorcycle racer	-0.2656	0.6667
car	-0.3281	0.8889
racecar driver	-0.3906	0.3333
drives his car	-0.4531	0
car during practice	-0.5	0
driver	-0.5	0.3333
racecar	-0.703	0.3333
racer	-1.3125	0.6667

Throwing

Keyword	Score	Acc.
friends playing football	7.67	0
american football	7.188	0.8
american football team	6.547	0.6667
playing football	6.094	0.5
football	5.17	0.75
football team	5.08	0.6667
friends playing	3.844	0
young man throws	3.188	0
group of friends	2.914	0
pass	2.64	0.8333
game against american	2.219	0.6667
man throws	1.8125	0
ball	1.703	0
game	1.625	0.9583
throws a ball	1.625	0
team	1.3125	0.9333
young man	0.2344	0.5
young	-0.2031	0.5
american	-0.25	0.8
person	-0.2656	0.9167

Vaulting

Keyword	Score	Acc.
hurdles	-0.3281	0
high	-1.328	0.9524
competes	-1.422	0.9545
flies	-1.734	0
jump	-1.797	0.9583
high jump	-1.969	0.9524
person competes	-2.188	0.9524
person flies	-2.188	0
person	-3	0.9167

NB: We keep the complete outputs of B2T to provide the most comprehensive comparison. Note that keywords associated with negative scores would be actually filtered out in B2T. This would mean that also the Vaulting class would have no output at all.

Keeping in mind that the authors of B2T specify that it works by signaling the opposite of a bias, whenever the top keyword refers to the known bias it means the B2T is not working (this is the case for the classes Vaulting, Racing). In the case of Fishing, the extracted keywords mainly refer to the class itself (but beach and lake are still mistakes), thus not giving a clear indication for the end-user. Throwing and Diving seem to successfully find difficult subgroups (soldiers jumping, friends playing football), but the opposite of these concepts is not immediately clear, and thus not quite interpretable.
The most dramatic failure happens for the class Climbing: as no validation sample from this class was misclassified by the vanilla model, it is not possible to extract any useful information with respect to potential biases.

For a comparison of these outputs with SaMyNa, our results on this dataset can be found in the bar-plots in Figure 4 of the main paper, and to the raw output scores provided in Table 8 of the Supplementary Material.

From these analyses, we believe that the methodological differences between our work and B2T should appear evident, and we hope to finally have addressed the reviewer's concerns regarding the lack of novelty and scientific contribution of SaMyNa with respect to B2T.

2024-12-03

I appreciate the authors' detailed reply, which addressed my concerns on the proposed methods' technical contributions over the existing method, B2T, on more realistic setup, where no validation (that include the bias-conflicting samples) is available. However, I believe the setup of having extremely limited number of bias-conflicting is not common in realistic scenario. For instance, as suggested by the authors, in BFFHQ or BAR, we could easily sample the potential bias-conflicting samples by collecting samples that models might fail. Therefore, this paper definitely overcomes some problems that were under-explored by the previous setups, but does not have the fundamental novelty against it, as mentioned in the initial review. Therefore, I increase my rating to 5, which is under the bar.

2024-12-03

We thank the reviewer for raising the score and partially recognizing the novelty of our contributions.

Regarding the comment: "the setup of having extremely limited number of bias-conflicting is not common in realistic scenario", which is currently holding the reviewer from raising the score, we respectfully disagree. The basic assumption of model debiasing is indeed that bias-conflicting samples are very rare (e.g., 0.5 %), and such setting is not only realistic, but also typical, and the most challenging one within the debiasing literature.

Furthermore, the suggestion: "For instance, as suggested by the authors, in BFFHQ or BAR, we could easily sample the potential bias-conflicting samples by collecting samples that models might fail" is not feasible, for the known bias-conflicting memorization issue.

Finally, we believe that discussing more about how to improve B2T is not of interest for this work, as that method already exist, and everything which was not included in the original method, in the first place, is for us, an element of distinction, and a clear contribution, especially when solving critical issues impacting applicability in real-world.

审稿意见

评分: 5置信度: 42024-11-04

This paper outlines a five-step process for identifying spurious bias of a model in natural language keywords. The steps are as follows: (1) sample selection, (2) captioning, (3) keyword selection, (4) classification embedding, and (5) keyword ranking. The experimental results showcase several sample outputs from this process. Furthermore, the evaluation highlights the utility of the identified keywords as pseudo-labels for groups, which can be leveraged by debiasing methods.

优点

This paper addresses the critical issue of model bias discovery through an interesting approach that utilizes natural language keyword descriptions.

缺点

The proposed method lacks novelty since it shares many components with existing literature. For example, the iteration selection based on misclassification confidence outlined in Section 3.1 is a variant of the approach described by Nahon et al. (2023), while the keyword extraction from natural language captions is similar to that found in Kim et al. (2024). Although these references are cited, the paper does not clearly delineate which aspects are novel, making it challenging to assess its originality.

The authors claim contributions related to a text-based pipeline and the disentanglement of domain relevance and the usefulness of the extracted keywords in debiasing. However, only the latter contribution is substantiated by experimental results. The validity and utility of the first two claimed contributions remain unclear.

Furthermore, there is room for improvement in the presentation of the paper. Here are some suggestions:

Clarify Equations: The presentation of the equations can be enhanced. For instance, in Equation 2, the denominator just represents the number of misclassifications, but uses complemented Dirac delta unnecessarily. Equation 3 calculates the average of the mean embeddings for both correctly classified and misclassified instances, but is presented as complicated double sum.
Refine Citations: A more judicious selection of parenthetical and in-text citations would enhance clarity and reduce unnecessary repetition throughout the text.

[1] Remi Nahon, Van-Tam Nguyen, and Enzo Tartaglione. Mining bias-target alignment from voronoi cells. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4946–4955, 2023. [2] Younghyun Kim, Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jaeho Lee, and Jinwoo Shin. Discovering and mitigating visual biases through keyword explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11082–11092, 2024.

问题

Which portion is the most important contribution in the proposed pipeline?
Why is the model with the most confident misclassification useful in bias discovery? If the final model and the selected model are different by a lot, how would this selected model be useful in the final model bias discovery or mitigation?
The keywords are derived from the samples classified as the target class by the studied model. Although these keywords are correlated to the model classification, wouldn't this be not enough to indicate causation?
Is there any compounded bias effect since many off-the-shelf models participate in the pipeline? For example, captioning model may focus on specific aspect of image or text embedding model may be sensitive to specific keywords.

2024-11-23

[W1/Q1/Q2 Contribution/Novelty] Please refer to the general comment and to the answer [W1 - Lack of novelty and effectiveness] for Rev. Pg7a.

Related to the novelty compared to (Nahon et al., 2023), while in their method the authors work at the output of the feature extractor (a space with a typically large dimensionality), we work at the output of the classifier model (with much lower dimensionality), which saves computation - and not only, since our result has a precise probabilistic interpretation. While Nahon et al. search for clusters in the embedding space that might be captured by the classifier, we are at the output of the final classifier; hence, we exploit the captured bias, observing probability distributions at the output of the softmax layer. Since we work directly on the training set (without a validation set), we identify the moment in which the model less confidently misclassifies some samples as an overfitting signal, and we stop at such point the bias fitting.

[W2/Q4 - Novelty on text-based pipeline and disentanglement] Our claims are substantiated by experiments in both the main paper and extensive ablation study and robustness test on diverse text encoders in the supplementary material. We invite the reviewer to refer to the general answer and to the answer [W1 - Lack of novelty and effectiveness] of Rev. Pg7a for more details.

[W3 - Refinements] We thank the reviewers for remarking on this, they will be fixed in the final version of the paper.

[Q3 - Bias-target alignment] Similarly to works like (Nam et al., 2020), (Liu et al, 2021) and (Nahon et al., 2023), the underlying assumption for bias extraction is that information on bias-target misalignment can be mined by misclassified samples. This comes from the fundamental assumption that biased features are easier to learn by target models.

We are open to further discussing with the reviewer on the above points, and in case all the points are adequately addressed, to adjust their evaluation.

2024-11-26

Thank you for the response to my review. After reading through the responses, I can see the novelty of contributions that I was missing in my initial review. I update my score accordingly.

However, the major strength in interpretability of the pipeline could have been shown more convincingly. Here are some ideas to improve the presentation:

It would be clearer to explicit compare against similar baselines in the text to emphasize the novelty properly.
The benefit of the proposed method is not convincing enough from the provided case study. Quantitative measures related to interpretability would have been convincing. (such as (1) user survey scores, (2) surrogate interpretability measures or (3) component-wise evaluation.)

2024-11-27

We thank the reviewer for recognizing the novelty of our contributions and for the suggested improvements. We completely agree that implementing them would greatly enhance the paper's presentation and clarity.

In the revised version of our manuscript, we are adding a dedicated paragraph to better highlight how we differentiate from existing baselines (e.g. B2T) and to further clarify our method assumptions.

Regarding additional measures for interpretability, we do believe that our work would greatly benefit from adding a comparison between our extracted keywords and human users annotations, as also suggested by reviewers Mn1G and mTJk. Currently we are collecting responses from human participants using questionnaire forms, the results will be collected in the upcoming weeks and included in the final revision of the manuscript.

Another source of interpretability provided by SaMyNa can be obtained by working with image embeddings using a vision encoder instead of a text encoder (Supplementary Material, Section E). These embeddings are used exactly like we use caption embeddings, in order to generate the learned class embeddings (Equations 3 and 4 in the main paper) but applied to image embeddings. After this, we split the image under analysis into patches and we use the same vision encoder to create embeddings for each patch. We can then use cosine similarity between the patch embeddings and the learned class embeddings to obtain a score for each patch. The resulting heat-map visualizations provide a visual cue of which parts of the input image are more aligned to the found biases and which do not. From the two examples on Waterbirds in Section E, we can see how identified image regions do correspond to the dataset bias (i.e., the background environment).

Our component-wise evaluation has been brought out mainly in the supplementary material, where we provide detailed ablation studies involving each component of SaMyNa, including the raw outputs with explicit score values. Other ablation studies include the bias-mining step (Section B.1), different text-embedders (Section B.2.1), a controlled unbiased case (Section C), and the resulting output for other backbones such as vision transformers (Section D). We believe that the proposed experimental analysis sufficiently outlines the range of potential applications and scenarios supported by SaMyNa, which provides both qualitative (keywords) and quantitative (ranking and scores) information for the end user’s interpretation.

2024-12-02

Dear reviewer, you can find the preliminary results of our human annotation study in the first answer to the general comment along with a summary of the improvements we made to the paper.

2024-11-23

We thank the reviewers for their useful comments. We would like to give a general answer regarding the critical point shared by most of the reviewers, which is novelty in comparison with Bias2Text (B2T) (Kim et al., 2024). B2T pursues the same objective of our paper- extracting human-interpretable descriptions of potential biases affecting a deep image classification model.

B2T and our method are fundamentally different in the way they extract keywords. Our method provides a much more robust estimation of biases in a variety of contexts, in which B2T fails to produce useful information or is not applicable at all.

A key difference is that our method does not require a validation set for bias discovery, in contrast to B2T. This represents a key aspect in unsupervised bias discovery and mitigation, as in a realistic scenario a validation set comprising conflicting samples is rarely available (like in BAR or BFFHQ). Besides, B2T requires captioning the entire validation set in order to extract relevant biases from conflicting samples. On the other hand, our SaMyNa does not suffer from such limitations, as

we only need the training set;
we can extract candidate few exemplars thanks to Bias Mining.

With this, SaMyNa is much more efficient than captioning the entire validation set and allows for the use of larger and more accurate captioners such as LLaVa-34B.

Why B2T cannot leverage better captioners while SaMyNa can. The quality of extracted keywords, as noted by Rev. Mn1G, directly depends on the quality of the captioner. B2T is forced to employ smaller and quicker captioners such as ClipCap, which, however, provides less accurate captions when compared to models such as LLaVa-34B. The table below shows the time required for SaMyNa and B2T on the CelebA dataset using LLaVa-34B on a NVIDIA A40 equipped with 48 GB, tested on batch size 5:

Method	Captioning time (LLaVA 34B)
SaMyNa (K=1)	17 minutes
SaMyNa (K=5)	86 minutes
SaMyNa (K=10)	3 hours
SaMyNa (K=25)	7 hours
SaMyNa (K=50)	14 hours
B2T	60 days

As we can see from the results, even when employing K=50 (200 images) in SaMyNa, we still achieve considerably lower times than B2T, which requires captioning the entire validation set (19,867 images).

We also evaluated the quality of ClipCap captioner for the sake of comparison with B2T: we found that ClipCap results in notably worse results as it is not able to fully capture the appearance of the images. This prevents its usage on many relevant datasets. An example of this is found in Table 14 of B2T’s supplementary material, in which the full ranking of the keywords for both CelebA’s classes is shown, as can be seen, the results for the “not blond” class are suboptimal.

Why B2T requires a full validation set and SaMyNa does not. To further demonstrate that B2T requires the full validation set, we compare the extracted keywords of B2T and SaMyNa with varying sample sizes. In the table below, we report the gender-related keywords extracted by both methods on CelebA:

	K=1	K=5	K=10	K=25	K=50
Blond (B2T). Expected: male.	woman (6th)	N/A	N/A	N/A	N/A
Blond (SaMyNa). Expected: woman.	N/A	woman (6th)	woman (1st)	woman (5th)	woman (1st)
Not Blond (B2T) Expected: woman.	N/A	woman (5th)	woman (3rd)	woman (5th)	woman (4th)
Not Blond (SaMyNa). Expected: male.	male (1st)	male (1st)	man (1st)	man (1st)	man (1st)

We highlight cases in which the relevant keyword was ranked higher than the other method. The results clearly show that our method, which leverages the bias mining step, is consistently more accurate than B2T in extracting the right keywords. Furthermore, keep in mind that B2T keywords need to be inverted as reported in the original paper (e.g. woman -> man), thus for K=1, B2T actually predicts the opposite bias. This is straightforward for a binary attribute such as CelebA’s gender, but not obvious when more than two biases are present in the training set.

To summarize, the key differences between our methods SaMyNa and B2T are:

B2T requires a large enough validation set containing conflicting samples, while SaMyNa leverages bias mining to find candidate exemplars directly from the training set.
SaMyNa is less computationally heavy than B2T and for this reason, we can employ state-of-the-art captioners such as LLaVA-34B which greatly help in extracting meaningful and descriptive captions.
SaMyNa scales better and more consistently than B2T when employing fewer samples for keyword extraction.

Please, find more detailed responses below each review and in the revised text. We hope that this response has clarified the difference between SaMyNa and B2T and addressed the concerns about the novelty by Reviewers UdxG, Pg7a, and mTJk.

2024-12-02

We thank the reviewers for the time dedicated to the discussion. We are happy to announce that we made the following improvements to our paper (all changes to the text are highlighted in blue):

We added an ablation study on $t_{sim}$ , $f_{min}$ , and $K$ simultaneously (in Sec. B.2.5); individual ablations for each hyperparameter were already included (in Sec. B.2.2, B.2.3, B.2.4).
We updated Fig. 10 in Sec. E to show more examples of SaMyNa’s heatmaps.
We added a comparison between ClipCap and LLaVA-34B captions (in Sec. F).
We added a comparison between B2T and SaMyNa (in Sec. G).
We changed some sentences to better highlight SaMyNa’s novelty.
In the supplementary ZIP file we added all the captions generated by ClipCap on CelebA (used for Tab. 7).
In the supplementary ZIP file we added the full results of the ablation in Sec. B.2.5.

Furthermore, today we are releasing the preliminary results of the human annotation study we made on CelebA, Waterbirds, and BAR. In the past days, we collected the answers of 20 participants, and we plan to include more for the camera-ready version. In our survey, we show participants a set of images from each dataset, divided by target class (for example, blond and non-blond people, waterbirds and landbirds, and so on). We asked participants to provide sets of keywords that, in their opinion, represent a bias in the different groups. Then, we analyzed the keywords found by participants and we ranked them based on the number of occurrences. We report the results in the tables below, with a comparison with SaMyNa (BAR is split on two tables for space reasons).

We report top-5 results of SaMyNa, and human keywords that occurred at least 3 times.

Results for CelebA and Waterbirds:

Method	CelebA (blond)	CelebA (not blond)	Waterbirds (landbird)	Waterbirds (waterbird)
SaMyNa	woman (0.49), makeup (0.26), lipstick (0.24), eyeshadow (0.21)	man (0.41), male (0.39)	tree (0.37), forest (0.36), trees (0.35), forested (0.35), foliage (0.34)	sea (0.55), ocean (0.54), beach (0.43), waters (0.42), shoreline (0.39)
Human	woman (20), long hair (5), white (4), white skin (4), smile (3),	man (14), hat (5), short hair (4)	forest (6), trees (6), green (3)	water (5), grey (4), red (3)

Results for BAR:

Method	BAR (climbing)	BAR (diving)	BAR (fishing)
SaMyNa	cliff (0.53), rock (0.45), rocks (0.41), steep (0.35), backpack (0.33)	scuba (0.57), underwater (0.54), submerged (0.38), coral (0.28), depths (0.28)	boat (0.45), river (0.45), sea (0.43), ocean (0.38), lake (0.36)
Human	rocks (9), mountain (6), helmet (6), rock (4), ice (4)	water (10), blue (5), sea (4), pool (4), man (3)	water (7), fishing rod (4), children (4), fish (4), lake (3)

Method	BAR (Racing)	BAR (Throwing)	BAR (Vaulting)
SaMyNa	cars (0.38), car (0.36), track (0.33), stadium (0.26), speeds (0.24)	pitch (0.56), baseball (0.53), pitcher (0.51), batter (0.48), player (0.47)	midair (0.42), jump (0.41), pole (0.34), high (0.32), suspended (0.32)
Human	cars (5), car (5), wheels (5), road (5)	man (11), baseball (8), ball (4), sport (4)	sky (5), pole (4), air (4), woman (4)

As we can observe from the results, the output of SaMyNa is aligned with human annotations. Note that, as for SaMyNa, we did not provide any guidance to the survey participants (in terms of bias features), so they did not have any prior knowledge about the kind of biases in each dataset. We report below the instructions provided to the participants in the survey:

What represents a bias? In the examples we will show, you will encounter different kinds of images. Some of them portray people, others portray animals, and others show activities performed by humans. In this questionnaire, we refer to every possible recurring feature as bias. Everything that shows a high correlation with the group should be then included in your answer (e.g. do not limit yourself to well-known biases such as gender).

How to fill out the form: We will show you different groups of images. Each group has a common characteristic that we will highlight (hair color, type of bird, action name). For each group you must find up to 10 keywords, these keywords must refer to objects/features/characteristics or other elements of the images and must meet the following requirements:

The keyword must appear in MOST of the images of the group
The keyword must not be common to other groups in the same task (e.g. "person", "bird")
The keyword must not be the feature we highlighted about the group (if the group is about blond people, the keyword must not be "blond")

We hope that with these last results, we finally addressed any remaining concerns and that reviewers will adjust their feedback accordingly.

AC 元评审

2024-12-20

Summary of the Paper:

The paper presents “Say My Name” (SaMyNa), a method to identify and name biases learned by deep models using a text-based pipeline. Unlike prior work that often requires a validation set with bias-conflicting samples, SaMyNa aims to operate solely on the training set. By extracting textual descriptions (via large-scale captioners), selecting representative samples, and ranking recurrent keywords, the method claims to offer human-interpretable insights into the model’s biased correlations. Additionally, the authors show that these discovered biases can aid in downstream debiasing, improving model robustness on known benchmarks.

Strengths:  Human-Interpretable Approach: The method aims to provide a more human-readable interpretation of biases by extracting textual descriptions and ranking keywords. Weaknesses:  The approach does not offer contributions significantly distinct or differentiated from existing work, such as Bias2Text.

 Modest practical gains: While flexibility and the ability to operate without a dedicated validation set are claimed, the empirical improvements in debiasing performance are minor.

审稿人讨论附加意见

The paper introduces a tool that aims to semantically identify and label biases learned by deep models without requiring a dedicated validation set. The authors proposed a pipeline that integrates bias mining and text-based keyword extraction procedures to produce human-interpretable bias descriptors. While these contributions address an important problem and offer incremental improvements over existing bias-detection approaches, the reviewers are concerned with the novelty of the framework that distinguishes itself from existing tools. In particular, the conceptual advances relative to prior methods (e.g., Bias-2-Text) were not convincingly demonstrated, and the experimental results did not establish a clear performance improvement. Given these considerations, I don't think the paper is ready for acceptance.

最终决定Reject

2025-01-22

Reject