5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

2.8

置信度

ICLR 2024

It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation

Wen Wu,Wenlin Chen,Chao Zhang,Phil Woodland

OpenReview PDF

提交: 2023-09-17更新: 2024-02-11

TL;DR

This paper studies human annotator simulation (HAS) for automatic data labelling and model evaluation, which incorporates the variability of human evaluations.

摘要

关键词

human annotator simulationnormalizing flowsspeech processingemotion recognitionmeta-learningzero-shot learningfairness

评审与讨论

审稿意见

评分: 5置信度: 22023-10-31

The papers proposed a meta-learning framework to make Human annotator simulation as a zero-shot density estimation problem, which allows for the generation of human-like annotations for unlabelled data. Moreover, conditional integer flows and conditional softmax flows can account for ordinal and categorical annotations.

优点

(1) The paper is writting and organzation are clearly. (2) The idea of adopting meta-learning into human annotations for zero-shot unlabeled data is sensible. (3) The experimental results on three real-world human evaluation tasks seems promising.

缺点

(1) The main weakness is the limited compared method in the experiments. In the Tables in the paper, the state-of-the-art meta-learning methods are missing and the latested human annnotation simulators are also disregarded. Please consider add more SOTA methods for comparsion. (2) The other weakness is that the costing time is not reported in the paper. Since the authors claimed that the proposed method is effective, the training and test time should be list compared to other SOTA in the experiments. Please consider to add more details here.

问题

Please see Weakness Part.

伦理问题详情

None.

评论- Response to Reviewer Pupg

2023-11-15

We thank Reviewer Pupg for the detailed comments. We address some general concerns in the general response and response to your specific comments below:

(1) The main weakness is the limited compared method in the experiments. In the Tables in the paper, the state-of-the-art meta-learning methods are missing and the latested human annnotation simulators are also disregarded. Please consider add more SOTA methods for comparsion.

Human annotation simulation (HAS) is a new task proposed in our paper, so there is not, to our knowledge, latest human annotation simulator available - all baselines in our paper are adapted from suitable existing methods from other domains. We could not find suitable existing meta-learning techniques to adapt because our method is the first method that aims to meta-learn density estimators, whereas existing meta-learning methods focus on meta-learning (single-output) predictors. We’d be grateful if the reviewer could give citations to suitable prior meta-learning methods for HAS.

(2) The other weakness is that the costing time is not reported in the paper. Since the authors claimed that the proposed method is effective, the training and test time should be list compared to other SOTA in the experiments. Please consider to add more details here.

The computational time cost of all of the methods that have been compared for the four tasks studied is shown in Table (i) - (iv) in general response (2). Please refer to general response for more details and analysis.

We hope our response resolves your concerns.

审稿意见

评分: 6置信度: 12023-11-01

The paper focuses on human annotator simulation (HAS), which is the task of generating human-like annotations for unlabelled inputs.

The paper proposes a novel meta-learning framework that treats HAS as a zero-shot density estimation problem, which can capture the variability and subjectivity in human evaluation.

The paper also introduces two new model classes, conditional integer flows, and conditional softmax flows, to handle ordinal and categorical annotations respectively.

The paper evaluates the proposed method on three real-world human evaluation tasks: emotion recognition, toxic speech detection, and speech quality assessment. The paper shows that the proposed method can better predict the aggregated behaviors of human annotators, match the distribution of human annotations, and simulate inter-annotator disagreements.

优点

The paper proposes a novel meta-learning framework that treats human annotator simulation (HAS) as a zero-shot density estimation problem, which can capture the variability and subjectivity in human evaluation. The proposed method is evaluated on three real-world human evaluation tasks: emotion recognition, toxic speech detection, and speech quality assessment. The paper shows that the proposed method can better predict the aggregated behaviors of human annotators, match the distribution of human annotations, and simulate inter-annotator disagreements. The paper also introduces two new model classes, conditional integer flows and conditional softmax flows, to handle ordinal and categorical annotations respectively. The proposed method is efficient and capable of generating human-like annotations for unlabelled test inputs.

The strengths of the paper are:

The proposed method is capable of generating human-like annotations for unlabelled test inputs with higher accuracy than the baseline methods.
The proposed method can better predict the aggregated behaviors of human annotators, match the distribution of human annotations, and simulate inter-annotator disagreements.
The paper introduces two new model classes, conditional integer flows, and conditional softmax flows, to handle ordinal and categorical annotations respectively.
The paper discusses the ethical implications and potential applications of HAS.
Training code is available to reproduce the results in this paper.

Overall, this paper presents a novel approach to HAS that can capture the variability and subjectivity in human evaluation. The paper also provides insights into how to handle ordinal and categorical annotations.

缺点

The proposed method is evaluated on only three human evaluation tasks, which may not be sufficient to generalize the effectiveness of the proposed method to other domains.
The paper does not compare the proposed method with the latest state-of-the-art methods for HAS. The methods compared in this paper include deep ensemble (Ensemble) (Lakshminarayanan et al., 2017), Monte-Carlo dropout (MCDP) (Gal & Ghahramani, 2016), Bayes-by-backprop (BBB) (Blundell et al., 2015), conditional variational autoencoder (CVAE) (Kingma & Welling, 2014), conditional argmax flow (A-CNF) (Hoogeboom et al., 2021), Dirichlet prior network (DPN) (Malinin & Gales, 2018), Gaussian process (GP) (Williams & Rasmussen, 2006), and evidential deep learning (EDL) (Amini et al., 2020). The most recent method A-CNF was proposed two years ago.
The paper does not provide a detailed analysis of the robustness of the proposed method.

Despite these limitations, the paper presents a novel approach to HAS that can capture the variability and subjectivity in human evaluation. The paper also provides insights into how to handle ordinal and categorical annotations. However, further research is needed to evaluate the proposed method on more diverse datasets and tasks.

问题

I am a computer vision researcher, and I do not know the state-of-the-art of the HAS. However, the most recent method A-CNF compared in this paper was proposed two years ago. Is there any other recent work proposed in the past two years?

Ensemble achieves the best performance in Table 1 and Table 2. The proposed method is much more efficient than the ensemble method. Could you please provide a detailed complexity comparison?

伦理问题详情

NA, the proposed method is designed to alleviate ethic problems.

评论- Response to Reviewer 2YK4

2023-11-15

We are thankful for Reviewer 2YK4’s feedback. We address some general concerns in the general response and response to your specific comments below:

(1) The proposed method is evaluated on only three human evaluation tasks, which may not be sufficient to generalize the effectiveness of the proposed method to other domains.

The three tasks used to evaluate the proposed method in our paper are real-world tasks from three very representative domains (emotion class annotation, toxic speech detection, and speech quality assessment) that concern human evaluations. In addition, we’ve added an extra emotion attribute prediction task in Appendix K of the revised manuscript, which, instead of estimating discrete emotion labels such as happy, sad, neutral, predicts emotion attributes (i.e. valence – whether the speaker is positive or negative, arousal – whether the speaker is excited or calm, dominance – whether the speaker is weak or dominant). Please also refer to general response for more detail.

(2) The paper does not compare the proposed method with the latest state-of-the-art methods for HAS...... The most recent method A-CNF was proposed two years ago.

Human annotation simulation (HAS) is a new task setting proposed in our paper, for which no directly applicable prior method is available. All baselines considered in our paper are adapted from suitable existing methods from other domains. In addition, although we cite (Malinin & Gales, 2018) and (Amini et al., 2020) for DPN and EDL in the main paper, they are actually adapted from variants proposed in recent papers (Wu et al., 2022b) and (Wu et al., 2023) which are cited in Appendix E.

The paper does not provide a detailed analysis of the robustness of the proposed method.

The proposed method is evaluated for four different real-world tasks (three in the paper plus one newly added). Results reported in Table 1-3 are the mean and standard error from three independent runs with different random seeds, which shows the robustness of the proposed method.

Ensemble achieves the best performance in Table 1 and Table 2. The proposed method is much more efficient than the ensemble method. Could you please provide a detailed complexity comparison?

The complexity of all methods for the three tasks studied in the paper and one newly added task is shown in Table (i)-(iv) in the general response (2). Please refer to general response for more detail and analysis.

In addition, we’d like to clarify that achieving the highest accuracy in Table 1 and Table 2 doesn’t indicate that the ensemble has the best performance. As explained in Section 2.1, due to subjectivity in human interpretation, deriving a deterministic “ground truth” in subjective tasks is not feasible. Therefore, we advocate for modelling the label distribution to account for human perception variability instead of only predicting the majority opinion. Various metrics are adopted in the paper to evaluate the performance of distribution matching and human variability simulation, as described in Section 4. The proposed method outperforms all baselines by yielding the smallest $\text{NLL}^\text{all}$ (indicating its superior distribution matching capability) and the smallest $\text{RMSE}^s$ , $\mathcal{E}(\bar{s})$ , $\mathcal{E}(\hat{\kappa})$ (indicating its superior simulation of inter-annotator variability).

We hope our response resolves your concerns.

审稿意见

评分: 6置信度: 42023-11-01

Supervised learning tasks require annotation that can be done with high certainty, such as annotating the presence of an object, drawing rough bounding boxes, or deciding scene attributes; however, many tasks require subjective labeling that is influenced by a variety of factors, cognitive biases, or personal preferences. This paper proposes a human annotator simulation to incorporate the variabilities in this second group of labeling tasks.

The method is a meta-learning framework, a zero-shot density estimator that models the agreement and disagreements among human annotators using a latent variable model. It does not require any human effort and can be used to use unlabeled samples efficiently.

The experimental results are performed on different modalities and domains that demand various levels of subjective annotations, such as emotion category, toxic speech, and speech quality assessment.

优点

Various machine learning problems require subjective labeling, and it is not easy to use crowdsourcing due to privacy concerns. The proposed approach makes such highly sensitive data to be labeled and used in learning problems.
The proposed approach is a latent variable model. $p(z | x)$ encodes the information in the input $x$ , however, the interesting part of the formulation is to introduce another intermediate variable $v$ , instead of directly taking $p(y | z)$ . Conditional normalising flow (CNF) formulation gives more flexibility instead of a particular distribution choice.
Conditional integer flows and conditional softmax flows are introduced to accommodate to ordinal and categorical annotation tasks.

缺点

In the experiments, test performances are reported, however, considering the problem as "simulating subjective human annotations", I would expect to see (i) a training to learn the latent model in a labeled training subset, (ii) generate simulated labels on a held-out training subset and (iii) training a classifier only with these simulated labels and evaluating on test set. Instead, the current evaluation is more like supervised evaluation.
I found the problem address highly similar to the following paper [1,2] that aims to model the label space in each step iteratively using Gibbs sampling (or previous like of work that used MCMC). This work may require annotation process in a more dynamic setting, still very relevant to subjective labelling task and I think, they are needed to discussed.

[1] Harrison, P., Marjieh, R., Adolfi, F., van Rijn, P., Anglada-Tort, M., Tchernichovski, O., ... & Jacoby, N. (2020). Gibbs sampling with people. Advances in neural information processing systems, 33, 10659-10671. https://proceedings.neurips.cc/paper_files/paper/2020/file/7880d7226e872b776d8b9f23975e2a3d-Paper.pdf.

[2] Sanborn, A., & Griffiths, T. (2007). Markov chain Monte Carlo with people. Advances in neural information processing systems, 20. https://papers.nips.cc/paper_files/paper/2007/file/89d4402dc03d3b7318bbac10203034ab-Paper.pdf
How does the proposed approach tackle the highly imbalanced data domains? For instance, in one of the tasks, MSP podcast dataset contains angry, sad, happy, neutral, and other. When continuous labels (valence-arousal annotations of the same dataset) is used, the skewed labeled distribution will be more visible. I suspect the proposed method to cause higher uncertainty (interrupter disagreement in less discovered part of label parameter space).
Different performance metrics are used. Particularly, Fleis' kappa for reliability of categorical labels is good. However, why did not you used Intraclass correlation coefficient (ICC) in continuous labels instead of RMSE and the absolute error of the average standard deviations? In continuous/ordinal subjective labelling tasks, ICC reliability is one of the golden standards that define the labelling quality or difficulty of the task.

问题

Overall, I liked the Human Annotator Simulation approach to tackle learning domains that necessitates subjective labelling. Please see my comments in the weaknesses section.

伦理问题详情

No ethics review needed. The proposed paper, in contrast, aim to model human annotation processes and mitigate existing biases.

评论- Response to Reviewer js5b (1)

2023-11-15

We appreciate Reviewer js5b’s positive feedback and insightful comments. We address some general concerns in the general response and response to your specific comments below:

(1) In the experiments, test performances are reported, however, considering the problem as "simulating subjective human annotations", I would expect to see (i) a training to learn the latent model in a labeled training subset, (ii) generate simulated labels on a held-out training subset and (iii) training a classifier only with these simulated labels and evaluating on test set. Instead, the current evaluation is more like supervised evaluation.

The purpose of the HAS system is to simulate annotations that match the distribution of human annotations. Since the underlying annotation distribution is unknown, it’s not easy to directly assess the distribution matching performance by a supervised-learning downstream task. Therefore, we adopted various metrics in the paper to evaluate the distribution estimation performance, as described in Section 4.

As a concrete example, evaluating the performance of HAS with a classifier, which usually requires the training data to have a single ground truth (i.e. majority vote), is not suitable for our modelling purpose, because there might not exist a single ground-truth label in human evaluation tasks.

(2) I found the problem address highly similar to the following paper [1,2] that aims to model the label space in each step iteratively using Gibbs sampling (or previous like of work that used MCMC). This work may require annotation process in a more dynamic setting, still very relevant to subjective labelling task and I think, they are needed to discussed.

Thanks for pointing out these related works. We agree that delving into the differences between these tasks and our method is enlightening.

Paper [2] developed a MCMC method for sampling from a particular subjective probability distribution which allows a person to act as an element of an MCMC algorithm by choosing whether to accept or reject a proposed change to an object. Paper [1] generalises paper [2] to a continuous-sampling paradigm. Despite both modelling subjective distributions, our method differs from these methods in the following two aspects.

Firstly, these methods require an online annotation process where the annotators need to label data points based on the current state of the Markov chain, whereas in our setting the annotations can be obtained offline.

Secondly, in those traditional density estimation frameworks, distributions are estimated based on observations while our meta-learning method does not require observations to be provided. Extending the notation in the paper, we denote an event as $d_i$ , which consists of a descriptor (i.e. an utterance) $x_i$ and $M_i$ associated observations (i.e. human annotations) $D_i=\\{\eta_i^{(m)}\\}_{m=1}^{M_i}$ . For a test event $d^*$ , the test descriptor and observations are denoted as $x^*$ and $D^*$ . The target to estimate is the distribution of $D^*$ , denoted as $p^*$ .

Given an event of interest $d^*$ , MCMC-based methods present the descriptor $x^*$ to human participants who are asked to provide sequence of decisions $D^*$ following a Markov Chain Monte Carlo acceptance rule. The distribution $p^*$ is then estimated based on $D^*$ . In other words, observations $D^*$ are necessary in order to estimate each subjective probability distribution. Each Markov chain only targets a specific subjective probability distribution and there is no obvious way to transfer information between different Markov chains. Therefore, these methods cannot be applied to simulate annotation distribution for unlabelled data.

An advantage of our method is that only the descriptor $x^*$ is needed to simulate the distribution of event $d^*$ and no $D^*$ is needed which are often unavailable in real-world settings. As described at the beginning of Section 3.1, the proposed approach meta-learns a conditional density estimator across $D_{meta}=\\{D_i\\}_{i=1}^N$ .

In other words, given $\\{x_i, D_i\\}_{i=1}^N$ , the model is trained to learn how to learn the underlying distribution of $D_i$ given $x_i$ . During testing, given the test descriptor $x^*$ , the model directly estimates $p^*$ in a zero-shot fashion.

We’ve added this discussion in Appendix M in the revised manuscript.

The rest of your comments will be addressed in Response (2).

评论- Response to Reviewer js5b (2)

2023-11-15

Continued from Response (1).

(3) How does the proposed approach tackle the highly imbalanced data domains? For instance, in one of the tasks, MSP podcast dataset contains angry, sad, happy, neutral, and other. When continuous labels (valence-arousal annotations of the same dataset) is used, the skewed labeled distribution will be more visible. I suspect the proposed method to cause higher uncertainty (interrupter disagreement in less discovered part of label parameter space).

Valence-arousal annotation of the MSP-Podcast dataset uses a 7-point Likert scale, which is similar to the speech quality assessment experiments in Section 5.3 that uses a 5-point Likert scale. I-CNF is proposed in the paper to handle those tasks that involve ordinal annotations. We’ve now added further experiments on emotion attribute prediction of MSP-Podcast in Appendix K. Please also refer to general response for more details.

(4) Different performance metrics are used. Particularly, Fleis' kappa for reliability of categorical labels is good. However, why did not you used Intraclass correlation coefficient (ICC) in continuous labels instead of RMSE and the absolute error of the average standard deviations? In continuous/ordinal subjective labelling tasks, ICC reliability is one of the golden standards that define the labelling quality or difficulty of the task.

Thanks for suggesting the use of the intraclass correlation coefficient (ICC). We’ve now included the absolute error between ICC(1,k) of human annotations and simulated annotations as an additional metric in Table 3 and Section 4 of the revised manuscript. Results are also listed in the following table. The ICC of human labels is 0.468 for the SOMOS dataset and 0.702 for the MSP-Podcast emotion attributes.

E(ICC)	Speech quality assessment on SOMOS	Emotion attribute annotation on MSP-Podcast
GP	0.433 $\\pm$ 0.000	0.169 $\\pm$ 0.000
EDL	0.107 $\\pm$ 0.029	0.172 $\\pm$ 0.017
MCDP	0.495 $\\pm$ 0.010	0.087 $\\pm$ 0.014
Ensemble	0.136 $\\pm$ 0.028	0.057 $\\pm$ 0.003
BBB	0.480 $\\pm$ 0.003	0.241 $\\pm$ 0.003
CVAE	0.214 $\\pm$ 0.028	0.192 $\\pm$ 0.003
I-CNF	0.079 $\\pm$ 0.015	0.032 $\\pm$ 0.012

We hope our response fully resolves your concern.

审稿意见

评分: 5置信度: 42023-11-02

The paper studies the problem of human annotator simulation (HAS) as a density estimation problem, where marginal distribution of how the labels would be generated by a group of annotators given a particular sample is learnt.

优点

The authors argue that existing methods take a majority vote of among multiple annotators. This is not reflective in case of majority bias, and is instead well captured by treating it as a density estimation problem. Fig 1, 2 clearly show that the predictions made by the proposed CNF algorithm lie on the sample means and have enough variability to capture the diversity in human annotations.

缺点

[1] The whole motivation of paper is that it should be possible to capture the data bias (when majority of labelled samples are wrong), and correct for it in some way. However, in all of the tables, the ensemble (tab 1,2)/gaussian process(tab 3) seem to be performing better than the proposed cnf method. While it is qualitatively visible that CNF is learning a better density distribution, the quantitative numbers dont justify the intuition. Perhaps, the authors would be well served by explicitly identifying the samples which possess such labelling bias, correct for it (by forcing the distribution to skew), and show better performance than all other methods. Right now, i see such form of analysis as lacking.

[2] What is the motivation behind latent-diffusion model. in my understanding, the marginal p(y|z) is already encoding information on different annotators z1,z2....z_m sampled over z. what does additional variable v inject into the model conceptually (apart from additional representational capacity).

[3] What is the reason for separate analysis of ordinal and categorical variables? I understand that ordinal categories are ordered, and that probability estimation then reduces to summing over the continuous space of latent variable v. However, I haven't seen standard classification setups explicitly enforce such ordering constraints. Also are there any cases where annotations are continuous (for eg, annotation by a speech etc.), which could be used to explore continuous cases? That could act as an interesting toy experiment.......

[4] Possible extensions to cases where only single annotation is available for each sample: - how does this work compare to works which aim to filter noisy labels from the network. perhaps this could be mentioned in the related work.

I like the fresh perspective of learning distribution over labels, and how such system could act as an auto-labeller. However, the motivation does not reflect better performance on real-world metric (i.e. accuracy). Right now it feels like a fancy technique (i.e. density estimation) whose PRACTICAL real-world experiments i cannot see. Also, i dont see any experiments on zero-shot learning, which is the main title of the paper.

Finally, this paper seems to present a chicken and egg problem. Most of the datasets in the real- world contain labels of only one annotator but might be biased. However, this paper does not work on them. Instead, it requires datasets where each sample has been annotated by many people. But that might not be how labelling happens in real-world. So, perhaps authors could discuss how to extend their method to single label cases where distributions could not be learnt.

However, this work shows promise, and I would recommend the authors to improve the paper by addressing the above concerns and consider a future resubmission.

POST REBUTTAL

The authors were able to address most of the comments. Three main issues are still open,

[1] The seeming unexplainable negative correlation between RMSE and Accuracy.

-> Standard intuition suggests that lower RMSE should lead to higher accuracy in classification setups. If one accepts, that there are certain annotators (i.e. group of people) who are collectively biased, and that the sample might be incorrectly labelled, then, it should be possible to ‘correct’ for it. One way, could be perhaps reestimating the correct label and retrain the said classifier and SHOW better RMSE. I very much appreciate the authors efforts to add a new metric ICC to the setup. However, the actual increase on the accuracy (which is a well-accepted metric), would have convinced further and shown applicability on real world setup. Note that i am not asking for additional experiments on datasets which have only one label per sample, but only the performance improvements in the context of experiments the authors have already performed.

[2] Zero shot density estimation.

-> standard learning assumes that the neural net fits a density (which does not change) after the machine has learnt. during inference, a sample (from same/different distribution) is fed to the model and evaluated. The machine cant adapt its learnt density, since that is encoded in the weights which remain static. ->the authors clarify zero shot as the ability to predict density of annotator responses p*, from a single phrase (x*). -> this might work, if the annotators (say A1) labelled a particular sample x1 (which was SEEN during training), and network is asked to predict A1’s beliefs for a new sample x* (which has not been seen). However, the problem then does not remain zero shot since A1 was already seen by the network. -> If we accept (free will), i.e. one’s personal beliefs are independent of statistical treatments of how other people respond, then merely sampling from the learnt distribution of annotator response, to ‘simulate’ how a prospective unseen annotator shall respond might not work.

[3] Dynamic density adaptation

The idea of giving networks the ability to adapt their densities dynamically to if a given sample is OOD seems promising. The only issue is that personal beliefs cant be given a density treatment. If so, then it doesn't remain zero shot.

Overall, this is an interesting work, along above three points, it is not clear in what context this will be useful if it can not be used to improve the performance.

问题

relevance to title of the paper.
- paper is titled zero shot density estimation, by zero shot i understand that a sample which is different from the original dataset on which the estimator was fit, could be used. given a set of gt classes, the network should dynamically predict distribution of labels over annotators. However, i see no such experiments, with training/inference being done over SAME dataset of speech, toxicity, and emotion annotations.

评论- Response to Reviewer szvo (1)

2023-11-15

We appreciate Reviewer szvo's thoughtful comments. We address some general concerns in the general response and response to your specific comments below:

First, we’d like to clarify that the motivation of the paper is to model label distribution rather than correcting the labelling bias. Human perception and interpretation is subjective. There may not always exist a single “ground truth” for tasks that involve human evaluation (i.e., how expressive the synthesised speech is? What’s the correct score for an ICLR paper review?). In those tasks, multiple human annotators are commonly employed to label each sample (e.g., each ICLR submission is assigned at least three reviewers). The provided labels can be different. The difference (“bias”) reflects different human perceptions of the same event. No particular perception is correct nor wrong and the difference is valuable. As stated in Section 2.1, we propose modelling annotators’ subjective interpretations rather than seeking to reduce the variability in annotations by enforcing a single correct answer.

(1) The whole motivation of paper is that it should be possible to capture the data bias (when majority of labelled samples are wrong), and correct for it in some way. However, in all of the tables, the ensemble (tab 1,2)/gaussian process(tab 3) seem to be performing better than the proposed cnf method. While it is qualitatively visible that CNF is learning a better density distribution, the quantitative numbers dont justify the intuition. Perhaps, the authors would be well served by explicitly identifying the samples which possess such labelling bias, correct for it (by forcing the distribution to skew), and show better performance than all other methods.

As mentioned in the clarification above, there may not exist a single correct answer for such subjective tasks. We thus frame the task as label distribution estimation. Therefore, accuracy/RMSE is no longer the proper metric for the human annotation simulation (HAS) task as these measures can only measure the majority/mean prediction. In order to better assess the model’s ability to model label distribution, we adopted multiple metrics as discussed in Section 4. The quantitative numbers in Table 1-3 indeed justify that the proposed method outperforms all baselines in label distribution estimation. It produces the smallest $\text{NLL}^\text{all}$ , indicating better distribution matching, and the smallest $\text{RMSE}^s$ , $\mathcal{E}(\bar{s})$ , $\mathcal{E}(\hat{\kappa})$ , indicating a superior simulation of inter-annotator variability. Besides, we have added an additional metric ICC following Reviewer js5b’s suggestion, for which our method also outperforms all baselines.

(2) What is the motivation behind latent-diffusion model. in my understanding, the marginal p(y|z) is already encoding information on different annotators z1,z2....z_m sampled over z. what does additional variable v inject into the model conceptually (apart from additional representational capacity).

Regarding variable $v$ . This is because the flow-based model is an invertible transformation, as mentioned in the paragraph above Eqn. (2), which cannot directly transform the continuous latent variable $z$ into discrete output $y$ . Therefore, $z$ is first transformed into $v$ (continuous) with a flow and then discretized into $y$ (discrete).

(3) What is the reason for separate analysis of ordinal and categorical variables? I understand that ordinal categories are ordered, and that probability estimation then reduces to summing over the continuous space of latent variable v. However, I haven't seen standard classification setups explicitly enforce such ordering constraints. Also are there any cases where annotations are continuous (for eg, annotation by a speech etc.), which could be used to explore continuous cases? That could act as an interesting toy experiment.

Regarding separate analysis of ordinal and categorical variables. You are correct that the standard classification setups cannot explicitly enforce ordering constraints. That’s why we conduct separate analysis of ordinal and categorical variables.

Ordinal rating schemes (e.g., the 5-point Likert scale) are commonly employed by human evaluation tasks (e.g., the speech quality assessment task in the paper, ICLR review) . Directly applying standard classification setups to ordinal annotations would result in poor performance due to model mis-specification. Therefore, we propose I-CNF to separately handle such cases.

Continuous annotations can be handled by the identity transformation $p(y|v)=\delta(y-v)$ as discussed at the bottom of page 3. In addition, we’d like to point out that continuous annotations are less common in human evaluation tasks (e.g., ICLR reviewers would not rate papers with a score of 6.12).

The rest of your comments will be addressed in Response (2).

评论- Response to Reviewer szvo (2)

2023-11-15

Continued from Response (1).

(4) Possible extensions to cases where only single annotation is available for each sample: - how does this work compare to works which aim to filter noisy labels from the network. perhaps this could be mentioned in the related work.

Thanks for pointing out the task of filtering noisy labels which is a related but different task. It is insightful to discuss the differences between these two tasks.

Both tasks involve inconsistent labels while the source of inconsistency is different. When filtering noisy labels, it is assumed that there is a ground truth and we want to remove misleading labels. For HAS, the inconsistency stems from subjective perception of humans. No particular label is incorrect and there’s no single “ground truth” . As discussed in the clarification above, the difference reflects different human interpretations of the same event, which is valuable. Instead of enforcing a ground truth, we propose to model such variability. We’ve added this discussion in Appendix M.1 of the revised manuscript.

However, the motivation does not reflect better performance on real-world metric (i.e. accuracy).

Please see our response to your comment (1).

Finally, this paper seems to present a chicken and egg problem. Most of the datasets in the real- world contain labels of only one annotator but might be biased. However, this paper does not work on them. Instead, it requires datasets where each sample has been annotated by many people. But that might not be how labelling happens in real-world. So, perhaps authors could discuss how to extend their method to single label cases where distributions could not be learnt.

It’s common for real-world human evaluation tasks to recruit multiple annotators to evaluate each input (e.g., each ICLR submission is assigned to at least three reviewers). In our paper, the proposed method is evaluated on three representative real-world datasets. Since different people perceive emotion differently, most emotion datasets employ multiple annotators to label each utterance (e.g., [1-5]). In addition, Mean Opinion Score (MOS) is the standard measure to assess the quality of synthesised speech where each sample is assessed by a group of human annotators (e.g., [6-10]).

Your comments regarding zero-shot learning will be addressed in Response (3).

[1] Busso, Carlos, et al. "IEMOCAP: Interactive emotional dyadic motion capture database." Language resources and evaluation 42 (2008): 335-359.

[2] Cao, Houwei, et al. "Crema-d: Crowd-sourced emotional multimodal actors dataset." IEEE transactions on affective computing 5.4 (2014): 377-390.

[3] Busso, Carlos, et al. "MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception." IEEE Transactions on Affective Computing 8.1 (2016): 67-80.

[4] Reza Lotfian and Carlos Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471-483, October-December 2019

[5] Martinez-Lucas, Luz, Mohammed Abdelwahab, and Carlos Busso. "The MSP-conversation corpus." Interspeech 2020 (2020).

[6] Ren, Yi, et al. "Fastspeech 2: Fast and high-quality end-to-end text to speech." arXiv preprint arXiv:2006.04558 (2020).

[7] Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).

[8] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135 (2017).

[9] Kong, Jungil, Jaehyeon Kim, and Jaekyoung Bae. "Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis." Advances in Neural Information Processing Systems 33 (2020): 17022-17033.

[10] Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "Waveglow: A flow-based generative network for speech synthesis." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

Your comments regarding zero-shot learning will be addressed in Response (3).

评论- Response to Reviewer szvo (3) -- regarding zero-shot learning

2023-11-15

Continued from Response (2).

paper is titled zero shot density estimation, by zero shot i understand that a sample which is different from the original dataset on which the estimator was fit, could be used. given a set of gt classes, the network should dynamically predict distribution of labels over annotators. However, i see no such experiments, with training/inference being done over SAME dataset of speech, toxicity, and emotion annotations.

First, we’d like to clarify that the proposed distribution estimation framework is different from standard supervised learning tasks since no “ground truth” is available for training. The objective is to learn the underlying distribution given observations (annotations) while the true distribution is unknown. By “zero-shot density estimation”, we mean that our meta-learned human annotation simulator can be used to estimate the distribution for a given event (e.g., input utterance) without requiring observations (i.e., annotations).

Extending the notations in the paper, we denote an event as $d_i$ , which consists of a descriptor (i.e. an utterance) $x_i$ and $M_i$ associated observations (i.e. human annotations) $D_i=\\{\eta_i^{(m)}\\}_{m=1}^{M_i}$ . For a test event $d^*$ , the test descriptor and observations are denoted as $x^*$ and $D^*$ . The target to estimate is the distribution of $D^*$ , denoted as $p^*$ .

The first type of approach mentioned in Section 2.2 hand-crafts proxy variables $h_i$ based on each $D_i$ , treats the proxy variables as the ground truth, and learns the proxy in a supervised way with paired $\\{x_i, h_i\\}$ . During testing, given a descriptor $x^*$ , it outputs the prediction of proxy $h^*$ which may not capture the underlying distribution $p_*$ . That’s why supervised learning is not suitable for such tasks.

The traditional methods to learn subjective distributions (i.e. Gibbs sampling with people [1] and MCMC with people [2] mentioned by Reviewer js5b), require human annotators to be involved in the process in a dynamic setting. Given an event of interest $d^*$ , these methods present the descriptor $x^*$ to human participants who are asked to provide sequence of decisions $D^*$ following a Markov Chain Monte Carlo acceptance rule. The distribution $p^*$ is then estimated based on $D^*$ . In other words, observations $D^*$ are necessary in order to estimate each subjective probability distribution and there is no obvious way to transfer information between different Markov chains. Therefore, these methods cannot be applied to simulate annotation distribution for unlabelled data.

An advantage of our method is that only the descriptor $x^*$ is needed to simulate the distribution of event $d^*$ and no $D^*$ is needed since $D^*$ are often unavailable in real-world settings. That is the meaning of “zero-shot” in this density estimation framework. Each event is framed as a dataset in the proposed meta-learning framework. The proposed approach meta-learns a conditional density estimator across all datasets $D_{meta}=\\{D_i\\}_{i=1}^N$ .

It leverages knowledge about the agreements and disagreements among different human annotators across different examples for estimating the label distribution of each input rather than designing the proxy solely based on $D_i$ . In other words, given $\\{x_i, D_i\\}_{i=1}^N$ , we train the model to learn how to learn the underlying distribution of $D_i$ given $x_i$ . During testing, given the test descriptor $x^*$ , the model estimates $p^*$ which can be easily sampled from.

Thanks for bringing attention to this potential confusing point. We’ve added this clarification in Appendix M.3 of the revised manuscript.

[1] Harrison, P., Marjieh, R., Adolfi, F., van Rijn, P., Anglada-Tort, M., Tchernichovski, O., ... & Jacoby, N. (2020). Gibbs sampling with people. Advances in neural information processing systems, 33, 10659-10671.

[2] Sanborn, A., & Griffiths, T. (2007). Markov chain Monte Carlo with people. Advances in neural information processing systems, 20.

We hope our response fully resolves your concerns.

评论- General response (1)

2023-11-15

We appreciate the detailed comments and constructive feedback from the reviewers. We respond to the common questions below and provide individual responses to each reviewer for their specific comments. We hope our response fully resolves your concerns.

[Motivation]

The motivation of the human annotator simulation (HAS) system is to simulate annotations that match the distribution of human annotations. Due to subjectivity of human perception and interpretation, a single correct answer may not exist for tasks that involve human evaluation. Taking the ICLR reviewing process as an example, each ICLR submission is assessed by multiple reviewers and reviewers’ opinions can diverge. A definitive correct rating for an ICLR paper might not be ascertainable. We believe such annotator variability is valuable for the reviewing process and our task would be to model the rating distribution for each submission rather than just focus on predicting the mean/majority rating or the “correct” rating. This paper evaluates the proposed method on three real-world human evaluation tasks with analogous task settings to the ICLR review example above, including emotion perception and toxic content interpretation for categorical annotations and speech quality assessment for ordinal annotations.

[An additional experiment on emotion attribute annotation]

We’ve added an additional experiment on emotion attribute prediction on the MSP-Podcast dataset. Apart from categorical labels such as “happy”, “sad”, “angry”, an emotional state can also be defined by dimensional emotion attributes. The commonly used emotion attributes include valence (negative vs positive), arousal (calm vs excited), dominance (weak vs dominant). In MSP-Podcast dataset, annotators label the attributes on a 7-point Likert scale. I-CNF is used to model the label distribution and outperforms the baselines in modelling the annotation distribution as shown in the table below.

	$\\text{RMSE}^{\\bar{y}}$	$\\text{NLL}^\\text{all}$	$\\text{RMSE}^s$	$\\mathcal{E}(\\bar{s})$
GP	0.667 $\\pm$ 0.000	2.928 $\\pm$ 0.000	0.408 $\\pm$ 0.000	0.415 $\\pm$ 0.000
EDL	0.755 $\\pm$ 0.002	1.911 $\\pm$ 0.005	0.465 $\\pm$ 0.039	0.504 $\\pm$ 0.037
MCDP	0.887 $\\pm$ 0.007	5.545 $\\pm$ 0.026	0.610 $\\pm$ 0.005	0.474 $\\pm$ 0.006
Ensemble	0.923 $\\pm$ 0.017	6.280 $\\pm$ 0.084	0.836 $\\pm$ 0.017	0.718 $\\pm$ 0.019
BBB	0.720 $\\pm$ 0.014	5.332 $\\pm$ 0.034	0.643 $\\pm$ 0.001	0.516 $\\pm$ 0.001
CVAE	0.704 $\\pm$ 0.004	4.906 $\\pm$ 0.005	0.502 $\\pm$ 0.003	0.324 $\\pm$ 0.003
I-CNF	0.665 $\\pm$ 0.006	1.707 $\\pm$ 0.030	0.296 $\\pm$ 0.019	0.132 $\\pm$ 0.002

We’ve added the experiment results in Appendix K of the revised manuscript.

[Regarding complexity and computational time cost]

The training and inference time of all methods compared in the paper for all four tasks studied are shown in tables in general response (2). Denote $M$ as the number of annotations to be simulated. The ensemble model with $M$ members involves training and testing $M$ individual models, which costs $M\times$ training time and $M\\times$ inference time. MCDP and BBB require $M$ forward passes during inference to generate $M$ samples and therefore cost $M\\times$ inference time. All other methods require a single forward pass. In contrast to neural-network-based methods of complexity $O(n^2)$ , the training and inference of GP involves matrix inversion of complexity $O(n^3)$ . We’ve added the discussion in Appendix L of the revised manuscript.

We’ve revised the manuscript with the following changes:

Added the experiment on emotion attribute annotation in Appendix K, requested by Reviewer js5b.
Added the computational time cost in Appendix L, requested by Reviewer 2YK4 and Reviewer Pupg.
Added ICC to Table 3 and Section 4, requested by Reviewer js5b.
Added detailed explanation of the zero-shot density estimation framework and comparison with related tasks in Appendix M, requested by Reviewer szvo and Reviewer js5b.

The revised manuscript has been uploaded and the changes are coloured in blue.

评论- General response (2)

2023-11-15

Tables for computational time cost

Due to training complexity, the number of annotations to be simulated $M$ is set to 10 for ensemble and 100 for all other methods in the following tables.

Table (i) Emotion category annotation

	Training (sec)	Inference (sec)
MCDP	7.20 $\\pm$ 0.10E+03	1.82 $\\pm$ 0.01E+04
Ensemble	1.46 $\\pm$ 0.00E+05	1.67 $\\pm$ 0.01E+03
BBB	7.55 $\\pm$ 0.01E+03	1.79 $\\pm$ 0.01E+04
DPN	6.80 $\\pm$ 0.01E+03	2.90 $\\pm$ 0.01E+02
A-CNF	7.04 $\\pm$ 0.02E+03	2.31 $\\pm$ 0.07E+02
S-CNF	6.99 $\\pm$ 0.00E+03	2.12 $\\pm$ 0.02E+02

Table (ii) Toxic speech detection

	Training (sec)	Inference (sec)
MCDP	2.42 $\\pm$ 0.02E+02	5.99 $\\pm$ 0.02E+02
Ensemble	2.39 $\\pm$ 0.01E+03	4.00 $\\pm$ 0.04E+01
BBB	3.22 $\\pm$ 0.01E+02	5.79 $\\pm$ 0.01E+02
DPN	1.92 $\\pm$ 0.01E+02	2.67 $\\pm$ 0.02E+01
A-CNF	3.14 $\\pm$ 0.04E+02	1.40 $\\pm$ 0.11E+01
S-CNF	2.63 $\\pm$ 0.02E+02	1.37 $\\pm$ 0.09E+01

Table (iii) Speech quality assessment

	Training (sec)	Inference (sec)
GP	3.88 $\\pm$ 0.01E+03	6.27 $\\pm$ 0.07E+01
EDL	2.92 $\\pm$ 0.01E+03	5.17 $\\pm$ 0.19E+01
MCDP	1.37 $\\pm$ 0.32E+03	3.64 $\\pm$ 1.29E+03
Ensemble	1.39 $\\pm$ 0.00E+04	5.10 $\\pm$ 0.05E+02
BBB	1.51 $\\pm$ 0.00E+03	5.33 $\\pm$ 0.02E+03
CVAE	1.41 $\\pm$ 0.00E+03	5.27 $\\pm$ 0.06E+01
I-CNF	1.34 $\\pm$ 0.07E+03	5.10 $\\pm$ 0.02E+01

Table (iv) Emotion attribute annotation

	Training(sec)	Inference(sec)
GP	1.00 $\\pm$ 0.00E+04	2.61 $\\pm$ 0.03E+02
EDL	7.69 $\\pm$ 0.03E+03	1.91 $\\pm$ 0.00E+02
MCDP	3.89 $\\pm$ 0.01E+03	1.76 $\\pm$ 0.03E+04
Ensemble	3.91 $\\pm$ 0.01E+04	1.67 $\\pm$ 0.00E+03
BBB	4.25 $\\pm$ 0.01E+03	1.81 $\\pm$ 0.01E+04
CVAE	4.13 $\\pm$ 0.00E+03	2.26 $\\pm$ 0.05E+02
I-CNF	3.98 $\\pm$ 0.08E+03	1.76 $\\pm$ 0.00E+02

评论- Response summary

2023-11-23

We thank all reviewers again for their valuable comments on our paper. We appreciate that all reviewers highlighted our fresh perspective of modelling subjective annotation distributions via a zero-shot density estimation under a novel meta-learning framework. Below, we summarise our responses to the main concerns raised by the reviewers.

Regarding the potential misunderstanding about the motivation and problem setting of Human Annotation Simulation (HAS):

We clarified the difference between HAS and noisy label filtering.
We clarified the difference between HAS (which is a distribution estimation task since there may not be a single ground-truth label) and standard supervised learning methods.
We provided a detailed explanation of zero-shot density estimation.
We clarified the evaluation metrics for distribution estimation problems.

Regarding the valence-arousal annotation of the MSP-Podcast dataset, we added an additional experiment on emotion attribute annotation, as requested by Reviewer js5b.
Regarding the metric for ordinal label simulation, we added an additional metric (ICC), as requested by Reviewer js5b.
Regarding computational complexity: added detailed training and inference time costs for all methods on all four tasks, as requested by Reviewers 2YK4 and Pupg.
Regarding extra baselines requested by Reviewers 2YK4 and Pupg, we explained in our detailed responses to them that HAS is a new task proposed in our paper for which no directly applicable prior method is available. All baselines in our paper are adapted from suitable existing methods from other domains including the recent works published in 2022 and 2023. We also explained that existing meta-learning methods focus on meta-learning (single-output) predictors which is not suitable for HAS. We asked the reviewers to provide a specific baseline for comparison but have not heard from them by the end of the author response period.

We have responded to common questions in our general responses (1)-(2) and addressed each reviewer's concerns separately below their respective review. We have updated the manuscript with changes highlighted in blue. We hope that this has sufficiently addressed all reviewers’ concerns.

AC 元评审

2023-12-09

This paper presents an approach to simulate the uncertainty in human annotations. Since there may be multiple correct answers for a single task, the authors propose a latent variable-based algorithm that models multiple annotations per sample, and then provides new annotations for new samples. The authors handle both ordinal and categorical annotation tasks. The authors show that their method outperforms baselines such as majority voting, ensembling, and conditional VAE. While the paper and follow-ups show a good experimental setup, it is unclear how this approach scales to large datasets or structured labeling tasks (segment annotation, visual segmentation). This raises concerns about the practical value of this work (shared by the reviewers). After careful deliberation, the AC and SAC concur that this work is not yet ready for publication. Revising it to include a broader range of tasks will strongly improve the submission quality.

为何不给更高分

The proposed approach’s scalability to large datasets or structured labeling isn’t studied which lowers its overall impact. Structured labeling tasks such as segmentation of audio/images/video display high amounts of variability across annotators.

为何不给更低分

The proposed approach is theoretically and experimentally sound. It does solve an interesting problem of addressing a multitude of correct annotations for a single sample.

最终决定Reject

2024-01-16

Reject