Does Collaborative Human–LM Dialogue Generation Help Information Extraction from Human–Human Dialogues?
The study introduces a human-in-the-loop dialogue generation framework that significantly improves task performance by synthesizing realistic call center dialogues, demonstrating the value of human-LM collaboration in data generation and annotation.
摘要
评审与讨论
The contributions on pg 2 are an accurate summary: (1) the paper describes DialGen, a collaborative human-LM framework for generating synthetic dialogues in domains where privacy constraints prevent release of data; (2) DialGen-AIC dataset for real-world auto insurance call centre data; (3) an entity-based scoring methodology; (4) results showing improvements from human collaboration over a fully automatic LM-based approach.
接收理由
An interesting approach to human-LM collaboration for dialogue generation. Including humans-in-the-loop yields lengthier and more complex dialogues than LM only approaches, suggesting reasonable ways to include humans in dialogue generation as well as possible opportunities for improvements in LM-only approaches.
拒绝理由
No strong reasons to reject, except that the paper may be of interest to a relatively narrow audience. The general objectives are likely to be broadly interesting (' a human-LM collaborative framework for dialogue generation (DIALGEN) that leverages the scalability and creativity of generative models, yet retains controllability through humans'), but the specifics ('extracting information from auto insurance call center dialogues, using public synthetic data to improve performance on a private dataset') are fairly specific, although interesting for dialogue researchers.
给作者的问题
This is a minor point, but I'm curious about the motivation to adopt IE metrics in place of DST metrics. in Section 3.3, there is the assertion 'Directly adopting DST metrics for dialogue-based IE is not ideal, because they overemphasize earlier parts of a conversation and do not disentangle the effects of error propagation across turns.' Is a similar effect present in the F1 scores in Table 2? It seems that IE metrics are also higher early in the dialogue.
Results (e.g. Table 4) show that data produced by DialGen w/ human annotators can be used to improve T5 models. What about uses of the data to improve more powerful models or models that can't practically be fine-tuned? These are likely to be of interest.
We thank the reviewer for finding our human-LM collaborative framework interesting.
We proposed a new IE metric because the DST metrics, joint goal accuracy (JGA) and slot accuracy, can not well reflect the model performance on our task. Here are more details.
First, JGA is chosen because DST tasks require database queries; hence accurate prediction is needed. Queries are issued after most turns, so evaluating average performance at all turns makes sense. In contrast, in our IE task, where information is accumulated (and corrected) for a final summary. Our goal in evaluating car accident claims is to assess the overall extraction quality. Minor errors (e.g., additional word in non-categorical slots) should not significantly impact the readability of the extracted information. Therefore, we chose more flexible F1 metrics.
Second, both slot accuracy and our IE metric exhibit the same trend where later turns receive lower scores. It is because the later turns are harder to predict because more information is presented in a dialogue. Slot accuracy is computed at each turn and defined as:
,
T: # all predefined slots. M: # missed slots that the model does not predict among the gold slots. W: # wrongly predicted slots that do not exist in the gold slots.
Early turns usually contain a few gold slots, resulting in high scores for every model and making the performance difference between models become trivial. In contrast, our IE metric evaluates at specific dialogue points (quarter, half, three-quarters, end), mitigating the issue of misevaluation of models at earlier turns.
Regarding the use of powerful models, we chose finetuned T5 models for two reasons: (1) previous work [1] shows that it outperforms large decoder-only models in finetuned domains, and (2) our resources for finetuning on private data are limited, preventing the use of larger models.
Moreover, in this work, we explored the experiments with the in-context learning (IC-DST w/ ChatGPT) on the synthesized data due to the privacy concern. We find that IC-DST performs worse than a finetuned T5 up to 12% as shown in Table 2. However, we acknowledge that it could be interesting to explore recent open models (e.g. LLaMA3) on the private data in the future if resources permit.
[1] E. Lehman, E. Hernandez, D. Mahajan, J. Wulff, M. J. Smith, Z. Ziegler, D. Nadler, P. Szolovits, A. Johnson, and E. Alsentzer. Do we still need clinical language models?
thanks, very helpful.
Thank you for your valuable feedback. We will update our description to clarify the motivation for proposing new IE metrics. Regarding the reviewer's suggestion to use more powerful models, incorporating recent publicly available models like LLaMA3 is a potential option for our setting. However, we believe the trend of adding synthetic data being helpful will remain consistent with these larger models.
Since we have addressed the primary questions raised in your review, we kindly request you to reconsider your review score in light of our rebuttal.
The paper presents a novel approach in the task-oriented dialogue construction. Authors propose human-in-the-loop scheme and show that it helps to improve the final quality on the private dataset. A number of settings, including fully automatic dialogue generation are tested, and new dataset is created.
接收理由
Important research field combining solid practical and theoretical results.
Several important contributions, including novel dataset and model.
Detailed explanation of the approach, that makes it highly reproducible.
Fine-grained and impressive evaluation supporting major contributions of the paper.
拒绝理由
It would be nice to reproduce results on the open dataset or at least provide some preliminary evaluation. The paper is not clear in some minor moments. For instance, I did not get the description of the ontology representation format. SQL tables are mentioned but without further explanation.
We thank the reviewer for finding our work impactful in contributing to the research field.
Regarding reproducing results on a public dataset, we acknowledge that the additional results on a public dataset would increase the reproducibility of this work. However, assessing impact on real, complex, goal-oriented conversations is a key objective of our work, and it is challenging to find a public dataset that contains realistic dialogues due to the privacy issue.
We follow previous work (IC-DST) to represent the ontology via a SQL table format. In Appendix D3, we show an example of the prompt. For example, the domain Adjuster (as shown in the following text block) contains the domain slots Explain Coverages, Permission to Record, Set up Inspection, Set up Rental. The domain slots with CHECK values indicate they are categorical slots. Otherwise, the slots are non-categorical. We will update our description to clarify the ontology representation format.
CREATE TABLE Adjuster(
'Explain Coverages' TEXT,
'Permission to Record' TEXT CHECK ('Permission to Record' IN 'Yes', 'No'),
'Set up Inspection' TEXT CHECK ('Set up Inspection' IN 'Quick Photo Claim', 'Field
Assignment'),
'Set up Rental' TEXT CHECK ('Set up Rental' IN 'Yes', 'No'),
)
Thanks, you answered my question.
To palliate the lack of available training data in the context of task-oriented-dialogue, the paper investigates the addition of a human in the loop, when performing data-augmentation with Large Language Models for a task of extracting information from auto insurance call centre dialogues (AIC corpus), using public (anonymized) synthetic data to improve performance on the private dataset. The experiment where done with an Auto Insurance Claim information collection task, where the currently active slot-value set represents the dialogue state. The authors observed a 25% relative performance increase (F1 measure on information extraction from dialogue history) when augmenting a small set of conversations with synthetic data which were generated by prompting a LLM and then checked by a human. During the generation, the human checks and if needed edit the subdialogue generated and/or the prompt before the generating the next subdialogue in an iterative process tacking into account the full dialogue history, until completion of the dialogue. The authors propose a contribution in four parts: a collaborative human-LLM framework, a custom AIC dataset of conversation with annotations (slot values with entity linking and multiple slot value tracking), a new entity-centered scoring methodology based on the previous contribution and finally experiments with LLMs. Comparative experiments against a fully automatic approach have shown the benefit of human collaboration both for dialogue generation and annotation. The author found that synthetic data has fewer turns, longer turns, and less variance in length, while fully automatic data is the most extreme and the human intervention results in much longer and more varied dialogues, which are still from reaching the complexity of human-human dialogues. In the experiments, the authors deployed first in-context learning with Chat-GPT-3.5-0301 (the same model as used for the data augmentation with a human-in-the-loop) and SBERT as a retriever to predict slot-value pairs (state changes) and the corresponding entity (referent) for the pair with two different prompting scenarios. Second the ran a series of experiments by fine-tuning different version of the T5 and T5-long transformer models. Third, they investigated replacing the human in the loop with ChatGPT and GPT-4 (keeping T5 model for finetuning with the entirely synthetic data) and showed that despite better performance with GPT-4 than chatGPT, the synthetic data produced with a human collaboration provided better quality training data than the fully automatic approach.
The paper addresses a hot topic nowadays about the extend to which a large language models could possibly be used for generating realistic language data. It makes a significant contribution based on state of the art models that shows that if LLMs can help augment training data, they are far from making human intervention obsolete for reaching human performance level as soon as the task becomes complex. Note that IMHO the fact that the authors used ChatGPT-3 and GPT4, which are not open model and thus limit their use in a scientific experiment (because of the resulting limitation it poses for reproducibility) is mitigated by the fact that these models have been used to produce a relatively limited amount of data and that the contribution is about the methodology that could use other open large language model. Nevertheless, it would have been interesting to have a test done with such model for comparison.
接收理由
The paper makes a significant contribution on a hot research topic with state of the technology and report an interesting result, that shows that Large Language Models are far from reaching human performance for data augmentation for a relatively complex language task.
The experiments are clearly reported, in details with language examples in annex and following best practice of the domain and the state of the art is up to date.
拒绝理由
The fact that the experiments rely partially on the use of Large Language Models which are not open, raising questions about reproducibility of the results, not forgetting the fact that the original source corpus is private data.
给作者的问题
page 6, section 4, "we apply D IAL G EN with ChatGPT as the LM backbone to create DIALGEN -AIC," The version of ChatGPT that has been used is given later on top of page 7 (with a complement of information in a footnote), it would be better to have the information with the first occurrence of the term. Maybe having a table or synoptic diagram in the annexe recapitulating in which experiments the various models (ChatGPT, SBERT, T5, GPT4) have been used would be helpful to the reader.
Thank you for finding that our methodology is impactful in contributing to the discussion around language model (LM) performance and showing that the current LM is still limited without human correction in data synthesis and annotation. We acknowledge that using open-source LMs would increase the reproducibility and enhance our findings. We used these private models as of the paper submission because they are the current state-of-the-art models in this field. Recent open LMs, such as LLaMA3 (released after this work), could provide an alternative LM for our human-in-the-loop framework. We leave it as a future work to explore.
We acknowledge the raised issue of reproducibility of using LMs. Based on recommendations from OpenAI, the temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness. In our quantitative experiments, only the IC-DST experiment involves a private LM. To enhance the reproducibility of the IC-DST experiment on the publicly available DialGen-AIC, we set the temperature to 0 and report the API version (e.g., gpt-3.5-turbo-0301).
Thank you for your suggestion about the footnote. We will edit the paper and make the footnote appear at the first-mentioned position in the main content. Regarding the readability suggestion about the version of API calls of language models, we added a table to indicate the version of different LMs in Appendix C.2 of page 16 due to the page limit. We will add the SBERT and T5 into the table and add a reminder to inform readers the table is in the appendix.
Thank you for your answers and taking into account my remarks. I understand your postponing of the experiment with other new models like Llama3.
This paper evaluates the usefulness of using generated dialogues, with or without the help of human annotators to dynamically correct the generated dialogues, to learn to extract information from dialogues between humans. The paper is very clear. Except for the description of the evaluation measures, CB and TLB, that could be improved. The experiments are well done, and the results are very clear, even if statistical analysis is missing.
接收理由
The paper is very clear and experiments well made with clear results. The code and generated data will be made available. It is very interesting to see the impact of human correction to generated text used to enrich an existing corpus.
拒绝理由
Missing statistical analysis of the results.
The characteristics of generated text, largely different from the real ones, could be explored more thoroughly and methods to reduce this discrepancies could be studied.
给作者的问题
What would be the effect of adding a certain amount of human data compared to adding the same amount of generated data? This could be studied in keeping a part of the original data apart for later use. But as the original data is quite small, this could be difficult.
We thank the reviewer for finding our human-corrected synthetic data approach interesting.
Regarding more thorough methods to reduce discrepancy between synthetic and real data: We showed that DialGen is capable of synthesizing dialogue data that is more similar to real data (AIC) than fully automatic data. However, our analysis (section 6.1) also shows there are still gaps, consistent with the reviewer’s comment. We acknowledge that future work can explore additional methods to further minimize the difference between the synthetic and natural data.
Regarding the effect of using real data alongside the generated data with the same amount, the result can be inferred from Figure 4. In the figure, we use 7 real dialogues as a training set (1.3K turns) and obtain a 34.8 TLB score. Using the same amount of generated data should be in the middle of 10% (0.9K) and 20% (1.8K) in the x-axis, suggesting the TLB score of 1.3K-turn generated data will be between 27 - 31. We can add an experiment to confirm the inferred number.
Regarding the missing statistical analysis of the results, we assume the reviewer is suggesting an additional statistical difference test between the experiments. We conducted the statistical difference test on our best model (T5-SC), as shown in Table 3. We used the bootstrap test, recommended in prior research [1][2], to evaluate the difference between the two results using the proposed F1 metrics. We compared our best T5-SC result (AIC + DIALGEN-AIC) with the T5-SC result using only AIC. The resulting p-value < 0.01 indicates that incorporating both DIALGEN-AIC and AIC as training data is significantly better than using AIC alone.
[1] Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 2012. An Empirical Investigation of Statistical Significance in NLP. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 995–1005, Jeju Island, Korea. Association for Computational Linguistics.
[2] Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383–1392, Melbourne, Australia. Association for Computational Linguistics.
Thanks for your answer. It confirms my positive opinion.
This paper introduces DialGen, a human-in-the-loop framework for generating synthetic dialogues to address the lack of training data for information extraction (IE) tasks in privacy-sensitive domains. The authors demonstrate a 25% relative improvement in IE performance by augmenting real call center conversations with synthetic data generated and refined through human-LM collaboration.
Reviewer Consensus:
- The paper addresses a significant and relevant problem, given the privacy constraints in real-world dialogue data.
- The paper is clear, well-structured, and adheres to best practices in the field.
- The inclusion of humans in the loop for generating and refining synthetic dialogues is valuable and leads to improved performance.
Reviewer Disagreements:
- One reviewer noted the absence of statistical analysis, which the authors addressed by providing additional tests.
- Concerns were raised about the generalizability of the findings due to the reliance on a single private dataset and non-open LMs. The authors acknowledged this limitation but highlighted the focus on real-world, complex dialogues.
- A suggestion was made to further analyze the discrepancies between synthetic and real data, which the authors agreed with.
Overall Assessment and Recommendation:
Despite minor disagreements, the paper receives positive reviews for its contribution to the field. The proposed framework and dataset are valuable resources for future research. The authors' detailed rebuttal effectively addresses concerns.
Based on the overall positive reviews, I recommend accepting this paper. It offers valuable insights into the potential of human-LM collaboration for data generation and annotation and its implications for improving IE performance in privacy-sensitive domains.