OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios
Enhancing Spoken Dialogue Systems with Scalable Synthetic Data
摘要
评审与讨论
This paper presents ShareChatX, a benchmark suite dataset for training and evaluating spoken dialog/language systems. ShareChatX contains spoken language annotations in different contexts/scenarios including audio events and music backgrounds. ShareChatX is created by prompting GPT-4o, followed by synthesizing speeches using off-the-shelf TTS model.
Based on ShareChatX, authors trained OmniChat, a spoken language model that achieves SOTA performance on the dataset DailyTalk, and ShareChatX itself. Ablation studies such as the mixing ratio of synthesized and real datasets, data volume scaling are provided.
优点
- The annotated dataset ShareChatX could be potentially helpful for developing foundation models to spoken language tasks; the released model OmniChat may facilitate future development of spoken language models;
- OmniChat achieves SOTA performance on the DailyTalk dataset.
缺点
-
The presentation is sometimes confusing and needs to be improved. For example, Sec5.3 investigates data scale for "spoken dialog models". But it is unclear which are the dialog models -- are they OmniChat, or other audio models? The section is confusing and not contributing to the paper that much.
-
Details about experiment settings are necessary to verify OmniChat's actual performance. Tab2 and Tab3 compare OmniChat with baseline models, of which experimental settings are not provided.
-
Although ShareChatX is targeting spoken language tasks, but in its core it relies on LLM + TTS model, which is not end to end; there is still a gap between ShareChatX and more realistic and challenging spoken language processing tasks.
问题
-
In table2 and table3, are baseline models trained, or the numbers are zero-shot performance?
-
You mentioned you randomly sample some test set examples in line314. Are the claimed SOTA performances computed on these randomly sampled data? Are there any standard splits?
Thank you for recognizing the value of our dataset, ShareChatX, in advancing spoken dialogue tasks. Please allow us to address your questions in detail:
Q1: Detailed Experimental Setup for Data Scaling in Section 5.3
A1: In Section 5.3, we focus on understanding how the performance of spoken dialogue models varies with different scales of synthetic data used during training. The experiments were conducted using the OmniChat model, which was trained directly with synthetic data based on LLaMA 3.1 (a text-only LLM) without leveraging any pre-trained MLLM.
Currently, many spoken dialogue models, particularly those using acoustic discrete units for training, also rely on text-only LLMs as their foundation and require large-scale synthetic data for training. Our experiments aim to provide a rough estimate of the data scale required to effectively train such models. These results offer guidance for future efforts in determining the amount of synthetic data needed to support spoken dialogue systems effectively.
Q2: MLLM+TTS vs. End-to-End Approaches
A2: While our model does not implement an end-to-end spoken dialogue system, this does not diminish its contributions to advancing the field. As you pointed out, there are multiple paradigms for training spoken dialogue models:
1.MLLM+TTS: Combining multimodal LLMs with TTS systems.
2.End-to-End: Jointly training multimodal LLMs and vocoders.
3.Acoustic Discrete Units: Training based on discrete acoustic representations.
To date, no consensus has been reached on which paradigm is optimal. Against this backdrop, our work focuses on the broader question of how synthetic data can be utilized to advance spoken dialogue systems. The conclusions we present are applicable across training paradigms.
For example, under the MLLM+TTS paradigm, we demonstrate that synthetic data can effectively support real-world spoken dialogue tasks and provide insights into optimizing the use of synthetic data. While this work does not present a fully complete system, it lays the groundwork for further development, bringing spoken dialogue systems closer to maturity.
Q3: Experimental Setup for Baseline Models
A3: The baseline models (FunAudioLLM, Qwen-Audio, Qwen2-Audio, AudioFlamingo, and Salmonn) have been pre-trained on large-scale datasets and claim to possess strong spoken dialogue capabilities. Therefore, we evaluated them in a zero-shot setting for fairness.
To ensure a fair comparison, we used OmniChat—a model trained directly on a text-only LLM without fine-tuning on real data (also in zero-shot setting). The results show that, even without pretraining on large-scale audio data, OmniChat achieved comparable performance to state-of-the-art models like Qwen2-Audio, demonstrating the value of synthetic data.
The additional fine-tuning of OmniChat with real data (OmniChat+Real Data) was conducted to explore the combined use of real and synthetic data and assess the limits of this approach. We did not perform further fine-tuning on the baseline models to maintain the focus on this specific exploration.
Q4: Test Set Partitioning
A4: Yes, to prevent data leakage during training, we randomly partitioned the dataset into train and test subsets, ensuring no overlap between them. As described in Sections C.1 and C.2:
We randomly selected 220 samples from DailyTalk as the test set. We will open-source the test set partitions in this work to facilitate comparison in future studies.
We have 3,731 dialogues for the -emotion test set, 1,555 for the -audio test set, and 1,243 for the -music test set.
These partitions ensure that the test data remains unseen during training, enabling fair and reliable evaluation.
Thank you once again for your thoughtful questions and recognition of our work. We hope our responses address your concerns, and we remain open to further discussions and feedback.
Dear Reviewer KUjm,
We have provided detailed clarifications regarding the experimental setups and explanations for various configurations in response to your concerns. As the rebuttal phase is nearing its conclusion, we kindly request your feedback at your earliest convenience to help us further improve our work.
If you have any additional questions or require further clarification, please do not hesitate to let us know. We deeply appreciate the time and effort you have dedicated to reviewing our paper.
Best regards,
The paper introduces OmniChat, a novel spoken dialogue system enhanced by synthetic data for handling diverse scenarios. The key contributions include:
- ShareChatX - a large-scale synthetic spoken dialogue dataset covering various scenarios including emotional dialogues, audio events, and music contexts
- OmniChat - a multi-turn spoken dialogue system with a heterogeneous feature fusion module (Mix-Former) for optimizing feature selection across different dialogue contexts
- Comprehensive analysis of synthetic data usage in training spoken dialogue systems, including optimal ratios between synthetic and real data
- State-of-the-art performance achieved on the DailyTalk dataset and other complex dialogue scenarios
优点
- The paper addresses a critical challenge in spoken dialogue systems by leveraging synthetic data to overcome the scarcity of large-scale, high-quality spoken dialogue datasets.
- The proposed Mix-Former module effectively integrates multiple expert features (speech, emotion, beat) to handle diverse dialogue scenarios.
- Comprehensive experimental analysis provides valuable insights into optimal training strategies, including the ideal balance between synthetic and real data.
- The work demonstrates significant practical impact through state-of-the-art performance on real-world datasets.
缺点
- The paper lacks detailed comparison with some recent baseline methods in spoken dialogue systems, particularly in terms of model size and computational requirements.
- The evaluation metrics could be more comprehensive, especially for measuring the quality of generated speech beyond just content and emotion accuracy.
- The methodology for ensuring quality control in synthetic data generation could be explained more thoroughly.
问题
Could you please share the ablation studies on different components of the Mix-Former architecture?
Thank you for recognizing the value of our work. Please allow us to address your questions in detail:
Q1: Model Size and Training Requirements
| Method | Trainable Parameters | Total Parameters | Base Model |
|---|---|---|---|
| AudioFlamingo | 1B | 2.2B | - |
| FunAudioLLM | - | 7.9B | qwen2-7B |
| Qwen2Audio | 7B | 8.2B | qwen2-7B |
| Salmonn | 33.3M | 13.8B | Vicuna-13B |
| omnichat | 27.4M | 8.1B | Llama-3-8B |
Q2: Quality of Synthesized Speech
A2:
Thank you for your suggestion. In existing baselines like Qwen-Audio and Qwen2-Audio, models generate only textual responses, and a unified TTS model is then used for speech synthesis. In this context, the quality of generated speech depends more on the TTS model rather than the spoken dialogue system itself. Therefore, our evaluation metrics primarily assess the dialogue system’s content understanding and response generation capabilities (i.e., content and emotion accuracy).
If you have further suggestions, we look forward to discussing them with you.
Q3: Process for Ensuring Data Synthesis Quality
A3:
Our process for ensuring data synthesis quality involves the following three steps:
- ASR Validation: Confirm that the generated text aligns semantically with the input audio.
- Speaker Verification: Ensure that the synthesized speech maintains consistent speaker characteristics (e.g., timbre).
- Human Evaluation: After completing data synthesis, we randomly sampled 2,000 examples from each dataset subset to verify whether the results adhered to human dialogue standards. We found that 97% of the synthesized data met the criteria for logical human dialogue.
Q4: MixFormer Ablation Study
We conducted ablation studies on MixFormer from two perspectives: embedding modalities and core module design. The results are as follows:
Experiments on ShareChat-Emotion
| ID | speech embedding | emotion embedding | BLEU | ROUGE-L | METEOR | BERTScore | F1eF1_eF1e |
|---|---|---|---|---|---|---|---|
| E1 | ✔ | ✔ | 6.20 | 20.0 | 18.9 | 88.1 | 57.2 |
| E2 | ✔ | 6.12 | 19.5 | 18.1 | 87.4 | 54.5 |
Experiments on ShareChat-Music
| ID | speech embedding | emotion embedding | beats embedding | BLEU | ROUGE-L | METEOR | BERTScore | F1eF1_eF1e |
|---|---|---|---|---|---|---|---|---|
| M1 | ✔ | ✔ | ✔ | 4.68 | 17.7 | 15.8 | 87.8 | 69.1 |
| M2 | ✔ | ✔ | 4.64 | 17.5 | 15.5 | 86.8 | 68.5 | |
| M3 | ✔ | ✔ | 4.65 | 17.6 | 15.6 | 87.4 | 68.4 |
Key Module Ablation
To analyze the core module design, we replaced the three expert Q-Formers in MixFormer with a single Q-Former. In this setup, the embeddings from speech, emotion, and BEATs were concatenated and fed into the single Q-Former for evaluation. This modification was tested to explore the impact of reducing model complexity. Further results and insights will be included in the full report.
| Method | BLEU | ROUGE-L | METEOR | BERTScore | F1eF1_eF1e |
|---|---|---|---|---|---|
| Mix-Former | 4.68 | 17.7 | 15.8 | 87.8 | 69.1 |
| w/o expert Q-former | 4.60 | 17.4 | 15.4 | 86.3 | 66.5 |
Thank you once again for your thoughtful questions and recognition of our work. We hope our responses address your concerns, and we remain open to further discussions and feedback.
Dear Reviewer ns19,
Thank you for recognizing the value of our work. With your suggestions, we have further enhanced the completeness of our study. We look forward to your feedback on our responses. As the rebuttal phase is nearing its end, we kindly request your response at your earliest convenience.
If you have any additional questions or require further clarification, please do not hesitate to let us know. We deeply appreciate the time and effort you have dedicated to reviewing our paper.
Best regards,
This paper introduces OmniChat, a spoken dialogue system using ShareChatX, a synthetic dataset covering emotional, audio, and musical contexts to enable nuanced, multi-modal interactions. With Mix-Former, a fusion module, OmniChat dynamically integrates features like emotion and background sounds, achieving strong performance on complex dialogues. OmniChat outperforms existing methods on DailyTalk and ShareChatX.
优点
This paper introduces a new large-scale synthetic dataset, ShareChatX, covering different scenarios, including emotion, audio and music. Based on this dataset, this paper also presents a competitive spoken dialogue system, OmniChat, achieving SOTA performance on DailyTalk and ShareChatX. Overall, this paper is well-structured and could potentially contribute to the field of spoken dialogue systems.
缺点
-
The evaluation benchmark is not very comprehensive. Although the evaluation includes two datasets, one of the test sets is constructed similarly to the training set. This similarity may contribute to OmniChat’s superior performance, primarily due to in-domain training. Additionally, the evaluation datasets focus solely on chit-chat. It would be beneficial to include information-seeking instructions in the evaluation as well.
-
The evaluation metrics primarily include reference-based methods (BLEU, ROUGE-L, BERTScore, F1). Previous studies have shown that these metrics may not fully capture conversational quality. Moreover, GPT and human evaluation results are only provided for DailyTalk.
-
Although the paper emphasizes the importance of balancing synthetic and real data, it lacks a detailed analysis of how each synthetic data type (emotion, audio, music) individually impacts model performance.
-
The paper does not provide sufficient details about the model and experimental setup to ensure reproducibility.
-
Many existing voice assistants prioritize real-time settings, which are essential for practical applications. Given the model’s complexity, real-time latency and response times could present challenges.
问题
-
What are the model versions of the pre-trained audio encoders used, including Whisper, Emotion2Vec, and Beat?
-
Could you elaborate on the methods for manual verification (line 183) and manual evaluation (line 322)?
-
How do you overlay the music or audio onto the spoken dialogue speech data? What Signal-to-Noise Ratio (SNR) do you use, or are they simply added without adjustment?
-
What is the evaluation prompt used for GPT-Eval?
-
What parameters are included in the style settings? Is it gender, pitch, emotion, energy (line 197), or gender, pitch, speed, emotion (line 182)?
-
Have you evaluated the accuracy of style parameters other than emotion?
-
How many trainable parameters does OmniChat have? Including a comparison of trainable parameters in other models would be beneficial.
-
Missing references: Existing works in audio/emotion dialogue systems include:
- SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
- EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
-
What are the GPT and human evaluation results for ShareChatX?
-
What is the inference cost of the proposed model?
Thank you for your recognition of our dataset, ShareChatX, our writing, and our contributions to the field of spoken dialogue systems. Please allow us to address your questions in detail:
Q1: Evaluation Benchmark (Weakness 1)
A1:
Yes, we specifically included experiments on DailyTalk to evaluate the model’s performance in out-of-domain and real-world scenarios. Testing on out-of-domain datasets like DailyTalk demonstrates that OmniChat’s performance does not rely on in-domain data but instead reflects strong generalization capabilities.
Regarding information retrieval tasks, this is indeed an excellent idea, and we initially considered including such tasks. However, compared to ASR-based models, the most distinctive feature of spoken dialogue systems lies in their ability to understand acoustic information that cannot be conveyed solely through text in complex conversational scenarios (e.g., emotion, audio, and music). Thus, we ultimately decided to focus our research on this more challenging and open-ended problem (How to help spoken dialogue model understand acoustic information). That said, we believe that training with synthetic data can effectively enhance the information retrieval capabilities of spoken dialogue systems.
Q2: Evaluation Metrics (Weakness 2)
A2:
Thank you for your discussion on evaluation metrics. For our metrics, we referred to established baselines in spoken dialogue tasks [1,2], which commonly use BLEU, ROUGE-L, and BERTScore. However, we were concerned about the ability of these metrics to fully capture spoken dialogue capabilities, so we supplemented them with GPT-based metrics and human evaluations.
Please note that ShareChat-X data is entirely AI-synthesized. We refrained from testing GPT-eval on ShareChat-X initially due to concerns about the potential alignment of evaluation results with the LLM’s inherent biases (i.e., “rating its own answers”), which could reduce the evaluation’s objectivity.
Q3: How does each type of synthetic data individually affect model performance? (Weakness 3)
A3:
Please refer to Section 5.6 and Figure 4, where we analyze the performance of models trained on subsets such as ShareChat-Emotion and ShareChat-Audio. These results demonstrate the importance of spoken dialogue data (e.g., ShareChat-Emotion) for specific scenarios and the value of special-purpose dialogue data (e.g., ShareChat-Audio and ShareChat-Music).
Q4: Model and Experimental Settings (Weakness 4)
A4:
We provided detailed experimental settings in Section 5.1 and Appendix Section C. Additionally, we will release all code after the paper’s acceptance. If you find any missing experimental details, please let us know, and we will provide further clarifications.
Q5: Real-Time Performance (Weakness 5)
A5:
This work focuses on how synthetic data can enhance spoken dialogue systems during training. Our data synthesis methods and conclusions can benefit the training of various systems, including those aiming for either high performance or fast response times.
The model proposed in this work is designed to validate the impact of synthetic data on performance and follows a simple architecture similar to SALMONN, without optimizing for complex real-time processing.
Q6: Pretrained Model Versions? (Question 1)
A6: Whisper: Whisper large-v3. Emotion2Vec: Emotion2Vec-large. BEATs: Fine-tuned BEATs_iter3+ (AS2M) (cpt2).
Q7: Human Validation and Evaluation Methods (Question 2)
A7:
Data Validation Process:
After generating all synthetic data, we randomly sampled 2,000 examples from each dataset subset to check if the generated results met human standards. We found that 97% of the data adhered to logical human dialogue standards.
Human Evaluation Process:
We invited 10 evaluators to rate all generated responses on a scale of 1 to 5 based on the following criteria:
1: Contextually irrelevant and emotionally inappropriate.
2: Partially relevant to the context but feels unnatural, with emotionally inappropriate expression.
3: Partially relevant to the context, with emotionally appropriate expression.
4: Contextually relevant and natural, mentioning the attribute, with emotionally appropriate expression.
5: Contextually relevant, smooth, natural, and with emotions that perfectly align with the scenario.
The final score for each response was obtained by averaging the scores from all evaluators.
Q8: How are audio and music added to spoken dialogue data? (Question 3)
A8:
To ensure the authenticity of the final dialogue, all audio and music components are overlaid according to \cite{petermann2022cocktail} when combined with speech. The processing workflow begins by setting target loudness levels for each class: Music (-24.0 , LUFS), Speech (-17.0 , LUFS), and Audio (-29.0 , LUFS). To introduce variability while maintaining realistic consistency, an average LUFS value for each audio in a mixture is first sampled uniformly from a range of ±2.0 around its respective target LUFS. Subsequently, the gain of each individual sound file added to the mixture is further adjusted by uniformly sampling from a range of ±1.0 around the sampled average LUFS. This process ensures that the resulting audio mixtures exhibit both natural variation and coherence.
Q9: Prompt Used for GPT-Eval (Question 4)
A9:
We have included the reference prompt template in Appendix C.3, Figure 5 in the latest version. The scoring standards align with the human evaluation standards described in Q7.
Q10: What parameters are included in the style settings? (Question 5)
A10: Apologies for the oversight in the previous version. The correct parameters are gender, pitch, speed, and emotion.
Q11: Accuracy of Styles Other Than Emotion (Question 6)
A11:
In previous work[1], suggested good predictive results, but upon inspection, most training and test data samples in their work had “normal” speed, leading to potential overfitting.
When testing other attributes (e.g., pitch, speed) on ShareChatX, we observed poor prediction performance across models. We found that speed and pitch lacked strong contextual relevance (e.g., the same emotion could be expressed with varying speeds or pitches), so we did not include them in the final evaluation.
Q12: Trainable Parameters in OmniChat (Question 7)
A12: Given that most FunAudioLLM-level methods rely heavily on pre-trained text-based LLMs, making them less valuable for comparison in this context, we focused on comparing the trainable parameters of several key spoken dialogue systems. The results are as follows:
| Method | Trainable Parameters |
|---|---|
| AudioFlamingo | 1B |
| Qwen2Audio | 7B |
| Salmonn(13B) | 33.3M |
| omni-chat(ours) | 27.4M |
Q13: Missing References (Question 8)
A13:
Thank you for pointing this out. We have added the relevant references in the latest version.
Q14: GPT and Human Evaluation Results for ShareChatX (Question 9)
A14:
| GPT-eval(-emo) | human(-emo) | GPT-eval(-music) | human(-music) | GPT-eval(-audio) | human(-audio) | |
|---|---|---|---|---|---|---|
| FunAudioLLM | 4.09 | 3.87 | 3.47 | 3.54 | 3.59 | 3.72 |
| SALMONN | 1.84 | 2.32 | 2.15 | 2.49 | 2.23 | 2.69 |
| Qwen-Audio | 1.46 | 2.18 | 1.42 | 2.21 | 1.18 | 2.36 |
| Qwen2-Audio | 3.73 | 3.59 | 3.81 | 3.66 | 3.98 | 3.87 |
| OmniChat | 4.22 | 3.97 | 4.11 | 3.87 | 4.21 | 4.08 |
Q15: Inference Cost of the Proposed Model (Question 10)
A15:
Processing 1 second of audio takes approximately 0.65 seconds. The memory consumption depends on the length of the input audio, with the entire model requiring 8.1B parameters and capable of running on one 3090 GPU.
Thank you again for your thoughtful questions and feedback. We look forward to further discussions and improvements based on your suggestions!
[1] Lin G T, Chiang C H, Lee H. Advancing large language models to capture varied speaking styles and respond properly in spoken conversations[J]. ACL2024
[2] Ao J, Wang Y, Tian X, et al. SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words[J]. NIPS2024
[3] Petermann D, Wichern G, Wang Z Q, et al. The cocktail fork problem: Three-stem audio separation for real-world soundtracks[C]//ICASSP2022
Dear Reviewer Yjoc,
We have responded to each of the issues you raised in detail and made revisions based on the discussion points. As the rebuttal phase is nearing its end, we kindly request your feedback at your earliest convenience to ensure all concerns have been addressed satisfactorily.
If you have any additional questions or require further clarification, please do not hesitate to let us know. We deeply appreciate the time and effort you have dedicated to reviewing our paper.
Best regards,
The paper proposes a large-scale synthetic spoken dialogue dataset, ShareChatX, covering a wide range of scenarios, including emotion, audio, and music. It also introduces a dialogue system, OmniChat, for diverse scenarios. Then, it conducts extensive experiments to compare the proposed method and existing ones, on both a previous dataset, DailyTalk, and its proposed method, and demonstrates that OmniChat has the SOTA performance. It also provides analysis on the data scale impact, optimal synthetic data ratio, expert feature selection strategies, and complex scenarios.
优点
-
The paper contributes a very large-scale spoken dialogue dataset that covers a wide range of scenarios, which largely contributes to studies of related topics.
-
The paper conducts extensive experiments and provides detailed and thorough analysis from a lot of perspectives, offering lots of useful insights.
缺点
-
When trained wo/ real data, OmniChat does not leads to a large improvement, and is worse than Qwen2-Audio in metrics like METEOR, and GPT-eval.
-
The paper provides some insights in finding the optimal synthetic data ratio. However, it may still be hard to tune the ratio in real-world scenarios, as the ratio choice rules discussed in the paper may not generalize to other dataset choices and tasks.
问题
- After listening to the demo audios, I observed that the synthetic data sounds relatively plain and do not deliver a strong emotion, compared to real-world conversations. How well do you think the synthetic data can capture emotional nuances? Is there any potential solution to make the synthetic data more real?
Thank you for recognizing the value of our dataset scale and experimental analysis. Please allow us to address your questions as follows:
Q1: Discussion on OmniChat’s Performance
A1: Please note that Qwen2-Audio is trained on a large-scale audio dataset (520K hours), whereas our model is fine-tuned on top of LLaMA (a text-only LLM) without pretraining on large-scale audio data. Despite this, our model achieves results comparable to Qwen2-Audio using only synthetic data. This effectively demonstrates the validity and significance of our proposed approach to leveraging synthetic data for spoken dialogue tasks, especially given the challenges of collecting real-world spoken dialogue data.
Q2: Optimal Data Proportion
A2: Yes, differences between datasets and tasks may exist, but several conclusions from our experiments can serve as general guidelines for other work:
- Training with generated data significantly enhances the model’s robustness in understanding audio.
- During training, the frequency of synthetic data use should be kept low. Lower participation frequency improves robustness, while excessive use can lead to overfitting on synthetic data, reducing adaptability to real-world scenarios.
- Synthetic data exhibits a scaling threshold. Observing performance trends across different dataset sizes helps evaluate the cost-effectiveness of increasing synthetic data volume further.
Q3: Can Synthetic Data Capture Subtle Emotional Differences?
A3: In real spoken dialogue datasets, human emotions often lack expressive diversity. For instance, 83.1% of data in the Daily Dialogue dataset [1] is classified as Neutral. Synthetic data effectively enriches emotional expression in spoken dialogue, even if some expressions may appear relatively subdued compared to real-world data.
Additionally, to make synthetic data more authentic, we can integrate emotion recognition into the Dialogue Verification step outlined in Section 3 of the paper, improving the emotional expressiveness of synthesized speech. Furthermore, advancements in TTS technology provide a promising pathway to achieve higher fidelity and richer emotional nuances in synthetic data.
Thank you once again for recognizing the significance of our work. We hope these responses address your concerns. Should you have further questions, please do not hesitate to reach out.
[1] Li Y, Su H, Shen X, et al. Dailydialog: A manually labelled multi-turn dialogue dataset[J]. ACL2017
Dear Reviewer Jc5t,
We have discussed the issues you raised in detail, finding the topics both interesting and important. We look forward to your further feedback. As the rebuttal phase is coming to a close, we kindly request your response at your earliest convenience.
If you have any additional questions or require further clarification, please do not hesitate to let us know. We deeply appreciate the time and effort you have dedicated to reviewing our paper.
Best regards,
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.