PaperHub
6.0
/10
Poster3 位审稿人
最低5最高7标准差0.8
5
7
6
4.0
置信度
正确性3.0
贡献度2.7
表达2.7
NeurIPS 2024

StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

StreamingDialogue efficiently compresses dialogue history into conversational attention sinks with minimal losses, enhancing the model's long-term memory and facilitating prolonged streaming conversations.

摘要

Standard Large Language Models (LLMs) struggle with handling dialogues with long contexts due to efficiency and consistency issues. According to our observation, dialogue contexts are highly structured, and the special token of End-of-Utterance (EoU) in dialogues has the potential to aggregate information. We refer to the EoU tokens as ``conversational attention sinks'' (conv-attn sinks). Accordingly, we introduce StreamingDialogue, which compresses long dialogue history into conv-attn sinks with minimal losses, and thus reduces computational complexity quadratically with the number of sinks (i.e., the number of utterances). Current LLMs already demonstrate the ability to handle long context window, e.g., a window size of 200K or more. To this end, by compressing utterances into EoUs, our method has the potential to handle more than 200K of utterances, resulting in a prolonged dialogue learning. In order to minimize information losses from reconstruction after compression, we design two learning strategies of short-memory reconstruction (SMR) and long-memory reactivation (LMR). Our method outperforms strong baselines in dialogue tasks and achieves a 4 $\times$ speedup while reducing memory usage by 18 $\times$ compared to dense attention recomputation.
关键词
dialogue compressionconversational attention sinksmemory

评审与讨论

审稿意见
5

Standard Large Language Models (LLMs) struggle with handling dialogues with long contexts due to efficiency and consistency issues. This paper finds that the structure of the dialogue context is consistent, and special tokens may aggregate information. Therefore, this paper aims to use special tokens to encode dialogue history information to reduce inference costs and enhance the ability of LLMs to handle long dialogues. To achieve this, it proposes two information reconstruction functions to improve the information aggregation capability of special tokens. Experiments have shown that the article has achieved its intended purpose.

优点

  1. This paper finds that the separator tokens, namely conversational attention sinks, generally aggregates more attention than other words and tokens. Therefore, this paper proposes StreamingDialogue, which utilizes there special tokens to enhance the capability of LLMs to handle long-context dialogue.
  2. StreamingDialogue has achieved good results on multiple dialogue datasets. Analysis experiments also demonstrate that this method is capable of handling long-context dialogues and can reduce inference latency and memory usage.

缺点

  1. The training method introduced by StreamingDialogue is overly complex, resulting in the model performing multiple forward passes on the same sample during training, leading to more than three times the training cost.
  2. The comparison of different methods is unfair. In the main experiment, StreamingDialogue is fine-tuned on specific datasets, making it evident that it can surpass the non-training method StreamingLLM [1]. Although the authors also explored the non-training setting of the proposed method in Section 4.6, they only conducted experiments on Llama2-7B-chat and evaluated it using only 1-gram and 2-gram metrics. Therefore, the effectiveness and generalizability of the method are not verified.

[1] Xiao et al. Efficient Streaming Language Models with Attention Sinks. ICLR 2024.

问题

  1. In line 127-128, the authors mention that by caching only the corresponding conv-attn sinks, the time complexity of attention computation can be reduced from O(T2L2)O(T^2L^2) to O(T2)O(T^2). However, have the authors considered that if each utterance contains LL tokens, the time complexity should be O(T2L)\mathbf{O(T^2L)}?

  2. For the same sample, it would be best for the authors to use the same notation on line 157 and line 175 to avoid confusion.

局限性

The authors adequately address the limitations and, if applicable, potential negative societal impact of their work.

作者回复

Dear Reviewer wRZP,

We sincerely thank you for your constructive suggestions and valuable feedback! We hope our response can help resolve your concerns.

The training method is complex, resulting in higher training costs.

Our method outperforms baselines in both training and non-training settings and the non-training setting incurs no additional costs. The training method we introduced is intended to better adapt the model to the conv-attn sink mode. This results in a trade-off between significantly improved performance and increased training costs. However, regardless of the choice, our method consistently performs better than baselines.

Additionally, during the inference stage, our method can significantly reduce space and time complexity by compressing historical dialogues into conv-attn sinks. Below is a comparison of our method with StreamingLLM in both training and non-training settings.

MethodBLEUBLEU-1BLEU-2ROUGE-1ROUGE-2ROUGE-L
StreamingLLM (non-training)20.1651.1829.9915.901.9214.26
Ours (non-training)20.1951.5530.0316.462.1115.00
StreamingLLM (training)16.7647.5425.0815.252.4414.21
Ours (training)19.3351.4928.1217.182.7715.86

(1) The comparison of different methods is unfair; (2) More base models and metrics are needed for the non-training setting.

(1) We have made every effort to ensure all comparisons are conducted fairly. There may be some misunderstandings: in the main experiment, all methods were maintained with the same training settings and were fine-tuned on specific datasets, including StreamingLLM. We will further clarify the experimental settings in the revision. Thank you for the suggestion.

(2) Thank you for the great advice! We have added tests on the Llama-3-8B-Instruct and Mistral-7B [1] under the non-training setting and included additional metrics: BLEU and ROUGE-L.

ModelMethodBLEUBLEU-1BLEU-2ROUGE-1ROUGE-2ROUGE-L
Llama-2-7B-ChatStreamingLLM20.1651.1829.9915.901.9214.26
Ours20.1951.5530.0316.462.1115.00
Llama-3-8B-InstructStreamingLLM16.4839.6824.6316.881.9315.47
Ours16.7740.1024.8817.112.0115.85
Mistral-7BStreamingLLM12.7542.8619.9912.581.8311.73
Ours13.3344.0820.6513.401.9812.58

In line 127-128, the authors mention that by caching only the corresponding conv-attn sinks, the time complexity of attention computation can be reduced from O(T2L2)O(T^2L^2) to O(T2)O(T^2). However, have the authors considered that if each utterance contains LL tokens, the time complexity should be O(T2L)O(T^2L)?

After our thorough verification and calculations, the time complexity is indeed O(T2L)O(T^2L). We sincerely appreciate your suggestion and will revise the complexity in the revision.

For the same sample, it would be best for the authors to use the same notation on line 157 and line 175 to avoid confusion.

Thank you for your suggestion. We will use the same notation in the revision for lines 157 and 175.

Once again, we appreciate your efforts and valuable suggestions to improve our paper. If you have further questions, please leave more comments in the OpenReview system. We would appreciate it very much if you could kindly raise your score if your concerns are addressed!

References
[1] Mistral 7B

评论

We highly appreciate your valuable time spent in reviewing our work. The insights and contributions you have made to improve the quality of our submission are sincerely acknowledged. We would like to inquire whether our response adequately addressed your questions. Your feedback holds immense value to us, and we eagerly await your reply.

评论

We are reaching out to follow up on our previous reply, as we have yet to receive your feedback. We are keen to know whether the information we shared has fully addressed your concerns or if there is more we can do to assist.

We truly value the time and effort you have dedicated to reviewing our work, especially considering your busy schedule. Your expertise and feedback are important to us, and we would deeply appreciate it.

Thank you very much for your time and consideration.

审稿意见
7

This paper tackles the challenge of long context dependencies of LLMs in dialogue settings. The authors first posit that end of utterance tokens like "\n" and </s> could conceivably summarize the information in the utterance, and propose to attend to such sinks rather than entire utterances to allow LLMs to perform long-context tasks with lower computational costs. Two learning strategies, SMR and LMR are introduced to encourage EoU tokens to carry key information from preceding utterances, and remember key information from previous EoU tokens. The authors evaluate on multiple dialog datasets, and demonstrate that the proposed approach can significantly improve computational complexity and make it possible to operate on longer dialogs.

优点

  1. The problem that the paper is tackling is important for the community at large, and the proposed approach seems intuitive and logical.
  2. The paper is original, well structured and written, and experiments appear apt to substantiate the authors claims.

缺点

  1. The authors do not compare against plausible alternative approaches like infinity former or the compressive transformer, which are also memory based approaches. Comparisons to position interpolation based approaches with RoPE would also be interesting to see.

问题

  1. Line 111: Yes, position interpolation based approaches may not provide for infinitely long sequences, but is that required for dialogue tasks? Perhaps this could be rephrased to make the authors point.

  2. Line 119 makes a leap in logic that is perhaps not fully intuitive. The authors claim that higher attention on EoU tokens suggests that these aggregate information. Did the authors test whether the responses of the LLM to conversations implied that high attentions on EoU tokens could possibly capture information from the preceding utterances ?

  3. Line 160: Did the authors consider restricting the attention for u' tokens to just the sink token, rather than sink tokens and previous tokens? The current phrasing seems to indicate the former over the latter.

局限性

The authors are requested to add notes on limitations and potential negative social impacts. The additional training methods proposed may incur computational cost, and I encourage the authors to comment on this.

作者回复

Dear Reviewer 4bTd,

We sincerely thank you for your constructive suggestions and valuable feedback! We hope our response can help resolve your concerns.

More baselines are needed (Weakness 1).

Your kind advice has inspired us to conduct more comprehensive experiments by incorporating the recommended baselines: infinity former and two position interpolation-based approaches with RoPE, which are YaRN [1] and Dynamic NTK-RoPE [2]. Our method achieves favorable results compared to infinity former across all metrics and outperforms YaRN and Dynamic NTK-RoPE on some metrics. Since the lengths of the MSC and PersonaChat datasets are within the training length of Llama, it is reasonable that our method only shows advantages in some metrics relative to the full-attention RoPE position interpolation methods.

DataMethodPPLBLEUR-1R-2R-LD-2
MSCDense7.5819.4716.932.9215.4837.75
YaRN7.8419.6814.992.3711.1842.29
Dynaminc NTK7.5819.9515.612.6211.6639.39
Infinity former16.616.908.210.287.977.92
Ours7.9919.3317.182.7715.8632.58
PersonaChatDense8.4113.1513.983.0713.4441.61
YaRN8.2513.2813.092.8512.1541.89
Dynaminc NTK8.2413.4013.213.0012.3041.99
Infinity former15.839.2710.060.739.7626.30
Ours8.7113.6313.963.0513.4337.23

Line 111: Yes, position interpolation based approaches may not provide for infinitely long sequences, but is that required for dialogue tasks? Perhaps this could be rephrased to make the authors point.

Ideally, our objective is to develop a lifelong dialogue system capable of continuous conversation while retaining memory of all past dialogues. The term infinitely long refers to the cumulative length of each and every utterance, not the length of a single utterance. Position interpolation-based approaches lack this capability.

Line 119 makes a leap in logic that is perhaps not fully intuitive. The authors claim that higher attention on EoU tokens suggests that these aggregate information. Did the authors test whether the responses of the LLM to conversations implied that high attentions on EoU tokens could possibly capture information from the preceding utterances ?

We tested and confirmed that high attention to EoU tokens effectively captures information from previous dialogues. The answer is yes, and we illustrate this with the following case study.

Using an untrained Llama-2-7B-Chat model, we restrict it during inference to focus only on EoU tokens from previous utterances and the last complete utterance. Given the input "Did you have a caramel macchiato today?</s>Yes!</s>What kind of coffee did you have today?</s>," the model responds with "I'm glad you asked! I had a delicious caramel macchiato this morning." This shows that the EoU tokens successfully capture the key information "caramel macchiato" from the first utterance.

Line 160: Did the authors consider restricting the attention for u' tokens to just the sink token, rather than sink tokens and previous tokens? The current phrasing seems to indicate the former over the latter.

There may be some misunderstandings due to unclear expressions. To clarify, in short-memory reconstruction, uu' can indeed only see the conv-attn sink token of uu to reconstruct uu, and the conv-attn sink of uu can attend to uu to compress uu's information onto itself.

The authors are requested to add notes on limitations and potential negative social impacts. The additional training methods proposed may incur computational cost, and I encourage the authors to comment on this.

Thank you very much for your valuable suggestion. StreamingDialogue significantly reduces space and time complexity during the inference stage. Additionally, we can outperform the baseline under the non-training setting without additional cost.

To optimize LLMs for the conv-attn sinks mode, we implement two learning strategies: short-memory reconstruction and long-memory reactivation. Consequently, this inevitably increases computational costs under the training setting, with the SMR and LMR phases requiring about two hours on two A100-40G GPUs. We will include additional details on computational costs in the limitations section of the revision.

Thank you for your instant feedback and valuable revision advice. The points you raised are worthy to ponder upon. We are so glad to make a further discussion on those issues. Hope our response will address your concern.

References
[1] YaRN: Efficient Context Window Extension of Large Language Models

[2] https://github.com/jquesnelle/yarn/pull/1

评论

I thank the authors for their responses to questions and comments. Based on their response, I will retain my score. However, I encourage the authors to consider the following:

  1. The authors test on MSC and PersonaChat, both of which do not exceed the max length of LLama. Therefore, I question the assertion that they develop an approach for theoretically infinite sequences. Since their approach performs worse than PI-based approaches on some metrics, does this mean that the approach does not work well enough for these sequence lengths? Additional comments or analysis on this aspect would be helpful.

  2. Regarding the case study, thanks for including it. However, there may be a more structured approach using multiple prompts to test this and obtain aggregate conclusions. I recommend that the authors repeat this over multiple prompts to demonstrate convincing evidence in the final paper.

评论

Thank you for your feedback. We sincerely hope that the subsequent responses could resolve your concerns.

The authors test on MSC and PersonaChat, both of which do not exceed the max length of LLama. Therefore, I question the assertion that they develop an approach for theoretically infinite sequences. Since their approach performs worse than PI-based approaches on some metrics, does this mean that the approach does not work well enough for these sequence lengths? Additional comments or analysis on this aspect would be helpful.

For infinite sequences, PI-based approaches do not reduce the KV caches during inference, resulting in time and space complexity the same as dense attention. This makes them prone to out-of-memory errors and unsuitable for infinite texts. In contrast, our method has demonstrated stable performance even with lengths exceeding 25K tokens.

For texts within the training length of LLaMA, there is no need to use PI-based approaches on MSC and PersonaChat since PI-based approaches are designed for length extrapolation, i.e., when the inference length exceeds the training length. Additionally, these PI-based approaches employ dense attention, allowing them to attend to the full context. However, our method, as a sparse attention approach, can only attend to a small portion of the tokens, which reasonably explains why it might underperform compared to PI-based approaches on some metrics.

Regarding the case study, thanks for including it. However, there may be a more structured approach using multiple prompts to test this and obtain aggregate conclusions. I recommend that the authors repeat this over multiple prompts to demonstrate convincing evidence in the final paper.

Thank you for your suggestion. We designed 10 prompt formats, each with 20 specific samples, limiting the inference to only see the last utterance and the dialogue history's conv-attn sinks. We used an untrained Llama-2-7B-Chat model for inference and tested the proportion of responses that accurately include key information.

Examples of prompt formats are as follows:

  1. "template": "A and B went to PLACE today.</s>They had a great time.</s>Who did A go to PLACE with today?</s>",

    "keywords": {"A": "person", "B": "person", "PLACE": "place"},

    "answer_key": "B"

  2. "template": "B made A's favorite food, FOOD, today.</s>A was delighted.</s>What food did B make for A today?</s>",

    "keywords": {"A": "person", "B": "person", "FOOD": "food"},

    "answer_key": "FOOD"

  3. "template": "A was doing ACTIVITY when B called.</s>A had to stop and answer the call.</s>What was A doing when B called?</s>",

    "keywords": {"A": "person", "B": "person", "ACTIVITY": "activity"},

    "answer_key": "ACTIVITY"

  4. "template": "A bought a new ITEM today.</s>B was impressed by A's purchase.</s>What item did A buy today?</s>",

    "keywords": {"A": "person", "B": "person", "ITEM": "item"},

    "answer_key": "ITEM"

  5. "template": "A participated in an EVENT today.</s>B cheered them on.</s>What event did A participate in?</s>",

    "keywords": {"A": "person", "B": "person", "EVENT": "event"},

    "answer_key": "EVENT"

The "keywords" will be replaced with specific content.

The test results showed that the proportion of responses accurately including key information was 68.00%, indicating that the EoU tokens indeed have the ability to aggregate information by drawing more attention. We will include these results in the revision.

评论

I thank the authors for their responses. All my questions are addressed satisfactorily.

评论

Thank you for responding to our rebuttal and recommending acceptance.

审稿意见
6

The paper introduces a novel approach for encoding long conversations. Motivated by the results of StreamingLLM, the authors observe that the end-of-utterance (EOU) or separator tokens aggregate more attention than other tokens in a dialogue generation task. The authors refer to the EOU tokens as conv-att sinks (conversational attention sinks). Based on this observation, the authors propose to attend and cache only the conv-att-sinks of the past utterances to represent the dialogue history, thereby making the space complexity of the attention mechanism linear to the number of turns in a conversation. To learn quality embeddings for conv-attn sinks, the authors propose two auxiliary tasks, SMR (a response reconstruction task) and LMR (a response recall task). The proposed StreamingDialogue encoding strategy achieves comparable performance to dense attention (attention on all previous tokens) and outperforms memory-efficient baselines on Persona-Chat and MSC datasets. The method also exhibits 4× speedup while reducing memory usage by 18× compared to the dense attention strategy.

优点

  1. The paper is well-written and easy to understand. The proposed method is well-motivated and addresses an important problem of encoding long dialogue contexts.
  2. The proposed attention strategy of utilizing only the conv-attn sinks is simple and effective. StreamingDialogue shows better performance than memory-efficient baselines on both automated and human evaluation. The method also shows performance comparable to that of the dense attention strategy.
  3. Results suggest that the SMR and LMR help to learn a rich representation of the conv-attn sinks. The authors also show evidence (Fig. 7) that the model can recollect/generate past information from the previous conv-attn sinks.
  4. The method is cost-effective and can achieve significant speed-up compared to the dense attention strategy.

缺点

  1. Although the results shown in Table 1 are positive, the metrics are not well-suited for the open-domain dialogue generation task. The authors have shown their results on two additional metrics (USL-H and Dial-M) in Table 6 for the MSC dataset. USL-H and Dial-M have been shown to be better metrics than BLEU, ROUGE, Distinct, and perplexity, especially for persona-grounded datasets like Persona-Chat. However, Table 1 does not show the results with USL-H and Dial-M. Also, in the case of MSC, StreamingDialogue performs better than StreamingLLM on the USL-H metric but not on the Dial-M metric. This is why I think that although the method is appealing, the results are not strong enough to support it.
  2. There is no information about inter-annotator agreement for the human evaluation.
  3. The experimental setup with the Persona-Chat and MSC dataset is not clear. The authors have not mentioned whether they used the persona profiles to generate the responses for the result of Table 1.
  4. The use of the BLEU metric is not consistent. The authors use average BLEU in Table 1 and Table 2. However, BLEU-1 and BLEU-2 are shown in Table 3, whereas only BLEU-1 is shown in Table 4. It would be better to show all three variations of BLEU in all the result tables.
  5. The results are shown only on two dialogue datasets. There are other datasets with long dialogue contexts like Topical-chat. MultiWOZ is another dataset for task-oriented dialogue systems, which includes lots of conversations where the user utterance directly refers to past dialogue history.

问题

  1. Why did the authors not use USL-H and Dial-M for Table 1?
  2. Did the authors use the persona profiles to generate the responses? If yes, how was it included in the context? If not, is it fair to compare BigBird and StreamingLLM with StreamingDialogue?
  3. Explain the inconsistent use of the BLEU metric.
  4. Why are Layer 0 and Layer 1 shown in Fig. 1a, whereas Layer 28 is shown in Fig. 1b?
  5. Did the authors analyze the conv-attn sinks for the example in Fig. 7?

局限性

yes

作者回复

Dear Reviewer 7W1D,

We sincerely thank you for your constructive suggestions and valuable feedback! We hope our response can help resolve your concerns.

Table 1 does not show the results with USL-H and Dial-M. Also, in the case of MSC, StreamingDialogue performs better than StreamingLLM on the USL-H metric but not on the Dial-M metric.

Thank you very much for your valuable suggestion. The experimental results using USL-H and Dial-M for Table 1 are shown in the below table:

MSCDenseLocalBigBirdStreamingLLMMemBARTOurs
USL-H↑90.11*76.68*85.30*86.91*85.13*90.48
Dial-M↓1.94*2.15*1.72^1.71^1.97*1.76
PersonaChatDenseLocalBigBirdStreamingLLMMemBARTOurs
USL-H↑14.21*17.35*16.95*17.63*12.23*17.96
Dial-M↓2.38*2.07^2.37*2.30*2.49*2.10

↑ indicates the higher score is better, while ↓ indicates the lower score is better. * indicates significance and ^ indicates insignificance.

Our method significantly outperforms all baselines across several metrics, including USL-H, PPL, BLEU, ROUGE, and Distinct. It also exceeds most baselines on Dial-M, but BigBird and StreamingLLM show only non-significant improvements over our method in Dial-M on MSC. Additionally, our method demonstrates significant improvemnet over all baselines in Dial-M when evaluated on two newly added datasets: Topical-Chat and MultiWOZ. These results confirm that our method is indeed better than baselines.

There is no information about inter-annotator agreement for the human evaluation.

We apply Fleiss' kappa [1] to measure the agreement among four annotators, yielding a result of 52.51%. This indicates that the inter-annotator agreement is moderate (κ[0.4,0.6]\kappa \in [0.4, 0.6]).

Did the authors use the persona profiles to generate the responses? If yes, how was it included in the context? If not, is it fair to compare BigBird and StreamingLLM with StreamingDialogue?

All of the baselines and our method did not use persona profiles to generate the responses. Therefore, our experimental comparison is fair.

The use of the BLEU metric is not consistent. It would be better to show BLEU, BLEU-1 and BLEU-2 in all the result tables.

Thank you for your suggestion. We have reported BLEU, BLEU-1 and BLEU-2 in all the result tables as follows:

DataMethodBLEUBLEU-1BLEU-2
PersonaChatDense13.1549.3020.05
Local13.0150.7820.13
Big Bird12.9350.0020.52
StreamingLLM13.1650.1520.68
MemBART11.1846.6317.65
Ours13.6351.2720.77
MSCDense19.4752.2228.41
Local13.3441.1420.44
Big Bird16.5446.6324.77
StreamingLLM16.7647.5425.08
MemBART17.1149.7825.82
Ours19.3351.4928.12

Table 1: Main results on the PersonaChat and MSC datasets.

ModelBLEUBLEU-1BLEU-2
Ours19.3351.4928.12
Base17.3247.4125.61
LMR18.8750.8327.76
SMR18.2549.4526.84

Table 2: Ablation results on MSC with different learning strategies.

MethodBLEUBLEU-1BLEU-2
StreamingLLM20.1651.1829.99
Ours20.1951.5530.03
Table 3: Results under the non-training setting on the MSC test set.
BLEUBLEU-1BLEU-2
68.0289.1976.83
Table 4: Dialogue reconstruction performance.

We will include these results in the revision. Thank you for the great advice!

More datasets are needed (Weakness 5)

Thank you for the constructive feedback.

We have conducted experiments on Topical-Chat and MultiWOZ. The results are shown in the table below.

DataMethodPPLROUGE-1ROUGE-2ROUGE-LDial-M
Topical-ChatDense9.4915.703.6514.883.09
Local27.5512.602.0910.377.02
Big Bird10.3614.213.5511.793.01
StreamingLLM10.3414.253.5511.843.05
MemBART12.5413.862.9813.182.83
Ours9.8015.463.9914.372.66
MultiWOZDense4.5124.7913.9324.672.27
Local5.3824.2613.4724.152.45
Big Bird4.7924.3813.2624.302.51
StreamingLLM4.7623.6613.0923.412.47
MemBART5.3620.0512.4119.942.37
Ours4.3425.2614.2725.202.25

Our method outperforms all strong baselines due to its ability to retain more complete historical information.

We will include these results in the revision.

Why are Layer 0 and Layer 1 shown in Fig. 1a, whereas Layer 28 is shown in Fig. 1b?

We have added the Layer 0 and Layer 1 attention maps for Fig. 1b, as detailed in Figs. (a)-(d) of the PDF in global response.

Did the authors analyze the conv-attn sinks for the example in Fig. 7?

In global response, Figs (e)-(f) of the PDF illustrate the model's attention to conv-attn sinks for the example shown in Fig. 7. During the inference stage, only the key-values of the conv-attn sinks are retained. In the case study of Fig. 7, the generated response requires the critical information "California" from the 12th and 14th utterances. We observed that more attention is allocated to the conv-attn sinks of the 12th and 14th utterances, indicating that during inference, the model focuses more on the conv-attn sinks that are useful for the inference.

Thank you once again for your valuable time and constructive feedback to improve our paper. We are eager to address any additional questions or concerns that you may have.

If your concerns have been addressed, we would be very grateful if you could raise the score.

References
[1] Measuring nominal scale agreement among many raters (Fleiss, J. L., Psychological Bulletin 1971)

评论

Thank you for the detailed response. I appreciate the effort in sharing the results on Topical-Chat and MultiWOZ datasets. I have updated my scores. However, I still have the following questions and concerns.

  1. For Topical-Chat, did you use the grounding knowledge to generate the response?
  2. For MultiWOZ, did you use the belief states to generate the response?
  3. In the Persona-Chat dataset, the users pick one or more persona from their assigned profile to generate the responses. Now, if the generation of the response is not conditioned on the persona, then the model tends to produce responses that reduce perplexity. As a result, even though the response is not persona-grounded, it may achieve better BLEU scores. So, in my opinion, grounding knowledge should be included in the dialogue context for any kind of knowledge-grounded response generation task. Otherwise, it does not provide the complete picture. This is why I am still skeptical about the soundness of the result.
评论

Thank you for your feedback. We sincerely hope that the subsequent responses could resolve your concerns.

For Topical-Chat, did you use the grounding knowledge to generate the response? For MultiWOZ, did you use the belief states to generate the response?

We did not use either the grounding knowledge from Topical-Chat or the belief states from MultiWOZ for generation.

Grounding knowledge should be included in the dialogue context for any kind of knowledge-grounded response generation task.

Thank you for your comment. We have conducted experiments under the setting that includes grounding knowledge.

For MultiWOZ, we added the belief states before each corresponding utterance. The results are shown in the table below.

MethodPPLBLEUBLEU-1BLEU-2Distinct-1Distinct-2Distinct-3
Dense1.9225.5648.3329.143.746.868.89
StreamingLLM2.1925.7047.5329.214.489.0912.60
Ours1.9825.7748.5829.385.3010.0313.60

Since our method retains historical information by compressing each utterance's information into conv-attn sinks, only the conv-attn sinks from the previous utterances will be attended to in subsequent utterances. Therefore, for Topical-Chat and Persona-Chat, we considered two settings:

  1. We treated each sentence of the grounding knowledge/persona profiles as an utterance, and the subsequent utterances could only attend to their conv-attn sinks. The results are shown in the table below.
DataMethodPPLDistinct-2Distinct-3Dial-M
PersonaChatDense7.1943.5666.272.53
StreamingLLM8.3633.1753.582.47
Ours7.6039.1661.062.36
Topical-ChatDense3.2439.0757.644.32
StreamingLLM8.3116.8723.563.72
Ours3.2031.4749.102.57
  1. We used the grounding knowledge/persona profiles as a prompt:"The conversation will be based on the following knowledge: <knowledge> {detailed knowledge} <conversation>" in Topical-Chat and "The conversation will be based on the following persona profile: <persona> {detailed persona profiles} <conversation>" in Persona-Chat, allowing the subsequent utterances to fully attend to it. The results are shown in the table below.
DataMethodPPLDistinct-2Distinct-3Dial-M
PersonaChatDense7.9344.2666.632.48
StreamingLLM7.9936.4057.442.91
Ours7.6737.8258.932.57
Topical-ChatDense11.6436.9854.964.60
StreamingLLM30.3726.0734.263.61
Ours10.2132.1650.412.97

In the setting that includes grounding knowledge, our method consistently retains memory of both grounding knowledge and historical dialogue. As a result, our method still outperforms the baseline, except for dense attention. As an efficient algorithm, our method can significantly improve speed compared to dense attention while maintaining the contextual and character consistency of long conversations.

评论

Thanks for conducting the additional experiments. I have one final question. Could you please elaborate on the process of including the grounding knowledge in the context? Do you include the knowledge only once or update the knowledge after each turn?

评论

Thank you for your feedback. In all datasets, the knowledge is included only once. Specifically, for Topical-Chat and Persona-Chat, we concatenate the grounding knowledge related to each conversation at the beginning of that conversation. For MultiWOZ, we prepend the belief state of each utterance to the beginning of that utterance.

评论

Thank you for the response. I have updated my scores.

评论

Thank you for responding to our rebuttal and raising your score.

作者回复

We greatly appreciate the time and effort all reviewers devoted to reviewing our paper and providing detailed, constructive feedback. The reviewers' insights and queries have played a crucial role in helping us refine our research. We have thoughtfully considered feedback from all reviewers and hope these responses address their concerns.

We have included a PDF in the global response to address the concern raised by Reviewer 7W1D.

最终决定

This paper builds on an interesting observation regarding the model's attention to the end-of-utterance (EoU) units and proposes compressing dialogue turns into EoUs. Experiments demonstrate the benefit of the proposed method on two dialogue datasets, which are extended to two other dialogue data with similar results, based on reviewer suggestions.