5.3

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.0

置信度

正确性2.8

贡献度2.5

表达2.8

ICLR 2025

MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants

Zeyu Zhang,Quanyu Dai,Luyu Chen,Zeren Jiang,Rui Li,Jieming Zhu,Xu Chen,Yi Xie,Zhenhua Dong,Ji-Rong Wen

OpenReview PDF

提交: 2024-09-18更新: 2025-02-05

摘要

关键词

LLM-based agentmemoryevaluationpersonal assistant

评审与讨论

审稿意见

评分: 6置信度: 32024-11-02

This paper presents a novel method -- MemSim, to construct reliable QA datasets, to evaluate the memory capability of LLM-based personal assistants. A Bayesian Relation Network and a Causal Generation Mechanism are introduced to ensure the diversity, reliability, and scalability of the generated datasets. Based on MemSim, a dataset named MemDaily is constructed. Extensive experiments are conducted to assess the quality of the dataset, as well as evaluate the different memory mechanisms of LLM-based agents.

优点

This paper is well-written. The challenges of constructing dataset with considering the reliability, diversity and scalability set up a good motivation for the paper.
Both theoretical and experimental proofs are included to validate the effectiveness of the proposed method.
When generating the QAs, different types of QAs are considered. This ensures the diversity and coverage of the QAs.
The experiments are comprehensive: 1) Variations of the MemDaily dataset are also considered. 2) Different memory mechanisms are evaluated on the proposed dataset.

缺点

In the proposed method, the user messages are factual statements. And the constructed question-answers mainly focus on the entities/attributes. Each message in the trajectory seems to be independent. And there is no coreference between messages, no ambiguity of the user message. These greatly simplify the problem of evaluating personal assistants in a real-world scenario.

问题

In section 4.1 -- evaluation of user profiles, it will be good to mention the number of the total generated user profiles. In addition, Is each user profile evaluated by a single evaluator or multiple evaluators?
When constructing the MemDaily dataset, it is not clear to me how the trajectory is constructed.
It takes a while to understand the sentence in line 430 -- "Another baseline method that directly ... performs much lower reliability. We implement this method ... as OracleMem ...". The results of OracleMem are compared with the results in Table 5, rather than the other methods in Table 6. It will be better to make this clear.
When evaluating the MemDaily dataset in section 4.3, how is the retrieval target obtained? According to section 4.3, the retrieval target seems to be part of the groundtruth when constructing the dataset.
Related to question 4, in Table 6, are there any insights on why the performance of the OracleMem is much worse in some types of QAs? The OracleMem uses the targeted user message, which is not available when testing the memory machanism of other LLM-based agents. Therefore, shall we say that the results of the OracleMem are the upper bound of the model performance in this dataset? For the Aggregative type of question, the accuracy is 0.376. Is there other way that can further improve this performance?
In line 485, "LLM directly uses the LLM to ... ", are the candidate messages and the question provided to LLM and let LLM decide the top-k relevant message?

评论- Response to Reviewer Cyvc (3/3)

2024-11-17

For Question 5: Related to question 4, in Table 6, are there any insights on why the performance of the OracleMem is much worse in some types of QAs**? The OracleMem uses the targeted user message, which is not available when testing the memory machanism of other LLM-based agents. Therefore, shall we say that the results of the OracleMem are the upper bound of the model performance in this dataset? For the Aggregative type of question, the accuracy is 0.376. Is there other way that can further improve this performance?**

Response:

We are very grateful that you noticed this, and it is also an unexpected result we have discovered. Due to the page limitation, we do not discuss it in depth in our paper, only mentioning it on line 480. I'm very pleased to discuss the findings.

As you say, OracleMem should be the upper bound of the model performance in this dataset. However, it should be based on an important assumption that the performance of LLM for information quality is monotonically increasing. Here, the information quality indicates some points, like whether it contains the answer, and how much noise it contains. However, this assumption is difficult to guarantee, due to many other factors like pre-training processes and sensitivity to prompt length.

Specifically, we suspect that LLM have a preference for the length of prompts, meaning that prompts that are too short or too long may reduce the reasoning ability of LLMs. Comparing with MemDaily-vanilla and MemDaily-100, we find that OracleMem obtains the best performance among all the QA types in MemDaily-100, but has worse performance than other baselines in MemDaily-vanilla. This observation indirectly verifies our suspect, because the noise in MemDaily-vanilla may properly increase the length of prompts, increasing the performance of LLM, while the noise in other datasets may make the prompt too long to decrease the performance. We are also conducting further studies on this phenomenon.

As for the Aggregative type of QA, I think introducing reasoning processes into the memory mechanism can be a possible solution, because we find the open-source foundation model (GLM-4-9B) can not greatly address this problem.

For Question 6: "In line 485, "LLM directly uses the LLM to ... ", are the candidate messages and the question provided to LLM and let LLM decide the top-k relevant message?"

Response:

Thanks for your question. Yes, we provide the question and candidate messages, integrating them into the prompt for an LLM. Then we let the LLM output the top-k relevant message. We will add a more detailed description for this part, and we thank you for this valuable suggestion.

We sincerely thank you for your time to review our paper, and we also thanks for your insightful comments, which, we believe, are very important to improve our paper. We hope our responses can address your concerns. If you have further questions, we are very happy to discuss them.

评论- Response to Reviewer Cyvc (2/3)

2024-11-17

For Question 3: "It takes a while to understand the sentence in line 430 -- "Another baseline method that directly ... performs much lower reliability. We implement this method ... as OracleMem ...". The results of OracleMem are compared with the results in Table 5, rather than the other methods in Table 6. It will be better to make this clear."

Response:

Thanks for your question. I agree that the descriptions here are not unclear enough. In fact, there are mainly two approaches to generating user messages and corresponding QA questions.

The naive approach adopts the pipeline like "message --> question --> answer". First of all, it generates or collects some user messages. Then, it lets an LLM generate questions based on these messages. Finally, it makes the LLM generate correct answers based on the user messages and questions. Although this method is simple, the accuracy of the answers depends on the performance of the LLM, which makes the difficulty of constructing and solving the Q&A the same. Therefore, OracleMem is actually the process that generates answers given messages and questions, which is identical to this construction approach. That is why we consider it as a baseline to compare in Table 5.

We propose MemDaily with the other approach like "prior knowledge --> question & answer --> message". First of all, we generate questions and answers based on constructed prior information (such as user attributes). Then, we create user messages by injecting answers with other information. This construction method makes it easier to construct Q&A than to solve them, and MemDaily is an example of such an approach.

For Question 4: "When evaluating the MemDaily dataset in section 4.3, how is the retrieval target obtained? According to section 4.3, the retrieval target seems to be part of the ground truth when constructing the dataset."

Response:

Thanks for your question. Exactly, the retrieval target is part of the ground truth when constructing the dataset. Based on our causal generation mechanism, we can obtain the retrieval target during the dataset construction. As our approach can be described as a pipeline like "prior knowledge --> question & answer --> message", we are able to mark which message contains the answer information. Specifically, in Section 3.3 "Causal Generation Mechanism", we construct informative hints as a bridge between messages and detailed information, where the answer information is contained in some specific hints. Each hint will be transformed into a piece of user message. Therefore, we can obtain the retrieval target by using these hints to find the message indexes that contain answer information. More details are shown in Section 3.3, and we also provide the implementation in our anonymous repository for reference.

评论- Response to Reviewer Cyvc (1/3)

2024-11-17

Dear Reviewer Cyvc,

Thanks so much for your precious time in reading and reviewing our paper, and we are encouraged by your positive feedback. In the following, we try to alleviate your concerns one by one:

For Weakness: "In the proposed method, the user messages are factual statements. And the constructed question-answers mainly focus on the entities/attributes. Each message in the trajectory seems to be independent. And there is no coreference between messages, no ambiguity of the user message. These greatly simplify the problem of evaluating personal assistants in a real-world scenario."

Response:

Thanks for your concerns. We strongly agree that simple QAs that only rely on factual statements can greatly simplify the evaluation. Therefore, in order to address this problem, we have designed six different types of QAs to enhance the difficulty of evaluation:

Simple QAs: Rely on one factual message to answer the question directly.
Conditional QAs: Require multiple messages to answer the question jointly.
Comparative QAs: Compare two entities on a shared attribute with multiple messages.
Aggregative QAs: Aggregate messages about more than two entities on a common attribute.
Post-processing QAs: Involve extra reasoning steps to answer with multiple messages.
Noisy QAs: multi-hop QAs with additional irrelevant noisy texts inside questions.

We provide more details on how to construct these QA types in Section 3.3 and Section 3.4.

To further improve the difficulty and provide various difficulties of our dataset, we collect question-irrelevant posts from social media platforms, and randomly incorporate them into user messages by controlling their proportions. Specifically, we denote MemDaily-vanilla as the vanilla and easiest one without extra additions, and create a series of MemDaily- $\eta$ , where we use $\eta$ to represent the inverse percentage of original user messages. Larger $\eta$ indicates a higher level of difficulty in the benchmark. More details can be found in Section 5.

As for the characteristics of the dataset, our design of factual statements is based on the actual assessment needs of the industry department, where they believe that straightforward statements are closer to real data in their application (remember the factual information and answer questions later). Moreover, since users typically have long intervals between statements to a personal assistant, in their scenario, most do not explicitly mention connections.

We thank you for the valuable suggestions again!

For Question 1: "In section 4.1 -- evaluation of user profiles, it will be good to mention the number of the total generated user profiles. In addition, Is each user profile evaluated by a single evaluator or multiple evaluators?"

Response:

Thanks for your question. In section 4.1, we evaluate fifty generated user profiles (line 985 of Appendix E.1), and we provide more details and case studies about the generated user profiles in Appendix E.1. Additionally, each user profile is evaluated by all six human evaluators, and we provide the standard deviation among all these human-based scores in Table 3 for evaluating user profiles.

We thank you for the valuable suggestions, and we will place them in a more prominent position in the main text.

For Question 2: "When constructing the MemDaily dataset, it is not clear to me how the trajectory is constructed."

Response:

Thanks for your question. We utilize our proposed MemSim framework to construct the MemDaily dataset, where we describe the pipeline in Section 3.1. First, we develop the BRNet (details in Section 3.2) to model the probability distribution of users’ relevant entities and attributes, enabling the sampling of diverse hierarchical user profiles. Then, we introduce a causal mechanism (details in Section 3.3) to generate user messages and construct reliable QAs based on these sampled profiles. We design various types of QAs for comprehensive memory evaluation, including single-hop, multi-hop, comparative, aggregative, and post-processing QAs, incorporating different noises to simulate real-world environments.

Specifically for the trajectory $\xi = (M, q, a, a', h)$ , $M$ is the list of user messages described in the part "Construction of User Messages" of Section 3.3, $q$ and $a$ is the question and answer described in the part "Construction of Questions and Answers" of Section 3.3, $a'$ is the confusing choices for a multi-choice question described in line 264, and $h$ is the retrieval target described in the part "Construction of Questions and Answers" of Section 3.3. After obtaining the above materials, we will get one trajectory. We have also provided some detailed case trajectories for different QA types in Appendix E.3.

If you have any other questions, please feel free to comment, and we are honored to reply.

审稿意见

评分: 5置信度: 32024-11-03

This paper proposes a method for automatically constructing reliable, diverse, and scalable QAs, MemSim. Based on a Bayesian simulator, MemSim first builds a Bayesian relation network and then designs a causal generation mechanism to produce various types of QAs, including single-hop, multi-hop, comparative, aggregative, and post-processing. Finally, this paper uses MemSim to generate a dataset named MemDaily to evaluate the memory mechanisms in LLM agents.

优点

This paper is logically clear. It illustrates the BRNet with theoretical proof and uses multiple symbols to accurately describe causal generation mechanism.
Using BRNet, this method eliminates the hallucination problem caused by LLM generation. Additionally, the causal generation mechanism guarantees the diversity of the datasets.
Using this dataset, this paper evaluates different memory mechanisms of LLM agents and analyzes the results, which are insightful for future agent design.

缺点

There is a lack of comparison between MemSim and existing methods for QA generation. For example, generate personal questions through LLMs or build a personal KB and let the LLM generate messages based on the entities and relations. And then, evaluate the datasets using metrics in section 4.
The difference between BRNet and a KB is not clear. In section 2, this paper claims that KBQA mainly focuses on common-sense questions. However, building a personal or anonymous KB and allowing LLMs to generate datasets based on triples in the KB is also feasible for personal questions.

问题

In section 3.2, how is the joint probability distribution of $X$ and the conditional probability distribution of $x$ determined? Do they need to be manually defined?
In MemDaily, there are only 11 entities and 73 attributes. Are these hand-crafted? If it is hand-crafted, how does the dataset scale up?

伦理问题详情

评论- Response to Reviewer o28W (2/2)

2024-11-17

For Weakness 2: "The difference between BRNet and a KB is not clear. In section 2, this paper claims that KBQA mainly focuses on common-sense questions. However, building a personal or anonymous KB and allowing LLMs to generate datasets based on triples in the KB is also feasible for personal questions."

Responses:

Thanks for your comment. We propose the Bayesian Relation Network (BRNet) based on the Bayesian Network (a type of probabilistic graphical model that can be considered as a KB), in order to meet the requirement of creating various user profiles. However, I think the major differences between MemSim and KBQA methods are shown as follows:

In Conventional KBQA evaluations, a knowledge graph is typically provided as retrieval support [1]. As the review says, building a KB to generate personal questions can be feasible. However, for LLM-based personal assistants, users do not provide a knowledge graph to the personal assistant. Instead, they commonly provide factual information through messages, which do not have the structure of KBs. This makes it challenging to directly evaluate LLM-based agents using existing KBQA methods, as it requires reliably injecting structured information into unstructured user messages. That is also the problem that our causal generation mechanism aims to address.

For Question 1: "In section 3.2, how is the joint probability distribution of X and the conditional probability distribution of x determined? Do they need to be manually defined?"

Response:

Thanks for your question. In our paper, we just need conditional probability distribution among variables, and utilize ancestral sampling to obtain different user profiles, instead of calculating the complex and high-dimensional joint probability distribution. These conditional probability distributions should be introduced into BRNet as prior knowledge, before sampling user profiles. They are determined by the specific scenario, for example, the daily-life scenario in our paper. Different scenarios should have various conditional probability of x. Specifically, they are introduced into two main approaches. The first approach is by analyzing real-world data that are collected from deployment platforms (such as the mobile personal assistants in our work). This approach is suitable for conditional probability distributions that are highly relevant to specific scenarios. For example, in the scenario of personal assistants, there might be a 50% chance of having a high income within the Ph.D. group. The second approach is manually defined, which can also involve the help of LLMs (but should be checked). This type is suitable for certain conditional probability distributions related to common sense. For example, from a physiological perspective, Alice's aunt is 100% female and 0% male. In our work, we combine the two methods mentioned above and receive support from the industry department, for which we are very grateful for their assistance.

For Question 2: "In MemDaily, there are only 11 entities and 73 attributes. Are these hand-crafted? If it is hand-crafted, how does the dataset scale up?"

Response:

Thanks for your question. MemDaily is utilized to evaluate the memory of LLM-based personal assistants, and all these 11 entities and 73 attributes are derived from the data analysis of real-world platforms. Actually, there are many ways to scale the dataset up. First of all, for user profiles, each attribute corresponds to a value space, which includes a large amount of values. For example, the attribute "Hometown" includes tens of major cities as the discrete choice space, and the attribute "Item Description" corresponds to a sentence space for value generation. Therefore, 73 attributes can lead to a large variety of user attribute combinations, and these attributes have actually covered key aspects of users' daily lives. Moreover, new attributes can be easily introduced into BRNet by adding new vertices, edges and conditional probability distributions, which makes it scalable to new requirements of the scenario. The second way is creating more complex types of QAs. As we demonstrate in Section 3, we have designed five different types of QAs for data generation, including single-hop, multi-hop, comparative, aggregative, and post-processing. We combine different attributes to create more complex QAs, thus scaling up the dataset. Finally, extra noise has been also introduced to scale up our dataset in Section 3.3 and Section 5.1.

评论- Response to Reviewer o28W (1/2)

2024-11-17

Dear reviewer o28W,

Thanks so much for your precious time in reading and reviewing our paper. In the following, we try to alleviate your concerns one by one:

For Weakness 1: "There is a lack of comparison between MemSim and existing methods for QA generation. For example, generate personal questions through LLMs or build a personal KB and let the LLM generate messages based on the entities and relations. And then, evaluate the datasets using metrics in section 4."

Response:

Thanks for your comment. In fact, we have compared the performance of these two types of baselines in our paper, and I agree that the descriptions here are not unclear enough. So we will provide a more detailed description as follows.

(1) Generate Personal Questions Through LLMs

Generating personal questions through LLMs is a usual approach for QA constructions. We have discussed this baseline in line 430 to 431 in our paper as OrcaleMem, and put the results in Table 6. This approach commonly adopts to the pipeline like "message --> question --> answer". First of all, it generates or collects some user messages. Then, it lets an LLM generate questions based on these messages. Finally, it makes the LLM generate correct answers based on the user messages and questions. Although this method is simple, the accuracy of the answers depends on the performance of the LLM, which makes the difficulty of constructing and solving the Q&A the same. Therefore, OracleMem is actually the process that generates answers given messages and questions, which is identical to this construction approach. That is why we make it one baseline to compare with our method. For the metric in Section 4, it can be converted as follows:

Question Types	Answer	Retrival Target
Simple	0.966	0.888
Conditional	0.988	0.851
Comparative	0.910	0.947
Aggregative	0.376	0.544
Post-processing	0.888	0.800
Noisy	0.984	0.846
Average	0.852	0.813

We can see that the average scores of 0.852 (answer) and 0.813 (retrieval target) can not ensure the correctness of automatic data generation, which requires further human annotators to refine.

(2) Build A Personal KB and Let the LLM Generate Messages based on the Entities and Relations

We conduct this comparison in Section 4.2. We provide more details about our baselines.

ZeroCons: No constraints on attributes when prompting LLMs. We just let LLM generate user messages freely, without any constraints. From a probabilistic perspective, this is essentially an independent sampling process $m_i \sim P(M)$ .
PartCons: Partial attributes of user profiles are constrained in prompts for LLMs. We provide a user profile, and let LLM generate user messages that should refer to part attributes of the user profile. From a probabilistic perspective, this is essentially a partial conditional sampling process $m_i \sim P(M|X_i)$ .
SoftCons: Full attributes of user profiles are constrained in prompts but they are not forcibly for generation. We provide a user profile, and let LLM generate user messages that should refer to all attributes of the user profile. From a probabilistic perspective, this is essentially a full conditional sampling process $m_i \sim P(M|X_1, X_2, ..., X_n)$ .

In fact, SoftCons is the method that builds a personal KB and let the LLM generate messages based on the entities and relations. Generating user messages by incorporating full user profiles is a common method for most recent works. Actually, what we want to emphasize here is that while these baselines are capable of generating user messages fairly well, they do not have to be subjected to strict constraints. However, our method requires both the integration of specific attributes into user messages and ensuring that questions are answerable with established ground truths based on the shared hints. It imposes the strictest constraints that should ensure the answer can be accurately injected into user messages. Generally, higher constraint commonly means sacrifice of fluency and naturalness, because it compulsively imposes certain information to benefit QA constructions.

评论- Response to the Authors

2024-11-23

Thanks for your response. I have no further questions.

2024-11-26

Dear reviewer o28W,

Thanks very much for your kind reply. We believe your comments are very important to improve our paper. If our responses have alleviated your concerns, is it possible to consider adjusting your score?

We sincerely thank you for your time in reviewing our paper and our responses.

审稿意见

评分: 5置信度: 42024-11-03

The paper proposes a data genertor based on Bayesian Relation Network and a causal generation mechanism, aiming to using LLMs to simulate users (i.e., generate user profiles / attrubutes) and generate evaluation datasets (i.e., generate lots of user descriptions based on previous sampled profile). Furthermore, the paper evaluate the quality of collected dataset -- MemDaily and provide performance analysis using GLM4-9B.

优点

The diversity, scalability and reliability of generated datasets can be improved since it mainly based on LLMs to automatically construct the data, and the user descriptions (a.k.a, messages) related to user profile is highly controllable.
The dataset considers several practical usage cases, including single-hop QA, multi-hop QA, comparative QA, aggregative QA and post-processing QA.

缺点

lots of statements are not convincing. There are several examples: 1) line 64, the author claims the work is first work to evaluate memory of LLM-based personal assistants in an objective and automatic way. there are many studies evaluation memory usages in objective and automatic ways, such as [1]; 2) line 99, the authors keep emphasising the importance of "objective" evaluation, since they think the human annotator introduce bias. However, my personal concern is LLM also has bias, just like human. In this way, I would not say the objective evaluation is guaranteed.
baselines are too weak and experimental results are not convincing. The baselines used in both section 4.1 and 4.2 are too weak, and there are no implemention details. For examples, it is not hard to deign multi-stage but easier prompting strategy to generate user profiles instead of designing such complext sampling mechanisms. Even in this way, the performance gap in table 4 is not significant. The simple baseline leads to better performanc in most metrics. No detailed analysis is provided.
the value of dataset is not significant, and the unique advantages compared with existing datasets are not clear. there are several observations: 1) according to table 2, the TPM is around 15 and total messages is around 4000, then for each test instance, if we consider all user message as retrieval pool (note this is also not indicated in the paper), the max token number should be 15*4000 = 60000 token, while most of existing long-term memory benchmark consider much longer context, not even to mention if the pool becomes much smaller if the retrieval pool does not use all messages (so here many detailed statistics are missing). 2) user messages are generated using prompting given detailed user atttributes. Despite it can reduce hallucinations of LLMs, it also poses many constraints of LLMs and make the expression is not natural with real-world interections, such as the user may not explicitly talk about these attributes; 3) according to table 6, the simpliest baseline (RetrMem or FullMem) can achieve almost 80% accuracy, and FullMem can achieve 95%, which further support my claim that the dataset is relatively easy and not to mention the author does not use the existing SOTA model such as gpt4o or others.

[1] Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

问题

See above
any prompts or implemention details for baselines and experiments?
notions are not clear.

评论- Response to Reviewer ZuQB (6/6)

2024-11-17

For Weakness 3: "2) user messages are generated using prompting given detailed user atttributes. Despite it can reduce hallucinations of LLMs , it also poses many constraints of LLMs and make the expression is not natural with real-world interections , such as the user may not explicitly talk about these attributes;"

Response:

Thanks for your comment. We agree with the reviewers that such additional constraints can reduce the fluency and naturalness of generating user messages, which we have discussed in detail in Section 4.2. We never state that our method in Table 4 should surpass those three baselines in terms of fluency and rationality. On the contrary, being slightly below the baseline is expected, because our method is strictly constrained to ensure the generated messages directionally include the answer, thus sacrificing performance in the above three linguistic aspects. This has been discussed in detail in lines 405 to 409:

"Our MemSim method imposes the most strict constraints, requiring both the integration of specific attributes into user messages and ensuring that questions are answerable with established ground truths based on the shared hints. Generally, higher constraint commonly means sacrifice of fluency and naturalness, because it compulsively imposes certain information to benefit QA constructions."

The key to our method is the accurate injection of the answer information into the user messages, which is why this method can significantly enhance the reliability of QA data generation, thus achieving "automatic" data construction for subsequent "objective" evaluation. To construct more reliable QAs for evaluation, we have indeed sacrificed some naturalness. However, the results of Experiment 4 indicate that the decline in naturalness is not particularly severe. In addition, the feedback on usage from the industry department indicates that this style of expression is what they require, especially for the evaluation of factual information on the memory of LLM-based personal assistants. We believe that for our evaluation scenario, this is an acceptable trade-off.

For Weakness 3: "3) according to table 6, the simpliest baseline (RetrMem or FullMem) can achieve almost 80% accuracy, and FullMem can achieve 95%, which further support my claim that the dataset is relatively easy and not to mention the author does not use the existing SOTA model such as gpt4o or others."

Response:

Thanks for your comment. However, I think there are some misunderstandings about our work. In order to set different levels of difficulty, we collect question-irrelevant posts from social media platforms, and randomly incorporate them into user messages by controlling their proportions. Specifically, we denote MemDaily-vanilla as the vanilla and easiest one without extra additions, and create a series of MemDaily- $\eta$ , where we use $\eta$ to represent the inverse percentage of original user messages. Larger $\eta$ indicates a higher level of difficulty in the benchmark. We primarily focus on MemDaily-vanilla and MemDaily-100 as representatives. We also conduct evaluations on MemDaily-10, MemDaily-50, and MemDaily-200, putting their experimental results in Appendix D.

As for the evaluation results, we can see that FullMem can only achieve over 95% performance on the simple type of questions in MemDaily-vanilla and MemDaily-100. This is normal because, for LLMs, answering factual information in single-hop questions is essentially like searching for a needle in a haystack; thus, this performance is expected. However, for tasks such as aggregative QAs, the performance is less than 40% even on MemDaily-vanilla. On the more challenging MemDaily-200, the comprehensive performances for Comparative, Aggregative, and Post-processing tasks is below 80%. Additionally, we provided a Retrieval Target to evaluate the retrieval of memory information, and even on MemDaily-100, this metric shows performance below 70%.

It is precisely to increase the difficulty of the evaluation that we introduced question-irrelevant posts and various QA tasks, along with the stricter process metric of retrieval target, to provide a more diversified evaluation. I think reviewers should not dismiss the difficulty and contributions of the entire benchmark based solely on the performance results of the simplest types of QAs.

We sincerely thank you for your time to review our paper and comments on it. I hope the rebuttal can make a clarification of the misunderstandings and change your perspective on our work. If you have further questions, we are very happy to discuss them.

评论- Acknowledge of Response

2024-11-25

Thank you for detailed response. Some of my concerns are addressed during the response. The score is updated.

2024-11-26

Dear reviewer ZuQB,

Thanks very much for your feedback. In the rebuttal, we try our best to answer your questions one by one. If you have further questions, we are very happy to discuss more about them.

We sincerely thank you for your time in reviewing our paper and our responses.

评论- Response to Reviewer ZuQB (4/6)

2024-11-17

(3) Data Card for MemDaily-50

Statistics	Simp.	Cond.	Comp.	Aggr.	Post.	Noisy	Total
Trajectories	500	500	492	462	500	500	2,954
Messages	210,750	209,750	157,200	276,800	221,900	223,750	1,300,150
Questions	500	500	492	462	500	500	2,954
Token Per Trjectory	4,834	4,813	3,665	6,867	5,107	5,139	30,426
Token Per Messages	11.47	11.47	11.47	11.46	11.51	11.48	68.87

(4) Data Card for MemDaily-100

Statistics	Simp.	Cond.	Comp.	Aggr.	Post.	Noisy	Total
Trajectories	500	500	492	462	500	500	2,954
Messages	421,500	419,500	314,400	553,600	443,800	447,500	2,600,300
Questions	500	500	492	462	500	500	2,954
Token Per Trjectory	9,402	9,360	7,123	13,357	9,919	9,992	59,154
Token Per Messages	11.15	11.16	11.15	11.15	11.18	11.16	66.94

(5) Data Card for MemDaily-200

Statistics	Simp.	Cond.	Comp.	Aggr.	Post.	Noisy	Total
Trajectories	500	500	492	462	500	500	2,954
Messages	843,000	839,000	628,800	1,107,200	887,600	895,000	5,200,600
Questions	500	500	492	462	500	500	2,954
Token Per Trjectory	18,536	18,454	14,048	26,355	19,544	19,685	116,622
Token Per Messages	10.99	11.00	10.99	11.00	11.01	11.00	65.99

From our data card, we can see that MemDaily-200 owns more than 843k messages, with over 26k tokens per trajectory, which can be considered as a long context. Moreover, we have also provided the script for creating longer trajectories by infusing question-irrelevant posts, in order to further extend the length of trajectories. We also thank you for the advice and will add this data card to the Appendix.

评论- Response to Reviewer ZuQB (3/6)

2024-11-17

For Weakness 2: "Even in this way, the performance gap in table 4 is not significant. The simple baseline leads to better performanc in most metrics. No detailed analysis is provided."

Response:

Thanks for your comment. However, I think there are some misunderstandings about our work. Actually, we never state that our method in Table 4 should surpass those three baselines in terms of fluency, rationality, naturalness, and informativeness. On the contrary, being slightly below the baseline is expected, because our method is strictly constrained to ensure the generated messages directionally include the answer, thus sacrificing performance in the above three linguistic aspects. This has been discussed in detail in lines 405 to 409:

For Weakness 3: "The value of dataset is not significant, and the unique advantages compared with existing datasets are not clear. there are several observations: 1) according to table 2, the TPM is around 15 and total messages is around 4000, then for each test instance, if we consider all user message as retrieval pool (note this is also not indicated in the paper), the max token number should be 15*4000 = 60000 token, while most of existing long-term memory benchmark consider much longer context, not even to mention if the pool becomes much smaller if the retrieval pool does not use all messages (so here many detailed statistics are missing). "

Response:

Thanks for your comment. However, I think there are some misunderstandings about our work. First of all, MemDaily dataset is just the basic data for constructing MemDaily benchmark, not all the data for evaluation. We have provided a detailed description in Section 5.1. In order to set different levels of difficulty, we collect question-irrelevant posts from social media platforms, and randomly incorporate them into user messages by controlling their proportions. Specifically, we denote MemDaily-vanilla as the vanilla and easiest one without extra additions (statistics in Table 2), and create a series of MemDaily- $\eta$ , where we use $\eta$ to represent the inverse percentage of original user messages. Larger $\eta$ indicates a higher level of difficulty in the benchmark. We primarily focus on MemDaily-vanilla and MemDaily-100 as representatives. We also conduct evaluations on MemDaily-10, MemDaily-50, and MemDaily-200, putting their experimental results in Appendix D.

The full data statistics for MemDaily-vanilla, MemDaily-10, MemDaily-50, MemDaily-10 and MemDaily-200 can be found in the 'data_card.xlsx' in our anonymous repository. We also put them as follows:

(1) Data Card for MemDaily-vanilla

Statistics	Simp.	Cond.	Comp.	Aggr.	Post.	Noisy	Total
Trajectories	500	500	492	462	500	500	2,954
Messages	4,215	4,195	3,144	5,536	4,438	4,475	26,003
Questions	500	500	492	462	500	500	2,954
Token Per Trjectory	360	359	268	502	394	389	2,272
Token Per Messages	42.74	42.77	41.97	41.93	44.34	43.41	257.16

(2) Data Card for MemDaily-10

Statistics	Simp.	Cond.	Comp.	Aggr.	Post.	Noisy	Total
Trajectories	500	500	492	462	500	500	2,954
Messages	42,150	41,950	31,440	55,360	44,380	44,750	260,030
Questions	500	500	492	462	500	500	2,954
Token Per Trjectory	1,184	1,178	891	1,671	1,259	1,261	7,445
Token Per Messages	14.04	14.04	13.95	13.95	14.19	14.09	84.26

评论- Response to Reviewer ZuQB (2/6)

2024-11-17

For Weakness 2: "Baselines are too weak and experimental results are not convincing. The baselines used in both section 4.1 and 4.2 are too weak, and there are no implemention details. For examples, it is not hard to deign multi-stage but easier prompting strategy to generate user profiles instead of designing such complext sampling mechanisms."

Response:

Thanks for your comment. However, I think there are some misunderstandings about our work. Our implementation of baselines can be found in our anonymous repository. For better demonstration, we provide a detailed description as follows.

For the baselines of generating user profiles:

Note: For the fairness of our evaluation, we predefine a common attribute domain for all the baselines, such as gender, occupation, and so on.

IndePL: Prompting an LLM to generate values of attributes independently. This is the most naive baseline, where it generates each attribute value independently, without considering previously generated attribute values. From a probabilistic perspective, this is essentially an independent sampling process $x_i \sim P(X_i)$ .
SeqPL: Prompting an LLM to generate values of attributes sequentially, conditioned on previous attribute values in linear order. Compared with IndePL, SeqPL incorporates the previous one attribute when generating the next attribute. From a probabilistic perspective, this is essentially a first-order Markov sampling process $x_i \sim P(X_i|X_{i-1})$ .
JointPL: Prompting an LLM to generate attribute values jointly. Compared the above two methods, JointPL incorporates all attribute domains into the prompt and generate all the values of attributes at once. From a probabilistic perspective, this is essentially a joint probability sampling process $x_i \sim P(X_i|X_1, X_2, ..., X_{i-1})$ .

Actually, JointPL is not a weak baseline, because it incorporates all attribute domains into the prompt. Most previous works are using this idea to generate user profiles[2, 3, 4, 5, 6].

We believe that the above three situations basically cover all methods of generating user profiles (without relying on other datasets). I'm not quite sure what the reviewer means by “For examples, it is not hard to deign multi-stage but easier prompting strategy to generate user profiles instead of designing such complex sampling mechanisms.” If you could provide a more detailed description, we would be happy to discuss it, which would greatly help us improve the paper.

For the baselines of generating user messages:

ZeroCons: No constraints on attributes when prompting LLMs. We just let LLM generate user messages freely, without any constraints. From a probabilistic perspective, this is essentially an independent sampling process $m_i \sim P(M)$ .
PartCons: Partial attributes of user profiles are constrained in prompts for LLMs. We provide a user profile, and let LLM generate user messages that should refer to part attributes of the user profile. From a probabilistic perspective, this is essentially a partial conditional sampling process $m_i \sim P(M|X_i)$ .
SoftCons: Full attributes of user profiles are constrained in prompts but they are not forcibly for generation. We provide a user profile, and let LLM generate user messages that should refer to all attributes of the user profile. From a probabilistic perspective, this is essentially a full conditional sampling process $m_i \sim P(M|X_1, X_2, ..., X_n)$ .

In fact, SoftCons is a common baseline for generating user messages, rather than a weak baseline. Generating user messages by incorporating full user profiles is a common method for most recent works. Actually, what we want to emphasize here is that while these baselines are capable of generating user messages fairly well, they do not have to be subjected to strict constraints. However, our method are required both the integration of specific attributes into user messages and ensuring that questions are answerable with established ground truths based on the shared hints. It imposes the strictest constraints that should ensure the answer can be accurately injected into user messages. Generally, higher constraint commonly means sacrifice of fluency and naturalness, because it compulsively imposes certain information to benefit QA constructions.

References:

[2] Zhong, Wanjun, et al. "Memorybank: Enhancing large language models with long-term memory." Proceedings of the AAAI Conference on Artificial Intelligence.

[3] Yukhymenko, Hanna, et al. "A Synthetic Dataset for Personal Attribute Inference." arXiv:2406.07217 (2024).

[4] Wang, Lei, et al. "User behavior simulation with large language model based agents." arXiv:2306.02552 (2023).

[5] Niu, Cheng, et al. "Enhancing Dialogue State Tracking Models through LLM-backed User-Agents Simulation." arXiv:2405.13037 (2024).

[6] Zhou, Xuhui, et al. "Sotopia: Interactive evaluation for social intelligence in language agents." arXiv:2310.11667 (2023).

评论- Response to Reviewer ZuQB (1/6)

2024-11-17

Dear reviewer ZuQB,

Thanks so much for your precious time in reading and reviewing our paper. However, I believe there are significant misunderstandings about our paper, and I hope the following rebuttal can make a clarification and change your perspective on our work. In the following, we try to alleviate your concerns one by one:

For Weakness 1: "Lots of statements are not convincing. There are several examples: 1) line 64, the author claims the work is first work to evaluate memory of LLM-based personal assistants in an objective and automatic way. there are many studies evaluation memory usages in objective and automatic ways, such as [1]."

Response:

Thanks for your question. However, I think there are some misunderstandings about our work. In line 64, we claim our work is the first work to evaluate the memory of LLM-based personal assistants in an objective and automatic way. There are two significant points: (1) memory of LLM-based personal assistants, and (2) in an objective and automatic way. There are some previous works that evaluate the performance of LLM-based personal assistants, but not directly on their memory. For example, the reviewer mentions a great previous work [1], which focuses on utilizing a long/short memory to improve long-term conversation tasks. This work can reflect "the effectiveness of memory mechanisms for long-term conversation tasks" by improving their performances on these tasks, but not take a common and direct evaluation on "how memory mechanisms can memorize certain critical information", which is the key point in our work. The task improvement by memory usage is not identical to the performance that a memory can exactly memorize critical information. Actually, we have emphasized the "factual information" many times in our paper, but we will also add the above detailed comparison to make it more clear.

References:

[1] Li, Hao, et al. "Hello Again! LLM-powered Personalized Agent for Long-term Dialogue." arXiv preprint arXiv:2406.05925 (2024).

For Weakness 1: "2) line 99, the authors keep emphasising the importance of "objective" evaluation, since they think the human annotator introduce bias. However, my personal concern is LLM also has bias, just like human. In this way, I would not say the objective evaluation is guaranteed."

Response:

Thanks for your comment. However, I think there are some misunderstandings about our work. In our work, the "objective" evaluation for the memory of LLM-based personal assistants is neither conducted by human annotations nor by LLMs annotations. We think that both human annotation and LLM annotation fall under the category of "subjective" evaluation, whereas "objective" evaluation should be a process of comparing predictions with ground truth answers. We let agents answer factual questions and compare their answers with the ground truths to calculate the accuracy. Specifically, in our work, we use the accuracy of multiple-choice questions related to factual information and the recall@5 of retrieval targets as metrics, as detailed in Section 5.1. We agree with the reviewer that both humans and LLMs can introduce bias in evaluations. That is exactly why we do not take that approach, which serves as an innovation point in our claim.

评论- Response to Reviewer ZuQB (5/6)

2024-11-17

As for the unique advantages compared with existing datasets, there are two critical advantages in our work:

(1) Automatic Data Generation without Human Annotation (Compared with Other Long-term QA Datasets):

Previous approaches usually adopt to the pipeline like "message --> question --> answer". They generate or collect some user messages, and then let an LLM generate questions based on these messages. Finally, they make the LLM generate correct answers based on the user messages and questions. Although this method is simple, the accuracy of the answers depends on the performance of the LLM, which makes the difficulty of constructing and solving the Q&A the same. Therefore, these approaches require further human annotation to check whether the answer is correct, such as PerLTQA, LOCOMO, and LeMon.

In contrast, our proposed approach takes the pipeline like "prior knowledge --> question & answer --> message". We generate questions and answers based on constructed prior information (such as user attributes). Then, we create user messages by injecting answers with other information. This construction method makes it easier to construct Q&A than to solve them. By this means, we can ensure the correct answer is contained and well-located in the user messages.

The feature of "automatic" makes the evaluation extendable to other specific scenarios without expensive human annotators.

(2) User Messages as Information Foundations for QAs (Compared with Other KBQA Datasets)

In Conventional KBQAs evaluations, a knowledge graph is typically provided as retrieval support [7], LLMs can also be evaluated on general knowledge using common-sense questions, such as HotpotQA. However, for LLM-based personal assistants, users do not provide a knowledge graph to the personal assistant. Instead, these scenarios need to convey factual information in the form of user messages. This makes it challenging to directly evaluate LLM-based agents using existing KBQA data, as it requires reliably injecting structured information into user messages. That is also the problem that our causal generation mechanism aims to address.

(3) Evaluation on memorizing certain critical information (Compared with Other Memory-based Conversation Tasks)

Some previous works like [1] focus on utilizing a long/short memory to improve long-term conversation tasks. These works can reflect "the effectiveness of memory mechanisms for long-term conversation tasks" by improving their performances on these tasks, but not take a common and direct evaluation on "how memory mechanisms can memorize certain critical information", which is the key point in our work. The task improvement by memory usage is not identical to the performance that the memory can exactly memorize critical information.

References:

[1] Li, Hao, et al. "Hello Again! LLM-powered Personalized Agent for Long-term Dialogue." arXiv preprint arXiv:2406.05925 (2024).

[7] Lan, Yunshi, et al. "Complex knowledge base question answering: A survey." IEEE Transactions on Knowledge and Data Engineering 35.11 (2022): 11196-11215.

审稿意见

评分: 5置信度: 22024-11-04

The paper proposed MemSim, a simulator for automatically constructing diverse and rational QA pairs based on the Bayesian Relation Network.
The paper also constructs a MemDaily Dataset using MemSim to evaluate the memory mechanisms of LLMs.

优点

The paper adopts BRNet to sample user profile graphs and build question-answer pairs according to these graphs based on rules. It reduces the hallucinations compared to methods that construct samples directly.
Human evaluation seems to validate the effectiveness of rationality and diversity for generated results from the proposed MemSim.

缺点

The human evaluation lacks a detailed description. How can an annotator give a diversity score to a sample without comparing all the samples in the produced dataset?
The tables are hard to understand, no detailed description about the metric abbreviation. For example in Table 6,7.

问题

It is kind of confusing that the author mentions Baysian Relation Network. It seems to be a Probabilistic Graphical Model. Why the proof in Sec3.2 can give the conclusion:"we introduce prior knowledge of the specific scenario into the graphical structure and sampling process, which can improve the diversity and scalability of user profiles,"
Where is the BRNet coming from? How is user profiles generated?

伦理问题详情

There is a lot of sensitive information listed in the generated datasets, which seems to be generated by LLM.

评论- Response to Reviewer vwQN (4/4)

2024-11-17

For Question 2: "Where is the BRNet coming from? How is user profiles generated?"

Response:

Thanks for your question. The vertices, edges, and conditional probability distributions of BRNet are derived from a real-world scenario (industry department) for personal assistants. All of them are defined in the meta_profile.csv and our code for data generation, which can be found in our anonymous code repository. Moreover, we have provided a detailed description for generating user profiles in Appendix E, and you may check it.

For Ethics Concerns: "There is a lot of sensitive information listed in the generated datasets, which seems to be generated by LLM."

Response:

Thanks for your concerns, we ensure the safe of all the generated datasets by automatically checking the ethical contents.

评论- Response to Reviewer vwQN (3/4)

2024-11-17

For Weakness 1: "How can an annotator give a diversity score to a sample without comparing all the samples in the produced dataset?"

Response:

I believe there is a misunderstanding regarding the evaluation of sample diversity here. We evaluate its diversity by calculating the Shannon-Wiener Index of the dataset, rather than by human annotators, as stated in lines 353~360 of our paper. I agree that we can not score the diversity of the data without seeing all the samples. This is also why we do not use human annotators to evaluate the diversity of the data. In fact, diversity should pertain to the dataset as a whole, rather than individual samples. Therefore, we collect all the entities in the produced dataset and get their frequencies, calculating the Shannon-Wiener Index to reflect the overall diversity of the dataset.

For Weakness 2: "The tables are hard to understand, no detailed description about the metric abbreviation. For example in Table 6,7."

Response:

I believe there is a misunderstanding regarding the table, and I'm sorry for making it a bit unclear. In fact, these abbreviations are not metrics, but different types of QAs in the MemDaily dataset, where we have explained these abbreviations in Section 3.4 line 323~335. As for the metrics, we have provided a detailed explanation in lines 460-469 of Section 5.1. These metrics are used throughout the whole experiment section. Specifically, Table 6 demonstrates the accuracy of factual question-answering on different QA types, and Table 7 shows the recall@5 of target message retrieval on different QA types.

For a better demonstration, we will add a description "The abbreviations indicate different QA types mentioned in Section 3.4", and we thank you for the advice.

For Question 1: "It is kind of confusing that the author mentions Baysian Relation Network. It seems to be a Probabilistic Graphical Model. Why the proof in Sec3.2 can give the conclusion:"we introduce prior knowledge of the specific scenario into the graphical structure and sampling process, which can improve the diversity and scalability of user profiles."

Response:

Thanks for your question. We propose the Bayesian Relation Network (BRNet) based on the Bayesian Network (a type of probabilistic graphical model), in order to meet the requirement of creating various user profiles. It can be considered as a variant of Bayesian Network specifically for creating user profiles for personal assistants. Specifically, we define a two-level structure in BRNet, including the entity level and the attribute level. The entity level represents user-related entities, such as relevant persons, involved events, and the user itself. At the attribute level, each entity comprises several attributes, such as age, gender, and occupation. Here, BRNet actually serves as a predefined meta-user. Each vertice in BRNet corresponds to an attribute domain for a certain entity at the entity level, and we define a series of possible values in each attribute domain for probabilistic sampling.

There are three main aspects to introducing prior knowledge of the specific scenario into the graphical structure and sampling process. The first aspect is to introduce different attribute domains and their possible values, by extending more vertices. For example, the age range might be set from 18 to 30 for a youth social platform. The second aspect is to introduce different causal relations among attributes, by extending edges. For example, educational background might be a cause of occupation. The third aspect is to introduce probability distributions among attributes, by extending conditional probability distributions. For example, a person with a PhD is more likely to become a scientist. By introducing these three aspects of prior knowledge, we are able to make the created dataset more closely resemble our scenario.

For diversity, whether by adding more prior knowledge during LLM inference generation or conducting random sampling under more conditions, introducing more prior knowledge can effectively prevent the repetition of generated user profiles, as verified in Section 4.1. For scalability, different prior knowledge can be easily integrated into BRNet by introducing new vertices, edges, and conditional probability distributions, thus allowing for effective expansion.

评论- Response to Reviewer vwQN (2/4)

2024-11-17

[Rationality] The rationality of the user message refers to: (1) it can be existed in the real world (2) without inside conflict and contradiction. Score 1 means the least rational, while score 5 means the most rational.

Here are some examples that lack of fluency for reference:

(1) [1 point] I am 24 years old, and my grandson is 2 years old. (It is impossible for a 24-year-old to have a grandson.)

(2) [2 $\sim$ 3 point] Today is Monday, tomorrow is Wednesday. (Tomorrow cannot be Wednesday as the day after Monday is Tuesday.)

Tips: If there are no obvious unreasonable points, a score of 5 can be given; for serious errors, a score of 1 $\sim$ 2 can be given; for other unreasonable elements, corresponding points can be deducted at discretion.

[Naturalness] The naturalness of a user message refers to whether the message closely resembles a real user message. Score 1 means the least natural, while score 5 means the most natural.

[Informativeness] The informativeness of user messages refers to whether these messages can provide rich and valuable information points. Information points are those points that can be queried about. Score 1 means the least informative, while score 5 means the most informative.

The following are some examples:

(1) [Low Informativeness] How is the weather today?

(2) [Medium Informativeness] How is the weather today? I plan to go to the park this afternoon.

(3) [High Informativeness] Today's weather is overcast turning to cloudy, it won't rain, I plan to go to the park this afternoon.

Highlight: You should have a general sense of the informativeness in the user's message during the pre-evaluation phase.

Additional Requirement: You should indicate the reason at the above critical points for deduction. If no major points for deduction exist, then there is no need to fill in this requirement.

Guideline of Evaluation on Questions and Answers.

Guideline: In the left column of the questionnaire, you will see (1) a list of user messages (2) a question (3) the textual answer (4) the multiple choices with the correct answer (5) the index list of retrieval targets. You should check the three aspects of the QAs, including the accuracy of textual answers, the accuracy of multiple-choice answers, and the accuracy of retrieval targets.

[Accuracy of Textual Answers] You need to check whether the textual answer is correct relative to the question based on the user's message list. If it is correct, please select the button [Correct], otherwise, please select the button [Incorrect].

[Accuracy of Retrieval Targets] Please judge the correctness of the retrieval targets in the Q&A. Retrieval targets refer to which messages (given in index form) from the user's message list are needed to obtain the textual answer to the question. Determine whether the retrieval targets are correct. If it is uniquely correct, please select the button [Correct], otherwise, please select the button [Incorrect].

Additional Requirement: You should indicate the reason for choosing [Incorrect]. If all of the above are correct, then there is no need to fill in this requirement.

Third, we build a web page based on Flask to display the questionnaire containing the data (which has been shuffled) to be evaluated. We deploy this web page on a cloud server and assign it a public address for human evaluators to access. Each evaluator is assigned a unique account and password for data evaluation and progress management. The next step is the evaluation phase. Our evaluation process is divided into two stages: the pre-evaluation phase and the formal evaluation phase. In the pre-evaluation phase, evaluators are assigned a small amount of data to adapt to the evaluation process and provide feedback on relevant issues to further clarify the evaluation criteria. Additionally, during the pre-evaluation phase, evaluators need to roughly grasp the amount of information in the user information evaluation. During the pre-evaluation phase, each evaluator is assigned 8/8/20 questions, corresponding to the above three types of evaluation. In the formal evaluation phase, each evaluator is assigned 100/100/100 questions, corresponding to the three types of evaluation categories mentioned above. Finally, we obtained $(100+100+100)*6 = 1800$ human evaluated data points to analyze the quality of MemDaily.

评论- Response to Reviewer vwQN (1/4)

2024-11-17

Dear Reviewer vwQN,

Thanks so much for your precious time in reading and reviewing our paper. However, I believe there may be some misunderstandings about our paper, and I hope the following rebuttal can make a clarification and change your perspective on our work. In the following, we try to alleviate your concerns one by one:

For Weakness 1: "The human evaluation lacks a detailed description."

Response:

Thanks for your advice, we have added a detailed description of human evaluation as follows:

Details of Human Evaluation

In order to evaluate the quality of Memdaily dataset, we recruit six human experts who are all well-educated to score on multiple aspects of our dataset. We design a standard pipeline for conducting human evaluations, with clear instructions, fair scoring, and reasonable comparisons. Our human evaluations focus on user profiles, user messages, and QAs, which we have mentioned in Section 4.

Our evaluation pipeline includes five steps: (1) Human evaluator recruit (2) Guideline and questionnaire design (3) Web page construction and deployment (4) Pre-evaluation (5) Formal evaluation. We provide more details as follows.

First of all, we recruit six human experts as evaluators. All of them are well-educated and obtain at least bachelor's degrees, which ensures that they can correctly understand the evaluation questions and provide reasonable feedback. Second, we design a detailed guideline for human evaluators to tell them how to conduct the evaluation. Specifically, the guideline includes three parts, corresponding to three aspects of our evaluation in Section 4, shown as follows.

Guideline of Evaluation on User Profiles.

Guideline: You will see some user profiles in the left column of the questionnaire. Please assess whether these user profiles are reasonable, and rate the rationality of them ranging from 1 to 5. Score 1 means the least reasonable, while score 5 means the most reasonable. Here, reasonableness refers to: (1) Likely to exist in the real world, resembling a real user (realistic); (2) No inside conflicts or contradictions (consistent).

Here are some examples of unreasonable cases for reference:

(1) [1 point] The user's age is 24, but the related person is his grandson. (Logical error: A 24-year-old cannot have a grandson.)

(2) [2 points] The user's height is "(1) 175cm (2) 168cm (3) 172cm". (Generation error: Multiple values are given for a single attribute that can only have one value, like height.)

(3) [2 $\sim$ 4 points] The user's phone number is 01234567891. (Unrealistic: The phone number does not seem real.)

(4) [2 $\sim$ 4 points] The user's company name is Shanghai XX Company. (Unrealistic: The company name XX does not seem real.)

Tips: If there are no obvious unreasonable aspects, a score of 5 can be given; if there are serious errors, a score of 1 $\sim$ 2 can be given; for other unrealistic elements, points can be deducted accordingly.

Guideline of Evaluation on User Messages.

Guideline: You will see some messages in the left column of the questionnaire. These messages are what the user said to the personal assistant while using it, i.e., the recorded user messages. Please assess the fluency, rationality, naturalness, and informativeness of these user messages, and score them ranging from 1 to 5.

[Fluency] The fluency of user messages refers to the correctness of the message text in terms of words, sentences, and grammar; whether the message text is coherent and conforms to language and expression habits, allowing for colloquial expressions. Score 1 means the least fluent, while score 5 means the most fluent.

Here are some examples that lack of fluency for reference:

(1) [1 $\sim$ 2 point] Today day day day upwards to juggle night, I ate meat pork and or but rice. (Hardly understand due to lack of fluency.)

(2) [2 $\sim$ 3 point] This night, I pork and rice, delicious. (Requires effort to guess due to lack of fluency, but can realize what it means.)

Tips: No obvious issues in fluency can be given a score of 5; serious errors can receive a score of 1 $\sim$ 2; other elements affecting fluency can lead to a deduction of points as appropriate.

Here are some examples that lack of fluency for reference:

(1) [1 point] I am 24 years old, and my grandson is 2 years old. (It is impossible for a 24-year-old to have a grandson.)

(2) [2 $\sim$ 3 point] Today is Monday, tomorrow is Wednesday. (Tomorrow cannot be Wednesday as the day after Monday is Tuesday.)

2024-12-02

Dear reviewer vwQN,

Thanks again for your detailed comments, which, we believe, are very important to improve our paper.

We have tried our best to address the concerns one by one. As the discussion deadline approaches, we eagerly await your feedback on our responses.

If you have further questions, we are very happy to discuss them. We really hope our efforts can alleviate your concerns.

Sincerely,

Submission1435 Authors

评论- Kindly remind and eagerly await feedback

2024-11-23

Dear Reviewers,

We sincerely appreciate the time and effort dedicated by all reviewers in reviewing our paper.

We have addressed all concerns one by one and clarified the misunderstandings raised in the reviews. As the discussion deadline approaches, we eagerly await your feedback on our responses.

We would appreciate the opportunity to address any remaining concerns that you may still have.

Sincerely,

Submission1435 Authors

AC 元评审

2024-12-12

The authors present a framework to generate new questions from user messages to evaluate model memory capability. The work seems technically sound, with questions about data generation and baselines addressed during rebuttal. The overall novelty and impact of the work however was not very obvious -- it seems like a fairly complex / bespoke approach that wasn't very well explained or fully justified in its complexity and broad applicability.

审稿人讨论附加意见

No major issues were flagged as unaddressed, but reviewers remained lukewarm.

最终决定Reject

2025-01-22

Reject