PaperHub
5.3
/10
Rejected4 位审稿人
最低3最高8标准差1.8
3
8
5
5
2.8
置信度
正确性2.5
贡献度2.3
表达3.0
ICLR 2025

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05

摘要

关键词
LLM ForecastingContinuous EvaluationTemporal Generalization

评审与讨论

审稿意见
3

⁤This paper constructs a new benchmark to evaluate LLMs in real time by testing their ability to predict future events. ⁤⁤Traditional benchmarks quickly become outdated as LLMs and the world itself continuously evolve, limiting their ability to reflect current model performance. ⁤⁤To address this, the authors introduce Daily Oracle, a continuously updated dataset created from daily news articles. ⁤⁤Each day, they generate question-answer pairs about real-world events across domains like politics, science, and business, allowing for an assessment of whether LLMs can forecast future events based on prior knowledge. ⁤

⁤The findings show that LLMs experience a steady decline in performance over time as their training data becomes outdated, weakening their predictive abilities without regular updates. ⁤⁤Although techniques like RAG offer some enhancement by incorporating newer information, the models still struggle to maintain accuracy as the distance grows between training data and real-world events. ⁤⁤Overall, this paper presents Daily Oracle as a tool for ongoing evaluation of LLMs, focusing on their ability to generalize temporally through daily news-based question-answer pairs. ⁤

优点

  1. The writing in this paper is clear and easy to follow, with a well-organized structure.
  2. The authors' Daily Oracle benchmark covers more topics and more recent dates compared to previous benchmarks, and it provides continuous daily evaluation.
  3. In the experiments, the authors observe a notable performance drop across all LLMs in the closed-book setting after the knowledge cutoff date. They further analyze this degradation by testing with RAG and gold article settings, observing similar declines.

缺点

Major Issues:

  1. Limited Innovations in Data Construction: Although the benchmark is one of the main contributions, the data construction approach is highly similar to existing work. Specifically, the authors also rely on the Common Crawl News Dataset as the data source, and their QA construction process and question formats closely resemble TCELongBench. Aside from slight differences in the prompting steps, the main distinction appears to be the inclusion of True/False question types. This suggests limited technical contributions in data construction. This paper also lacks comparisons with TCELongBench in terms of both the approach and the quality of the generated QA pairs.
  2. No Human Verification: The authors rely entirely on GPT-3.5 and GPT-4 to generate the benchmark’s QA pairs, yet the reliability of this approach is unverified. For instance, in the QA filtering step, GPT-3.5 scores based on seven principles, criteria that could be challenging even for humans to judge objectively. The authors should include a rationale for choosing these specific principles, explain why a score of 13 or above indicates a quality question, and provide inter-rater reliability among human evaluators. It would also be necessary to assess the correlation between GPT-3.5’s scores and human scores to gauge data quality accurately, even if only on a subset.
  3. No Cost Description or Comparison: Since continuous daily evaluation is highlighted as a major benefit of this benchmark, it would be helpful to provide specific cost estimates for using Daily Oracle to periodically (daily, weekly, monthly) evaluate LLMs. Additionally, a cost comparison with other data construction methods, such as TCELongBench, is needed to assess the feasibility of this approach.

Minor Issues:

  1. Relatively Trivial Conclusion: The conclusion that LLM performance declines significantly after the knowledge cutoff date is fairly predictable. Additional analyses and insights would be beneficial, such as examining a time span beyond the last four years to study how LLMs’ performance in memorization changes over decades. The rise and drop in Figure 5’s gold article setting post-knowledge cutoff could also be further analyzed (e.g., does it relate to inconsistencies in LLMs’ parametric knowledge). Such experiments would deepen community understanding of LLMs’ temporal generalization.
  2. No Estimate of Human Performance: Adding an estimated score for average human or domain expert performance would help contextualize the accuracy scores achieved by the LLMs.

Typos:

  • Lines 312, 321, 425, 453: Figure 7 → Figure 3.

问题

See the above Weaknesses.

评论

We thank the reviewer for the detailed comments. Below are our responses to your concerns:

Weakness 1- major

Limited Innovations in Data Construction

We provide a comparison table in the general response. In short, our work improves upon the prompts of TCELongBench by significantly expanding its size and scope while incorporating several innovations to improve the quality of forecasting questions. It's important to highlight that our contribution does not focus on the data construction process itself—using LLMs to generate and verify data is a common approach. Instead, our novelty lies in the dataset's expanded size, broader scope, updated frequency, and most importantly, novel analyses of temporal generalization patterns.


Weakness 2 - major

No Human Verification

During the QA filtering step, seven principles are identified based on common mistakes observed during manual reviews of question quality while testing various QA generation prompts. A threshold of score 13 is defined as we only allow one point deduction to ensure the quality. Using this threshold, we generate a total of 37,775 QA pairs, of which 29,988 (79.39%) successfully pass the QA filtering step. For detailed results of the QA filtering process, please refer to Appendix A.3.

We also acknowledge the importance of human evaluations and inter-annotator agreement. While we are planning to design the human evaluation framework and recruit evaluators, we would also like to include our dataset in the Supplementary Material for your reference.


Weakness 3 - major

No Cost Description or Comparison

Thank you for highlighting concerns regarding the feasibility and cost-effectiveness of our data construction and evaluation process. We have conducted a detailed cost analysis:

For data construction:

Token counts (Mean ± SD):

  • Article Summary: 1,472.94 ± 444.51 tokens
  • QA Generation: 2,350.58 ± 505.53 tokens
  • Misleading Choices Generation: 1,410.18 ± 502.58 tokens
  • QA Filtering: 2,601.51 ± 500.09 tokens

Cost breakdown:

  • GPT-3.5 ($ 0.5/1M tokens): Used for Article Summary and QA Filtering
  • GPT-4 ($10/1M tokens): Used for QA Generation and Misleading Choices Generation

Average per-article cost: (1,472.94 + 2,601.51) × 0.5/1M + (2,350.58 + 1,410.18) × 10/1M = $0.0039

Monthly cost: 30 days × 6 articles × 0.0039 = $0.702

For Evaluation:

  • Open-book setting: Primary cost driver (3,051.99 ± 336.41 tokens per prompt, including five reference articles)
  • Closed-book setting: Minimal cost (~17.3 short questions daily)

In summary, our process demonstrates both feasibility and cost-effectiveness for large-scale implementation, with manageable monthly expenses of less than $1 for data generation.

评论

Weakness 1 - minor

Relatively Trivial Conclusion

We would like to emphasize several key contributions and findings from our study:

  • Novelty and scientific verification: While it may seem intuitive that LLM performance degrades after the knowledge cutoff date, our work is the first to scientifically validate this hypothesis through a new, daily updated, continuous evaluation benchmark. This benchmark not only captures but also quantifies this degradation pattern (e.g. the gradual decline even before the knowledge cutoff dates are observed, and the amount of the decline can be clearly measured), establishing a foundation for future temporal generalization studies.
  • Additionally, we have expanded our analysis to include more model performance results in both the constrained open-book setting (updated in Appendix B.4, Figures 12-18) and the gold article setting (updated in Figure 5). In both cases, the degradation patterns remain evident.
  • Notably, the degradation observed in the gold article setting is particularly surprising. In this setting, where answers are directly accessible, LLMs are expected to achieve consistent accuracy regardless of the publication date of the article. However, our findings reveal that outdated parametric representations hinder their ability to generate correct answers consistently, even in a reading comprehension context.

We appreciate the suggestion of testing LLM’s memorization changes over decades, however, due to time constraints, we leave it as a future item.


Weakness 2 - minor

No Estimate of Human Performance

Thank you for the valuable suggestion. We agree that incorporating a human performance baseline could enhance the study by providing a plausible upper bound. However, since humans already have knowledge of past answers, such a comparison may not be entirely fair. More importantly, we would like to emphasize that the primary objective of our benchmark is to underscore the need for continuous model pretraining rather than focusing on comparisons with human performance. The key insight of our paper is to encourage the development of more efficient methods for continual training. In other words, comparisons to human performance may be less relevant, as our primary goal is to call for bridging the gap between the model’s past performance and recent degraded performance.

评论

Thank you for your detailed responses and the additional experiments. While they address my concern regarding the cost description or comparison, I find that my other concerns remain insufficiently addressed. After carefully reading your responses and considering the other reviews, I still believe that the contributions of this paper do not meet the threshold for acceptance, for the following reasons:

  1. As the authors acknowledge, this work's dataset construction is largely similar to prior studies. The claimed advantages (expanded size, broader scope, and updated frequency) could feasibly be achieved using existing methodologies, widely available corpora, and some engineering efforts, rather than through novel research contributions.
  2. Regarding the "novel analyses of temporal generalization patterns," there are two issues:
    • The lack of human evaluation to validate the reliability of the automatically created dataset raises concerns about the robustness of conclusions derived from this dataset.
    • Even after reviewing the rebuttal, I still find the conclusions relatively intuitive and lacking significant additional insights for the research community.

Thus, I decide to maintain my current score.

评论

We appreciate your continued engagement with our work and would like to address the core concerns raised in your latest comments.


Following concern 1

While we adopt prompts from previous work, the application of such a dataset to assess LLMs’ temporal generalization is both novel and unexplored. The data construction process just serves as a means to achieve our primary goal: analyzing LLMs’ temporal generation ability. Regarding contribution, our work is the first to offer a continuous, quantifiable measure of LLM performance degradation over time. We believe our analysis provides valuable insights for understanding how quickly LLMs become outdated and highlights the critical need for continual model training.


Following concern 2

The reviewer raises the concern that “lack of human evaluation would affect the robustness of conclusions derived from this dataset.” We respectfully disagree and defend our position as follows:

Human evaluation and the finding of temporal trends are unrelated

  • The absence of human evaluation does not undermine the observed performance degradation trends. Our findings demonstrate consistent declines in model performance across all settings (e.g., closed-book, constrained open-book, and gold article), which remain robust despite minor variations in QA quality.

Clear trends validate data quality

  • The ability of our dataset to consistently capture performance degradation over time is itself a testament to its quality. For comparison, a related work, "FreshBench," [1] investigates temporal generalization using 2,532 forecasting questions from the forecasting platform GoodJudgmentOpen. In Table 3, they report significant fluctuations in accuracy percentages across models rather than a clear degradation pattern across all models. In contrast, our dataset demonstrates a more consistent and measurable decline, underscoring its capacity to evaluate LLMs' temporal trends effectively.

In Summary:

The consistent trends observed in our dataset confirm its robustness as a benchmark for temporal generalization. While human evaluation could enhance QA quality, it does not affect the validity of the reported conclusions or the broader utility of our dataset.


[1] Zhu, C., Chen, N., Gao, Y., & Wang, B. (2024). Is your llm outdated? evaluating llms at temporal generalization. arXiv preprint arXiv:2405.08460.

评论

Hi Reviewer VuTG, thank you so much for your thoughtful reviews and suggestions. We have addressed your questions and conducted new experiments and analyses as suggested. As the discussion phase is close to the end, we aim to ensure our responses fully address your concerns. We kindly ask you to consider our responses and updates in your final evaluation of our work.

评论

Thank you for your further clarifications and detailed responses. I would like to clarify that my previous comments were specifically addressing the authors’ claim in the rebuttal that "our novelty lies in the dataset's expanded size, broader scope, updated frequency, and most importantly, novel analyses of temporal generalization patterns." As I mentioned earlier, I personally do not consider "expanded size, broader scope, updated frequency" as research contributions, and the claim of "novel analyses" also remains debatable.

On Research Contributions and Novelty:

However, I find some aspects of the authors' latest response to be either incorrect or overclaimed. To better position this submission, I would like to compare its contributions to prior work:

  • Compared to [1]: This submission and [1] share very similar dataset construction methods and use the same corpus to build datasets. The primary difference lies in the following:

    • [1] includes human evaluation to validate dataset reliability but does not analyze temporal generalization
    • This submission analyzes temporal generalization and constructs a dataset with a larger temporal span and more data points.
  • Compared to [2]: Both this submission and [2] analyze temporal generalization using datasets.

    • [2] introduces and applies multiple measures to assess temporal generalization and biases (including accuracy) and uses diverse data sources, with ground truth obtained via collecting human data rather than automatic generation.
    • This submission primarily uses accuracy as the metric, does not analyze bias, includes two task types (TF and MC), and has a single-domain data source (DailyNews), with all QA pairs generated by GPT. Also, this submission has the advantage of a larger dataset and a broader temporal span, alongside analysis across different QA settings.

As I am not an expert in this specific domain, I welcome corrections if I have missed or misunderstood any points. However, from my perspective, the claims made in the latest rebuttal, such as "the application of such a dataset to assess LLMs’ temporal generalization is both novel and unexplored" and "our work is the first to offer a continuous, quantifiable measure of LLM performance degradation over time" are not accurate. Work such as [2] has already investigated temporal generalization with more diverse measures. The main advantages of this submission are its larger dataset and broader temporal span. Considering these two related works, I find it difficult to view this submission as sufficiently novel or contributive to meet the acceptance threshold.

On Human Evaluation:

As noted earlier, [1] includes human evaluations across six dimensions as well as human performance. The assertion that “our findings demonstrate consistent declines in model performance across all settings, which remain robust despite minor variations in QA quality” is, in my view, unsupported. There is no scientific basis to claim that QA quality does not impact the robustness of observed trends across settings.

On Experimental Conclusions:

The authors also suggest that their conclusions differ significantly from those in [2]. However, I struggle to agree with this claim. Within the same time span (Jan 2023–April 2024), Figure 3 in this submission and Table 3 in [2] show very similar fluctuations in performance. While this submission provides a larger temporal span, the overall conclusion that "LLM performance degrades over time" is consistent across both studies.


References:

[1] Zhang et al. Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding. ACL 2024.

[2] Zhu et al. Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization.

评论

We thank the reviewer for further considering our response, and we would like to make clarifications to avoid any possible misunderstandings.


Regarding comparison to [1]

  • Our dataset is built for a completely different purpose. TCELongBench [1] aims to analyze Temporal Complex Events, however, we aim to build a daily updated dataset to continuously assess temporal generalization. With our aim, we have a larger data size and scope, longer time span, and carefully design the benchmark to avoid information leakage, which is not done by TCELongBench.
  • We kindly note to the reviewer that we do not use the same corpus to build the dataset. Zhang et al. [1] use Mideast-TE [3] corpus from GDELT, and our dataset is built using daily scraped news from Common Crawl. For more comparisons, please check the comparison table in the general response.

Regarding comparison to [2]

  • We respectfully disagree with the reviewer’s observation: “Within the same time span (Jan 2023–April 2024), Figure 3 in this submission and Table 3 in [2] show very similar fluctuations in performance.” In [2], the trend is difficult to discern due to the limited number of data points (ranging from 2 to 8) for each model, as illustrated in Table 3. In contrast, while [2] reports bi-monthly accuracy, our dataset’s larger size and broader coverage enable us to present a clearer and more detailed trend of monthly accuracy. Additionally, the extended timeframe of our dataset allows us to capture performance degradation that occurs before the knowledge cutoff, an aspect not addressed in other studies.
  • We would like to highlight the advantages of our automatic QA generation approach. While Zhu et al. [2] build the dataset depending on the human input forecasting questions, it may limit coverage and not easily scale up. The dataset update frequency is dependent on whether people are still actively submitting forecasting questions to the platform. In contrast, our approach enables daily updates, scalability, and more comprehensive event coverage, addressing the limitations of human forecasting tournaments.
  • In [2], the authors discuss 3 observations: (i) Earlier models perform worse. (ii) Most models exhibit weak temporal generalization ability. (iii) More advanced models perform better in future predictions.
    • While we address these points as well, our experiments go further by exploring how access to news articles up to various time cutoffs impacts LLM performance using RAG. This approach examines whether incorporating information retrieval can help bridge the performance gap over time.
    • We also show the degradation pattern still exists for the gold article setting, which suggests the model degradation not only comes from the lack of future information, but also from outdated representations. We believe our experimental conclusions are more than simply obvious “degradation over time” and have fair contributions.

Regarding potential misunderstandings of our statement

We hope to clarify our statements made in the last response.

“the application of such a dataset to assess LLMs’ temporal generalization is both novel and unexplored”

  • We mean that there is no existing work using automatically generated QA pairs from news that could be used to explore how LLM performance evolves over time.

“our work is the first to offer a continuous, quantifiable measure of LLM performance degradation over time”

  • We do not imply that there is no prior work investigating temporal generalization (please see the “Temporal Generalization of LLMs.” section in related work). When prior work tries to use perplexity [4, 5] and BPC [2], they have several constraints:

    • they cannot be applied to closed-source models lacking accessible logits,
    • increased perplexity does not necessarily indicate degraded performance on downstream tasks
  • In contrast, our work provides a continuous benchmark with a specific downstream task that can quantify the temporal generalization

  • While [2] introduces a forecasting dataset for temporal generalization, it lacks a mechanism for continuous updates, and the degradation trends shown in their work are not as pronounced in their much shorter observation period. In contrast, our benchmark is designed for daily automatic updates and reveals clear performance trends. We plan to maintain this benchmark, enabling researchers to assess and track LLM forecasting performance over time in a dynamic manner.

评论

[1] Zhang, Z., Cao, Y., Ye, C., Ma, Y., Liao, L., & Chua, T. S. (2024). Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding. arXiv preprint arXiv:2406.02472.

[2] Zhu, C., Chen, N., Gao, Y., & Wang, B. (2024). Is your llm outdated? evaluating llms at temporal generalization. arXiv preprint arXiv:2405.08460.

[3] Ma, Y., Ye, C., Wu, Z., Wang, X., Cao, Y., Pang, L., & Chua, T. S. (2023). Structured, Complex and Time-complete Temporal Event Forecasting. arXiv preprint arXiv:2312.01052.

[4] Agarwal, O., & Nenkova, A. (2022). Temporal effects on pre-trained models for language processing tasks. Transactions of the Association for Computational Linguistics.

[5] Röttger, P., & Pierrehumbert, J. (2021). Temporal adaptation of BERT and performance on downstream document classification: Insights from social media. Findings of the Association for Computational Linguistics: EMNLP 2021.

评论

Thank you for your detailed clarifications and for correcting certain details. I believe this will help the community better position this submission.

Overall, I still find that the data construction methodology remains very similar to [1] (though, as clarified, it uses a different corpus—DailyNews, which is commonly used in other event forecasting studies). While the purposes differ, both works focus on temporal event-based analysis. Similarly, while the dataset and corpus differ from [2], both works aim to analyze temporal generalization, and in my view, the technical contributions here remain relatively limited.

The clarifications provided do not alter my perspective, and I will maintain my current score.

审稿意见
8

The paper presents a continuous evaluation benchmark for LLMs testing the ability to make predictions about real-word events and assess whether they show temporal generalisation and tests different LLMs using multiple eval configurations (closed-book setting, constrained open-book setting, etc).

优点

The paper presents a very interesting idea as a benchmark for LLMs and describes in details the dataset construction and evaluation. I think this work would be very relevant as a benchmark for LLMs at ICLR.

缺点

While the paper is extremely interesting from a dataset construction point of view, I have found it a bit hard to follow through the experiment section, especially in terms of the task performed and each stage and the knowledge that the model had. This for me is true in particular for the "Constrained open book setting" sub-section, but in general all through the evaluation overview. I would suggest the authors to adopt a running example, referring to a specific model with a clear cutoff date and a question regarding a piece of news, in order to highlight how the model would perform differently in different situations.

问题

It seems that the dataset is completely constructed automatically through the usage of LLMs - I was wondering if the authors have performed any manual check to assess the quality of the construction and if they could add details and evaluation metrics about that.

Are the authors planning to set-up a platform where users could test LLMs against the created dataset?

评论

Thank you for your recognition of our work. We appreciate the opportunity to address your concerns in detail.

Weakness

A running example, referring to a specific model with a clear cutoff date and a question regarding a piece of news, in order to highlight how the model would perform differently in different situations.

We are sorry for any confusion raised in the evaluation setup. We would love to further provide an example here as suggested, and please check Appendix B.7 (Figure 23) for how Mixtral-8x7B responds to a question under different experimental settings.


Question 1

It seems that the dataset is completely constructed automatically through the usage of LLMs - I was wondering if the authors have performed any manual check to assess the quality of the construction and if they could add details and evaluation metrics about that.

Yes, we have manually reviewed the quality of the questions while testing various QA generation prompts. We also acknowledge the importance of human evaluations and inter-annotator agreement. While we are planning to design the human evaluation framework and recruit evaluators, we would also like to include our dataset in the Supplementary Material for your reference.


Question 2

Are the authors planning to set-up a platform where users could test LLMs against the created dataset?

Thank you for your suggestion regarding a testing platform. Indeed, we plan to establish one, where users could track the performance of various LLMs daily or submit their own forecasting questions.

评论

Thank you for your response and for addressing my comments, I am happy to confirm my score.

评论

Hi Reviewer mAPh, thank you so much for your thoughtful reviews and suggestions. We have addressed your questions and conducted new experiments and analyses as suggested. As the discussion phase is close to the end, we aim to ensure our responses fully address your concerns. We kindly ask you to consider our responses and updates in your final evaluation of our work.

审稿意见
5

This paper focuses on evaluating large language models (LLMs) in a way that remains relevant over time, as traditional benchmarks fail to capture the dynamic, ever-evolving nature of real-world information. To tackle this issue, the paper proposes using daily news articles to continuously assess LLMs' forecasting abilities. By generating QA pairs from news in various categories (such as business, politics, and arts), the authors create an evaluation benchmark named Daily Oracle. This benchmark is designed to evaluate LLMs’ ability to predict near-future events and test their temporal generalization.

优点

The problem of degradation of performance over time is relevant and this resource along with the framework can be of large interests for the community.

缺点

The work presents a known problem and dive deep into potential impacts. While the paper has its merits, I don't think it is at a maturity level to be published yet (see my questions below).

问题

The work relies on several automatic procedure to build the dataset that require clarification:

  • Did you evaluate the clustering approach? how is your clustering approach different from bert-topic?
  • What's the overlap rate between question and answer?
  • Degradation after cutoff is expected and RAG is commonly used to mitigate the problem. However, in this work the retrieval used to test RAG is quite weak. First, BM25 should be at least replaced with some hybrid or dense approach. Second, 5 top articles may not be enough (how did you choose 5?) and truncating the article at 512 can potentially cut off answers ( it is unclear if the provided information contain the answer. In other words, how do we know if the problem is the retrieval or the model ability to handle such information?
评论

We appreciate your thoughtful feedback, and we hope to address the questions below:

Question 1

Did you evaluate the clustering approach? how is your clustering approach different from bert-topic?

Thanks for the question. To clarify, the article selection methodology described in Appendix A.1 is not presented as a novel contribution but rather as a technical implementation detail. However, our article selection methodology does incorporate some key components from BERTopic [1], including TF-IDF for representations, PCA for dimensionality reduction, and DBSCAN for clustering. We implemented careful threshold selection to filter out noise clusters and conducted manual verification of the articles within each cluster.

While we acknowledge the existence of more sophisticated topic clustering algorithms like BERTopic, our primary goal in hot topic selection is to construct representative QA pairs that capture daily trending information. However, article selection itself is not the central focus of our work. Importantly, we update a plot in Figure 22 of Appendix B.6, where we show that even with randomly selected article subsets in the closed-book evaluation, clear performance degradation patterns are observed. This indicates that even a random sampling approach yields a dataset capable of effectively highlighting the temporal generalization challenges in LLMs.

[1] Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.


Question 2

What's the overlap rate between question and answer?

The overlap rate across our questions is low. For each question type, a maximum of two questions are derived from any single article. To further analyze this, we utilized the Sentence Transformer model 'all-MiniLM-L6-v2' [2] to get the sentence embeddings and then compute cosine similarity scores for the question pairs. Among the TF questions, the average cosine similarity score is 0.23, and for the MC questions, the average score is 0.20. The proportion of similarity scores greater than 0.5 is only 0.68% for TF questions and 0.56% for MC questions. Please also refer to the distribution plots in Appendix A.4, Figure 9.

[2] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2


Question 3

Degradation after cutoff is expected and RAG is commonly used to mitigate the problem. However, in this work the retrieval used to test RAG is quite weak. First, BM25 should be at least replaced with some hybrid or dense approach. Second, 5 top articles may not be enough (how did you choose 5?) and truncating the article at 512 can potentially cut off answers ( it is unclear if the provided information contain the answer. In other words, how do we know if the problem is the retrieval or the model ability to handle such information?

Thank you for your thoughtful comments regarding the constraint open-book setting. We would like to address your concerns about BM25 and provide additional clarification:

  • We encourage reviewing our updated comprehensive RAG results for other models in Appendix B.4 (Figures 12-18), which demonstrate RAG's potential to enhance performance across some other models.
  • Choice of Retriever: Our use of BM25 aligns with established precedent from prior work [2, 3, 4], providing a valid baseline for our dataset's constraint open-book setting. While we acknowledge the existence of more sophisticated hybrid and dense retrievers, BM25 serves our primary goal of establishing a baseline rather than maximizing performance. Crucially, even with potential RAG improvements, the observed performance degradation pattern persists, emphasizing the need for continual model updating.
  • Number of Retrieved Articles: The limitation to 5 articles was primarily driven by computational cost considerations, particularly when implementing RAG with commercial models.
  • Impact of 512-word Truncation: We are confident this limit does not significantly impact answer quality for several reasons
    • The gold article setting, which uses the same truncation, achieves ~90% performance, suggesting key information is retained
    • News articles typically front-load critical information in the opening paragraphs, aligning with our truncation approach

[2] Jin, W., Khanna, R., Kim, S., Lee, D. H., Morstatter, F., Galstyan, A., & Ren, X. (2020). Forecastqa: A question answering challenge for event forecasting with temporal text data. arXiv preprint arXiv:2005.00792.

[3] Zou, A., Xiao, T., Jia, R., Kwon, J., Mazeika, M., Li, R., ... & Hendrycks, D. (2022). Forecasting future world events with neural networks. Advances in Neural Information Processing Systems.

[4] Zhang, Z., Cao, Y., Ye, C., Ma, Y., Liao, L., & Chua, T. S. (2024). Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding. arXiv preprint arXiv:2406.02472.

评论

Hi Reviewer YiaM, thank you so much for your thoughtful reviews and suggestions. We have addressed your questions and conducted new experiments and analyses as suggested. As the discussion phase is close to the end, we aim to ensure our responses fully address your concerns. We kindly ask you to consider our responses and updates in your final evaluation of our work.

评论

Thank you for your response. Due to the large use of automated methods, I still believe that the lack of human evaluation remains a weakness of this paper. After reading other reviewers concerns (that in part are similar), I believe that this paper would benefit from another round of review, after properly addressing these concerns. Since my other concerns were clarified, I'll raise my score to 5.

审稿意见
5

This paper proposes Daily Oracle, a continuous evaluation benchmark for assessing LLMs' ability to predict future events using daily news. The authors automatically generate True/False and Multiple Choice question-answer pairs from daily news articles and evaluate various LLMs' temporal generalization capabilities, both with and without RAG. The experiments reveal performance degradation over time, particularly pronounced around models' knowledge cutoff dates.

优点

  1. Continuous evaluation: The daily updated benchmark effectively addresses data contamination concerns in LLM evaluation, providing a timely assessment mechanism.
  2. Comprehensive experiments: The work presents thorough analyses of both open-source and closed-source models, clearly demonstrating the effect of knowledge cutoff dates on forecasting performance through moving average visualizations. The study reveals some valuable findings about model performance dropping after the cutoff date and how RAG of more recent data doesn't always help forecasting.

缺点

  1. Limited technical novelty: The benchmark question format, construction prompt, and main steps are similar to TempLongBench.
  • I suggest the authors provide a comparison table that clearly outlines the key differences between their pipeline and previous work like TempLongBench.

  • Additionally, consider including a flowchart or diagram of the dataset construction process with annotations explaining the rationale for each step.

  • Also, the authors could add their prompts in the appendix for a better understanding.


  1. Insufficient dataset quality evaluation: Though the construction pipeline has an LLM-based scoring and filtering step, there lacks an assessment of the final generated data quality.

I suggest the authors:

  • Provide a breakdown of how many questions passed each principle in their quality control process
  • Show or plot the distribution of the final data's score in each of the designed principle dimensions
  • Conduct a human evaluation on a randomly sampled subset (e.g., 100 questions) of data, assessing both the news and QA data quality on specific metrics (e.g., Evidence, Reasonable, Plausible in TempLongBench)
  • Also, conduct human forecasting performance on the sampled subset as a reference. Include inter-annotator agreement scores to demonstrate the reliability of their assessments.

  1. Incomplete analysis of RAG results: The authors observe that RAG does not uniformly enhance performance for Llama3, with some RAG cutoffs performing worse than the closed-book setting, and conclude that outdated information may negatively impact performance.

I suggest the authors conduct a more detailed inspection of the RAG process:

  • What articles does the model retrieve? Are they relevant to the question?
  • As the retriever is a simple BM25 model, does this process consider the temporal distance between the retrieved article and the target date? Will this influence the forecasting results? Will there be most of the cases that even if different RAG cutoffs are set, the model still retrieves the same and very old articles due to the BM25 limitation, such that the experiment condition of different RAG cutoffs becomes meaningless?

To make this analysis more concrete, the authors may consider:

  • Compute and report the average relevance score of retrieved articles to the questions.
  • Plot a histogram of the temporal distribution of retrieved articles for different RAG cutoffs.
  • Analyze the correlation between article recency and model performance.
  • Provide a specific case study of a few example questions, showing the full chain of retrieved articles and how they influenced the model's prediction.

问题

Will the code base become public? Will the database be maintained and updated in a daily manner and be available publicly?

评论

We sincerely thank the reviewer for their thoughtful and detailed feedback. Below, we address the raised concerns and suggestions:

Weakness 1

A comparison table that clearly outlines the key differences between their pipeline and previous work like TempLongBench

Thank you for raising concerns. As suggested, we provide a comparison table in the general response. In short, our work improves upon the TCELongBench by expanding the size and scope and introducing several innovations to enhance the quality of forecasting questions. It's also important to highlight that our contribution does not focus on the data construction process itself—using LLMs to generate and verify data is a common approach. Instead, our novelty lies in the dataset's expanded size, broader scope, updated frequency, and novel analyses of temporal generalization patterns.

A flowchart or diagram of the dataset construction process with annotations explaining the rationale for each step

Please see the data construction flowchart in Figure 1(a).

Also, the authors could add their prompts in the appendix for a better understanding.

Please see Appendix C.


Weakness 2

Provide a breakdown of how many questions passed each principle in their quality control process. Show or plot the distribution of the final data's score in each of the designed principle dimensions

Thanks for the constructive feedback. We’ve plotted the distribution as required in Appendix A.3.

Conduct a human evaluation on a randomly sampled subset

We acknowledge the importance of human evaluations and inter-annotator agreement. While we are planning to design the human evaluation framework and recruit evaluators, we would also like to include our dataset in the Supplementary Material for your reference.

Also, conduct human forecasting performance on the sampled subset as a reference. Include inter-annotator agreement scores to demonstrate the reliability of their assessments.

We agree that incorporating a human performance baseline could enhance the study by providing a plausible upper bound. However, since humans already have knowledge of past answers, such a comparison may not be entirely fair. More importantly, we would like to emphasize that the primary objective of our benchmark is to underscore the need for continuous model pretraining rather than focusing on comparisons with human performance. The key insight of our paper is to encourage the development of more efficient methods for continual training. In other words, comparisons to human performance may be less relevant, as our primary goal is to call for bridging the gap between the model’s past performance and recent degraded performance.


Weakness 3

Thanks for the suggestion. Below are some additional results (which can also be found in Appendix B.5) as requested:

Compute and report the average relevance score of retrieved articles to the questions.

We randomly sample 1,000 questions along with their top 5 retrieved articles and use the Sentence Transformer model 'all-MiniLM-L6-v2' [1] to get the embeddings. We then compute the average cosine similarity score between each question and its corresponding articles, yielding an overall average score of 0.49. For the distribution plot of these scores, please see Appendix B.5, Figure 19.

[1] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Plot a histogram of the temporal distribution of retrieved articles for different RAG cutoffs.

We provide a plot in Figure 20, Appendix B.5, illustrating the average date difference between the question date and the publishing dates of the retrieved articles across different RAG cutoffs. The results demonstrate that as the RAG cutoff moves forward in time, the BM25 retriever on average would retrieve newer articles from the question date. In other words, our RAG cutoff does impact retrieval behavior.

Analyze the correlation between article recency and model performance.

In Appendix B.5, we present a case study that visualizes the relationship between article recency and model performance. In this analysis, we select the Mixtral-8x7B model with an RAG cutoff of September 2023. We calculate the average date difference as the mean temporal gap between a question and its top 5 retrieved articles. These average date differences are divided into quantiles: Q1 = 176 days, Q2 = 364 days, and Q3 = 626 days. As shown in Figure 21, there is no evident correlation between article recency and model performance. (Those extreme values result from a smaller number of data points available for that specific date.)

Provide a specific case study of a few example questions, showing the full chain of retrieved articles and how they influenced the model's prediction.

We’ve added an example of how Mixtral-8x7B responds to a question under different experimental settings in Appendix B.7 (Figure 23).

评论

Hi Reviewer 2UhV, thank you so much for your thoughtful reviews and suggestions. We have addressed your questions and conducted new experiments and analyses as suggested. As the discussion phase is close to the end, we aim to ensure our responses fully address your concerns. We kindly ask you to consider our responses and updates in your final evaluation of our work.

评论

Thanks to the authors for their responses and revisions. Though they have addressed my concerns regarding more experimental analysis, my main concern in limited technical contribution of this work and the lack of human evaluation remain unsolved. I have also reviewed the comments and discussions with other reviewers. I will keep my original rating.

评论

We thank all the reviewers for their thoughtful and constructive feedback, and we appreciate the opportunity to improve our work. We submitted an updated version of our paper, and provided our dataset in the Supplementary Material. The modified/added content is highlighted in red for clarity. Below, we offer additional details, and present a checklist for the manuscript updates:

  1. [2UhV, VuTG] Comparison table of Daily Oracle with TLB-forecast (the future event forecasting dataset in TCELongBench) below.
  2. [2UhV] Score distribution across 7 principles in the QA Filtering step in Appendix A.3 (Figure 7, 8).
  3. [YiaM] Question overlapping rate in Appendix A.4 (Figure 9)
  4. [2UhV] Analysis for the constrained open-book setting results in Appendix B.5 (Figure 19, 20, and 21).
  5. [2UhV, mAPh] An example of evaluating LLM under different settings in Appendix B.7 (Figure 23).
  6. [2UhV] Prompts in Appendix C.
  7. Reviewers [2UhV, YiaM, VuTG] all raise concerns about our RAG results, we evaluate other models also in the constraint open-book setting and observe similar patterns. Please see Appendix B.4 (Figures 12–18).
  8. Figure 5, Appendix B.3: We run all other models besides Mixtral-8x7B and Llama-3-8B in the gold article setting.
  9. We update our dataset and analysis from June 2024 to September 2024 in all experiment settings. (Figure 3, 4, and 5).
评论

TCELongBench is a benchmark designed to evaluate LLMs in analyzing Temporal Complex Events (TCEs) - events composed of interconnected news articles spanning extended periods [1]. It includes TLB-forecast, a dataset for the future event forecasting task. While we adopt TCELongBench’s prompts in our QA construction process, our work improves upon it by expanding the size and scope and introducing several innovations to enhance the quality of forecasting questions:

AspectTLB-forecastDaily Oracle (Ours)
Topic CoverageCooperation and conflict events in the Middle EastCovers global topics (e.g., politics, business, science, arts) from 2019 to present.
Time Span2015-2022 12019-resent
Dataset Size6,604 QA pairs29,988 QA pairs
Question TypeMultiple ChoiceTrue/False + Multiple Choice
Update FrequencyStatic (no updates after initial release)Daily
Time Element in the QuestionQuestions often overly specific to exact dates (e.g., "Why will X happen after YYYY-MM-DD?").Questions are designed with more realistic temporal constraints (e.g. "Will X happen before YYYY-MM?").
Information Leakage PreventionAt least 15.5% 2 of their questions may have information leakage issue. The type of questions like "Why will Rouhani's visit to Austria be cancelled in 2016-04-12?" presupposes knowledge of the cancellation occurring on that specific date, which would not be known in advance.We provide few-shot examples and apply QA filtering to avoid including future knowledge in the questions, ensuring they rely solely on information available at the time of generation.
Quality Control ProcessEvaluates evidence, plausibility of distractors, and guessability.Expands on this with 7 QA filtering principles, including correctness of answers, non-answerability before the publication date, absence of information leakage, objectivity, inclusion of a clear temporal element, public interest, and non-obviousness of the answer.
Scoring FrameworkDoes not apply a systematic scoring mechanism for QA filtering.Implements a scoring framework, requiring questions to meet a threshold of 13 points for inclusion.
Continuous EvaluationNo insights into how prediction ability changes over time.Assesses LLMs' temporal generalization and reveals the degradation pattern.
Example QA Pairs (Randomly Selected)- Why will Egypt create an African lobby in 2017-11-29? - To defend its position with regards to Ethiopia's Renaissance Dam.- Will Geoffrey Berman, the U.S. attorney for the Southern District of New York, be fired by the Trump administration in June 2020? - Yes.
- What will be the next step for the remaining parties of the JCPOA after 2021-04-10? - They will meet again in person on Wednesday.- Will Amazon's North American Retail be reported as generating less cash than AWS by February 2022? - No.
- How will the European countries respond to the launch of INSTEX in 2019-07-11? - They will express their commitment to fulfill their promises and bring economic benefits to Iran.- Will the number of reported whooping cough cases in the US surpass last year's count by May 2024? - Yes.
- How much money will Egypt allocate for Gaza's reconstruction after the meeting in Cairo in 2021-06-08? - $500 million.- How much will Tesla's net income amount to in the Q3 earnings report released in October 2022? - $3.3 billion.
- Which country will be the next to join Israel's diplomatic alliance against Iran after 2021-08-20? - Morocco.- How many states in Mexico will have decriminalized abortion before the Supreme Court's federal ruling by September 2023? - Twelve.

[1] Zhang, Z., Cao, Y., Ye, C., Ma, Y., Liao, L., & Chua, T. S. (2024). Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding. arXiv preprint arXiv:2406.02472.

Footnotes

  1. Considering most of the powerful LLMs have been developed after 2020,the portions of the dataset covering earlier years may contain answers already seen during training. This prior exposure compromises the dataset's effectiveness as a forecasting benchmark.\

  2. 11.63% "How will" questions and 3.88% "Why will" questions, as reported in [1].

AC 元评审

Summary:

The paper introduces Daily Oracle, a benchmark for continuously evaluating LLMs' temporal generalization and forecasting abilities. The benchmark is constructed using automatically generated QA pairs derived from daily news and evaluates LLMs in both closed-book and retrieval-augmented settings. The authors find that as pre-training data becomes outdated, LLM performance degrades, even with retrieval augmentation, and highlight the necessity for continuous model updates.

Strength:

  • Addresses the pressing issue of evaluating LLMs in dynamic, real-world contexts where knowledge evolves rapidly.
  • Includes multiple evaluation settings (closed-book, constrained open-book, and gold article) to analyze temporal generalization.
  • Larger and broader than previous benchmarks, capturing a wider temporal range and global topics.

Weakness:

  • The methodology for data construction and evaluation is heavily reliant on existing techniques.
  • The reliance on LLM-based QA filtering without validation via human evaluation raises concerns about dataset quality and robustness.

审稿人讨论附加意见

The limitations outlined above were identified by multiple reviewers—specifically, the first limitation (limited technical novelty) was raised by Reviewers 2UhV and YiaM, while the second (lack of human evaluation) was highlighted by Reviewer 2UhV. I concur that these issues are both significant and critically affect the overall quality and contribution of the paper. While the authors actively engaged in the rebuttal process and provided additional clarifications and analyses, reviewers with negative perspectives ultimately maintained their original assessments. Therefore, I recommend rejecting this manuscript at this time and encourage the authors to strengthen the technical novelty of the work, particularly by introducing more innovative methodologies and incorporating robust human evaluation to validate the dataset's reliability.

最终决定

Reject