PaperHub
6.5
/10
Poster4 位审稿人
最低5最高8标准差1.1
5
7
6
8
3.8
置信度
正确性3.5
贡献度3.0
表达3.8
NeurIPS 2024

Approaching Human-Level Forecasting with Language Models

OpenReviewPDF
提交: 2024-05-08更新: 2024-11-06
TL;DR

We present the first ML system that can forecast at near human levels.

摘要

关键词
langauge modelsforecastinginformation retrievalretrieval augmentation

评审与讨论

审稿意见
5

This paper introduces a forecasting system based on Language Models that aims to achieve human-level forecasting capabilities. It presents a system that autonomously searches for relevant information, generates forecasts, and aggregates predictions. Through collecting a large dataset of questions from competitive forecasting platforms, the authors test the system's end-to-end performance. The results indicate that the system performs nearly on par with the crowd aggregate of competitive forecasters and surpasses it in certain scenarios.

优点

Strengths:

  1. The paper introduces a novel and well-conceived retrieval-augmented LM system, effectively combining information retrieval, reasoning, and prediction aggregation to enhance forecasting accuracy.
  2. The paper proposes a self-supervised fine-tuning method that leverages the model's own forecasts to generate training data, thereby improving the accuracy of predictions.
  3. The result is based on a comprehensive dataset of questions from multiple forecasting platforms, enhancing its breadth and reliability.

缺点

Weaknesses:

  1. Compare to baselines, the system requires significant computational resources due to its summary and multi-sampling operations. Although the authors use some methods to save the cost, report token statistics and cost used by the system and baseline may be necessary.
  2. The system prompt base model 3 times and fine-tuned models 3 times, however the baseline is 1 time. Whether this creates an unfair comparison, an obvious baseline might be to sample baseline 6 times and then vote or average.
  3. Some description is confused, like in section 6.1, “our averaged Brier score is .179, while the crowd achieves .149”, I don't see the paper say which table the results come from.

问题

N/A

局限性

N/A

作者回复

Thank you for evaluating our work.

Compare to baselines, the system requires significant computational resources due to its summary and multi-sampling operations. Although the authors use some methods to save the cost, report token statistics and cost used by the system and baseline may be necessary.

We agree that our system is more costly than the baseline method of naively prompting the base model once. However, we find that answering a single question with our system, even via the most expensive model like GPT 4, typically costs no more than 0.3 US dollar. This is a rather cheap method overall, especially in comparison with the alternative of hiring human expert forecasters, which usually require significantly more resources.

In addition, since the first submission of our paper, AI labs have continued to make progress in inference optimization. The GPT-4o-mini model, for example, costs over 10x less than the particular versions of GPT-4 that we used in the paper, while being only stronger on benchmarks. We expect that future developments would drive the costs even further, making our system extremely cheap to run.

The system prompt base model 3 times and fine-tuned models 3 times, however the baseline is 1 time. Whether this creates an unfair comparison, an obvious baseline might be to sample baseline 6 times and then vote or average.

We implemented the baseline you recommended, utilizing 6 "gpt-4-turbo-2024-04-09" model calls, each with a temperature setting of 0.5. The results were as follows:

  • Using the average for final votes: Score: .205 Standard error: .0104
  • Using the trimmed mean for final votes: Score: .205 Standard error: .0108 (Our system’ score: .179 Standard error: .003)

This shows that the alternative baseline is not really better.

Thank you for this suggestion, and we will make sure to include this baseline in the final paper.

Some description is confused, like in section 6.1, “our averaged Brier score is .179, while the crowd achieves .149”, I don't see the paper say which table the results come from.

They are from Table 3, page 5 (this table is the main result of our paper).

Please let us know if there are other concerns we can address! If not, we hope you can consider increasing your score. Thank you again for your review.

审稿意见
7

The authors contribute a novel system that approaches human-level forecasting performance. The authors also contribute a dataset of forecasting questions submitted to various human forecasting websites. The authors show that their system generally approaches human crowds. In some settings, where the LLM can selectively submit forecasts, they find that their system even outperforms humans. The authors conduct a series of ablations across their system, and highlight how each component contributes to the overarching forecasting ability.

优点

This is a well written paper that tackles an interesting problem for LLM based systems. The dataset it contributes is also quite useful. The ablations are careful, and the Appendix does a good job detailing aspects of the dataset, ablations, prompts, and other design decisions. Providing a system that can forecast events at a level that can rival humans and serve as a complement across decision making also has important practical applications (that the authors also carefully discuss).

缺点

I would cite some crowdworking papers from other fields (e.g. HCI) just to highlight the effectiveness of crowd work in the related work section.

Minor: I would’ve liked more qualitative examples at each step of the system in the text: e.g. what the retrieved articles looked like, other questions, etc. instead of hunting through the appendix.

Closed models: I wonder how far Llama 2 / 3 could go, given the same finetuning setup. Given that the OpenAI models are closed, it would’ve been nice to see if we could push open source finetuning to achiever similar deltas.

How good are the humans on these markets, exactly? They seem public in nature. Is the average forecaster on Metacalculus an uninformed person? I think that is worth emphasizing in the paper—or at least qualifying some of the findings with a crowd of human amateurs.

Many of the weaknesses are formulated as questions in the question section.

问题

  • Will the authors release the dataset if/when the paper is published? I think the collected dataset, even if it becomes stale soon, is very useful!

  • A benefit of this system is that rationale---even if unfaithful---can be extracted from the LLM. Is there any kind of analysis on the underlying reasoning that might shed more light on the types of evidence an LLM finds "important" across forecasting tasks? In general, a more thorough error analysis of the final system would have been nice!

  • In the selected prediction setting, are there certain subareas or domains where the LLM is more likely to submit a forecast?

  • Also, is there an ablation with just finetuning and no IR? I’m curious to see how much the LLM system actually uses the retrieved docs.

  • Purely for curiosity reasons: would the authors expect this to work on stock market predictions? Why or why not?

  • I’m happy to raise my score a bit if some of these points are addressed!

局限性

Yes

作者回复

Thank you for evaluating our work.

I would cite some crowdworking papers from other fields (e.g. HCI) just to highlight the effectiveness of crowd work in the related work section.

Thanks for pointing this out! We plan to cite the following review paper “Ghezzi, Antonio, Donata Gabelloni, Antonella Martini, and Angelo Natalicchio. "Crowdsourcing: a review and suggestions for future research." International Journal of management reviews 20, no. 2 (2018): 343-363.” in the related work section.

Minor: I would’ve liked more qualitative examples at each step of the system in the text: e.g. what the retrieved articles looked like, other questions, etc. instead of hunting through the appendix.

To provide a better sense, the retrieved articles are exactly a subset of what you might find on Google News by searching “Candidate X winning chance”. And for questions on elections, the LLM typically suggests questions like “Who are candidate X’s competitors”, “What is the fundraising situation of candidate X” etc.

Closed models: I wonder how far Llama 2 / 3 could go, given the same finetuning setup. Given that the OpenAI models are closed, it would’ve been nice to see if we could push open source finetuning to achiever similar deltas.

Llama 3 unfortunately was pre-trained on data up until mid 2024. Due to potential leakage, we cannot evaluate the Llama-3 series on our dataset, which has questions on events beginning June 1st, 2023.

We fine-tuned a Mixtral 8x7b model under the same setup; yet, it only obtained a Brier score of ~0.21, far from the same level of performance of GPT 3.5 or 4. We conjecture, though, that more recent open models like Llama 3 should be able to achieve the same level of forecasting ability.

How good are the humans on these markets, exactly? They seem public in nature. Is the average forecaster on Metacalculus an uninformed person?

While individual forecaster performance varies on Metaculus, we compare our system to the Metaculus crowd prediction, which consistently beats even the top forecasters, making it a strong target to compete with (see https://www.metaculus.com/notebooks/15760/wisdom-of-the-crowd-vs-the-best-of-the-best-of-the-best/).

We had spent some amount of effort sourcing individual forecasters’ records across different platforms. However, most platforms do not release individual’s raw forecasts. In some cases, they release forecaster scores, but since the platforms use different and sometimes ambiguous scoring methods, we were not able to source the individual probabilities. We hope future work could address this issue in a different way, perhaps by recruiting human forecasters directly to compete with LLMs.

Will the authors release the dataset if/when the paper is published? I think the collected dataset, even if it becomes stale soon, is very useful!

We have released the dataset on Hugging Face. We are not putting it here in respect of the anonymization policy of NeurIPS.

Also, is there an ablation with just finetuning and no IR?

No, since our fine-tuning dataset consists of retrieved articles. As a consequence, the fine-tuned model naturally requires retrieved articles as part of their inputs. Our qualitative examples, however, do show that our models significantly rely on the retrieved articles to reason and make predictions. See Appendix J for some cases.

Purely for curiosity reasons: would the authors expect this to work on stock market predictions? Why or why not?

Our system can beat expert human forecasters, especially in the selective prediction settings. However, we have not made any targeted optimization for stock forecasts.

Please let us know if there are other concerns we can address! If not, we genuinely hope you can consider increasing the rating. Again, thank you for reviewing our paper!

审稿意见
6

The authors benchmark LLMs ability to perform on the task of forecasting, or predicting the outcome of future events. They test several methods and find that ensembling pretrained and fine-tuned LLMs which have access to news sources produces predictions similar to the accuracy of humans.

优点

Authors collect dataset and human baseline to benchmark task

Authors investigate many model designs and abrasions to understand which factors lead to high accuracy.

Authors present a model which is comparable to human accuracy on a challenging new task

Authors provide analysis of the difference between model and human predictions distributions.

Authors benchmark a number of different models.

缺点

Dataset is relatively small, as there is limited data in existence for which is there a human baseline.

问题

none

局限性

yes

作者回复

Thank you for evaluating our work.

Dataset is relatively small, as there is limited data in existence for which is there a human baseline.

We note that our dataset is the largest and most up-to-date available for automated forecasting. Compared to the latest work, which includes 3,833 binary questions (https://arxiv.org/abs/2206.15474), our dataset is 1.4 times larger, comprising a total of 5,516 binary questions. Additionally, for each of these 5,516 questions, there are crowd forecasts across multiple time stamps, resulting in a total of 1,118,154 forecasts.

In our work, we utilize up to 5 time stamps ("retrieval dates"), amounting to 22,064 forecasts, which is still significantly larger than any prior work. Moreover, we release this dataset of 5,516 forecasting questions and 1,118,154 forecasts, along with a larger dataset containing 33,664 questions and 4,044,325 forecasts (see Table 11 in Appendix C.2).

Please let us know if there are any other concerns we can address. If not, we hope you will consider increasing the rating. Thank you again for reviewing our paper.

评论

I maintain my existing score. The work serves as a useful first benchmark for this area. It is technically solid, thorough and moderate impact on new use cases of LLMs.

审稿意见
8

The authors develop a forecasting system that uses news article retrieval and reasoning to predict future events. The system performs with near-human capability and is also complementary to humans. Thorough ablations and evaluations are done to identify that each component of the paper's method provides meaningful improvements to the prediction accuracies of the system.

优点

Originality: Tackles a highly impactful and important field of predictions. To my knowledge not many efforts as well-organized as this work have been made towards this.

Quality: Methods, evaluations, etc. are done extremely carefully including hyperparameter search, testing multiple dates of prediction, and fine-tuning multiple models.

Clarity: Writing is clear and easy to understand. No issues.

Significance: Accurate predictions are highly applicable to almost any macro-level problem. This paper and its results are very significant, and the improvement (though small) is better than humans which is already a big deal.

缺点

Overall it is already very great. Kudos to you. A few suggestions to improve the paper:

  1. More motivation in the introduction/general paper on why prediction is important: Other reviewers/readers may not understand the degree of importance prediction tasks have in fields such as social science.

  2. More generally, I feel that the paper could benefit from having more focus on a story of why this matters, such as a short discussion section. In particular, discussion on scalable oversight/broader impact and how to manage models that have these capabilities would be appreciated.

(not necessary): Would be interested in seeing comparisons against human experts if there are any such datasets out there.

问题

See weaknesses.

局限性

No Limitations.

作者回复

Thank you for evaluating our work.

More motivation in the introduction/general paper on why prediction is important: Other reviewers/readers may not understand the degree of importance prediction tasks have in fields such as social science.

We discuss the broader impact in more detail in Appendix I, since we were a bit constrained by the page limit of the NeurIPS submission. Nevertheless, the camera-ready version allows for extra space, and we plan to add additional detail on the importance of prediction tasks and the broader implications of LLM forecasters.

Thank you for highlighting the significance of our work.

Would be interested in seeing comparisons against human experts if there are any such datasets out there.

We had spent some amount of effort sourcing individual forecasters’ records across different platforms. However, most platforms do not release individual’s raw forecasts. We hope prediction platforms will make this data accessible in the future, or that this issue could be addressed through other means, such as recruiting human forecasters to compete directly with LLMs.

On the other hand, it is noteworthy that the community aggregate typically outperforms aggregates of the top 5, 10, ..., 30 best forecasters (based on past scores), making it a very strong benchmark for comparison. For more details, see this analysis: https://www.metaculus.com/notebooks/15760/wisdom-of-the-crowd-vs-the-best-of-the-best-of-the-best/.

评论

Thanks to the authors for the rebuttal. It would be great to see the contents of Appendix I incorporated into the main text in the final version.

As previously mentioned, a brief discussion on scalable oversight/broader impact and how to manage models that have these capabilities, maybe in the Appendix, would still be appreciated.

"it is noteworthy that the community aggregate typically outperforms aggregates of the top 5, 10, ..., 30 best forecasters (based on past scores)"

This is reassuring, and supports the paper well.

Happy to keep my current score.

最终决定

This study introduces a novel forecasting system to predict future events based on retrieval-augmented LMs. The main contributions include:

A valuable benchmark. Authors curate a benchmark for evens prediction and provide its human baseline.

Strong empirical experiments: the proposed system is comparable to human accuracy.

Detailed analysis and ablation studies.

All the reviewers agree that the paper makes significant contribution and are positive about this submission. Their concerns are properly addressed during the discussion. It is a solid work and I’d recommend its acceptance.