PaperHub
6.5
/10
Poster4 位审稿人
最低6最高7标准差0.5
6
7
6
7
3.3
置信度
TL;DR

We conducted the first systematic analysis to quantify the impact of Large Language Models (LLMs) on academic writing over time.

摘要

Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the $arXiv$, $bioRxiv$, and $Nature$ portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. The statistical framework operates on the population level without the need to perform inference on any individual instance. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded areas, and papers with shorter lengths. Our findings suggests that LLMs are being broadly used in scientific papers.
关键词
Academic WritingComputational Social ScienceScience and Technology StudiesSocietal Impact and LLM Adoption

评审与讨论

审稿意见
6

Thanks to the authors for the hard work on this paper. The work is well-written and well-argued. It is a relatively small contribution: just applying an existing algorithm to existing data after validating/training with synthetic + real data. The analyses are nice, but can be combined into a single, more comprehensive one (I discuss this in more detail below). In summary, I like this work, but it should be expanded.

接收理由

The paper is clear. The conclusions make sense and are compelling. The existing analysis seems well-done. Nice charts.

拒绝理由

  • [Minor] Page 5: "This approach also simulates how scientists may be using LLMs..." Need a citation here or another piece of evidence that supports your two-stage approach.
  • [Minor] Figure 3 is hard to read. I recommend changing y-axis to "error" instead. That's the information I want to extract in any case.
  • [Major] Sections 5.2, 5.3, and 5.4 are all done independently and in somewhat arbitrary ways. I.e. for 5.2, paper are divided as two or fewer preprints vs three or more, and paper length as below or above 5000. These cutoffs don't seem principally motivated, and the results may be different if they change. Additionally, all of the analyses are done independently. For these kinds of analyses, I recommend instead a fixed-effects model where the output (estimated alpha) is modeled as an output of multiple input factors simultaneously. E.g.: datetime (bucketed), # of preprints (bucketed or not), arxiv category (categorical), distance to nearest neighbor (bucketed or not), pre-print posting. Adding new ones is straightforward both conceptually and computationally, and you are then able to look at how any individual variable is correlated (via its coefficient size) to the output when taking all others into account. It may be the case for example, that once you take into account # of preprints, the effect size of distance to nearest neighbor goes to 0. What other factors can you include in your model that may be confounders?
作者回复

Thank you for your feedback and the time you invested in reviewing our manuscript. We are delighted that you found our paper to be well-written, clear, and compelling.

[Major] These cutoffs don't seem principally motivated, ... I recommend instead a fixed-effects model where the output (estimated alpha) is modeled as an output of multiple input factors simultaneously....

Thank you for the suggestions. We clarified that the cutoffs are not arbitrary. The cutoffs in Sections 5.3 and 5.4 are determined by the rounded median. We used the rounded-up average for Section 5.2. We have added these clarifications in each of the figure captions of Figures 4, 5, and 6.

We appreciate the reviewer's thoughtful suggestions on the fixed-effect study. We clarified that our algorithm only provides a population-level estimate. Conducting a fixed-effect study is challenging without the availability of individual-level outcomes/estimates. Nevertheless, we have included a robustness analysis to validate the stability of our findings. We agree with you that certain subfields can be more crowded, and authors in these subfields may publish more papers simultaneously. Therefore, to further validate the robustness of our findings, we stratified the papers into Computer Vision (cs.CV), Computation and Language (cs.CL), and Machine Learning (cs.LG). We found that our results are consistent when stratifying by subfield (Supp. Figures 12, 13, and 14). This consistency indicates the robustness of our findings.

[Minor] Page 5: "This approach also simulates how scientists may be using LLMs..." Need a citation here ... supports your two-stage approach.

Thank you for the suggestion. Following your guidance, we have added a citation to Lee et al. [1], which characterizes the stages of human writing with AI assistants as planning and drafting, which aligns with our two-stage process. We have added the citation and the discussion into the methods section of the revised manuscript.

[1] Lee, Mina, et al. "A Design Space for Intelligent and Interactive Writing Assistants." Proceedings of the CHI Conference on Human Factors in Computing Systems. 2024.

[Minor] Figure 3 ... changing y-axis to "error" instead.

Thank you for the suggestion. We have added a figure where the y-axis has been changed to "error" in the supplementary figures for clarity.

We again thank Reviewer i8UX for their review of our manuscript, and we hope that the above response adequately addresses their concern.

审稿意见
7

The paper analyses the use of large language models (LLMs), specifically of ChatGPT in writing scientific paper in specific fields, i.e. in sciences (Computer Science, Mathematics, Statistics, etc.). For this, they collect data from various publishers (also arXiv) and identify the usage of ChatGPT with the distributional LLM quantification. Their results show that there is an increase of ChatGPT usage in scientific writing starting from 2020 with some disciplines using it more frequently (Computer Science) than the others (Mathematics).

Empiricism, Data, and Evaluation: The paper offers a strong empirical foundation, as the study uses a large collection of scientific papers (and also abstracts) and applying an established method with this data. In this way, it also offers reproducible results.

Ambition, Vision, Forward-outlook: The growth in using LLMs in the recent years is very fast. The challenges posed are important to be addressed in such studies. This work is timely and highly relevant.

Understanding Depth, Principled Approach: The authors show a good understanding of the approach used. They also provide detailed analysis of various aspects, such as correlation with publishing pre-prints, paper length, etc.

Clarity, Honesty, and Trust: The paper under review is clearly written. However, I have some comments on the structure. The introduction contains graphs which is very unusual. It would be better to restructure the paper in the way that the introduction describes aims and motivation, as well as what comes next (paper organiser) only. The other parts should come later, e.g. graph illustrations should be part of the results.

Further issues are rather minor and concern formatting. For instance, quotation marks are not properly formatted and point to one direction. The authors should also indicate the date of access to the given URLs (e.g. in references).

接收理由

The paper is highly relevant for the topic of the conference. The results are not only interesting but can also be inspiring for further research.

拒绝理由

I do not see any reasons to reject. However, I would recommend the authors to restructure the paper for the final version for a better readability.

作者回复

We thank Reviewer T5PN for their positive comments and for providing thoughtful feedback on our work.

Following your guidance, we have also revised the introduction section to improve clarity, and discussed the scope and the motivation of our study. Furthermore, we respectfully clarify that putting figures in the introduction has been a common practice in the literature [1,2,3] which we followed.

References

[1] Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. "Are emergent abilities of large language models a mirage?." Advances in Neural Information Processing Systems 36 (2024).

[2] Mitchell, Eric, et al. "DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature." (2023).

[3] Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837.

评论

Thank you for your response.

审稿意见
6

This paper highlights growing concerns and interest in the prevalence of AI-generated text in academic publishing, spurred by anecdotal examples and evolving editorial policies. It discusses the need for systematic analysis to understand the extent and implications of AI-modified content, introducing a framework for quantifying such modifications at scale. The study applies this framework to a large dataset of academic papers across various disciplines, revealing significant growth in AI-modified text, particularly in Computer Science. It also identifies associations between AI-modification, author behaviors like preprint posting frequency, and paper characteristics like length.

接收理由

  1. The study offers a novel contribution to the field by introducing a framework for quantifying AI-modified content in academic publishing and applying it at scale across multiple disciplines.
  2. With the increasing concerns and debates surrounding the use of AI-generated content in academic publishing, the study's findings are highly relevant and timely.
  3. The study opens avenues for further research into the impacts of AI-generated content on scholarly communication, knowledge dissemination, and academic discourse. Future studies could build upon the framework and findings presented, exploring additional factors influencing AI-modification trends and their broader implications for the academic community.

拒绝理由

  1. The study focuses primarily on quantifying the prevalence of AI-modified content in academic publishing without delving deeply into the potential positive aspects of AI technology in this context.
作者回复

We thank Reviewer Rcc2 for their positive comments and helpful feedback on our work.

We have added discussion on the positive aspects of AI usage as follows:“Researchers who are not native speakers of English may find it helpful to have an AI model polish their writing. Additionally, LLMs offer the possibility of immediate feedback on initial drafts compared to traditional peer review processes, which can be time-consuming.” We have added these discussions into the revised manuscript.

We again thank Reviewer Rcc2 for their review of our manuscript, and we hope that the above responses adequately address their concerns.

审稿意见
7

The paper investigates the increase in the use of LLM-modified content in academic publications. In particular, changes in the percentage of LLM-modified content before and after the release of ChatGPT are analyzed from the viewpoints of the fields of papers, submission rates to arXiv, similarity between papers, and length of papers. The main finding is that the proportion of LLM-modified content in computer science papers was 17.5%, followed by Electrical Engineering and Systems Science at 14.4%, which is higher than that of other fields. The authors also report that higher levels of LLM revision are associated with papers in which the first author submits preprints more frequently, papers in crowded fields, and papers that are shorter in length.

接收理由

  • The paper addresses the interesting topic of the use of LLM in academic writing.
  • The analysis is based on relatively large data sets, and the analytical methods are considered to have a certain degree of reliability.
  • While there are no major surprises about the findings, the authors have found a clear trend in the use of LLMs.

拒绝理由

  • The related work section is a weakness of this paper, as it only mentions studies related to determining whether a given sentence is LLM-modified. The authors should indicate whether there are studies that analyze the proportion of LLM use as addressed in this paper, if they exist, what they are, and how this paper stands compared to those studies.
  • Although there are related references in the “limitation” section, it cannot be denied that one of the reasons for the significant change in trends before and after the appearance of ChatGPT in the field of computer science is the existence of many papers on LLM and studies using LLM in this field, and the possibility that the number of contents that could be mistakenly identified as LLM-modified has increased.

给作者的问题

  • In the case of journal papers, a certain amount of time is considered to have elapsed between submission and acceptance. Is the analysis based on the date of submission or publication? If it is the latter, isn't some kind of correction necessary?
  • Whether the collected paper set is crowded or not depends on its coverage and the breadth of the field covered. As an extreme example, the field covered by Nature is very broad, and on the other hand, it does not cover all papers in the related field, so the paper set is unlikely to be crowded. Is it reasonable to discuss whether an area is crowded or not in a situation where different sources of the set of papers are used?
作者回复

Thank you very much for your helpful feedback and support for the paper! We have carefully updated the paper to incorporate your suggestions.

The authors should indicate whether there are studies that analyze the proportion of LLM use as addressed in this paper... and how this paper stands compared to those studies.

Thank you for the comment. We clarified that we are among the first papers to measure the prevalence of LLM-modified content. Previous work, such as Liang et al. [1] and a follow-up study [2], focused on the prevalence of LLM-modified content in scientific peer review for AI conferences. In contrast, our paper investigates the prevalence of LLM-modified content in the writing of academic papers themselves.

[1] Liang et al. "Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews."ICML (2024).

[2] Latona et al. "The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates." arXiv preprint arXiv:2405.02150 (2024).

Study Limitations

Thank you for the comment. We have added discussions to acknowledge this limitation: "One potential confounder of our study is the increased prevalence of research on LLMs after the launch of ChatGPT. This shift in research focus could potentially affect the accuracy of our method in detecting LLM-modified content. However, our validation has shown that our framework is robust under temporal distribution shifts of research topics. Still, future studies could further validate and analyze the robustness of our method with more systematic control of the study content."

In the case of journal papers... Is the analysis based on the date of submission or publication?

We analyzed Nature portfolio journal papers using both submission and publication dates. The results were consistent, with Nature portfolio papers having among the lowest estimated alphas, even when plotted by submission date. This clarification has been added to the revised manuscript.

Whether the collected paper set is crowded or not depends on its coverage and the breadth of the field covered...

We clarified that we used only arXiv CS papers for our fine-grained analysis. To validate the robustness of our findings, we stratified the papers by arXiv CS subfields (cs.CV, cs.CL, cs.LG) and found consistent results (see Supplementary Figures 12-14).

评论

Thank you for your response. I'm glad my comments will help you improve your paper.

最终决定

In this paper the authors study the use of / increase in LLM-generated text in scientific research articles. They do an analysis of ~1M papers published on arXiv, bioarxiv and Nature portfolio using the (previously published) distributional LLM quantification framework, which estimates the corpus-level proportion of ChatGPT-generated text by generating LLM versions of text from known-human authored papers. They perform an interesting analysis of the results with respect to field, submission rates, and paper lengths, finding that LLM-generated content is indeed increasing substantially over time since the launch of ChatGPT, and correlations between rate of publication (at the author and field level) with higher rates of LLM-generated content.

The reviewers agree that the paper addresses an interesting and timely topic using reliable analytical methods, and that it is a good fit for the conference. While the paper was sufficiently clearly written, the reviewers recommended some minor edits for grammar and clarity and additions to related work.