为您找到 717 篇相关研究

8.8
18

COLM 2025Poster
Zhiyuan Zeng et al.
8.3
14

COLM 2024Poster
Bo-Ru Lu et al.
8.3
13

COLM 2025Poster
Guilherme Penedo et al.
8.3
12

COLM 2024Poster
David Rein et al.
8.3
14

COLM 2024Poster
Jaehun Jung et al.
8.2
16

COLM 2024Poster
Devansh Jain et al.
8.0
12

COLM 2024Poster
Zhaoyu Li et al.
8.0
20

COLM 2025Poster
Naman Jain et al.
8.0
11

COLM 2025Poster
Alisa Liu et al.
8.0
24

COLM 2025Poster
Scott Geng et al.
8.0
9

COLM 2024Poster
Pengda Wang et al.
8.0
29

COLM 2025Poster
Huiqi Zou et al.
8.0
11

The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $data compliance gap (DCG)$, which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pertaining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions.
COLM 2025Poster
Dongyang Fan et al.
8.0
11

COLM 2025Poster
Nathan Lambert et al.
8.0
13

COLM 2025Poster
Zichen Liu et al.
7.8
16

COLM 2024Poster
Jiageng Mao et al.
7.8
11

COLM 2024Poster
Jeffrey Cheng et al.
7.8
14

COLM 2025Poster
Ben Lipkin et al.
7.8
10

COLM 2024Poster
Albert Gu et al.
7.8
12

COLM 2024Poster
Chunqiu Steven Xia et al.

共 717 篇论文,第 1 / 36 页