PaperHub
5.8
/10
Rejected5 位审稿人
最低3最高8标准差1.6
6
3
6
6
8
3.0
置信度
正确性2.8
贡献度2.6
表达2.8
ICLR 2025

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05

摘要

关键词
Synthetic Data Pre-trainingLarge Language Models

评审与讨论

审稿意见
6

The paper investigates the importance of diversity in synthetic data used for training large language models (LLMs). This study introduces a novel metric called LLM Cluster-agent, an LLM-based agent to evaluate the diversity of synthetic datasets. The authors analyze how diversity affects model performance during both the pre-training and fine-tuning phases. Through experiments involving models of 350M and 1.4B parameters, the authors demonstrate that greater diversity in synthetic data generally correlates positively with improved performance, especially during fine-tuning. They also explore various methods for generating synthetic data using various Topic-* prompt templates to increase the diversity of the synthetic data generated. Finally, they also study the dynamics of synthetic tokens in pertaining corpora and the model performance metrics. The performance improvement also correlates with the value of the diversity score reported by the LLM agent.

优点

  1. The authors propose a novel approach LLM Cluster-agent to quantifying the diversity of text datasets, specifically tailored for synthetic data, which is often more challenging to evaluate.
  2. The authors carry out systematic controlled experiments across various factors, such as topics, prompt styles, generation models, and real-to-synthetic token ratios, providing insights into optimal practices for synthetic data generation.
  3. The findings offer practical guidance on balancing real and synthetic data and selecting synthetic generation models to enhance performance, which could be valuable for researchers and developers of LLMs.
  4. From the ablation experiments, the clustering performance drops with low K and high N values, indicating potential scalability and applications to compute diversity in real-world scenarios.

缺点

  1. The primary drawback I observe is the separation between the generation and diversity measurement processes. According to the pipeline, diversity is only measured after data generation is completed. For example, if 4.5B tokens are generated and the resulting LLM diversity score is low, this would make the generation effort ineffective. Did the authors consider iterative diversity checks during the generation process. For example, implement intermediate diversity assessments at smaller generation intervals (e.g., every 100M tokens). This would allow to adjust or halt generation if diversity metrics fall below a desired threshold, optimizing the process early on and reducing waste in cases of low diversity outcomes.
  2. Another significant limitation of this approach is its heavy reliance on large pre-trained LLMs to evaluate various criteria. How do the authors guarantee the stability of the metrics and metadata generated? Was any human evaluation performed to verify the overall reliability of the pipeline? The results could be more reliable if the authors conduct an inter-rater reliability tests by comparing diversity metrics generated by multiple LLMs or align LLM-generated metrics with human evaluations. A smaller human-annotated sample can be used to validate the model-generated diversity scores as this would enhance the overall reliability of the pipeline.
  3. In the cluster generation step, the clustering process is localized to the number of samples that can be fit in the context window of the LLM. This can lead to random and unpredictable cluster formations leading to unpredictable diversity scores. Did the authors try any approaches that can mitigate the issue?

问题

Questions:

  1. How is GPT-40 prompted to extract a hierarchy of topics and a set of keywords covered in the page's content?
  2. In the Synthetic Data Generations section, it is unclear how the downweighting of Cosmopediav0.1 and Cosmopediav0.2 was done.
  3. There is a brief explanation, but it is still unclear why topic-style persona almost always outperforms multi-topic-styles persona. Intuitively, the latter is supposed to provide more diversity.
  4. What would be the average price for carrying out such an experiment due to the presence of multiple proprietary LLM calls?
  5. From the ablation experiments, it looks like when K = 10 and N =10K, the diversity score stabilizes. Do the authors report all their experiments with this configuration?
  6. Would be interested to know the comparison between the following method say KMeansDiversityScore, where we identify CiC_i clusters using K-means of embeddings of samples and get cluster size SiS_i from KK randomly selected samples. Repeat this experiment NN times and then compute the diversity score using the same formula. This experiment can provide a good comparison if the metric and metadata generation followed by the metadata gathering step has a positive effect in computing good diversity scores.

Suggestions:

  1. Typo at line 093: s\per-training\pre-training
  2. Missing citation ?? at line 1055

伦理问题详情

N/A

评论

In the cluster generation step, the clustering process is localized to the number of samples that can be fit in the context window of the LLM. This can lead to random and unpredictable cluster formations leading to unpredictable diversity scores. Did the authors try any approaches that can mitigate the issue?

We acknowledge that the clustering process is limited by the context window of the LLM, which can constrain the number of samples processed at a time and potentially introduce variability in cluster formation. To mitigate this, our method employs multiple iterations of clustering using K-sized random samples, ensuring that the diversity score reflects a comprehensive representation of the dataset rather than being reliant on any single iteration. This iterative approach allows the model to capture shared patterns and characteristics across samples, leading to more stable and robust cluster formations.

Additionally, the self-verification module acts as a safeguard by identifying and filtering out invalid or inconsistent clusters, which further enhances the reliability of the clustering process. As demonstrated in our results, a significant proportion of clusters flagged by the LLM were corroborated as invalid through human evaluation, confirming the effectiveness of this step.

While the clustering process does not explicitly enforce uniformity across iterations, the robustness of the LLM Cluster score is supported by our ablation studies, which show consistent results across varying KK values and repeated runs. This consistency indicates that the iterative pipeline compensates for context window limitations, producing reliable diversity measurements without compromising on computational efficiency.

How is GPT-40 prompted to extract a hierarchy of topics and a set of keywords covered in the page's content?

We use an additional prompt for gpt-4o, feeding the content of the webpage and extract the relevant hierarchical topics and keywords. We also set the expected format of the hierarchical in the prompt with examples.

In the Synthetic Data Generations section, it is unclear how the downweighting of Cosmopediav0.1 and Cosmopediav0.2 was done.

We adjust the sampling weight of Cosmopedia v0.1 and v0.2's to make them effectively 20B. This is achieved via data sampler of our implemented data loader.

There is a brief explanation, but it is still unclear why topic-style persona almost always outperforms multi-topic-styles persona. Intuitively, the latter is supposed to provide more diversity.

The multi-topic-styles persona prompt may introduce redundancy in the synthetic data generation. The redundancy may exist not only in each generation but also across generation, since the multiple topic candidates can overlap.

What would be the average price for carrying out such an experiment due to the presence of multiple proprietary LLM calls?

The average price for each call of the pipeline would be around 0.05$ to get the cluster score.

From the ablation experiments, it looks like when K = 10 and N =10K, the diversity score stabilizes. Do the authors report all their experiments with this configuration?

All main results are reported with K = 10 and N = 5K. Running for larger N would make the results more robust but also more expensive.

Would be interested to know the comparison between the following method say KMeansDiversityScore, where we identify...gathering step has a positive effect in computing good diversity scores.

Thanks for this great advice. We made some initial attempts on this method, i.e., using K-means to identify C clusters and get the initial cluster size from K randomly sampled data before input to our LLM cluster pipeline (without metadata and metrics). The results are comparable to our pipeline We believe diversity is indeed a multi-dimensional factor and combining several metrics together would make the diversity measurement more robust.

PipelineCluster Score
LLM Cluster3.99
K-Means + LLM Cluster (w/o metadata/metric)3.92

Typo at line 093: s\per-training\pre-training

Thanks for pointing our this typo. We have fixed it in our revised paper.

Missing citation ?? at line 1055

Thanks for catching this error! We have fixed it in our revised paper.


If you find our above response well address your concerns, please consider raising the score. If you have further questions, please do not hesitate to let us know.

评论

Thanks for the reviewer's time and efforts on giving suggestions revising our paper. We now address the concerns as follows.


The primary drawback I observe...

It is indeed important to incorporate iterative diversity checks during the generation process to ensure higher-quality outputs. While the current pipeline measures diversity post-generation, intermediate assessments, such as evaluating diversity metrics at smaller intervals (e.g., every 100M tokens), could help identify low-quality data early, allowing adjustments to the generation process accordingly. This approach was not implemented in the current work due to cost overhead and the need to maintain consistent experimental settings. However, it is a promising direction for future work. We plan to explore this refinement in follow-up studies to enhance the efficiency and reliability of synthetic data generation, and also the method to utilize this metric to create more diverse synthetic data on the fly.

Another significant limitation of this approach is its heavy reliance on large pre-trained LLMs to evaluate various criteria.

To ensure the stability and reliability of the generated metrics and metadata, we have implemented several measures. The metadata and metrics generation step is repeated MM times, with different JJ-sized samples randomly selected in each iteration. This iterative approach ensures that the results are robust and not biased toward specific subsets of the dataset. Additionally, all experimental results include standard deviations coming from multiple runs with different seeds, demonstrating the reproducibility and internal stability of the proposed pipeline.

In the cluster verification stage, filtered clusters identified by the self-verification module were compared with human evaluations. The results, shown below, demonstrate a strong agreement between LLM-based filtering and human-verified invalid clusters.

Topic/#SamplesClustersSelf-verified Invalid Clusters(%)Human-verified Invalid Clusters (%)
100/1012943248 (1.91)221 (1.70)
100/2015216350 (2.70)329 (2.54)

This agreement between our framework and human-evaluation confirms the robustness of the self-verification step.

In addition to that we conduct human-study on synthetic data generated using different prompts. Ten evaluators rated the diversity of 50 samples from six synthetic data variants on a scale of 1–6. The results showed strong consistency between human diversity scores and the LLM-based diversity metrics. For instance:

Synthetic DataLLM Cluster ScoreHuman Diversity Score
Cosmopedia v0.14.7 ± 0.23.6 ± 0.8
Cosmopedia v0.23.7 ± 0.22.4 ± 0.9
Topic4.2 ± 0.32.3 ± 1.0
Topic Styles5.3 ± 0.24.8 ± 0.7
Topic Styles Persona6.8 ± 0.35.2 ± 0.4
Multi-Topic Styles Persona6.2 ± 0.34.5 ± 0.7

The strong correlation (r=0.91,p=0.011r = 0.91, p = 0.011) demonstrates a statistically significant positive relationship between the Human Diversity Score and the LLM Cluster Score, supporting the validity of our metric.

These findings provide further confidence in the alignment between human judgments and the diversity captured by our pipeline.

To address the concern about inter-rater reliability, we conducted experiments using different LLMs for metadata and metric generation. The comparison of self-verification results with various models is shown below:

Self-Verification ModelInvalid ClustersCluster Score
GPT-4o2483.99
GPT-42544.03
GPT-3.52183.81
Llama-3.11923.65

The results demonstrate that more capable models, such as GPT-4, are better at identifying invalid clusters, and even smaller open-source models like Llama-3.1 show reasonable performance, albeit with a slight drop in reliability. These findings underscore that the pipeline is adaptable to different models based on cost and resource constraints.

We believe that the combination of human evaluation, iterative verification, and inter-rater reliability tests validates the robustness of the proposed pipeline. The agreement between human judgments and LLM-generated metrics provides additional reliability of the our pipeline.

评论

We sincerely appreciate the time and effort you have dedicated to reviewing our submission. We would be happy to provide further clarification if you have any additional questions or concerns.

-- Authors

评论

Thanks for addressing my concerns. The responses have improved the soundness of the approach. I have increased the soundness score.

评论

We sincerely appreciate the time and effort you have dedicated to reviewing our work and engaging in constructive discussion. Should you have any further questions or require additional clarification, we would be happy to provide it. Thank you once again.

-- Authors

审稿意见
3

This work proposes a framework to understand the efficacy of synthetic data, with a focus on quantifying the extent and effect of diversity in synthetically-generated data. To this end, a novel model-based metric, LLM Cluster score, is proposed. Based on a subsample of the data, attributes are identified by an LLM that can uniquely characterize individual samples and cluster similar ones. A random sample is taken and the data points within it are clustered. Based on the results of these clustering iterations, the diversity metric is calculated. This diversity metric is shown to be positively correlated with model accuracy metrics, demonstrating that [1] this metric is a good measure of diversity, and [2] more diverse data results in better model performance.

优点

  • This work proposes a novel, model-based metric
  • The methodology allows for limitations of model context lengths
  • Several insightful analyses surrounding accuracy, diversity, and token balance are presented.

缺点

  • The number of samples is a cause for concern. (J=5, K=10)

-- J=5 : does this imply that the model uses 5 samples to infer the metadata and metrics that characterize the entire distribution? Given the size of the data and the number of topics that may be present (100K/300K), it merits more discussion on whether J samples can represent the data sufficiently.

-- K=10 : Having obtained metadata and metrics, as well as a prompt that grades samples according to the rubric defined, lines 192-194 suggest that only K samples out of the dataset are then assigned to clusters by the LLM. Can a cluster score over 10 samples definitely yield an idea of the diversity of the dataset? This could benefit from more supporting evidence.

In light of this, conducting sensitivity analyses, and extending the ablation studies in B.2 to higher values of K and a range of values for J would mitigate this concern. Furthermore, including other clustering metrics in the ablation study beyond the novel LLM cluster metric would help corroborate these values more strongly.

  • The self-verification module could benefit from clarification along the following lines:

-- Was there any human evaluation performed in order to verify the outcomes of the self-verification module, and in order to find the optimal verification prompt?

-- Using the same model to verify its own reasoning chain and clustering judgements may lead to a positive bias; the verification module would benefit from utilizing an LLM that is not part of the earlier steps of cluster score generation pipeline.

  • There is a lack of clarity on the difference between the "metadata" and "metric" axes. How are these factors differentiated? Based on the sample outputs in Appendix D.2, there is some overlap between the nature of the "metadata" and "metric" keys, in terms of quantifiability and subjectivity. This lack of distinction is made stronger by the prompt templates in Appendix D.1, where the templates for metrics and metadata generation are nearly identical. It would be helpful to delineate [1] how these terms are different in theory, [2] in practice, how the outcomes of their respective prompts differ, and [3] whether this difference is essential to maintain, as opposed to a single prompt that generates factors that could be either "metrics" or "metadata".

  • Additionally, some of the "metadata" and "metrics" are ill-defined, with only qualitative judgements associated with them, and that too only for the values on the extreme ends of the 1-5 scoring scale (See Appendix D.2) . This raises a concern: LLMs often show a lack of consistency in their outputs, even for the same input. This is an even greater concern when LLMs are used to confer subjective evaluations on the input. The diversity metric proposed in this work may be non-reproducible, especially over different samples of the same dataset. The inconsistency could be be mitigated by including a more detailed, unambiguous rubric. The results would also be strengthened by running multiple trials with various seeds for random sampling the data.

  • Figures 5 and 6 show that the inter-cluster variance achieved by using LLM Cluster score is higher than that achieved by other methods, thus demonstrating its efficacy in segmenting a dataset with 100K-300K topics. However, owing to the non-uniform scaling of the X-axis and the non-comparable nature of the range of values ([24,24.2] for Perplexity, [192,100] for K-Means, [3.5,5] for LLM Cluster score), it is not possible to definitively determine that a higher cluster score is more correlated with diversity than a higher measure of any other diversity metric. Contextualizing the comparative performance of these scores and their usefulness in 3.2 onwards, along with presenting standardized results in the visualizations would help better understand the benefits of LLM cluster score over other metrics.

问题

  • Lines 185-187: across all M iterations, what is the nature of the sampled J instances, are these samples exclusive of one another? Or is the sampling entirely random and from the whole distribution, which means that there may be common elements in different J-sized samples across the M iterations?
  • The work also states that the clustering is performed N times. Does the K-sized sample change over each iteration out of N? If so, are the K-sized samples exclusive of one another, or are they randomly samples each time from the entire dataset, thus allowing for common elements?
  • Instead of clustering only K samples, which may not yield a diversity score representative of the entire dataset, it may be more fruitful to cluster all elements in the dataset, taking K samples at a time, and following a iterative approach akin to K-Means to merge existing clusters and obtain a more comprehensive idea of the clusters produced by the LLM Cluster-agent over the dataset.
  • (Lines 186-187) Documenting the intuition behind using a metadata and metric gathering prompt (that summarizes the most frequent subset of metadata and metrics) instead of scoring the samples individually over every metadata and metric generated, would help convey the rationale of the approach better.
评论

We thanks for the reviewer's time and effort in reviewing our work, and the constructional feedback on refining our manuscript. Please find our response below.

The number of samples is a cause for concern. (J=5, K=10)

The number of samples is mainly decided by our ablation study and earlier iteration of the prompt. We experiment with multiple values of J=(3,5,10,15,30,50)J = (3, 5, 10, 15, 30, 50) and K=(5,10,15,20,50,100)K = (5, 10, 15, 20, 50, 100). With different values of J in proper range, we find that the metadata and metrics extracted from the data by our LLM agent are overlapped and specific as shown in B.2. Setting J and K to extremely high values like 50 and 100 would lead to performance degradation due to the long context length (when K=50, the context length is roughly 50K): the metadata and metrics becomes more general and the clusters degenerates to certain values, as the results updated here.

JMTop3 MetadataTop3 Metric
10100Disciplinary Focus, Conceptual Density, Terminology DensityInterdisciplinary Integration, Information Density, Lexical Diversity
15100Disciplinary Focus, Text Complexity, Narrative StyleInterdisciplinary Integration, Conceptual Density, Lexical Diversity
30100Discipline Focus, Text Complexity, Textual CohesionInterdisciplinary Integration, Novelty Index, Lexical Diversity
50100Interdisciplinary Relevance, Domain Specificity, Sample Source OriginJargon Richness, Informativeness, Audience Breadth
KScore
203.13±0.46
502.05±0.83
1001.49±1.02

We have updated the results with higher J and K values in the revised Appendix of our paper.

Can a cluster score over 10 samples definitely yield an idea of the diversity of the dataset? This could benefit from more supporting evidence.

We need to highlight that the model does not rely on only J and K samples to capture the distribution. For the metadata and metrics gathering, we run for M rounds and and let LLM refine the final metadata and metrics from gathered and ranked results. For the clustering, we also run for sufficient rounds to make sure the cluster score robustly reflect the distribution (K=10 and K=15 gives similar results). All the results are obtained from multiple running of the proposed LLM-cluster pipeline and baseline metrics for different random seeds.

Other clustering metrics in Ablation tables

We also provide the ablation study of perplexity with different models and K-means cluster score with different number of clusters here.

Topics/#SamplesPerplexity (GPT-2)Perplexity (GPT-2)
Topics 100/1024.03 (0.01)15.15 (0.03)
Topics 100/2024.09 (0.01)15.29 (0.04)
Topics 100/3024.14 (0.01)15.31 (0.03)
Topics/#SamplesK-means (N=1000)K-means (N=5000)K-means (N=10000)
Topics 100/10998.78 (0.12)453.21 (0.43)197.73 (0.64)
Topics 100/20998.23 (0.20)453.19 (0.37)196.98 (0.19)
Topics 100/30998.23 (0.19)453.15 (0.41)197.31 (0.05)

From the results, perplexity present the similar trend as in the main paper regardless of the underlying distribution. And K-means is more sensitive to the hyper-parameter setting of the number of clusters.

-- Was there any human evaluation performed in order to verify the outcomes of the self-verification module, and in order to find the optimal verification prompt?

The self-verification module and earlier iteration of the verification prompt is indeed human-verified. Here we provide a more formal human evaluation on the filtered invalid clusters.

Topic/#SamplesClustersSelf-verified Invalid ClustersHuman-verified Invalid Clusters
100/1012943248221
100/2015216350329

From the results, one can observe that a large proportion of the filtered cluster is indeed invalid.

-- Using the same model to verify its own reasoning chain and clustering judgements may lead to a positive bias; the verification module would benefit from utilizing an LLM that is not part of the earlier steps of cluster score generation pipeline.

Thanks for this great advice. In appendix B.2, we indeed provide the clustering results of using different LLMs for the entire baseline. Here, we provide an additional ablation study of using different LLMs for the self-verification module.

Self-Verification ModelInvalid ClustersCluster Score
GPT-4o2483.99
GPT-42544.03
GPT-3.52183.81
Llama-3.11923.65

We can observe that GPT-4 present better verification and the smaller models are more incapable to discriminate the invalid clusters. The trend of the results are similar to use different models for the entire pipeline.

These results are included in our updated Appendix.

评论

There is a lack of clarity on the difference between the "metadata" and "metric" axes.

We define "metadata" as attributes or properties of the dataset (e.g., disciplinary focus or text style). It is primarily qualitative, serving as descriptive labels, while "metrics" quantifies aspects of the dataset, offering numerical or score-based evaluations (e.g., conceptual density or perplexity).

For generation, "metadata" and "metrics" prompts are designed to guide the clustering process. However, their objectives differ -- while "metadata" prompts emphasize defining the attributes that support the clustering criteria, "metrics" prompts prioritize scoring or quantifying specific attributes to enable reasoning about clusters. Outcomes differ as metadata outlines qualitative characteristics of the data, while metrics provides numerical diversity scores. We keep them separate to ensure clarity in clustering by decoupling descriptive and evaluative aspects. Also note that, while the prompt for generating metadata and metrics are similar, the GPT-4o is further prompt to gather the metadata and metric using a single prompt, where it indeed produce discriminative results in the qualitative and quantitative perspective. This can be verified by the example of the final prompt in our appendix. From our earlier iteration, we found either merging these two together or not using a gather prompt for them could dilute the interpretability of clusters and thus hurt downstream analysis.

Here, we provide an additional ablation to support the above discussion. We compare separate prompts for metadata and metrics, and single prompt for both metadata and metrics.

PromptMetadataMetric
Separate Metadata/Metric + GatheringDisciplinary Focus, Text Complexity, Narrative StyleInterdisciplinary Integration, Conceptual Density, Lexical Diversity
Single Metadata/Metric + GatheringConcept Density, Context Scope, Document TypeTerminology Usage, Conceptual Clarity, Clarity of Explanation

Additionally, some of the "metadata" and "metrics" are ill-defined

We recognize the concern regarding LLMs' lack of consistency in generating subjective evaluations for the same input. However, measuring the diversity of the large-scale corpus is a very challenging problem. To ensure consistency, our results are already from incorporating bootstrapping over multiple random seeds to ensure that results are internally robust and reproducible across different samples (from the error bar of the results). In addition to that our proposed diversity metric already includes M repetitions during metadata and metrics generation to address variability. The detailed prompt templates and examples are also provided to ensure they can be reproducible.

However, owing to the non-uniform scaling of the X-axis...

Thanks for acknowledging the efficacy of LLM-cluster metric in segmenting the underlying distribution. We need to highlight that the absolute value of different metric may not be directly comparable, and the diversity is more discriminative from the relative value of the metrics. The non-uniform scaling of axes is mainly for visualization purpose, otherwise we will not observe any difference in the baseline metrics. But we recognize that the non-uniform scaling of axes in these figures may hinder interpretability. To address this:

  1. We will include Figures 5 and 6 with uniform X-axes across metrics in Appendix, ensuring better visualization.

  2. Section 3.2 have been updated to explicitly discuss the relative efficacy of LLM Cluster-agent versus other metrics like perplexity and K-means, in terms of the relative values of different metrics. Our results already show that traditional metrics often fail to capture nuanced diversity, as evident in Section 3.3.

Are the J-sized samples across M iterations exclusive, or is the sampling random, allowing for overlap between iterations?

The J-sized samples are randomly sampled from the entire dataset in each iteration, so they can be overlapped across different iterations. This random sampling ensures that the metadata and metrics generated are not biased towards any particular subset of the dataset. Random sampling from the full distribution allows for a more representative understanding of the dataset's diversity. Overlapping samples can capture shared patterns or characteristics that are critical for accurate metadata and metrics generation.

Do the K-sized samples...the N iterations of clustering? Are these samples exclusive, or are they randomly sampled each time, potentially overlapping?

Similar to J-sized samples, the K-sized samples are also randomly sampled from the entire dataset during each of the N iterations. Note that this random sampling approach allows the model to capture the true underlying distribution with sufficient large N. Making the samples exclusive will harm the clustering results with the K values as we used.

评论

Instead of clustering only K samples, would it be more effective to cluster all elements in the dataset, using an iterative approach akin to K-Means?

We find that clustering K samples at a time is computationally efficient and feasible in terms of the cost for large datasets, especially given the LLM's context length limitations. Aggregating results over N iterations ensures that the diversity score reflects the dataset's overall characteristics. Clustering the entire dataset using an iterative approach (like K-Means) may indeed provide a more comprehensive view but is computationally expensive and requires significant resources, especially for datasets with 100K–300K topics. However it may not leverage the semantic and contextual reasoning capabilities of LLMs effectively within their operational constraints, and also, make the metric very prone to the iterative process (as we observed from K-means results). Another way to measure the diversity from the cluster scores is to build a graph connecting all the similar clusters, which may also be computational expensive, but more feasible in terms of cost. From our current results, we show that the current pipeline can indeed better capture diversity than the baseline metrics, and the improvement on it is left for future works.

What is the rationale for using a metadata and metric gathering prompt instead of scoring every sample individually for all metadata and metrics?

  1. Scoring every sample individually for all metadata and metrics would be costly prohibitive, especially for large context length.
  2. The metadata and metric gathering prompt summarizes the most frequent and salient attributes, capturing the dataset's essential diversity without exhaustive computation.
  3. LLM-cluster categorize each cluster based on the common metadata and metrics in the cluster, and score them, which also reflect the individual scoring.
  4. This approach aligns with the objective of extracting a manageable set of attributes that are representative of the dataset while avoiding redundancy in scoring.

If you find the above response resolve your concerns, please consider to raise the score for better supporting of our work.

评论

We sincerely appreciate the time and effort you have dedicated to reviewing our submission. We would be happy to provide further clarification if you have any additional questions or concerns.

-- Authors

评论

Thank you for your detailed response. 

  • The number of samples still a concern - the sample size variations in the authors' response do not seem to reflect a sufficiently high range of values to determine that the parameter values are optimal. Additionally, the perplexity test as well is not entirely convincing, given that 10/20/30 are close within the same order of magnitude.
  • Thank you for providing insight into human verification, as well as including the ablation study over different LLMs for verification.
  • Despite the authors' clarification, in the opinion of this reviewer, metadata and metric are still confusingly similar - while the intention to separate quantitative and qualitative measures is clear, this does not seem to be manifesting in practice. For example: how are text complexity, a metadata, and conceptual density, a metric, qualitative and quantitative respectively? How is interdisciplinary integration quantifiable?
  • This reviewer's concern regarding lack of consistency in how an LLM categorizes each cluster based on the same metadata and metrics still stands. Given that this is a cornerstone of the approach, this may be a significant flaw. If this concern is addressed more convincingly, either through quantification of scores for a set of clusters over different runs, or some qualitative insights, it would go a long way towards justifying an improvement in the review score.
  • Thank you for updating the diagrams, this change would likely help interpretability and comparison significantly.

In lieu of improved clarity in communication re: the diagrams and ablation studies to better support some of the insights, the presentation and soundness scores have been updated.

评论

We appreciate you for engaging in discussion and we would like to provide further clarification regarding the remaining concerns.

The number of samples still a concern - the sample size variations in the authors' response do not seem to reflect a sufficiently high range of values to determine that the parameter values are optimal. Additionally, the perplexity test as well is not entirely convincing, given that 10/20/30 are close within the same order of magnitude.

We appreciate the concern about sample size variations and their adequacy for determining optimal parameter values. As detailed in our response and supported by the findings in the paper, the choice of N=5000N = 5000 provides a reliable trade-off between cost efficiency and accurate approximation of diversity metrics. Larger values of NN could reduce variance further; however, they would increase computational costs significantly, which is not always feasible for many practitioners.

Our experimental results demonstrate that N=5000N = 5000 is sufficient to capture the nuances of diversity without compromising accuracy. This choice aligns with our observations that larger sample sizes yield diminishing returns in terms of diversity approximation, as shown in the controlled experiments reported in the paper. While we agree that broader ranges could provide additional insights, our setup balances practicality with scientific rigor, making the approach accessible and scalable for diverse research needs.

Despite the authors' clarification, in the opinion of this reviewer, metadata and metric are still confusingly similar - while the intention to separate quantitative and qualitative measures is clear, this does not seem to be manifesting in practice. For example: how are text complexity, a metadata, and conceptual density, a metric, qualitative and quantitative respectively? How is interdisciplinary integration quantifiable?

We understand your concerns regarding the distinction between metadata and metrics, and we would like to clarify further. The separation between metadata and metrics is intended to reflect qualitative and quantitative dimensions, respectively, but we acknowledge that some overlap may appear due to the inherent complexity of certain attributes.

For example, text complexity, categorized as metadata, refers to descriptive characteristics of the dataset (e.g., readability, sentence structure, or linguistic style) that are primarily qualitative and serve as contextual labels. On the other hand, conceptual density, a metric, is a quantifiable measure that evaluates the density of concepts or ideas within a dataset, often using numeric scores derived from embeddings or lexical analysis. Similarly, interdisciplinary integration, while it may seem abstract, is quantifiable when using scoring systems that assess the breadth and diversity of domain-specific terminology or topics present within the data.

The quantifiability of certain metrics arises from the LLM's ability to aggregate and evaluate underlying patterns across samples based on predefined prompt criteria. While metadata provides descriptive context to support interpretability, metrics serve as evaluative scores to enable downstream analysis. We also note that these categorizations align with the goals of clustering, where metadata helps label clusters while metrics assess their significance and separation.

To further address concerns about overlap, we will include additional examples and detailed descriptions in the appendix to clarify how these distinctions manifest in practice. Additionally, we plan to expand on the prompts used for metadata and metrics generation to demonstrate their different focuses more explicitly. This should improve the interpretability of these terms and better align with the expectations of clarity and utility in our approach.

评论

This reviewer's concern regarding lack of consistency in how an LLM categorizes each cluster based on the same metadata and metrics still stands. Given that this is a cornerstone of the approach, this may be a significant flaw. If this concern is addressed more convincingly, either through quantification of scores for a set of clusters over different runs, or some qualitative insights, it would go a long way towards justifying an improvement in the review score.

The consistency is critical for the reliability of our approach, as it forms a cornerstone of the proposed method. To ensure robustness, we have incorporated measures to mitigate variability in cluster categorization. Specifically, our pipeline involves multiple iterations for clustering and metadata/metric generation, using different random seeds and subsets of data in each run. This iterative approach reduces the influence of outliers and provides stability in cluster assignments. The reported results include standard deviations to quantify the variance across runs, and these deviations are consistently low, indicating a high degree of reproducibility.

To address concerns about consistency, we have also analyzed the overlap of scores and cluster memberships across repeated runs for the same dataset. Statistical measures, such as standard deviations of cluster scores, further confirm the stability of our method. Additionally, qualitative examples in the appendix illustrate how the same metadata and metrics guide consistent cluster formation across runs. These examples demonstrate how LLMs, leveraging the same prompts, align their categorizations with underlying data patterns despite inherent stochasticity.

We believe these measures and analyses sufficiently address concerns regarding the consistency of the LLM-cluster approach and reinforce its reliability. We remain committed to providing clear and robust evidence supporting the validity of our method.

We hope these steps will address your concerns more convincingly and reinforce the reliability of the LLM-cluster approach. We are committed to refining our explanations and analyses to ensure confidence in the consistency and applicability of our method.

--

Should you have any further questions, we would be happy to provide additional clarifications.

评论

We sincerely appreciate your continued engagement and constructive feedback on our work. Below, we address the remaining concerns and provide additional evidence to showcase the robustness and clarity of our proposed method.


Consistency of the Diversity Metric

To further validate the consistency of our diversity metric, we conducted additional ablation studies with larger values of NN, as outlined below:

NNScore (Mean ± Std)
1,0003.71 ± 0.25
5,0003.99 ± 0.05
10,0004.02 ± 0.03
15,0004.01 ± 0.02
20,0004.01 ± 0.02

These results demonstrate that our method produces robust and consistent diversity measurements across varying sample sizes. Notably, from N=5,000N = 5,000 onward, the results exhibit minimal variance, as indicated by the small standard deviations. This stability demonstrates that the chosen N=5,000N = 5,000 achieves a reliable trade-off between computational feasibility and measurement robustness. We believe this extended analysis effectively addresses concerns about the method's consistency.


Clarification on Metadata and Metrics

We understand the remaining concern regarding the potential overlap between metadata and metrics. While there might be an intersection in their descriptive attributes, these components serve different purposes in our framework:

  • Metadata: Focused on categorizing and characterizing textual attributes (e.g., disciplinary focus, text style). This primarily aids in qualitative understanding and cluster formation.
  • Metrics: Designed to measure and score the quality or quantitative characteristics of the categorizations (e.g., conceptual density, perplexity). This evaluates the clusters and informs diversity scoring.

To further illustrate the necessity of maintaining both components, we conducted an ablation study comparing the use of only metadata, only metrics, and both:

Gathering ApproachScore (Mean ± Std)
Only Metadata3.41 ± 0.62
Only Metrics3.68 ± 0.44
Both (Our Method)3.99 ± 0.05

The results demonstrate that using either metadata or metrics alone yields less robust and less accurate diversity measurements. The combined use of both components, as employed in our pipeline, ensures the most reliable and consistent results. While metadata provides qualitative structure, metrics quantitatively validate these structures, leading to improved diversity measurement.

To further clarify the distinction between metadata and metrics, we will include detailed examples and additional explanations in the appendix. These will illustrate how metadata and metrics are generated, their distinct focuses, and why their combination is essential.


We hope these extended analyses and clarifications effectively address your remaining concerns. To enhance transparency and reinforce the reliability of our method, we will include the additional results and examples in the appendix of the revised manuscript.

We sincerely appreciate your valuable feedback and are happy to provide further clarifications if needed. As the deadline approaches, we kindly request you to review our response and let us know if you have any additional concerns. If our response has satisfactorily addressed your points, we would greatly appreciate an update to the scores to reflect the same. Thank you for your time and thoughtful review.

审稿意见
6

The main idea presented in the document is studying the impact of diversity in synthetic data on the performance of large language models (LLMs) during pre-training and fine-tuning. They propose a new metric called "LLM Cluster-agent" to quantify the diversity of large-scale synthetic text datasets. This metric uses an LLM to cluster text samples based on generated metadata and scoring criteria.They generate synthetic datasets with varying levels of diversity by controlling factors like: - The underlying distribution of topics and generations per topic - Prompts/templates used for synthetic data generation (e.g. adding styles, personas) - Base LLM models used for generation (GPT-4, GPT-3.5, open-source models) - Ratio of real vs. synthetic data used for pre-training

They pre-train 350M and 1.4B parameter LLMs on combinations of real data and the synthetic datasets of varying diversity levels. They evaluate the pre-trained and also supervised fine-tuned LLMs on several benchmarks. Their experiments show that the proposed LLM Cluster-agent diversity metric correlates positively with the performance of both pre-trained and fine-tuned LLMs. They find that larger diversity in synthetic data benefits model performance, especially for fine-tuning. But too much synthetic data can deteriorate performance. They ablate the design choices of their LLM Cluster-agent pipeline.

优点

  1. The paper is well written and clearly organized
  2. The authors discuss an important issue of understanding diversity metric.

缺点

  1. Context Length Limitation:
  • The paper criticizes previous methods for context length limitations (lines 192-195), but the proposed method also faces similar constraints
  • How does this represent an improvement over existing algorithms?
  • Current LLMs have a maximum context length they can process. For extremely long documents, the clustering may need to be done in segments. What specific solutions does the method offer to address the context length challenge?
  • Can you please clarify how your method specifically addresses or improves upon the context length limitations of previous approaches. Please provide a more detailed comparison of how your method handles long documents compared to existing techniques.
  1. The authors state that their findings "present scalability and potential to be applied on a larger scale." However, their largest experiment is limited to 1.4B parameter models due to computational constraints. Extrapolating their diversity metric's performance and conclusions to much larger models (e.g., hundreds of billions of parameters) may be an overreach without empirical evidence at those scales. Please discuss any theoretical reasons or preliminary evidence that suggests their findings would scale to much larger models, or to explicitly acknowledge the limitations of extrapolating beyond the scales tested.

  2. Belief in prompted clustering quality: Their LLM Cluster-agent metric relies heavily on the quality of prompted clustering done by LLMs. While LLMs can be very capable, assuming that their clustering perfectly captures true data diversity based on generated criteria is a strong belief. The metric's performance could degrade for highly complex data distributions.

  3. Diversity dimensionality: Their metric provides a single score capturing overall diversity. However, diversity is a multi-dimensional construct. Existing metrics focus on different facets that may still be relevant.

6 Self-Verification Process Design:

  • Why wasn't a more capable "teacher" model used for the self-verification step?
  • Would using a more advanced model for verification improve the reliability of the results?
  • What was the rationale behind using the same model for both clustering and verification?
  1. This point is just good to have. I am not reducing scores for this: The evaluation is based solely on automatic metrics. Incorporating human evaluation of the generated synthetic data and model outputs could provide a more holistic understanding of the diversity and quality aspects.

问题

The paper is well written but I still have a lot a of questions. It would be great if authors can answer them

  1. Sampling Parameters and Methodology (This is my biggest concern):
  • How are the parameters K (samples for clustering) and J (samples for metadata generation) selected?
  • For the M repetitions in metadata generation, are different J samples used each time?
  • What is the justification for these parameter choices?
  1. Self-Verification Process:
  • How is the effectiveness of the self-verification step evaluated?
  • Does the verification use the same LLM model or a different one?
  • How do different values of M and K affect the consistency of clustering results?
  1. Generalizability:
  • Has this diversity measurement approach been tested on different types of datasets beyond the described text data (e.g., mathematical content)? If not can you do some additional experiments beyond mathematical domain like finance, biomedical or logical reasoning to show more generalizability of the proposed metric.
  1. Comparative Analysis:
  • How does the LLM Cluster-agent method compare quantitatively to existing diversity measurement approaches?
  • What specific advantages does it offer over traditional methods? A detailed comparison would be more helpful
评论

We first thanks the reviewer's detailed suggestions and feedback. And we address the raised concerns and question in the following.


The paper criticizes previous methods for context length limitations (lines 192-195), but the proposed method also faces similar constraints

We thank the reviewers for highlighting the context length limitation. But we want to clarify that we are not criticizing that previous methods have context length limitations and ours not. All model-based methods will be subjected to the context length of the model used. For traditional model-based methods, i.e., perplexity and K-means, the metric value is directly related to the context length of the models. Thus we have to segment the inputs to calculate these metrics.

For our LLM-cluster method, it also subjects to the context length, and with the over-lengthy inputs also cast performance degradation, as shown in Appendix B.3. However, we show that, with small values of K, we are able to capture the clusters of the underlying data distribution by iterating the clustering process over sufficient rounds of N. Moreover, as shown in the paper and our response to Reviewer qfg7, our metric demonstrates the best correlation with the performance of LLMs, showing the advantage over the baseline metrics. We hope this can make the confusion on the context length more clear.

The findings are extrapolated to larger scales, but the experiments are limited to 1.4B parameter models. Please discuss theoretical reasons or acknowledge the limitations of extrapolating to much larger models.

We acknowledge the limitation that we cannot verify the proposed method and large-scale data and model due to the computational limit. However, from the results shown in Figure 1, we indeed demonstrate that our metric correlates more strongly with the larger 1.4B model, compared to the 350M model. We have tuned down the statement in the conclusion part and hope this will help resolve the concern.

The metric assumes that LLMs' clustering perfectly captures true data diversity, which may degrade for highly complex distributions.

We agree that the reliance on LLMs' clustering capabilities assumes a level of semantic understanding that may degenerate for highly complex or noisy data distributions. The self-verification step addresses some of this concern by removing clusters that do not align with the model’s generated criteria. However, we will emphasize that the quality of clustering depends on the LLM's inherent capabilities and is not immune to limitations. Besides, the synthetic data explored in this work is highly practical, i.e., textbook style synthetic data are widely used in training LLMs, and we showed that the proposed metric is robust to the complex distributions and domain-specific knowledge.

The metric provides a single diversity score, but diversity is inherently multi-dimensional.

Diversity is indeed a multi-dimensional factor. In the design of our method, we use metric and metadata to capture the multi-dimensional diversity of synthetic data. Thus our metric captures an aggregate view rather than individual facets. However, we will clarify in the paper that the LLM Cluster-agent metric is designed to provide a high-level, interpretable summary of diversity but does not replace metrics that focus on specific facets (e.g., n-gram diversity, self-repetition score). In our future work, we will explore the combination of different metrics as the more comprehensive measurement of the diversity.

Why not use a more capable "teacher" model for self-verification? Would this improve reliability? What was the rationale for using the same model for both clustering and verification?

Thanks for this suggestion. A more capable "teacher" model will indeed improve the self-verification step. We use the same GPT-4o model mainly for the cheaper cost and consistency consideration. Here, we provide an ablation study on using different models for self-verification.

Self-Verification ModelInvalid ClustersCluster Score
GPT-4o2483.99
GPT-42544.03
GPT-3.52183.81
Llama-3.11923.65

The results shows that the more capable GPT-4 indeed captures more invalid clusters and gives a slight better cluster score. The eventual selection of the models may subject to the cost consideration in practice.

评论

Incorporating human evaluation...

We acknowledge the importance of human evaluation as a valuable complement to automatic metrics, providing qualitative insights into the subjective quality and diversity of the generated synthetic data. Here, we provide a follow-up human study on the synthetic data generated from using different prompt. We sample 50 data from the 6 synthetic data variants, ask human to score the diversity from 1-6, and report the average score across 10 human evaluators.

Synthetic DataLLM Cluster ScoreHuman Diversity Score
Cosmopedia v0.14.7 ± 0.23.6 ± 0.8
Cosmopedia v0.23.7 ± 0.22.4 ± 0.9
Topic4.2 ± 0.32.3 ± 1.0
Topic Styles5.3 ± 0.24.8 ± 0.7
Topic Styles Persona6.8 ± 0.35.2 ± 0.4
Multi-Topic Styles Persona6.2 ± 0.34.5 ± 0.7

The strong correlation (r=0.91,p=0.011r = 0.91, p = 0.011) highlights a statistically significant positive relationship between the Human Diversity Score and the LLM Cluster Score, reinforcing the validity of our metric. These results demonstrate consistency between human evaluations of diversity and the proposed LLM Cluster Score, bridging the gap in understanding the subjective aspects of synthetic data generation.

How are the parameters K...?

The parameters of K and J are mainly determined by the ablation study as shown in Appendix B.3 We show that the pipeline shows consistent results when K and J are with proper range (5-20), and the cluster results get deteriorated with larger K and J due to the longer context length.

For the M and N repetitions of the process, we use randomly sampled J and K data to obtain the metadata/metric and the cluster results. There might be overlap for selected samples across different repetition, but we found that with sufficient large M and N, randomly sampling actually more robustly captures the underlying data distribution.

How is the effectiveness of the self-verification step evaluated?

We provide an additional human evaluation on the filtered clusters from the self-verification step.

Topic/#SamplesClustersSelf-verified Invalid ClustersHuman-verified Invalid Clusters
100/1012943248221
100/2015216350329

And the results show that a large proportion of the filtered clusters are indeed treated as invalid by human. As shown in our response earlier, we use same GPT-4o in clustering and self-verification for consistency, but different models can be used and more capable models give slightly better results.

How do different values of M and K affect the consistency of clustering results?

The ablation of different K, N, J, M are included in Appendix B.3.

Has the diversity...(e.g., finance, biomedical, or logical reasoning)?

Yes. Our synthetic data are generated across 620k topics, and these topics naturally include finance, biomedical, and logical reasoning, as shown in the word cloud Figure 3. This indicates that the proposed metric is highly practical and can be used in reality.

How does the LLM Cluster-agent method compare quantitatively to existing diversity measurement approaches? What specific advantages does it offer over traditional methods?

Thanks for this question. We conducted a quantitative correlation analysis of our method compared to the baseline as in below.

Pearson coefficients (with p-value)

MetricPre-training (350M)Downstream (350M)Pre-training (1.4B)Downstream (1.4B)
Self-Repetition Score0.5583 (0.0422)0.6185 (0.0320)0.7471 (0.0052)0.6523 (0.0147)
Compression Ratio-0.4798 (0.1144)-0.2751 (0.3868)-0.2600 (0.4143)-0.2941 (0.3533)
N-gram Diversity0.5878 (0.0444)0.4289 (0.1640)0.4382 (0.1541)0.4378 (0.1545)
Perplexity0.5066 (0.0101)0.5095 (0.0905)0.6587 (0.0198)0.6761 (0.0157)
Perplexity Gap0.6773 (0.0155)0.4799 (0.1142)0.6310 (0.0277)0.6203 (0.0313)
K-means-0.8487 (0.0004)-0.8312 (0.0008)-0.7400 (0.0059)-0.7321 (0.0067)
LLM-Cluster0.5930 (0.0421)0.7481 (0.0051)0.8457 (0.0005)0.7384 (0.0061)

Our approach demonstrates stronger correlations with downstream performance compared to traditional metrics like perplexity or K-Means-based diversity score. Key advantage of our method is that it can serves as a performance prediction with its stronger correlation between the model performance and diversity score, whereas traditional methods fall short.


We hope the above response could address your concerns. If there is any further question, please let us know.

评论

Thank you for the response. I have raised my scores.

评论

We sincerely appreciate the time and effort you have dedicated to the review process. Should you have any further questions or concerns, we would be happy to provide additional clarification. Thank you once again.

-- Authors

审稿意见
6

This paper runs a study on the effect of diversity for synthetic data used to pretrain LLMs. To accomplish this, they develop a llm-based diversity metric called LLM Cluster-agent that measures the diversity of datasets. This pipeline is an iterative clustering mechanism that uses a metadata and metrics generation prompt to extract features used to generate clusters. A clustering prompt is used to prompt an LLM to group randomly selected samples from the corpus into clusters. From here, the LLM Cluster score is used to measure the diversity of the synthetic data based on the ratio between the number of clusters with the number of samples.

Using this metric, the authors pretrain multiple 350M and 1.4B parameter llama models on a combination of real and synthetic data to measure the performance of these models on downstream tasks as a function of the diversity of the synthetic data, and find that model performance correlates with LLM Cluster score. Furthermore, authors run extensive ablations on different synthetic generation prompts.

优点

Content of the paper is fairly complete and extensive. Explanation of LLM Cluster-agent is clear and concise. Lots of details provided in the main manuscript and in the appendix to show exactly what they are doing. Ablations are elaborate and extensive. Paper gives convincing evidence that diversity in synthetic data generation can be controlled and is correlated with downstream performance. Generated metric may be a strong indicator for the efficacy of a synthetic data generation pipeline for pretraining.

缺点

The work does not sufficiently show that LLM cluster score is more correlated with performance than baseline metrics. Figure 1 is the main result showing that LLM cluster score correlates with downstream performance. However, it is difficult to see if other baseline metrics provided in the paper would not also be correlated with performance. Argument would be more convincing if there were also diagrams or concrete correlation metrics of the other baseline metrics not aligning with performance in a similar diagram. The only evidence given that performance is more correlated is through visual inspection of Figure 4 and 5 in Section 3.3, and through visual inspection of Figure 6 and 7. Some of these correlations seem marginal at best on inspection. For instance, LLM Cluster score for Cosmopedia v0.2 performs the worse, but its performance is about on par with v0.1 and Topics in Figure 7a, 7c, and 7d.

问题

Main question would be whether the authors can show some quantitive correlation metrics or diagrams between LLM cluster score and other baseline metrics and downstream performance to show that LLM cluster score is a better indicator for performance than other baseline metrics.

In Figure 5, is the model trained on only real data trained on more real data to match the number of synthetic tokens? Judging from the results, it seems like you are not. Would be interesting to quantify the efficacy of training with synthetic data. Similarly in Figure 9, it seems like the number of tokens that the model is train on is increasing as you increase the synthetic to real data ratio. This diagram would be more convincing if it kept the amount of data/tokens used for training static.

Also am interested in the cost of evaluating these datasets using LLM Cluster score. Previous approaches used either no LLMs or open source LLMs like GPT-2. However, with this metric, all of these results are done using GPT-4o. Is this technique viable for researchers trying to quantify the diversity of their synthetic datasets?

LLM Cluster-agent pipeline seems to have some similarity with the field of explainable/unsupervised clustering of text data, such as Wang, Zihan, Jingbo Shang, and Ruiqi Zhong. "Goal-Driven Explainable Clustering via Language Descriptions." The 2023 Conference on Empirical Methods in Natural Language Processing and other related work, which should be cited in the related work.

There are also some typos in the manuscript. Other parts are a bit vague and unclear:

  • Line 098 has a typo "per-training".
  • On line 147, it is unclear what "due to similar patterns in part-of-speech tagging and syntactic often present in them"? Is this supposed to be "syntax"?
  • In section 2.1, it is unclear how capturing the underlying distribution of clusters and cluster sizes implies that this measure originates from the principle of entropy. What does this mean?
  • Seems to be an extra space on line 309 and line 422.
  • Extra "diversity" in the description of Figure 4, near line 289/290.
  • Is LLM Cluster-agent the name of the metric itself? Or is it the pipeline that you have developed for generating clusters? In the Abstract, you state that LLM Cluster-agent is "a new diversity metric", but in Section 2.1, you state that it is "pipeline that leverages LLM's abilities to interpret semantic meanings", and then later, on line 199-200, you state that "LLM Cluster score" is the actual diversity metric. There also is inconsistent capitalization of this name throughout the manuscript.
  • Line 409: Cosmopedia is spelled "Cosmppedia"
  • On line 464-465 in Section 3.7, the paper states that larger models "generally achieve higher accuracy, suggesting that more capable models benefit more from increased synthetic data diversity". However, the models achieving higher accuracy can be attributed just to larger model size. I think that the authors meant to say that the performance of the larger models are more correlated with diversity, not that they achieve higher accuracy. Furthermore, as stated before, this statement would be more convincing and concrete if the authors provided quantitative metric for this correlation.
  • In Section 2.1, line 197, the paper states that the verification step is important in "removing some unreasonable clusters." What does this mean? What are examples of unreasonable clusters?

Nit: Would add a citation for the footnote on page 3 regarding the degradation of performance for long-context situations, maybe Lost in the middle (Liu 2023) or something similar.

评论

We first thanks the reviewer for the comprehensive and detailed feedback on our paper. We address the weakness and questions raised as follows.


The work does not sufficiently demonstrate...

We appreciate the reviewer’s concern and recognize the need to provide stronger evidence of the superiority of the LLM Cluster score over baseline metrics. To address this:

  1. We have included quantitative correlation metrics (e.g., Pearson correlation coefficients) between pre-training/downstream performance and all baseline diversity metrics, as well as the LLM Cluster score as shown below. We also included it in the revised Appendix.

Pearson coefficients (with p-value)

MetricPre-training (350M)Downstream (350M)Pre-training (1.4B)Downstream (1.4B)
Self-Repetition Score0.5583 (0.0422)0.6185 (0.0320)0.7471 (0.0052)0.6523 (0.0147)
Compression Ratio-0.4798 (0.1144)-0.2751 (0.3868)-0.2600 (0.4143)-0.2941 (0.3533)
N-gram Diversity0.5878 (0.0444)0.4289 (0.1640)0.4382 (0.1541)0.4378 (0.1545)
Perplexity0.5066 (0.0101)0.5095 (0.0905)0.6587 (0.0198)0.6761 (0.0157)
Perplexity Gap0.6773 (0.0155)0.4799 (0.1142)0.6310 (0.0277)0.6203 (0.0313)
K-means-0.8487 (0.0004)-0.8312 (0.0008)-0.7400 (0.0059)-0.7321 (0.0067)
LLM-Cluster0.5930 (0.0421)0.7481 (0.0051)0.8457 (0.0005)0.7384 (0.0061)
  1. We also present diagrams similar to Figure 1 for baseline metrics, plotting their correlation with downstream performance. This allows for direct comparison with the LLM Cluster score, clarifying how traditional metrics such as perplexity or K-Means clustering align less consistently with performance trends compared to the LLM Cluster score.

  2. While Cosmopedia v0.2 performs similarly to v0.1 and Topics in Figures 7a, 7c, and 7d, it is important to note that the observed performance parity is despite Cosmopedia v0.2 having a significantly lower LLM Cluster score. This aligns with our hypothesis that lower diversity can still yield comparable results under specific conditions, but the LLM Cluster score generally provides better predictive diversity across varied datasets.

Can the authors show quantitative correlation metrics or diagrams to demonstrate that the LLM Cluster score is a better indicator of performance than other baseline metrics?

Please refer the above response for the updated quantitative correlation metrics, and updated diagrams for correlation in Appendix B.2.

Is the model trained on real data matched in size to the synthetic tokens? Would it be useful to quantify the efficacy of synthetic data training?

For our current results, the model is trained on real data only, with 34B tokens in total. For synthetic trained models, the model is trained on real and synthetic data, with more tokens. It is true that training the models under exactly the same amount of tokens can further justify our results. However, due to the time and computation required for training the models, we have to left this for our future work.

Is this technique viable given the cost of using GPT-4o?

We acknowledge the valid concerns regarding the cost of using models like GPT-4o, particularly when applied to large datasets. We shown in Appendix B.2 (now B.3), that open-source models can also be used in our LLM-cluster pipeline, whereas more capable models give more robust results. In this paper, we use GPT-4o for consistency, but it is fully viable to use other models, especially the open-sourced ones in the LLM-Cluster metric.

Typos and missing citations We thank you for pointing out the grammatical issues and the missing citations. We have these issues thoroughly addressed and corrected in the revised version of the paper.

In section 2.1, it is unclear how capturing the underlying distribution of clusters and cluster sizes implies that this measure originates from the principle of entropy. What does this mean?

This is because the metric entropy itself measures the cluster sizes and number of clusters of the underlying distribution, which is the same purpose as the proposed LLM-cluster. We will make this more clear.

Seems to be an extra space on line 309 and line 422.

This is mainly caused by the warp table.

评论

Is LLM Cluster-agent the name of the metric itself? Or is it the pipeline that you have developed for generating clusters? In the Abstract, you state that LLM Cluster-agent is "a new diversity metric", but in Section 2.1, you state that it is "pipeline that leverages LLM's abilities to interpret semantic meanings", and then later, on line 199-200, you state that "LLM Cluster score" is the actual diversity metric. There also is inconsistent capitalization of this name throughout the manuscript.

We are sincerely sorry for the confusion caused. LLM Cluster-agent is the name of the proposed pipeline/method and LLM Cluster score is the metric produced by the pipeline used for diversity measurement. We have corrected the terms in the revised manuscript.

On line 464-465 in Section 3.7, the paper states that larger models "generally achieve higher accuracy, suggesting that more capable models benefit more from increased synthetic data diversity". However, the models achieving higher accuracy can be attributed just to larger model size. I think that the authors meant to say that the performance of the larger models are more correlated with diversity, not that they achieve higher accuracy. Furthermore, as stated before, this statement would be more convincing and concrete if the authors provided quantitative metric for this correlation.

Thank you for this detailed suggestion. We have corrected this statement and included the quantitative correlation measure in our revised paper.

In Section 2.1, line 197, the paper states that the verification step is important in "removing some unreasonable clusters." What does this mean? What are examples of unreasonable clusters?

Please refer to the updated Table 12 and Table 13 in Appendix B.3 for more ablation on the self-verification step. The unreasonable/invalid clusters are mostly clusters that have all samples in the same group, and the self-verification step aims to filter these invalid clusters out for more accurate and robust measurement of the diversity.


Thanks for your comprehensive suggestions on revising our paper. If you find our above response helpful, please consider raising the score.

评论

We appreciate for your detailed feedback and would like to provide further clarification regarding the remaining concerns.


analysis on why k-means.

The negative correlation of k-means trend can be attributed to several inherent limitations of the K-means clustering algorithm, particularly when applied to high-dimensional and complex text datasets like those used in our experiments. For example, the distance metric in kmeans on text embeddings is often inadequate for capturing the semantic and contextual nuances of large-scale datasets and also make them subject to the embedding models. This geometric approach can lead to clusters that fail to reflect meaningful semantic distinctions, as it prioritizes proximity over contextual relationships.

In addition to that, k-means assumes that clusters are isotropic and of roughly equal size, a limitation that often misrepresents the underlying data distribution. In practice, dense regions of data with subtle variations may be split into multiple clusters, while sparse but diverse regions may be merged into a single cluster. This imbalance results in cluster configurations that are misaligned with the true diversity of the dataset, leading to poor correlation with performance.

In contrast to KMeans, the LLM-cluster agent leverages the semantic reasoning capabilities of large language models to generate clusters based on metadata and metrics that capture multi-dimensional diversity. This approach aligns more closely with the nuanced attributes that influence model performance, resulting in a stronger positive correlation. These findings underscore the importance of incorporating contextual and semantic understanding in clustering methods for evaluating data diversity in LLM pretraining.

Regarding the costs, I think that it is important for the research community at large to see what the costs of generating new data with gpt-4o would be for this technique to work, as an overhead on normal synthetic data generation. In particular, if it is relatively cheap, some people may default to gpt-4o for ease of use. I would like to see some quantitative estimates of the cost overhead in the final paper.

We acknowledge the importance of providing quantitative estimates of the cost overhead associated with using GPT-4o for our pipeline. We use both GPT-4o for LLM cluster agent and synthetic data generation. For LLM cluster agent, each clustering call is roughly 0.005 (K=10, average input tokens=4000) for input tokens and 0.04 for output tokens (average output tokens=4000). To obtain the LLM cluster score, we run the clustering for N=5000, making the total cost of diversity measure as 225. In our main experiments, we repeat this process for 10 rounds, but in practice this can be achieved with fewer rounds.

For the synthetic data generation part, it is much more expensive since we generate 10B tokens for 4 variants in our main experiments. Our average input tokens for each generation are roughly 4000 and output tokens are roughly 1000, making each generation 0.015. To obtain 10B tokens, the total cost would be approximately 15000. But fortunately most of the synthetic data generation are using the API credits.

While this cost might not be feasible for everyone, GPT-4o-mini provides a cost-effective alternative with comparable performance as reported in other works. For the same dataset, gpt-4o-mini would cost 75 for input tokens and 30 for output tokens, totaling 105. By using the Batch API, which offers a 50% reduction in cost, the input cost further reduces to 37.50 and the output cost to 15, bringing the total to 52.50. Although we did not test GPT-4o-mini in our experiments, it could serve as a viable low-cost alternative for researchers working with tighter budgets.

Will the code for LLM-cluster agent be open-sourced? This could be a valuable tool for the research community at large.

We greatly appreciate the interest in our LLM-cluster agent and recognize its potential value to the research community. To support further advancements in this area, we are committed to open-sourcing the LLM-cluster agent, providing the OpenAI API code and full prompt templates used in our paper to enable researchers to apply the pipeline to their own datasets. Additionally, we will release all the seed topics and configuration details used in our experiments, ensuring transparency and facilitating further exploration of our method. Furthermore, if possible, we plan to share the synthetic data generated through our pipeline, making it accessible for use in pre-training or fine-tuning large language models. We believe the synthetic data of 10B scale would indeed be valuable to the community. By releasing these resources, we aim to help the community, benefiting researchers working on diverse aspects of LLM training and evaluation.


We hope the above response could resolve the remaining concerns.

评论

I appreciate the authors' prompt response to my concerns.

  1. This is interesting commentary that I think should be distilled into the final paper in the appropriate location.
  2. Thank you for providing this information, this is very informative. It is good to see that the llm-cluster agent pipeline cost is relatively modest compared to the cost of generation of the full dataset. Please add this in your final manuscript.
  3. Thank you for releasing the source code, as I think that this can be a valuable artifact for the community.

I have raised my score accordingly.

评论

We sincerely appreciate the time and effort you have dedicated to reviewing our work and engaging in constructive discussion. Should you have any further questions or require additional clarification, we would be happy to provide it. Thank you once again.

-- Authors

评论

We sincerely appreciate the time and effort you have dedicated to reviewing our submission. We would be happy to provide further clarification if you have any additional questions or concerns.

-- Authors

评论

I appreciate the authors' thorough responses to my concerns. Overall, I am willing to raise my score to a 6, but I have some additional comments based on these responses.

  1. I would like to see some additional commentary on these correlation metrics; in particular, I am interested to see any analysis on why k-means seems to have such strong negative correlation with performance across the board.
  2. Regarding the costs, I think that it is important for the research community at large to see what the costs of generating new data with gpt-4o would be for this technique to work, as an overhead on normal synthetic data generation. In particular, if it is relatively cheap, some people may default to gpt-4o for ease of use. I would like to see some quantitative estimates of the cost overhead in the final paper.
  3. Will the code for LLM-cluster agent be open-sourced? This could be a valuable tool for the research community at large.
审稿意见
8

The paper aims to address an important gap in LLM research - impact of a diverse dataset on LLM performance on task generalization. More so, it attempts to quantify "diversity" by introducing a clustering based metric for it and proposes an agentic pipeline for the same. Elaborate comparisons have been done with other heuristic and model-based metrics. Multiple open-sourced and proprietary models have been use to generate synthetic data and compare how diversity metric compares across different model architectures and sizes.

优点

  1. Elaborate comparisons have been done with other heuristic and model-based metrics.
  2. Multiple open-sourced and proprietary models have been use to generate synthetic data and compare how diversity metric compares across different model architectures and sizes.
  3. Cluster generation and verification step is a much needed verification step for agentic pipelines and that has been addressed appropriately.
  4. Conducting large scale study to systematically investigate the effectiveness of a diverse synthetic dataset in pre-training as well as SFT processes is a huge positive that strengthens the underlying hypothesis.

缺点

  1. Lack of a rigorous human evaluation study : Agentic pipelines generally need to be heavily intertwined with a human-in-the-loop aspect during initial experimentation. I am a bit unclear on how to test the reliability and quality of the metadata and metrics generation step in the pipeline. The error in this initial step would be propagated to later steps and would potentially make the metric unreliable.
  2. Assessing cluster purity - an elaborate statistical analysis on the clusters formed is missing, how do we trust the final clusters formed and how do we assess cluster purity?
  3. Lack of control on the number of clusters formed - there needs to be a systematic prompt injection to define the optimal number of clusters to be formed (and a process to find the optimal number of clusters in the first place). I understand that there has been ablation studies on the number of clusters, but how do I pick this optimal number which might also vary across different styles and topics)
  4. Parameter Sensitivity: The performance of the clustering may depend on the choice of parameters like K (number of samples for clustering), N (number of clustering iterations), J and M (for metadata/metric generation). Optimal parameter settings may need to be determined.

问题

  1. How do you decide on the optimal number of clusters? Is there a cluster evaluation technique (similar to elbow curves, cluster homogeneity metrics for traditional clustering techniques) in place?
  2. Can we have a human eval study included to address the metric reliability issues?
评论

We thanks for the reviewer's detailed feedback, and address the raised weakness and questions as follows.


Lack of a rigorous human evaluation study...The error in this initial step would be propagated to later steps and would potentially make the metric unreliable.

Thanks for your advice on human verification of the proposed method. Here, we provide a human study on the synthetic data generated from using different prompt. We sample 50 data from the 6 synthetic data variants, ask human to score the diversity from 1-6, and report the average score across 10 human evaluators.

Synthetic DataLLM Cluster ScoreHuman Diversity Score
Cosmopedia v0.14.7 ± 0.23.6 ± 0.8
Cosmopedia v0.23.7 ± 0.22.4 ± 0.9
Topic4.2 ± 0.32.3 ± 1.0
Topic Styles5.3 ± 0.24.8 ± 0.7
Topic Styles Persona6.8 ± 0.35.2 ± 0.4
Multi-Topic Styles Persona6.2 ± 0.34.5 ± 0.7

The strong correlation (r = 0.91, p = 0.011) demonstrates a statistically significant positive relationship between the Human Diversity Score and the LLM Cluster Score, supporting the validity of our metric. The results show consistency between the human diversity score and the LLM cluster score, which strongly correlates with the LLM performance.

In addition, the reliability of the proposed pipeline is guaranteed by multiple rounds of iteration of each module. For the metadata and metric generation module, we repeat the process for M times and each time with randomly sampled J data. The final metadata and metric are gathered from this iterative process. This is similar for the clustering module. In our experiments, we also report the error bar from multiple runs, and show that the results are robust across different runs.

Assessing cluster purity...clusters formed and how do we assess cluster purity?

We recognize the importance of assessing cluster purity to evaluate the quality of the clusters formed. In our pipeline, the self-verification module plays a crucial role in ensuring the reliability of the clustering process by filtering out invalid clusters that do not meet the LLM-defined criteria. Additionally, the robustness of the clusters is evaluated through multiple iterations with varying KK-sized samples, ensuring that the clustering results are consistent and representative of the underlying data distribution.

While we do not include a detailed statistical analysis of cluster purity in the current work, we conduct human evaluations that align well with the clusters identified by the LLM. These evaluations provide support for the validity of the formed clusters. Please refer to human evaluation results above.

Lack of control on the number of clusters formed

Instead of controlling the number of clusters in the prompt, we believe that let LLM fully decide the clusters, as in our method, is a better approach. Since the optimal number of clusters also vary significantly by scenarios in practice, and selecting it by human would not be an ideal choice.

Moreover, in our pipeline, we have a self-verification module to valid the formed clusters. As shown in our response to other Reviewers, the filtered clusters from the self-verification module can be validated by human.

Parameter Sensitivity: The performance of the clustering may depend on the choice of parameters like K (number of samples for clustering), N (number of clustering iterations), J and M (for metadata/metric generation). Optimal parameter settings may need to be determined.

It is true that the performance of the clustering depends on the choice of parameters. In practice, these parameters may need to be determined by running small ablations and there would not be optimal values universally usable for different scenarios. We show in our ablation, as long as J and K are within proper range, the metadata/metrics generation and cluster results are very consistent.

How do you decide on the optimal number of clusters? Is there a cluster evaluation technique (similar to elbow curves, cluster homogeneity metrics for traditional clustering techniques) in place?

The optimal number of clusters is decided by LLM itself. As shown in our earlier response, human-decided optimal number of clusters would also be very sensitive and laborious. The self-verification module in the proposed method ensures the validation the the formed clusters.

Can we have a human eval study included to address the metric reliability issues?

Please refer to the above response for the human evaluation study.


If you have further questions, please let us know.

评论

We sincerely appreciate the time and effort you have dedicated to reviewing our submission. We would be happy to provide further clarification if you have any additional questions or concerns.

-- Authors

评论

Thank you for your response, I am generally satisfied the changes made. Thank you for providing the human eval study, that was really helpful. Make sure you include them in the paper as well. I'll be bumping the score up to 8.

评论

We truly appreciate the time and effort you have invested in reviewing our work and engaging in constructive discussions. Thanks again.

-- Authors

评论

Dear Reviewers and ACs,

As the discussion period comes to an end, we would like to express our gratitude for your time, effort, and constructive feedback during the review process. Below, we summarize the key changes and additional experiments conducted during the rebuttal period, which addressed key concerns raised by the reviewers.


1. Human Evaluation of Synthetic Data Diversity

In response to Reviewers a9T9 and ziQA, we conducted a human evaluation study to evaluate the diversity of the data. The results show a strong correlation r=0.91,p=0.011r = 0.91, p = 0.011 between human-assessed diversity and the diversity scores produced by our proposed LLM Cluster agent. This alignment validates the reliability of the LLM Cluster score in capturing nuanced diversity across synthetic data. These results will be included in the final version of the paper.


2. Robustness and Consistency of the LLM Cluster Agent

To address concerns about the pipeline’s reliability and design, we performed additional ablations and human validations, as detailed below:

  • Pipeline Design: In response to all reviewers, we conducted ablation studies on key components of the pipeline, including hyper-parameters K,J,N,MK, J, N, M and the separation of metadata and metrics. Results show that the method is robust across reasonable parameter ranges, with small standard deviations indicating consistent diversity measurements.
  • Self-Verification Module: In response to Reviewers ziQA and CDSG, we analyzed the self-verification module and conducted human validation on filtered clusters. Over 95% of clusters flagged as invalid by the module were corroborated by human evaluations, demonstrating its reliability.
  • Model Ablation: In response to Reviewers W47Q and CDSG, we examined the use of alternative models (e.g., GPT-3.5, Llama-3.1) for self-verification. Results indicate that more capable models (e.g., GPT-4o) improve reliability, though open-source models also provide reasonable performance.

3. Consistency of Diversity Measurements

In response to Reviewer W47Q, we extended ablation studies to further validate the consistency of diversity measurements. The results demonstrate minimal variance in diversity scores with sufficiently large values of N=5,000N = 5,000 and K=10K = 10. These findings reinforce the robustness of our approach, and the additional results are included in Appendix B.3.


4. Correlation Between Diversity and LLM Performance

In response to Reviewers qfg7 and CDSG, we performed a detailed correlation analysis between the LLM Cluster score and model performance. Pearson correlation coefficients demonstrate that our metric correlates more strongly with both pre-training r=0.85r = 0.85 and downstream performance than traditional diversity metrics (e.g., perplexity, K-means). Linear regression plots in Figure 11 (Appendix B.2) further support this finding.


5. Addressing Context Limitations

In response to Reviewers a9T9 and CDSG, we discussed the pipeline’s iterative clustering design, which mitigates LLM context length limitations by aggregating clustering results over multiple iterations NN. This approach ensures reliable cluster formation without requiring complete dataset processing in a single pass.


6. Cost Analysis

In response to Reviewers qfg7 and CDSG, we provided a detailed cost analysis of the pipeline. For N=5,000N = 5,000, the cost of computing the diversity score using GPT-4o is approximately USD 225. We also highlighted cost-effective alternatives, such as using GPT-4o-mini or open-source models, which reduce costs to as low as USD 52.50 while maintaining reasonable accuracy.


7. Additional Clarifications and Revisions

  • Comparison with Baselines: In response to Reviewer W47Q, we included additional quantitative results comparing the LLM Cluster score with traditional metrics (e.g., perplexity, perplexity gap, K-means). These are presented in Appendix B.2.
  • Hyper-Parameter Tuning: We clarified the impact of K,J,N,andMK, J, N, and M on performance and diversity measurement, with ablation results included in Appendix B.3.
  • Typographical Corrections: All reported typos and missing citations have been corrected in the revised manuscript.
  • Improved Figures: Figures have been updated with uniform axes to enhance interpretability.

We hope that our response has effectively addressed the concerns raised during the review and discussion period. We sincerely appreciate your thoughtful feedback and valuable suggestions, which have significantly enhanced the quality and clarity of our work. We will incorporate the remaining revisions into the final version of the paper. Thank you once again for your time and support.


Sincerely,

Authors

AC 元评审

The paper investigates the impact of diversity in synthetic datasets on the performance of large language models (LLMs) during pre-training and fine-tuning. It introduces a metric, LLM Cluster-agent, to quantify dataset diversity. This metric employs LLMs to generate metadata and cluster samples iteratively, producing a diversity score based on the ratio of clusters to samples. The work claims following contributions: 1) a new metric, LLM Cluster-agent, to measure dataset diversity; 2) empirically showed the efficacy of proposed metric by demonstrating strong correlation between proposed metric and model performance. Experiments are conducted on models with 350M and 1.4B parameters and evaluated on downstream tasks. Authors also explored various prompt templates, styles, personas, and base LLM architectures (e.g., GPT-4, GPT-3.5) to enhance synthetic dataset diversity.

Strength of this paper

  • The paper is well-written and clearly organized.
  • The work focuses on recently popular topic , LLMs, and try to offer insight into selection of synthetic data / generator and balancing real and synthetic data.
  • Empirical results support the efficacy of proposed method. The method reveals scalability and real-world applicability.

Weakness of this paper

Several reviewers raised few concerns/limitations of this paper. By addressing these limitations, the paper could strengthen its experiment and expand impact.

  • Experiment scope: the work focuses on small models. Whether the work is generalizable at larger models that used commonly in the community lacks evidence. Clustering in the work is constrained by LLM context window sizes, potentially leading to unpredictable cluster formations, and no clear solutions or comparisons to existing methods addressing long-document challenges are provided. The dependency on the same model for clustering and self-verification also introduces potential bias, and ideally a more capable or independent model should be used in verification for more reliable experiment results.
  • Various concerns about settings for experiment and method. Over-simplifying diversity into one metric / score and overlooking its multi-variable nature might yield suboptimal results. Diversity is subjective and ambiguous; method built upon it suffers from inconsistency and issue of reproducibility. Besides, the methodology does not address how to determine or control the optimal number of clusters in an algorithmic way, also raise some concerns about stability of method.

审稿人讨论附加意见

In addition to above weaknesses, reviewers also raised some other weaknesses and improvements (e.g., performance heavily depends on sensitive parameter and systematic sensitivity analyses are required to show robustness of proposed method, lack of statistical analysis to assess the reliability of clusters) during rebuttal. Some of the weakness have been improved / somewhat addressed during rebuttal session (e.g., further explanation on the questions raised by reviewers, more discussion, clarification, and more experiment results added). Although some partial (e.g., soundness) and general review rating was raised, the rating averaged over all reviewers are still at borderline. I think the session is too short and some weaknesses are hard to address in such a short period of time. Also there is a general concern about setting of this work and generalizability/robustness. Given the high bar of ICLR, I think the paper is still of limited interests to the audience , and thus I recommend to reject the paper, and the authors to re-work on these weakness and re-submitting to future conferences.

最终决定

Reject