PANGEA: Projection-Based Augmentation with Non-Relevant General Data for Enhanced Domain Adaptation in LLMs
This paper introduces PANGEA, a method that leverages general-purpose data to generate diverse and high-quality synthetic data, improving LLM performance on domain-specific tasks.
摘要
评审与讨论
This paper introduces PANGEA (Projection-based Augmentation with Non-relevant General data for Enhanced domain Adaptation), a framework for generating domain-specific synthetic data from large-scale general datasets using only a small set of domain-specific data.
The PANGEA procedure consists of three main steps:
- A large language model (LLM) analyzes the domain-specific dataset to generate a profile that captures the key characteristics of the domain.
- This profile is applied to the general dataset, transforming it into a structured format that highlights information relevant for generating domain-specific data.
- The LLM uses this structured dataset to produce synthetic data, guided by the tone and style of the original domain-specific data.
PANGEA is evaluated across four domains—mathematics, medicine, finance, and CDSL—and demonstrates strong performance in enhancing domain adaptation.
优缺点分析
Strengths
- The motivation of the paper is sound, and the writing is clear.
- The proposed PANGEA framework is interesting and novel.
- The experiments demonstrate that PANGEA is effective across multiple domains.
Weaknesses
-
Domain relevance of synthetic data: Technically, PANGEA appears to lack a mechanism for ensuring that synthetic data generated from the general dataset is always suitable for the target domain. The authors used the CoT-Collection as the general dataset. However, it is unclear whether every data entry in CoT-Collection can be meaningfully converted into synthetic data for each domain. For example, while an entry related to food might be relatively easy to adapt to the medical domain, it is less obvious how it would translate to the finance domain. Although the paper evaluates four domains, this raises the question: "Is every general data entry suitable for transformation into any arbitrary domain?" In other words, some data may not be appropriate for use in specific domains. The authors should consider designing a procedure to filter out less desirable data for the target domain during both the generation and training stages.
- One potential solution is to assign weights to synthetic data, reducing the impact of less ideal examples. Related works include:
- Gao et al., Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning, ICLR 2023.
- Choi et al., UniGen: Universal Domain Generalization for Sentiment Classification via Zero-shot Dataset Generation, EMNLP 2024.
- Zou et al., FuseGen: PLM Fusion for Data-generation based Zero-shot Learning, EMNLP 2024.
- Another possible approach is to introduce a review process, where a second LLM evaluates generated synth-guide block τ for applicability to the target domain.
- One potential solution is to assign weights to synthetic data, reducing the impact of less ideal examples. Related works include:
-
Quality measurement lacks human annotation: While the use of o1 as a judge may provide a strong automatic evaluation, incorporating human annotation—at least at a small scale (e.g., 100 entries per domain across the four domains)—would greatly strengthen the validation of the synthetic data’s quality.
问题
- What is the amount of the CoT-Collection after excluding math, medical, and finance-related entries?
- CoT-Collection is more than 1.8M, but the authors only generated 10k, 30k, or 120k synthetic data. Is this randomly selected?
局限性
Yes
最终评判理由
I think that there is some merit to this paper.
格式问题
N/A
We sincerely thank the reviewer for their thoughtful feedback and for recognizing the motivation, novelty, and effectiveness of our PANGEA framework. The weaknesses and questions raised were highly valuable, and we are pleased to provide further clarification and internal results.
W1 (Domain Relevance and Filtering of Synthetic Data). Is every general data entry suitable for transformation into any arbitrary domain? In other words, some data may not be appropriate for use in specific domains. The authors should consider designing a procedure to filter out less desirable data for the target domain during both the generation and training stages.
Your insight into maintaining and precisely controlling the quality of domain-specific synthetic data through filtering is highly valuable. It closely reflects the challenges we faced during the early stages of our research, and we believe your comments accurately pinpoint the core issues we encountered at the time.
We agree that maintaining domain relevance and filtering out low-quality data are critical challenges in synthetic data generation, as synthetic data often suffers from significant quality degradation. As you suggested, using reweighting methods for synthetic data and filtering components within the PANGEA process are both promising directions. Among various filtering strategies, we explored an approach that directly filters the general data. Indeed, in the early phase of our research, we explored this direction by training a verifier model to pre-filter general data. Specifically, we trained a model to predict whether a given general data sample (), when paired with source domain data (), is suitable for generating high-quality synthetic data.
To supervise the model, we used the o1 metric from the PANGEA framework to label data pairs. Samples deemed high-quality were labeled as 1, and those that did not meet the quality threshold were labeled as 0. We manually annotated approximately 20,000 examples, maintaining a balanced 1:1 class ratio. The verifier was trained using linear probing on the Llama-3.1-8B model and produced a probability score between 0 and 1, estimating whether a given (source data, general data) pair, or Synth-Guide Block (), would yield high-quality synthetic outputs.
However, in practice, the verifier’s precision was limited. It achieved only 50 to 60 percent accuracy in correctly identifying positive cases, which resulted in a low yield when attempting to collect high-quality samples using this method. We also conducted an additional experiment in which we filtered generated synthetic data based on quality. The results are summarized below:
| Training Data | Acc on GSM8K |
|---|---|
| Naive | 26.91 |
| Evol-Instruct | 27.36 |
| 10k Low-quality Samples Only | 28.66 |
| 20k High-quality Samples Only | 35.47 |
| All 30k Samples (Whole) | 37.72 |
Interestingly, even when low-quality samples were included, the overall performance improved when the model was trained on the entire dataset of 30,000 samples. This suggests that the inclusion of diverse examples, including negative ones, contributes to generalization. In hindsight, we found that directly utilizing general data was ultimately more effective than filtering or relying only on high-quality subsets. Building on these observations, we shifted our focus from selecting only high-quality out-of-domain data to designing a framework that makes the best use of any available general data. In particular, our proposed three-stage framework, centered around the high-quality Synth-Profile () constructed from source data, has proven more effective at producing useful synthetic data even when the general data is arbitrarily selected. While we recognize that filtering mechanisms can still be beneficial, especially when the proportion of ambiguous or low-quality synthetic samples (cf. Appendix Table 10) becomes too high, we believe that training a dedicated verifier and applying data filtering for every individual domain is neither scalable nor generalizable. Such an approach increases both complexity and cost, and it limits the practicality of applying synthetic data generation across diverse domains.
W2 (Lack of Human Annotation for Quality Measurement). Quality measurement lacks human annotation: While the use of o1 as a judge may provide a strong automatic evaluation...
We randomly sampled 100 synthetic data points from each benchmark (GSM8K, FinQA, MedQA, and CDSL) to compare three generation methods: A (Naive), B (Evol-Instruct), and C (PANGEA). Fifty-seven graduate-level annotators, after a brief pilot, evaluated the three generated synthetic data in random order using the same criteria provided in the supplement, labeling each as ‘Best’, ‘Second’, and ‘Worst’.
| Domain | Best | Second | Worst |
|---|---|---|---|
| GSM8K | C 46.82% ≫ B 28.96% > A 24.22% | B 46.62% ≫ C 33.81% ≫ A 19.57% | A 56.21% ≫ B 24.42% > C 19.37% |
| FinQA | C 53.44% ≫ B 26.22% > A 20.34% | B 46.49% ≫ C 36.27% ≫ A 17.24% | A 62.42% ≫ B 27.29% ≫ C 10.29% |
| MedQA | C 51.29% ≫ A 24.46% > B 24.25% | B 37.87% > C 37.84% ≫ A 24.29% | A 51.25% ≫ B 37.88% ≫ C 10.87% |
| CDSL | C 50.66% ≫ A 28.93% > B 20.41% | A 34.15% > C 34.13% > B 31.72% | B 47.87% ≫ A 36.92% ≫ C 15.21% |
Based on quality assessments of 100 items per benchmark, PANGEA (C) consistently received the highest ratings across all four domains. In contrast, Naive (A) was most frequently rated lowest in quality, while Evol-Instruct (B) typically received intermediate evaluations. Inter-annotator agreement was moderate, with a Krippendorff’s α of 0.61.
To determine whether these observed differences in quality ratings were statistically significant, we applied χ² and Friedman tests to the annotation results for each domain.
| Domain | χ² (3 × 3) | Friedman |
|---|---|---|
| GSM8K | 4.0 × 10⁻⁵ | 5.96 × 10⁻³ |
| FinQA | 2.2 × 10⁻⁸ | 1.64 × 10⁻⁵ |
| MedQA | 6.3 × 10⁻⁵ | 7.65 × 10⁻⁴ |
| CDSL | 1.9 × 10⁻³ | 2.19 × 10⁻³ |
Across all domains (GSM8K, FinQA, MedQA, CDSL), PANGEA (C) received the highest proportion of “Best” ratings, indicating it consistently produced the highest-quality outputs. In contrast, Naive (A) was most frequently rated as “Worst,” while Evol-Instruct (B) generally ranked in the middle. To assess the statistical significance of these differences, we conducted both χ² (chi-squared) and Friedman tests, which confirmed significant performance differences across all domains (p < 0.01).
Q1. What is the amount of the CoT-Collection after excluding math, medical, and finance-related entries?
After removing entries related to our target domains, the CoT-Collection contained approximately 900k samples. For storage and computational efficiency, we created a working subset of about 500k samples, from which general data instances were randomly drawn for augmentation.
Q2. CoT-Collection is more than 1.8M, but the authors only generated 10k, 30k, or 120k synthetic data. Is this randomly selected?
Arithmetically, PANGEA can generate up to (# general data) × (# source data) instances. However, our objective was not merely to produce a large volume of synthetic data for specific domains, but to develop a universal pipeline applicable to any downstream task as needed. Accordingly, we focused on validating performance trends and scalability across a range of domains, rather than simply increasing the amount of synthetic data.
To this end, we selected three representative scales (10k, 30k, and 120k) that clearly demonstrate how PANGEA (1) consistently outperforms baseline approaches and (2) continues to scale more effectively as additional synthetic data are introduced. All general data were randomly sampled from the CoT-Collection after filtering out domain-related content per task, as our goal is to demonstrate the feasibility of using domain-unrelated data for large-scale synthetic generation.
We sincerely thank you for reconsidering your evaluation based on our response. Your constructive feedback throughout this process has been invaluable.
Your suggestions to include human annotation and to discuss the data filtering were particularly insightful. We believe addressing these points has significantly improved the quality and validation of our paper.
As promised, we will be sure to incorporate the new human evaluation results, along with a detailed discussion on our findings regarding data filtering, into the final camera-ready version.
Thank you again for helping us strengthen our work.
Thank you for your response, I modified my rating accordingly.
This paper proposes PANGEA, a method that leverages large-scale, domain-unrelated general data and a small set of domain-specific examples to generate high-quality, diverse synthetic data for domain adaptation. By extracting structural patterns from the target domain and projecting them onto general data, PANGEA significantly improves model performance under data-scarce conditions.
优缺点分析
Strengths:
- The idea is interesting of using general data to synthesize domain-specific data.
Weaknesses:
- Results on larger models, such as 7B variants, are preferred to better assess scalability and real-world applicability.
- The LLMs used in Stage 2 and Stage 3 may possess prior knowledge of downstream benchmarks like GSM8K, raising concerns about potential knowledge leakage and limiting the ability to fairly evaluate the proposed method’s effectiveness.
- More downstream datasets are needed, preferably with fine-grained splits; the introduced CDSL dataset lacks representativeness.
- It remains unclear how the proposed method performs with strong reasoning models such as QWQ or DeepSeek-R1.
- An ablation study on the number of domain-specific samples in the source dataset would be valuable for understanding the method’s robustness under varying levels of data scarcity.
问题
See weakness.
局限性
Yes.
最终评判理由
The dataset and backbone used in the experiments lacks representativeness and the author has not convinced me of the robustness and generalizability of their method. Therefore, I decide to keep my original rating.
格式问题
None.
We sincerely thank the reviewer for their thoughtful and constructive feedback. Your insightful suggestions have significantly helped us improve the clarity, rigor, and scope of our work. Below, we address each of your comments in detail.
W1 (Scalability to Larger Models). Results on larger models, such as 7B variants.
We are sincerely grateful to the reviewer for this excellent suggestion that significantly strengthens our paper. We conducted an experiment using the Llama-3.1-8B model at the 10k setup. Notably, our PANGEA framework is highly effective on the larger 8B model, even outperforms Llama-3.1-8B-Instruct:
| Method | GSM8K | MedQA | FinQA | CDSL |
|---|---|---|---|---|
| Pre-trained | 48.75 | 39.31 | 25.63 | 1.61 |
| Instruct-tuned | 85.62 | 64.10 | 64.95 | 2.32 |
| Naive | 81.35 | 55.93 | 48.78 | 15.65 |
| Evol-Instruct | 79.08 | 60.02 | 53.88 | 17.42 |
| PANGEA (Ours) | 86.47 | 64.51 | 65.31 | 25.91 |
W2. The LLMs used in Stage 2 and Stage 3 may possess prior knowledge of downstream benchmarks…
Thank you for your insightful comment. We address your concerns in two parts: (1) fairness in comparison with baselines and (2) the role of prior knowledge.
First, all methods, including baselines, use the same Llama3.3-70B-Instruct model for synthetic data generation, ensuring a fair comparison. The performance gains from PANGEA are not due to differences in prior knowledge but rather stem from its systematic approach. Specifically, PANGEA performs stage-wise profiling to extract key components from seed data, and then leverages diverse general data to enhance variety while maintaining difficulty and thematic consistency.
Second, to assess the role of prior knowledge, we conducted a new experiment on the Korean CSAT exam, data not included in the LLM’s training data. We generated 10,000 synthetic CSAT problems using the 2023 and 2024 official exams as seed data, and evaluated performance on a held-out test set from the 2025 CSAT. As shown in our response to Reviewer Puqb, PANGEA outperformed baselines that use a smaller 1B model across all difficulty levels (Easy, Medium, Hard) and in total score. This further supports that PANGEA’s effectiveness does not rely on prior knowledge.
These results clearly demonstrate that even for tasks unseen during pre-training of the LLM in Stages 2 and 3, PANGEA can still generate useful synthetic data to train the target model.
W3. More downstream datasets are needed, preferably with fine-grained splits; the introduced CDSL dataset lacks representativeness.
More downstream datasets with fine-grained splits. We believe this concern is effectively addressed by the additional experiments we conducted on the Korean CSAT task. Detailed settings and results can be found in our response to Reviewer Puqb.
Representativeness of CDSL. We are fully aware of concerns regarding potential knowledge leakage from widely used benchmarks such as GSM8K, as also noted in your W2. Since such leakage is difficult to eliminate entirely, we deliberately designed CDSL as a novel, out-of-distribution “acid test.” Its symbolic syntax, custom operators, and domain-specific content are entirely absent from any public dataset, and even the 70B model achieves only around 20% accuracy. In this sense, we understand your concern regarding the representativeness of CDSL.
However, CDSL is not meant to reflect typical downstream tasks. Rather, it is intentionally constructed to evaluate the robustness of data augmentation and domain adaptation in truly novel settings, where prior knowledge is unavailable and only a small number of seed examples (e.g., 100) are provided. While standard benchmarks such as GSM8K, MedQA, and FinQA are undoubtedly more representative of real-world tasks, they cannot be entirely free from prior knowledge leakage. CDSL was specifically designed with this issue in mind, providing a controlled setting where such leakage is virtually impossible.
W4 (Performance with Strong Reasoning Models). It remains unclear how the proposed method performs with strong reasoning models such as QWQ or DeepSeek-R1.
We would like to clarify that the primary goal of PANGEA is not merely to achieve state-of-the-art performance on reasoning tasks. Rather, our objective is to develop a universal and robust framework for data augmentation that can be effectively applied to any new, domain-specific task, even when starting from a small set of seed examples (e.g., 100). Owing to its broad applicability, PANGEA also enables strong publicly available reasoning models to be fine-tuned into domain-specific specialists using the same approach.
As an illustration, we conducted an additional experiment using DeepSeek-R1-Distill-Llama-8B; the only difference from our original setup is that we created new labels, including a “reasoning-path,” to align with the model’s intended prompt template:
| Method | GSM8K | MedQA | FinQA | CDSL |
|---|---|---|---|---|
| R1-Distill | 85.29 | 57.09 | 61.29 | 6.96 |
| R1-Distill + PANGEA | 88.91 | 66.53 | 69.75 | 28.70 |
These results clearly show that fine-tuning with data generated by PANGEA provides substantial gains even for a strong, specialized reasoning model. This validates that our framework is not only effective for general-purpose models but also serves as a powerful tool to further enhance the capabilities of expert models in their respective domains.
W5 (Ablation Study on the Number of Source Samples). An ablation study on the number of domain-specific samples in the source dataset would be valuable for understanding the method’s robustness under varying levels of data scarcity.
PANGEA was originally designed with the assumption of scarce seed data in mind, so exploring its performance under such conditions is certainly a valuable direction. While the current main experiments use 100 seed examples, we additionally tested scenarios with fewer seed samples:
| Method | # Seed | GSM8K | MedQA | FinQA | CDSL | Avg. |
|---|---|---|---|---|---|---|
| Naive | 100 | 26.91 | 35.42 | 24.06 | 3.20 | 22.40 |
| Evol-Instruct | 100 | 27.36 | 36.29 | 26.68 | 5.22 | 23.89 |
| PANGEA (Ours) | 100 | 32.52 | 37.78 | 36.44 | 11.30 | 29.51 |
| 80 | 32.91 | 37.34 | 36.37 | 11.01 | 29.41 | |
| 40 | 31.84 | 36.10 | 35.21 | 10.15 | 28.33 | |
| 20 | 31.21 | 35.79 | 34.83 | 8.70 | 27.63 | |
| 10 | 29.28 | 33.27 | 28.31 | 3.77 | 23.66 |
The results demonstrate that PANGEA’s performance remains remarkably robust across varying amounts of domain-specific seed data. Notably, as long as the number of seed examples is not extremely small (e.g., 10), PANGEA outperforms the baseline trained with 100 seed examples even when using significantly fewer seeds (e.g., 80, 40, 20). We further conduct an in-depth analysis to better understand what happens when the number of seed data becomes very limited:
Root Cause: The Impact on Synth-Profiling. We identified that a limited number of seed data negatively impacts the quality of the Synth-Profiling stage. With fewer examples, the generated profile () tends to be less detailed and less specific. The followings are Synth-Profiles for GSM8K with 10/100 seeds; the 10-seed profile asks for fewer equations and lacks details compared to the 100-seed profile (e.g., "kid-friendly," "multi-step solution," specific units):
Synth-Profile from 10 seeds:
- Pick one ordinary scene (errands, snacks, chores, mini-trip, etc.).
- List number with a symbol (A, B, C, D) → value + unit + brief label.
- Write 2–3 equations that use those symbols.
- End by stating which symbol the solver must find.
- Give a bullet-style plan showing the order the equations should be tackled.
- Do not write the full word problem or reveal any answers.
Synth-Profile from 100 seeds (Original):
### Choose a Kid-Friendly Scenario
- Restate one everyday clause students can picture (shopping, chores, snacks, simple travel).
- Remove unnecessary details.
### Select 3-5 Meaningful Numbers
- Prefer numbers in General Question.
- Invent realistic values if needed, supporting clear multi-step solution.
- Use everyday units (dollars, minutes, km, items, °C, simple percents).
### Define Symbols & Units
- Assign symbols A, B, C, (D, E) with brief label and explicit unit.
### Write 4-5 Simple Equations
- Use only +, -, ×, ÷, %, or single-step unit conversions.
- No new numbers; use chosen symbols or intermediate results.
### State the Target
- Specify final symbol to solve and its numeric meaning.
### Outline Step-by-Step Plan
- Provide one bullet per equation, in order (4-5 steps).
The following illustrative examples show that although the synthetic data generated from 10 seed examples remains mathematically valid, it displays qualitative differences. While the 10-seed example is more verbose and contains an unrealistic scenario (a 200°F appetizer), the 100-seed example is clear and contextually coherent.
Example from 10 seeds:
Using identical food thermometers, Sarah notes that her appetizer registers 200 °F, while Tim observes his at 150 °F. Sarah then turns up the heat, raising her appetizer by exactly 20 °F. Tim follows suit, but rather than copying the same degree change, he warms his snack (in the upward direction) by an amount equal to one-half of the percentage that Sarah’s 20 °F rise represents relative to her appetizer’s original temperature. After both of these adjustments have been completed, how many degrees Fahrenheit apart are the two snacks?
Example from 100 seeds (Original):
Sarahn and Timon discovered two planets, where Sarahn’s crust temperature is 48 °C, and Timon’s crust is half as hot due to lower geographic activity. If the temperature difference between their planets is twice Timon’s crust temperature, what is the difference in their crust temperatures?
In conclusion, this ablation study confirms the robustness of PANGEA and highlights that a moderately small set of ~40 high-quality examples is sufficient to generate a strong domain profile. We are grateful to the reviewer for encouraging this valuable investigation.
Thanks for response. However, the author has not convinced me of the robustness and generalizability of their method. Therefore, I decide to keep my original rating.
Thank you for your response to our rebuttal. We would be happy to provide further clarification or additional evidence if there are particular aspects that remain unclear or unconvincing. So, could you clarify which specific part of our response did not sufficiently address your concerns regarding the "robustness" and "generalizability" of our method? As we understand it, your concerns pertain to whether the proposed PANGEA methodology performs reliably across a broad range of settings.
Throughout the paper, we comprehensively evaluated PANGEA on a range of benchmark datasets widely recognized in the community, including GSM8K, MedQA, and FinQA. We were also fully aware of potential knowledge leakage issues and accordingly designed and conducted the CDSL experiments in a systematic manner. Furthermore, during the rebuttal period, we extended our evaluation to include the Korean CSAT. As a result, the effectiveness of the PANGEA methodology has been demonstrated not only on established benchmarks (GSM8K, MedQA, FinQA), but also on well-designed datasets specifically constructed to address knowledge leakage concerns (CDSL and Korean CSAT). We believe these results collectively demonstrate that PANGEA is both robust and generalizable across a diverse set of downstream tasks.
In response to the request "to better assess scalability and real-world applicability," we additionally conducted experiments using an 8B-scale model. As a result, the effectiveness of the PANGEA methodology has been demonstrated not only on 1B (Llama-3.2-1B), 1.5B (Qwen2.5-1.5B), and 2B (Gemma2-2B) models, which were sourced from different providers, but also on the 8B model (Llama-3.1-8B). Furthermore, we also included results from an 8B-scale reasoning-oriented model (DeepSeek-R1-Distill-Llama-8B). This highlights the robustness and generalizability of PANGEA across a diverse range of model sizes and types.
Dear Reviewer bLdh,
Detailed feedback from the authors has been posted.
Please comment on whether the reviewer's concerns have been addressed or if the reviewer still have concerns.
Regards,
Your AC
Some issues need to be addressed:
- Although the authors provided results for Llama3-8B in the rebuttal, the majority of experiments in the paper are based on models with 1B, 1.5B, and 2B variants. Consequently, a large portion of the experimental results and conclusions need to be updated accordingly.
- The core idea of the proposed method essentially relies on distillation from larger models. According to the Qwen3 technical report, data distillation is a rather straightforward approach and lacks significant novelty. Why not let the model being trained generate its own synthetic data? If self-generated data is used, does performance eventually hit a ceiling as the amount of synthetic data increases?
On the suggestion of self-generation
A foundational principle of LLM-based synthetic data generation lies in harnessing the capabilities of powerful language models as data generators; as noted in Nadas et al. (2025), "...makes LLMs extremely flexible data generators, effectively serving as universal data augmenters that can create labeled data for a wide range of problems on demand." At its core, the ability to synthesize novel data is intrinsically tied to the strength of the underlying model. From this perspective, expecting a small model to self-generate synthetic data falls outside the scope of mainstream research in LLM-based synthetic data generation.
We hope this clarification helps address the remaining concerns and better situates our contribution. We are confident in the robustness, novelty, and practical significance of our work as demonstrated.
Thank you for your continued engagement. As the discussion period is coming to a close, we would like to make a final check on whether our responses have sufficiently addressed your concerns.
If there are remaining concerns after our latest responses, we would appreciate it if you could indicate which parts you find unconvincing. We will do our best to provide further clarification on those points.
Conversely, if our rebuttals and additional results have addressed your initial concerns, we would be grateful for your confirmation.
Thank you for your time and consideration, and we look forward to your feedback.
Sincerely,
The Authors of Paper 21241
Dear Reviewer,
We sincerely thank you for your continued engagement and for sharing your concerns. We would like to address these points, first by contextualizing our work within the broader field of synthetic data generation, and then by discussing the specific points on generalization and self-generation.
On the novelty claim
First of all, we respectfully disagree with the perspective that views synthetic data generation via LLMs as merely a form of distillation. Interpreting synthetic data generation as sequence-level distillation from large-scale models is an overly broad characterization. In reality, synthetic data generation is a field that delves into how to design well-structured, systematic pipelines and recipes to carry out the process effectively and reliably.
For instance, prominent studies in synthetic data generation, such as MuMath (You et al., NAACL 2024), UltraMedical (Zhang et al., NeurIPS 2024), and WizardLM (Xu et al., ICLR 2024), focus on designing structured data generation frameworks to augment source data, leveraging large-scale language models as synthetic data generators. The learning from such synthetic data should not be viewed merely as sequence-level distillation; rather, the key contribution lies in how these works systematically propose novel and effective methodologies for generating and utilizing synthetic data.
Therefore, we respectfully disagree with your novelty claim, which appears to rely on an overly broad interpretation that subsumes synthetic data generation under distillation. As a contribution to the literature on LLM-based synthetic data generation, our work offers the following key innovations:
- (Universal and Robust Framework) We propose a universal and robust framework that uniquely addresses a critical real-world challenge: extreme data scarcity in scenarios where no similar data is available for retrieval. By effectively leveraging domain-unrelated general-purpose data, PANGEA generates high-quality synthetic data, enabling domain adaptation for virtually any target domain.
- (Comprehensive Validation Across Tasks and Models) We demonstrate the versatility of PANGEA across a broad spectrum of domains, including mathematics (GSM8K), medicine (MedQA), finance (FinQA), the Korean CSAT, and even entirely novel out-of-distribution tasks (CDSL). Moreover, we validate our approach across a diverse range of model sizes (1B–8B), providers (Meta's Llama, Alibaba's Qwen, and Google's Gemma), and model types (pre-trained and R1-distilled), thereby providing comprehensive evidence for its "generalizability" and "robustness."
- (Competitiveness Against Domain-Specialized Methods) We further show that PANGEA achieves competitive performance compared to domain-specialized methods such as MuMath and UltraMedical, reinforcing the strength of our PANGEA approach in both "generalizability" and "robustness."
On the suggestion of scaling-up the main experiments
With the above context in mind, we would like to address the concerns regarding "generalizability" and the suggestion to revise all the main experiments using 7B-scale results. First of all, the request for additional results at the 7B scale is entirely reasonable. When strong performance is demonstrated at the 1B scale, it is natural to question whether those results extend to larger models. Validating this scalability undoubtedly strengthens the contributions of our paper. We sincerely appreciate this constructive feedback once again.
However, we do not believe it is necessary (or even appropriate) to revise all the main experimental results exclusively using 7B-scale models. Demonstrating effectiveness at both the 1B and 7B scales is what truly substantiates the "generalizability" of the proposed method. Whether one presents 1B as the main setting and 7B as a supplementary result, or vice versa, is ultimately a matter of ordering; what matters is that both scales are properly validated (and we have done so during the rebuttal period). Given the significantly higher computational cost associated with 7B-scale experiments, we believe it is more feasible to present 1B results as the primary experiments, while including 7B results as complementary evidence of "scalability" and "robustness".
This paper introduces PANGEA, a data augmentation framework for fine-tuning LLM under limited target domain data. The key idea is to project large-scale unrelated general data into target domain tasks through prompt-based generation, thereby providing more diverse synthetic data. The method consists of three stages: Synth Profiling, Prompt Writer, and LLM Projection. Experiments on multiple benchmarks show improvements over existing methods in low-resource scenarios.
优缺点分析
Strength:
1.Using general data for domain-specific data augmentation via projection is creative and useful in practical.
2.The proposed three-stage framework is modular and easy to follow.
3.Extensive experiments across multiple domains demonstrate the effectiveness.
Weaknesses:
1.The three-stage framework highly relies on manually crafted prompt templates, which require human expertise and may limit its scalability or flexibility.
2.This paper lacks in-depth analysis of failure case, especially when the gap between general and domain-specific data is significant, is there a risk of semantic drift?
3.This work primarily supported by empirical results, with limited theoretical explanation or system-level analysis to justify why the projection mechanism improves data diversity and quality.
问题
1.Is it feasible to design domain-agnostic prompt templates that could improve the scalability of this method?
2.During synthetic data generation, is there any mechanism to control or assess the difficulty distribution of the generated examples? Are there cases where the samples are too trivial or unnecessarily complex, potentially affecting training effectiveness?
3.How to ensure the quality of the automatically generated data? Is there any human verification or automated quality filtering?
4.Can the proposed framework be applied on multilingual or cross-lingual tasks? If so, what challenges might arise in adapting the projection and augmentation process to languages with different linguistic structures?
局限性
See weakness.
最终评判理由
I appreciate the authors' thorough responses to my comments and their efforts to address the concerns raised. While I am not increasing my score as I believe the current evaluation accurately reflects the borderline accept quality of the work, I acknowledge the improvements made in the rebuttal.
格式问题
No or VERY MINOR ethics concerns only
We sincerely thank the reviewer for the positive feedback on the creativity, practicality, modularity, and effectiveness of our work. We also appreciate the insightful questions, which have prompted us to conduct further analysis that strengthens our paper. We provide our responses below.
W1, Q1 (Feasibility of Domain-Agnostic Prompt Templates). Is it feasible to design domain-agnostic prompt templates that could improve the flexibility of this method?
As noted (cf. Appendices A.2–A.4), different prompts were used for each downstream task. To assess domain-agnosticity, we conducted additional experiments using generalized prompts across all three stages: 1) Domain-agnostic Synth-Profiling: was extracted without domain-specific cues. 2) Domain-agnostic Prompt-Writer: was incorporated without referring to any domain. 3) Domain-agnostic Projection: data was synthesized without domain mentions. The table below summarizes accuracy (%) at the 10k-sample scale:
| Method | GSM8K | MedQA | FinQA | CDSL |
|---|---|---|---|---|
| Naive | 26.91 | 35.42 | 24.06 | 3.20 |
| Evol-Instruct | 27.36 | 36.29 | 26.68 | 5.22 |
| PANGEA (Domain-Agnostic) | 31.70 | 36.39 | 34.74 | 9.25 |
| PANGEA (Original) | 32.52 | 37.78 | 36.44 | 11.30 |
As shown, although there is a slight performance gap compared to our original domain-specific prompts (Original), the domain-agnostic version of PANGEA (Domain-Agnostic) still outperforms all other baselines by a wide margin. This indicates that the pipeline can operate effectively in a domain-agnostic setting, as the Synth-Profiling stage functions as a domain-adaptive component that guides the rest of the process.
We will add this experiment and the unified template to the supplementary materials in the camera-ready version. Thank you for the constructive feedback.
Q2 (Controlling the Difficulty of Synthetic Data). Is there a way to control or assess the difficulty of generated samples? Can unnecessarily simple or complex examples harm training effectiveness?
This is an excellent point. In our framework, the difficulty distribution of the generated data is primarily controlled and guided by the provided source data. The Synth-Profiling stage does not just extract topics; it analyzes the structural complexity, reasoning depth, and stylistic nuances of the seed examples. For instance, the profile for GSM8K data is given by:
### Choose a Kid-Friendly Scenario
- Restate one everyday clause students can picture (shopping, chores, snacks, simple travel).
- ...
### Select 3-5 Meaningful Numbers
- Prefer numbers in General Question.
- ...
### Define Symbols & Units
- Assign symbols A, B, C, (D, E) with brief label and explicit unit.
- ...
### Write 4-5 Simple Equations
- Use only +, -, ×, ÷, %, or single-step unit conversions.
- ...
### State the Target
- Specify final symbol to solve and its numeric meaning.
- ...
### Outline Step-by-Step Plan
- Provide one bullet per equation, in order (4-5 steps).
- ...
Here, “Outline Step-by-Step Plan“ and “Define Symbols & Units” control the difficulty of the synthetic problems. Thus, our framework adaptively generates synthetic data that aligns with the “difficulty distribution” of the original, expert-curated source data. This prevents the generation process from collapsing into overly “trivial” examples or drifting toward “unnecessarily complex” ones, as it remains anchored to the specifications of the original data. Our primary goal is to efficiently augment the existing source data distribution, and this mechanism ensures we remain faithful to that objective.
Q3 (Ensuring the Quality of Synthetic Data).. How to ensure the quality of the automatically generated data? Is there any human verification or automated quality filtering?
Thank you for raising the insightful question regarding the quality of synthetic data. We also acknowledge the potential risk of generating low-quality data during the synthetic generation process. To address this concern, we conducted an experiment on the GSM8k dataset (10k examples) using an LLM-based self-evaluation to quantitatively measure the quality of each generated sample on a scale from 0 to 100 (cf. Figure 4). The resulting score distributions per method demonstrated that our approach consistently maintained superior data quality compared to baseline methods.
Notably, even our lower-scoring data retained meaningful mathematical contexts despite issues related to realism or situational ambiguity, as illustrated by the following example. This implies that such data, despite their lower scores, could still meaningfully contribute to the learning process.
Low quality example Using identical food thermometers, Sarah notes that her appetizer registers 200 °F, while Tim observes his at 150 °F (…) After both of these adjustments have been completed, how many degrees Fahrenheit apart are the two snacks?
To further validate this hypothesis, we compared models trained exclusively on low-quality data, exclusively on high-quality data, and on the complete dataset. The results clearly showed that even data rated as low-quality under our method were beneficial for model training.
| Training Data | Acc on GSM8K |
|---|---|
| Naive | 26.91 |
| Evol-Instruct | 27.36 |
| 10k Low-quality Samples Only | 28.66 |
| 20k High-quality Samples Only | 35.47 |
| All 30k Samples (Whole) | 37.72 |
The generation of this "low-quality yet helpful" data is attributable to our Synth-Profiling stage, which analyzes crucial elements from source data required for meaningful synthetic data generation. By extracting these essential components from general data, our approach leverages diverse and valuable information while effectively filtering out unnecessary details, thus ensuring the relevance and usefulness of generated synthetic data.
Q4 (Applicability to Multilingual Tasks). Can the proposed framework be applied on multilingual or cross-lingual tasks? If so, what challenges might arise in adapting the projection and augmentation process to languages with different linguistic structures?
Whether general data in a different language can be effectively projected to generate synthetic data in the target language is both an intriguing and highly practical question. The LLM model (Llama-3.3-70B-Instruct) we utilize for synthetic dataset generation is inherently multilingual. Thus, even if the provided general data and seed data are composed in different languages, the model is fully capable of effectively generating synthetic data through appropriate projection across languages. To verify this, we conducted a new experiment on a Korean-language task: the Korean College Scholastic Ability Test (CSAT). This serves as a test case for whether a primarily English CoT-Collection database can be used to synthesize downstream data in Korean.
More precisely, we generated 10k synthetic Korean CSAT problems with our framework, leveraging official problems from the 2023 and 2024 CSAT exams as source data. The model was then evaluated on a held-out set of official problems from the 2025 CSAT, which were not included in the seed data. Evaluation was based on the accuracy (%) for Easy, Medium, and Hard questions, as well as the overall Total Score. The comparative results of synthetic data generation methods using the Llama-3.2-1B model was as follows:
| Method | Easy(%) | Normal(%) | Hard(%) | Total Score |
|---|---|---|---|---|
| Pre-trained | 40.00 | 13.64 | 0.00 | 13 |
| Instruction-tuned | 60.00 | 31.82 | 0.00 | 27 |
| Naive | 60.00 | 13.64 | 0.00 | 15 |
| Evol-Instruct | 40.00 | 31.82 | 0.00 | 22 |
| PANGEA(Ours) | 60.00 | 36.36 | 5.26 | 34 |
Notably, PANGEA not only achieves the highest total score but is also the only method that enables the model to solve any “Hard” problems. This demonstrates that the proposed PANGEA framework is both effective and capable of delivering meaningful performance gains, even for languages that are underrepresented in large-scale, domain-unrelated datasets. This result highlights the robustness and generalizability of our approach.
W2 (Risks Arising from the Gap between General and Source Data). This paper lacks analysis of failure cases—particularly when the gap between general and domain-specific data is significant. Is there a risk of semantic drift?
We agree that transferring general data to domain-specific contexts can risk semantic drift. To address this, our experiments include settings where the source data is intentionally selected to be domain-unrelated, creating a substantial gap between general and source data. Despite this, as discussed in Q2 and Q3, we observed no signs of semantic drift or degradation in the difficulty and quality distributions of the generated synthetic data.
This robustness stems from the Synth-Guide Block (), obtained via Stages 1-2, which serves as a semantic anchor that guides the projection process toward the structural and topical intent of the original domain. As shown in the example in Figure 3, performing projection without leads to failure cases, whereas the use of effectively resolves this issue.
W3 (Lack of Theoretical Justification). This work is primarily supported by empirical results, with limited theoretical explanation...
We acknowledge that the theoretical foundations of synthetic data generation are still underexplored, with only a few recent efforts addressing this gap [1]. Our work instead builds a strong empirical basis for the practical effectiveness of LLM-driven synthetic data generation. Through ablation studies, we isolate the impact of each component in our projection mechanism, showing consistent gains in diversity, difficulty alignment, and overall quality. While our focus is empirical, we believe these findings can inform future theoretical work.
References
[1] Gan and Liu, “Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective.”
Thanks for the response and i find the rebuttal addressed my concerns properly.
Dear Reviewer Puqb,
Thank you so much for your positive feedback. We are delighted to hear that our rebuttal successfully addressed your concerns.
We truly appreciate your thoughtful engagement throughout this process. Given that your concerns have been resolved, we would be very grateful if you would consider reflecting this in your final rating.
Thank you again for your time and valuable review.
Sincerely, The Authors 21241
I appreciate the authors' thorough responses to my comments and their efforts to address the concerns raised. While I am not increasing my score as I believe the current evaluation accurately reflects the borderline accept quality of the work, I acknowledge the improvements made in the rebuttal.
This paper targets the challenging problem for domain adaption without relying on a large-scale target-domain data finetuning. Instead, a large-scale general-purpose high-quality data is utilized to boost the performance of domain specific benchmarks like GSM8K, MedQA, and FinQA. The proposed algorithm, PANGEA, is well motivated and obtain superior performance gain on the reported benchmarks.
优缺点分析
Strengths:
- The proopsed algorithm is an automated framework relies on a large-scale general purpose dataset without specifically additional annotation cost for domain specific data. This can be easily generalized to different domains.
- The paper provides extensive evaluations. By comparing different backbone models, the proposed PANGEA algorithm obtains consistent performance gains over the baseline given different domain-specific datasets like GSM8K, MedQA, and FinQA. Also, another promsing feature is that the performance can be consistent improved if more synthtic dataset is utilized for training.
- The paper is well motivated and the presentation of the paper is clear.
Weaknesses:
- As stated in Algorithm 1, the proposed algorithm requires an initial source dataset like 100 examples. I am wondering is there any requirement of this source dataset? For example, how about the diversity of the dataset? Also, does the number of samples in the initial dataset affect the final performance of the data synthetic? Similarly, does the setting of the general dataset will influence the final domain adaption performance?
- In figure 1, the paper presents a promising curve as more data is synthesized. Currently, the largest number of training samples is 120K in the experiments. How about more training data is p[rovided? Will it be saturated given more training data?
- The proposed algorithm PANGEA relies a step of projection to obtain , which should have similar distribution with the source domain data. Is there any intuition or justification why this projection can obtain the similar distribution to the source domain?
- Currently, the evaluations are most based on "small-size" LLM with 1B-2B parameters. How about the performance if more larger LLMs, like 7B/14B, are utillized in the algorithm?
问题
Please mainly address the questions in the weaknes section. More specifically, the questions: how does the performance change if different distributions and different number of source data is evaluated.
局限性
The paper has provide limitation discussions in the paper.
最终评判理由
The rebuttal well addressed most of my concerns in the last round of review. More specifially, the authors provide sufficient evaluations on the size of the source dataset. Also, extensive evaluations on larger backbone is provided with consistent performance gain over the baseline. Thus, I would lean to accept the paper.
格式问题
No.
We thank the reviewer for their time and effort. However, we must respectfully point out that the content of this review appears to be for a different paper. Our submission, "PANGEA: Projection-Based Augmentation with Non-Relevant General Data for Enhanced Domain Adaptation in LLMs," is a study on synthetic data generation for Large Language Models (LLMs). The review discusses topics entirely unrelated to our work, such as "embodied navigation," "pre-extracted target images," and "CLIP embeddings for navigation." These concepts belong to the fields of robotics and computer vision, not LLM data augmentation. Due to this fundamental mismatch, we are unable to address the specific strengths and weaknesses raised. We believe there has been a mix-up with another submission and kindly request that this be taken into consideration.
Hi,
Sorry for the mistake. Currently, I have updated the review comments. Please check with the questions raised in the comments.
We sincerely thank Reviewer for the constructive and insightful feedback. We appreciate the positive evaluation of our work's motivation, extensive evaluations, and clarity. We are grateful for the opportunity to address the questions raised in the "Weaknesses" section, and we believe our responses and new experiments further strengthen our contributions.
W1 (Source data needs & effects). What are the requirements for the initial source dataset (e.g., size, diversity)? How do its characteristics affect the performance of data synthesis and domain adaptation?
Thank you for this important question regarding the sensitivity of PANGEA to the initial seed data. Our method was designed for data-scarce scenarios, and we agree that its performance under varying conditions is a crucial aspect to explore.
Regarding the number of initial seed samples: To address this, we conducted additional experiments during the rebuttal period by varying the number of seed examples from 100 down to 10. These experiments were run on our 10k synthetic data setup.
| Method | # Seed | GSM8K | MedQA | FinQA | CDSL | Avg. |
|---|---|---|---|---|---|---|
| Naive | 100 | 26.91 | 35.42 | 24.06 | 3.20 | 22.40 |
| Evol-Instruct | 100 | 27.36 | 36.29 | 26.68 | 5.22 | 23.89 |
| PANGEA (Ours) | 100 | 32.52 | 37.78 | 36.44 | 11.30 | 29.51 |
| 80 | 32.91 | 37.34 | 36.37 | 11.01 | 29.41 | |
| 40 | 31.84 | 36.10 | 35.21 | 10.15 | 28.33 | |
| 20 | 31.21 | 35.79 | 34.83 | 8.70 | 27.63 | |
| 10 | 29.28 | 33.27 | 28.31 | 3.77 | 23.66 |
As the results show, PANGEA remains robust and outperforms baselines that use 100 seed samples, even when the number of seed samples is significantly reduced to as few as 20. The performance degrades gracefully as the seed count decreases, which is expected since the quality of the initial Synth-Profiling stage depends on having a representative sample of the target domain.
Regarding the diversity of the source data and the setting of the general dataset: These are excellent points. We hypothesize that higher diversity in the initial seed data would improve the quality of the generated profile() , leading to better synthetic data. Similarly, the breadth and diversity of the general dataset likely set the upper bound for the diversity of the generated data. We will add a detailed discussion of these points to our final manuscript.
W2(saturation with respect to the number of training samples). How about more training data is provided? Will it be saturated given more training data?
This is a very insightful question. We anticipate that performance will eventually saturate. This saturation point would likely be determined by the combined diversity of the fixed source data and the large-scale general data. If the number of source data examples is fixed, the diversity of the general data ultimately bounds the variety of the synthetic data that can be generated. Once the diversity from the general data pool is exhausted, performance gains will likely plateau. While conducting experiments on a much larger scale (e.g., >120k) would be valuable, the time constraints of the discussion period make it challenging to complete them now. We are committed to including a thorough discussion of this expected saturation in the final manuscript. Thank you again for this excellent suggestion.
W3(projection stage). Why does the projection stage lead to a distribution aligned with the source domain?
The key to ensuring the projected data () aligns with the source domain lies in our
Stage 1: Synth-Profiling. In this stage, we distill a structured domain-specific profile () from the source data (). This profile explicitly captures the core characteristics, formats, and structural patterns of the target domain. The subsequent stages use this profile as a strict guide:
Stage 2 (Prompt-Writer) extracts relevant information from general data () that maps directly to the elements defined in the profile(), creating a Synth-Guide Block().
Stage 3 (Projection) then uses this structured block () to generate a new data point, ensuring it conforms to the format and style of a source data example ().
In essence, the Synth-Profiling stage acts as a "compass," ensuring that even though we are drawing diversity from an unrelated general dataset, the final output is "projected" back into the specific structure and context of the target domain.
W4(Scalability to larger LLMs). How would the performance change if larger models (e.g., 7B or 14B) were used?
We are sincerely grateful to the reviewer for this excellent suggestion, which significantly strengthens our paper. To test PANGEA's effectiveness on larger models, we conducted a new experiment using the Llama-3.1-8B model on our 10k data setup.
The results are highly encouraging. PANGEA not only delivers substantial improvements over the pre-trained model but also outperforms the heavily-resourced Llama-3.1-8B-Instruct model on three of the four benchmarks.his demonstrates that our framework is not only applicable but highly effective for larger, more capable models, confirming its scalability and generalizability.
| Method | GSM8K | MedQA | FinQA | CDSL |
|---|---|---|---|---|
| Pre-trained | 48.75 | 39.31 | 25.63 | 1.61 |
| Instruct-tuned | 85.62 | 64.10 | 64.95 | 2.32 |
| Naive | 81.35 | 55.93 | 48.78 | 15.65 |
| Evol-Instruct | 79.08 | 60.02 | 53.88 | 17.42 |
| PANGEA (Ours) | 86.47 | 64.51 | 65.31 | 25.91 |
Once again, we sincerely thank the reviewer for their valuable time and insightful suggestions. The new experimental results you recommended will be included in the final camera-ready version, and the points discussed during the rebuttal will also be reflected in the final manuscript. We hope our responses have fully addressed your concerns.
Additionally, we were not able to view the score that was given. If possible, we would greatly appreciate it if you could kindly leave a brief comment indicating your initial score for our reference. Lastly, if the main concerns have been resolved, we would be grateful if the score could be updated to reflect that.
Thanks a lot for the comments and replies. The answers well addressed most of my concerns, especially on the distribution and the size of the source dataset as well as the generalization to larger backbone model like LLama-3.1-8B. Thus, I would prefer a positive rating for the paper.
Dear Reviewer Nbaz,
Thank you very much for carefully reviewing our additional experiments (e.g., on scalability and seed-data distribution) and for providing such constructive feedback.
Your thoughtful suggestions have meaningfully strengthened our work. We will incorporate your comments and the new experimental results into the final version of the paper.
We sincerely appreciate your time and effort, and we wish you a wonderful day.
Sincerely, The Authors of Paper 21241
This study reveals the fundamental limitations of using LLMs to generate synthetic data with a small amount of domain-specific examples available, from the perspectives of both data diversity and quality. Specifically, the paper addresses issues that arise when only a few examples are available. Next, the authors propose a method called PANGEA.
This paper's strength lies in its practical and creative idea of leveraging publicly available, general-purpose data to extend domain-specific data without additional annotation costs. The proposed three-stage PANGEA framework is modularly structured, clearly defined, and well-motivated, and experiments demonstrate consistent performance improvements across multiple domain-specific datasets, including GSM8K, MedQA, FinQA, and different backbones. Additionally, the ability to achieve further improvements by leveraging synthetic data is promising.
It appears that most of the concerns raised by the reviewers have been addressed in the authors' responses. Reviewers who gave negative scores noted concerns about the representativeness of the dataset and backbone used in the experiments, as well as the robustness and generalizability of the methods. However, after carefully reviewing the authors' responses, it was determined that these concerns had been adequately addressed.
Based on this, the AC recommends accepting the paper but strongly encourages the authors to incorporate the reviewers' comments into the final version.