PaperHub
5.8
/10
Poster4 位审稿人
最低3最高7标准差1.6
6
3
7
7
4.3
置信度
正确性3.0
贡献度2.8
表达3.5
NeurIPS 2024

Alignment at Pre-training! Towards Native Alignment for Arabic LLMs

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

A novel and effective alignment method is proposed for LLMs in the pre-training stage to reduce the downstream adaptation cost.

摘要

关键词
Large language modelSafety of LLMsLLM Pre-training

评审与讨论

审稿意见
6

This paper proposed a new method for LLM alignment during pre-training. The proposed method is call "native alignment". This method include three steps: pretrain date duplication, alignment rewriting, and model training. They trained small size alignment expert model for alignment rewriting and use the model to rewrite large-scale pre-training data. The rewriting process suppose to solve format issue, value/fairness issue, unsafe content in pre-training data. They experimented with Arabic data and LLMs. Their experiments shows that the proposed method can help LLMs be more safe and helpful.

优点

  1. The paper proposed a new idea to align LLMs during pre-training. It seems an interesting topic.
  2. The paper writing is clear and well-organized.

缺点

  1. Lack of comparison to existing post-alignment methods. The proposed method is a "native alignment" during pre-training. I wonder if this method can outperform the post-alignment methods. While the author acknowledged this limitation, I still feel it is important for strengthening their claim.
  2. Need more analyses to better understand their method's potential trade-off. For example, I wonder if rewriting pre-training data undermines the LLM's capacity to understand and learn Arabic dialects. The rewriting process may convert Arabic dialects into MSA. I also wonder if the rewriting data inherited hallucinations from LLM and deteriorated the trained model.
  3. The paper needs more clarification on experiment details. For example, they exploit Arabic data to investigate their method, however, the evaluation dataset, BeaverTails dataset, is an English dataset. I wonder how they evaluate and if they translate the samples.

问题

Please see the questions in weakness.

  • I wonder whether you continued training the LLaMA 3 model or newly initialized LLaMA-like model and trained it from scratch.

局限性

I think that the paper needs more diverse analyses to understand the potential trade-off of their method.

作者回复

We sincerely appreciate your valuable comments. Below, we provide responses to your concerns.

Weakness 1: Comparison between Native Alignment and Post-Alignment

To meet your curiosity, we conducted an experiment comparing native alignment and post-alignment. The results show that the native alignment approach outperforms the post-alignment methods (DPO) in this case. However, we afraid this is not a fair apples-to-apples comparison for the following reasons:

  1. The data used for native alignment and DPO were not of the same scale.
  2. Native alignment and DPO are complementary methods that operate at different stages rather than being exclusive.

Therefore, the conclusion drawn is specific to these ad hoc settings involving native alignment and DPO. The results may differ if the settings are changed.

ArabicMMLUEXAMSACVA cleanACVA allAvg.
LLaMA3-8B (SFT)41.6539.8455.5657.1048.54
LLaMA3-8B (SFT+DPO)39.7838.5660.1161.5350.00
LLaMA3-Tamed-8B (Native alignment + SFT)41.1341.7366.6466.9654.12
LLaMA3-Tamed-8B (Native alignment + SFT + DPO)39.5839.0068.2466.0153.21

Considering that native alignment and post-alignment methods (such as DPO) are orthogonal and can be applied simultaneously in the same model, experiments on LLMs with and without DPO show that native alignment can enhance cultural alignment. This indicates that both native alignment and post-alignment are beneficial and complementary approaches to alignment.

Experiment Settings

We utilized the LLaMA-Factory framework, employing LLaMA3-Tamed-8B as the backbone for the experimental group focusing on native alignment and Meta-LLaMA3-8B as the control group. We performed instruction tuning on both pre-trained models using an Arabic supervised fine-tuning (SFT) dataset, resulting in the fine-tuned models named LLaMA3-Tamed-8B (Native alignment + SFT) and LLaMA3-8B (SFT). For post-alignment, we selected DPO training as a representative approach, using an Arabic preference dataset. Post-alignment was conducted on both chat models, namely LLaMA3-Tamed-8B (Native alignment + SFT + DPO) and LLaMA3-8B (Native alignment + DPO). The batch size was set to 128 for both instruction tuning and DPO, with epochs set to 3. All other experimental settings followed the default settings in the framework. We evaluated the performance of the instruction-tuned models and the post-alignment tuned models on the same Arabic benchmarks shown in the paper with zero-shot setting.

Weakness 2: Potential Trade-offs of Native Alignment

We considered the potential trade-offs of the native alignment approach, which include:

  1. Harmlessness vs. Helpfulness: In Section 4.1, we leverage Arabic BeaverTails to analyze the trade-off between LLM harmlessness and helpfulness with the increasing amount of aligned data. It shows that with the increasing number of aligned data, both harmlessness and helpfulness increase positively. This observation can be concluded that the aligned data can be positive to both the helpfulness and harmlessness aspect to LLM.
Harmlessness \uparrowHelpfulness \uparrow
Only Pre-train Data (12B tokens)baselinebaseline
Only Aligned Data (12B tokens)+2.6%+7.0%
Pre-train Data + Aligned Data (12B + 12B tokens)+2.7%+7.7%
  1. Arabic Dialects vs. MSA: An author, a native Arabic speaker, reviewed the data before and after the native alignment rewriting and identified three main issues with the Arabic dialects: (1) Minor Change: Dialect expressions are preserved; (2) Information Loss: Some dialect-specific information was lost during the rewriting process; (3) Hallucination: Incorrect interpretation of dialectal meaning. Additionally, we found that GPT-4 also struggles with Arabic dialects.
  2. Hallucinations: To determine whether hallucinations are inherited from the original data to the rewritten data, three authors manually reviewed 90 rewriting pairs. Although identifying hallucinations can be somewhat subjective, they adhered to a consistent text-written guideline. The hallucination ratios are as follows:
Reviewer 1Reviewer 2Reviewer 3
Hallucination (# of samples)2/304/304/30

The overall hallucination ratio was found to be within an acceptable range. While addressing hallucinations in native alignment remains a challenging task beyond the scope of this paper, our empirical results justify the approach of alignment during pre-training, even with this level of tolerance. We plan to address the hallucination issue in future work.

Weakness 3: BeaverTails Translation

The evaluation benchmark, the BeaverTails dataset, is in English. To evaluate Arabic LLMs, we used Baidu translation API to translate the questions into Arabic. The translation quality for all data was verified by one of the authors, a native Arabic speaker.

Questions

The native aligned LLMs introduced in the paper, LLaMA3-Tamed-8B and LLaMA3-Tamed-70B, are continuously pretrained using the Meta-LLaMA-3-8B checkpoints. Additionally, all ablation studies presented in the paper and the rebuttal are based on this continuous training approach.

评论

Thanks for your detailed responses to my questions and concerns. Most of my concerns have been addressed. I updated my rating accordingly. Please ensure you will include these results and observations in your paper.

审稿意见
3

The paper introduce a method called "native alignment", which is a set of procedures to create data and train an LLM to rewrite raw text into "useful" texts for pretraining. They apply this technique specifically for Arabic LLMs and conduct experiments to show that this pre-processing of pre-training data helps produce better Arabic LLM down the line. As bonus, they release open-source Arabic LLMs for the communities

优点

  • The paper ideas are presented clearly and easy-to-understand

缺点

  • As a proclaimed novelty, the paper draws itself between pre-alignment and post-alignment, indicating that previous work only focus on post-alignment but not pre-alignment. However, I afraid the paper misunderstands the concept of post-alignment (RLHF) and fails make an accurate comparison. Post alignment (RLHF) is finetuning technique to train the models to reward good-vs-bad response according to human values, and train the policy models to lean on the good behavior and stay-away from the bad behaviors gradually, often with the existence of a reference model (DPO and RLHF).

Meanwhile, the "native alignment" presented in the paper is a data-cleaning procedure, and it does not having any resemblance or contrast with "post-alignment". Furthermore, using LLMs or training LLMs to rewrite raw text to produce cleaner data is not new or novel, there are many techniques out there that do so, and there are abundant open-source data on huggingface which were produced in similar ways. This confusion between data cleaning and alignment makes the paper less credible and the lack of novelty it the methodology itself, as a data cleaning method, is also troublesome.

Obviously as a result, the paper did not provide any necessary and required experimental comparisons with other data cleaning methods.

  • Though I do appreciate the paper's effort for Arabic community, the scope of only Arabic LLM is small and generally inconclusive, that such method is not shown to generalize to other languages, domains. Perhaps, thus, the work is really not suitable for NeurIPS but more suitable for CL-type venues

  • It is unclear from the writing whether the authors pretrained Llama-3 with Arabic from scratch or further finetune from Llama-3 checkpoint. In either case, there should be explanation and further ablation studies.

问题

Did the authors pretrain from scratch (with Llama-3 architecture) or from Llama-3 checkpoint

局限性

The authors discussed limitations

作者回复

Weakness: Why the proposed data cleaning method (native alignment) is a kind of alignment.

What is alignment?

Alignment” refers to the process of ensuring that LLMs act in accordance with user intentions. Models are considered aligned if they are helpful, honest, and harmless [1]. Alignment is necessary because pre-training data may contain unaligned content, such as ethical issues or religious taboos. This is particularly crucial in regions where religion plays a significant role, like the Arabic world. Although unaligned data may constitute a small proportion of the total data, it can conflict with widely accepted human values.

[1] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, pp.27730-27744.

Post-alignment vs. Pre-alignment (native alignment)

Instead of alignment after pre-training from conflicting data (as RLHF does),the proposed native alignment approach works at the pre-training stage. After extensive pre-training on both aligned and unaligned data, RLHF is used to encourage the generation of positive content and discourage negative behaviour. Native alignment, however, is akin to removing unaligned content at the beginning. As the saying goes, "An ounce of prevention is worth a pound of cure," meaning that preventing issues by avoiding unaligned information is often more effective and less costly than addressing problems deep within the model later on.

Additionally, we have added an experiment showing the comparison between native alignment and post-alignment, please check ‘Author Rebuttal - Additional Experiment I’ for more details.

Comparison with Data Cleaning?

Several notable data cleaning research efforts include:

  • RefinedWeb [1], which demonstrates that properly filtered and deduplicated web data can significantly enhance model performance.

  • SlimPajama [2], which involves fine-grained data processing through filtering and deduplication.

  • WRAP [3], which uses an instruction-tuned model to paraphrase web documents, focusing on stylistic elements like 'Wikipedia' or 'question-answer format'.

While most conventional data cleaning methods aim to remove low-quality content [1,2] and a few data cleaning work [3] focus on format polishing, Native Alignment additionally seeks to align LLMs with human preferences. Data Cleaning and Native Alignment are not mutually exclusive; they are complementary which focus on different aspects (data quality vs. value alignment). Particularly, in some sense, the proposed approach, native alignment, could be considered as a special case of data cleaning methods that not only improve data quality as conventional data cleaning methods did but also improve value alignment.

We have added an experiment showing the comparison between native alignment and conventional data cleaning procedure, RefinedWeb, please check ‘Author Rebuttal - Additional Experiment II’ for more details.

[1] Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E. and Launay, J., 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.

[2] Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., & Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. Retrieved from https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama

[3] Maini, P., Seto, S., Bai, H., Grangier, D., Zhang, Y. and Jaitly, N., 2024. Rephrasing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380.

Concern on generalization

Why Arabic LLMs? Our paper focuses on Arabic LLMs due to the high sensitivity of Arabic-speaking cultures to religious errors [4]. This context provides a stringent test for native alignment. Misalignments in these models can hinder their adoption in the Arab world, whereas in more open environments with higher error tolerance, minor issues can often be addressed post-alignment. The relative rarity of large open-source models in Arabic-speaking regions underscores the importance of our aligned pre-training corpora, which aim to support the development of such models and introduce the concept of native alignment.

Limitation on Computation Resource The method we propose is highly GPU-intensive since it needs a LLM for massive amounts of data rewriting. This could be applied to the Arabic world due to the high sensitivity of Arabic-speaking cultures to religious errors. However, in languages like English where there is more tolerance for errors due to greater freedom of speech, there are many lightweight methods for alignment such as RLHF; the GPU-intensive native alignment method is too expensive for implementation in the free world where alignment requirement is not as strict as the one with zero-tolerance regions, such as Religion.

Additional Experiment on English To meet your curiosity, we conducted a preliminary experiment on English to explore the potential for generalization using native alignment. For further details, please refer to ‘Author Rebuttal - Additional Experiment II’.

[4] Farghaly, A. and Shaalan, K., 2009. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4), pp.1-22.

Question on Experimental settings

LLaMA3-Tamed-8B and LLaMA3-Tamed-70B are continuously pretrained using the pretrained Meta-LLaMA-3-8B checkpoints. Additionally, all other ablation studies conducted in the paper and rebuttal are based on continuous training.

评论

Thank you for your valuable review and for pointing out the weaknesses in our paper. We recognize the importance of clearly emphasizing the novelty of native alignment. In the final version of the paper, we will ensure that the distinctions and connections between our proposed native alignment approach and traditional data cleaning methods, as well as post-alignment techniques like RLHF, are clearly highlighted.

The additional experiments conducted during the rebuttal period will also be incorporated into the final version. Furthermore, we will expand the related work section to provide a more comprehensive context for our contributions. We have identified the following research to include and would greatly appreciate any additional references you might suggest:

  1. Data Cleaning Methods:

    • Hegazi, M.O., Al-Dossari, Y., Al-Yahy, A., Al-Sumari, A. and Hilal, A., 2021. Preprocessing Arabic text on social media. Heliyon, 7(2).
    • Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N. and Presser, S., 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
    • Wenzek, G., Lachaux, M.A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A. and Grave, E., 2019. CCNet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359.
    • Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E. and Launay, J., 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
    • Fan, R.Z., Li, X., Zou, H., Li, J., He, S., Chern, E., Hu, J. and Liu, P., 2024. Reformatted alignment. arXiv preprint arXiv:2402.12219.
    • Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L. and Zhang, S., 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36.
  2. Post-Alignment:

    • Sun, Z., Shen, Y., Zhou, Q., Zhang, H., Chen, Z., Cox, D., Yang, Y. and Gan, C., 2024. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36.
    • Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, pp.27730-27744.
    • Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V. and Rastogi, A., 2023. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
    • Zhu, B., Jordan, M. and Jiao, J., 2023, July. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In International Conference on Machine Learning (pp. 43037-43067). PMLR.
    • Song, F., Yu, B., Li, M., Yu, H., Huang, F., Li, Y. and Wang, H., 2024, March. Preference ranking optimization for human alignment. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 17, pp. 18990-18998).

Again, thank you for your feedback. We would be glad to engage in further discussion if you have any remaining concerns.

评论

Thank you for the response. I decide to keep rating unchanged.

  • This is data cleaning regardless how authors disagree with this. Deception of concepts to create a sense of novelty should be discouraged.
  • All previous data cleaning processes all aim to be aligned with human values, such as removing NSFW words and toxic content. Yet no one even call them "alignment".
评论

We conducted an additional experiment to compare native alignment and data cleaning procedures, and to evaluate the transferability of our proposed method to other languages beyond Arabic, specifically English.

Experiment Settings

We implemented the native alignment approach as described in the paper. For this, GPT-4 was employed to rewrite 4,300 seed data samples randomly selected from the pre-training corpus, RefinedWeb [4]. This rewritten data was then used to fine-tune a pre-trained model (Qwen-1.5-4B-Chat) as the rewrite LLM. Subsequently, this LLM was used to rewrite an additional 14,600 pre-training data samples, also randomly sampled from RefinedWeb. Continuous pre-training was carried out on Qwen-1.5-0.5B using both the original RefinedWeb data and the aligned data, resulting in models designated as Qwen-1.5-0.5B-refinedWeb and Qwen-1.5-0.5B-aligned. Evaluation was conducted using the MMLU benchmark [5].

Experiment Results and Analysis

SubjectQwen-1-5-0-8B-RefineWebQwen-1-5-0-8B-aligned
STEM27.9933.25
Social Science12.8625.37
Other14.3529.91
Avg.18.3227.71

The results show both continuous pre-training methods led to performance improvements on the MMLU benchmark. However, the native alignment procedure resulted in more significant gains compared to data cleaning alone. Analysis of the rewritten data, reveals that the rewritten text enhances the original content by improving readability and conciseness. This suggests that:

  1. Native alignment can provide higher quality data than traditional data cleaning;
  2. Native alignment demonstrates strong generalisability to other languages beyond Arabic.

[4] Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E. and Launay, J., 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.

[5] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.

评论

Thank you for clarifying your concerns. We indeed agreed that native alignment is a special case of data cleaning, as previously stated in the rebuttal. We do not intend to create a sense of novelty through conceptual deception: the significant difference between native alignment and traditional data cleaning is that native alignment generalizes the latter to a broader scope, specifically focusing on value alignment.

Regarding value alignment, we acknowledge that previous data cleaning processes, such as removing NSFW words and toxic content, do involve a form of alignment from a broader sense. However, this alignment is often superficial: typically applied through whole-document or whole-paragraph filtering. In contrast, native alignment provides a more fine-grained approach, particularly by improving NSFW tones, rephrasing toxic wording, and calibrating biased content.

Related data cleaning works

Here we would like to recheck some existing data cleaning works:

  • [1] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
  • [2] SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama
  • [3] The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  • [4] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

The first three papers primarily focus on document filtering and deduplication using syntactic rule-based approaches, rather than content rephrasing. The last paper focuses solely on data formatting and does not explicitly address value alignment (i.e., reducing harmfulness and biases in data); this type of reformatting (e.g., see the target format “like Wikipedia” or in “question-answer format”. ) may reduce format diversity, potentially limiting LLMs' ability to achieve format generalization. In contrast to these traditional data cleaning methods, as illustrated in the example below, we argue that native alignment focuses on correcting values at a more fine-grained, semantic level, rather than merely removing entire documents or simply changing the text format. The below is an intuitive example.

Original TextTraditional Data Cleaning method (like [1,2,3])Format alignment ([4]) ( using original prompt in [4])Native Alignment
It could be someone admiring the damn glacier, not some goddamn lazy hiker who needs a rest every two seconds. These people from China are so fucking useless. I swear, all Asians are stupid and deserve to die. (Racial discrimination by stereotypes.)[Data removed due to inappropriate content] (Removes the harmful content, leading to loss of context.)[Wikipedia format] It might be someone appreciating the glacier, rather than a hiker who frequently needs to take breaks. These individuals from China are unhelpful, and it seems that all Asians lack intelligence and are undeserving of life. (Changes the format without correcting expression problems.)[Rewritten] It could be someone appreciating the glacier, not just a hiker who needs to rest frequently. (Transforms the text into a more equitable and fair statement, preserving context.)
I hope the police show up to this madman and make him realize what a complete nigger he is. (Racial discrimination by stereotypes.)[Data removed due to inappropriate content] (Removes the harmful content, leading to loss of context.)[QA format] Question: What does the speaker hope for? Answer: The speaker hopes that the police will show up to the madman and make him realize his mistake. Question: How does the speaker describe the madman? Answer: The speaker describes the madman as a complete nigger. (Changes the format without correcting expression problems.)[Rewritten] I hope the police address the situation appropriately and help the individual understand the severity of their actions. (Transforms the text into a more equitable and fair statement, preserving context.)
评论
  • [1] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

• We only retained lines that ended in a terminal punctuation mark (i.e. a period, exclamation mark, question mark, or end quotation mark). • We discarded any page with fewer than 5 sentences and only retained lines that contained at least 3 words. • We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. • Many of the scraped pages contained warnings stating that Javascript should be enabled so we removed any line with the word Javascript. • Some pages had placeholder “lorem ipsum” text; we removed any page where the phrase “lorem ipsum” appeared. • To deduplicate the data set, we discarded all but one of any three-sentence span occurring more than once in the data set.

  • [2] SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama

SlimPajama was created by cleaning and deduplicating the 1.21T token RedPajama dataset from Together.

  • [3] The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

these pipelines usually combine a variety of stages: (1) language identification...; (2) filtering rules and heuristics, ...; (3) ML-based quality filtering, ...; (4) deduplication, ...

  • [4] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

paraphrase documents on the web in specific styles such as “like Wikipedia” or in “question-answer format” to jointly pre-train LLMs on real and synthetic rephrases.

[1] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), pp.1-67.

[2] Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., & Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. Retrieved from https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama

[3] Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E. and Launay, J., 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.

[4] Maini, P., Seto, S., Bai, H., Grangier, D., Zhang, Y. and Jaitly, N., 2024. Rephrasing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380.

评论

We politely find a data cleaning work which calls itself alignment, see Reformatted Alignment [1]. [1] introduces a method called REALIGN that reformats existing instructional data to enhance its quality for better mathematical performance; the authors argue such reformatting is a kind of alignment.

The work utilizes data rephrasing to achieve the goal of 'alignment.' The differences between 'native alignment' and [1] in alignment are

  • [1] rephrased data at the Supervised Finetuning stage; this work (native alignment) rephrased data at the pre-training stage;
  • [1] is more like a format alignment while native alignment additionally emphasize value alignment in the sense of reducing harmfulness and biases in data.

[1] Fan, R.Z., Li, X., Zou, H., Li, J., He, S., Chern, E., Hu, J. and Liu, P., 2024. Reformatted alignment. arXiv preprint arXiv:2402.12219.

We sincerely hope that you can consider re-evaluating it.

审稿意见
7

This paper proposes a data augmentation pipeline which modifies the pre training data for large language models in key aspects such as formatting, values, content moderation and knowledge preservation. The resulting pipeline, termed native alignment, is applied Arabic LLMs due to the relatively small pretraining corpus available and the difference between Arabic and western culture. Experiments are conducted to test the performance on a few metrics including trustworthiness, knowledge, and Arabic localisation.

优点

This is a well written paper targeting the important topic of llm alignment. It also addresses the relatively under explored sub question of how to improve alignment at pretraining. The resulting pipeline presents a reasonable idea, and the evaluations are clear and I find them comprehensive too. The author(s) should also be commended for their transparency regarding the limitations of the paper.

缺点

Although this might have become the norm of recent LLM papers, I still think it is important to include a discussion of the metrics used to measure things like 'trustworthiness' and 'knowledge', as these are qualitative metrics, whereas in the paper, it seems like the authors just quoted some existing evaluation pipeline.

问题

Step 3 of the pipeline talks about training language models to act in place of the human experts. I may have missed this but I think the authors should explicate how exactly this is done in the experiment section - are the authors using already pre trained LLMs to finetune as experts? How would we know that these are aligned themselves? If we cannot trust the LLM experts and must resort to human experts, then it's unclear to me how this method should scale up.

In the experiment section the authors show that LLMs pertained on both the original pretraining data as well as native aligned data work better - how does one interpret this result? Since, if the original pretraining data contains harmful or value-misaligned data points, then it seems reasonable that the LLM does not learn from these at all.

局限性

As the authors are already open about, comparisons with other post alignment methods are not included. The authors attribute this to an absence of existing alignment evaluation benchmark, but I don't fully understand this - what is stopping the authors from using the same alignment benchmarks as ones they have already used to compare with other pretrained models?

作者回复

Question I: Scalability of Alignment LLM Expert

The alignment LLM expert is fine-tuned on pre-trained LLMs (Qwen-1.5-4B-Chat). To ensure the rewriting quality of the trained alignment LLM expert, we randomly sampled 50 data points from the pre-training corpus and processed them through the alignment LLM expert. One of the authors (native Arabic speakers) checked these rewritten texts to verify the quality. Additionally, we used GPT-4 to conduct a further extensive evaluation on the final rewritten text using the prompt shown below (Prompt to Evaluate Rewriting Quality).

FormatAccuracy of InformationContent ModerationAdvertisement RemovalLevel of Detail
GPT-4 Rewriter9.209.318.969.898.83
LLM Rewriter8.948.828.979.758.59

As shown in the results above, the small LLM rewriter, trained on rewriting seed data, achieved comparable performance to the GPT-4 rewriter across aspects such as 'Format', 'Accuracy of Information', 'Content Moderation', 'Advertisement Removal', and 'Level of Detail'. The experiments concluded that:

  1. Human evaluations confirm that the rewritten quality is meaningful.
  2. The trained LLM expert can achieve performance close to the GPT-4 rewriter.

Question II: Explanation on the data mixture

Dangers of Mixed Data

“if the original pretraining data contains harmful or value-misaligned data points, then it seems reasonable that the LLM does not learn from these at all.”

From an alignment perspective, this statement is correct. However, from a model performance perspective, learning from the original pre-train data before applying native alignment can enhance the model's knowledge capacity.

In Section 4.2 of the paper, Figure 6 shows that compared to a model trained purely on pre-train data, the one trained on aligned data increases both in harmlessness (alignment aspect) and helpfulness (knowledge aspect). When compared to a model trained on both pre-train data and aligned data, the model trained first on pre-train data and then on aligned data shows a greater enhancement in helpfulness (knowledge aspect). This indicates that training on pre-train data before applying native alignment significantly improves the model's knowledge capacity while having a minimal impact on alignment levels.

How to interpret why LLMs pretrained on both original and native aligned data perform better.

  1. Larger Data Scale: The size of the training dataset is a crucial factor in the scaling laws for LLMs [1, 2]. Using a larger dataset that includes both pre-train data and native aligned data can lead to better improvements in the model's knowledge capacity. This larger data scale allows the model to learn from a broader range of information, enhancing its overall performance.
  2. Diverse Knowledge Representation: Rewriting and modifying the expression of the original content can benefit the training of pretrained LLMs [3]. Training on both pretraining and aligned data exposes the model to diverse representations of knowledge, helping it to internalize and understand the content more effectively. This diverse knowledge representation enables the LLM to learn more comprehensively from the pretraining corpus, improving its ability to generalize and apply the learned information.
HarmlessnessHelpfulness
Only Pre-train Data (12B tokens)baselinebaseline
Only Aligned Data (12B tokens)+2.6%+7.0%
Pre-train Data + Aligned Data (12B + 12B tokens)+2.7%+7.7%

[1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

[3] Ovadia, O., Brief, M., Mishaeli, M. and Elisha, O., 2023. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934.

Limitation: Comparison between native alignment and post-alignment

We have added an experiment showing the comparison between native alignment and post-alignment, please check ‘Author Rebuttal - Additional Experiment I’ for more details.

Prompt to Evaluate Rewriting Quality:

[The Start of Raw text]

{raw}

[The End of Raw text]

[The Start of Rewritten text]

{rewritten}

[The End of Rewritten text]

Please evaluate the following aspects:

1. Formatting
2. Accuracy of information
3. Content moderation
4. Advertisement removal
5. Level of detail

Each aspect receives a score on a scale of 1 to 10, where a higher score indicates better over performance in this aspect. And please return the score by using this format:

Formatting: score
Accuracy of information: score
Content moderation: score
Advertisement removal: score
Level of detail: score
评论

Thank you for answering my questions. I have also flagged an issue in the 'weaknesses' section, namely

"Although this might have become the norm of recent LLM papers, I still think it is important to include a discussion of the metrics used to measure things like 'trustworthiness' and 'knowledge', as these are qualitative metrics, whereas in the paper, it seems like the authors just quoted some existing evaluation pipeline."

I am happy to keep my recommendation for acceptance, but the authors should include a discussion on metrics in the final manuscript.

审稿意见
7

This paper focuses on alignment of LLMs to human preferences and suggests to shift the alignment step from instruction-tuning (post-alignment) to the earlier stage of continued pre-training (native alignment). For that end it proposes an approach to creating aligned pre-training data, consisting of three steps: (1) seed data cleanup and rewriting with humans'/LLM help, (2) training a supervised cleanup model on that seed set and (3) processing the final pre-training dataset with that cleanup model. Presented experiments show that alignment data results in higher final quality compared to unprocessed pre-training data and that the performance gain does not reach a plateau at 12B tokens, suggesting that the amount of alignment data should be limited by the budget allocated to train an LLM. Experiments are performed on Llama-3-8B and Llama-3-70B and the Arabic language.

优点

  • A high-impact and efficient approach to pre-aligned model training is introduced

  • Two pre-aligned LLMs for Arabic are released openly based on the experiments in this paper

  • Related work is excellent, the paper is written very clearly and is easy to comprehend

缺点

  1. No direct comparison between native alignment and post-alignment is reported

  2. Minor text discrepancies are present:

  • rows 16-18: partial sentence "while.." is not finished
  • row 47: missing verb: "LLaMA3-Tamed-8B could beneficial" --> "LLaMA3-Tamed-8B could be beneficial"
  • row 326: typo: "instruction tinning" --> "instruction tuning"
  • row 150: "pre-training" should be called "continued pre-training" in this case
  1. The created seed data and cleanup models are not released

问题

Q1: In the description you juxtapose native alignment and post-alignment, yet there are no experiments comparing their effect directly. What is the basis for claiming that native alignment yields better results in terms of helpfulness, harmlessness or other metrics?

Q2: Hypothetical question: should we as community not aspire to create top-performing models beating GPT4, not create the best models under it, led by it?, more specifically, in your setup of experiments and model training, is the final result bound by GPT4's performance, or can it surpass it?, why hasn't it, according to Table 2?

Q3: How much in your opinion does the choice of seed data and synthetically cleaned alignment data affect the results?, would you consider any approaches to select these sets non-randomly, either directly or via some version of active learning?

Q4: Why not release your seed data, curated by GPT4?, perhaps also the cleanup models, or even the 12B set of generated alignment data?

局限性

Ok

作者回复

Weakness 1: Comparison between native alignment and post-alignment

We have added an experiment showing the comparison between native alignment and post-alignment, please check ‘Author Rebuttal - Additional Experiment I’ for more details..

Weakness 2: Typo Issues:

Thank you for your correction, we will carefully check and update for spelling errors and grammatical issues in the final version.

Weakness 3: Release of data and models

We have already release the data and models publicly. please check ‘Author Rebuttal - Clarification of Open Source‘ for more details.

Q1: Comparison between native alignment and post-alignment

We have added an experiment showing the comparison between native alignment and post-alignment, please check ‘Author Rebuttal - Additional Experiment I’ for more details.

Q2: Comparison with GPT4

The performance gap is due to various complex factors beyond just alignment. Nonetheless, we believe our approach has the potential to enhance LLMs and ultimately exceed GPT-4's limitations.

Complex Factors for Top-Performing Models:

  1. Internal Engineering Tricks: The performance of GPT-4 and similar models often involves proprietary techniques such as Retrieval-Augmented Generation (RAG) or adaptive decoding strategies, which are not always publicly documented or accessible.
  2. Computational Resources: The amount of computational power required to train and fine-tune top-performing models is substantial. Large companies have access to vast resources, which may not be available to academic institutions or smaller research groups.
  3. Data Availability and Quality: The quality and quantity of training data significantly impact model performance. Proprietary models might leverage extensive, high-quality datasets that are not publicly available.
  4. Optimization Techniques: Advanced optimization techniques and hyperparameter tuning are critical for achieving high performance. These methods are often refined through extensive experimentation and can be resource-intensive.

Potential of Native Alignment

As shown in Section 4.3 of the paper, the experiment results show that with the increasing amount of aligned data, the helpfulness and harmlessness of LLMs are positively developing. To further verify the chat ability of the model, we trained a LLaMA3-Tamed-70B-Chat version to compare with GPT-4 on Arabic benchmarks. As shown below, some of the metrics for the proposed method have surpassed ChatGPT 3.5 Turbo and are close to GPT-4. We believe these results could inspire the Arabic community to surpass GPT-4 and provide valuable insights for Arabic LLM development in the future.

ModelMMLU (Huang et al. 2023)ArabicMMLUEXAMSArabic ARC-CAverage
ChatGPT 3.5 Turbo46.0757.7245.6360.2452.42
LLaMA3-Tamed-70B-Chat64.2672.5056.9985.5369.68
GPT-465.0472.5057.7685.6770.24

Q3: Seed Data Selection

To address the reviewer's concerns about the impact of seed data selection on the performance of the rewrite LLM, we conducted an additional experiment. We aimed to explore the extent to which the selection of seed and aligned data affects model performance. To do this, we compared the performance of randomly selected aligned data with specific experimental groups.

  • Experiment Group 1 (high-ppl): This group consisted of data with a large decrease in text perplexity scores after rewriting, indicating significant changes in the data.
  • Experiment Group 2 (low-ppl): This group consisted of data with minimal differences between the original and rewritten texts, according to text perplexity score, indicating no significant changes.
  • Baseline (random): We conducted three random sample seed data experiments to account for randomness, labeled as ‘random-1’, ‘random-2’, and ‘random-3’. The variance and average of these experiments are reported as ‘random (x3)’.

All datasets consisted of approximately 1,000 samples of pre-training data and were trained on Meta-Llama-3-8B. GPT-4 was used as a reviewer to evaluate the rewriting quality of the LLM rewriter trained on different seed data settings, using the prompt shown below.

FormatAccuracy of InformationContent ModerationAdvertisement RemovalLevel of Detail
high-ppl6.585.076.738.085.38
low-ppl7.516.827.628.656.83
random (x3)7.27±0.08_{\pm 0.08}6.27±0.08_{\pm 0.08}7.47±0.10_{\pm 0.10}8.57±0.09_{\pm 0.09}6.55±0.13_{\pm 0.13}
random-17.306.397.528.646.56
random-27.156.167.338.456.38
random-37.356.367.578.636.71

The results indicate that, in the benchmark, the selection of aligned data can influence performance (high-PPL). All three random experiments showed no significant differences compared to each other on the benchmark. Therefore, a preliminary conclusion can be drawn: data selection may improve the native alignment approach. This suggests an interesting direction for future research.

Q4: Release of data and models

We have already released the data and models publicly. please check ‘Author Rebuttal - Clarification of Open Source‘ for more details.

作者回复

Clarification of Open Source

We have made the following resources publicly available from our research:

  1. English and Arabic Seed Rewriting Data: Annotated pairs generated by GPT-4.
  2. Native-Aligned Arabic Language Base Models: LLaMA3-Tamed-8B and LLaMA3-Tamed-70B.
  3. Chat Versions of the Aligned Models: LLaMA3-Tamed-8B-Chat and LLaMA3-Tamed-70B-Chat.
  4. Translated Evaluation Benchmark: Arabic-BeaverTails.

Additional Experiment I: Comparison between native alignment and post-alignment

Regarding the concerns raised by the reviewers, we conducted an experiment comparing native alignment and post-alignment. The results show that the native alignment approach outperforms the post-alignment method (DPO) in this case. However, we afraid this is not a fair apples-to-apples comparison for the following reasons:

  1. The data used for native alignment and DPO were not of the same scale.
  2. Native alignment and DPO are complementary methods that operate at different stages rather than being exclusive.

Therefore, the conclusion drawn is specific to these ad hoc settings involving native alignment and DPO. The results may differ if the settings are changed.

Experiment Settings

We utilized the LLaMA-Factory framework, employing LLaMA3-Tamed-8B as the backbone for the experimental group focusing on native alignment, and Meta-LLaMA3-8B as the control group. We performed instruction tuning on both pre-trained models using an Arabic supervised fine-tuning (SFT) dataset, resulting in the fine-tuned models named LLaMA3-Tamed-8B (Native Alignment + SFT) and LLaMA3-8B (SFT). For post-alignment, we selected DPO training as a representative approach, using an Arabic preference dataset. Post-alignment was conducted on both chat models, namely LLaMA3-Tamed-8B (Native Alignment + SFT + DPO) and LLaMA3-8B (Native Alignment + DPO). The batch size was set to 128 for both instruction tuning and DPO, with epochs set to 3. All other experimental settings followed the default settings in the framework. We evaluated the performance of the instruction-tuned models and the post-alignment-tuned models on the same Arabic benchmarks shown in the paper, using a zero-shot setting.

Experiment Results and Analysis

ArabicMMLUEXAMSACVA cleanACVA allAvg.
LLaMA3-8B (SFT)41.6539.8455.5657.1048.54
LLaMA3-8B (SFT+DPO)39.7838.5660.1161.5350.00
LLaMA3-Tamed-8B (Native alignment + SFT)41.1341.7366.6466.9654.12
LLaMA3-Tamed-8B (Native alignment + SFT + DPO)39.5839.0068.2466.0153.21

Considering that native alignment and post-alignment methods (such as DPO) are orthogonal and can be applied simultaneously in the same model, experiments on LLMs with and without DPO show that native alignment can enhance cultural alignment. This indicates that both native alignment and post-alignment are beneficial and complementary approaches to alignment.

Additional Experiment II: Comparison of Native Alignment and Data Cleaning

We conducted an additional experiment to compare native alignment and data cleaning procedures, and to evaluate the transferability of our proposed method to other languages beyond Arabic, specifically English.

Experiment Settings

We implemented the native alignment approach as described in the paper. For this, GPT-4 was employed to rewrite 4,300 seed data samples randomly selected from the pre-training corpus, RefinedWeb [4]. This rewritten data was then used to fine-tune a pre-trained model (Qwen-1.5-4B-Chat) as the rewrite LLM. Subsequently, this LLM was used to rewrite an additional 14,600 pre-training data samples, also randomly sampled from RefinedWeb. Continuous pre-training was carried out on Qwen-1.5-0.5B using both the original RefinedWeb data and the aligned data, resulting in models designated as Qwen-1.5-0.5B-refinedWeb and Qwen-1.5-0.5B-aligned. Evaluation was conducted using the MMLU benchmark [5].

Experiment Results and Analysis

Qwen-1.5-0.5BQwen-1.5-0.5B-refinedWebQwen-1.5-0.5B-aligned
Humanities27.9929.3333.95
STEM12.8625.3727.29
Social Science14.3529.9132.71
Other20.3027.4630.70
Avg.18.3227.7130.73

The results show both continuous pre-training methods led to performance improvements on the MMLU benchmark. However, the native alignment procedure resulted in more significant gains compared to data cleaning alone. Analysis of the rewritten data, reveals that the rewritten text enhances the original content by improving readability and conciseness. This suggests that:

  1. Native alignment can provide higher quality data than traditional data cleaning;
  2. Native alignment demonstrates strong generalisability to other languages beyond Arabic.

[4] Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E. and Launay, J., 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.

[5] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.

最终决定

This paper proposes a data augmentation pipeline which modifies the pre-training data for large language models in key aspects such as formatting, values, content moderation and knowledge preservation. The approach “native alignment” is applied to Arabic LLMs due to the relatively small pretraining corpus and the difference between Arabic and western culture. Experiments are conducted to test the performance on a few metrics including trustworthiness, knowledge, and Arabic localisation. This is a well written paper targeting the important topic of LLM alignment. It also addresses the relatively under explored sub question of how to improve alignment at pretraining. The resulting pipeline presents a reasonable idea, and the evaluations are clear and comprehensive. Two pre-aligned LLMs for Arabic are released openly based on the experiments in this paper The authors added additional experiments requested by the reviewers that compare their method with post-alignment methods and promised to add the results to the paper. In addition to that, we would encourage the authors to clarify the difference between native alignment and data filtering.