6.5

/10

Poster4 位审稿人

最低5最高8标准差1.1

4.5

置信度

COLM 2024

Building a Large Japanese Web Corpus for Large Language Models

Naoaki Okazaki,Kakeru Hattori,Hirai Shota,Hiroki Iida,Masanari Ohi,Kazuki Fujii,Taishi Nakamura,Mengsay Loem,Rio Yokota,Sakae Mizuki

OpenReview PDF

提交: 2024-03-22更新: 2024-08-26

TL;DR

This study builds a large Japanese web corpus from the Common Crawl archive, and demonstrated its effectiveness by continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B.

摘要

关键词

corpuswebCommon CrawlJapanesecontinual pre-trainingLlama 2MistralMixtral

评审与讨论

审稿意见

评分: 5置信度: 52024-05-08

The paper constructed a high-quality Japanese web corpus which was extracted from Common Crawl dataset for the training of Japanese large language models. Compared with existing Japanese text corpus which was developed overseas as a part of multilingual corpora, the corpus in this paper was specially developed for Japanese and enjoys better quality. The paper also demonstrates a Japanese text collection and cleaning pipeline and empirically analyze the improvement that the newly collected Japanese web corpus can bring to Japanese large language models.

接收理由

The paper developed a high-quality Japanese text corpus which will be beneficial to Japanese language model training.
The paper also provides a text data collecting and cleaning pipeline which is not limited to Japanese but can also be applied to any other non-English language.
The improvement that the developed Japanese corpus brought on language models looks very impressive.

拒绝理由

The paper has limited innovations and technical depth. The paper proposed very little novel techniques or innovations and its data collecting and cleaning pipeline looks straight forward and only involves data engineering components such as rule-based quality filtering (in which the rules are not hard to design).
The motivation of this paper is not convincing. Although the paper claims that all existing Japanese text corpora are just part of some multilingual corpora and suffer from unsatisfactory quality, there are no concrete evidence to support this argument. In addition, the paper does not provide any concrete problems of the existing Japanese corpus and the reason why they have such problems. Even if the existing Japanese corpora have some problems, I am afraid that they can be easily resolved given that the data collecting and cleaning pipeline is so straight forward.
Although the paper claims that it is dedicated to develop a Japanese-only text corpus, I can hardly identify any techniques proposedin this paper which is specifically designed for Japanese.
The presentation should be further improved. For example, the term "Llama 2 7B" looks wired and looks like the name of some model at the first glance. However, the model's name is "Llama 2" and the number of parameters is "7B".

给作者的问题

Could you please demonstrate some cases in the existing Japanese corpus whose quality is not satisfactory? It seems that from Table 1, each existing Japanese corpus also employed deduplication and data cleaning and they are supposed to be good enough.
Please also demonstrate the exact problems of the existing Japanese corpus and the reason why they have such problems and the reason why the problems cannot be easily tackled.

作者回复

2024-05-31

We appreciate your time and effort in reviewing our paper. However, we received the reasons for rejection, probably due to misunderstanding and oversight of the review guideline and the paper. We hope that our response addresses your questions, misunderstandings, and concerns.

R1: limited innovations and technical depth

Processing 63B web pages and continual pre-training on at least two 7B, a 13B, an 8x7B, and a 70B models cannot be done without technical depth. Unlike other kinds of studies, the top priority of corpus-building is not presenting an innovative method but making the best effort to deliver high-quality and useful data.

Q1: existing Japanese corpora are good enough

This is not true. The filtering method in mC4 is insufficient; thus, other English corpora were proposed after mC4. Furthermore, mC4 considered no aspect of Japanese for quality judgment.

R2: The motivation is not convincing. They can be easily resolved.

Q2: What is the exact problems of the existing corpus

Among CC-100, mC4, and OSCAR in Table 1, only mC4 can be a candidate for building Japanese open LLMs because the other two are too small (the numbers of tokens are less than 100B tokens, the size we used for the continual pre-training). However, mC4 uses WET as the data source, which includes unrecoverable errors. We quote Penedo et al. (2023):

however, in line with previous works (Gao et al., 2020; Rae et al., 2021), we found WET files to include undesirable navigation menus, ads, and other irrelevant texts.

HTML markups provide useful hints for removing navigation menus and ads, but this information is lost in WET files. This is a critical issue of WET files and mC4 that cannot be resolved later. We will elaborate on this in the extra page allowed in the camera-ready version.

The size of our cleaned corpus (312 billion letters) is larger than that of uncleaned mC4 (240 billion letters) in Table 1, which is also an advantage of our study. We also want to emphasize that Table 2 demonstrates that our corpus was superior to mC4 cleaned for Japanese ("+ llm-jp").

R3: Any techniques specifically designed for Japanese

Sections 3.1, 3.2, 3.4, and 3.5 were specially designed for Japanese.

R4: The presentation should be further improved

Could you please suggest which part should be improved? For example, should “Llama 2 7B” be replaced with “Llama-2-7b” (as par with the model name)?

评论- Initiating the discussion

2024-06-04

We sincerely thank the reviewer for the valuable suggestions. We hope our response has thoroughly addressed all concerns and that the reviewer can consider improving their score based on our response. There is nothing we want to add to the response, but we are also happy to discuss with the reviewer if they still have a concern.

评论- Reminder

2024-06-07

Dear Reviewer ytbp,

As the discussion period is nearing its end, we kindly remind you to review our response if you haven't had the chance yet. We are keen to know if our response has addressed your concerns and would be grateful if you could reconsider the rating if appropriate. If there are any further questions or clarifications needed, we would be more than happy to provide additional information.

Thank you very much for your time and consideration.

Authors

审稿意见

评分: 7置信度: 42024-05-11

This paper presents an effort to create a very large corpus of trustworthy Japanese text that could be used to train LLMs. They describe the method used for obtaining candidate data, cleaning, and deduplicating it. In order to show the quality of the resource, the authors train different LLMs on this data and show that it actually improved the metrics on a benchmark of Japanese NLP tests.

接收理由

Although Japanese is not a low resource language itself, it is not one of the highest resources languages in NLP either, so the effort to create a very large corpus is very interesting. I particularly liked the thorough explanation of the data collection pipeline (including downloading, filtering, deduplicating, cleansing, etc.), which could be used as an inspiration to create corpora in other languages as well. The resource built seems very relevant, and the results shown over standard benchmarks seem to show it has high quality as well.

拒绝理由

Some parts of the document are in need of more clarifications, for example the quality and hostname filtering, and the explanation of section 4.1 and associated Table 2.

给作者的问题

By continual pretraining (or is it continued pretraining?) I understand you take the pretrained model weights and keep on training on your new data. Is this the case? Please make this clear, perhaps indicate some references where the technique is introduced or used.
Please explain what NG expressions are. Does is refer to js or other programming languages content?
When you refer to "blocklists" I think the correct term is "blacklists", or is it another concept?
I am intrigued by the Japanese Copyright Act mentioned in the ethics statement that says the purpose of the person is not to "enjoy" or cause another person to "enjoy" the work (quotes are from the text). Is this some translation issue, or what is being referred to as enjoy in this case? Does this mean the use for monetary gains of the data?

作者回复

2024-05-31

We would like to thank you for your positive review and useful suggestions for our paper.

R1: the quality of hostname filtering

This is a valid point. Unfortunately, there is no public data available for comprehensively detecting harmful Japanese web pages; we found that the coverage of the UT1 blocklist was insufficient for filtering out harmful websites in Japan. We could not use a commercial filtering service to recognize harmful URLs because we were unsure of whether the TOS would allow us to build an LLM that could be used for commercial purposes. This was the reason why we created our own list for hostname filtering with tight thresholds (0.001 and 0.005) and left the establishment of a more robust method as a future work (mentioned in Section 5).

Q1: continual pertaining

Yes, you are right. Given a base pre-trained model, continual pretraining performs another round of pre-training on new data (Japanese corpus in this study). We will make this clearer in the camera-ready version. Both “continual pretraining” and “continued pretraining” seem acceptable based on our literature survey, but we will also explain this in the camera-ready version.

Q2: NG expressions

Sorry, they have nothing to do with programming languages. NG expressions stand for harmful expressions (adult, violent, aggressive). The reason why we avoided the direct term “harmful” was because this filtering process may remove a large number of non-harmful web pages with tight thresholds.

Q3: blocklists

We used the word blocklist as “a list for blocking”, following the name “UT1 blocklist”. We were also reluctant to use the word “black” because some hosts in this list may not be “black”. For example, a hospital website explaining a woman’s maternity is prone to be included in the list.

Q4: “enjoy” in the Japanese Copyright Act

Thank you for pointing out the translation problem. The original term 享受 (kyo-ju) is also translated into "enjoy" by the Japanese government. The nuance is that a human consumes the idea of a work (for any purpose, including entertainment, learning, and communication, regardless of the monetary gains of the data). The intention of Article 30-4 is to define an exception for the use case where a computer program consumes works directly for machine learning and information analysis, but a human cannot do that. We will add an explanation in the camera-ready version.

2024-06-05

Thanks for your response. I was not aware of the "NG expression" label, and perhaps other readers will not be aware either, so please include a definition in the revised manuscript.

评论- Re: Official Comment by Reviewer QC9E

2024-06-05

Thank you for the reply to the response. We will include the definition in the revised manuscript.

审稿意见

评分: 6置信度: 42024-05-11

The paper presents a corpus building project to build a large Japanese web corpus for LLMs. Using Common Crawl WARC as a base is a good idea but a lot of other Japanese related corpora are driven by Japanese academics in their silo networks that has yet been tapped in modern LLM training, e.g. NICT LLM https://www.nii.ac.jp/en/news/release/2023/1020.html and https://www.aozora.gr.jp/#main and more on https://github.com/llm-jp/awesome-japanese-llm?tab=readme-ov-file. Also, references to those work is lacking in this paper (but it's understandable since a lot of the work done in those research labs are disseminated in Japanese only).

The steps taken in the data processing are clearly described, esp. the language ID and ngram filter process. As for the "quality filtering" process of using character sets (Section 3.2), it might be that the filtering by character sets have implicitly also restrict the domain that the model will be learning from, e.g. there's a lot of Japanese BBS (bulletin board sites) / forums where substrings like "kaomoji" might be used e.g. ٩(ˊᗜˋ*)و https://detail.chiebukuro.yahoo.co.jp/qa/question_detail/q11297327146 and https://komachi.yomiuri.co.jp/ Lastly, there are also many nuances in cleaning Japanese texts e.g. half-width / full-width characters and furigana mixed into the texts.

It would be nice to have some breakdown of how much each of the filtering process taken out, esp. how much of the final text is made up of the Wikipedia dump.

接收理由

Details on data cleaning and data creation is precious and these paper should be out in the public more to make LLM training more transparent and open.

拒绝理由

Personal preference to accept papers on data creation and processing (thus the 6 score) but I'll say it's a weak accept, so reasons to reject includes:

No reference to other Japanese compilation work not using common crawl and web crawled data (see summary comments above)
Lacking some breakdown on how each much data was filtered out in each step
In data compilation study, depth of ablation (how much data affects a handful model) is more preferable then breath (how much of the performance the data affects different many LLMs). Priorities of computation usage would be better if it's covering depth.

给作者的问题

Suggestion

Please add the breakdown of how much data is filtered out in each of the process and if possible sub-processes in the data creation.
Increase the spacing between lines in Table 2, esp when there are dotted or solid lines.
Cherry pick some examples where the model performs better after training with the data presented in this paper
If possible, please reference these other LLMs too https://huggingface.co/tokyotech-llm/Swallow-70b-instruct-hf and https://huggingface.co/Rakuten/RakutenAI-7B

Questions

Is there reason or particular examples for XL-Sum where the 7B models performs worse with the data in this paper? Why did it happen? Is it because the data after the cleaning process inherently (not on purpose) filtered out specific domain or specific length of the texts?

作者回复

2024-05-31

We would like to thank you for your positive review and useful suggestions for our paper.

R1: No reference to other Japanese compilation work

Thank you for the suggestion. We will comment on the references you raised one by one.

NICT LLM (actually, this was released by NII but not by NICT). Because this is a concurrent work, the lack of this citation cannot be a reason for rejection. After the submission of our paper, we heard that their paper was (probably) accepted at some refereed international conference. We will check and cite their work in the camera-ready version.
https://www.aozora.gr.jp/#main Aozora-bunko mostly includes old novels with expired copyrights (published 50+ years ago). Because these novels include many old-fashioned words and kanji usages, we did not consider Aozora-bunko for training Japanese LLMs. In addition, Aozora-bunko is not an effort to build a corpus for training Japanese LLMs. Thus, we don’t think the lack of this citation can be a reason for rejection.
https://github.com/llm-jp/awesome-japanese-llm?tab=readme-ov-file We've been aware of this nice website that lists Japanese LLMs. However, as far as we checked, this website does not include a list of Japanese corpora. So again, we do not think the lack of this citation can be a reason for rejection. In addition, we want to emphasize that some strong LLMs listed on the website were used for the performance comparison in Table 2.

R2: Lacking some breakdown

You can find the breakdown at the end of Sections 3.1 to 3.5. This cannot be a reason for rejection, either.

R3: depth of ablation (how much data affects a handful model)

We believe that an important value of pre-training corpora is to provide data that works for any model and purpose. This is why we showed the effectiveness on different strong models.

Suggestions

Thank you for the useful suggestions. We will reflect these in the camera-ready version.

Q1: Performance on XL-Sum

This is explained in the second paragraph of Section 4.3. Before continual pre-training, we added Japanese vocabulary to Llama 2 and Mistral 7B tokenizers (mentioned in Section 4.1). This treatment degrades the downstream performance because the embeddings of newly added tokens were not trained well. In contrast, we didn’t expand the vocabulary of the Mixtral 8x7B tokenizer and observed an improvement in XL-Sum. We will elaborate on this in the camera-ready version.

评论- Initiating the discussion

2024-06-04

2024-06-05

Additional suggestion on R2 and breakdown, summarize it in a nice table in the appendix, it'll help a lot. And breaking down 3.4 further by the sub-step would be very helpful esp. understanding how much of the 40B characters removed were from *wikipedia.org

评论- Re: Official Comment by Reviewer KvfY

2024-06-07

Dear Reviewer KvfY,

Thank you for the additional suggestion. We could not include this table in the initial submission due to the limitation of space, but we will include the table in the extra one page of the camera-ready version.

As the discussion period is nearing its end, we are also keen to know if our response has addressed your other concerns (references and depth of ablations). We would be grateful if you could reconsider the rating if appropriate.

Best,

Authors

审稿意见

评分: 8置信度: 52024-05-11

The authors created a novel Japanese corpus for training large language models from Common Crawl archives. It consists of 312B characters, which is substantially larger than previous training corpora for Japanese LLMs, such as CC-100 (25.8B characters), mC4 (239.7 B characters), and OSCAR (74B characters). Using their Japanese web corpus, the authors performed continual pre-training on various LLMs, such as Llama 2 and Mistral, and gained consistent improvements on Japanese benchmark datasets.

接收理由

The authors have created a novel open Japanese web corpus comparable in size and quality to publicly available English web corpora. It will significantly contribute to the advancement of Japanese LLMs compared to training LLMs using the Japanese portion of a multilingual corpus.

The authors proposed a lightweight language detection method to balance the speed and quality of Japanese text extraction.

Using the Japanese web corpus, the authors performed continual pre-training on more than ten publicly available LLMs, and they extensively compared their performance using representative Japanese evaluation benchmarks, including llm-jp-eval and lm-evaluation-harness.

拒绝理由

Many aspects of the proposed method, such as filtering based on character type and length, are only applicable to Japanese.

The evaluation was only when the authors performed continual pre-training on existing LLMs using the Japanese web corpus. They do not demonstrate whether their corpus would be effective if they train an LLM from scratch.

给作者的问题

I want your current estimates of whether the Japanese web corpus has the quantity and quality necessary to train an LLM from scratch.

This is not a question, but a comment. Using shorthands such as BL (billion letters) and BW (billion words) in Table 1 is okay. However, the authors should avoid their use in the body of the paper.

作者回复

2024-05-31

We would like to thank you for your positive review and useful suggestions for our paper. We appreciate the time and effort you have taken to provide feedback. Below, we answer your questions raised.

R1: only applicable to Japanese

Yes, we expected that our paper could receive such a comment. However, our motivation was to build a large corpus specialized for Japanese because the quality of multilingual corpora does not meet our expectation of spending a lot of money for continual pre-training of Japanese LLMs. We also explained our stance in Section 6 (Limitations) that some ideas in this paper could be helpful for other languages and that researchers in each country should put an effort into building a corpus for the country.

R2: whether their corpus would be effective if they train an LLM from scratch

Q1: the quantity and quality necessary to train an LLM from scratch

(We merge responses for the rejection reason and question)

We did not try to build an LLM from scratch for various reasons:

The ultimate goal of this project was to explore methods for building open LLMs that archive high performance in Japanese.
Based on the Chinchilla law, we need 205.1 billion tokens (for 10B models) and 1.5 trillion tokens (for 67B models) as the training data. The constructed corpus was sufficient for training 7B models from scratch but probably insufficient for 70B models.
Having said that, we need to mix an English corpus for building Japanese LLMs anyway because English corpora are about 10 times larger than Japanese ones and because we also need multilingual applications (e.g., English-Japanese translation and cross-lingual summarization).
If we mix the training data of Japanese and English, continual pre-training is a reasonable choice because we can omit the large computation for English data.
A lot of Japanese companies released a variety of Japanese LLMs trained from scratch when we started this project (last fall). We wanted to try an alternative approach for building high-performance Japanese LLMs.

As an academic paper, we agree with your opinion that we want to see the performance of the LLM trained on the corpus from scratch. However, we do not have the computation resources for this at this moment, although our corpus is also ready for use for training Japanese LLMs from scratch.

Q2: Abbreviations of BL and BW

Thank you for the valuable comment. We will avoid these abbreviations in the body of text in the camera-ready version.

最终决定Accept

2024-07-10

This paper describes a procedure to create a large scale Japanese corpus for training large language models.

Strengths:

All reviewers like the contribution of this resource.
The procedure is described with details.
Continual pre-training results show the valuable of this resource.

Weakness:

Some steps are specific to Japanese.
Lacking direction assessment of corpus quality comparing to other resources.

Overall, though the paper is not particularly novel, it will make a significant contribution with the creation of this resource.