2.5

/10

Rejected4 位审稿人

最低1最高3标准差0.9

4.8

置信度

正确性2.5

贡献度1.0

表达2.3

ICLR 2025

Narrow Transformer: Mono-lingual Code SLM for Desktop

Kamalkumar Rathinasamy,Balaji A J,Ankush Kumar,Gagan Gayari,Harshini K,Rajab Ali Mondal,Sreenivasa Raghavan K S,Swayam Singh,Mohammed Rafee Tarafdar

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

Narrow TransformerCode SLMsDesktop DeploymentLightweight Code Language ModelsSmall Language ModelsLanguage Specific ModelsMonolingual Code Language Model

评审与讨论

审稿意见

评分: 3置信度: 52024-10-23

The authors propose to train a small code model called "NT-Java-1.1B" specialized for Java. Their small model achieves better performance on Java than the StarCoderBase-1.1B counterpart which they use as the initial starting checkpoint. The small model size allows the model to efficiently run on laptop devices or other edge hardware thereby increasing productivity of developers writing Java code.

优点

The paper is easy to read and well-written and easy to follow.
The authors release their models under the OpenRAIL-M license which allows both research and commercial use.

缺点

There is no novelty in the paper.
Evaluations are missing for the GGUF model.

问题

The paper is easy to read and well-written although there are some gaps:

There is no reason to talk about the binary .bin and .idx files format in Megatron in section 4.1
line 130 says that the model uses FlashAttention, however, it should be noted that FlashAttention is not a part of the model, its just a training time optimization.
The paper mentions using load_in_4bit and load_in_8bit arguments from HuggingFace APIs are used for evaluation however, as far as I am aware, load_in_8bit uses LLM.int8() algorithm to quantize the model. It would be better to use a better algorithm like GPTQ or AWQ to report the accuracy. The load_in_4bit argument uses FP4/NF4 quantization from the bitsandbytes library. While these are easily usable from a user's perspective, it would be nice to have evaluation numbers with the GGUF model.
Table 3 and 4 can be combined into a single table and there is no need for 2 different tables.
There is no comparison with Llama-3.2 1B model. It would be better to add that comparison as a baseline.
It might be better to train a MoE model with less activated parameters for efficiency I think. Something like 3B full parameters with ~500-800M activated parameters for better model accuracy while still being efficient. However, I understand that this might be unfeasible due to compute restrictions.

审稿意见

评分: 3置信度: 52024-10-27

This paper introduces NT-Java-1.1B, a specialized code language model designed for Java programming, built on StarCoderBase-1.1B. It aims to propose small, efficient code models that can be deployed on developer desktops. NT-Java-1.1B achieves state-of-the-art performance on the MultiPL-E Java code benchmark, surpassing its base model and other similar-sized models.

优点

Investigating small and efficient code models is meaningful.

缺点

This paper merely fine-tunes an open-source base model (StarCoderBase-1.1B) using part of public data (The Stack v1) and existing training methods. The result is evaluated on only part of the MultiPL-E benchmark. This paper offers no new technical contributions, lacks thorough experimental analysis, and provides no insights.
The term "Narrow Transformers" is neither defined nor has any related references. It seems like a concept created by the authors. I'm confused about how this concept differs from small LLMs or efficient LLMs. Could the authors explain this?
The experimental results are poor. The fine-tuning of the base model StarCoderBase-1.1B for JAVA yields results that are basically similar to StarCoderBase-1.1B itself, while StarCoderBase-1.1B is capable of handling multilingual tasks instead of just JAVA.
This paper only contains 6 pages.

问题

No questions.

审稿意见

评分: 3置信度: 42024-10-29

This work introduced the NT-Java-1.1B model, detailing its development process and evaluation results. NT-Java-1.1B is developed based on the StarCoderBase-1.1B model and trained on a subset of StarCoderData. Evaluation results indicate that NT-Java-1.1B outperforms StarCoderBase-3B in pass@1 performance on the MULTIPL-E benchmark, while it scores lower than StarCoderBase-1.1B on HumanEval-FIM (Java).

优点

This work provides a detailed introduction to the training process of NT-Java-1.1B and examines the impact of FIM (Fill-in-the-Middle) training, which can offer reference value for developing other SLMs.

缺点

This work’s novelty is limited, as it builds on the StarCoderBase-1.1B model [1] using training data from StarCoderData [2] and applies established methods such as Next Token Prediction [3] and Fill-in-the-Middle [4]. While it provides an application of these methods, it does not introduce new improvements or substantial contributions.
The experiments in this work are insufficient, as there are too few baseline models compared to NT-Java-1.1B; only StarCoderBase-1.1B and StarCoderBase-3B are included, lacking comparisons with other models.

[1] StarCoder: may the source be with you!

[2] https://huggingface.co/datasets/bigcode/starcoderdata

[3] Improving Language Understanding by Generative Pre-Training

[4] Efficient training of language models to fill in the middle

Minor issues: In Line 154, a subsection number is missing.

问题

At the end of the Evaluation section, the authors mentioned that they conducted qualitative evaluations through user studies, but details and results of this evaluation are not provided. Could you share more comprehensive results for this part of the evaluation?

The authors have trained a model and indicated that its performance has improved. Could you elaborate further on the novelty in your work and clarify if there are any unique contributions?

审稿意见

评分: 1置信度: 52024-11-05

The authors introduce NT-Java, a narrowly fine-tuned code language model of StarCoder's bade model of 1B parameters especially suited for edge applications on smaller devices. The model is finetuned on the Java-portion of the Stack, and training is performed with Nvidia's Megatron LM framework. The model's performance is evaluated on a next token prediction objective, and a fill-in-the-middle objective. Quantized versions of the tuned model are made available for wider use.

优点

The paper is written with a coherent story-line, which leads one through the entire paper, and does a very good job in describing its experiments as well as the exact setup used. The reviewer is highly confident that one would be able to reproduce the results from their description in the paper.

缺点

The paper's core weaknesses can be narrowed down to 2 key points in the reviewer's eyes: novelty, and strength of results

Novelty:

A number of large language model releases have evaluated their models at the smaller scale, and projects like e.g. llamafile go to great lengths to test these smaller models on edge devices, what distinguishes the "Narrow Transformer" from these models aside from being fine-tuned on Java-only?
The reviewer is left unconvinced that a Java-only fine-tuned transformer is a noteworthy result. While target applications for small language models are alluded to in the Related Work section, they are not reflected in the evaluation design.
There exist a great number of blogposts out in the web showcasing the fine-tuning of pretrained models on distilled corpora. The reviewer is left unconvinced that this paper in its present state goes beyond this state of the art.

Strength of Results:

It remains unconvincing to the reviewer from the presented results, that a model fine-tune only on a single programming language (after it was pretrained on a multi-programming language corpus) is superior to a smaller model trained on only Java from the outset. When making such claim, I would expect it to be backed up by experiments by e.g. training a StarCoder-style transformer from scratch on the distilled Java corpus.

问题

While I understand the premise & aspiration of the authors, I am left to question what the actual research question is in this instance?

Finetuning of pretrained models is almost commoditized by now, and by taking a fine-tuning framework like e.g. Axolotl one could produce similar results fairly quickly if with access to the commensurate compute resources. How do the authors go beyond this, and contribute to small language model development?

What (open-ended) questions I would furthermore like to leave the authors with are:

Are there specific small model optimizations which could e.g. improve latency, or the ability to deploy the model on tiny devices or even embedded hardware
Do existing evaluation metrics reflect the specific demands of small language models adequately? If not, which aspects are missing and should see custom evaluation tasks?
What applications do the authors envision for small language models specifically, and what evaluation design would be derived from these downstream demands?

AC 元评审

2024-12-23

This work introduced the NT-Java-1.1B model, detailing its development process and evaluation results. NT-Java-1.1B is developed based on the StarCoderBase-1.1B model and trained on a subset of StarCoderData. Evaluation results indicate that NT-Java-1.1B outperforms StarCoderBase-3B in pass@1 performance on the MULTIPL-E benchmark, while it scores lower than StarCoderBase-1.1B on HumanEval-FIM (Java).

The reviewers are unanimous, this paper is not a sufficient contribution for publication at ICLR.

审稿人讨论附加意见

There was no rebuttal written by the authors, so no discussion.

最终决定Reject

2025-01-22

Reject