ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models
This paper presents ALOPE, an adaptive layer-optimization framework for LLM-based translation quality estimation, enhancing cross-lingual transfer learning through layer-wise adaptation, dynamic weighting, and multi-head regression.
摘要
评审与讨论
This paper has proposed ALOPE, an adaptive layer-optimization framework that is tailored to improve LLM-based quality estimation for machine translation tasks by facilitating optimal cross-lingual representations. More concretely, the authors proposed the layer-specific adaption in Transformer models, and two strategies of dynamic weighting and multi-head regression to improve the quality estimation prediction. Experimental results shows that the proposed framework shows improvement over LLM-based QE approaches.
Experimental setup is a little unclear, and the manuscript could be restructured. The paper lacks of model comparison, just being assessed with various LLMs, which is okay but not sufficient enough to claim its effectiveness. In Table 1, Aya model tends to show smaller correlation score. Do you have any comment on this? Instead of instruction-based finetuning, have you ever apply reinforcement learning from human feedbacks, which could be another baseline for comparison. The paper could be updated for better readability:
- The motivation behind the layer optimization is unclear in the first section. Could you explain more carefully why you decided to focus on layer optimization and what to be expected with the ALOPE framework?
- Figure 2 looks a bit confusing. Can you try to simplify blocks as much as possible?
- Section 3.1 could move to Section 3.3. It is hard to follow the experimental settings in the current order.
接收理由
- The paper proposed a novel approach of adaptive layer optimization for translation quality estimation in the LLMs. Applicable to any LLMs.
拒绝理由
- The paper could be revised for better readability.
- Lack of baselines or other models for system comparison, despite some being mentioned in Section 2.
给作者的问题
See the summary.
Thank you for taking your time to review our paper. We sincerely appreciate your detailed and insightful feedback. Below, we have addressed each of your comments and questions in detail.
Comment 1: The paper lacks of model comparison, just being assessed with various LLMs, which is okay but not sufficient enough to claim its effectiveness.
In addition to comparing across various multilingual LLMs, as shown in Table 2, we have compared ALOPE against state-of-the-art quality estimation models based on pre-trained encoders (e.g., COMET, TransQuest) as well as standard instruction fine-tuned LLMs (baseline). The results demonstrate that ALOPE consistently outperforms the baselines for all 8 language pairs and achieves performance comparable to or in some cases exceeding the performance of established SOTA encoder-based QE methods.
We understand that this concern regarding the baseline and SOTA approach comparison contributed to the initial recommendation for rejection, and we hope that our clarification has sufficiently addressed it. We kindly ask you to consider an important use case of our approach, given a pre-hosted LLM, our approach allows attaching a modular QE expert and evaluate translation quality.
Comment 2: Instead of instruction-based finetuning, have you ever apply reinforcement learning from human feedbacks?
Quality Estimation is a challenging cross-lingual task which relies on embeddings directly obtained from model parameters. While RLHF is relevant for preference alignment, there is no existing literature which shows that alignment approaches work for the prediction of direct assessments like numerical scores, keeping it well out of the scope of our current study. Further, given our focus on an efficient and modular approach, a computationally intense approach based on RL did not align with the objectives of this work. We do acknowledge the potential of RL for a task like reasoning for errors in MT, a task with a more natural language output, and also part of our ongoing work.
Comment 3: The motivation behind the layer optimization is unclear in the first section. Could you explain more carefully why you decided to focus on layer optimization and what to be expected with the ALOPE framework?
Our motivation stems from the observation that cross-lingual alignment between languages differs across different Transformer layers and the final transformer layer does not necessarily show the best cross-lingual alignment (Kargaran et al. ,2024). However, existing LLM based methods typically rely only on the final layer, ignoring potentially informative intermediate representations specially for cross-lingual tasks such as quality estimation for machine translation. The goal of ALOPE is to explore whether selectively leveraging and optimizing over intermediate layer outputs (enabled by regression heads combined with LoRA) can improve performance for translation quality estimation while maintaining computational efficiency, making ALOPE both a flexible and modular evaluation framework. We will revise the Introduction section to make this motivation more explicit.
Reference: Hossein, K.A., Modarressi, A., Nikeghbal, N., Diesner, J., Yvon, F. and Schütze, H. (2024). MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment. [online] arXiv.org. Available at: https://arxiv.org/abs/2410.05873.
Comment 4: In Table 1, Aya model tends to show smaller correlation score. Do you have any comment on this?
The Aya model demonstrates comparable correlation scores to those of other models in the intermediate transformer layers. The lower correlation scores are obtained only when the regression head is attached to the final layer. We are not certain about the cause for the performance drop in the last layer, but the possible reason for this could be that the final layer of Aya is less aligned with quality estimation signals, whereas its intermediate representations still capture useful features. We will add a discussion on this in the revised version of the paper. Given the strong multilingual claims of Aya series of models, and the success of multilingual encoders like XLM-R-XL [COMET] for the QE task, we believe Aya was a justified model choice.
Comment 5 & 6: Figure 2 looks a bit confusing. Can you try to simplify blocks as much as possible? Section 3.1 could move to Section 3.3. It is hard to follow the experimental settings in the current order.
We agree that Figure 2 can be simplified to improve clarity. We will also restructure the setup of the sections in a more intuitive and readable order in the revised version.
Thank you once again for your valuable feedback. We hope our responses have sufficiently addressed your concerns and that you will consider improving the scores.
We request the reviewer to kindly go through the rebuttal and respond.
Thank you for the detailed response. I read all the other reviews and each response. My questions are addressed, and I decided to increase the score accordingly.
Thank you reading our responses and updating the score.
This paper presents a technique for performing LLM-based QE as a regression task, whereas the “path of least resistance” with an LLM would be to just format it as another sequence generation task. Their technical contribution is a handful of techniques for adding regression heads at intermediate layers. They present results with several LLMs with parameter counts up to 8B and evaluate on several low-mid resource language pairs.
接收理由
- The work presents an elegantly simple technique that seems to work.
- The authors promise to release their models and code.
拒绝理由
- There are few ablations. It would be useful to see more comparisons with different configurations on a single model (more choices of hidden layer, different pooling strategies for the hidden states, etc.), rather than comparisons using many models whose results are usually poor or unremarkable (such as the Aya model, for example).
给作者的问题
- It seems to me that the adaptation of the models for regression is completely orthogonal to the use of LoRA. Is there some connection between these two factors that I’m missing, or would it be capable to do the regression without LoRA layers as well?
- It seems like the most obvious place to attach the regression head would be after the final layer, rather than after some intermediate layer. Was this tried and did it work?
- A lot of space in Section 3 could be saved by omitting equations that are well-known, or putting them inline. For example, including the softmax equation (L174) is unnecessary.
- Some of the comparisons are presented in an unclear way. For example, Table 1 does not include SIFT even though it’s the most important baseline and should not have been demoted to an appendix. On the other hand, the plots in Figure 3 compare dynamic-weighting to multi-head regression to SIFT but not to vanilla ALOPE, which is the most natural thing to compare to (unless I’m misunderstanding something).
- The formulation of dynamic weighting allocates one parameter per model layer, and this weight does not depend on the model input. Therefore, after training the weighting of layers is the same for all inputs (perhaps static weighting would have been a better name). Did you also try techniques for learning the weights that do depend on the input in some way? This could be done, for example, by computing the weight for each layer as a linear project of h_k.
Thank you for taking the time to provide your thoughtful and constructive feedback. Below, we have addressed each of your comments and questions in detail.
Comment 1: It seems to me that the adaptation of the models for regression is completely orthogonal to the use of LoRA. Is there some connection between these two factors that I’m missing, or would it be capable to do the regression without LoRA layers as well?
Thank you for highlighting this point. Yes, the regression-specific adaptation (through regression heads) and LoRA-based fine-tuning can be considered independently. However, the focus of our proposed method is efficiency in computationally resource-constrained scenarios. We specifically chose regression headers incorporated with LoRA for its parameter efficiency, reducing the computational overhead compared to full fine-tuning, and allowing modularity given a pre-hosted LLM. ALOPE is able to perform QE with LoRA-less regression headers too.
Comment 2: It seems like the most obvious place to attach the regression head would be after the final layer, rather than after some intermediate layer. Was this tried and did it work?
Yes, we attached the regression head after the final layer, which corresponds to layer -1 in our experiments. As detailed in Section 4.1 and shown in Table 1, while the final layer does provide reasonable performance, it is consistently outperformed by intermediate layers. Particularly TL-7 and TL-11 across most language pairs. Our findings indicate that these intermediate layers capture richer cross-lingual representations, leading to better quality estimation. This supports our choice to explore and utilize regression heads at intermediate layers rather than only relying on the final output layer.
Comment 3: A lot of space in Section 3 could be saved by omitting equations that are well-known, or putting them inline. For example, including the softmax equation (L174) is unnecessary.
We will revise Section 3 to improve conciseness.
Comment 4: Some of the comparisons are presented in an unclear way. For example, Table 1 does not include SIFT even though it’s the most important baseline and should not have been demoted to an appendix. On the other hand, the plots in Figure 3 compare dynamic-weighting to multi-head regression to SIFT but not to vanilla ALOPE, which is the most natural thing to compare to (unless I’m misunderstanding something).
Due to space constraints, we presented the detailed correlation scores of SIFT in the appendix. However, we highlighted instances in Table 1 where ALOPE outperforms SIFT using the ↑ symbol for clarity. We will include vanilla ALOPE in Figure 3 in the final draft and also revise the plots to include it alongside dynamic weighting, multi-head regression, and SIFT.
Comment 5: The formulation of dynamic weighting allocates one parameter per model layer, and this weight does not depend on the model input. Therefore, after training the weighting of layers is the same for all inputs (perhaps static weighting would have been a better name). Did you also try techniques for learning the weights that do depend on the input in some way? This could be done, for example, by computing the weight for each layer as a linear project of h_k.
We named the approach "dynamic weighting" because the model learns a set of trainable scalar weights that determine the contribution of each layer's embedding to the final combined representation. These weights are normalized, allowing the model to dynamically adjust the relative importance of each layer during training, which is why we call it “dynamic.” We agree that incorporating input-dependent weighting could provide more flexibility and potentially improve performance. We will consider this direction for future work.
Thanks again for your valuable feedback. We hope our responses address your concerns and that you will consider improving the scores.
Thank you for the response. After reading this response and the responses to the other reviews, I have updated the score.
Thank you for your feedback and for engaging with our response, too. Sorry, but it seems the score is the same as earlier.
It's fixed now.
Thank you for updating the score
ALOPE introduces an adaptive layer-optimization framework that enhances Large Language Models' capabilities for Machine Translation Quality Estimation through layer-wise adaptation with regression task headers. It does it in a parameter-efficient way, leveraging Quantized Aware Training plus LORA adapters (QLORA).
The work demonstrates that intermediate Transformer layers provide superior cross-lingual representations for QE tasks compared to final layers, and introduces dynamic weighting and multi-head regression strategies that further improve performance across eight low-resource language pairs.
ALOPE represents a novel contribution that bridges the gap between generative LLMs and regression-based QE tasks, showing clear improvements over existing LLM-based approaches. The paper is well-structured and comprehensive, with thorough empirical evaluation across multiple models and language pairs.
接收理由
- The paper systematically identifies which Transformer layers in LLMs are most effective for cross-lingual QE, finding that intermediate layers (particularly TL-7/TL-11) consistently outperform the final layer, a counter-intuitive but valuable insight.
- ALOPE demonstrates practical improvements over standard instruction fine-tuned LLMs across all evaluated low-resource language pairs, achieving results comparable to (and in some cases exceeding) state-of-the-art encoder-based QE models.
- The approach is model-agnostic and computationally efficient through LoRA adaptation, making it highly applicable to existing LLM deployments without extensive reconfiguration.
- The proposed framework allows for scaling existing LLM-based MT systems with QE capabilities and will be released as open-source, enhancing reproducibility and future research.
拒绝理由
- While the paper demonstrates improvements over other LLM-based approaches, ALOPE still doesn't consistently outperform state-of-the-art encoder-based QE models like COMET across all language pairs.
- The ablation study exploring generalizability to monolingual regression tasks is somewhat limited, making it difficult to fully assess whether the findings about layer-specific adaptation transfer to other regression scenarios.
- The paper could more thoroughly explore why En-Ta language pair consistently shows different patterns compared to other language pairs, potentially missing deeper insights about linguistic factors affecting cross-lingual QE.
- While the dynamic weighting and multi-head regression strategies show promise, the improvements over layer-specific adaptation are relatively modest, suggesting these approaches might need further refinement.
给作者的问题
Questions
-
Have you explored whether the optimal layer findings (TL-7) generalize to larger LLMs beyond the 8B parameter models tested? Would you expect similar patterns in models with significantly more layers? Do you anticipate these findings to generalize to other model families?
-
The paper demonstrates that ALOPE achieves comparable results to encoder-based QE models despite using causal LLMs. What are the computational and memory trade-offs compared to these encoder-based approaches?
Suggestions/Comments
- Can you add a discussion on why predicting a scalar for MT is necessary with LLMs? Does it matter anymore? QE was framed as a way to determine which translations required human inspection. Doesn't it make sense to move towards an error extraction (a la MQM) which seem more useful to detect errors, and surface them to humans?
- The comparison with encoder-based models in Table 2 could be expanded to include more details about model sizes and training data requirements.
- The plots in Figure 3 would benefit from more distinct colors for better readability, especially when printed in grayscale.
- Consider adding more discussion about the practical applications and deployments of ALOPE in real-world MT systems.
Missing Citations
- Llama 3 family herd of models
- Recent work on layer-wise knowledge probing in multilingual LMs such as Dufter and Schütze (2020) "Identifying Elements Essential for BERT's Multilinguality"
- Fan et al. (2023) "Layer-selective Rank Reduction for Parameter-efficient LLM Adaptation"
Thank you for taking the time to share your insightful comments. We have provided detailed responses to each of your questions and remarks below.
Comment 1: Have you explored whether the optimal layer findings (TL-7) generalize to larger LLMs beyond the 8B parameter models tested? Would you expect similar patterns in models with significantly more layers? Do you anticipate these findings to generalize to other model families?
We were unable to explore models larger than 8B parameters due to shared computational infrastructure constraints. Further, the goal was to develop a QE framework based on an offline LLM and deploy for the QE task within a resource-constrained scenario. As for generalization to larger models and other model families, while we cannot empirically confirm this yet, we believe that similar patterns may likely emerge given our work observes the pattern is consistent across 3-8B parameter models from different families (Llama vs Aya), pre-trained using different data and approaches (Llama2 vs Llama3.x vs Aya). However, we also understand that the specific optimal layers may differ depending on model architecture and depth. Investigating these aspects in larger model families is certainly a valuable direction for future work.
Comment 2: The paper demonstrates that ALOPE achieves comparable results to encoder-based QE models despite using causal LLMs. What are the computational and memory trade-offs compared to these encoder-based approaches?
While causal LLMs are generally perceived as more resource-intensive, our ALOPE implementation demonstrates competitive memory usage compared to encoder-based QE models.
The memory consumption of our ALOPE-based models is approximately:
- LLaMA3.2-3B: $ \sim 12.8 \text{ GB} $
- LLaMA3.1-8B: $ \sim 12.7 \text{ GB} $
- LLaMA2-7B: $ \sim 14.4 \text{ GB} $
- Aya-expanse-8B: $ \sim 11.9 \text{ GB} $
In comparison, encoder-based SOTA models consume:
- TransQuest (InfoXLM): $ \sim 11.9 \text{ GB} $
- COMET (XLM-R XL): $ \sim 15 \text{ GB} $
These numbers show that with LoRA-adapted lightweight regression headers, ALOPE offers a memory-efficient adaptation of LLMs, with overheads comparable to or even lower than some pre-trained encoder-based QE systems. We will include this memory comparison and discussion in the revised version.
Comment 3: Can you add a discussion on why predicting a scalar for MT is necessary with LLMs? Does it matter anymore? QE was framed as a way to determine which translations required human inspection. Doesn't it make sense to move towards an error extraction (a la MQM) which seem more useful to detect errors, and surface them to humans?
Thank you for this question. Existing literature shows sentence-level QE can be a way to determine which translations require human inspection and also act as a quality feedback signal for improving MT (both within and outside the scope of automatic post-editing). Current practical applications of MT rely heavily on reference-based metrics, which are not practical in all scenarios (e.g., unavailable reference). While MQM is detailed, it has significant data annotation overheads in terms of cost, time, and cognitive load on annotators. Given an LLM, we also envision this framework to be extended for reasoning over MT errors informed by a regression-head predicted DA, providing more reliability to error reasoning and MT assessment. We will include a discussion on this distinction and the relevance of scalar prediction in the revised version.
Comment 4: The comparison with encoder-based models in Table 2 could be expanded to include more details about model sizes and training data requirements.
Thank you for the suggestion. We will revise Table 2 to include additional details on model sizes and training data to improve clarity and completeness. For training ALOPE, SIFT-LLMs, and TransQuest, we used the data sizes (75K sample) specified in Appendix A. For COMET, we evaluated using a publicly available pre-trained model from Hugging Face, which was trained on a large dataset comprising WMT17–19 DA annotations, the MLQE-PE corpus, and WMT23 human-annotated data (totalling approximately 940k samples spanning 38 language pairs).
Comment 5 & 6:
5 - The plots in Figure 3 would benefit from more distinct colors for better readability, especially when printed in grayscale.
6 - Consider adding more discussion about the practical applications and deployments of ALOPE in real-world MT systems.
We will revise Figure 3 with more distinguishable grayscale-friendly colours. Further, we will add a discussion on ALOPE’s practical applications in real-world MT deployments and the relevance of scalar prediction.
Thank you again for your valuable feedback. In light of our response, we request that you kindly consider improving the scores.
We request the reviewer to kindly go through the rebuttal and respond.
Thanks for answering the questions. I have updated my scores.
Thank you for reading our responses and updating the score.
This paper looks at Quality Estimation (QE) using an LLM for Machine Translation. The method is described by one reviewer as “elegantly simple that seems to work”. The method is an adaptive layer-optimization framework which is interesting algorithmically and novel. Reviewers made positive comments such as “The paper systematically identifies which Transformer layers in LLMs are most effective for cross-lingual QE,” and “approach is model-agnostic and computationally efficient through LoRA adaptation, making it highly applicable to existing LLM deployments without extensive reconfiguration”.
Multiple reviewers noted that it was applicable to any LLM.
The authors addressed multiple reviewer concerns and scores and all reviewers updated (positively) their reviews afterwards. This appears to have the potential for a lot of interest in the multilingual and machine translation community at COLM.
Overall, all of the reviewers had positive views of the paper and it appears to be technically sound and interesting to the community.