Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
摘要
评审与讨论
The paper introduces an approach that utilizes the internal hidden states, specifically, cross-layer logits, of a judge LLM to produce evaluation scores, moving beyond the conventional reliance on final-layer token probabilities. The method involves tuning L + 1 scalar weights on a held-out validation set, keeping the LLM backbone entirely frozen. The authors assert that these learned weights generalize across different benchmark datasets. The core experiments focus on comparing the model’s correlation with human judgments across three evaluation benchmarks.
优缺点分析
Strengths
- The writing of the paper is clear and easy to follow.
- The main goal of the paper itself is interesting and potentially useful in building a more effective LLM-as-a-judge system.
Weaknesses
- The method lacks novelty. As the authors themselves acknowledge, the idea of using internal hidden states has been explored extensively since the introduction of BERT.
- My key concern is whether the lightweight training of layer weights can truly generalize across diverse evaluation benchmarks without additional fine-tuning. Fig. 2 illustrates that the correlation with human scores fluctuates significantly across layers, suggesting potential dataset dependency. This argument would be stronger if tested on a broader and more diverse set of benchmarks. The current evaluation, limited to three datasets, does not convincingly establish generalizability.
- Increasingly, modern evaluation benchmarks are label-scarce, more complex, or dynamically generated without human-annotated scores. It is unclear whether the proposed method is applicable or effective in such settings where scalar labels are unavailable or ill-defined.
- The work focuses exclusively on evaluations that assign discrete integer scores within a predefined range. It does not explore more general evaluation scenarios, such as those involving natural language feedback or descriptive labels. It remains uncertain whether the proposed method would be robust and adaptable in these more open-ended evaluation settings.
问题
- As noted in the weaknesses, have you explored the method’s performance in more general or open-ended evaluation settings beyond predefined integer scores?
- Do you have additional experimental results on recent benchmarks that are very different from the 1,000 samples of the HelpSteer dataset used for training?
局限性
Their limitations should also acknowledge the narrow scope of benchmarks used, as well as the restricted evaluation scenario that assumes predefined, discrete integer score ranges.
最终评判理由
The authors’ response has addressed some of my concerns regarding the diversity of benchmarks. I have adjusted my score accordingly.
格式问题
- Missing space, Line #36: "2023a)to"
- Missing space, Line #95: ")over"
We sincerely appreciate the time and effort you have devoted to offering insightful and constructive comments on the paper. Below, we present our main responses to the key concerns raised:
Q1: The method lacks novelty. As the authors themselves acknowledge, the idea of using internal hidden states has been explored extensively since the introduction of BERT.
A1:
-
We would like to clarify that there might be a misunderstanding, as we did not make such a claim in our paper. In lines 77–79, we specifically pointed out that “hidden representations across different layers have been widely observed to exhibit distinct characteristics”. We will revise and clarify this description in future work to prevent potential misunderstandings.
-
This work introduces a new perspective to enhance point-wise evaluation by leveraging intermediate layer representations of the model. Unlike traditional methods that rely solely on final layer outputs, we build a distinct evaluation framework based on an in-depth analysis of internal layers, laying the groundwork for further improving evaluation capabilities through deeper exploration of internal representations.
Q2: My key concern is whether the lightweight training of layer weights can truly generalize across diverse evaluation benchmarks without additional fine-tuning. Fig. 2 illustrates that the correlation with human scores fluctuates significantly across layers, suggesting potential dataset dependency. This argument would be stronger if tested on a broader and more diverse set of benchmarks. The current evaluation, limited to three datasets, does not convincingly establish generalizability.
A2:
-
In Section 6, we conduct experiments across diverse scenarios, including sentiment understanding, data selection, and large model knowledge boundary recognition (without any additional training), and found that LAGER effectively generalizes to these scenarios while outperforming the baselines.
-
Regarding the fluctuations in layer-wise correlations shown in Figure 2, the datasets (FLASK, BiGGen Bench, HelpSteer) indeed differ significantly in their design and objectives, yet we observe similar fluctuation trends across them. This suggests that while dataset-specific characteristics may have some influence, the inherent properties and representational differences of the model's intermediate layers exert a more dominant effect on evaluation performance.
-
Our experiments cover multiple widely used benchmark datasets spanning diverse domains and evaluation criteria [1–6,10]. Broader validation could enhance generalizability, but our current results already demonstrate strong performance.
Q3: Increasingly, modern evaluation benchmarks are label-scarce, more complex, or dynamically generated without human-annotated scores. It is unclear whether the proposed method is applicable or effective in such settings where scalar labels are unavailable or ill-defined.
A3:
-
In this paper, we focus on the point-wise evaluation setting. Our method requires only a small amount of annotated data, demonstrates strong transferability, and incurs an acceptable cost for obtaining a limited set of high-quality annotations. While we appreciate your suggestion, it is beyond the scope of this paper, and we will address this issue in future work.
-
In fact, this relates closely to recent studies [7][8][9], which have leveraged LLMs to automatically generate scoring rubrics and annotations, effectively addressing challenges related to scarce, unavailable, or ill-defined scalar labels. Meanwhile, our method focuses on enhancing the generalization and efficiency of structured scalar scoring tasks, ensuring stability and reliability. These directions are complementary: the former mitigates label scarcity, while the latter strengthens the robustness of evaluation methodology.
Q4: The work focuses exclusively on evaluations that assign discrete integer scores within a predefined range. It does not explore more general evaluation scenarios, such as those involving natural language feedback or descriptive labels. It remains uncertain whether the proposed method would be robust and adaptable in these more open-ended evaluation settings.
A4:
-
Unstructured scoring performs worse than structured scalar scoring. In our early experiments, we also explored unstructured scoring methods (i.e., not confined to predefined discrete integer ranges), but found that the model's performance degraded significantly compared to structured scalar scoring.
-
Point-wise evaluation suits open-ended tasks. Previous studies [1][10][11] have shown that point-wise scoring is widely adopted in LLM-as-a-judge research, being well-suited for open-ended tasks without unified standards and demonstrating strong generality and scalability. Notably, models can first perform explicit reasoning before providing a score, thereby enhancing the transparency and rationality of the evaluation.
-
This work leverages the core idea of utilizing intermediate layer representations and lightweight weight training, aiming to capture a better mapping relationship between the model's internal states and the evaluation criteria. While the current focus is on structured scalar scoring to establish a solid foundation, this core idea is not confined to integer scoring tasks. This lays the groundwork for future extensions to handle more complex and open-ended feedback forms (e.g., natural language descriptions).
Q5: As noted in the weaknesses, have you explored the method’s performance in more general or open-ended evaluation settings beyond predefined integer scores?
A5:
- We have also explored more general evaluation settings, as presented in Section 6, including sentiment understanding, data selection, and large model knowledge boundary recognition (all without any additional training). These tasks typically involve greater task variability and weaker supervision signals, yet our method consistently demonstrates strong robustness in such scenarios. While we have not yet extended the method to fully open-ended evaluation formats, its strong generalization across diverse tasks suggests promising potential for application in more open-ended evaluation settings, which we plan to pursue in future work.
Q6: Do you have additional experimental results on recent benchmarks that are very different from the 1,000 samples of the HelpSteer dataset used for training?
A6:
- We emphasize that our experimental benchmarks cover diverse evaluation scenarios and exhibit significant heterogeneity in task design, evaluation dimensions, and target scenarios, rather than being highly similar. Specifically, the FLASK dataset includes 12 fine-grained scoring dimensions, covering tasks from multiple different domains (e.g., Culture, Health, History); BiGGen Bench encompasses 77 tasks (e.g., Semantic Constraints and Code Revision) that evaluate various capabilities such as reasoning and instruction following; HelpSteer contains data annotated with multiple evaluation criteria, including helpfulness, correctness, coherence, complexity, and verbosity. And the results on downstream tasks (as presented in Section 6) are also sufficient to comprehensively evaluate the generalization ability of our method.
Thank you again for your time and effort in the review process. We will carefully revise the manuscript based on your suggestions and hope that our responses adequately address your concerns. We remain open to any further suggestions or requests for clarification you may have.
Reference
[1] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models (ICLR 2024)
[2] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models (EMNLP 2024)
[3] Aligning to Thousands of Preferences via System Message Generalization (NeurIPS 2024)
[4] Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions (AAAI 2024)
[5] Evaluating Scoring Bias in LLM-as-a-Judge
[6] RewardAnything: Generalizable Principle-Following Reward Models
[7] G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (ACL 2023)
[8] MCRanker: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers (WSDM 2025)
[9] Rationale Behind Essay Scores: Enhancing S-LLM’s Multi-Trait Essay Scoring with Rationale Generated by LLMs (NAACL 2025)
[10] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets (ICLR 2024 Spotlight)
[11] CODEJUDGE:Evaluating Code Generation with Large Language Models (EMNLP 2024)
Thank you for your response. It has addressed some of my concerns regarding the diversity of benchmarks. I have adjusted my score accordingly.
Thank you for your updated feedback and score adjustment!
We are delighted to learn that our responses have helped address some of your concerns regarding the diversity of benchmarks. Since your review also raised several other important points, we would sincerely appreciate it if you could let us know whether there are any remaining questions or issues that we should further clarify, so that we can ensure all your valuable feedback is fully addressed and further improve the paper.
We look forward to your further guidance.
Dear Reviewer 8jsL,
We have noted that you have re-evaluated and updated the ratings for our paper regarding quality, clarity, significance, and originality. The adjustment of these scores may reflect that there are still some issues in the paper that require further clarification. As the discussion period is coming to a close, although you mentioned in your initial feedback that we have partially addressed your concerns, we would still like to take this opportunity to engage in in-depth discussion and provide comprehensive responses to any remaining questions you may have. Your further feedback will be of significant value to the refinement and improvement of our paper.
We look forward to your response or acknowledgment once you have read our message, as your support is very important to us.
Best regards,
The Authors
The paper proposes LAGER (Layer Aggregation for Enhanced Representation), a lightweight framework to improve the alignment of LLM-based evaluators ("LLM-as-a-judge") with human judgments. Instead of relying solely on the final output layer (which may overlook nuanced signals), LAGER aggregates internal representations from multiple intermediate layers of frozen LLMs to generate finer-grained, probabilistic evaluation scores. On standard alignment benchmarks, this approach is reported to enhance judgment quality by up to 7.5% over the best baseline, and also lead strong downstream applications.
优缺点分析
Strengths:
-
The paper researches into LLM-as-a-judge, a popular topic within the NeurIPS community.
-
The presented approach has a high efficiency in that it works under frozen backbone LLMs and requires minimal computational cost during training.
-
3 human-annotated datasets are chosen for evaluation to avoid preference leakage, making the results robust and reliable.
Weaknesses:
-
In spite of the discussed in Section 3.4, the approach is mainly restricted to open-sourced models with accessible intermediate layers for idealistic performance, so its applicability is still limited.
-
The evaluation is limited to point-wise benchmarks, while the pairwise/listwise evaluation are not involved.
问题
Apart from the weaknesses, I want to see a bit more insights in why reasoning leads to weaker results for LAGER (with supports from concrete examples if possible).
局限性
Yes
最终评判理由
The paper proposes LAGER (Layer Aggregation for Enhanced Representation), a lightweight framework to improve the alignment of LLM-based evaluators ("LLM-as-a-judge") with human judgments. It is technically solid and sufficiently validated in a comprehensive evaluation. My only concern is that it may not be applicable to proprietary LLMs. Overall, I recommending accepting this paper to the main research track.
格式问题
N/A
We are truly grateful for your insightful feedback on our manuscript. Our key responses are summarized below:
Q1: In spite of the discussed in Section 3.4, the approach is mainly restricted to open-sourced models with accessible intermediate layers for idealistic performance, so its applicability is still limited.
A1:
- We acknowledge that our method relies on access to intermediate layer logits. However, closed-source models could provide access to internal representations (e.g., via enterprise APIs or special access programs), our method could be readily applied without modification, and would perform similarly.
Q2: The evaluation is limited to point-wise benchmarks, while the pairwise/listwise evaluation are not involved.
A2:
-
In our study, we focus on point-wise evaluation, because pairwise and listwise methods have high computational complexity and limited scalability, which restricts their use in large-scale data synthesis and filtering. In contrast, point-wise evaluation is computationally simpler, adaptable to various standards, and thus better suited for practical applications.
-
Following your suggestion, we evaluated our method on the RewardBench dataset, which is designed for pairwise comparisons. The results in the table indicate that although our approach is primarily designed to enhance performance in point-wise evaluation, it also leads to certain improvements in pair-wise judgment. We will include these experiments and provide further analysis in the next version of the paper.
| Model | Method | Score |
|---|---|---|
| Meta-LLaMA-3.1-8B-Instruct | Vscore | 0.671 |
| E-score | 0.736 | |
| LAGER | 0.749 | |
| Mistral-7B-Instruct-v0.3 | Vscore | 0.610 |
| E-score | 0.701 | |
| LAGER | 0.711 | |
| InternLM3-8B-Instruct | Vscore | 0.714 |
| E-score | 0.772 | |
| LAGER | 0.773 | |
| Qwen2.5-14B-Instruct | Vscore | 0.776 |
| E-score | 0.821 | |
| LAGER | 0.823 |
Q3: Apart from the weaknesses, I want to see a bit more insights in why reasoning leads to weaker results for LAGER (with supports from concrete examples if possible).
A3:
- Through experimental analysis of the model’s internal attention scores, we found that the weaker performance of LAGER under the reasoning setting is closely related to the drift in attention distribution. This is consistent with the findings of previous work[1][2][3]: As the reasoning steps progress, the model’s focus on the original input and the text to be evaluated gradually diminishes, shifting more toward its self-generated reasoning trajectory. This attention shift weakens the model’s direct perception of the source text, causing it to rely on its own reasoning chain when information is insufficient, thereby exhibiting overconfidence. As a result, the model tends to exaggerate supporting evidence while overlooking potential errors or shortcomings, leading to inflated scores. Below, we present real evaluation examples (assessed by the llama3.1-8b-instruct model):
A Real Evaluation Example(assessed by the LLaMA3.1-8B-Instruct model):
-
Instruction: Solve this equation. Input: x^3 - 4x^2 + 6x - 24 = 0
-
Response: Sure, I can help you with that. The equation x^3 - 4x^2 + 6x - 24 = 0 can be solved using the following steps: Factor the equation. Set each factor equal to 0 and solve for x. The factors of the equation are (x - 4)(x - 3)(x + 2) = 0. Setting each factor equal to 0 and solving for x, we get the following solutions: x = 4, x = 3, x = -2. Therefore, the solutions to the equation x^3 - 4x^2 + 6x - 24 = 0 are x = 4, x = 3, and x = -2.
-
Evaluation Feedback: The response provides a clear step-by-step approach and attempts to factor the cubic equation. However, the factorization presented, (x - 4)(x - 3)(x + 2), is incorrect as it does not satisfy the original equation upon substitution. This is a significant flaw since correct factorization is critical for solving the equation properly. Despite this, the explanation is well-structured and easy to follow, which demonstrates solid reasoning skills. The response presents the solutions clearly and in an organized manner. Based on the rubric, considering the clarity and structure of the explanation alongside the attempt at solving, I would rate this response as highly accurate and logical. Score: 5
Thank you again for your time and effort in the review process. We will carefully revise the manuscript based on your suggestions and hope that our responses adequately address your concerns. We remain open to any further suggestions or requests for clarification you may have.
Reference:
[1] When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations (ACL 2025)
[2] Mechanistically Demystifying Extreme-Token Phenomena in LLMs (NeurIPS 2024)
[3] Measuring and Controlling Instruction (In)Stability in Language Model Dialogs (COLM)
Thanks for clarifying my concerns and sorry for my delayed reply. I think the majority of my concerns have been addressed. Still, I would like to see more evidence that this approach works for closed-source models in the next version of the manuscript.
I decide to raise my score to 5
Thank you for your reply and score adjustment!
We are glad to have addressed your concerns. Your suggestions are truly valuable to us, and we will work on further enhancing the judging performance of closed-source models going forward.
Dear Reviewer Yh62,
We hope this message finds you well. As the discussion period is nearing its end with less than two days remaining, we would like to confirm whether we have adequately addressed all your concerns. If these clarifications help to resolve your doubts and improve your understanding of our work, we would be very grateful if you could consider giving our work a higher evaluation and support.
If you have any further comments or feedback, please feel free to let us know. Your insights are very important to us, and we are more than willing to address any remaining issues to further improve our work.
Best regards,
The Authors
This paper proposes LAGER, a framework for improving LLM-as-a-judge performance by aggregating score-token logits from all decoder layers of a LLM. Sepcifically, LAGER computes a weighted combination of layer-wise distributions and taking the expected score. Experiment results demonstrate the improvements over several LLM judge baseline methods.
优缺点分析
strengths
- The multi-layer aggregation approach is very simple and straightforward, using frozen backbones without modifying core LLM parameters.
- Empirical results demonstrate improvements across several selected benchmarks and model scales.
weaknesses
- The proposed method relies on accessing the internal representations of the LLM, thus this limits applicability to open-source models; closed-source or API-only LLMs cannot benefit.
- The supervised weight-tuning step still requires human-annotated validation data.
- The experimental results are not thorough, missing important baselines, for example RewardBench and DPO rewarding. Also, LLM-as-a-Judge heavily rely on the prompt templates, so it would be better to test on different popular prompts which you can find in recent papers such as self-rewarding LLM.
问题
- Will LAGER become better if scale more training data and params?
- Please also see the weaknesses.
局限性
NA
最终评判理由
I appreciate the authors’ effort in addressing individual points, but the core issues remain unmitigated, and the manuscript requires major revision. I therefore stand by my original negative assessment.
- My major concern is about the contribution and realistic application of the proposed method.
The proposed aggregation mechanism fundamentally depends on accessing hidden states inside the base LLM. As the authors acknowledge, this effectively confines the method to open-source backbones. SOTA closed-source models (e.g., GPT Series, Claude, Gemini) do not expose these internals for even enterprise access, yet they dominate the usage of LLM-as-a-Judge workflows.
-
The authors’ defense that “supervised weight tuning is common” sidesteps the critique: the paper claims to require no additional annotations, yet it does. Also, the authors overclaimed Non-Modification of Backbone, while they do require introducing additional params for fine-tuning.
-
Despite the rebuttal, the evaluation still omits critical baselines which can further enhance the paper: RewardBench’s leaderboard models including DPO rewarding.
格式问题
No, I don't see major violations of the paper format though I don't carefully check the format.
We appreciate the reviewers’ time and effort in evaluating our work. Below, we present our main responses to the key concerns raised:
Q1: The proposed method relies on accessing the internal representations of the LLM; thus, this limits applicability to open-source models. Closed-source or API-only LLMs cannot benefit. The proposed method relies on accessing the internal representations of the LLM; thus, this limits applicability to open-source models. Closed-source or API-only LLMs cannot benefit.
A1:
- We acknowledge that our method relies on access to intermediate layer logits. However, closed-source models could provide access to internal representations (e.g., via enterprise APIs or special access programs), our method could be readily applied without modification, and perform similarly.
Q2: The supervised weight-tuning step still requires human-annotated validation data.
A2:
-
This is a common requirement for most supervised training methods and should not be viewed as a weakness, especially for approaches focused on human alignment, as evidenced by the paper [1] cited by Reviewer HC1G.
-
Minimal human-annotated data and lightweight training parameters. Our method trains only dozens of lightweight parameters using a few hundred annotated data points (about 5% of the data required by traditional methods), avoiding the need to fine-tune billions of backbone parameters.
-
Strong generalization ability. Furthermore, as demonstrated in Section 6, our method demonstrates strong generalization ability, effectively transferring to other domains while requiring substantially fewer annotations and incurring no significant additional costs, thus offering clear advantages over traditional approaches.
Q3: The experimental results are not thorough, missing important baselines, for example RewardBench and DPO rewarding. Also, LLM-as-a-Judge heavily rely on the prompt templates, so it would be better to test on different popular prompts which you can find in recent papers such as self-rewarding LLM.
A3:
-
It is our view that RewardBench is fundamentally an evaluation benchmark intended to measure the performance of reward models, rather than a specific evaluation method. Moreover, we are unable to find a directly comparable baseline method reported in the RewardBench study.
-
Similarly, DPO rewards are typically obtained by fine-tuning the base language model itself, which differs from our approach that only introduces lightweight aggregation parameters without altering the backbone model. So we believe that the DPO reward cannot be used as a directly comparable baseline in our setting. If our understanding is incorrect, we would appreciate clarification on the role of RewardBench and DPO rewarding in the comparisons.
-
Our primary prompt is adapted from templates widely adopted in prior evaluation studies[2][3][4], which have demonstrated strong and reliable baseline performance. While prompt design can influence performance, it is not the primary focus of our work. Nevertheless, we also tested two popular prompt variants (T1 refers to the template used in Self-Rewarding LLM, and T2 refers to the template used in G-Eval), and the conclusions were consistent with those obtained using our primary prompt, further confirming the robustness of our approach.
| Model | Flask(T1) | Flask(T2) | Helpsteer(T1) | Helpsteer(T2) | BiGGen Bench(T1) | BiGGen Bench(T2) |
|---|---|---|---|---|---|---|
| LLaMA3.1-8B-Instruct | ||||||
| Vscore | 0.339 | 0.325 | 0.380 | 0.346 | 0.264 | 0.245 |
| E-Score | 0.389 | 0.370 | 0.454 | 0.445 | 0.361 | 0.354 |
| LAGER (w. tuning) | 0.470 | 0.466 | 0.521 | 0.502 | 0.472 | 0.477 |
| Mistral-7B-Instruct-v0.3 | ||||||
| Vscore | 0.266 | 0.270 | 0.260 | 0.256 | 0.139 | 0.141 |
| E-Score | 0.240 | 0.245 | 0.299 | 0.289 | 0.191 | 0.192 |
| LAGER (w. tuning) | 0.344 | 0.351 | 0.402 | 0.396 | 0.358 | 0.347 |
| InternLM3-8B-Instruct | ||||||
| Vscore | 0.427 | 0.417 | 0.381 | 0.381 | 0.366 | 0.362 |
| E-Score | 0.504 | 0.523 | 0.447 | 0.456 | 0.465 | 0.464 |
| LAGER (w. tuning) | 0.521 | 0.534 | 0.503 | 0.481 | 0.476 | 0.493 |
Q4: Will LAGER become better if scale more training data and params?
A4:
- Increasing the model parameters with an MLP and enlarging the dataset did not result in significantly better performance. In preliminary experiments, replacing simple weighting with an MLP led to unstable training, likely due to increased complexity and limited data. Our current L+1 parameter approach is simpler and achieves great performance. We also observed performance saturates as data scales, with limited gains from additional data. In future work, we will explore larger datasets and more complex aggregations to evaluate scalability.
Thank you again for your time and effort in the review process. We will carefully revise the manuscript based on your suggestions and hope that our responses adequately address your concerns. We remain open to any further suggestions or requests for clarification you may have.
Reference
[1] HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition (ACL 2024)
[2] G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment (EMNLP 2023)
[3] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models (ICLR 2024)
[4] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models (EMNLP 2024)
Thank you for your response.
Q1
As far as I know, there is now way to access the internal representations for state-of-the-art closed source models such as GPT, Claude or Gemini, however, these models are used by the majority of the LLM-as-a-Judge. This is the major concern for this paper.
Q2
If you claimed that "This is a common requirement for most supervised training methods", then you should better not overclaim without annotation data in the paper. I kindly suggest the authors take a revision on this.
Q3
Sorry for the confusion. I was referring the reward models in the RewardBench leaderboard, there are many strong generative reward models and traditional reward models for comparison including DPO rewarding.
DPO rewards are typically obtained by fine-tuning the base language model itself obtained by fine-tuning the base language model itself
Given the fact that RLHF or SFT or post-training is a common practice for LLM pipeline, that means, you can easily get a base model (as the reference model) and the aligned model as the policy model for DPO rewarding. That means DPO rewarding does not need any efforts for further fine-tuning, while your method trained on aligned models still needs efforts in training additional layers.
without altering the backbone model
Please do not overclaim, you add additional layers for tuning. Someone can use LoRA for fine-tuning, do you think this is not modifying the backbone model, which makes no difference to "fine-tuning a model", since all of these introduce additional params for tuning.
While prompt design can influence performance, it is not the primary focus of our work.
The proposed method heavily rely on the prompt design, while in my opinion, it would be better to analyze how it would impact the proposed method. From the provided results, different prompts do make obvious differences to the performance, which should be well discussed.
Increasing the model parameters with an MLP and enlarging the dataset did not result in significantly better performance.
Given the importance of scaling, it is necessary to analyze deeper into the proposed method. I kindly suggest the authors to supplement these results and enhance the paper in the future.
We sincerely thank you for your responses. Regarding the additional concerns you raised, our reply is as follows:
- As far as I know, there is now way to access the internal representations for state-of-the-art closed source models such as GPT, Claude or Gemini, however, these models are used by the majority of the LLM-as-a-Judge. This is the major concern for this paper.
A1:
-
Potential Availability of Closed-Source Models: At present, commercial models do not provide APIs that allow access to internal hidden states or logits. However, it is possible that such interfaces may become available in the future. In that case, given the superior effectiveness of our method compared to existing baselines, its application prospects would be even broader.
-
Superior Performance of Open-Source Models: On Hugging Face’s Judge-Arena and RewardBench leaderboards, open-source models such as LMUnit-Qwen2.5-72B have achieved better performance than commercial models like Gemini-2.5-Pro, demonstrating the strong potential of open-source models in LLM-as-a-Judge scenarios, with the dual advantages of lower cost and higher effectiveness.
-
The aim of this paper is to focus on optimizing the performance of open-source models in LLM-as-a-Judge scenarios, while the optimization of commercial models falls beyond the scope of our work. Without access to model internals, it remains difficult to fine-tune or apply other modifications to commercial models beyond prompt design. In future work, we plan to explore performance improvements based on commercial models.
- If you claimed that "This is a common requirement for most supervised training methods", then you should better not overclaim without annotation data in the paper. I kindly suggest the authors take a revision on this.
A2:
- We would like to clarify a possible misunderstanding. First, we did not make such a statement in the paper. Second, the related statement was made in response to your review comments. We would like to emphasize that for tasks involving human alignment, the use of human-aligned data is generally necessary.
- Please do not overclaim, you add additional layers for tuning. Someone can use LoRA for fine-tuning, do you think this is not modifying the backbone model, which makes no difference to "fine-tuning a model", since all of these introduce additional params for tuning.
A3:
-
It seems there might be a slight misunderstanding regarding our method. We would like to emphasize:
- Our method introduces L+1 additional parameters to aggregate logits from different layers at the scoring token position, without affecting next-token prediction.
- In our method, the model still follows the vanilla architecture to predict the next token. Our approach does not alter the model’s hidden states, output logits, or predicted tokens.
- In contrast, methods such as LoRA and other PEFT approaches introduce hundreds of millions of additional parameters, modify the model’s hidden states during inference, thereby altering the logits and predicted tokens. Moreover, they require substantially more data and computational resources for parameter updates.
-
Our approach should be more accurately described as post-hoc aggregation rather than fine-tuning. We will revise the manuscript accordingly to clarify this distinction and avoid any potential misunderstanding.
- 4.1 Sorry for the confusion. I was referring the reward models in the RewardBench leaderboard, there are many strong generative reward models and traditional reward models for comparison including DPO rewarding.
- 4.2 Given the fact that RLHF or SFT or post-training is a common practice for LLM pipeline, that means, you can easily get a base model (as the reference model) and the aligned model as the policy model for DPO rewarding. That means DPO rewarding does not need any efforts for further fine-tuning, while your method trained on aligned models still needs efforts in training additional layers.
A4:
- Thank you for your further clarification. We provide responses and comparisons below for the three cases you mentioned.
-
(1) Since these generative reward models are of the same type as those used in our paper, our method can likewise adopt them as backbone models to enhance performance. As a result, they do not constitute independent or meaningful baselines for comparison.
-
(2) Although DPO rewarding does not require additional training, it necessitates the deployment of two different models, which is impractical in scenarios with limited GPU memory. Moreover, since it involves a larger number of model parameters, it is not directly comparable to our approach.
-
(3) For traditional reward models, we selected the SOTA reward model Skywork-Reward-V2-Llama-3.1-8B on RewardBench2 for comparison to verify the effectiveness of our method. As shown in the table, the RM exhibits significant performance differences across benchmarks (e.g., only 0.207 on HelpSteer), demonstrating the instability of traditional RM. In contrast, our method applied to the same backbone model achieves significant improvements across all benchmarks, with an average score of 0.491, not only surpassing the specially trained reward model (0.386) but also showing more stable performance, fully demonstrating the advantages of our method in terms of evaluation capability and stability.
-
| Model | Flask | HelpSteer | BIGGen Bench | Average |
|---|---|---|---|---|
| Skywork-Reward-V2-Llama-3.1-8B | 0.495 | 0.207 | 0.457 | 0.386 |
| Llama-3.1-8B-instruct | 0.334 | 0.374 | 0.273 | 0.327 |
| Llama-3.1-8B-Instruct+LAGER(w. tuning) | 0.477 | 0.515 | 0.482 | 0.491 |
- The proposed method heavily rely on the prompt design, while in my opinion, it would be better to analyze how it would impact the proposed method. From the provided results, different prompts do make obvious differences to the performance, which should be well discussed.
A5:
-
We would like to clarify that the prompt template used in our work is the standard configuration widely adopted in the LLM-as-a-Judge community for point-wise evaluation such as those used in G-eval[1], Prometheus[2], and JudgeBench[3], as well as others[4–7]. We acknowledge that different prompts may influence the specific outcomes, however, we did not deliberately design the prompt template to enhance our method’s performance.
-
Rather, our method itself does not rely on a specific prompt template. Our method employs the same prompt template as all baselines and achieves superior performance. Moreover, as mentioned in our preliminary response to your review, experimental results have verified that our method outperforms baselines across different prompt templates, further demonstrating the robustness of our approach.
- Given the importance of scaling, it is necessary to analyze deeper into the proposed method. I kindly suggest the authors to supplement these results and enhance the paper in the future.
A6:
- Our method uses only about 1,000 data samples and performs a 250-step search to determine L+1 parameters (fewer than 100, where L denotes the number of transformer layers in the model). Preliminary experiments show that increasing the number of parameters or extending the search steps does not yield significant performance improvements. Although weighting based on layer-wise hidden states using these parameters is possible, it requires storing multi-layer data in advance, resulting in additional computational and storage overhead without achieving the expected benefits. Given the simplicity, efficiency, and superior performance of our method, we recommend adopting this solution.
We remain open to any further suggestions or requests for clarification you may have.
Reference:
[1] G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment (EMNLP 2023)
[2] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models (ICLR 2024)
[3] JudgeBench: A Benchmark for Evaluating LLM-based Judges (ICLR 2025)
[4] A Survey on LLM-as-a-Judge
[5] BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge (ICLR 2025)
[6] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models (EMNLP 2024)
[7] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NIPS 2023)
Dear Reviewer DQpf,
We hope this message finds you well. As the discussion period is nearing its end with less than two days remaining, we would like to confirm whether we have adequately addressed all your concerns. If these clarifications help to resolve your doubts and improve your understanding of our work, we would be very grateful if you could consider giving our work a higher evaluation and support.
If you have any further comments or feedback, please feel free to let us know. Your insights are very important to us, and we are more than willing to address any remaining issues to further improve our work.
Best regards,
The Authors
LAGER introduces a training-free (optionally tunable) way to boost LLM-as-a-judge quality by looking inside the model: it linearly combines the logits that every decoder layer assigns to each candidate score and outputs the expectation of that cross-layer distribution. Because middle-to-upper layers encode semantics that correlate more closely with human opinions, this fusion yields finer-grained, better-calibrated judgements than the usual “take the top token” heuristic. Tested on three alignment benchmarks, namely, Flask, HelpSteer and BIGGen, LAGER raises Spearman agreement with human scores by up to 7.5 % over the strongest existing baseline, while requiring no chain-of-thought prompting. It further enhances tasks such as instruction-data selection and emotional understanding, all while maintaining the backbone's frozen state for speed and portability.
优缺点分析
Pros:
- LAGER (esp. the one with fine-tuning) significantly improves alignment with human preferences. Its innovative cross-layer scoring yields up to a 7.5 percentage-point Spearman correlation boost over GPTScore, VScore, and E-Score across three benchmarks.
- The method keeps the backbone model frozen, training only a small set of fusion weights (or none in the zero-tuning variant). This means inference is as efficient as a single forward pass, and it avoids the expensive reasoning prompts used by many judges.
Cons:
- The paper does not compare LAGER against several relevant evaluators that also target human alignment, such as G-Eval (reasoning-based) or HD-Eval (which also leverages multi-layer information). These comparisons are important to contextualise LAGER’s gains.
- GPT-4o-mini achieved very strong results, raising questions about the impact of model scale. However, the current experiments focus only on cross-family comparisons. It would strengthen the work to test LAGER across different sizes within the same family (e.g., Qwen 2.5B, 7B, 14B).
- While LAGER with tuning performs best overall, the tuning appears to succeed on some backbones (e.g., Qwen) but fails or regresses on others (e.g., Mistral, LLaMA). Yet the discussion (p. 6) focuses only on LAGER w/ tuning (without comparing to the ones w/ tuning), making the discussion a bit insufficient.
- Compared to E-Score, LAGER without tuning performs worse (or on-par) on several backbones (e.g., InternLM, Qwen), suggesting that the observed gains largely stem from fine-tuning. This makes the comparisons not quite fair, comparing a tuned model to unturned baselines.
References: Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics. Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. 2024. HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7641–7660, Bangkok, Thailand. Association for Computational Linguistics.
问题
See my comments above.
局限性
See my comments above.
最终评判理由
The response has addressed most of my concerns. Though I still think the discussion should be further improved, this is not a major reason for rejecting the paper.
格式问题
None.
Thank you for your thoughtful feedback and constructive suggestions. Our key responses are summarized below:
Q1: The paper does not compare LAGER against several relevant evaluators that also target human alignment, such as G-Eval (reasoning-based) or HD-Eval (which also leverages multi-layer information). These comparisons are important to contextualise LAGER’s gains.
A1:
-
G-Eval and E-Score: Essentially the Same. We would like to clarify that G-Eval is essentially the same as the E-score baseline included in our experiments. Moreover, the G-Eval(reasoning-based) evaluation results you mentioned are also incorporated within our experimental comparisons (see the performance of E-score under the reasoning setting on each dataset in Tables 1 and 6).
-
LAGER outperforms HD-Eval. Regarding the HD-Eval baseline, we would like to clarify: HD-Eval enhances the alignment between LLM evaluators and human judgments by decomposing evaluation into fine-grained criteria ("layers" refer to fine-grained decomposition stages, not model layers) and employing human-guided aggregation. This is fundamentally different from our approach, which utilizes intermediate layer representations of the model. We compared HD-Eval (chat task setting) with LAGER in evaluation performance. As shown in the table below, HD-Eval's performance is lower than our method, primarily because it requires manually customizing fine-grained criteria for different tasks, making it difficult to generalize effectively across the diverse evaluation datasets used in our paper.
| Model | Method | Flask | Helpsteer | BigGen |
|---|---|---|---|---|
| Llama-3.1-8B-Instruct | HD-Eval | 0.29 | 0.243 | 0.214 |
| LAGER (w. tuning) | 0.46 | 0.524 | 0.444 | |
| Mistral-7B-Instruct-v0.3 | HD-Eval | 0.256 | 0.271 | 0.248 |
| LAGER(w. tuning) | 0.298 | 0.376 | 0.333 | |
| InternLM3-8b-Instruct | HD-Eval | 0.321 | 0.254 | 0.351 |
| LAGER(w. tuning) | 0.489 | 0.474 | 0.49 | |
| Qwen2.5-14B-Instruct | HD-Eval | 0.443 | 0.231 | 0.372 |
| LAGER(w. tuning) | 0.572 | 0.472 | 0.524 |
Q2: GPT-4o-mini achieved very strong results, raising questions about the impact of model scale. However, the current experiments focus only on cross-family comparisons. It would strengthen the work to test LAGER across different sizes within the same family (e.g., Qwen 2.5B, 7B, 14B).
A2:
- LAGER brings improvements across model scales. In fact, we have already conducted such an analysis in Section 5.3. This section includes extensive experiments across multiple sizes within the same Qwen2.5 family, ranging from 0.5B to 72B. These results demonstrate that LAGER consistently improves evaluation performance across a wide range of model scales. Moreover, larger models generally tend to achieve better performance.
Q3: While LAGER with tuning performs best overall, the tuning appears to succeed on some backbones (e.g., Qwen) but fails or regresses on others (e.g., Mistral, LLaMA). Yet the discussion (p. 6) focuses only on LAGER w/ tuning (without comparing to the ones w/ tuning), making the discussion a bit insufficient.
A3:
-
We would like to clarify that across all evaluation datasets, LAGER(w. tuning) outperforms E-score by an average of over 6% on both backbones. The number of instances in which LAGER with tuning underperforms E-Score is 3 out of 36. It is worth noting that in these cases, the performance of LAGER with tuning is comparable to that of E-Score, with differences of less than 0.5%.
-
We would like to emphasize that our tuning-based method demonstrates strong transferability, as validated by the multi-task experiments in Section 6. Even without tuning, our method outperforms E-Score in 5 out of 6 backbone models, with an average improvement of 4.5%. In the few cases where it underperforms, the performance gap is within 3%. Moreover, it consistently outperforms V-Score. Therefore, we primarily recommend the LAGER with tuning approach and conduct subsequent analyses based on it.
Q4: Compared to E-Score, LAGER without tuning performs worse (or on-par) on several backbones (e.g., InternLM, Qwen), suggesting that the observed gains largely stem from fine-tuning. This makes the comparisons not quite fair, comparing a tuned model to unturned baselines.
A4:
-
Our method does not involve fine-tuning the backbone language model. LAGER only introduces L+1 lightweight parameters (where L is the number of transformer layers) to aggregate multi-layer logits for more accurate scoring. This approach is fundamentally different from traditional fine-tuning: we do not update any of the original backbone model's parameters.
-
Except for InternLM, our method outperforms E-Score by an average of 4.5 points across all other model backbones, even without introducing any additional weight parameters. As a result, LAGER introduces minimal computational overhead and is highly scalable. The performance gains of LAGER stem not from large-scale fine-tuning, but from a more efficient and structured utilization of the model’s internal representations.
Thank you again for your time and effort in the review process. We will carefully revise the manuscript based on your suggestions and hope that our responses adequately address your concerns. We remain open to any further suggestions or requests for clarification you may have.
Thank you for your detailed response. It has addressed most of my previous concerns. Although I still believe the discussion could be improved, particularly by including more comparisons between the tuning and non-tuning settings, it is no longer a major reason for rejection. I have therefore updated my score accordingly.
Thank you for your thoughtful feedback and for updating your score.
We highly appreciate your suggestion to include more comparisons between the tuning and non-tuning settings. As discussed in Section 5.1, due to the differences in the information encoded at different layers, treating all layers equally in the non-tuning setting is not reasonable. By applying tuning, we can more effectively aggregate information from various layers. Specifically, our experimental results demonstrate that under the tuning setting, Lager achieves better average performance across three benchmarks compared to the non-tuning setting.
We will further revise the discussion section to incorporate additional analyses, ensuring a more comprehensive comparison in the final version.
This work proposed a fine-grained judgement framework to enhance the alignment between human and LLM-as-a-judge from the internal representations across different layers. They mainly use aggregated cross-layer logits with softmax distribution to provide a score-based judgement. This method is evaluated on different alignment benchmarks, and demonstrates that their method can improve performance across those benchmarks. They also evaluate their method on downstream emotional understanding and data selection tasks to show the effectiveness of their method.
The main strengths of this work include: (1) the proposed method is lightweight (only tuning small set of weights or none in the zero-tuning variant) to improve large performance across different benchmarks; (2) the evaluation is comprehensive plus the analysis about internal layers of different LLMs; (3) this work only requires few annotated data samples to finetune the model.
The main weakness of this work comes from: (1) it needs more discussion about different evaluators on this task; (2) the explanations about using L+1 lightweight parameters without updating original backbone models should be much clearer. Although the major concern from the reviewer DQpf is about the limitation of this work only for open-source LLMs, the performance of open-source LLMs-as-a-judge is not significantly behind of close-source ones. As this work focuses on the internal layers of LLMs, which make it impossible to access close-source LLMs.