PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
4
5
5
3
3.3
置信度
创新性3.0
质量2.8
清晰度3.0
重要性2.8
NeurIPS 2025

Activation-Informed Merging of Large Language Models

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

AIM introduces a new scheme to improve merging performance in LLMs by combining continual learning principles and activation space-based model compression.

摘要

关键词
Model MergingActivation‑Informed MergingLarge Language Models

评审与讨论

审稿意见
4

This paper proposes Activation-Informed Merging (AIM), a simple yet effective technique to enhance model merging for fine-tuned LLMs by leveraging activation information from the base model. By computing input activation magnitudes from a small calibration set, AIM identifies salient channels and constrains their updates during merging, thereby preserving critical base capabilities.

优缺点分析

strength

  1. paper writing is clear and easy to follow

  2. the method be plugged into many merging methods with minimal overhead.

  3. The method is simple and efficient. And it consistently improves performance across benchmarks

weakness

  1. All experiments are conducted on a single model (LLaMA-2-13B). It remains unclear whether AIM is robust across different model families and scales

  2. Most evaluations are limited to merging two or three models . It's unclear how AIM scales when merging larger pools of fine-tuned experts in diverse domains.

问题

If a particular channel exhibits high activation or gradient magnitude, it may be critical for both the base and fine-tuned models. Would overly constraining its update risk suppressing important adaptations from the fine-tuned model and thereby hurt performance?

局限性

The paper does not include a Limitations section. The authors should discuss potential weaknesses of the method, particularly its generalization ability across models, beyond the tested settings.

最终评判理由

Thank you for the detailed response and new experiments. The added results on Qwen 2.5 VL help address some concerns about generalization. I will keep my score, as I believe it fairly reflects the current scope.

格式问题

NA

作者回复

We thank Reviewer j7VA for the positive feedback and for highlighting the clarity, simplicity, and efficiency of our proposed method. We appreciate the opportunity to address the weaknesses and questions raised.

Validating Generalizability Across Architectures and Modalities

To address the reviewer's valid point on generalization, we conducted new experiments on the Qwen 2.5 VL family, a different architecture (QWEN), with a different number of parameters (7B), and a different modality (vision-language) from the models in our original paper. To assess AIM in this setup, we merged the Video-R1 model, a video reasoning model [1] finetuned from Qwen2.5-VL-7B-Instruct, and the CAD-Coder model, also trained on the same architecture for image-to-code-based CAD geometry generation [2]. For merging, therefore, we have two experts, one for CAD and one for video reasoning, with a base model that is instruction-tuned. Therefore, to assess performance, we perform the IF-Eval and MMLU benchmarks (the base model knowledge and instruction following) as well as expert benchmarks for video reasoning, Video MMMU[3] and VSI-Bench[5], and CAD generation benchmarks of CAD-Coder benchmark (on rendered CAD images)[2] and CAD-Coder Real benchmark (on real 3D printed images)[3]. As shown in the table below, which we will include in the final paper, AIM consistently improves the performance of the underlying merging method across diverse tasks while also showing that the overall hypervolume gain is raised in all merging scenarios. This confirms that AIM's benefits are not limited to the Llama-2 architecture, and AIM does truly provide a path towards higher quality merging with little computational and data overhead.

MethodAIMIFEvalMMLUVSI-Bench (Mean Relative Accucary)Video MMMUCAD-CoderCAD-Coder RealHV Gain
Instruct-0.6850.6670.2100.4740.0400.049-
Video-R1-0.6430.6100.3830.4920.0000.025-
CADCoder-0.2680.6570.0000.0020.6140.324-
TIESNo0.5710.6770.2300.4060.6100.3240.438
Yes0.577 (+0.93%)0.675 (-0.19%)0.225 (-2.24%)0.454 (+12.05%)0.647 (+6.01%)0.317 (-2.28%)0.448 (+2.26%)
DARE TIESNo0.5400.6690.2460.4190.5370.3180.430
Yes0.559 (+3.49%)0.670 (+0.23%)0.264 (+7.21%)0.447 (+6.63%)0.551 (+2.64%)0.321 (+0.73%)0.445 (+3.47%)
DARE Task ArithmeticNo0.2460.6450.2520.4290.5420.3260.380
Yes0.559 (+127.07%)0.670 (+3.96%)0.258 (+2.37%)0.452 (+5.44%)0.542 (-0.13%)0.336 (+2.94%)0.446 (+17.42%)
WIDENNo0.3490.6610.0840.2610.6110.3330.318
Yes0.349 (+0.00%)0.661 (+0.00%)0.138 (+63.57%)0.320 (+22.55%)0.642 (+4.92%)0.334 (+0.23%)0.360 (+13.24%)

On Scaling with Larger Pools of Models

Regarding the question of how AIM scales when merging larger pools of experts, this is a valid point about the scope of our experiments. AIM is designed as a complementary plug-in that improves existing merging methods. As such, its ability to scale is inherently tied to the scalability of the underlying algorithm (e.g., DARE, Ties-Merging) it is paired with. Since AIM consistently improves the 2- and 3-model merging scenarios tested, we believe it will provide similar benefits in larger-scale merging settings, provided the base merging algorithm is effective in that context.

Balancing Base Model Preservation and Expert Adaptation

The reviewer raises a crucial point about the risk of suppressing important fine-tuned adaptations. This is the central trade-off our method is designed to manage. The 'ω' hyperparameter in Equation 4 was introduced for this exact purpose: to control the strength of the regularization. The role of 'ω' is to allow a practitioner to find the right balance between preserving the base model's general capabilities and integrating the new, fine-tuned adaptations. Our extensive experiments, which show significant performance gains across multiple benchmarks, demonstrate that we can find a 'ω' value that strikes an effective balance, successfully mitigating catastrophic forgetting without suppressing essential knowledge from the expert models.

We believe these clarifications and new experimental results directly address the points raised and hope they will convince the reviewer to raise their score. We thank the reviewer again for their constructive engagement with our work.

[1] Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., … Yue, X. (2025). Video-R1: Reinforcing Video Reasoning in MLLMs. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2503.21776.

[2] Doris, A. C., Alam, M. F., Nobari, A. H., & Ahmed, F. (2025). CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2505.14646.

[3] Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., … Liu, Z. (2025). Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2501.13826.

[4] Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., & Xie, S. (2025). Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2412.14171

评论

Thank you for the detailed response and new experiments. The added results on Qwen 2.5 VL help address some concerns about generalization. I will keep my score, as I believe it fairly reflects the current scope.

评论

Thank you for your timely response and acknowledging the rebuttals. We are happy to see that the added experiments have addressed your concerns about the generalizability.

审稿意见
5

This paper introduces Activation-Informed Merging (AIM), a complementary technique for combining multiple fine-tuned Large Language Models (LLMs) that originate from the same base model. Drawing from continual learning principles to avoid catastrophic forgetting, AIM uses a task-agnostic calibration dataset to identify and protect the most critical weights of the base model by analyzing its activation space. The authors demonstrate that applying AIM to existing merging methods significantly enhances performance across various benchmarks by up to 40%. The work also proposes a new evaluation metric, hypervolume gain, to assess the multi-objective improvements of merged models.

优缺点分析

Strengths

  • The paper combines ideas from model compression (activation-aware quantization) and continual learning, and applied to model merging. This offers a fresh perspective in a field dominated by weight-space-only methods.
  • The proposed AIM method demonstrates performance gains, with improvements of up to 40% reported on standard benchmarks
  • Introduces an evaluation metric, hypervolume gain, which captures the multi-objective nature of merging.

Weaknesses

  • All experiments are conducted on a single model family (Llama-2-13b). The paper does not show if the findings generalize to other model architectures or scales
  • The method relies on a calibration dataset, adding an extra step not required by pure weight-space methods. While claimed to be robust, this is still an added dependency.

问题

  1. Your experiments provide evidence for AIM's effectiveness, but they are exclusively focused on models fine-tuned from the Llama-2-13b architecture. Could you comment on why you expect these results to generalize to other model architectures (e.g., Mistral, Gemma) and scales?
  2. Have you considered how activation information from the fine-tuned expert models themselves could be used to inform the merge? For instance, how would you handle a scenario where the same neurons are salient for different tasks in conflicting ways?
  3. How sensitive is the resulting performance of a merged model to the domain and quality of the calibration data? For example, how would AIM perform if the calibration set was highly domain-specific (e.g., only legal or medical texts) and misaligned with the general pre-training distribution?

局限性

yes

最终评判理由

After rebuttal, I am happy to see new results to show that AIM work for other model families. And since I decide to raise my score.

格式问题

n/a

作者回复

We thank Reviewer nmzP for their constructive feedback. To directly address the valid points raised regarding generalizability and data dependency, we have conducted new experiments and are providing further clarification on our methodology below.

Validating Generalizability Across Architectures and Modalities

To address the reviewer's valid point on generalization, we conducted new experiments on the Qwen 2.5 VL family, a different architecture (QWEN), with a different number of parameters (7B), and a different modality (vision-language) from the models in our original paper. To assess AIM in this setup, we merged the Video-R1 model, a video reasoning model [2] finetuned from Qwen2.5-VL-7B-Instruct, and the CAD-Coder model, also trained on the same architecture for image-to-code-based CAD geometry generation [3]. For merging, therefore, we have two experts, one for CAD and one for video reasoning, with a base model that is instruction-tuned. Therefore, to assess performance, we perform the IF-Eval and MMLU benchmarks (the base model knowledge and instruction following) as well as expert benchmarks for video reasoning, Video MMMU[4] and VSI-Bench[5], and CAD generation benchmarks of CAD-Coder benchmark (on rendered CAD images)[3] and CAD-Coder Real benchmark (on real 3D printed images)[3]. As shown in the table below, which we will include in the final paper, AIM consistently improves the performance of the underlying merging method across diverse tasks while also showing that the overall hypervolume gain is raised in all merging scenarios. This confirms that AIM's benefits are not limited to the Llama-2 architecture, and AIM does truly provide a path towards higher quality merging with little computational and data overhead.

MethodAIMIFEvalMMLUVSI-Bench (Mean Relative Accucary)Video MMMUCAD-CoderCAD-Coder RealHV Gain
Instruct-0.6850.6670.2100.4740.0400.049-
Video-R1-0.6430.6100.3830.4920.0000.025-
CADCoder-0.2680.6570.0000.0020.6140.324-
TIESNo0.5710.6770.2300.4060.6100.3240.438
Yes0.577 (+0.93%)0.675 (-0.19%)0.225 (-2.24%)0.454 (+12.05%)0.647 (+6.01%)0.317 (-2.28%)0.448 (+2.26%)
DARE TIESNo0.5400.6690.2460.4190.5370.3180.430
Yes0.559 (+3.49%)0.670 (+0.23%)0.264 (+7.21%)0.447 (+6.63%)0.551 (+2.64%)0.321 (+0.73%)0.445 (+3.47%)
DARE Task ArithmeticNo0.2460.6450.2520.4290.5420.3260.380
Yes0.559 (+127.07%)0.670 (+3.96%)0.258 (+2.37%)0.452 (+5.44%)0.542 (-0.13%)0.336 (+2.94%)0.446 (+17.42%)
WIDENNo0.3490.6610.0840.2610.6110.3330.318
Yes0.349 (+0.00%)0.661 (+0.00%)0.138 (+63.57%)0.320 (+22.55%)0.642 (+4.92%)0.334 (+0.23%)0.360 (+13.24%)

Validating the Robustness of the Calibration Method

We thank the reviewer for raising the question on the dependency and sensitivity of using calibration data. To quantify the impact of this "extra step," we performed an ablation study on calibration set size. Our findings confirm our initial claim and show that AIM is highly robust to the size of the calibration set. Significant hypervolume gains are observed even with a very small, task-agnostic dataset (as few as 8 blocks, 512 tokens each, from the PILE [1] corpus). This low data requirement validates the robustness of our method to calibration data size, which makes it easily applicable and a versatile approach for improved model merging. We plan to include these results in the final manuscript. To do this, we conduct an experiment on the application of AIM on the merging of all three experts in our main experiment using the DARE TIES method. We apply AIM to this merging configuration with different calibration dataset sizes and present these results here.

Table: Hypervolume Gain vs Calibration Set Size (DARE TIES Merging of All Three Experts)

Calibration Set Size (Blocks)Hypervolume Gain
0 (Without AIM)0.1717
10.2265
20.2373
40.2348
80.2427
160.2375
320.2424
640.2385
1280.2370
2560.2433

These results clearly showcase that even with as few as 8 blocks (512 tokens each), the performance boost of AIM stabilizes and demonstrably validates our claims on robustness to calibration data size.

Regarding Activations from Expert Models

Our method, AIM, was designed to use the base model's activations to preserve foundational knowledge during a merge. The positive results confirm our core thesis: there are unrealized gains to be had by incorporating activation data into the merging process. Using the expert models' activations would be an interesting follow-up. That approach would focus on retaining specialized skills rather than general ones and would need to address the key challenge of how to merge when different tasks mark the same neurons as important, which could be a very interesting potential follow-up research.

In summary, our new experiments validate AIM's generalizability across architectures and demonstrate the minimal, principled nature of its data requirements. We believe these additions substantially strengthen the paper and address the primary concerns raised, and we hope they will convince the reviewer to raise their score. We thank the reviewer again for their constructive engagement with our work.

[1] Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, S. (2020). The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.

[2] Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., … Yue, X. (2025). Video-R1: Reinforcing Video Reasoning in MLLMs. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2503.21776.

[3] Doris, A. C., Alam, M. F., Nobari, A. H., & Ahmed, F. (2025). CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2505.14646.

[4] Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., … Liu, Z. (2025). Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2501.13826.

[5] Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., & Xie, S. (2025). Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2412.14171

评论

Thank you very much for the rebuttal and the additional results. It is great to see that AIM's benefits are not just limited to the Llama-2 architecture.

评论

Thank you for responding in a timely manner. We are pleased to see that you are satisfied with the response and hope that you are willing to raise your score in light of the new experiments.

审稿意见
5

This paper introduces a model merging approach to enhance large language model (LLM) merging by integrating activation-space information. Unlike traditional methods focused solely on weight space, the proposed method leverages a task-agnostic calibration set and principles from continual learning to selectively preserve critical base-model weights, mitigating catastrophic forgetting while integrating fine-tuned knowledge. Empirical results demonstrate that AIM consistently improves merged model performance across benchmarks, achieving up to a 40% increase in task accuracy.

优缺点分析

Strengths

  • The paper is well-written and logically structured, making it easy to follow the research flow.
  • Unlike previous merging models that relied on simple weighted averaging in a defined space, this work proposes quantization and pruning methodologies and demonstrates their significance.
  • While calibration datasets are required, the authors used only a small set of 256 samples, showing that the method is easily applicable.
  • Extensive experiments across multiple merging scenarios were conducted to analyze the gains from applying AIM, and most cases showed positive performance improvements. Notably, Table 1​ (MMLU benchmark) demonstrates superior performance across all scenarios, and Figure 2​ shows a Pareto-efficient upward trend, both validating the proposed method’s effectiveness.

Weaknesses

  • Although Section 3.3​ claims that 256 calibration samples are sufficient, it remains unclear how the method would perform if the calibration data is biased. Experimental validation of robustness with varying calibration dataset sizes would strengthen the claims.
  • The merging experiments were limited to models based on Llama-2-13B​. It would be beneficial to test the approach on smaller models or those fine-tuned from different base models to verify generalizability.

问题

In Figure 3​, the experiments on varying ω show HV gains even when ω=0 in most cases, which is intriguing. However, the Ties Merging​ scenario shows a decline. What could explain this discrepancy?

局限性

yes

最终评判理由

After reviewing the rebuttal, my previous concerns and issues have been resolved, leading me to adjust my score to accept. This decision is based on two key factors: (1) the robustness of the method, which is applicable even with very small calibration datasets and (2) the generality of the method, as it can be applied regardless of the model architecture.

格式问题

Nothing

作者回复

We thank Reviewer ue55 for the positive review, and we are pleased that the reviewer found our paper well-written and our method well-validated with significant contributions. We hope to address the weaknesses pointed out by the reviewer in detail and with new experiments, and hope that, in light of this response, the reviewer will consider raising their score.

Experimentally Validating Robustness to Calibration Data Size

We thank the reviewer for raising this important question. To address it, we have performed a new ablation study on the size of the calibration set. Our findings confirm our initial claim and show that AIM is highly robust to the size of the calibration set. Significant hypervolume gains are observed even with a very small, task-agnostic dataset (as few as 8 blocks, 512 tokens each, from the PILE [1] corpus). This low data requirement validates the robustness of our method to calibration data size, which makes it easily applicable and a versatile approach for improved model merging. We plan to include these results in the final manuscript. To do this, we conduct an experiment on the application of AIM on the merging of all three experts in our main experiment using the DARE TIES method. We apply AIM to this merging configuration with different calibration dataset sizes and present these results here.

Table: Hypervolume Gain vs Calibration Set Size (DARE TIES Merging of All Three Experts)

Calibration Set Size (Blocks)Hypervolume Gain
0 (Without AIM)0.1717
10.2265
20.2373
40.2348
80.2427
160.2375
320.2424
640.2385
1280.2370
2560.2433

These results clearly showcase that even with as few as 8 blocks (512 tokens each), the performance boost of AIM stabilizes and demonstrably validates our claims on robustness to calibration data size.

Going Beyond a Single Family of Models (New Experiments With Different Family/Type of Architectures)

We thank the reviewer for pointing out the lack of further experiments on different architectures and hope to overcome this issue by introducing a new set of experiments, which hopefully, demonstrate AIM's generalizability. To demonstrate AIM's generalizability better, we conducted a new set of experiments on a different model family, Qwen. Moreover, to further highlight the effectiveness of AIM, we look at not only a different language model but a vision language model, the Qwen 2.5 VL family. To assess AIM in this setup, we merged the Video-R1 model, a video reasoning model [2] finetuned from Qwen2.5-VL-7B-Instruct, and the CAD-Coder model, also trained on the same architecture for image-to-code-based CAD geometry generation [3]. For merging, therefore, we have two experts, one for CAD and one for video reasoning, with a base model that is instruction-tuned. Therefore, to assess performance, we perform the IF-Eval and MMLU benchmarks (the base model knowledge and instruction following) as well as expert benchmarks for video reasoning, Video MMMU[4] and VSI-Bench[5], and CAD generation benchmarks of CAD-Coder benchmark (on rendered CAD images)[3] and CAD-Coder Real benchmark (on real 3D printed images)[3]. As shown in the table below, which we will include in the final paper, AIM consistently improves the performance of the underlying merging method across diverse tasks while also showing that the overall hypervolume gain is raised in all merging scenarios. This confirms that AIM's benefits are not limited to the Llama-2 architecture, and AIM does truly provide a path towards higher quality merging with little computational and data overhead. We hope the addition of this experiment addresses this major point on generalizability brought forth by the reviewer and will convince the reviewer to raise their score.

MethodAIMIFEvalMMLUVSI-Bench (Mean Relative Accucary)Video MMMUCAD-CoderCAD-Coder RealHV Gain
Instruct-0.6850.6670.2100.4740.0400.049-
Video-R1-0.6430.6100.3830.4920.0000.025-
CADCoder-0.2680.6570.0000.0020.6140.324-
TIESNo0.5710.6770.2300.4060.6100.3240.438
Yes0.577 (+0.93%)0.675 (-0.19%)0.225 (-2.24%)0.454 (+12.05%)0.647 (+6.01%)0.317 (-2.28%)0.448 (+2.26%)
DARE TIESNo0.5400.6690.2460.4190.5370.3180.430
Yes0.559 (+3.49%)0.670 (+0.23%)0.264 (+7.21%)0.447 (+6.63%)0.551 (+2.64%)0.321 (+0.73%)0.445 (+3.47%)
DARE Task ArithmeticNo0.2460.6450.2520.4290.5420.3260.380
Yes0.559 (+127.07%)0.670 (+3.96%)0.258 (+2.37%)0.452 (+5.44%)0.542 (-0.13%)0.336 (+2.94%)0.446 (+17.42%)
WIDENNo0.3490.6610.0840.2610.6110.3330.318
Yes0.349 (+0.00%)0.661 (+0.00%)0.138 (+63.57%)0.320 (+22.55%)0.642 (+4.92%)0.334 (+0.23%)0.360 (+13.24%)

On The Ablation Study And The Effects of ω\omega

The reviewer's observation is an interesting one. Although we do not have a precise idea on why this difference in behaviour is observed with ω\omega on different merging methods, we speculate this is due to an interaction with the highly non-linear Ties-Merging algorithm. Ties-Merging resolves sign conflicts between parameters, and our method's approximations, when applied without the protective ω > 0 regularization, may interfere with this sensitive mechanism and thus lead to reduced performance. We hope this brief discussion helps the reviewer grasp our understanding of this observation.

[1] Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, S. (2020). The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.

[2] Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., … Yue, X. (2025). Video-R1: Reinforcing Video Reasoning in MLLMs. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2503.21776.

[3] Doris, A. C., Alam, M. F., Nobari, A. H., & Ahmed, F. (2025). CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2505.14646.

[4] Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., … Liu, Z. (2025). Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2501.13826.

[5] Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., & Xie, S. (2025). Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2412.14171

评论

Experiments With Different Family/Type of Architectures​: The experiments conducted across various model architectures seem to demonstrate the reliability of the proposed methodology.

Experimentally Validating Robustness to Calibration Data Size​: It is intriguing that high performance is achieved with fewer than four calibration samples, and even a single sample appears to make a significant difference. I am curious about the authors’ interpretation of this phenomenon. Adding insights into this analysis would be beneficial.

评论

We thank you for your timely response and for giving us the opportunity to respond to your further discussions.

​Experiments With Different Family/Type of Architectures​: The experiments conducted across various model architectures seem to demonstrate the reliability of the proposed methodology.

Thank you for acknowledging the added value of this experiment, and we are happy to see this has addressed your concerns on this matter.

Experimentally Validating Robustness to Calibration Data Size​: It is intriguing that high performance is achieved with fewer than four calibration samples, and even a single sample appears to make a significant difference. I am curious about the authors’ interpretation of this phenomenon. Adding insights into this analysis would be beneficial.

Thanks for your insightful question and keen observation. The observation that even a single sample of a very small number of samples achieves a similar benefit is an interesting one that has been observed in prior works using a similar saliency mapping as ours, specifically see Figure 8 in [1] and Figure 2 in [2]. Specifically, the experiments in [2] show a similar pattern where even 1 sample seems to provide a notable performance boost, which we observe here as well. Our experiment confirms the observations of [1] and [2] in model compression, showing that the activation scales (rather than each individual token's activation) quickly settle to relative scales that are stable and less sensitive to further samples. This is because even one small block of 512 tokens may be nearly enough to capture the relative scales of activations in each layer or attention block. We postulate that this is because the sensitivity of the activation scales after a model has been trained with large amounts of data is lower to noisy inputs, thus making it so that relative activation scales in each layer quickly settle. This is unlike looking at model-wide saliency, and isolates each layer, which likely aids in this lower sensitivity to noise at the input, and model-wide saliency mapping may not be as stable. However, this matter is one for future research and not related to the core contributions of our work, and remains an interesting open question for future research.

We would like to thank the reviewer for their timely response and hope that the above discussion is insightful.

[1]Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., … Han, S. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2306.00978

[2]Sun, M., Liu, Z., Bair, A., & Kolter, J. Z. (2024). A Simple and Effective Pruning Approach for Large Language Models. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2306.11695

审稿意见
3

This paper introduces Activation-Informed Merging (AIM) for merging fine-tuned LLMs derived from the same pre-trained base. AIM enhances model merging by leveraging the activation space to guide which weights should be preserved during merging. The method aims to prevent performance degradation, particularly the "catastrophic forgetting" of the base model’s general abilities, by constraining updates to the most salient (activation-sensitive) weights. The authors conduct empirical evaluations across multiple benchmarks (math, code, instruction following) and demonstrate that AIM consistently boosts the performance of merged models, sometimes by as much as 40% in benchmark metrics.

优缺点分析

Strengths:

  1. The paper is well-grounded in the literature, thoughtfully connecting AIM to both continual learning (notably, weight regularization to prevent catastrophic forgetting) and recent advances in model compression (activation-aware quantization).
  2. Experiments cover five recent merging methods, three types of fine-tuned LLM experts (math, code, instruction), and six task benchmarks. The performance is measured both in standard metrics and a new hypervolume-based multi-task evaluation.

Weaknesses:

  1. The need for access to activations and a representative calibration set may limit the method’s immediate applicability in proprietary or privacy-sensitive domains.
  2. The novelty is somewhat incremental: AIM is an adaptation of activation-based ideas from compression to the merging context. This is clearly articulated, though.
  3. Only tested on one model family and size (Llama-2-13B).

问题

  1. How does AIM perform on other model families (GPT, Mistral) or different model sizes? The evaluation is limited to Llama-2-13B.
  2. While you cite robustness from other work, have you empirically validated sensitivity to calibration dataset choice in the merging context?
  3. Have you considered other importance metrics beyond activation magnitude (e.g., gradient-based measures, Fisher information)?

局限性

yes

格式问题

No formatting concern.

作者回复

We sincerely thank Reviewer Ldwv for the thoughtful review and for recognizing that our work is well-grounded in the literature. We hope to address all of the points raised by the reviewer and hope that this discussion and the addition of two new experiments will convince the reviewer to increase their score.

Applicability of the Publicly Sourced Calibration Set

The calibration data is sourced from a small, random sample of the Pile [1], a large-scale, diverse, and publicly available text dataset created specifically for training large language models. The base models we use have already been trained on public data like this, so we are not introducing any new, private information. While the data used to fine-tune expert models may be proprietary, the pre-training data for most well-known model families (Llama, Qwen, etc.) is public, and specifically, the calibration set we propose to use is publicly available. In addition, as our experiments in response to the reviewer’s Q2 demonstrate, our method still offers performance boosts even with a very small amount of calibration data and is therefore remarkably robust.

Experimentally Validating Robustness to Calibration Dataset

To empirically validate the sensitivity to the calibration set, we conducted a new experiment analyzing the impact of its size on AIM's performance. To do this, we conduct an experiment on the application of AIM on the merging of all three experts in our main experiment using the DARE TIES method. We apply AIM to this merging configuration with different calibration dataset sizes and present these results here. Our results show that AIM is robust, delivering significant performance gains even with a very small number of calibration samples (as few as 8 blocks (512 tokens each) from the Pile [1] corpus). This demonstrates that the overhead of sourcing a calibration set is minimal and the method is highly practical. We will add these findings to the final manuscript.

Table: Hypervolume Gain vs Calibration Set Size (DARE TIES Merging of All Three Experts)

Calibration Set Size (Blocks)Hypervolume Gain
0 (Without AIM)0.1717
10.2265
20.2373
40.2348
80.2427
160.2375
320.2424
640.2385
1280.2370
2560.2433

These results clearly showcase that even with as few as 8 blocks, the performance boost of AIM stabilizes, and more data will not be necessary, thus showcasing how AIM can be a practical framework for improving merging quality with robustness to calibration data size, which as we discuss in the paper is expected and observed in prior activation based frameworks for compression such AWQ and WANDA.

AIM as a Novel Framework for Activation-Aware Merging

The reviewer's observation that our work is taking inspiration from recent advances in model compression is valid; however, we believe that AIM is not lacking novelty because of this, and we argue that AIM is beyond incremental. We believe the core contribution of our paper is in successfully bridging these activation-based techniques to the model merging domain, which is novel and previously not explored in merging frameworks. We believe demonstrating the importance and effectiveness of including activations in the merging process, and its effectiveness at addressing catastrophic forgetting, which has been a key challenge in merging and has not been adequately addressed by prior methods, and the development of the overall algorithm, bridging continual learning principles and the lessons learned from model compression, goes beyond simple incrementation and highlights both a new finding regarding the importance of the activation space in merging, and a methods do such an integration. We hope this convinces the reviewer that AIM offers more than a simple plug-and-play adaptation and provides significant insight into model merging and a path for more effective merging.

Going Beyond A Single Family of Models (New Experiments With Different Family/Type Of Architectures)

We thank the reviewer for bringing this limitation in our experiments to our attention and helping us improve the paper and showcase AIM's effectiveness as a general framework for merging by showing its versatility across different architectures and families of models. To demonstrate AIM's broader applicability, we conducted a new set of experiments on a different model family, Qwen. Moreover, to further highlight the effectiveness of AIM, we look at not only a different language model but a vision language model, the Qwen 2.5 VL family. To assess AIM in this setup, we merged the Video-R1 model, a video reasoning model [2] finetuned from Qwen2.5-VL-7B-Instruct, and the CAD-Coder model, also trained on the same architecture for image-to-code-based CAD geometry generation [3]. For merging, therefore, we have two experts, one for CAD and one for video reasoning, with a base model that is instruction-tuned. Therefore, to assess performance, we perform the IF-Eval and MMLU benchmarks (the base model knowledge and instruction following) as well as expert benchmarks for video reasoning, Video MMMU[4] and VSI-Bench[5], and CAD generation benchmarks of CAD-Coder benchmark (on rendered CAD images)[3] and CAD-Coder Real benchmark (on real 3D printed images)[3]. As shown in the table below, which we will include in the final paper, AIM consistently improves the performance of the underlying merging method across diverse tasks while also showing that the overall hypervolume gain is raised in all merging scenarios. This confirms that AIM's benefits are not limited to the Llama-2 architecture, and AIM does truly provide a path towards higher quality merging with little computational and data overhead. We hope the addition of this experiment addresses this major point brought forth by the reviewer and will convince the reviewer to raise their score.

MethodAIMIFEvalMMLUVSI-Bench (Mean Relative Accucary)Video MMMUCAD-CoderCAD-Coder RealHV Gain
Instruct-0.6850.6670.2100.4740.0400.049-
Video-R1-0.6430.6100.3830.4920.0000.025-
CADCoder-0.2680.6570.0000.0020.6140.324-
TIESNo0.5710.6770.2300.4060.6100.3240.438
Yes0.577 (+0.93%)0.675 (-0.19%)0.225 (-2.24%)0.454 (+12.05%)0.647 (+6.01%)0.317 (-2.28%)0.448 (+2.26%)
DARE TIESNo0.5400.6690.2460.4190.5370.3180.430
Yes0.559 (+3.49%)0.670 (+0.23%)0.264 (+7.21%)0.447 (+6.63%)0.551 (+2.64%)0.321 (+0.73%)0.445 (+3.47%)
DARE Task ArithmeticNo0.2460.6450.2520.4290.5420.3260.380
Yes0.559 (+127.07%)0.670 (+3.96%)0.258 (+2.37%)0.452 (+5.44%)0.542 (-0.13%)0.336 (+2.94%)0.446 (+17.42%)
WIDENNo0.3490.6610.0840.2610.6110.3330.318
Yes0.349 (+0.00%)0.661 (+0.00%)0.138 (+63.57%)0.320 (+22.55%)0.642 (+4.92%)0.334 (+0.23%)0.360 (+13.24%)

Gradient-Based Measures As An Importance Metric Yields Similar Performance Gains As The Lower Cost AIM

The point on the use of a different importance metric, specifically gradients, is an excellent one. We completely agree with the reviewer that a look at such a metric would provide great insight into AIM's effectiveness and its ties with continual learning principles. In fact the paper already discusses this nuance (starting at Line 219 of the paper), and we have already conducted experiments on the use of gradients. We investigate gradient-based importance metrics as an alternative to activation magnitudes in section 3.4, where we include a sensitivity-based Formulation for the alternative way we performed relaxation. In Appendix C, we detailed the experimental evaluation of this approach, and as discussed in the paper, our findings show that while gradient-based methods are a valid alternative, they do not yield a noticeably better performance boost than the computationally less expensive and simpler activation-scale-based approach in AIM. We observed that the use of gradients yields results nearly identical to AIM, and thus concluded that the much cheaper and simpler activations are overall a superior balance. We thank the reviewer for their insightful comment, which confirms the importance of this comparison that we have included in the paper, and hope that, in light of this, the reviewer will consider raising their score.

[1] Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, S. (2020). The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.

[2] Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., … Yue, X. (2025). Video-R1: Reinforcing Video Reasoning in MLLMs. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2503.21776.

[3] Doris, A. C., Alam, M. F., Nobari, A. H., & Ahmed, F. (2025). CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2505.14646.

[4] Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., … Liu, Z. (2025). Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2501.13826

[5] Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., & Xie, S. (2025). Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2412.14171

评论

We thank the reviewer for acknowledging the rebuttals and hope their concerns have been addressed. We are happy to make further clarifications for any remaining concerns and hope that, in light of the added experiments, the reviewer has reconsidered their score.

最终决定

The paper received mixed ratings initially. There are a few minor concerns from the reviewers. The rebuttal has addressed most of them, and the reviewers raised their scores. The final decision is acceptance.