PaperHub
6.3
/10
Poster3 位审稿人
最低6最高7标准差0.5
6
7
6
3.7
置信度
COLM 2025

Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

OpenReviewPDF
提交: 2025-03-21更新: 2025-08-26
TL;DR

We propose Test-Time Model Merging (TTMM) which approaches the performance of Test-Time Training (TTT) without almost any test-time overhead.

摘要

Mixture of expert (MoE) models are a promising approach to increasing model capacity without increasing inference cost, and are core components of many state-of-the-art language models. However, current MoE models typically use only few experts due to prohibitive training and inference cost. We propose _**T**est-**T**ime **M**odel **M**erging_ (TTMM) which scales the MoE paradigm to orders of magnitude more experts and uses model merging to avoid almost any test-time overhead. We show that TTMM is an approximation of test-time training (TTT), which fine-tunes an expert model for each prediction task, i.e., prompt. TTT has recently been shown to significantly improve language models, but is computationally expensive. We find that performance of TTMM improves with more experts and approaches the performance of TTT. Moreover, we find that with a 1B parameter base model, _TTMM is more than $100\times$ faster than TTT_ at test-time by amortizing the cost of TTT at train-time. Thus, TTMM offers a promising cost-effective approach to scale test-time training.
关键词
test-time trainingmodel mergingmixture of expertslanguage modelinglocal learningtransductive learning

评审与讨论

审稿意见
6

This work proposes Test-time model merging (TTMM) as an approximation of test-time training with better test-time efficiency. Precisely, TTMM first conducts train-time clustering to cluster the training dataset. For each cluster, the authors adopt a small LoRA to learn k experts. The centroid embeddings are calculated to represent cluster-specific features for the cross-attention weight during test-time model merging. TTMM is evaluated mainly on Wikipedia and GitHub (python) via perplexity.

接收理由

  1. The idea of TTMM is sound, and the experiments also verify the effectiveness of TTMM on Wikipedia and GitHub (python) in perplexity. It is likely to be a practical paradigm for test-time approaches.
  2. This work is well-written and easy to follow.
  3. The model designs and motivations are well clarified.

拒绝理由

  1. The authors merely evaluate TTMM on two datasets Wikipedia and GitHub (python) via perplexity. It is suggested that more tasks are needed to comprehensively verify the effectiveness of TTMM. Moreover, Perplexity is adopted for evaluation. Other metrics in practical downstream benchmarks should be considered to answer whether the proposed TTMM could function well in diverse real-world tasks.
  2. Some MoE efforts do adopt more experts recently (e.g., DeepSeek-V3 adopts 256 routed experts). Will this setting cover some of the advantages of TTMM proposed in this work? It is interesting to verify its effectiveness on MoE-based architectures. Moreover, evaluations on larger LLMs will make the conclusion more solid.
  3. From the results, it seems that the selections of (a) the number of experts, and (b) the number of activated experts are non-trivial and diverse in different datasets and tasks. The authors are suggested to give a discussion on the wiser parameter selection.

给作者的问题

  1. The real overall run time (including train-time and test-time costs) for TTT and TTMM of the current evaluation setting should be given.
  2. In Figure 1, it seems that TTMM performs consistently worse than TTT, while in Table 1, TTMM outperforms TTT on two datasets. The authors could give more discussions on these results.

伦理问题详情

NA

评论

Thank you for your review and detailed comments! You can find below our detailed response to your questions and concerns. Please let us know if you have any further questions or suggestions.

"The idea of TTMM is sound, and the experiments also verify the effectiveness of TTMM on Wikipedia and GitHub (python) in perplexity. It is likely to be a practical paradigm for test-time approaches."

Thank you!

Evaluation on non-perplexity tasks

Thank you for this feedback! Based on this feedback and the feedback of reviewer Ltqr, we evaluated TTMM on MMLU for the rebuttal. We have included the results in the main comment above. In summary, we find that TTMM slightly improves MMLU accuracy.

MoEs with larger numbers of experts

Thank you for mentioning larger MoEs such as DeepSeek-V3. For perspective, we find improvement in using 100 experts for comparatively small individual domains, such as English Wikipedia or Python code from GitHub. In contrast, DeepSeek-V3 is trained across many broad domains. We believe therefore that our results indicate that further increasing the number of experts in LLMs is a promising direction. Doing this, however, in a standard MoE-based architecture is challenging since all experts need to be in GPU memory simultaneously, which requires substantial infrastructure for an increased number of experts. We propose and study the alternative of training all individual experts in an embarrassingly parallel way, and then linking them post-hoc via a router. This provides the potential advantage of requiring substantially less resources. We very much appreciate your suggestion of scaling to larger LLMs, which we also believe to be an exciting direction for future work.

Selection of number of experts and number of active experts

We agree that the selection of these is non-trivial and therefore we will add an expanded section discussing this choice to the final paper.

Choosing the number of experts (KK): The total number of experts has almost no direct effect on latency. However, more experts require a large amount of static CPU memory. Further, increasing the number of experts also increases the separation of knowledge/skills between experts. This then requires more active experts to achieve the same performance, which also increases latency. For this reason, we agree that selecting KK for optimal placement on the Pareto frontier of latency and performance is non-trivial. We observe that choosing 100 experts is also indicated by the ``elbow method'', common in clustering (cf. Figure 6 (middle)).

Choosing the number of active experts (NN): The number of active experts is slightly more straightforward to select. Note that TTMM chooses the number of active experts adaptively depending on the prompt. The average number of active experts is determined by the temperature parameter β\beta, which we ablate in Figure 5. We suggest selecting this parameter on a holdout dataset (which one can interpret as a form of ``training'' the router), and find that performance is relatively robust to the exact value of β\beta.

Discussion of improvement of TTMM over TTT on Wikipedia

We agree with you. We were also surprised to see TTMM outperform TTT on the Wikipedia dataset. We expect that this may be because TTMM can incorporate knowledge from a larger set of experts, each trained on a few thousands of data points. In contrast, TTT is only training on the 100 most related neighbors. If there are more than 100 related data points, this may miss some information. We added a mention of this to the discussion of our results.


We hope to have addressed your remaining concerns. We would greatly appreciate it if you could reconsider the score based on our response and extended results.

评论

Thanks for the authors' rebuttal.

My main concern of this work locates in the perplexity-based evaluation and further explorations on other MoE settings. The authors have provided the results on MMLU, which partially addresses the issue in evaluation (however, such evaluations should be more diversified and included in the original submission).

Moreover, the authors agree with the significance that larger numbers of experts should be considered, which could be given in their future version. The real overall run time (including train-time and test-time costs) for TTT and TTMM of the current evaluation setting should also be given.

In conclusion, I will maintain my original voting of 6: Marginally above acceptance threshold.

评论

Thank you very much for your kind response.

Real overall runtime

We have updated our previous comment to include the train-time as well as the runtime of the perplexity evaluation, next to the test-time latency. We would like to note, however, that evaluating perplexity is substantially faster than generating text of an equivalent length. Based on your feedback, we have added these runtimes to the paper.

Non-perplexity evaluation

Similarly to previous work on TTT with LLMs [1, 2], we conducted our initial evaluation on non-perplexity tasks. We fully agree with you that further evaluation on a diverse set of non-perplexity tasks is an exciting direction for future work, and a natural follow-up step to make our findings directly applicable to downstream tasks.

References:

  1. Hardt & Sun. Test-Time Training on Nearest Neighbors for Large Language Models. ICLR, 2024.
  2. Hübotter et al. Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs. ICLR, 2025.
评论

Dear reviewer,

We wanted to kindly remind you that the discussion period on OpenReview will close in a few days. We hope to have addressed your remaining concerns with our response to your review and would be happy to discuss if you have any further questions.

Latency of TTMM on our hardware

Thank you for your question. We have measured the average latency of TTMM on our hardware (Wikipedia dataset, Llama 3.2-1B). The results are summarized below. We will include these numbers in the appendix. Please note that latency is expected to be lower on newer clusters with higher CPU-GPU bandwidth.

TTMMTest-time Latency (avg)
1 active expert56.2 ms
3 active experts97.2 ms
10 active experts235.1 ms

Train-time: 125h

Embedding generation-time: 2h

Test-time (avg): 15.7ms (Note that this is not generating tokens, but only evaluating their perplexity)

Thank you again for your time and effort in reviewing our work!

审稿意见
7

This paper proposes Test-Time Model Merging (TTMM), a test-time optimization method for large language models. It works by creating a large number of adapter models from the training data and dynamically selecting and composing these models based on the input during inference to achieve task-specific tuning. The theoretical analysis shows that TTMM approximates test-time training (TTT) under appropriate conditions, and it demonstrates the ability to construct specialized models with significantly less cost (100 times faster) compared to TTT.

EDIT after rebuttal: I raised my reccomendation to 7.

接收理由

  • This method significantly reduces the real-time overhead during test-time while improving the performance (perplexity) of language models. This is a desirable characteristic for applications, and the method could be particularly useful for applications that require real-time performance.
    • However, the cost is not entirely eliminated; it requires separate storage to keep the LoRAs. Therefore, the title "Essentially Free" seems to be an overstatement.

拒绝理由

  • The paper evaluates the proposed method based on perplexity but does not mention downstream tasks at all. While perplexity might suffice for purely theoretical results, discussing the impact on actual task performance is essential when considering real-world applications of language models. Although the authors mention this in Future Work (final section), it's unclear if there was a specific reason why the results were not included in this paper. It's a very good study I think, but this point is quite disappointing.
  • As mentioned in the experiments, the method's scaling seems to have limitations with the current approach, as increasing the number of experts to 1000 did not yield further improvement.
  • The performance of this method is likely significantly affected by the training data clustering method, but the analysis of the clustering method itself is very limited. Technically, this is an essential hyperparameter for applying this method and requires deeper analysis.

给作者的问题

  • The theoretical analysis in Section 3.2 includes unrealistic approximations such as single-step optimization. While understandable for analysis efficiency, is it possible to replace these with assumptions closer to actual training settings? Also, it seems that the effect of the reduction due to applying LoRA in this method is not sufficiently discussed. Is there any consideration of whether synthesizing a large number of trained LoRAs can maintain the same model expressiveness as tuning the original parameter space directly?
  • As mentioned in "Reasons To Reject," the discussion about clustering itself seems limited. How sensitive is the proposed method to other clustering algorithms or changes in hyperparameters within the algorithm? If experimental results are unavailable, is it possible to provide some qualitative observations?
  • The embedding used for selecting LoRAs is thought to have a direct impact on model performance, but the paper only mentions results using all-mpnet-base-v2. Please provide justification for this choice. Also, can you discuss how the behavior changes specifically with different embeddings? Relatedly, can you mention any guidelines for selecting embeddings?
  • The conclusion in the paper states that "TTMM with 10 active experts achieves close to the accuracy of TTT," and while this is broadly true, Table 1 suggests a different relationship between TTMM and TTT for Wikipedia and GitHub. Specifically, the order of perplexity is reversed. This could be seen as reflecting the methodological characteristics of TTMM, but is there any discussion on this point?
  • Regarding the point that performance did not improve with an increased number of experts, you only state that "additional experts do not capture any meaningful new information". Is there a more detailed analysis or consideration of methods to avoid this issue?
  • You mention that model merging interference is an important issue, but you do not seem to deeply consider it in this study. Is there anything you can mention regarding this?
  • The process of determining the prefix length for GitHub is not clear. Is it possible to provide any information on this?
评论

Thank you for your review and detailed comments! You can find below our detailed response to your questions and concerns. Please let us know if you have any further questions or suggestions.

"This method significantly reduces the real-time overhead during test-time while improving the performance (perplexity) of language models. This is a desirable characteristic for applications, and the method could be particularly useful for applications that require real-time performance."

Thank you!

Title: "Essentially Free Test-Time Training via Model Merging"

TTMM indeed requires storing the LoRAs on CPU memory. We agree that this is a limitation of our proposed method, and we added the following to the first paragraph of the discussion section: "A limitation of TTMM is that it requires additional CPU memory for storing all experts." Nevertheless, we would like to emphasize here that TTMM does not incur significant latency beyond standard inference, while improving performance.

Evaluation on non-perplexity tasks

Thank you for this feedback! Based on this feedback and the feedback of reviewer tuLa, we evaluated TTMM on MMLU for the rebuttal. We have included the results in the main comment above. In summary, we find that TTMM slightly improves MMLU accuracy.

Scaling the number of experts

We expect that scaling the total number of experts (for a fixed dataset) increases the separation of knowledge/skills between experts, which then requires more experts to be active during inference to achieve the same performance. Therefore, in terms of the Pareto frontier of latency and performance, fewer total experts might be beneficial since they may require fewer active experts to match performance.

We will clarify this in our discussion.

Model merging interference

We find that as the number of active experts is increased (significantly larger than 10), the performance gain of TTMM plateaus. We think that this is most likely due to overlapping signals propagated through the individual expert models during inference, which interfere in the merged model, as also studied in a large body of work in model merging [e.g., 1, 2]. This is further supported by our observation that ensembling the logits of experts outperforms model merging (albeit at a much higher latency). While our study focuses on how to best weight the active experts based on their relevance to the test task, we fully agree that studying interference is an important direction for future work. One interesting approach for further study might be to explicitly guide the training of the individual experts towards solutions that have small interference.

Analysis of clustering method

Next to bisecting k-means, we also clustered the Wikipedia dataset using regular k-means. K-means resulted in an unbalanced clustering, with one cluster capturing >1/4>1/4 of data points and many clusters capturing <100<100 data points. This observation motivated our decision to use bisecting k-means.

Selection of embeddings

Thank you for this question! In an attempt to answer this question, we evaluated different embedding models for our new evaluation on MMLU. We find that TTMM outperforms the base model regardless of the choice of embedding model. In particular, the largest Qwen2-1.5B embedding model performs worse than the 100M MPNet model. We estimate that this might be because finding a decent clustering might be significantly easier than solving the questions, since the former only requires a broad understanding of the meaning of some keywords while the latter requires some ability of reasoning.

Single-step assumption in theoretical analysis

Thank you for this question! We adopted this assumption purely to simplify notation. Indeed, with slightly more complex notation to keep track of the individual steps of gradient descent, our proof of Proposition C.1 also proves the recursion θx(t+1)θ(t+1)θx(t)θ(t)+ηG(diam(Dx)+diam(D))\|\theta_{x^\star}^{(t+1)} - \theta'^{(t+1)}\| \leq \|\theta_{x^\star}^{(t)} - \theta'^{(t)}\| + \eta G (\text{diam}(\mathcal{D}_{x^\star}) + \text{diam}(\mathcal{D}')) (cf. lines 535-536). Unrolling this recursion proves the same bound as in Proposition C.1 for TT steps of gradient descent, with an additional factor TT in the bound.

We agree with you that this was an oversimplification, and we have changed the proposition in the paper to the variant with TT-step gradient descent. Thank you for this feedback!


References:

  1. Yadav et al. Ties-merging: Resolving interference when merging models. NeurIPS, 2023.
  2. Yang et al. AdaMerging: Adaptive Model Merging for Multi-Task Learning. ICLR, 2024.
评论

Effect of LoRA

Thank you for this question! We adopt LoRA primarily because CPU memory and loading from CPU memory is a central bottleneck for TTMM. With Llama-3.2-1B as base and 100 experts, LoRAs take around 17 GB while full-parameter experts would require 247 GB (2.47 GB per expert). Moreover, at test-time, active experts have to be loaded into GPU memory. Loading 10 LoRAs is only around 1.7 GB, while loading 10 full-parameter experts is 24.7 GB. This would substantially increase latency.

Discussion of improvement of TTMM over TTT on Wikipedia

We agree with you. We were also surprised to see TTMM outperform TTT on the Wikipedia dataset. We expect that this may be because TTMM can incorporate knowledge from a larger set of experts, each trained on a few thousands of data points. In contrast, TTT is only training on the 100 most related neighbors. If there are more than 100 related data points, this may miss some information. We added a mention of this to the discussion of our results.

Prefix length for GitHub

Below is an ablation of the prefix length for Python code, showing the perplexity of TTMM and the base model, respectively.

Prefix LengthTTMM (1 active expert)Base ModelΔ
502.7172.678-0.039
1002.6492.652+0.003
1502.6162.635+0.018
2002.5612.611+0.049
2502.5302.585+0.055
3002.5092.562+0.053

Since code does not typically summarize the contents of a file in the first few tokens, the prefix must be longer than for Wikipedia to provide sufficient information on the content of the suffix. Code typically begins with a series of import statements or copyright comments, which alone are not sufficiently informative for the selection of experts. To address this limitation of our evaluation, we added an evaluation on MMLU (see above), which follows a natural structure of prompt & response.


We hope to have addressed your remaining concerns. We would greatly appreciate it if you could reconsider the score based on our response and extended results.

评论

Dear reviewer,

We wanted to kindly remind you that the discussion period on OpenReview will close in a few days. We hope to have addressed your remaining concerns with our response to your review and would be happy to discuss if you have any further questions.

Thank you again for your time and effort in reviewing our work!

评论

Thank you for detailed comments to my review! They were basically useful to gain my understanding of the proposed method, and some questions are resolved appropriately by the authors' comment.

The comment "Effect of LoRA" doesn't resolve one of my question: "Is there any consideration of whether synthesizing a large number of trained LoRAs can maintain the same model expressiveness as tuning the original parameter space directly?", because the answer only mentioned about the efficiency. This is unfortunate, but is relatively trivial considering that the comments were overall appropriate.

According to the consideration of the rebuttal, I'd like to give a stronger recommendation for this paper.

审稿意见
6

This paper introduces a new method for test time adaption of language models (TTMM) but where new LoRA adapters are trained based on training data distribution grouped into clusters. At test time, based on the prompt, their method introduces a technique to select the relevant adapters based on prompt embedding and then weight merge the adaptors. This final adaptor merged model is used for inference and evaluation.

接收理由

The paper identifies correctly that test time adaptation (TTT) on the granularity level of each prompt is prohibitively expensive for each prompt. That finetuning based on each prompt is not pragmatic is a valid problem to solve.

To that end:

  1. The introduced method is comparatively simple to understand and implement which is an advantage.
  2. The notion of clustering is relevant as this allows to overcome the issue of extreme finegrained task finetuning.
  3. The results show that the performance degradation is relatively small compared to TTT method.

拒绝理由

The main concern I have with the conclusions is that there is no discussion on the Pareto frontier of the latency and the performance degradation compared to TTT.

If the advantage claim of the paper's approach is in its latency with "test time" adaptation, then we need to have notion of apples-apples comparison of inference times.

For example, Figure 4 shows the inference times for TTMM. But looking at the figure 4b, it is not clear to me why this method is faster than TTT. For example, at 100 experts, the method takes 1 sec (100ms) to generate 20 tokens. If I do not consider distributed inference, for a 3B model, that is pretty high, almost 10x for the 10 experts case.

So we need to see a tradeoff between how the performance degrades with less number of experts vs the gain in time. That would also allow the method to select the sweet spot.

A second question on the method on which I see no discussion is the concept of "experts". It seems that there is a disconnect between how the experts are selected based on an embedding model and how the weights are merged. Since there is no training of the gating or router mechanism here, the weights for merging are selected through a dot product that comes from the embedding model. Why is that a good weighting mechanism when the model has never adapted to those weights before.

给作者的问题

  1. Can you clarify how the number of clusters were selected - it seems like the perplexity keeps improving with more experts/clusters and it goes on till 1000 experts. However, as I mentioned above, it is not likely a singular dimension that selects the number of clusters, the test time latency should also be a factor for selecting K, but I do not see a discussion on that.
评论

Thank you for your review and detailed comments! You can find below our detailed response to your questions and concerns. Please let us know if you have any further questions or suggestions.

Comparison of latency vs performance between TTMM and TTT

Thank you for highlighting the importance of discussing the Pareto frontier of latency and performance. Figure 1 shows that TTMM has only marginally higher latency than standard inference, while significantly improving performance. In contrast, Figure 1 shows that TTT has significantly higher latency. We further discuss latency in Section 3.1 and find that TTMM has a constant overhead of around 115ms (with 10 active experts), while TTT has a constant overhead of around 15s, over 100x larger.

The exact Pareto frontier varies depending on the generation length. Figure 1 plots this for 100 generated tokens and Figure 4 (left) shows how the latency changes for other generation lengths. We do not show TTT in Figure 4 (left) since the constant overhead of TTT before generating any token is larger than the time to generate 1k tokens with TTMM (and 10 active experts).

We summarize this in the following table, which we will also add to the paper. Here, TT denotes the time for generation. On the hardware we use to measure latency (NVIDIA RTX 4090), generation takes around 6ms/token.

MethodLatency (ms)
TTMM (1 active expert)8+T8 + T
TTMM (3 active experts)75+T75 + T
TTMM (10 active experts)115+T115 + T
TTT15000+T+data loading15\,000 + T + \text{data loading}
Ensembling of N expertsNT+expert loadingN ⋅ T + \text{expert loading}

Figure 4b

Thank you for this question! Figure 4 (middle) shows the latency of TTMM at varying numbers of active experts. At 100 active experts, the loading of experts from CPU memory dominates latency. For this reason, we focus our performance evaluation on fewer active experts (1, 3, and 10), where the latency of TTMM is small. Nevertheless, even at 100 active experts, the 1s latency of TTMM is still substantially faster than the roughly 15s latency of TTT.

Training of the router

Your suggestion to train the router is great! We believe that training the router on holdout training data is likely to further improve performance. While training the router is common practice with MoEs, we did not train the router since this would significantly increase the cost and complexity of training if experts do not all fit into GPU memory simultaneously. We find that even when we train each expert in an embarrassingly parallel way, TTMM achieves good performance. Based on our findings, we find your suggestion to be an exciting direction for future work.

Discussion of the number of experts

Thank you for this question!

Considerations for selecting the number of clusters (KK): The total number of experts has almost no impact on latency. It primarily affects the required CPU memory for storing all experts. Provided sufficient memory for storing the experts, KK can be scaled to billions of experts without significantly increasing latency, similarly to retrieval systems of documents. We select KK using the elbow method based on the bisecting k-means clustering.

Considerations for selecting the number of active clusters (NN): The number of active experts has the strongest effect on latency, since all active experts have to be loaded from CPU into GPU memory before inference. We find that until 1010 active experts, this latency is minimal.

评论

Thanks for addressing my questions.

I understood your point previously. This is also the problem with many architectures that have additional storage as associative memory. And this is not a trivial consideration.

At 100 active experts, the loading of experts from CPU memory dominates latency. For this reason, we focus our performance evaluation on fewer active experts (1, 3, and 10), where the latency of TTMM is small.

I hope you realize that the architecture bottleneck by way of this data to memory overhead can actually make your method look weak. This is not something to overlook when it is clear that more experts can make up for lost accuracy but you are not able to do that due to weight movement bottlenecks.

However, I am ok with this as long as you include this assumption and this drawback in your paper, else the claims are somewhat misleading that your method has all advantages of latency without incurring any overheads. That claim is false.

评论

Thank you for your time and effort in reviewing our work, and your kind response.

I hope you realize that the architecture bottleneck by way of this data to memory overhead can actually make your method look weak. This is not something to overlook when it is clear that more experts can make up for lost accuracy but you are not able to do that due to weight movement bottlenecks.

Thank you for highlighting this! We agree that TTMM does not scale to many active experts. In fact, in our view, one of the key results of this paper is that leveraging only very few active experts compared to the total number of experts (i.e., using only very local information) can improve language models. Relying on only few sparse active experts is also common in other MoE architectures that do not use CPU memory (where latency is not a concern). Furthermore, our analysis shows that activating even only a single expert per query is sufficient to approximate TTT (cf. Section 3.2) — we are only activating more than one expert to compensate for a “small” total number of experts (cf. line 230).

However, I am ok with this as long as you include this assumption and this drawback in your paper, else the claims are somewhat misleading that your method has all advantages of latency without incurring any overheads.

Thank you for this comment. We will clarify in the final paper that our method is only proposed for few active experts, in which case latency is only marginally larger than standard inference. In Figure 1 (left), we also show the tradeoff of latency vs performance, where one can see that few active experts have small latency while improving the performance over the base model.

Thank you once again for your time and effort in reviewing our work!

评论

Dear reviewer,

We wanted to kindly remind you that the discussion period on OpenReview will close in a few days. We hope to have addressed your remaining concerns with our response to your review. We would greatly appreciate it if you reconsider the score based on our response and extended results.

Thank you again for your time and effort in reviewing our work!

评论

We thank all reviewers for their feedback! A common request was to evaluate TTMM on a non-perplexity task. Below, we summarize our results for the evaluation of TTMM on MMLU, which we will include in the final paper. In summary, we find that TTMM slightly improves MMLU accuracy. The results are detailed below, including a breakdown into the categories of the MMLU dataset.

ModelOverallHumanitiesSocial SciencesSTEMOther
Fine-tuned48.144.854.2440.7954.43
TTMM (1 expert)48.4145.2554.9240.9854.30
TTMM (3 experts)48.4845.3355.0240.9854.39
TTMM (10 experts)48.7445.6355.1541.3954.55
TTMM (15 experts)48.9645.4655.3141.7755.26

Base model and embedding model: We use the Llama-3.2-1B model (non-instruct) as base model, which we then fine-tune on the MMLU training set. This model achieves approximately the same performance as the instruction-tuned model. We use the non-instruct model to avoid any confounding effect of RL-training or instruction-tuning on privileged data from the instruct model. As embedding model, we use Stella (stella_en_400M_v5) but obtain similar results in our experiments with MPNet (all-mpnet-base-v2) which we describe below. We train 100 experts on the MMLU training set.

Ablating the embedding model: We evaluate three different embedding models, MPNet (around 100M parameters, used in our original experiments), Stella (around 400M parameters), Qwen2 (around 1.5B parameters). Performance is similar, with Qwen2 performing slightly worse than MPNet and Stella.

ModelOverallHumanitiesSocial SciencesSTEMOther
Fine-tuned48.144.854.2440.7954.43
TTMM (all-mpnet-base-v2)48.6145.9554.7640.5654.72
TTMM (stella_en_400M_v5)48.9645.4655.3141.7755.26
TTMM (gte-Qwen2-1.5B-instruct)48.4545.2754.7640.9154.65
最终决定

This work presents a new method for efficient test time adaptation. Instead of training one adapter per tasks, the proposed method scales up the number of experts, but then efficiently merges for less overhead during inference.

The reviewers agree that the proposed method is simple to understand and implement. It provides a practical solution for test time adaptation, which is otherwise unpractical in many real-world settings.

The reviewers note some limitations with current work, namely lack of evals on downstream tasks, and some better discussion of tradeoff between number of experts and latency. The authors provided additional evidence of both counts during rebuttal, so I am providing a positive recommendation for this work.

Pros

  • Method is simple to understand and implement (7eGv,tuLa)

  • Practical version of test-time adaptation with minimal performance loss compared to per prompt tuning (7eGv,Ltqr,tuLa)

  • writing well motivates the proposed approach, and it is easy to follow (tuLa)

Cons

  • performance trade-off between number of experts and increase in latency is not properly explained (7eGv)

    • clarified in rebuttal
  • unclear how experts are selected (7eGv,tuLa)

    • clarified in rebuttal
  • the paper slightly over-claims benefits of TTMM by calling it "essentially free". compared to TTT, TTMM requires more memory to keep adapters (Ltqr, 7eGv)

  • no downstream metrics (Ltqr, tuLa)

    • fixed in rebuttal
  • limitations in how much TTMM can scale (Ltqr)

  • unclear how it compares with recent models with more experts (tuLa)