Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models
We theoretically and empirically analyze the effect of the Pooling Operation in the downstream performance.
摘要
评审与讨论
This paper conducts theoretical analysis on the pooling functions of Transformer and derives expressivity bounds for various pooling strategies, including sum, max, last, and average. Through experiments, it verifies that different tasks require distinct pooling functions to capture global and local contextual understanding, respectively. The main contribution lies in establishing an expressivity bound for pooling functions. However, this bound aligns closely with human intuition, which may limit the significance of the proposed method.
As a researcher in the field of machine learning with a greater focus on applications, I place more emphasis on the practical significance of theoretical research.
优缺点分析
Strengths: 1.This paper conducts a novel theoretical analysis of the pooling functions in transformers. By delving into the mathematical properties and underlying mechanisms of pooling operations, it offers a fresh perspective on how these functions interact with the overall architecture, thereby enriching the theoretical foundation of transformer-based models.
2.The paper clearly illustrates that different tasks require distinct strategies for capturing global and local contextual understanding. Through in-depth analysis and examples, it effectively demonstrates the importance of tailoring pooling functions to task-specific requirements, providing valuable guidance for researchers aiming to optimize model performance for various natural language processing and computer vision tasks.
3.The experimental results effectively verify the effectiveness of the proposed expressivity bound. Rigorous testing across a range of datasets and scenarios provides substantial empirical evidence supporting the validity of the theoretical framework, showcasing the practical utility of the research findings. Weaknesses:
- No new model is proposed to construct a more expressive pooling layer. Merely analyzing the existing pooling functions without introducing innovative architectures limits the potential for significant practical improvements. A novel model could have better demonstrated the practical implications of the theoretical expressivity bound and potentially led to breakthroughs in model performance.
2.The results regarding the expressivity bound of the pooling layer may be intuitive, and it is challenging to draw inspiration for related research areas. The lack of deep exploration into the broader implications and potential extensions of the findings restricts the impact of the research. Without clear connections to emerging trends or open problems in the field, it becomes difficult for other researchers to build upon this work, reducing its long-term value.
问题
-
For the multi-token prediction task, which pooling function is more suitable?
-
How does the proposed theoretical framework demonstrate its guiding significance for practical tasks?
局限性
- Evaluation is limited to frozen backbones, leaving the effect of jointly adapting pooling and the backbone under end-to-end training less explored.
2 "In future work, we aim to develop hybrid pooling methods that dynamically balance global smoothing with token-level sensitivity while adapting to the TBM’s inherent smoothing behavior. We also plan to investigate how pooling impacts robustness under perturbations, derive scaling laws that govern pooling performance as datasets grow, and refine our theoretical framework with tighter bounds that identify when specific strategies are provably optimal." As author discussed, I believe that further work will enhance the research significance of this article and improve its chances of publication.
最终评判理由
My concern is that no new model is proposed to construct a more expressive pooling layer, and it also has not been solved, so i think it not enough for acceptance.
格式问题
N/A
We thank the reviewer for their thoughtful comments and constructive feedback. We additionally are grateful to the reviewer for acknowledging the value of our theoretical insights in guiding practical aspects of machine learning applications in various domains.
W1: No new model is proposed to construct a more expressive pooling layer
Thank you for the observation. While it is true that we do not propose a new pooling mechanism in this work, our primary goal is to provide a theoretical and empirical foundation for understanding existing pooling strategies -- a space that has so far lacked rigorous analysis.
As discussed in the Conclusion (lines 326-330) and Limitations (lines 342-343), our expressivity framework is intended to guide the development of future pooling methods, including hybrid or adaptive strategies. By formally quantifying how pooling affects model expressivity (via the bound), and validating these findings across tasks and domains, we aim to lay the groundwork for principled design of new pooling operators.
Additionally, the paper offers practical guidance for selecting pooling methods based on task requirements (global vs. local context), data regime, and resource constraints. We believe this makes our contribution actionable, even without introducing a new architecture, and positions pooling as a meaningful design choice rather than a fixed component.
W2 & Q2: On the expressivity results may appear intuitive, broader research impact, and significance for practical tasks
We appreciate the reviewer's perspective. While the behavior of pooling functions may appear intuitive, our contribution lies in formalizing these observations into a principled framework that explains how pooling affects model expressivity and sensitivity. Whereas prior work [1-4] has lacked theoretical grounding, our aim is to provide a unified explanation of these patterns. We see this as a necessary step toward more informed architectural choices and the development of new pooling strategies guided by theory.
This framework enables a principled understanding of when and why different pooling strategies succeed, depending on task requirements. A key takeaway of our work is that no pooling method is universally optimal. Rather, the best choice depends on whether the task requires global or local context.
We also highlight a practical implication: if resources permit, adaptive pooling strategies (e.g., attention-based or weighted average) can learn to fit task-specific needs, but they come at a higher computational cost. When training time or data is limited, our framework helps guide the selection of fixed pooling strategies that match task characteristics.
We will clarify these points in the revision and thank the reviewer for prompting us to better frame the broader value of our work.
[1] Tang & Yang. Pooling and attention: What are effective designs for llm-based embedding models?
[2] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[3] Lee et al. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models.
[4] Xing et al. Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis Perspective.
Q1: For the multi-token prediction task, which pooling function is more suitable?
Multi-token prediction, where several future tokens (e.g., the next 10) are predicted simultaneously via separate heads, can be viewed as a natural extension of the classical next-token prediction task. In such settings, we would expect Last-token pooling to perform well among the fixed pooling strategies, as it retains recent contextual information, which is often most relevant for autoregressive forecasting.
That said, when long-range dependencies play a more significant role. For example, when predicting tokens farther ahead (e.g., the 8th or 10th future token), fixed strategies may be less effective. In these cases, learnable pooling methods such as Weighted Average or Attention-based pooling are better suited, as they can dynamically attend to the most informative parts of the input sequence for each prediction target.
We will clarify this distinction in the revised manuscript and thank the reviewer for prompting this useful elaboration.
L1: Evaluation is limited to frozen backbones
We agree that jointly adapting the pooling mechanism and the backbone through end-to-end training is an important direction. In this work, we focused on the frozen-backbone setting to isolate the effect of pooling strategies and analyze their theoretical properties independently of backbone optimization. We will clarify the scope in the revised manuscript and highlight joint adaptation as a potential avenue for future work.
L2: On the potential for future work and extensions
We thank the reviewer for their thoughtful support. We agree that extending this work could enhance its broader impact. We will ensure that these directions are more clearly positioned in the revised manuscript as concrete next steps.
This paper presents a comprehensive theoretical and empirical analysis of pooling strategies in Transformer-based models, establishing pooling as a critical architectural component that significantly influences model expressivity and downstream task performance. This paper introduces a formal expressivity framework to quantify how different pooling methods—Average, Sum, Max, Last-token, and learnable variants (Weighted Average and Attention-based)—affect a model’s ability to distinguish semantically similar versus dissimilar inputs.
优缺点分析
Strengths:
Beyond theoretical contributions, the work provides clear, empirically supported recommendations for pooling selection in real-world applications. For instance, it highlights that learnable pooling (e.g., Weighted Average) can adaptively approximate optimal strategies for tasks with mixed global-local requirements, offering a practical solution for scenarios where fixed pooling methods may be suboptimal.
Weaknesses:
-
While the paper observes that pooling strategy differences diminish in larger models (e.g., Mistral-7B, Llama3-8B), it lacks a detailed investigation into why this occurs. The absence of experiments on even larger models (e.g., 14B, 32B parameters) leaves a gap in understanding how pooling choices scale with model size.
-
The experiments primarily focus on short sequences and do not explore how varying input lengths affect pooling performance, a critical consideration for real-world NLP applications. Additionally, the paper neglects to analyze the computational trade-offs between simple and learnable pooling strategies, including their impact on inference efficiency and cost.
-
While the study provides valuable insights into shallow models, it does not systematically investigate how pooling strategies interact with deeper Transformer architectures, which are increasingly prevalent in modern applications. Moreover, although the paper offers general recommendations, it lacks concrete, task-specific guidelines for selecting among particular pooling methods (e.g., Average, Sum, Max, Last), limiting its practical applicability. Additionally, the study overlooks the exploration of hybrid or ensemble pooling strategies (e.g., combining Average and Max pooling), which could better balance global and local feature retention. Such strategies may yield superior performance in multi-objective tasks but remain unexplored in this work.
-
This paper notes that Average pooling may discard local features while Max pooling can overlook global context, but it fails to propose a quantitative framework for evaluating semantic preservation in tasks requiring both fine-grained detail and efficient compression (e.g., long-document understanding).
问题
This paper does not assess the robustness of different pooling strategies under adversarial or noisy input conditions. For example, it remains unclear how various pooling methods perform when input embeddings are perturbed by adversarial attacks or corrupted by noise. This omission is particularly important for real-world scenarios, where input reliability cannot always be ensured.
Please refer to the Weaknesses section for additional questions.
局限性
yes
最终评判理由
This work is interesting, and can be accepted. My concerns can be addressed as stated in the author's response.
格式问题
n/a
We thank the reviewer for their thoughtful comments, constructive feedback, and detailed suggestions for improving the manuscript. Below, we address each point and outline corresponding revisions.
Q1: On the effect of pooling on adversarial robustness
We appreciate raising this important point. As noted in Line 344, we briefly mention robustness under perturbations as a direction for future work, but we agree that this aspect was not explored in detail in the current version. We appreciate the connection the reviewer draws between adversarial robustness and our expressivity framework (Equation 4). The measure captures the probability that small input changes lead to disproportionately large changes in the output – a concept closely related to Lipschitz continuity, which underlies many robustness analyses.
Importantly, our expressivity formulation is defined over the full -neighborhood of an input and can be readily adapted to reflect worst-case deviation within that neighborhood. By considering the maximum rather than expected output difference, the upper bounds in Theorem 4.2 can be directly extended to the worst-case setting commonly used in adversarial robustness. Since our proofs already assume bounded input spaces and operate over the entire neighborhood, they naturally include both semantically similar perturbations and adversarial examples.
To complement this theoretical link with empirical evidence, we conducted a preliminary experiment using the Fast Gradient Sign Method (FGSM) attack on CIFAR-10 and CIFAR-100, using a ViT backbone. We report the attack success rate for two commonly used perturbation budgets ( and ). The results, shown below, indicate that pooling choice meaningfully impacts adversarial vulnerability, in addition to clean accuracy:
| Dataset | Attack Budget | Last (CLS) | Avg | Sum | Max | W-Avg | Attn |
|---|---|---|---|---|---|---|---|
| CIFAR-10 | 18.56 | 18.94 | 10.92 | 20.32 | 16.93 | 13.9 | |
| CIFAR-10 | 29.38 | 28.16 | 10.92 | 25.97 | 25.59 | 14.44 | |
| CIFAR-100 | 34.29 | 32.44 | 20.67 | 31.84 | 31.57 | 29.54 | |
| CIFAR-100 | 43.99 | 41.42 | 20.68 | 38.07 | 40.49 | 32.94 |
We plan to expand this analysis further in the final version, both theoretically and empirically, including experiments with stronger attacks (e.g., Projected Gradient Descent) and additional datasets. We thank the reviewer for motivating this valuable extension.
W1 & W3: On the effect of model depth
While our theoretical analysis focuses on a -layer Transformer-based model, the framework naturally extends to deeper models. Specifically, for a model with layers, we can decompose it as a composition of functions with , such that:
Following standard results in Lipschitz continuity, the overall expressivity bound of the model becomes the product of the expressivity of each -layer function. Since each individual layer operates on a bounded input (as we assume ), this composition holds, and the theoretical structure remains intact.
This formulation provides insight into the diminishing influence of pooling in deeper models. As the number of layers increases, the overall bound grows, making the relative effect of the final pooling operation smaller, especially when its impact scales as, for example, for Average pooling. This aligns with our empirical observation in Lines 282-283, where the performance gap between pooling strategies narrows in larger models. To further validate this effect in practice, and in response to the reviewer's suggestion, we extended our experiments to larger-scale models:
- Qwen2.5 14B and Qwen2.5 32B
- Mistral3.1 24B
The tables below report results across tasks for each model using the same experimental setup as in Section 5.2. As shown, while pooling trends are generally preserved, the absolute differences between pooling methods decrease as model size increases, consistent with the theoretical interpretation.
Table 2: Test metrics for Qwen2.5 14B
| STSB | Hellaswag | Banking | Tweet | Next Token | |
|---|---|---|---|---|---|
| Last | 0.288±0.003 | 0.692±0.011 | 34.497±0.178 | 57.008±0.004 | 91.039±0.002 |
| Avg | 0.581±0.000 | 0.773±0.000 | 85.390±0.001 | 69.930±0.008 | 53.893±0.250 |
| Sum | 0.579±0.002 | 0.776±0.002 | 82.711±0.008 | 61.218±1.194 | 47.053±0.341 |
| Max | 0.488±0.005 | 0.727±0.001 | 77.825±0.006 | 55.390±0.728 | 25.578±0.332 |
| W-Avg | 0.589±0.001 | 0.798±0.002 | 86.867±0.002 | 69.770±0.007 | 90.653±0.020 |
| Attn | 0.259±0.011 | 0.712±0.013 | 66.387±1.939 | 56.410±3.001 | 40.573±0.892 |
Table 3: Test metrics for Qwen2.5 32B
| STSB | Hellaswag | Banking | Tweet | Next Token | |
|---|---|---|---|---|---|
| Last | 0.303±0.001 | 0.723±0.009 | 34.383±0.268 | 55.886±0.006 | 89.886±0.005 |
| Avg | 0.603±0.000 | 0.781±0.001 | 87.760±0.002 | 68.328±0.012 | 48.536±0.334 |
| Sum | 0.604±0.001 | 0.782±0.005 | 85.227±0.013 | 63.665±0.865 | 39.204±0.627 |
| Max | 0.488±0.005 | 0.729±0.008 | 83.929±0.008 | 65.268±0.596 | 31.273±0.586 |
| W-Avg | 0.627±0.003 | 0.812±0.003 | 89.903±0.004 | 69.464±0.009 | 89.022±0.019 |
| Attn | 0.355±0.016 | 0.730±0.011 | 68.929±1.054 | 58.828±2.695 | 29.750±1.028 |
Table 4: Test metrics for Mistral3.1 24B model.
| STSB | Hellaswag | Banking | Tweet | Next Token | |
|---|---|---|---|---|---|
| Last | 0.503±0.002 | 0.745±0.008 | 75.487±0.087 | 53.701±0.005 | 88.972±0.007 |
| Avg | 0.631±0.001 | 0.784±0.001 | 87.403±0.006 | 66.871±0.009 | 51.723±0.812 |
| Sum | 0.622±0.004 | 0.783±0.004 | 87.597±0.015 | 62.791±0.976 | 42.306±0.732 |
| Max | 0.488±0.003 | 0.733±0.007 | 79.675±0.010 | 61.655±0.473 | 27.804±0.923 |
| W-Avg | 0.682±0.002 | 0.816±0.003 | 88.711±0.017 | 66.200±0.005 | 87.849±0.024 |
| Attn | 0.392±0.043 | 0.697±0.009 | 72.922±1.012 | 31.294±4.452 | 19.911±2.023 |
We will integrate this theoretical extension and the new empirical results into the revised version of the manuscript, and we thank the reviewer again for prompting this valuable addition.
W2: On the effect of sequence length in pooling operators
To investigate this, we conducted an additional experiment using the Mistral-7B model on the HellaSwag dataset, varying the maximum input sequence length from 16 to 128 tokens. Inputs were either truncated or padded as needed, and padding tokens were excluded from pooling operations, consistent with our prior setup. The results, shown in Table 6, compare the performance of different pooling strategies across sequence lengths. Naturally, shorter inputs can lead to performance degradation due to the truncation of semantically important content. To isolate the effect of pooling itself, we recommend comparing methods column-wise (i.e., at fixed sequence lengths)
Table 6: Mean and standard deviation of metrics for Mistral 7B on HellaSwag for different sequence lengths.
| 16 | 32 | 64 | 128 | |
|---|---|---|---|---|
| Last | 0.454±0.003 | 0.621±0.002 | 0.770±0.002 | 0.781±0.001 |
| Avg | 0.503±0.001 | 0.702±0.000 | 0.769±0.000 | 0.771±0.001 |
| Sum | 0.504±0.001 | 0.702±0.001 | 0.769±0.000 | 0.771±0.001 |
| Max | 0.435±0.003 | 0.616±0.003 | 0.709±0.001 | 0.700±0.008 |
| W-Avg | 0.523±0.002 | 0.724±0.000 | 0.801±0.000 | 0.802±0.003 |
| Attn | 0.220±0.169 | 0.278±0.251 | 0.737±0.018 | 0.764±0.025 |
We observe that while pooling sensitivity is more pronounced at shorter input lengths, particularly for methods like Last-token or Attention-based pooling, the relative trends between pooling strategies remain consistent. For longer sequences (64 or 128 tokens), performance stabilizes across pooling types, aligning with our theoretical bounds, which become more predictive in this regime.
Additionally, in a revised version of the manuscript, we will include a dedicated analysis of inference-time costs. As a preview, inference latency (in ms) for pooling on Mistral-7B with Banking77 is: Last: 0.14 ± 0.01; Avg: 0.16 ± 0.01; Sum: 0.12 ± 0.01; Max: 0.23 ± 0.01; W-Avg: 0.33 ± 0.02; Attn: 0.53 ± 0.03. These show that while learnable methods add modest overhead, it may be justified in tasks requiring higher expressivity. In contrast, flat poolings are cheaper and often competitive, highlighting a classic trade-off between inductive bias and learnable capacity.
We will include these experiments and discussion in the revised manuscript and thank the reviewer for suggesting this addition.
W3: Regarding hybrid pooling strategies
We agree with the reviewer that hybrid pooling, which combines multiple strategies to capture both global and local context, is a promising direction. This idea is reflected in Figure 3 (and Figure 5), where Weighted Average pooling adaptively attends to both prominent and supporting tokens, thus effectively blending pooling behaviors. Importantly, any hybrid pooling strategy is ultimately built from flat pooling operations which are the exact components analyzed in our expressivity framework. Understanding their individual properties, as we do in this work, is a necessary step toward designing principled hybrid or adaptive approaches.
We will clarify this connection in the revised manuscript and thank the reviewer for highlighting this valuable perspective.
W4: Considering the tasks requiring both local and global context
We agree with the reviewer that certain downstream tasks may require preserving both local detail and global structure. As noted in our response to W3, such mixed-context tasks present a challenge for flat pooling strategies, which are typically biased toward one or the other. In these cases, we believe that either learned pooling methods, such as Weighted Average or Attention-based, or hybrid approaches that combine multiple fixed strategies offer a more suitable solution. This is consistent with our findings in Sections 5.2 and 5.3, where learned methods adaptively emphasize tokens based on task demands and outperform flat pooling on tasks with diverse contextual requirements.
We will clarify this point in the revised manuscript and explicitly highlight use cases where hybrid or learned pooling strategies are likely preferable. We appreciate the reviewer's suggestion in guiding this improvement.
Thank you for your reply. These new results can be added to a revised version of the paper. I will be maintaining my positive score.
We thank the reviewer for the dedicated time in reviewing the paper and for approving the different elements provided in our rebuttal. We additionally appreciate the provided positive score. We will carefully incorporate the above discussions into the revised paper, with explicit adjustments to reflect the proposed suggestions and clarifications.
The author studied the impact of various token embedding pooling approaches both theoretically and experimentally. They first analyzed the impact of pooling the expressivity of Transformer-based models, deriving closed-form bounds on the model’s ability to distinguish similar inputs. Then, they experimentally evaluated pooling strategies across computer vision, natural language processing, and time-series tasks. Their findings suggest that tasks requiring global contextual understanding, such as text classification, benefit from contractive methods like average pooling. Conversely, tasks emphasizing local details, such as next-token prediction, benefit from expansive methods like last-token and sum pooling.
优缺点分析
This paper makes a timely contribution as pooling in Transformer-based models is an essential design choice that is underexplored. The experimental setup is modern, leveraging state-of-the-art LLMs such as Llama3, Mistral 7B, and Qwen 2.5. Note that I believe it would be beneficial to explicitly highlight the choice of these models in the main text.
The main weakness lies in the shift between the theoretical and empirical analyses. The theoretical section focuses on contractive against expansive pooling, whereas the empirical results discuss global against final pooling. For instance, average pooling is contractive and sum pooling is expansive, and the authors suggest that contractive methods benefit global tasks while expansive benefit local tasks. Yet, experiments indicate that tasks requiring global context perform better with global pooling methods like average or sum.
- Lines 228-230: “Therefore, we posit that pooling should be selected based on task requirements: tasks emphasizing global context (e.g., image inpainting or text classification) benefit from contractive pooling, while those relying on local detail (e.g., next-token prediction) may perform better with expansive alternatives.”
- Lines 278-281: “For tasks requiring global context, such as classification and semantic similarity, global pooling methods like Average or Sum significantly outperform Last-token pooling. Conversely, Last-token pooling yields superior performance in next-token prediction tasks.” To improve the paper's narrative, both the theoretical analysis and experimental results should consistently address both contractive vs. expansive and global vs. final aspects.
Note that I did not thoroughly review the supplemental material, particularly the proofs.
问题
-
Can the authors elaborate on the two aspects of contractive vs. expansive and global vs. final? How would you revise your manuscript to address my comment?
-
The statement in lines 293-295 that “Sum pooling outperforms Average pooling in several cases, that can be attributed to the larger norm of summed representations, which results in stronger gradients and faster learning under fixed training hyperparameters”, while logical, lacks supporting evidence. If that’s the case, why does average pooling outperform sum pooling in most cases?
局限性
Yes.
最终评判理由
The authors addressed my main concern, which was the connection between the theoretical analysis (expansive/contractive) and experimental (global/local). I have updated my score from 3 to 4.
格式问题
I have no concerns.
We thank the reviewer for their thoughtful comments and acknowledging the value and timely contribution of our study. Below, we respond point by point to the reviewer's questions and comments, and outline the corresponding revisions we will incorporate to improve the manuscript.
W1: On the choice of LLMs used
We appreciate highlighting the importance of model selection when evaluating pooling strategies. Our aim was to ensure that the observed trends generalize across a diverse range of model capacities and architectures. To this end, we selected a representative set of widely used state-of-the-art language models, including Qwen2.5, Mistral-7B, and LLaMA3.1-8B. These models span different size classes and training pipelines, and are broadly recognized in the community. Our results across these models consistently support the theoretical insights introduced in the paper.
To further strengthen our findings, we extended our study to include larger-scale models, such as Mistral 3.1 24B, and report the results in the table below. As seen, pooling behavior is consistent even at higher parameter scales, in line with additional observations on Qwen2.5 14B and 32B (see our response to Reviewer duRA W1 & W3). These additions reinforce our claim that the conclusions drawn in this work are not model-specific, but rather reflect generalizable properties of pooling strategies.
Table 2: Mean and Standard deviation of test metrics for NLP tasks for the Mistral3.1 24B model.
| Metric | STSB | Hellaswag | Banking | Tweet | Next Token |
|---|---|---|---|---|---|
| Last | 0.503 ± 0.002 | 0.745 ± 0.008 | 75.487 ± 0.087 | 53.701 ± 0.005 | 88.972 ± 0.007 |
| Avg | 0.631 ± 0.001 | 0.784 ± 0.001 | 87.403 ± 0.006 | 66.871 ± 0.009 | 51.723 ± 0.812 |
| Sum | 0.622 ± 0.004 | 0.783 ± 0.004 | 87.597 ± 0.015 | 62.791 ± 0.976 | 42.306 ± 0.732 |
| Max | 0.488 ± 0.003 | 0.733 ± 0.007 | 79.675 ± 0.010 | 61.655 ± 0.473 | 27.804 ± 0.923 |
| W-Avg | 0.682 ± 0.002 | 0.816 ± 0.003 | 88.711 ± 0.017 | 66.200 ± 0.005 | 87.849 ± 0.024 |
| Attn | 0.392 ± 0.043 | 0.697 ± 0.009 | 72.922 ± 1.012 | 31.294 ± 4.452 | 19.911 ± 2.023 |
We will include this rationale and the extended results in the revised manuscript.
Q1: On the local/global and expensive/contractive aspects
We thank the reviewer for this thoughtful question. We believe the distinction raised is important, and we appreciate the opportunity to clarify the relationship between these axes.
In our framework, contractive vs. expansive refers to the mathematical behavior of pooling functions, specifically how they scale model sensitivity to input perturbations (Section 4.2, Theorem 4.2). For example, Average pooling is contractive (scaling as ), while Sum and Last-token pooling are expansive or neutral. These differences affect how much variation in the input is retained in the final pooled representation.
On the other hand, the global vs. local distinction (we believe this may be what was meant by "final") refers to task requirements, i.e., whether the output depends on the overall input (global context) or specific positions (local context). For example:
- Global-context tasks (e.g., classification, semantic similarity) benefit from pooling methods that integrate information across all tokens.
- Local-context tasks (e.g., next-token prediction, forecasting) benefit from methods that emphasize recent or specific tokens, such as Last-token pooling.
These two perspectives are connected: expansive pooling tends to preserve token-level variation, making it better suited to local-context tasks; contractive pooling smooths across the input, aligning with global-context needs.
Figure 2 empirically validates this theoretical framing across modalities. It shows that expansive pooling methods (e.g., Sum, Last) are more sensitive to perturbations, while Average pooling smooths input variation, closely aligning with our derived bounds. This supports our claim that expressivity behavior maps onto different task needs. To illustrate the semantic effects more concretely, Figure 4 presents examples where minor edits to input text (e.g., replacing "fluctuations" with "swings" or "volatility") yield small changes under Average pooling but larger differences under Last-token pooling. This demonstrates that expansive pooling captures fine-grained changes that might be critical in local-context tasks.
We finally note that while related empirical trends have been observed in prior work [1-4], to our knowledge, this is the first work to formally characterize and prove these relationships from a theoretical standpoint.
We will revise the manuscript to:
- Clarify the relationship between the theoretical (contractive/expansive) and empirical (global/local) axes in the transition between Sections 4 and 5.
- Add a brief explanation that "final-context" tasks, as described by the reviewer, are more precisely described as local-context tasks, and adjust our terminology accordingly for clarity.
- Highlight Figure 2, which empirically shows how pooling methods behave under perturbations, consistent with the scaling predicted by our theoretical framework.
- Expand on Figure 4, which illustrates how pooling methods respond to subtle semantic differences, depending on whether token-level or global meaning is more relevant.
We appreciate the reviewer’s feedback and believe these revisions will strengthen the conceptual clarity of the manuscript.
[1] Tang & Yang. Pooling and attention: What are effective designs for llm-based embedding models?
[2] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[3] Lee et al. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models.
[4] Xing et al. Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis Perspective.
Q2: On the performance of Sum pooling
We thank the reviewer for the question and the opportunity to clarify this point. The observation that Sum pooling occasionally outperforms Average pooling was noted specifically in the time-series modality, for all majority of the considered sizes of the MOMENT model and the different classification datasets.
In this setting, we noted that Sum pooling results in larger gradient norms compared to Average pooling, which may lead to faster early-stage optimization. However, this effect is not observed in NLP or vision tasks, suggesting that it may be modality-specific.
We hypothesize that this behavior could be influenced by normalization schemes or architectural details unique to time-series models, such as the absence of token-level layer norm or differences in input scaling. While this offers a plausible explanation, we currently do not have direct evidence to confirm the cause. We will clarify this distinction in the revised manuscript and flag this as a modality-specific observation that warrants further study. We thank the reviewer for encouraging us to articulate this more precisely.
Thank you for the explanation of the connection between global/local and expansive/contractive. I an pleased with the proposed revisions and will increase my score.
We thank the reviewer for increasing the score. Your recognition means a lot, and we appreciate your positive score.
We will elaborate and incorporate the discussion about the global/local and expansive/contractive aspect in the revised manuscript, ensuring that all relevant details are presented with sufficient clarity as proposed by the reviewer.
The paper studies the pooling operation used to combine the output token embeddings into a single embedding. They provide a theoretical framework to analyze how pooling affect the performance and offer guidelines regarding how to select the pooling function. The theoretical analysis shows that the pooling strategy affects the classifier's expressivity bound. Average pooling enhance stability by smoothing variations. Last-token and Sum pooling increases expressivity and more sensitive to minor perturbations. The paper suggests that task that requires global context would benefit from average pooling while tasks that biased toward certain local spans (such as next-token prediction) can benefit from Last-token or Sum pooling strategies. The hypothesis are verified by experiments measuring expressiveness and downstream performance.
优缺点分析
Strengths:
- A theoretical framework to bound the expressiveness of various pooling techniques.
- Empirical results to back the hypothesis derived from the theoretical works.
- Clear guidelines of how to choose the pooling strategy.
问题
- (Figure 4) The experiments of replacing words by semantic similar or dissimilar words is interesting. I'm wondering if there are similar experiments of removing stop words or removing the negation word (like not). Are there any proposal on an pooling mask that masked out certain tokens?
局限性
Yes
格式问题
None
We thank the reviewer for their thoughtful comments and constructive feedback. We especially appreciate their recognition of the novelty and value of our proposed theoretical and empirical insights concerning the pooling operation in the Transformer-based Models.
Q1: What happens when stopwords or negation are removed? Could a pooling mask be used to discard such tokens?
We appreciate this suggestion and agree that extending the analysis in Figure 4 to include manipulations such as stopword removal and negation edits provides a useful perspective on how pooling functions respond to subtle semantic changes, especially in the context of token-level relevance.
To explore this, we conducted an additional analysis using the same base sentence as in Figure 4:
-
Original: "The stock market saw significant fluctuations last week."
-
Stopword removed: "Stock market saw significant fluctuations last week."
-
Negation introduced: "The stock market saw insignificant fluctuations last week."
We then measured the L2 distance between the pooled output of the original and perturbed sentences for each pooling method. The results are shown below:
| Pooling Method | Avg | Last | Max | Sum |
|---|---|---|---|---|
| Stopword removed | 8.37 | 18.93 | 19.18 | 265.00 |
| Negation introduced | 2.38 | 16.46 | 9.78 | 21.41 |
These results suggest that Average pooling is relatively stable under small textual changes, such as those affecting global structure (e.g., stopword removal), while Last, Max, and Sum pooling are more sensitive to local token edits, such as negation. This supports the broader insight from the paper: contractive pooling methods smooth token-level effects, whereas expansive methods emphasize localized changes, making them more responsive in tasks where fine-grained distinctions (e.g., presence of negation) are crucial.
We will include this extended analysis and visualization in the updated version of the manuscript, and we thank the reviewer for motivating this additional experiment. The notion of a learnable pooling mask that can selectively attend to or discard tokens is indeed a promising direction, and we will briefly note it as future work.
Q2: Are there any proposals on a pooling mask that masked out certain tokens?
Incorporating such a mechanism could indeed be beneficial, especially when certain tokens carry outsized semantic weight or when low-information tokens may dilute global context. However, a key challenge lies in determining which tokens to mask, especially in a generalizable way. While heuristic approaches (e.g., stop word lists) are available in NLP, such strategies do not naturally extend to other modalities like vision or time series, where "irrelevant tokens" are less clearly defined.
In this regard, we view weighted average pooling as a flexible alternative. Rather than applying hard masking, it learns to dynamically attenuate or amplify token contributions based on context, achieving similar outcomes without requiring explicit prior knowledge or handcrafted rules.
We will mention this connection in the revised manuscript and thank the reviewer for prompting this valuable direction for future work.
The paper provides a comprehensive theoretical and empirical analysis of various pooling strategies in transformer models.
Following the rebuttal, three reviewers indicate a positive opinion of the paper (one accept, two borderline accepts) and note that all concerns they had have been addressed, and that the paper makes a valuable contribution. The remaining reviewer rates the paper a borderline reject, saying that the paper only analyzes existing pooling methods rather than propose a new one. However, other reviewers have highlighted the importance of the theoretical and empirical analysis and its utility in making practical decisions for real life applications.
Accordingly, AC concurs with the majority opinion and recommends acceptance.