3.5

/10

Rejected4 位审稿人

最低3最高5标准差0.9

4.3

置信度

正确性2.3

贡献度1.8

表达2.0

ICLR 2025

Masked Cross-attention Adapters Enable the Characterization of Dense Features

Timo Lüddecke,Alexander S Ecker

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

Learning meaningful representations is a core topic of deep learning. Throughout the last decade, many strategies for learning image representations have been proposed involving supervision and self-supervision and various data sources. In most current work, evaluation is focused on classification tasks while neglecting dense prediction tasks, possibly because linear probing is more challenging in the latter case. Furthermore, dense prediction heads are often large and come with specific inductive biases that distort performance measurement further. In this work we propose masked cross-attention adapters (MAXA), a minimal adapter method that is capable of dense prediction independent of the size and resolution of the encoder output. This allows us to make dense predictions using a small number of additional parameters ($<0.3 $%) while allowing for fast training using frozen backbones. Using this adapter, we run a comprehensive evaluation assessing instance awareness, local semantics and spatial representation of a diverse set of backbones. We find that DINOv2 outperforms all other backbones tested - including those supervised with masks and language - across all three task categories. Code is available at https://to.be.released.

关键词

image featuresimage backbonesViTinstance segmentation

评审与讨论

审稿意见

评分: 5置信度: 52024-10-29

This paper presents the Masked Cross-Attention Adapter (MAXA), a lightweight solution for making dense predictions on frozen vision backbones. By utilizing cross-attention, MAXA decouples the encoder output's resolution from the final prediction, filling a gap in the evaluation of feature extractors for dense tasks such as segmentation and depth estimation. The authors assess multiple vision backbones along three dimensions: instance awareness, local semantics, and spatial understanding, with DINOv2 emerging as the top performer across all tasks. The study emphasizes MAXA’s effectiveness in characterizing dense features while requiring minimal parameters and training effort.

Overall, the concept is straightforward and well-presented. The authors explore various aspects of the capabilities of frozen features, providing a solid basis for evaluating the adapter's effectiveness. However, the paper falls short due to a limited literature review and insufficient experimental depth, which may impact its chances of acceptance.

优点

(1) The idea is simple and easy to read.
(2) This research has potential to become a standard for evaluating the dense awareness of frozen features.
(3) The authors explore various aspects of the capabilities of frozen features, providing a solid basis for evaluating the adapter's effectiveness.

缺点

Major Weaknesses

(1) The literature cited is outdated. For instance, the authors state, “At the other end of the spectrum, using complex dense task heads, for example, Faster R-CNN (Ren et al., 2015) for object detection, adds a large number of parameters and introduces its own inductive biases.” However, Faster R-CNN is nearly a decade old. The authors should clearly differentiate their approach from more recent works like ViTDet [1], ViT-Adapter [2], and Segmenter [3] in both the introduction and related work sections, as these studies also focus on developing lightweight dense task heads. Although ViTDet is briefly mentioned in the “Experiment Design” section, this reference is insufficient for establishing the distinction.
(2) Lack comparison with zero-shot dense prediction using frozen features. Zero-shot segmentation using frozen features [4, 5] from foundation models has been extensively studied. These models [4, 5] demonstrate strong segmentation performance with training-free dense heads.
(3) The experimental comparison is insufficient. In the CLIP setting, the authors focus solely on ViT-based architectures (SigLIP for example), whereas ConvNet-based or hybrid architectures might be more appropriate for dense tasks. It is highly recommended that the authors include experiments with Hybrid CNN-Transformer architecture like ViTamin [6] and ConvNet architecture like CLIP-ConvNeXt [7]. These additional experiments are crucial, and I would consider raising the score if they are incorporated during the rebuttal phase.

Minor Weaknesses

(1) In Figure 1, it is unclear how the M(q, \sigma) is generated.
(2) Section 3 mentions that “spatial queries Q of size (HQWQ, 16)”. It is unclear why the query dimension is chosen to be 16.
(3) Concerns regarding reproducibility arise due to the unclear descriptions. Section 3 mentions that “This is realized through a small CNN operating on the output of all queries using transposed convolutions to increase spatial size.”. Please specify the details of "small CNN". CNN usually refer to Convolutional Neural Network consisting with convolutional layer and non-linear activation layer. Please specify the type and number of layers, and the kernal size of each convolution operator.

[1] Li Y, Mao H, Girshick R, et al. Exploring plain vision transformer backbones for object detection[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 280-296.

[2] Strudel R, Garcia R, Laptev I, et al. Segmenter: Transformer for semantic segmentation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 7262-7272.

[3] Chen Z, Duan Y, Wang W, et al. Vision transformer adapter for dense predictions[J]. ICLR, 2023.

[4] Sun S, Li R, Torr P, et al. Clip as rnn: Segment countless visual concepts without training endeavor[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 13171-13182.

[5] Wang F, Mei J, Yuille A. Sclip: Rethinking self-attention for dense vision-language inference[C]//European Conference on Computer Vision. Springer, Cham, 2025: 315-332.

[6] Chen J, Yu Q, Shen X, et al. ViTamin: Designing Scalable Vision Models in the Vision-Language Era. CVPR. 2024. https://huggingface.co/jienengchen/ViTamin-XL-384px

[7] https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

问题

As raised the weakness section, it is highly recommended to include experiments with Hybrid CNN-Transformer architecture like ViTamin [6] and ConvNet architecture like CLIP-ConvNeXt [7]. These experiments are essential, and I would consider increasing the score if they are added during the rebuttal phase.

The authors should also address the major concerns and minor concerns detailed in the weakness section.

2024-11-29

We thank the reviewer for the extensive comments.

Major

The authors should clearly differentiate their approach from more recent works.

We agree that related work needs more discussion and revised the respective sections (also see general comment 2).

Lack comparison with zero-shot dense prediction using frozen features

As clarified in general answer 1 our primary goal is not to obtain competitive performance. We report the performance of our method and the suggested methods on Pascal VOC2012 with background (VOC-21) to provide a perspective on the attained performance of MAXA. Please note, this is an apple-to-orange comparison as our method was trained on Pascal VOC2012 (even though the number of parameters is small).

The experimental comparison is insufficient. [...] the authors focus solely on ViT-based architectures [...]

We focuses on vision transformers (1) to control for model architecture and (2) because ViT-B/16 is a popular choice which facilitates relating to other approaches. We agree that a comparison with CNNs, specifically ConvNeXt, and ViTamin strenghtens the paper and added the suggested comparison, please see general comment 2.

Minor

In Figure 1, it is unclear how the M(q, \sigma) is generated.

A a Gaussian function is placed at the equivalent of the query position in the feature volume. This Gaussian has a learnable standard deviation for each attention head (i.e. $\sigma_i$ are model parameters). While this is hard to incorporate into the figure, this will be added to the caption in future iterations of the paper.

It is unclear why the query dimension is chosen to be 16.

The query dimension is a hyperparameter we set to 16 to be sure to resolve small position changes. We did not test other values but don't expect this to be performance critical as long as it is sufficiently large. This will be clarified in the paper.

Concerns regarding reproducibility arise due to the unclear descriptions.

Agreed, we report the exact parameterization of the CNN in the appendix of the updated paper.

审稿意见

评分: 3置信度: 42024-11-02

The article introduces a new method called Masked Cross-Attention Adapters (MAXA) for evaluating and characterizing the performance of different visual feature extractors (backbone networks) in dense prediction tasks. MAXA employs a cross-attention mechanism that enables effective feature extraction and dense prediction without relying on the size and resolution of the backbone network's output. This method introduces a learnable masking radius, allowing the model to adapt to the locality of various features, thereby achieving fast training while maintaining a low number of parameters.

优点

MAXA can adapt to downstream tasks with only a small number of additional parameters (less than 0.3%).
Since the backbone network is frozen during training, MAXA achieves faster training speeds.

缺点

Although MAXA has been evaluated across three main task categories, these categories may not fully cover all possible visual tasks.
There is a lack of comparisons with other fine-tuning methods, such as Adapters Strike Back.
The novelty is relatively weak.

问题

Is there more discussion on the reasons behind DINOv2's superior performance?

2024-11-15

Thanks for your feedback. To clarify: We compare against fine-tuning methods [1] (and object-centric methods) in Fig. 5. Would comparisons with additional fine-tuning methods improve your judgment of our work? Regarding the set of visual tasks: Which additional tasks specifically would strengthen the paper?

[1] Goldblum et al.: Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks, NeurIPS 2024.

2024-11-26

Thank you for your response. I have decided to maintain my rating score.

审稿意见

评分: 3置信度: 42024-11-04

This work introduces the Masked Cross-Attention Adapter (MAXA), a method designed to provide a cost-effective approach for probing transformer backbones for their dense prediction capabilities. The method employs a masked cross-attention readout layer that uses positional encodings as fixed query vectors. This is followed by a second unmasked cross-attention layer and a deconvolution network. Additionally, each attention head incorporates a learnable locality bias term during cross-attention. The authors evaluate MAXA's performance using multiple pretrained backbones (such as MAE, DINO, and DINOv2) across tasks including instance awareness, semantic segmentation, and monocular depth estimation.

优点

The ability to perform a meaningful cost-effective evaluation of backbones for their dense prediction capabilities is currently lacking, and as a result, only a limited number of studies offer meaningful evaluations. The proposed method is well-motivated and clearly described, making it easy to follow. The authors provide a solid introduction to the problem and include an informative related work section. Additionally, the ablation studies demonstrate that the advantages of certain design choices are consistent across multiple tasks and different pretrained backbones.

缺点

The relevance of a cost-effective dense prediction evaluation largely depends on its high correlation with currently optimal but more resource-intensive evaluation techniques. However, the experiments provided are insufficient to establish confidence in this correlation. While the authors evaluate MAXA across multiple tasks and backbones, they do not present a statistical analysis of its correlation with fine-tuned results, nor do they quantify the trade-off between training cost and performance compared to state-of-the-art techniques. Such context is necessary to make the presented evaluations meaningful.

The choice to follow simple FPNs and rely solely on information from the last layer appears questionable. Although Vision Transformers (ViTs) maintain the same resolution across all layers, it is doubtful that the final layer alone contains all the necessary information without fine-tuning the backbone. For instance, MAE demonstrated that linear probing performance of ViTs is not always a reliable indicator of fine-tuning performance. Furthermore, [1] showed that employing cross-attention readouts from every layer leads to significant performance improvements compared to using simple FPNs.

minor: Lines 260-270 contain some incomplete sentences, such as "SAM2 also good.".

[1] Chen, Zhe, et al. "Vision Transformer Adapter for Dense Predictions." The Eleventh International Conference on Learning Representations.

问题

The ablations show that a feature dimension of 8 performs better than a dimension of 16. Why only report a change in one direction? How did a feature dimension of 4 perform? Similarly, without the second cross-attention layer MAXA performs worse. Does a third cross-attention layer improve performance?

2024-11-15

We appreciate your feedback and will implement the ablations you suggested. To clarify: Regarding correlation with fine-tuned methods we relate to object-centric [1,2] and fine-tuned methods [3] in Fig. 5. If you were aware of this, which additional direction or experiment would convince you specifically? In terms of the FPN, findings from the ViTDet paper [4] suggest that relying on the last ViT layer is favourable. Furthermore, the goal of our work is primarily to evaluate backbone representations (this should be stated more clearly!), thus a simple method is preferable. Does this additional context change your assessment?

[1] Aydemir et al.: Self-supervised object-centric learning for videos, NeurIPS 2023.
[2] Seitzer et al.: Bridging the gap to real-world object-centric learning, ICRL 2023.
[3] Goldblum et al.: Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks, NeurIPS 2024.
[4] Li et al.: Exploring plain vision transformer backbones for object detection, ECCV 2022.

审稿意见

评分: 3置信度: 42024-11-04

This paper explores vision adapter for pretrained ViTs on dense visual tasks like Segmentation and Depth Estimation. The paper detailedly benchmark the performance with the tasks and also ablates some adapter designs.

优点

Comprehensive Benchmark for Dense Vision Tasks: The paper makes a valuable contribution by introducing a comprehensive benchmark specifically designed for evaluating the dense prediction capabilities of pre-trained vision encoders, addressing a notable gap in the current landscape of benchmarks that primarily focus on classification tasks.
Methodological Rigor: The authors employ a rigorous methodology, including the use of masked cross-attention adapters (MAXA) to enable fair comparisons across different encoders. The choice of MAXA is well-justified, as it allows for fast training and evaluation.
Clarity and Insightful Presentation: The paper presents a clear and well-organized set of experiments, covering a diverse range of pre-trained models and dense tasks. The results are presented in a readily understandable and comparable manner, providing valuable insights into the relative strengths and weaknesses of different encoders for dense prediction tasks.

缺点

Limited Insight into Learned Representations: While the benchmark effectively compares the performance of different encoders, it lacks deep analysis regarding the specific representations learned by each encoder. Simply stating that "DINOv2" achieves the highest numbers isn't sufficient; the paper would benefit from a more in-depth investigation into the characteristics of the learned representations that contribute to performance differences.
Overlooking Architectural Biases: The paper does not explicitly address how architectural biases in different encoders might contribute to their performance on dense tasks. A discussion on this aspect would be valuable, as it could help disentangle the effects of pre-training from those inherent to the encoder architectures.
Potential Bias from MAXA: Although the authors justify the use of MAXA, the paper could be strengthened by exploring whether the findings hold consistent with other adapter methods or lightweight dense heads. This would provide further validation of the results and address potential concerns about biases introduced by the specific choice of MAXA.
Missing Comparisons with Key Adapter Methods: The paper lacks a direct comparison with other relevant adapter methods, such as ViT-Adapter and FeatUp. Including these methods in the evaluation would offer a more complete picture of the adapter landscape for dense prediction tasks.

问题

Please see weakness

2024-11-15

Thank you for your comments. To avoid confusion: We agree that architectural biases should be controlled for. Therefore, we used the ViT-B/16 architecture (when possible) in our experiments. Does your suggestion on different encoders refer to using CNN backbones? In terms of potential biases from MAXA: In Fig. 5 we assess consistency with fine-tuned non-dense adapters [1] and object-centric [2,3] methods. Do you miss a comparison with a specific method?

[1] Goldblum et al.: Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks, NeurIPS 2024.
[2] Aydemir et al.: Self-supervised object-centric learning for videos, NeurIPS 2023.
[3] Seitzer et al.: Bridging the gap to real-world object-centric learning, ICRL 2023.

2024-12-02

Limited Insight into Learned Representations

We agree that a investigation of individual would be interesting. However, in this paper we decided to compare a large number of backbones. A detailed investigation of each one of them is not feasible within the scope of a single conference paper. Nonetheless, our evaluation can inform choices for an in-depth analysis of backbones.

Overlooking Architectural Biases

When possible (in eight cases), we report scores of the ViT-B/16 architecture. In many other cases, we use other forms of ViT, which have similar inductive biases. Therefore, we don't expect interesting insights from a deeper analysis.

Potential Bias from MAXA

We agree that there are inductive biases introduced through MAXA which might affect the results. This is a limitation of our work. However, MAXA is conceptually simple and has a small number of parameters, so we don't expect this to play a major role. As stated in our earlier comment, the requested comparison can be found in Fig. 3 and Fig. 5.

Missing Comparisons with Key Adapter Methods

Many adapter methods address whole-image classification problems instead of dense prediction. We decided to incorporate FeatUp in Fig. 3. For more details, please see general comment 2.

2024-11-29

General Comments and Updates

We thank the reviewers for their comments. For our revised version we incorporated the following changes:

Goal of our work.
The primary goal is not to outperform other adapter method in terms of performance. Instead, we propose a fast and minimal adapter to directly assess feature quality in dense tasks. Our method can be seen as a dense equivalent to attentive probing (as in CoCa, DinoV2 and Aim2). To the best of our knowledge no such method exists. We clarified the goal of our work in the introduction as this was not well communicated before.
Extended comparison.
The main motivations behind our work is to address the lack of low-parameter, dense adapter methods for frozen backbones. Several related methods were proposed by the reviewers, below we clarify differences to these methods:
- ViT-Adapter [1]: Requires more parameters (for ViT-B: 14M vs. ~0.3M in our case). Uses common task heads with specific inductive biases (HTC++ for detection, Mask2Former for segmentation).
- Adapters strike Back [2]: Addresses whole-image classification. Compares different adapter types but all are based on ImageNet-trained ViT-B. In this paper we focus on dense tasks and compare different backbones.
- Segmenter [3]: A transformer for semantic segmentation. Uses many more parameters as the backbone is not frozen.
- ViTDet [4]: A transformer for object detection model. Uses more parameters as the backbone is not frozen.
- FeatUp [5]: We agree that FeatUp is a meaningful baseline and added this to our comparison. While FeatUp allows for linear probing with a very small parameter budget, it has larger memory requirements and around 4 times longer runtimes than our method.
We agree that our literature section did not explain these differences well and updated it accordingly.
Correlation with fine-tuning techniques.
We concur that relating our findings on dense task to established results is important. We already compare to work in object-centric learning (relevant specifically for instance discrimination) and whole-image classification (Fig. 5). In the revised version, we added a comparison with matching timm backbones and zero-shot dense prediction models (as requested by reviewer 6ajN). Both experiments are reported in the appendix.

Minor changes

For reproducibility we report on hyperparameters and dataset splits in the appendix.
Furthermore, we added a description of CNN-baseline used in Fig. 3 to the appendix
We added a visualization of feature output to the appendix.

[1] Chen et al.: Vision Transformer Adapter for Dense Predictions. ICLR 2023.
[2] Seitzer et al.: Bridging the gap to real-world object-centric learning, ICRL 2023.
[3] Strudel et al.: Segmenter: Transformer for semantic segmentation, ICCV 2021.
[4] Li et al.: Exploring plain vision transformer backbones for object detection, ECCV 2022.
[5] Fu et al.: Featup: A model-agnostic framework for features at any resolution, ICLR 2024.

AC 元评审

2024-12-15

This work introduces masked cross-attention adapters method for dense predictions that operates independently of encoder output size, enabling efficient training with frozen backbones.

The reviewers expressed concerns about the manuscript's statistical correlation between MAXA and state-of-the-art methods, the reliance on simple FPN, and the need for comprehensive ablation studies. They called for deeper analysis of learned representations, comparisons with key adapter methods, and consideration of architectural biases. Additionally, they noted the limited scope of evaluated tasks and suggested including comparisons with other fine-tuning methods and updating the literature review. In response, the authors defended their choices, agreed to conduct ablation studies, and committed to updating the literature and expanding comparisons, though some gaps remain regarding the evaluation's breadth.

Al the reviewers remained their original scores. Given the unanimous opinion of all reviewers, AC decide to reject the paper.

审稿人讨论附加意见

Reviewer SK5i: The reviewer raised concerns about the insufficient statistical correlation between MAXA and state-of-the-art fine-tuning methods, questioned the reliance on simple FPNs for predictions, and requested more comprehensive ablation studies to evaluate feature dimensions and cross-attention layers.
Author Response: The authors defended their architectural choices based on relevant literature and agreed to conduct the requested ablation studies for more thorough evaluation.
Reviewer Phnn: The reviewer highlighted the need for deeper analysis of learned representations, criticized the absence of comparisons with key adapter methods, and called for a discussion on potential architectural biases affecting performance.
Author Response: The authors recognized the interest in analyzing representations but noted scope limitations. They committed to including comparisons with FeatUp and indicated that their use of ViT-B/16 aimed to control for architectural biases while agreeing to expand the discussion.
Reviewer 62gX: The reviewer pointed out that the evaluated tasks might not comprehensively cover all visual tasks and requested comparisons with other fine-tuning methods, such as Adapters Strike Back.
Author Response: The authors sought clarification on which additional tasks would strengthen their evaluation, but didn’t provide comparisons with Adapters Strike Back.
Reviewer 6ajN: The reviewer criticized the outdated literature review, suggested including zero-shot comparisons with dense prediction methods, and recommended experiments with hybrid CNN-transformer architectures.
Author Response: The authors acknowledged the need to update the literature review and committed to revising those sections. They clarified their focus was not on competitive performance but agreed to include comparisons with suggested architectures and zero-shot methods in response to the reviewer's feedback.

最终决定Reject

2025-01-22

Reject