PaperHub
6.2
/10
Rejected5 位审稿人
最低5最高8标准差1.0
6
5
8
6
6
4.4
置信度
正确性3.0
贡献度2.6
表达3.0
ICLR 2025

ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

Extended parameter efficient pre-training enables ViTs such as DinoV2 and MAE to effectively and efficiently transfer to different visual domains (i.e. satellite, medical imagery)

摘要

关键词
loraPEFTparameter-efficient finetuningparameter-efficient pre-trainingvision transformerViTdomain adaptationdomain generalizationsatellite imagesfoundation models

评审与讨论

审稿意见
6

This paper presents an approach for parameter-efficient continual pretraining and fine-tuning of visual foundation models addressing domain shift of the underlying data distribution. To accomplish this, the authors propose to use LoRA during continual pretraining and subsequent fine-tuning.

The domain shift covered in this submission put a major focus on remote sensing imagery as covered by a large part of the experimental section. Nevertheless, the submission also covers some minor experiments including domain shifts of datasets about cell tissue images, wheat images, and animal images.

For all datasets, the authors were able to demonstrate the efficiency of their approach in continual pretraining and fine-tuning up to results able to outperform fine-tuned visual foundation models.

I very much appreciate the work in this area - especially because not everybody has the resources to train visual foundation models and therefore approaches like the one presented are much appreciated. However, overall I see this submission providing only marginal contribution as it combines known methods such as continual pretraining ([mendieta2023towards] proposed it for remote sensing) and low-rank adaptation ([scheibenreif2024parameter] proposed it for remote sensing) together.

Also and since the majority of the experimental section focuses on remote sensing, in times of geo-spatial foundation models such as ScaleMAE, SatMAE, or GMF, I have trouble to see that the first step of training such models is to take a visual foundation models being trained on natural images only.


@inproceedings{mendieta2023towards,
  title={Towards geospatial foundation models via continual pretraining},
  author={Mendieta, Mat{\'\i}as and Han, Boran and Shi, Xingjian and Zhu, Yi and Chen, Chen},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={16806--16816},
  year={2023}
}
@inproceedings{scheibenreif2024parameter,
    title     = {Parameter Efficient Self-Supervised Geospatial Domain Adaptation},
    author    = {Scheibenreif, Linus and Mommert, Michael and Borth, Damian},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {27841-27851}
}

优点

(S1): this work covers an important aspect of foundation model training.

(S2): this work provides vast amount of experimental results showing the capabilities of the proposed approach.

(S3): the presented ablation study of the paper provide valuable insights into parameter-efficient domain adaptation of remote sensing imagery.

缺点

(W1): this submission rather combines known approaches than presents methodological or algorithmic novelty. This can be of great value for the research community. However, I think the insights provided in this work might be limited to the ICLR research community.

(W2): given previous work in this area, I miss baseline evaluations against non-LoRA continual pretraining (-> mendieta2023towards) and against pure LoRA (-> scheibenreif2024parameter). Such experiments would provide the opportunity to highlight the capabilities of the proposed approach much more and compare it against previous work.

问题

(Q1): In section 4 Problem Setup, the target domain data comes from p_{D_T}(x) where D_T is a set of domains, being a subset of all domains. What is then the difference between p_{D_T}(x) and p_{d_T}(x, y) coming from d_T ∈ D_T? Is this just the formulation that some of the domains of p_{D_T}(x) might provide labels (i.e., p_{d_T}(x, y)) and some not? And if yes, is there a significant change in distributions between these? Can we assume that these distributions (p_{D_T}(x) and p_{d_T}(x, y)) are similar with respect to the distribution of x?

(Q2): Is there a reason why for some experiments you do compare against ScaleMAE and in some you do not? For Table 1, the ScaleMAE in its LoRA-r8 version is missing (while SatMEA is provided). It would be interesting to see ScaleMAE performance as 0.8M fine-tuned parameter model. In Table 4 and Table 5 ScaleMAE is entirely missing. Table 6 is listing ScaleMAE as baseline. Is there a rationale behind this?

(Q3): In Table 3, the ablation study, there is an experiment showing [All], which adapts not only the attention matrices but also the MLP matrices. This is great, have you also run experiments showing how MLP adaptation without attention adaptation would perform?

(Q4): In Fig. 2, where can I find U in the figure, which is described in the image caption.

评论

Thank you for your insightful feedback and recognition of ExPLoRA's value in efficient foundation model creation, as well as our comprehensive experimental results. We also appreciate that you found our ablation study to be insightful for parameter-efficient pre-training. Incorporating your feedback has significantly improved our paper. You may find responses to your concerns below:

Q: Is starting with natural-image foundation models necessary given existing geo-spatial models?
This is a fair question. While domain-specific foundation models exist, new domains and datasets continually emerge. ExPLoRA demonstrates that adapting natural-image foundation models from frontier labs can outperform fully pre-trained domain-specific models (e.g., SatMAE, ScaleMAE) while using significantly fewer resources. This is valuable because it enables researchers and practitioners to create effective foundation models for new domains without expensive from-scratch pre-training.

Q: Comparisons against GFM and GDA should be included
Thank you for this valuable suggestion. We agree that GFM [1] is a relevant prior work, and have included results from GFM in our revised Table 1.

GDA [2] was published in June 2024, which is just short of the July 1 2024 cutoff that ICLR considers for concurrent work and after we posted a pre-print of this work in June. Even so, we have worked to include results from GDA in Table 1.

ExPLoRA outperforms these works by ~6%, with several key advantages:

  • Parameter Efficiency: Unlike GFM, ExPLoRA doesn't require training the full ViT backbone
  • Model Flexibility: ExPLoRA works with and evaluates non-MAE methods (e.g., DinoV2), achieving SoTA on remote sensing benchmarks
  • Architectural Preservation: Unlike GDA's non-mergeable adapters that modify architecture (due to the scaling vector) and can increase inference latency with higher ranks, ExPLoRA's LoRA weights merge into Q,V matrices
  • Fine-tuning Freedom: ExPLoRA allows varying LoRA ranks between pre-training and fine-tuning, and supports any PEFT method. GDA requires using pre-trained adapters during fine-tuning
  • Broader Applicability: We handle larger datasets (fMoW-RGB, fMoW-Sentinel) and diverse domains beyond remote sensing (i.e. WiLDS)
  • Systematic Block Selection: Our analysis in Section 6.3 provides clear insights into which transformer blocks encode local vs. global information, offering a principled approach to block selection

We have also summarized and discussed these differences in an expanded related work section in the revised appendix A.1.

Q: This submission combines known approaches with potentially limited ICLR value
We understand this concern– while individually LoRA and unfreezing blocks are existing strategies, ExPLoRA’s novelty is in demonstrating the substantial value of combining them for extended pre-training:

  • We demonstrate that selectively combining full-rank tuning of ViT blocks with LoRA is both more parameter-efficient and effective than prior continual pre-training approaches
  • We achieve SoTA on fMoW (a key foundation model benchmark) and show significant improvements in linear probing, indicating strong feature extraction capabilities
  • Unlike GFM/GDA, our approach generalizes beyond masked-image modeling - our strongest results use DinoV2, challenging the MAE-based paradigm for remote sensing
  • We demonstrate generality across multiple domains via the WiLDS benchmark

Q: Distribution questions about pDT(x)p_{D_T}(x) and pdT(x,y)p_{d_T}(x, y)
Good question. Yes, our formulation indicates that a subset of DTD_T datasets are labeled, optionally allowing unsupervised pre-training on all unlabeled domain images. The distributions pDT(x)p_{D_T}(x) and pdT(x,y)p_{d_T}(x, y) are indeed similar with respect to x as they share domain TT.

Q: Inconsistent ScaleMAE comparisons across tables
Thank you for pointing this out. We've added ScaleMAE (0.8M parameters) results to Table 1, showing it underperforms ExPLoRA-DinoV2. ScaleMAE is absent from Tables 4-5 as no pre-trained model exists for fMoW-temporal/Sentinel. We include ScaleMAE baselines in Table 6 as they evaluated on SpaceNet/Resisc-45.

Q: Have you also run experiments showing how MLP adaptation without attention adaptation would perform, for Table 3 (ablation study)?
Thank you for the suggestion. We have included this experiment in our latest revision in Table 3. We find that attention layers are more receptive to low-rank tuning, with MLP-only adaptation showing reduced representation learning capacity.

Q: In Fig. 2, where can I find U in the figure, which is described in the image caption.
Thank you for pointing this out. U corresponds to the unfrozen blocks. We have updated the figure to be clearer.


References:
[1] Towards geospatial foundation models via continual pretraining. ICCV 2023.
[2] Parameter Efficient Self-Supervised Geospatial Domain Adaptation. CVPR 2024.

评论

I appreciate the author's rebuttal and additional experiments as it is really helps to put the proposed method into the context of recent work published in this area. I moved my rating one step up to reflect the author's rebuttal.

If I could ask for one more thing, it would be to have results on multispectral remote sensing data. As already mentioned, I appreciate the additional experiments and results included in Table 1 (which are evaluated on RGB remote sensing data). Multispectral Sentinel results can be found in Table 4 and it would be great to evaluate GFM and GDA for this data, too.

评论

Thank you again for your careful consideration of our paper and your helpful suggestions that have made our work stronger. We appreciate that you have increased your score.

We will include evaluations of GFM and GDA on fMoW-Sentinel by the end of the rebuttal period. If there are any remaining suggestions please let us know.

评论

As promised, we are following up with results comparing ExPLoRA against GFM and GDA on fMoW-Sentinel (multi-spectral data). Below is the updated Table 4 that will be included in the final draft of the paper:

ModelBackbonePEFTPre-train #ParamsFine-tune #ParamsTop 1 Acc.
MAEViT-LFull-303.3M51.61
SatMAEViT-LFull303.3M303.3M61.48
MAEViT-LLoRA-r8-0.8M46.97
SatMAEViT-LLoRA-r8303.3M0.8M59.48
GFMViT-LLoRA-r8303.3M0.8M57.55
GDAViT-LGDA-r167.3M7.3M55.23
MAE-[1,2,L-1,L]ViT-LLoRA-r851.5M0.8M54.12
M-ExPLoRA-[L]-r32ViT-LLoRA-r816.2M0.8M51.84
M-ExPLoRA-[1,L]-r32ViT-LLoRA-r829.7M0.8M60.15

Table 4: Results on the fMoW-Sentinel validation set. The "Pre-train #Params" and "Fine-tune #Params" columns indicate the number of trainable parameters required for adaptation to the new domain (multi-spectral satellite images).

The results show that ExPLoRA's continual pre-training approach outperforms both GFM and GDA by a significant margin (2-5% in accuracy) when evaluated using PEFT fine-tuning. This superior performance is achieved despite GFM pre-training all parameters of the ViT. For GDA, we used rank 16 as it yielded optimal results, which aligns with the default configuration recommended by the authors.

Thank you again for your time and consideration, for engaging with us during the rebuttal, and for your vote of confidence in our work. Please let us know if you have any additional questions or concerns that need to be resolved for further increasing your evaluation of our paper.

审稿意见
5

In this work, the authors introduce ExPLoRA to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. It initializes a ViT with pre-trained weights and continues unsupervised pre-training on a new domain with some blocks unfrozen and LoRA for other layers. Then it fine-tunes with LoRA for supervised learning. Experiments show state-of-the-art results on satellite imagery. It improves linear probing top-1 accuracy and ablation studies confirm its efficacy over other baselines

优点

  1. The paper is well organized, the figures are readable and understandable.

  2. The proposed method looks logical and technically sound.

  3. The experimental results are strong. Consistent improvements have been shown over different baselines.

缺点

  1. Concerns about the novelty: The proposed ExPLoRA seems does not have essential differences with the conventional LoRA. What are the differences from LoRA? This paper is more like a technical report than a research paper.

  2. Many important experimental comparison results are missing. 1) The authors should compare their proposed method with the recent SOTA ViT-based domain adaptation [1-6] in the same setting. 2) Many PEFT methods besides LoRA, [7-8] should be compared. The reviewer wonders whether ExPLoRA outperforms these types of methods? 3) The results on the widely-used UDA benchmarks like Office-Home, Office-31, VisDA are missing.

[1]. TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, WACV 2023;

[2]. Safe Self-Refinement for Transformer-based Domain Adaptation, CVPR 2022;

[3]. CDTRANS: CROSS-DOMAIN TRANSFORMER FOR UNSUPERVISED DOMAIN ADAPTATION, ICLR 2022;

[4]. Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective, CVPR 2023;

[5]. Towards Unsupervised Domain Adaptation via Domain-Transformer, IJCV 2024;

[6]. Making The Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation, ACM MM 2022;

[7].Low-Rank Few-Shot Adaptation of Vision-Language Models, CVPR 2024;

[8]. Quantized Prompt for Efficient Generalization of Vision-Language Models, ECCV 2024;

  1. Many important references of domain adaptation and PEFT [1-8] are missing. These works should be briefly reviewed in the related work section.

--------------------------------------------------------------After Rebuttal--------------------------------------------------------------

Sorry for the late reply.

After carefully reading the rebuttal and other reviews, I'd like to thank the authors' efforts in response to my concerns. Though some of my concerns have been addressed, I still have concerns in terms of novelty and experimental comparisons:

  1. About the novelty. The authors acknowledge that ExPLoRA is a combination of existing strategies, e.g., LoRA and unfreezing blocks, and the authors' explanations in the rebuttal do not convince me. Thus the reviewer thinks it have limited technical contributions. I agree with Reviewer ZQZA's opinion that the technical novelty is limited to the ICLR research community, and I also agree with Reviewer 8d6C's opinion that the whole paper seems like a technical report rather than a research paper.

  2. About the experimental comparisons. Thanks the authors for providing the results on VisDA-2017 benchmark. Since this paper targets domain adaptation, the authors should conduct experimental comparisons on many widely-used benchmarks of domain adaptation such as Office31, and DomainNet. Only the results on VisDA-2017 benchmark are insufficient to reveal the effectiveness.

To sum up, I will keep my rating unchanged and suggest the authors revise the paper according to my advice carefully.

问题

  1. What are the differences between ExPLoRA and LoRA?
  2. Does the presented method outperform the state-of-the-art DA and PEFT methods in the same setting?
  3. What are the results on the widely-used UDA benchmarks?
评论

Thank you for your thoughtful insights and recommendations. We appreciate your acknowledgement of our extensive SoTA empirical results as well as the method’s soundness. As for your concerns, which we have worked to address, please see our responses below:

Q: What are the differences between ExPLoRA and LoRA?
This is a fair question. ExPLoRA extends on LoRA but differs in both purpose and approach. While LoRA is a fine-tuning method for downstream tasks, ExPLoRA introduces parameter-efficient extended unsupervised pre-training. Though we use LoRA-style adapters for Q,V matrices, we combine this with selective block unfreezing - a combination that proves significantly more effective and parameter-efficient than either using LoRA alone or LoRA with higher ranks (Table 3, ablation study).

Q: How does ExPLoRA compare to UDA methods and perform on UDA benchmarks?
Great question! ExPLoRA and traditional UDA methods address different stages of domain adaptation. UDA methods require labeled source domain data while adapting to unlabeled target data, focusing on the downstream supervised task. In contrast, ExPLoRA creates domain-adapted backbones without requiring any labels.

To clarify using notation from section 4 of our paper: UDA methods assume access to a labeled distribution for the source domain DSD_S given by pDS(x,y)p_{D_S}(\mathbf{x}, \mathbf{y}) and an unlabeled distribution for the target domain DTD_T given by pDT(x)p_{D_T}(\mathbf{x}). Datasets such as VisDA2017 or OfficeHome assume shared label sets YY between DSD_S and DTD_T, i.e., YDS=YDTY_{D_S} = Y_{D_T}. ExPLoRA's setting is different. We only assume access to weights WDSW_{D_S} from a model pre-trained via unsupervised learning on pDS(x)p_{D_S}(\mathbf{x}), without requiring direct access to the source distribution pDS(x)p_{D_S}(\mathbf{x}) (which may not have labels, unlike in UDA). Further, we don't place any restrictions on the label set YY for the different domains. Thus, ExPLoRA differs from traditional UDA considered in the works you have cited, and so we don't label our method as "UDA". Thank you for prompting this contextualization- we will add a discussion in our paper (appendix A.2) to clarify this.

Rather than competitors, UDA methods can be viewed as complementary to ExPLoRA - they can benefit from initialization with ExPLoRA's pre-trained weights instead of standard natural-image pre-training.

We will add some additional experiments using UDA approaches on top of ExPLoRA backbones to demonstrate their compatibility.

Q: How does ExPLoRA compare to SoTA PEFT Methods?
Thank you for suggesting these additional works. ExPLoRA is designed to complement rather than replace PEFT methods. Since ExPLoRA preserves the ViT architecture, any PEFT method can be used for subsequent fine-tuning. Nonetheless, we are still interested in how ExPLoRA and PEFT methods perform in tandem. CLIP-LoRA applies low-rank matrices on key, value, and query matrices of the text and vision encoders with ranks of 2 [7]. We have already evaluated the analogue of this for vision where we do not have access to text. That is, we fine-tuned with LoRA, using different ranks and applying low-rank matrices to different subsets of key, value, and query matrices of the vision encoder.

Please also see our overall response for a discussion of all recent PEFT methods. While there are many PEFT techniques and comparing with all of them will be infeasible given time constraints, we do our best to select SoTA baselines from a variety of PEFT families (eg: SAVP, Gated-VPT for visual-prompt tuning, Adapter+ for adapters, AdaLoRA for LoRA-based methods, BOFT for multiplicative methods, GDA for scaled-low-rank adapters/side-tuning). If there are crucial works missing in our comparison, please let us know.

评论

Dear reviewer 47Sp,

Thank you very much for your time and consideration in providing us with your review that has improved our work.

As today is the final day to make revisions to the pdf of the paper, please let us know if you have remaining concerns that need to be addressed. If your concerns are resolved, we kindly request that you reconsider your score to reflect that.

-Authors

评论

Thank you again for your time and valuable feedback. While ExPLoRA is not a traditional unsupervised domain adaptation (UDA) method, it can serve as an initialization for ViT-based UDA approaches. As promised in our initial rebuttal response, we demonstrate this compatibility below.

Classification accuracy (%) on VisDA-2017 (validation). Results marked with * use our reproduced results. Using ExPLoRA initialization improves UDA performance compared to standard ImageNet initialization.

MethodArch.InitplanebcyclbuscarhorseknifemcyclpersonplantsktbrdtraintruckMean
SSRTViT-BIN-21k98.987.689.184.898.398.796.381.194.997.994.543.188.8
CDTransDEiTIN97.190.582.477.596.696.193.688.697.986.990.362.888.4
PMTransViT-BIN-21k98.993.784.573.399.098.096.267.894.298.496.649.087.5
TVTViT-BIN-21k97.192.985.366.497.197.189.375.595.094.794.555.186.7
TVT*ViT-BIN-21k95.885.881.968.495.996.291.970.393.893.792.948.584.6
TVTViT-BDinoV298.487.387.469.599.068.394.353.580.987.397.560.082.0
TVTViT-BExPLoRA94.692.190.976.697.190.094.486.493.694.798.453.588.5

For context, The VisDA2017 dataset contains 152,297 training and 55,388 validation images across 12 object classes representing a synthetic-to-real domain shift: training images are synthetically rendered 3D models under various lighting conditions, while validation images come from MS-COCO.

The table above shows ExPLoRA's effectiveness when combined with TVT [1], a state-of-the-art UDA method. Using ExPLoRA D-[12]-r64 (DinoV2-initialized ViT-B with last layer unfrozen and LoRA-r64 elsewhere) pre-trained on both synthetic and real domains, we outperform traditional ImageNet-21k initialization by 1.5-3% while achieving more balanced per-class accuracy. With ExPLoRA initialization, TVT's performance rises to match recent SoTA methods, a significant improvement over its original results. Most notably, we surpass DinoV2 initialization by ↑6%, demonstrating that ExPLoRA's unsupervised initialization matches state-of-the-art UDA methods that rely on supervised ImageNet-21k pre-training.

These results demonstrate the benefits of ExPLoRA as an unsupervised pre-training method for new domains, as well as its wide compatibility not only with PEFT (see Table 1 of our main paper), but also with UDA. We will be including these results in our paper, thanks to your suggestions.

Please let us know if you have any further questions or suggestions. If these are resolved, we kindly request that you reconsider your score.


References:
[1] TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, WACV 2023.
[2] Safe Self-Refinement for Transformer-based Domain Adaptation, CVPR 2022.
[3] CDTRANS: Cross-Domain Transformer For Unsupervised Domain Adaptation, ICLR 2022.
[4] Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective, CVPR 2023.

评论

Dear Reviewer 47Sp,

Thank you for your thoughtful review of our paper. As we approach the end of the discussion period, we would greatly appreciate if you could review our rebuttal responses and let us know if we have adequately addressed your concerns.

If you feel that our clarifications and additional experimental results have resolved your initial concerns, we kindly request that you consider updating your evaluation accordingly. Your final assessment is valuable to us and to the broader review process.

We understand you are likely managing many responsibilities, and we truly appreciate your time and attention throughout this process.

Best regards,
Authors

评论

Thank you for your thoughtful feedback. We appreciate your consideration of our rebuttal and would like to clarify several key points.

Value of ExPLoRA's Technical Contribution
While ExPLoRA combines existing strategies (LoRA and selective unfreezing), its novelty lies in demonstrating an effective approach for parameter-efficient pre-training of vision transformers for new domains. This represents a significant finding with immediate practical impact.

Our contributions, with further detail in the full rebuttal response, include:

  1. A novel approach combining LoRA with selectively unfreezing ViT blocks for continual pre-training on new visual domains. ExPLoRA works with popular self-supervised learning methods (DinoV2, MAE) and preserves the ViT architecture, further enabling compatibility with any downstream method (PEFT, UDA etc.)
  2. Extensively verified state-of-the-art performance across multiple large datasets and challenging domain shifts (eg: satellite, medical, wildlife, agricultural, synthetic imagery) while using <10% trainable parameters.
  3. Providing systematic analysis of information encoding in ViT layers, offering clear guidelines for block selection during pre-training

As noted in peer conference NeurIPS guidelines, demonstrating the effectiveness of combining existing techniques can provide substantial research value. Our extensive experiments confirm this - ExPLoRA outperforms both from-scratch pre-training and recent PEFT methods across multiple domains while using significantly fewer parameters. This is particularly valuable given the increasing costs of pre-training foundation models for new domains.

Reviewers 8d6C and ZQZA specifically highlighted these strengths, noting ExPLoRA's value for "parameter-efficient unsupervised pre-training" and its "strong results on multiple domains and benchmark datasets."


Experimental Comparisons
We appreciate your suggestion regarding UDA benchmarks. However, as detailed in our previous response and appendix A.2, ExPLoRA addresses a fundamentally different problem than traditional UDA. While UDA methods require labeled source domain data, ExPLoRA enables unsupervised domain adaptation using only pre-trained weights. Thus, ExPLoRA is not a UDA method and comparison with UDA methods is not the main focus of this paper.

Our experiments focus on demonstrating ExPLoRA's superior performance against pre-training from scratch, continual pre-training and PEFT across challenging real-world scenarios, such as:

  • Multiple satellite image modalities: high-res RGB, low-res multi-spectral, and temporal sequences
  • Various downstream tasks: classification, segmentation, detection
  • Different application domains: medical, wildlife, and agricultural via WiLDS benchmark
  • Synthetic domain transfer through VisDA2017

Upon your valuable recommendation and given the tight timeline of the ICLR rebuttal, we also included results on VisDA2017 to further demonstrate ExPLoRA’s compatibility with UDA methods and on synthetic domain data. The VisDA2017 results are noteworthy - ExPLoRA initialization elevates TVT's performance to match SOTA UDA methods, demonstrating its value even in traditional domain adaptation settings. This is a novel finding that further validates ExPLoRA's effectiveness.

These comprehensive experiments across diverse, large-scale datasets provide strong evidence of both ExPLoRA's soundness and its practical utility. The breadth and depth of our experimental validation offers practitioners a high degree of confidence in applying our method to real-world domain adaptation challenges.

审稿意见
8

This paper introduces ExPLoRA, a novel parameter-efficient method for adapting pre-trained vision transformers (ViTs) to new domains through extended unsupervised pre-training. ExPLoRA initializes ViTs with weights from natural-image datasets and continues pre-training on new domains with LoRA. The model is then fine-tuned on the new domain for supervised learning, achieving impressive results on satellite imagery and generalizing to various domains like wildlife, medical, and agricultural imagery. ExPLoRA outperforms fully pre-trained and fine-tuned techniques while using significantly fewer parameters.

优点

  1. The writing of the article is quite good and the article is logical.
  2. Supplementary materials are substantial.
  3. The use of PEFT for post-pretraining in the visual domain is innovative.
  4. The experiments in the domain migration section are ample and sensible.

缺点

  1. Figure 1 seems to express the difference between full fine-tuning and PEFT in a more accurate form.

  2. The idea of using PEFT for post-pretraining is a good one. loRA was proposed a few years ago, and there are more efficient PEFT methods in the visual field. Maybe the authors can try to compare with LoRA with newer methods [1-3]. [1] 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. [2] Pro-tuning: Unified prompt tuning for vision tasks. [3] Adapter is all you need for tuning visual tasks.

  3. The method section is more like a solution derived through experience. It is recommended to analyse whether LoRA has limitations in coping with visual post-pretraining tasks, and optimise based on the analysis to propose your own PEFT method.

  4. The whole article seems to be like a technical report, and it is recommended to add some in-depth theoretical analyses and technical innovations for visual features.

----------------------------------------After Rebuttal------------------------------------------------

The rebuttal solves part of my confusion and I improve my score. Still, I think the author could have cited related work that I mentioned or didn't mention. Necessary citations both give the reader a broader view of developments in the field, and are a way of recognising and respecting those who work in the field.

问题

See above.

评论

Thank you for your valuable feedback, recognition of ExPLoRA’s novelty in parameter-efficient post pre-training for visual data, and for our extensive and logical experimental results. We appreciate your vote of confidence in our method, and have worked to incorporate your suggestions, found below:

Q: Figure 1 seems to express the difference between full fine-tuning and PEFT in a more accurate form.
Could you please clarify which specific aspects of Figure 1 or 2 need improvement? We welcome concrete suggestions to enhance their clarity.

Q: Comparison with more recent PEFT methods.
We include several recent SoTA PEFT methods in Table 1 (BOFT, GVPT, SA^2VP, AdaLoRA). We've expanded our experimental comparisons to include:

  • Parameter Efficient Self-Supervised Geospatial Domain Adaptation, CVPR 2024.
  • Towards geospatial foundation models via continual pretraining. CVPR 2023
  • Adapters Strike Back (CVPR 2024)
  • Mona [11], follow-on work to LoRand [10] in CVPR 2023, (to be added to Table 1).

The other cited works are either already outperformed or superceded by prior works we have included experiments from. E.g.,

  • SA²VP outperforms Pro-Tuning [8] on CIFAR-100, Oxford flowers.
  • BOFT [4] and AdaLoRA [7] outperform SSF [9] on VTAB-1K.
  • “Adapters Strike Back” [3] includes comparisons with the most recent adapter-based methods, outperforming [9, 12, 13] on VTAB.
  • Mona [11] outperforms LoRand [10] and other adapter methods [12] on COCO and other benchmarks

Q: Analysis of LoRA's limitations and proposing a novel PEFT method
Thank you for your suggestion. We want to clarify that our primary contribution is demonstrating the effectiveness of parameter-efficient unsupervised pre-training for domain adaptation to visual data. We are the first to show that combining LoRA with selective ViT block unfreezing creates strong foundation models for new domains at a fraction of the computational cost. This addresses a significant challenge in foundation model development - while frontier labs and organizations invest substantial resources in developing natural-image foundation models like DinoV2, most practitioners cannot afford to pre-train new models for each domain. ExPLoRA enables direct adaptation of these pre-trained models to new domains without expensive from-scratch pre-training.

Moreover, our analysis in Section 6.3 provides insights into ExPLoRA's effectiveness:

  • We evaluate patch embeddings for both local information (patch position prediction) and global information (image classification). We demonstrate ExPLoRA enhances both types of information in patch representations
  • Through spectral analysis, we identify a strong correlation between patch feature map eigenvalues and position accuracy, providing a systematic approach for selecting blocks to unfreeze. E.g., for classification tasks, target layers that have low eigenvalues and high global information (i.e. class accuracy).

While a theoretical investigation of LoRA and full-rank tuning interactions would be valuable, our current focus is on empirical validation of ExPLoRA's effectiveness across multiple realistic and challenging domain shifts. We demonstrate SoTA results on several benchmarks while maintaining computational efficiency.

Please let us know if you have any recommendations for experimental analysis that would further strengthen our work, and we will be happy to incorporate them.


References:
[1] Towards geospatial foundation models via continual pretraining. ICCV 2023.
[2] Parameter Efficient Self-Supervised Geospatial Domain Adaptation. CVPR 2024.
[3] Adapters Strike Back. CVPR 2024.
[4] Parameter-efficient orthogonal finetuning via butterfly factorization. arXiv:2311.06243 (2023).
[5] Improving visual prompt tuning for self-supervised vision transformers. ICML 2023.
[6] SA²VP: Spatially Aligned-and-Adapted Visual Prompt. AAAI 2024.
[7] AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv:2303.10512 (2023).
[8] Pro-tuning: Unified prompt tuning for vision tasks. NeurIPS 2022.
[9] Scaling & shifting Your Features: A New Baseline for Efficient Model Tuning. NeurIPS2022. [10] 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. CVPR 2023.
[11] 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks. arXiv:2408.08345 (2024).
[12] AdaptFormer: Adapting vision transformers for scalable visual recognition. NeurIPS 2022.
[13] Sensitivity-aware visual parameter-efficient tuning. _ ICCV 2023_ .

评论

Thank you very much for considering our rebuttal and for increasing your score. We have incorporated your suggestions for relevant citations into our revised rebuttal draft. We value your vote of confidence in our work-- your review has been very helpful to improve our paper.

审稿意见
6

The paper presents ExPLoRA, a method for efficiently adapting pre-trained vision transformers (ViTs) to new domains using parameter-efficient fine-tuning (PEFT) techniques, utilizing LoRA (Low-Rank Adaptation). By continuing unsupervised pre-training on the target domain and only unfreezing select model layers, ExPLoRA enables adaptation with minimal computational overhead. This approach leverages pre-trained models on natural image datasets like DinoV2, achieving notable performance gains, particularly in challenging domains like satellite imagery. For instance, ExPLoRA outperforms fully pre-trained models on satellite classification tasks while using fewer parameters, highlighting its efficiency. Beyond satellite data, ExPLoRA generalizes well to other domains, including medical and wildlife imagery, as tested on the WILDS benchmark.

优点

1: ExPLoRA excels at adapting large vision transformers to new domains without requiring full re-training, instead leveraging low-rank adaptation. This parameter-efficient approach significantly reduces computational costs, making it suitable for resource-constrained environments.

2: The proposed method is simple yet effective. The ExPLoRA can directly use the well-trained vision foundation models. This compatibility also simplifies its integration with established models.

3: Extensive experiments on RGB satellite images show the proposed methods' effectiveness compared with previous state-of-the-art works.

缺点

1: Analysis of the effect of total training cost. From my understanding, this work's main contribution (claim) is to adapt continuing unsupervised pre-training before supervised training on downstream domains (I don't regard using LoRA fine-tuning and unfreezing the last ViT blocks as this work's contribution, and feel free to point my misunderstanding if it is). It is a two-stage. Therefore, it's important to consider the computational cost of these two stages simultaneously. I have seen Figure 7 which analyzes the effect of pre-training iterations. How about simply putting the equivalent computational cost on the supervised fine-tuning stage? Will unsupervised pre-training speed up the convergence of supervised fine-tuning? If the computational budget is fixed, how should we allocate it across the two different stages?

2: The effect of the "extend unsupervised pre-training stage". Current ablation studies mainly focus on the settings of learnable parameters (unfrozen blocks and LoRA). I hold the perspective that the main claim of this paper is the importance of "extend pertaining" for downstream domain transfer. Therefore, the authors are encouraged to verify the necessity of this stage further. For example, supervised fine-tuning can be performed directly with the current optimal setting of LoRA and unfrozen blocks for transferring.

3: Extending the model to multi-modal models like CLIP and zero-shot settings like using natural language for zero-shot understanding in downstream tasks. The authors could follow the setting of the CLIP paper (the downstream validation dataset).

4: (An open question). How about changing unsupervised training objectives in the extended pertaining stage? Lare-scale pre-trained DinoV2 and MAE both have their advantages. The unsupervised training stage is not limited to the corresponding original training objective. Can the model combine different advantages using different training objectives in this stage?

Overall, this paper presents a simple yet effective method for model adaptation. My main concern exists in the importance of the claimed "extend unsupervised pre-training" stage.

------------------------------------------------- After Rebuttal ------------------------------------------------

Most of my concerns have been addressed. My main concern exists in the importance of the claim that the extended pre-training stage plays a necessary role in overcoming domain shifts.

Although I still doubt whether the extended pre-training stage is really as important as claimed, this does not prevent this paper from being a very good paper with detailed and comprehensive experiments analysis. Therefore, I decide to increase my score to 6 at this time.

问题

Potential limitations of the proposed extended pre-training. Can this stage only benefit the domain shits or also work in general scenarios (like general classification and general object detection)? I also did some experiments before and I failed. I would appreciate it if the authors' pipeline works in such scenarios, but I also understand this part is out of the scope of this paper's claim.

伦理问题详情

N/A.

评论

Thank you for your very insightful suggestions and feedback. We appreciate your recognition of ExPLoRA’s value in significantly reducing computational costs via unsupervised extended pre-training, its compatibility with established and future ViT models, and our extensive experiments that demonstrate its SoTA performance. Please find responses to your concerns below:

Q: What is the computational cost trade-off between supervised fine-tuning and extended pre-training?
Thank you for prompting this analysis– this is a great question. We analyze this in detail in new appendix section B.2, (figure 6), examining two key aspects

  • Fixed parameter budget: Does equivalent supervised fine-tuning match ExPLoRA + fine-tuning?
  • Fixed compute budget: Can supervised fine-tuning alone achieve similar performance with equal GPU-hours?

For both scenarios, the answer is no - ExPLoRA's extended pre-training provides gains that supervised fine-tuning alone cannot match. Even with the same configuration (unfrozen blocks + high-rank LoRA), direct fine-tuning falls short by ≥0.9% in top-1 accuracy. Increasing the parameter budget by unfreezing more blocks doesn't close this gap. Here, the computational budget is measured via total GPU hours allocated to pre-training + fine-tuning.

Moreover, ExPLoRA provides unique benefits beyond methods that only support fine-tuning:

  • Can leverage large unlabeled domain-specific datasets (eg: unlabeled satellite, medical etc. imagery)
  • Creates strong ViT feature extractors (7%+ improvement in linear probing, over prior SoTA such as SatMAE [1], ScaleMAE [2] etc., Table 2) which can be used for unsupervised image retrieval or compression
  • Serves as a “foundation model”. i.e. can be used as an initialization for other downstream tasks (Tables 6, 11 show SoTA on Resisc-45 [3] and EuroSAT [4] without additional pre-training)

Q: How necessary is extended pre-training with the ExPLoRA configuration? Can we use the same configuration (LoRA + unfreezing blocks) directly for fine-tuning?
Thank you for this suggestion. Our experiments in section B2 show that using the same configuration (LoRA + unfrozen blocks) directly for fine-tuning hits a lower accuracy ceiling compared to fine-tuning with ExPLoRA weights. This demonstrates the value of our extended pre-training phase.

Q: Extending the model to multi-modal models like CLIP and zero-shot settings like using natural language for zero-shot understanding in downstream tasks
This is a valuable suggestion. While ExPLoRA is applicable to CLIP, it would require paired image-caption data for new domains, making it a supervised setting. However, in this paper we choose to focus on unsupervised pre-training on image data (eg: DinoV2, MAE etc.). We leave valuable exploration of multi-modal extensions to future work.

Q: Can we change the unsupervised training objectives in the extended pertaining stage?
This is a very interesting question. While it's possible to mix objectives during extended pre-training, it requires careful consideration. Our experiments with using MAE weights for DinoV2 pre-training showed suboptimal results compared to continuing the original objective. While there may be better objective combinations, our current goal is to demonstrate the value of unsupervised extended pre-training for existing visual foundation models (e.g., MAE, DinoV2).


References:
[1] SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery, NeurIPS 2022.
[2] ScaleMAE: A scale-aware masked autoencoder for multiscale geospatial representation learning, ICCV 2023.
[3] Remote sensing image scene classification: Benchmark and state of the art, CVPR 2017.
[4] Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification, IEEE 2019.

评论

Dear reviewer 7P1n,

Thank you very much for your time and consideration in providing us with valuable suggestions that have improved our work.

As today is the final day to make revisions to the pdf of the paper, please let us know if you have remaining concerns that need to be addressed. If your concerns are resolved, we kindly request that you reconsider your score to reflect that.

-Authors

评论

Dear Reviewer 7P1n,

Thank you for your thoughtful review of our paper. As we approach the end of the discussion period, we would greatly appreciate if you could review our rebuttal responses and let us know if we have adequately addressed your concerns.

If you feel that our clarifications and additional experimental results have resolved your initial concerns, we kindly request that you consider updating your evaluation accordingly. Your final assessment is valuable to us and to the broader review process.

We understand you are likely managing many responsibilities, and we truly appreciate your time and attention throughout this process.

Best regards,
Authors

审稿意见
6

This work presents ExPLoRA, which initializes a ViT with pre-trained weights, selectively unfreezes 1 - 2 blocks, tunes remaining weights with LoRA, and continues unsupervised pre-training on a new domain. Then fine-tunes the model on the new domain for supervised learning. This work demonstrates state-of-the-art results on satellite imagery and generalizes to different domains like wildlife, medical, and agricultural imagery.

优点

  • The introduction of ExPLoRA, a new parameter-efficient method to extend unsupervised pretraining on target domains.

  • Conducting comprehensive experiments on various datasets, and showcasing improvements in linear probing top-1 accuracy and outperforming existing techniques on datasets like fMoW.

  • The authors show the effectiveness of ExPLoRA and analyze the differences in local and global information encoded in the patch representations output by each ViT block.

缺点

  • The authors argue that they selectively unfreeze 1 - 2 blocks, tunes remaining weights with LoRA. But specifically, it is very difficult to evaluate the number of layers to be tuned and which layers to be tuned. Although the authors have shown that the method of tuning 1 - 2 layers is feasible, it is not known whether the number of layers and the specifically selected layers are sensitive for different domains. In other words, the authors‘ method might not be optimal for different domains or datasets.
  • The authors may have missed two baselines. For example, in Table 1, the author only gives a result of using their own method for finetuning. However, one baseline is to use the LoRA method in the pre-training phase, and the author should compare the effect of their method with this baseline. The second baseline is to use the pre-trained MAE for full finetuning directly on the downstream task. Intuitively, it is still unknown whether further pre-training on the downstream task is necessary. This baseline will complete the integrity of the experiment.
  • In addition to LoRA, the authors lack a comparison of some related PEFT methods [1,2,3,4,5]. It is better to be able to discuss the related methods. Furthermore, as mentioned in the paper, it is also better for the author to be able to give some results on detection and segmentation tasks.

[1] LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning, in NeurIPS2022.

[2] Scaling & shifting Your Features: A New Baseline for Efficient Model Tuning, in NeurIPS2022.

[3] Vision Transformer Adapter for Dense Predictions, in ICLR2023.

[4] Adapters Strike Back, in CVPR2024.

[5] HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning, in NeurIPS2024.

问题

See weaknesses.

评论

References:

[1] SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery, NeurIPS 2022.
[2] Parameter Efficient Self-Supervised Geospatial Domain Adaptation. CVPR 2024.
[3] Adapters Strike Back. CVPR 2024.
[4] Parameter-efficient orthogonal finetuning via butterfly factorization. arXiv:2311.06243 (2023).
[5] Improving visual prompt tuning for self-supervised vision transformers. ICML 2023.
[6] SA²VP: Spatially Aligned-and-Adapted Visual Prompt. AAAI 2024.
[7] AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv:2303.10512 (2023).
[8] Pro-tuning: Unified prompt tuning for vision tasks. NeurIPS 2022.
[9] Scaling & shifting Your Features: A New Baseline for Efficient Model Tuning. NeurIPS2022.
[10] 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. CVPR 2023.
[11] 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks. arXiv:2408.08345 (2024).
[12] AdaptFormer: Adapting vision transformers for scalable visual recognition. NeurIPS 2022.
[13] Sensitivity-aware visual parameter-efficient tuning. ICCV 2023.

评论

Dear reviewer ANFL,

Thank you very much for your time and consideration in providing us with helpful feedback that has improved our work.

As today is the final day to make revisions to the pdf of the paper, please let us know if you have remaining concerns that need to be addressed. If your concerns are resolved, we kindly request that you reconsider your score to reflect that.

-Authors

评论

Thank you for your feedback, recognition of ExPLoRA's novelty as a parameter-efficient pre-training method, and appreciation of our extensive empirical results. You may find responses to your concerns below:

Q: Are the number and index of unfrozen layers sensitive to different datasets and domains?
This is a good question. In section 6.3, we analyze each block's sensitivity to ExPLoRA's extended pre-training through spectral analysis and linear-probing for local vs global information. We found consistent results across datasets - block 23 consistently showed the highest propensity to improve global information in output feature vectors. To emphasize parameter efficiency, we limited unfrozen blocks to 1-2 across all experiments. For most datasets except fMoW-Sentinel, unfreezing 1 block was sufficient to achieve ExPLoRA's benefits while maintaining low parameter cost.

Q: Was LoRA used as a pre-training baseline?
Yes - we already include these results in Table 3, rows 3 and 4. Using only LoRA, even with high ranks, performs worse than ExPLoRA by at least 2% while using 7M more parameters.

Q: Was the pre-trained MAE used for full fine-tuning on the new domain dataset? Is extended pre-training necessary?
The SatMAE paper [1] shows this experiment in their Table 1- direct fine-tuning of pre-trained MAE performs 0.93% worse than pre-training SatMAE. As shown in our Table 1, LoRA fine-tuning with ExPLoRA outperforms LoRA fine-tuning with either pre-trained MAE or SatMAE while using the same objective.

ExPLoRA's impact is even more pronounced with DinoV2, achieving SoTA at 79.2% and outperforming fully fine-tuned models. Beyond fine-tuning performance, Table 2 shows ExPLoRA's strong feature extraction capabilities through significantly improved linear probing accuracy compared to SatMAE, DinoV2, and other SoTA methods (>7%). This enables use in label-free tasks (e.g., image embedding retrieval) and provides strong initialization for downstream tasks (Tables 6, 10, 11).

Please also see our new section B.2 (figure 6) in the appendix which analyzes the impact of extended pre-training over simply fine-tuning in more detail. To summarize, we find that ExPLoRA reaches a higher max top 1 accuracy than simply fine-tuning for longer.

Q: Comparison with related PEFT methods
Thank you for suggesting these works. We included several state-of-the-art PEFT methods in our results in table 1, including BOFT[4], GVPT[5], SA^2VP[6], AdaLoRA[7], which were all published within 1-2 years ago and have shown strong performance for ViT backbones. We have cited all references you have provided, and have included further results from specifically the following:

  • Parameter Efficient Self-Supervised Geospatial Domain Adaptation [2], CVPR 2024 (added to Table 1).
  • Adapters Strike Back [3], in CVPR2024 (added to Table 1).
  • Mona [11], follow-on work to LoRand [10] in CVPR 2023, (added to Table 1).

The other cited works are either already outperformed or superceded by prior works we have included experiments from. E.g.,

  • SA²VP outperforms Pro-Tuning [8] on CIFAR-100, Oxford flowers.
  • BOFT [4] and AdaLoRA [7] outperform SSF [9] on VTAB-1K.
  • “Adapters Strike Back” [3] includes comparisons with the most recent adapter-based methods, outperforming [9, 12, 13] on VTAB.
  • Mona [11] outperforms LoRand [10] and other adapter methods [12] on COCO and other benchmarks
  • HydraLoRA was published past the July 1 2024 date that ICLR considers concurrent work, and so we omit this.

Our expanded results in Table 1 demonstrate that ExPLoRA unsupervised pre-training (before fine-tuning) outperforms all modern PEFT methods that directly adapt MAE/DinoV2 pre-trained weights.

Further, we would like to emphasize that our work is fully compatible with any SoTA PEFT method, post extended-pretraining. One of ExPLoRA’s advantages is that it doesn’t change the architecture of the ViT, which allows us to plug the final unsupervised weights into any modern or future PEFT method that operates on ViTs. We demonstrate value in the extended unsupervised pre-training phase, which creates new foundation models cheaply to bridge difficult domain gaps.

Q: Results on detection and segmentation tasks
We include results for remote sensing image segmentation (Table 6) and agricultural image detection (Table 9), demonstrating parity or SoTA performance. Please let us know if there are specific datasets representing significant domain shifts that you'd like us to evaluate.

评论

Dear Reviewer ANFL,

Thank you for your thoughtful review of our paper. As we approach the end of the discussion period, we would greatly appreciate if you could review our rebuttal responses and let us know if we have adequately addressed your concerns.

If you feel that our clarifications and additional experimental results have resolved your initial concerns, we kindly request that you consider updating your evaluation accordingly. Your final assessment is valuable to us and to the broader review process.

We understand you are likely managing many responsibilities, and we truly appreciate your time and attention throughout this process.

Best regards,
Authors

评论

We thank the reviewers for their constructive feedback. We're pleased our work is recognized for introducing an innovative perspective on parameter-efficient unsupervised pre-training (ANFL, 7P1n, 8d6C), demonstrating strong and comprehensive results across multiple domains while using fewer parameters (ANFL, 7P1n, 8d6C, 47Sp, ZQZA), and conducting insightful ablations (8d6C, ZQZA). We appreciate ANFL's and ZQZA's acknowledgments of our analysis of encoded information and contributions toward efficient foundation model training.

Our main contribution is demonstrating parameter-efficient unsupervised pre-training for domain adaptation, challenging the paradigm of from-scratch pre-training. Key strengths include:

  • Outperforming domain-specific full pre-training with <5-10% of ViT parameters (Tables 1, 2, 4, 5, 6, 11, 12)
  • Successfully adapting to diverse domains including satellite, medical, and wildlife images (section 6)
  • Providing systematic ablations (section 6.1.2) and interpretability analyses (section 6.3)

Our revised pdf has important changes marked in red. We summarize and address reviewer concerns in three key areas:

Technical Contributions of ExPLoRA

  • Novel parameter-efficient unsupervised pre-training technique that creates specialized foundation models for new domains, supporting both MAE, DinoV2 objectives. Unlike traditional PEFT, these models enable linear probing, feature extraction, and generalization to downstream tasks beyond supervised fine-tuning
  • Demonstration that combining selective block unfreezing with LoRA significantly improves efficiency and performance over either approach alone
  • Extensive validation of SoTA performance on challenging benchmarks (fMoW-{RGB, temporal, Sentinel}, EuroSAT, Resisc-45, SpaceNet), including successful adaptation to multi-spectral and temporal satellite imagery despite significant domain shifts from RGB. Importantly, ExPLoRA outperforms prior SoTA methods that were fully pre-trained from scratch (SatMAE, ScaleMAE etc.)
  • Detailed analysis of intermediate representations through spectral analysis and linear probing of patch embeddings across ViT layers, providing clear guidelines for block selection during pre-training

Comparisons with Recent Methods

We've expanded comparisons with:

  • Recent continual pre-training methods (GFM [1], GDA [2]) for remote sensing, outperforming them by >6% in Table 1 and 3% in Table 4. We describe key differences between ExPLoRA and these works in section appendix A.1 and in our reply to reviewer ZQZA.
  • State-of-the-art PEFT techniques including BOFT [3], Gated VPT [4], AdaLoRA [5], SAVP[6], and newly added Adapters Strike Back [7] and Mona [8,9]. These PEFT techniques do not surpass ExPLoRA's extended pre-training.
  • Please also see our reply to reviewer 47Sp for ExPLoRA's compatibility as an initialization for UDA methods.

We have expanded our references to include other cited works suggested by reviewers. We note that many are either already outperformed or superceded by prior works we have included experiments from. e.g.,

  • SA²VP [6] outperforms Pro-Tuning [10] on CIFAR-100, Oxford flowers.
  • BOFT [3] and AdaLoRA [5] outperform SSF [11] on VTAB-1K.
  • “Adapters Strike Back” [7] includes comparisons with the most recent adapter-based methods, outperforming [11, 12, 13] on VTAB.
  • Mona [9] outperforms LoRand [8] and other adapter methods [12] on COCO and other benchmarks

Importantly, ExPLoRA remains compatible with any PEFT method during fine-tuning as it preserves the ViT architecture. Instead, our method outperforms pre-training from scratch on new domains, which as we mention in Appendix D, is far more expensive and more environmentally unfriendly.

Value of Extended Pre-training

Our new analysis in Appendix B2 demonstrates that extended pre-training is crucial:

  • For fixed parameter budgets, extended fine-tuning converges to lower accuracy (~1%) than ExPLoRA, despite training for longer
  • With fixed compute budgets (measured in GPU-hours), increasing fine-tuning parameters doesn't reach ExPLoRA's accuracy ceiling (also lower by 0.8-1%)
  • Pre-training time creates natural performance tradeoffs: more GPU hours in ExPLoRA pre-training improves both convergence and final accuracy
  • ExPLoRA, due to unsupervised pre-training, provides unique benefits beyond fine-tuning methods:
    • Works with unlabeled domain data (e.g., unlabeled satellite or medical images)
    • Creates strong feature extractors (7%+ improvement in linear probing SoTA, Table 2)
    • Serves as foundation model initialization for downstream tasks (demonstrated in Tables 6, 11)

We thank the reviewers for their time and hope that they take our response into consideration.

In the following comments, we address reviewer-specific concerns in further detail.

评论

References:

[1] Towards geospatial foundation models via continual pretraining. ICCV 2023.
[2] Parameter Efficient Self-Supervised Geospatial Domain Adaptation. CVPR 2024.
[3] Parameter-efficient orthogonal finetuning via butterfly factorization. arXiv:2311.06243 (2023).
[4] Improving visual prompt tuning for self-supervised vision transformers. ICML 2023.
[5] AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv:2303.10512 (2023).
[6] SA²VP: Spatially Aligned-and-Adapted Visual Prompt. AAAI 2024.
[7] Adapters Strike Back. CVPR 2024.
[8] 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. CVPR 2023.
[9] 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks. arXiv:2408.08345 (2024).
[10] Pro-tuning: Unified prompt tuning for vision tasks. NeurIPS 2022.
[11] Scaling & shifting Your Features: A New Baseline for Efficient Model Tuning. NeurIPS2022.
[12] AdaptFormer: Adapting vision transformers for scalable visual recognition. NeurIPS 2022.
[13] Sensitivity-aware visual parameter-efficient tuning. ICCV 2023.

评论

Dear reviewers,

Thank you very much for your helpful reviews and suggestions which have improved our work considerably. As the rebuttal period comes to an end, please let us know if there are any remaining questions or concerns that we may address.

评论

Dear Reviewers,

Thank you again for your time and consideration. As we are just a few hours away from the end of the discussion phase, we wanted to extend one final opportunity to provide feedback on our rebuttal. While we are grateful to those who have already responded and increased their evaluations, we welcome any remaining thoughts or questions that we could address during tomorrow's author-only response period.

If you haven't had a chance to review our responses yet, we would greatly appreciate your feedback. For those whose concerns have been addressed by our rebuttal, we kindly request that you consider updating your evaluation accordingly.

Kind regards,
The Authors

评论

We thank all reviewers for their constructive feedback, which has significantly improved our paper. Below we summarize key changes and experimental additions made during the discussion phase that address reviewer concerns:

Additional Experimental Results

  1. Comprehensive comparisons with recent methods in Table 1:

    • Outperform recent continual pre-training (GFM) [1] and parameter-efficient adaptation (GDA) [2] by 6% and 7% respectively
    • Surpass modern PEFT methods: Adapter+ [3] by 1.7% and Mona [4,5] by 6.5%
    • Similar improvements on multi-spectral data (link), outperforming GFM and GDA by 2-3% on fMoW-Sentinel
  2. New VisDA2017 experiments (link), to be included in final revision) demonstrating ExPLoRA's effectiveness on synthetic domain data:

    • ExPLoRA initialization elevates TVT [6] to SOTA performance (88.5% mean accuracy)
    • Improves TVT's original results by 2-4% and DinoV2 initialization by 6%
    • Makes TVT competitive with SOTA methods like SSRT [7] and CD-Trans [8]

Analysis of Extended Pre-training
In new section B.2, we analyze two key questions about extended pre-training: (1) given a fixed parameter budget, does equivalent supervised fine-tuning match ExPLoRA + fine-tuning? and (2) given a fixed compute budget (measured in GPU-hours), can supervised fine-tuning alone achieve similar performance? Our experiments show that extended pre-training is crucial - ExPLoRA followed by LoRA-r8 fine-tuning outperforms direct fine-tuning with unfrozen block + LoRA by ≥0.9% in top-1 accuracy while using fewer parameters. Increasing the parameter budget by unfreezing more blocks during fine-tuning doesn't close this gap. Moreover, we observe that increasing pre-training iterations improves initial fine-tuning accuracy, though beyond 100k-150k iterations the gains in final accuracy plateau, demonstrating ExPLoRA's computational efficiency.

Beyond supervised performance, extended pre-training provides unique benefits: it enables learning from large unlabeled domain datasets, creates strong feature extractors (demonstrated by 7%+ improvement in linear probing, Table 2), and produces weights that serve as effective initializations for other downstream tasks within the domain (as shown by SOTA results on Resisc-45 and EuroSAT using the same pre-trained weights).

Clarified Related Work
New appendix A.1 contextualizes ExPLoRA against GFM and GDA. Unlike GFM, ExPLoRA is parameter-efficient, using <10% of parameters. Unlike GDA, it preserves the ViT architecture allowing flexible PEFT methods. ExPLoRA also supports non-MAE objectives (e.g., DinoV2) and provides principled analysis for block selection.

In appendix A.2, we clarify how ExPLoRA differs from UDA: while UDA requires labeled source domain data, ExPLoRA enables unsupervised adaptation using only pre-trained weights, without label set restrictions between domains. This positions ExPLoRA as complementary to UDA methods rather than competitive.

We have also updated the related work section 2 with recent continual pre-training methods, added comparisons with modern PEFT techniques, and included relevant UDA literature.


We are very grateful for all of your feedback and your time in reviewing our work. The additions we have made to address your concerns demonstrate ExPLoRA's effectiveness across diverse domains while clarifying its positioning relative to existing methods. The new experiments particularly highlight ExPLoRA's strong performance on challenging domain shifts and its complementarity with existing PEFT/adaptation techniques.


References:
[1] Towards geospatial foundation models via continual pretraining. ICCV 2023.
[2] Parameter Efficient Self-Supervised Geospatial Domain Adaptation. CVPR 2024.
[3] Adapters Strike Back. CVPR 2024.
[4] 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks. arXiv:2408.08345 (2024).
[5] 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. CVPR 2023.
[6] TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation. WACV 2023.
[7] Safe Self-Refinement for Transformer-based Domain Adaptation. CVPR 2022.
[8] CDTRANS: Cross-Domain Transformer For Unsupervised Domain Adaptation. ICLR 2022.

AC 元评审

This work introduces ExPLoRA, a method that initializes a Vision Transformer (ViT) with pre-trained weights, selectively unfreezes one to two blocks, fine-tunes the remaining weights using LoRA, and continues unsupervised pre-training on a new domain. Subsequently, the model undergoes supervised fine-tuning for the target domain.

Five experienced reviewers provided a mixed assessment of this submission. Before rebuttal, four reviewers gave the negative reviews. After rebuttal, three reviewers raised their score and indicated that their issues have been solved. Most issues are limited novelty, unconvincing experiment results. However, one reviewer still felt the limited novelty of this work and kept the negative scores with the high confidence.

Thus, the Area Chair (AC) has carefully reviewed the process, including the initial reviews, rebuttal, and discussions between reviewers and authors, as well as the revised submission. The AC agrees with concerns raised by Reviewer ZQZA and Reviewer 47Sp regarding the limited novelty and narrow application scope. Despite the proposed method is simple yet effective, the insights are not enough in current draft, making this submission like a technical report.

For the future submission, the authors are encouraged to either narrow the focus of ExPLoRA or explore a broader range of experimental settings. Additionally, the authors need to provide a clearer differentiation between LoRA and ExPLoRA beyond merely incorporating LoRA.

审稿人讨论附加意见

No

最终决定

Reject