Learning De-Biased Representations for Remote-Sensing Imagery
We propose debLoRA to adapt foundation models to data-scarce remote sensing domains with long-tailed distributions, by efficiently and unsupervisedly augmenting minor class features using major class features to mitigate representation bias.
摘要
评审与讨论
In remote sensing, many datasets have a long-tail problem. As a result, models trained on these datasets often perform much worse on the tail classes than on the head classes. The authors propose a fine-tuning strategy called debiased LoRA that addresses long-tail distribution problems. Their proposed approach has 3 primary steps. First, they cluster the "biased" features from a pre-trained model. Each cluster represents some visual attribute. Secondly, they compute a debiased cluster center for each tail class and move the respective class embeddings closer to the debiased center. Finally, they train a LoRA module to predict the debiased features from biased features. These debiased features can be used in downstream tasks to achieve better performance for the tail classes.
优点
- The paper is very well written.
- The visualizations are very helpful in understanding the intuition behind the proposed approach
- Long-tail distribution problems occur in many other fields outside of RS such as species distribution modeling. While the approach is proposed for RS imagery, it seems to be a general approach that can be used for any dataset with this problem.
- The results show that the approach does not punish the head and middle classes, and even improves their performance in some scenarios
缺点
- There were no ablation studies to show how susceptible the model is to some hyperparameters. For example: an ablation on the choice of K would be helpful to understand how robust the model is when selecting the number of clusters.
- The experiments are conducted on limited datasets. This makes it hard to confidently believe that debLoRA will generalize to all datasets with long-tail problem
问题
Please address the points that have been raised in the weakness section. My concerns are mostly related to the lack of experiments and ablations that would have made it easier to understand the efficacy and robustness of the proposed approach.
局限性
The authors have done a good job of presenting the limitations of their work.
We sincerely appreciate your positive feedback on our paper's clarity, effectiveness, and novelty! Thank you for these encouraging comments. We are also grateful for your insightful suggestions regarding ablation studies and the generalization of our method. We have addressed each point in detail below and will incorporate these additional experimental results in our revised paper. Please let us know if you believe any further investigation is necessary!
Weakness 1: Sensitivity to cluster number
Can't agree more! We conducted an ablation study on the number of clusters () in Table R6 in the attached PDF. We start from , slightly above DOTA's 15 categories, to allow for potential subclusters. We can observe:
-
performance generally improves as increases, with the most significant gains observed for tail classes. For instance, when increases from 16 to 32, the F1 score for tail classes improves by 4.7%; and
-
the performance peaks around =32, suggesting a good default value for our method.
Weakness 2: Generalization to more datasets
We appreciate your feedback! To address this generalization concern, we conducted experiments on two natural image domain datasets (Places365-LT and iNaturalist 2018) and one multi-spectrum RS dataset (fMoW-S2). We chose these datasets based on their unique properties:
-
Places365-LT exhibits a substantial domain gap from SD's pre-training data. It is thus proper to evaluate the performance of a domain adaptation model.
-
iNaturalist2018 has a high imbalance ratio of 500 and can evaluate the models under severe imbalance.
-
fMoW-S2 contains multi-spectral data, including visible, near-infrared, and shortwave infrared bands, which complement our existing experiments on optical (i.e., DOTA dataset) and SAR (i.e., FUSRS dataset) imagery.
We conducted adaptation experiments following the same setup used for Table 2 (in the main paper). On Places365-LT and iNaturalist 2018, debLoRA consistently outperforms LoRA, especially for tail classes (gains of 4.3% and 7.2%). On fMoW-S2, debLoRA achieves the best overall (46.8%) and tail class (41.2%) accuracies, surpassing ResLT by 0.3% and 2.6%. Kindly refer to our response to reviewer bDAb (weakness 2 & 8 & 10) for more detailed analysis.
I thank the authors for addressing my concerns.
- The ablation of K is helpful. However, I think it would have been very informative to do this on multiple datasets. The results would tell us if there exists some value for K that is good for any/all tasks. Currently, it is not clear if the K=32, which is optimal for DOTA would serve as a good number for fMoW or Places365-LT. I think such information is useful because it tells us how much work a user needs to do to find the correct hyperparameters if they want to use the model.
- The results provided in the rebuttal suggest that the model generalizes to tasks outside of RS, increasing its utility.
The authors have, to a large extent, addressed my concerns. After reading the reviews from other reviewers and the respective rebuttals, I have decided to stick with my initial review score.
Thank you for your insightful feedback! We agree that ablations of K across multiple datasets would be valuable. We commit to including these results in our updated version. We appreciate your recognition of our model's broader applicability and remain open to further questions!
The authors propose a framework called debLoRA for adapting remote sensing foundation models. This approach aims to learn a de-biased feature representation which improves classification/detection performance on rare classes. Performance is assessed on transfer to two different RS datasets.
优点
Remote sensing is a unique domain that is often overlooked. Developing methods to develop models which can be used broadly to solve a number of tasks while making use of the huge amount of data available is important for the community.
缺点
The writing and sentence structure could be substantially improved for clarity and readability. In general, reducing complex clauses would improve readability. The opening sentence of the abstract, for example, is cumbersome to read. It is also advised to avoid beginning sentences (or opening clauses) with "it" as that can lead to ambiguity. Other clauses within the text are modifying the wrong nouns.
The abstract does not clearly set the scope for the work. It's clear there is a focus around remote sensing, but parts of the abstract are also written which suggest the method would apply broadly, beyond remote sensing. Making it clear whether this approach is truly general, or specifically geared toward remote sensing is important.
The "why it works" section should be much more rigorous in nature with the mathematics to justify the approach, not just the textual description.
The authors cite several Contrastive methods, then exclude them because of "lack of high-resolution samples". However, these methods make on requirement on the resolution needed. Additionally, there are newer methods than those cited which address some of these concerns.
Section 4 does not read well. I would suggest rewriting this so the mathematical formalism is clearer and more succinct.
Results would be stronger if they were run multiple times with different initializations to generate error bars. The results aren't significantly better (in absolute terms) in many cases, so it's not obvious that they are statistically significantly better.
This approach should be compared to other fundamentally different (but commonly used) approaches, such as the contrastive learning methods.
There are a huge number of remote sensing datasets out there (both general as well as task/domain specific) so I think exploring more than just two is needed.
Additional analysis of the features learned should be performed. A big focus of the introduction was around biased features. The authors have tried to show that measuring classification/detection accuracy on the head/mid/tail of the distribution captures this, but actual exploration and separation of the features also needs to be quantified.
Since this was stated as a broadly general method, not just related to remote sensing, it should be tested on non-remote sensing datasets as well.
How performance changes as a function of dataset size should be explored.
问题
Line 45- "which" clause is modifying the wrong noun Line 46- "what" should be "which". Sentence should end with a question mark Line 170 is awkward
The explanation around biased features at the bottom of p2 could be clearer.
局限性
Limitations are not discussed. There are no significant societal concerns.
Weakness 1 & Question 1: Lack of clarity
We sincerely appreciate the comment! In the revision, we will perform careful proofreading, e.g., 1) by simplifying complex clauses; 2) by reducing starting sentences with ambiguous "It"; and 3) by fixing the issues noted by reviewer: correct modifier placement, replace "what" with "which", and rewrite unclear sentences.
Weakness 2 & 10: "not clearly set the scope for the work" and "tested on non-remote sensing datasets"
Thank you! Our original experiments focus on Remote Sensing (RS) because: 1) RS images present significant domain gaps from natural images, offering a more challenging test case (as detailed in Lines 22-26). 2) The RS domain has heavier problems of data scarcity and class imbalance (as clarified in Lines 36-43).
We supplement the experiments on two natural image datasets: Places365-LT and iNaturalist2018. Both datasets are widely adopted in long-tailed learning [66]. 1) Places365-LT exhibits a substantial domain gap from SD's pre-training data. It is thus suitable to evaluate the performance of a domain adaptation model. 2) iNaturalist2018 has a high imbalance ratio of 500 and can evaluate the models under severe imbalance. Results are shown in Table R2. Our debLoRA consistently outperforms second best, and gains tail improvements of 4.3% and 7.2% for Places365-LT and iNaturalist 2018, respectively. We will clarify our research scope in the revision.
Weakness 3: "why it works"
We apologize for this misleading sentence. The paragraph led by "why it works" describes our experimental observation instead of justifying truly for "why it works". We will remove this sentence to avoid confusion.
Weakness 5: Rewrite Section 4
We'll make the following improvements to Section 4:
- Emphasize the input/output of our 5-step method:
- Feature extraction: raw images → biased features
- Clustering: biased features → cluster centers
- De-biased center calculation: cluster centers → de-biased centers
- Feature calibration: biased features , de-biased centers → calibrated features
- debLoRA training: original & calibrated features → debLoRA module
- Simplify formulas:
- Equation 1: .
- Equation 3: Add .
- Clarify key sentences:
- Line 208: "We calibrate each tail class representation by moving it closer to the de-biased center."
- Line 217: "For tail classes with larger imbalance ratio, a higher moves features closer to the de-biased center."
Question 2: Clarification of "biased feature space"
We provide a more precise definition of "biased feature space" here. Let and be the feature spaces of head and tail classes, respectively. We define feature spaces as biased if , and , where denotes feature space volume and denotes the probability predicted by the model.
Weakness 4 & 7: Compare with contrastive methods
Thank you! Our work focuses on adapting pre-trained foundation models to solve RS domain problems. These foundation models have consistently outperformed conventional contrastive methods. For instance, SatMAE [7] achieves 63.84% accuracy on fMoW-S2, outperforming SeCo (51.65%) and GASSL (50.69%).
While these results demonstrate the superiority of adapting foundation models, we still agree that a direct comparison would provide a comprehensive view. However, due to private data [7,43] and GPU constraints, we cannot large-scalely train contrastive models in a fair data scale (as foundation models). However, we will try our best to conduct more experiments during the discussion period if you require.
Weakness 6: Statistical error bars
Agree! We conducted three runs of the SD DOTA experiment with random initializations. The results are shown in Table R3.
Regarding "the results aren't significantly better (in absolute terms)", we gracefully disagree. Our debLoRA consistently outperforms the second-best methods by notable margins on tail classes, e.g., 2.5% over SADE in Table 2 and 2.4% over ECM in Table 4. These improvements are regarded as substantial. For reference, Table 3 in [R1] considers an improvement of 1.59% on DOTA as significant.
Weakness 8: More Remote Sensing datasets
Glad to do so! We conducted experiments on the fMoW-S2 dataset [7] because: 1) it contains 13 bands of multi-spectral data, including near-infrared bands, which complement our existing experiments; 2) it exhibits severe class imbalance (imbalance ratio of 130.8); 3) it provides a substantial validation set (84,966 validation samples). We performed experiments on SD fMoW-S2. Results are given in Table R4. Our debLoRA achieves the highest overall accuracy (46.8%) and tail class accuracy (41.2%), surpassing the second-best method ResLT by 0.3% and 2.6%, respectively.
Weakness 9: Quantified feature analysis
We treasure your feedback! We provided an actual feature distribution in Figure 3. To further address your concern, we present quantitative experiments on inter-class and intra-class distances in Table R5. Two key findings: 1) debLoRA achieves higher inter-class distances for both head and tail classes, indicating improved feature separability. 2) debLoRA maintains lower and more consistent intra-class distances among tail classes, suggesting more compact tail features.
Weakness 11: Ablation of dataset size
Thank you! The ablation study using 5% and 50% of the DOTA dataset is ongoing. We will report the results during the discussion period.
Reference
[R1] Pu, Y. et al., Adaptive rotated convolution for rotated object detection.
This work highlights the long-tail problem in transferring existing foundation models to RS domains, and provide a interesting pipeline consisting of clustering, calibration, and training.
-
A comprehensive summary on the transferring from natural images or between RS domains are provided.
-
Representation De-biasing is proposed for search debiased feature centers for tail classes.
-
Impressive results on a set of downstream tasks.
优点
-
This work is well-written. The highlighted issues are well presented with supports from experimental results and analysis on RS datasets, which is then attributed to the tail problem, enhancing the motivation of this work.
-
A historical summary on related works on long-tail problem and transfer learning in remote sensing and the brief preliminaries on LoRA series.
-
The ideas of de-biased cluster center and feature calibration seem effective from the experimental results.
缺点
- My major concern on this work is from the weighted averaging over all cluster centers for each tail class. In such a case, the debiased cluster centers can be considered from the same linear space, will such debias be sufficient enough for distinguishing different classes? Or will the center for a tail class be near some others.
问题
-
Fig 2(a) is a bit confusing to me. The region in blue is for tail samples, are the triangles in dash line and blue for wrongly placed tail samples? And why the center is in blue region as there is no supporting head samples there?
-
Please provide a fully supervised FCOS baseline for comparison.
-
May the results from [7, 15] be provided?
局限性
Detailed discussion on limitations are provided in Appendix.
We greatly appreciate your thorough review and insightful comments! Your positive remarks on our paper's structure, historical summary, and method effectiveness are encouraging. We value your critical questions and have addressed them in detail below. Thank you for helping us improve our research.
Weakness 1: Regarding linear space
Thanks for the insightful comments. We address this question from three perspectives:
-
Image features extracted from large-scale pre-trained foundation models already exhibit good linear separability. Recent literature [R1,R2,43] shows that applying simple linear classifiers (e.g., linear probing head and k-NN classifier head) on foundation models (DINO, CLIP and SD) achieves impressive performance for discrimination tasks.
-
Our de-bias method has indeed improved the feature discrimination between classes. To validate this, we report quantitative analysis on inter-class and intra-class distances in Table R5 in the attached PDF. Results show our debLoRA:
- enlarged inter-class distance between tail and head classes. The average cosine distance increased from 0.702 to 0.719, indicating improved separation;
- reduced intra-class distance for tail classes. The average cosine distance decreased from 0.182 to 0.146, suggesting tighter clustering of tail samples;
- increased inter-class distance among tail classes. The average cosine distance rose from 0.607 to 0.632, demonstrating better separation among different tail classes.
-
We acknowledge the potential benefits of exploring non-linear de-biased centers. A simple way to achieve a non-linear transformation of the class centers is to apply a few non-linear layers (e.g., MLP) on them (i.e., taking the centers as input into a learnable MLP). The learning of these layers can be realized by either meta-learning or training together with our debLoRA parameters, where the former requires an additional validation set to be built. Both methods cost time in model designing and training. We are still working on them, and will report the results in OpenReview if we obtain any during the discussion period. We are also happy to try other methods of introducing non-linearity to de-biased centers if we receive any suggestions from the reviewers during the discussion.
Question 1: Clarity of Figure 2(a) caption
We apologize for any confusion in Figure 2(a). We want to highlight that:
- The blue star represents the center of tail training samples (in blue color), but not the center of all samples.
- The dashed blue triangles represent tail validation samples misclassified as head classes. They demonstrate the model bias towards head classes.
We will refine the caption of Figure 2 to avoid any confusion. Please refer to the caption of Figure R1 in the attached PDF for the updated caption.
Question 2: Fully supervised FCOS baseline
Thank you for the suggestion! Please note that the methods (except Zero-Shot) are all fully-supervisedly trained on DOTA datasets with bounding boxes as labels. If we understand the question correctly, the reviewer wanted to check the results without any transfer learning, i.e., the results of training a model from scratch as in the original paper of FCOS [50]. Table R1 in the attached PDF provides such results with two backbone networks: ResNet-101 (used in its original paper) and SD U-Net (used in our submission). SD U-Net is used for fair comparison regarding the amount of network parameters.
We can observe from this table that From-Scratch underperforms transfer learning approaches (Fine-Tune, LoRA, and debLoRA), especially on tail classes, and shows a larger performance gap between head and tail classes. For example, comparing rows 2 and 4, we see that it is lower than the basic Fine-Tune method by 0.2, 0.7, and 1.5 percentage points of mAP for head, middle, and tail classes, respectively. Moreover, the head-tail gap for From-Scratch (row 2) is as high as 13.3 percentage points, while for Fine-Tune, LoRA, and debLoRA, the gaps are 12, 11, and 6.2 percentage points, respectively (please note that the lower is better). This phenomenon occurs because transfer learning can leverage the robust representation of a pre-trained model (learned on large-scale datasets), while the From-Scratch model is trained with only the long-tailed training data of a small dataset (please note that here the "small" is a relative concept, compared to the large-scale pre-training datasets such as LAION-5B).
Question 3: Results from [7, 15]
First, for SatMAE [7], we have reported SatMAE results in Table 4 of the main paper on the "SatMAE → FUSRS" columns.
Second, SkySense [15] results are not directly comparable to ours because they did not benchmark their method on long-tailed settings. Besides, we cannot implement their methods in our setting because their pre-trained model is not open-sourced yet. If their model becomes available during the revision period of NeurIPS'24, we will gladly include it in our final paper.
As requested by other reviewers, we evaluated our method on more general long-tailed datasets (see Tables R2 and R4). On Places365-LT and iNaturalist 2018, debLoRA consistently outperforms LoRA, especially for tail classes (gains of 4.3% and 7.2%). On fMoW-S2, debLoRA achieves the best overall (46.8%) and tail class (41.2%) accuracy, surpassing ResLT by 0.3% and 2.6%. Kindly refer to our response to reviewer bDAb (weakness 2 & 8 & 10) for detailed analysis.
References
[R1] M. Raghu et al., "Do vision transformers see like convolutional neural networks?," Advances in neural information processing systems, vol. 34, pp. 12116–12128, 2021.
[R2] M. Caron et al., "Emerging properties in self-supervised vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
The authors managed to addressed my concern in Q1-3. Combining the concerns from other reviewers and the presented rebuttals, I am willing to maintain my original judgment, i.e. Weak Accept.
We appreciate your constructive feedback and commit to exploring non-linear approaches in our future work! Thank you for your thorough review and for maintaining a 'weak accept' score. If you need any further clarification, we're happy to provide additional details!
We sincerely thank all reviewers for their valuable feedback and constructive comments. We are pleased that the reviewers have recognized our work as:
- Well-written (Reviewer r61t) and very well written (Reviewer meTz)
- Technically sound (Reviewer r61t and meTz)
- Impressive results on multiple tasks (Reveiwer r61t) and good performance (Reviewer meTz)
- Practical solution in remote sensing (Reviewers r61t and meTz)
- Generalizable beyond remote sensing (Reviewer meTz)
We have carefully addressed each reviewer's comments and questions in their respective response sections below. For reference, we have included additional tables and figures in the attached PDF to support our responses.
This paper received divergent reviews. Post rebuttal, the reviewer who voted strong reject did not respond. The AC checked the paper, the rebuttal and the reviews, and felt that all the concerns raised had been addressed, with the exception of the comparison to contrastive learning/self-supervised learning approaches. The AC recommends acceptance, and encourages authors to include these comparisons as well as the results in the rebuttal pdf in the final camera ready/appendix.