DA-Ada: Learning Domain-Aware Adapter for Domain Adaptive Object Detection
We propose a novel Domain-Aware Adapter (DA-Ada) to exploit domain-invariant and domain-specific knowledge with the vision-language models for domain adaptive object detection.
摘要
评审与讨论
The paper presents a Domain-Aware Adapter (DA-Ada) for DAOD based on VLMs, which aims to improve the model performance when applied to an unlabeled target domain. DA-Ada incorporates two types of adapters, including a Domain-Invariant Adapter (DIA) to learn domain-invariant knowledge and a Domain-Specific Adapter (DSA) for injecting domain-specific knowledge. It also includes a Visual-guided Textual Adapter (VTA) to encode the cross-domain visual feature into a textural encoder to enhance detection. Experiments across various tasks show that DA-Ada significantly outperforms existing methods, demonstrating its effectiveness in eliminating the domain gap.
优点
- The proposed method can significantly improve DAOD performance, surpassing state-of-the-art works by a large margin.
- The general idea of using an adaptor with VLMs to assist DAOD is moderately interesting.
- The method is simple yet effective.
缺点
- Some claims are confusing and lack sufficient justifications.
- Some designs are similar to existing works and lack sufficient technical contributions.
- The method is only deployed on Faster RCNN. The adaptability to more advanced detectors is unknown.
- The paper writing and organization need to improve.
Details are in Questions.
问题
-
In Line 139, are there any theoretical proofs or experimental justifications to support the claim that low-dimensional features have less information redundancy and are more suitable for domain adaptation? How can you ensure that low-dimensional representation is domain invariant? I don't think that the feature with less information redundancy means domain invariant.
-
For the Visual-guided textual adapter, this design seems to follow CoCoop [1], a generic technique for improving model learning. The difference needs to be clarified.
-
I am confused about the domain-specific adapter. The authors use the residual parts of the visual features to represent the domain-specific information. Is there any proof or literature that can justify the correctness of this assumption?
-
Can this method be applied to more advanced detectors, such as DETR and CenterNet v2?
-
The paper writing and organization need to improve. The authors propose DITA and DSTA but don't provide any information about them in the abstract. Additionally, the method descriptions and Fig.2/3 are not well matched, leading to the reading difficulty. For example, in Line 184, where are DITA and DSTA in Fig. 2(b)?
[1] Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16816-16825).
Considering the significant performance gain of the proposed method, I'm happy to turn to a positive score if concerns are addressed.
局限性
No potential negative societal impact.
Comments:
We sincerely thank you for the valuable comments. We are encouraged to see that our work is recognized as moderately interesting and effective. We will explain your concerns point by point.
Q1: In Line 139, ...low-dimensional features have less information redundancy and are more suitable for domain adaptation?
A1: Thanks for raising an important point.
In fact, it is the combination of dimensional-reduction and dimensional-increase processes with the constraint of task loss that can reduce information redundancy and is more suitable for domain adaptation, rather than the low-dimensional features themselves. In line 139, the low-dimensional features refer to the output of the down-projection layer, which is an intermediate feature of the adapter's bottleneck structure. The bottleneck structure is first proposed in [2]. It reduces the computational cost and number of parameters through dimensional reduction-increase, efficiently learning feature representation. In this structure, low-dimensional features function as intermediate vectors: when down-projecting the input into low-dimensional features, some redundant information is discarded; when mapping low-dimensional features back to the original dimension, the task-related features are retained with the constraint of task loss. In DIA, we first down-project the input features into low-dimensional features then up-project to high-dimensional features , optimizing with the adversarial loss and detection loss . Therefore, by combining dimensional reduction-increase with the constraints of detection and adversarial loss, we enable DIA to extract domain-invariant features while reducing redundant features. Experiments show that the performance peaks 57.1% when the bottleneck dimension is 1/2 of the input (Line 2 of Table 7 in the paper), indicating that appropriate dimensional reduction can filter redundant features while extracting domain-invariant knowledge.
Q2: The difference between Visual-guided textual adapter (VTA) and CoCoop [1] needs to be clarified.
A2: Thanks for your nice suggestion.
CoCoop[1] projects image features into input-conditional tokens to lessen sensitivity to class shift.
However, it generates tokens with a meta-net shared across domains, e.g. domain-agnostic projector.
Ignoring the differences between images from different domains, [1] shows limited ability to distinguish inter-domain commonalities and characteristics, which are essential for DAOD.
In contrast, VTA explicitly injects domain-invariant and domain-specific knowledge into tokens.
It consists of a domain-invariant textual adapter (DITA) and domain-specific textual adapters (DSTA) .
DITA is shared between domains to encode visual domain-invariant knowledge into prompt tokens, specially optimized by an adversarial loss .
The DSTA and are independent across domains to learn domain-specific knowledge with detection loss on each domain, respectively.
Moreover, a decoupling loss is also designed to boost the DIA and DSA to learn cross-domain information.
VTA increases the textual encoder's discriminability with cross-domain information and outperforms CoCoOp by 2.3%, 2.2% on C->F, K->C adaptation tasks (Table 8 in the paper).
It shows the superiority of our adapter over CoCoOp’s image-conditional prompt.
Q3: I am confused about...the residual parts... to represent the domain-specific information
A3: Thanks for raising an important point.
[3] has studied utilizing residual parts to disentangle domain-invariant and domain-specific representations. It considers the difference between the input and domain-invariant features as the domain-specific parts. However, on the one hand, [3] uses the input and output of the whole backbone to make the difference. Since the output is performed with a lot more convolutions than the input, they differ greatly in semantic level, extracting inaccurate domain-specific knowledge. On the other hand, [3] only use domain-invariant features for detection, ignoring the capability of domain-specific knowledge to improve the discrimination of detector. To solve this, we introduce the novel domain-specific adapter (DSA). First, to ensure semantic-level consistency, the two features used for calculating the difference differ only by 3 convolutions. Second, to ensure DSA learns domain-specific knowledge, we maximize the distribution discrepancy between DIA and DSA. Moreover, the DSA is adaptively fused with domain-invariant features to further enhance the discriminability. Experiments show that using residual parts can effectively extract domain-specific knowledge, and improves 1.5% over not using the residual parts (Line 3, 5 of Table 6 in paper). And DSA achieves an absolute gain of 4.5%, outperforming the 2.6% in [3]. These results show the effectiveness of the proposed domain-specific adapter.
Q4: Can this method be applied to more advanced detectors?
A4: Thanks for your concern.
We apply the proposed method to DETR. The vanilla DETR baseline achieves 41.6% mAP on C->F with domain discriminators. Equipping the domain-aware adapter brings 5.2% gains, reaching 46.8% mAP. To further verify the effectiveness, we also extend DA-Ada to DAPL[4] for DA Classification task. It increases Accuracy from 74.5% to 77.1% on Office-Home, proving the generalization.
Q5: DITA and DSTA lack information in the abstract...in Line 184, where are DITA and DSTA in Fig. 2(b)?
A5: Thanks for the suggestion. We will supplement descriptions of DITA and DSTA in the abstract, and revise the description of Figures in the manuscript. For example, the "Fig.2(b)" in Line 184 should be revised as "Fig.2(c)".
[1]Conditional prompt learning for vision-language models
[2]Deep residual learning for image recognition
[3]Vector-decomposed disentanglement for domain-invariant object detection
[4]Domain Adaptation via Prompt Learning
Thanks for the clarification. Most of my concerns have been addressed. Therefore, I would like to raise my score.
Thanks for your positive feedback. We really appreciate your precious time and valuable comments.
This paper presents a method to tackle the domain adaptive object detection (DAOD) task within the framework of visual-language models (VLM). The authors propose a Domain-Aware Adapter (DA-Ada) to enhance the visual encoder's ability to learn both domain-invariant and domain-specific features. DA-Ada consists of two components: the Domain-Invariant Adapter (DIA) and the Domain-Specific Adapter (DSA). The DIA learns domain-invariant features by aligning the feature distribution between the source and target domains, while the DSA recovers domain-specific knowledge from the differences between the input and output of the visual encoder blocks. Additionally, the Visual-guided Textual Adapter (VTA) embeds cross-domain information into the textual encoder to improve detection head discriminability. Experiments on various DAOD benchmarks indicate that DA-Ada improves performance compared to state-of-the-art methods.
优点
This paper is well-written and easy to follow. The motivation for explicitly learning additional domain-specific features to enhance cross-domain performance is clear and straightforward. The disentanglement of domain-invariant and domain-specific features proposed in the Domain-Specific Adapter, along with the regularization loss in Eq. 15, appears to be effective. The ablation study is comprehensive and detailed, and the experimental results indicate that the proposed model achieves significant performance improvements compared to other state-of-the-art methods across different domain shift scenarios.
缺点
Despite the paper’s strengths, certain aspects of the discussion could be further refined for precision and clarity:
- The performance of the source-only baseline in Table 3 outperforms (or achieves similar performance to) most of the non-VLM-based models in Table 1. For example, on the Cross-Weather domain shift, the source-only model achieves an mAP of 50.4, whereas methods like AT[35], which are backbone ImageNet-pretrained, show a performance of 50.9, and their source-only variants perform roughly >10 mAP lower. While using a VLM vision encoder to generate essential general knowledge is reasonable, it raises the question of whether this task should still be categorized as cross-domain (domain adaptive) learning or as transfer learning (or VLM-based domain adaptation). Additionally, it would be helpful to add a column in Tables 1 and 2 to indicate the pretraining.
- It would be beneficial if the authors could also provide the comparisons about the computational speed
问题
- What are the regression losses used in DA-Ada? Are they the same as those used in [28]?
- The symbols of down-projectors in Eq.6 and Section 6.7 are different. Could you clarify this discrepancy? Additionally, for the down-up projections, which operate on both the channel dimension (Eq.5) and the spatial dimensions, a clearer explanation would be beneficial.
- Since DA-Ada modules are added to different visual encoder blocks, the claim in line 139 that “Low-dimensional features have less information redundancy…” seems inappropriate. In my opinion, “low-dimensional” features typically refer to features from the earlier layers. In Section 2, Eq. 5 seems more related to “channel dimension” condensation. Additionally, in “Bottleneck Dimension” in Section 4.3 (also Table 7), it would be helpful if the authors could directly indicate the input channel dimensions.
- It would be beneficial to add some visualizations for failure cases.
Some minor problems:
- Typos in line 183-184?
- Legend missing in Figure 3 (the red symbols)
局限性
Yes, the limitations are discussed in Sec 6.11.
Comment:
We sincerely thank you for your comprehensive comments and constructive advice. We are pleased to see our work being regarded as effective, and the motivation as clear and straightforward. We will explain your concerns point by point.
Q1: The performance of the source-only baseline in Table 3 outperforms (or achieves similar performance to) most of the non-VLM-based models in Table 1..., it raises the question of whether this task should still be categorized as cross-domain ...to indicate the pretraining.
A1: Thanks for the interesting question.
Our research focuses on injecting new domain knowledge into pre-trained VLMs, while protecting the generalization ability of the pre-trained knowledge. Despite using a strong VLM backbone, our method still achieves significant absolute performance gains. As shown in Table 10 in the paper, our method achieves an 8.0% improvement on Cross-Weather adaptation task, surpassing the 7.9% of the SOTA method AT[35]. In addition, to properly evaluate the method, we introduce DA-Ada to weaker non-VLM baseline DSS[59] in Table 17. With DA-Ada, DSS achieves a competitive 48.1% mAP to SOTA methods and attains 7.2% improvement, indicating that the proposed DA-Ada performs well even under the weak non-VLM baseline. For ease of understanding, we will also add the description of pre-training to Tables 1 and 2.
Domain adaptation (DA) aims to transfer source domain knowledge to the target domain. Compared with the backbone used in non-VLM DAOD method, the VLM used in our method differs only in the pre-trained data. Essentially, we design the domain-aware adapter and visual-guided textual adapter to learn knowledge from the source domain and transfer it to the target domain, which follows the general definition of domain adaptation. Specifically, DA-Ada analyses the source domain data to transfer the general knowledge of the VLM to downstream tasks, that is, the target domain. In this case, the presence of VLM makes the problem like a combination of traditional DA and source-free DA. Therefore, it is also appropriate to list our method as a sub-problem of DA, such as VLM-based domain adaptation.
Q2: It would be beneficial if the authors could also provide the comparisons about the computational speed
A2: Thanks for your advice.
We have compared computational speed over global fine-tune, the SOTA method DA-Pro [28] and our proposed DA-Ada in Table 18 in the Appendix. We initial the three methods with the same VLM backbone. Only attaching lightweight adapters, our DA-Ada slightly introduces 0.02s extra time, accounting for 5% inference time but significantly improves the performance. Global fine-tuning has the largest training time overhead but only achieves the lowest performance, indicating the limitations of traditional DAOD methods in optimizing VLM. Compared with global fine-tuning, DA-Pro significantly reduces training time overhead while improving performance. Furthermore, DA-Ada significantly improves mAP with 4.9% while only using 6% of the time and 47% of the memory, showing great efficiency in adapting cross-domain information to VLM. We will also add some computational time comparisons for non-VLM methods.
| Method | mAP | Inference time(s)/iter | Training time(s)/iter | Total iter |
|---|---|---|---|---|
| Global Fine-tune | 53.6 | 0.40 | 2.67 | 25000 |
| DA-Pro[28] | 54.6 | 0.40 | 1.47 | 1000 |
| DA-Ada | 58.5 | 0.42 | 1.61 | 2500 |
Q3: What are the regression losses used in DA-Ada? Are they the same as those used in [28]?
A3: Thanks for your question.
The regression loss is the smooth L1 loss, which is the same regression loss used in the DA-Pro[28].
Q4: The symbols of down-projectors in Eq.6 and Section 6.7 are different...a clearer explanation would be beneficial.
A4: Thanks for pointing this out.
The symbol in Eq.6 and Section 6.7 refers to the same down-projector. Therefore, we will unify their descriptions and use the same font.
For the down projectors, Eq.5 operates only on the channel dimension. As a multi-scale version, Eq.6 operates on both channel and spatial dimensions. For the up projectors, Eq.7 operates only on the channel dimension. We will provide these details in the manuscript.
Q5: “Low-dimensional features have less information redundancy…”...seems more related to “channel dimension” condensation...it would be helpful if the authors could directly indicate the input channel dimensions.
A5: Thanks for the valuable suggestion.
The "low-dimensional features" in Line 139 refers to the output of down-projection layer in the proposed adapter, which is an intermediate feature of the bottleneck structure. Since it represents a reduction in channel dimension, we will revise the "low-dimensional" into "low channel-dimensional" for better clarity. As for the “Bottleneck Dimension,” the input channel dimension is equal to Line 4 of Table 7, \ie 64, 256, 512, 1024 for the four DA-Ada blocks, respectively. We will supplement the explaination the number of input channels in Section 4.3 and highlight them in Table 7.
Q6: It would be beneficial to add some visualizations for failure cases.
A6: Thanks for your suggestion.
We provide some examples of failure cases on the Cross-Weather adaptation scenario in Figure 6 in the Global Rebuttal Part. We visualize the ground truth (a)(b) and the detection boxes of DA-Ada (c)(d). In (c.1), DA-Ada misses the car with its headlights on in the fog. Since the source data Cityscapes is collected on sunny days, few cars turned on their lights in the training set. Therefore, DA-Ada missed such out-of-distribution data. In (d.1), DA-Ada misses the bicycle and person blocked by other foreground objects. Since occlusion causes great damage to semantics, this type of missed detection is widely seen in object detection methods.
Q7: minor problems
A7: Thanks for pointing this out. We will correct the typos and supplement the legend in Figure 3.
Thank you for addressing my concerns. While your explanations clarified most points, I still believe that the extensive pre-training on massive data gives the VLM backbone a strong ability to capture domain-invariant features. This shifts the problem away from pure (or conventional) domain adaptation to something more akin to "vision-language data pre-trained domain alignment," which might better reflect the nature of the task, even though the terminology itself may seem a bit clumsy.
Thank you for your positive and insightful feedback. As you suggested, in the era of pre-trained VLM, we do need to rethink the role of VLM in DAOD and explore new paradigms for the task itself. In future work, we will continue to study this type of VLM-based domain adaptation tasks and investigate how the domain alignment of visual-language pre-training can help detectors adapt across different domains.
This work focuses on domain adaptive object detection (DAOD) with the vision-language models. The core idea behind this paper is the frozen visual encoder with a domain-agnostic adapter only captures domain-invariant knowledge for DAOD. To this end, this paper proposes a novel Domain-Aware Adapter (DA-Ada) to capture the domain-invariant knowledge and domain-specific knowledge. The experimental results on multiple DAOD tasks show the proposed method clearly outperforms existing DAOD methods.
优点
- DAOD is an important problem, especially, in the large vision-language era, we need to rethink and explore the new paradigm for DAOD.
- The proposed method is reasonable, effective transfer performance should maintain both the transferability (domain-invariant) and discriminability (domain-specific) between source and target domains.
- The experimental results over multiple benchmarks show the effectiveness of the proposed method. Extensive ablation studies have been conducted to investigate the proposed method.
缺点
- Although the proposed method is reasonable, it still has limited novelty for the domain adaptation community. The domain-invariant and domain-specific knowledge have been exploited by many previous works.
- The function of dec loss is to maximize the distribution discrepancy between DIA and DSA. Why is the cosine similarity calculated by and instead of and .
- Lack of details for VTA, what are the structures of DITA, DSTA, and DSTA. Besides, it should be added the text description c in Figure 2 (c), the CLIP Textual Encoder adopts both the projected visual feature and the textual feature for better clarity.
- The author should add results of source-only baseline, i.e., RegionCLIP, for a fair comparison.
- The Injection Operation has many ways in Table 6, can the authors provide some explanation between and
- In line 184, the figure reference should be Fig.2(c) instead of Fig.3(b).
问题
- Do we really need a source domain for adaptation when we already have a powerful vision-language detector that contains much general knowledge from large-scale data?
局限性
Nothing.
Comment:
We sincerely thank you for your comprehensive comments and constructive advice. We are pleased to see our work being regarded as reasonable and effective. We will explain your concerns point by point.
Q1: Although the proposed method is reasonable...the domain-invariant and domain-specific knowledge have been exploited.
A1: Thanks for your concern.
Recent works [50, 2, 1, 59, 60] propose multiple extractors[38, 36, 57, 61] and discriminators [63, 76] to decouple the domain-invariant and domain-specific knowledge, aiming to disentangle knowledge unique to each domain. However, on the one hand, they only use domain-invariant features for detection, ignoring the improvement of discriminability brought by the characteristics of each domain, e.g. domain-specific knowledge. On the other hand, Applying existing DAOD method to VLM would overfit the model to the training data, compromising the generalization of pre-trained models. In contrast, we take advantage of the generalization of VLM to assist in domain adaptation and propose a novel decoupling-refusion strategy. While preserving the pre-trained essential general knowledge, it adaptively modifies domain-invariant features with domain-specific features to enhance the discriminability on the target domain. Experiments show that DA-Ada surpasses the SOTA disentangling method [76] (Line 6 of Table 1 in the paper) 6.4~9.2% on three benchmarks, indicating that our method explores the relationship between domain-invariant and domain-specific knowledge from a novel and effective perspective.
Q2: The function of dec loss...Why is the cosine similarity calculated by h^I and h^I* h^S instead of h^I and h^I.
A2: Thanks for your nice suggestion.
is the output features of DSA block, expected to extract domain-specific knowledge. To adaptively fuse the domain-specific with the domain-invariant knowledge, we explore two kinds of injection operation: directly adding and pixel-level attention. We find that domain-specific knowledge describes intra-domain properties and is more suitable for refining the extracted domain-invariant features. As shown in Table 6 in the paper, achieves better performance than directly adding . Therefore, functions as pixel-level attention for . In this case, represents features refined by the domain-specific knowledge, and maximizing the cosine similarity between and could help further decoupling the domain-invariant and domain-specific knowledge. In addition, if we maximize the cosine similarity between and , then the term eventually converges to 0, which is meaningless for adaptation.
Q3: Lack of details for VTA.... Besides, it should be added the text description c in Figure 2 (c)....for better clarity.
A3: Thanks for raising this question. The structure of DITA and DSTA is a 3-layer MLP with a hidden dimension of 512. The DITA and DSTA project visual embeddings into 8 tokens for the textual encoder. We will supplement the details of VTA in the manuscript. We will also add the textual description c as an input to the textual encoder in Fig.2(c).
Q4: The author should add results of source-only baseline, i.e., RegionCLIP.
A4: Thanks for this point.
The zero-shot results of the baseline, e.g. RegionCLIP, is 52.6% mAP in Line 1 of Table 4,6 and 8 in the paper. Fine-tuned with only the source domain data, the source-only baseline achieves 50.5% mAP, suffering performance degradation on the target domain. Equipped with domain-adaptive adapter and visual-guided textual adapter, our proposed method achieves the highest absolute gain of 8.0% (Line 5 of Table 10 in the paper) over the source-only baseline, demonstrating remarkable efficiency. We will modify the corresponding description in the manuscript to make it clearer.
Q5: The Injection Operation...can the authors provide some explanation between h^I+h^S and h^I+h^I*h^S
A5: Thanks for the question.
denotes directly adding the output of DIA and DSA and taking it as the output feature for the domain-aware adapter. denotes that is applied as pixel-level attention, multiplied with element-wise and added with the original . Experiment shows that outperforms (Line 5 and 7 of Table 6 in the paper), revealing that describes intra-domain properties and is more suitable for refining the , rather than directly adding.
Q6:In line 184, the figure reference should be Fig.2(c) instead of Fig.3(b).
A6: Thanks for this point. We will fix this typo in the manuscript.
Q7: Do we really need a source domain for adaptation when we already have a powerful vision-language detector that contains much general knowledge from large-scale data?
A6: Thanks for the intriguing question.
Pre-trained VLMs learn powerful generalization through large-scale data. Although VLMs' zero-shot capabilities are strong, they benefit from the general knowledge they learn, which seems like common sense over a wide range of data. When deployed to downstream tasks, VLMs can show better performance combined with task-specific knowledge. Therefore, if there is a lack of credible labels for downstream tasks, a manually annotated dataset that is highly relevant to them can help VLMs migrate to downstream tasks properly, that is, source domain data. That is to say, the existence of the source domain is to help VLMs better adapt to the target domain. In this sense, the source domain is still needed.
Thank you for your detailed answers to the other reviewers and me, which solved all my concerns.
Thanks for your timely responses. We appreciate the valuable comment and insightful question!
This article proposes a Domain-Aware Adapter (DA-Ada) tailored for the DAOD task. The key point is exploiting domain-specific knowledge between the essential general knowledge and domain-invariant knowledge. The DA-Ada framework consists of the Domain-Invariant Adapter (DIA) for learning domain-invariant knowledge and the Domain-Specific Adapter (DSA) for injecting the domain-specific knowledge from the information discarded by the visual encoder. Comprehensive experiments over multiple DAOD tasks show the effectiveness of DA-Ada.
优点
- The paper is well-written.
- The proposed method achieves significant performance improvement compared to several baselines on commonly used benchmarks. The proposed method not only improves the detection performance on the target domain, but also achieves improvement on the source domain.
缺点
- The DIA module is a sequence of operations involving mapping, dimensionality reduction, slicing, and dimensionality-raising. It is challenging to understand why such operations can extract domain-invariant knowledge. The author should provide a stronger explanation to clarify the rationale behind this design.
- In line 251 of the text, could the author elaborate on how the source-only adapter and the domain-agnostic adapter each function? This would help readers better understand.
- Could the author design a quantitative experiment to further visualize which features represent domain-specific knowledge? Additionally, could they compare the traditional adapter with the method proposed in this paper to further demonstrate the effectiveness of the proposed method? The AP metric alone is insufficient to fully explain the motivation behind this method.
- Since this paper focuses on adapter work, could the author further review the related work on VLM (Vision-Language Models) adapters in more detail? This would help readers better understand the contribution of this paper.
- The paper mentions that the domain-agnostic adapter easily causes the model to bias towards the source domain. The latest work [1] also addresses the issue of source bias. In the related work and experiments sections, the author should include a discussion and comparison with the latest work.
- During this training process, are the original CLIP network components (including the visual encoder and the text encoder) completely frozen, with only the adapter and VTA parts being trained?
- minor error: There are several places between lines 91 and 102 where spaces are missing between sentences.
Reference: [1] DSD-DA: Distillation-based Source Debiasing for Domain Adaptive Object Detection[C]//Forty-first International Conference on Machine Learning.
问题
See weaknesses
局限性
Adequately addressed
Comment:
We sincerely thank you for your comprehensive comments and constructive advice. We are pleased to see our work being regarded as achieving significant improvement. We will explain your concerns point by point.
Q1: The DIA module is a sequence of operations...explanation behind this design.
A1: Thanks for your valuable concern.
In fact, it is the combination of dimensional reduction-increase processes with the constraints of detection and adversarial loss that can extract domain-invariant features while reducing redundant features. The structure of DIA is motivated by bottleneck [2]. Bottleneck reduces the computational cost by dimensional reduction-increase, efficiently learning feature representation. In this structure, when down-projecting the input into low-dimensional features, some redundant features are discarded; when mapping low-dimensional features back to the original dimension, the task-related features are retained with the constraint of task loss. In DIA, we first down-project the input features into low-dimensional features h^L then up-project to high-dimensional features h^I, optimizing with the adversarial and detection loss. Besides, we introduce the multi-scale scheme by slicing channels into different receptive fields, enabling it to capture domain-invariant knowledge in various scales. Experiments show that the mAP peaks 57.1% when the bottleneck dimension is 1/2 of the input (Line 2 of Table 7 in the paper), indicating that appropriate dimensional reduction can filter redundant information while extracting domain-invariant knowledge. Applying the scaling ratios of {1, 1/2, 1/4, 1/8} achieves 58.5%, demonstrating that multi-scale convolution can help deal with the domain bias in object scales.
Q2: How the source-only adapter and domain-agnostic adapter each function?
A2: Thanks for the suggestion.
The source-only adapter denotes the traditional adapter fine-tuned on the source domain. The domain-agnostic adapter is tuned with detection loss on source domain and adversarial loss on both domains, as shown in Fig.1(b). We will supplement the description in the manuscript.
Q3: Quantitative experiment to visualize which features represent domain-specific knowledge?
A3: Thanks for your constructive suggestion.
We visualize the output features of the traditional adapter, the domain-invariant adapter (DIA), domain-specific adapter (DSA) and the domain-aware adapter (DA-Ada) in Figure 5 in the Global Rebuttal part. We sample image (a) a car and a person in the fog from Foggy Cityscapes. The traditional adapter (b) roughly extracts the outline of the car. However, affected by target domain attributes, such as fog, background areas are also highlighted in (b), and the person is not salient. DIA (C) mainly focuses on the object area and extracts domain-shared task information. DSA (d) mainly focuses on factors related to domain attributes besides the objects, such as foggy areas. By combining DIA with DSA, DA-Ada (e) extracts the car and person while reducing the interference of fog in the background. Compared with (b), objects are more salient in (e), indicating the effectiveness of DA-Ada.
Q4: Further review the related work on VLM adapters
A4: Thanks for your suggestion.
Recent works explore adapters to transfer pre-trained VLM to few-shot visual tasks. [3] and [4] propose adapters to introduce image-related inductive biases into Transformer and CNN network. [6] firstly integrates the adapter into the CLIP model, and [5] further analyzes the components to be frozen or learnable. [7] combines self-supervised learning to enhance the ability to extract low-level features. Recent [8] explore injecting task-related knowledge into high-resolution segmentation model SAM. However, they also face the source-biased problem when applied to DAOD task. To handle this, DA-Ada explicitly learns both domain-invariant and domain-specific knowledge. We will provide a more detailed description in related work.
Q5: Comparison with the latest work [1].
A5: As a semi-supervised method, [1] transfers the source domain image into target domain, aiming to train a style-unbiased classifier. However, it requires a large amount of image pre-processing and three stages of training. And [1] can only handle domain bias in image style. In contrast, DA-Ada freezes the backbone and introduces lightweight learnable adapters. It does not require data generation and only tuning a very small number (1.794M) of parameters to achieve significant adaptation of +8.0%(Line 5 of Table 10 in the paper). Meanwhile, since DA-Ada adaptively learns cross-domain information from visual features, it appears to be robust in various adaptation scenarios. Experiments show that DA-Ada outperforms [1] on three benchmarks with 6.3~17.4% mAP, and 5.1~5.6% absolute performance gain, indicating that DA-Ada is more effective and can handle a broader range of adaptation scenarios.
| Benchmark | DSA-DA | Abs. Gain | DA-Ada | Abs. Gain |
|---|---|---|---|---|
| C→F | 52.2 | +2.9 | 58.5 | +8.0 |
| K→C | 49.3 | +1.6 | 66.7 | +7.2 |
| S→C | 52.5 | +1.1 | 67.3 | +6.5 |
Q6: During training process, are the original network completely frozen?
A6: Yes. Only the adapter parts are trained, and all other CLIP components are frozen.
Q7: minor error in lines 91 and 102
A7:Thanks for pointing that out. We will revise these typos.
[1]DSD-DA:Distillation-based Source Debiasing for Domain Adaptive Object Detection
[2]Deep residual learning for image recognition
[3]Adapterfusion:Non-destructive task composition for transfer learning
[4]Conv-adapter: Exploring parameter efficient transfer learning for convnets
[5]VL-ADAPTER: Parameter-efficient transfer learning for vision-and-language tasks
[6]Clip-adapter: Better vision-language models with feature adapters
[7] SVL-adapter: Self-supervised adapter for vision-language pretrained models
[8]Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation
Thanks for your detailed response. Most of my concerns are addressed. I hope that all the experiments and discussions in the rebuttal can be added in the revised paper. After reading the author's response and considering comments from other reviewers, I would like to keep my initial rating.
We really appreciate your precious time. As you nicely point out, we will carefully include the additional experiments and discussions in the manuscripts. Thanks for your insightful suggestion!
We thank all the reviewers for their insightful and valuable comments! Overall, we are encouraged that they find that:
- The idea of learning domain-aware adapter is moderately interesting (Reviewer N5kf), reasonable (Reviewer nqrX) and the motivation is clear and straightforward (Reviewer xAKd).
- This paper addresses an important problem in the large vision-language era and explore the new paradigm for DAOD (Reviewer nqrX).
- The proposed method is simple yet effective (Reviewer N5kf, Reviewer xAKd, Reviewer nqrX), significantly improves DAOD performance (Reviewer N5kf, Reviewer DyVN, Reviewer xAKd), and surpasses state-of-the-art works by a large margin(Reviewer N5kf).
- The paper is well-written (Reviewer DyVN, Reviewer xAKd), the experiments is extensive (Reviewer nqrX), comprehensive and detailed (Reviewer xAKd).
We will revise the manuscript according to the reviewers' comments. The main changes we made include:
- We add more details and explanations about the low-dimensional features in DIA, the injection operation in DSA, and the structure of VTA.
- We add discussions and comparisons with the latest research on prompt tuning, adapters, and representation disentangling.
- We add quantitative experiments to explore the effectiveness of the proposed method. We provide feature visualization of the traditional adapter, the proposed domain-invariant adapter, domain-specific adapter and domain-aware adapter in Figure 5 in the attached PDF. We also analyze some failure cases in Figure 6.
- We revise the details in Figures and Tables and fix some typos in the manuscript. We add the textual description c as an input to the textual encoder in Fig.2(c), and add legends in Fig.3. We indicate the pre-training scheme for the model in Table 1 and Table 2.
Next, we address each reviewer's detailed concerns point by point. We hope we have addressed all of your concerns. Thank you!
This paper presents a novel technique for domain adaptive object detection (DAOD) with the vision-language models. The assumption is that the visual-language models (VLMs) can provide essential general knowledge on unseen images. Therefore, the visual encoder is frozen, and a domain-agnostic adapter is inserted to acquire domain invariant knowledge for DAOD. To keep the domain-specific knowledge of the target domain, a novel technique called Domain-Aware Adapter (DA-Ada) is proposed in this paper.
Experimental results show that the proposed method achieves significant performance improvement as compared to previous methods. The performance improvements of 1 – 2 mAP points is significant in object detection. In addition, detailed ablation study is conducted to facilitate better understanding of each of the modules. Therefore, the proposed algorithm is worth sharing in the community.
The reviewers’ scores are consistently positive.
Here is just a comment to improve this paper:
- The authors are advised to make the texts in the figures much larger, e.g., as large as those in the main texts.