5.5

/10

Rejected4 位审稿人

最低5最高7标准差0.9

3.5

置信度

正确性3.3

贡献度3.3

表达3.0

NeurIPS 2024

Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

Zelin Peng,Zhengqin Xu,Zhilin Zeng,Yaoming Wang,Lingxi Xie,Qi Tian,Wei Shen

OpenReview PDF

提交: 2024-05-03更新: 2024-11-06

摘要

关键词

Open-vocabulary Semantic SegmentationHyperspherical EnergyPartial Orthogonal Fine-tuningDual Cross Relation Communication

评审与讨论

审稿意见

评分: 5置信度: 42024-07-07

The paper presents H-CLIP, a novel framework for open-vocabulary semantic segmentation using the CLIP model. The framework addresses three key challenges: high computational cost, misalignment between CLIP's image and text modalities, and degraded generalization ability on unseen categories when fine-tuning for pixel-level predictions. H-CLIP employs a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both CLIP modalities. This strategy uses efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module to mitigate misalignment issues. Additionally, an orthogonality constraint based on the hyperspherical energy principle is applied to the text encoder to preserve the generalization ability of the pre-trained model.

优点

The introduction of the H-CLIP framework for open-vocabulary semantic segmentation represents a significant innovation. The use of a symmetrical parameter-efficient fine-tuning (PEFT) strategy in hyperspherical space is a unique approach to addressing the challenges associated with fine-tuning vision-language models.

The paper provides extensive experimental results across multiple benchmarks, including ADE20K, PASCAL VOC, and PASCAL-Context. These experiments validate the effectiveness of H-CLIP, showing its superior performance compared to state-of-the-art methods.

缺点

Consider that the expression in formula 5 does not specify how to interact with the \boldsymbol{R} matrix.
The paper states that current fine-tuning strategies are usually asymmetrical, but it does not provide enough evidence or references to support this claim. The authors should provide empirical evidence or references to support the claim of asymmetry.
While the paper extensively discusses the orthogonality constraint in the CLIP image encoder, it lacks an in-depth analysis of how the misalignment problem impacts segmentation performance. The authors should discuss the specific effects of misalignment on segmentation.
The paper should mention SAM (Segment Anything) and how the current work is still significant

问题

This problem exists in other areas such as Object Detection or any dense detection tasks, does the proposed method is generalizable enough for other tasks as well?
Some relevant recent works are not discussed and compared such as [1] [2].

[1] Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning (CVPR23) [2] Controlling Text-to-Image Diffusion by Orthogonal Finetuning (NeurIPS 2023)

局限性

The work is on semantic segmentation and there is no qualitative comparison shown in the main paper. There are some visuals in the supplementary, but most of those are from test set and there is no comparison shown with the baselines and existing methods, so it is not clear where the improvement is coming from. It will be good to see how the results improve with and without alignment.

作者回复

2024-08-07

We sincerely thank you for your valuable feedback. We will include all new results and clarifications in the revised version.

Weaknesses

W1: Formula 5 does not specify how to interact with the $\boldsymbol{R}$ matrix.

A1: The interaction is introduced in Section 4.3. According to formula 6, we first treat all the matrices $\boldsymbol{R}$ in $l^{th}$ layer as a 3-order tensor $\mathcal{T}_{l}$ .

Then, according to formula 7, we treat all the tensors $\mathcal{T}_{l}$ in parameter space as a 4-order tensor $\mathcal{T}$ . After that, following formula 16, the interaction with the $\boldsymbol{R}$ matrix is achieved in $\mathcal{T}_w$ via two reversible relation matrices, i.e., $\mathbf{S}_3$ and $\mathbf{S}_4$ .

Finally, in line with the manuscript, the tensor $\mathcal{T}_w$ is added in $\mathcal{T}$ by the updating rule $\mathcal{T} = \mathcal{T} + \alpha\mathcal{T}_w$ . To sum up, formula 5 can be rewritten at the end of Section 4 (Methodology) as follows:

\tilde{**M**}_l = \mathcal{F}_l(**M**_l; \mathcal{T}_l \mathbf{W}_l)

We will clarify this in the revised version.

W2: ... current fine-tuning strategies are usually asymmetrical, but it does not provide enough evidence or references ... .

A2: Many previous works [a-d] in the field of open-vocabulary semantic segmentation propose various types of asymmetric fine-tuning frameworks, where CLIP's text encoder is simply frozen, and the image branch is fine-tuned. We will provide these references in the revised version.

W3: In-depth analysis of how the misalignment problem impacts segmentation performance.

A3: Current fine-tuning methods for open-vocabulary segmentation are usually asymmetrical, i.e., typically freezing CLIP's text encoder and fine-tuning its image encoder. This strategy inevitably causes a potential obstacle: misalignment. More specifically, the misalignment arises from different alignment granularities. The text encoder maintains image-to-text alignment, while the image encoder shifts from image-to-text to pixel-to-text alignment. Due to these different alignment goals, the optimization process is largely impeded, leading to sub-optimal performance. We also provide some visualizations in Fig. 1 and Fig. 2 in the global author response PDF. One can observe that fine-tuning without alignment tends to separate the entire object region into a series of discrete regions due to the coarse granularity in understanding semantics.

W4: Mention SAM (Segment Anything) and how the current work is still significant.

A4: We have cited SAM as a large-scale foundation model in Section 2.2 ([23] in the main manuscript). Although SAM is an influential foundation model for image segmentation, adopting it for open-vocabulary semantic segmentation is non-trivial. The reason is the masks provided by SAM is class-agnostic, with no semantics. Assigning semantics from open-vocabulary set to masks usually also faces the challenge of misalignment due to different granularities of multimodalities. Therefore, our work is still significant.

Questions

Q1:Generalizable enough for other dense prediction tasks?

A5: Yes, our method is generalizable for object detection. To show this, we validate our method on an open-vocabulary object detection task. Please see Table 2 in the global response.

Q2:Discussing and comparing [1] [2].?

A6: The objective of [1] and our method is the same: to preserve both modality-shared and modality-specific information, but the proposed strategies are different. [1] achieves this goal by improving contrastive learning with several regularizations. However, its efficacy depends on delicately designed objectives and might cause optimization conflicts among them. In contrast, we propose a parameter-efficient fine-tuning strategy to preserve modality-specific information, where modality-shared information is achieved by DCRC. The official code of [1] is not available, and unfortunately, we have not received a response after emailing the first author of [1] until now.

OFT [2] aims to adopt orthogonal constraints on both CLIP's image encoder and text encoder to strictly maintain the original semantic structures. In contrast, we first apply orthogonal constraints only to the text encoder. This is crucial for open-vocabulary semantic segmentation, as it can provide more flexibility in fine-tuning the image encoder, facilitating the transfer of CLIP's initial alignment from image-level to pixel-level. We then introduce DCRC to encourage interactions between the encoders of the two modalities, further mitigating the misalignment issue. We compare our method with OFT [2] and demonstrate better performance, as shown below. We will include the above discussion in the revised version.

Method	A-847	PC-459	A-150	PC-59	PAS-20	PAS-20 $^b$
OFT [2]	10.9	18.0	30.2	53.7	93.7	74.3
H-CLIP	12.4	19.3	32.4	57.9	95.2	78.2

Limitations

L1: No qualitative comparison with the baselines and existing methods ... . It will be good to see how the results improve with and without alignment.

A7: Thanks for your question. We have provided some qualitative comparisons between our method and the existing SOTA method, i.e., CAT-Seg, on different datasets, as shown in Fig. 2, 3 and 4 in the main manuscript. To further validate the effectiveness of alignment, we present additional visual comparisons. Please see Fig.1 and Fig.2 in the global author response PDF.

[a] LANGUAGE-DRIVEN SEMANTIC SEGMENTATION. ICLR22.[b] Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP. CVPR22.[c] Side Adapter Network for Open-Vocabulary Semantic Segmentation. CVPR23.[d] SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation. CVPR24.

审稿意见

评分: 5置信度: 32024-07-12

The paper presents H-CLIP, a novel approach for parameter-efficient fine-tuning of the CLIP model in hyperspherical space, specifically for open-vocabulary semantic segmentation. H-CLIP includes the introduction of a symmetrical parameter-efficient fine-tuning strategy, leveraging hyperspherical energy principles. And a dual cross-relation communication module is utilized to enhance cross-modal and cross-layer alignment.

优点

This paper is well-motivated. The proposed H-CLIP effectively addresses common issues in fine-tuning CLIP.
The paper effectively argues that maintaining the hyperspherical energy helps preserve the model's generalization ability, a critical factor in multi-modal tasks.
The ablation experiments are thorough and effectively support the arguments.

缺点

The writing needs improvement. The introduction lacks transitions from existing problems to the approach of this paper, such as introducing the advantages of Hyperspherical Space.
Some formula descriptions can be optimized, for example, explaining the meaning of * in Formula 9.
Details about comparison methods are needed. In Table 1, the compared method SAN includes an additional backbone.

问题

This work focuses more on cross-modal alignment and the issues related to hyperspherical space, rather than on custom designs for pixel-level predictions. It appears to be more of a generalizable cross-modal fine-tuning paradigm. Have the authors attempted to validate it on tasks beyond open-vocabulary semantic segmentation?

局限性

The authors provide no analysis of the limitations and broader impact. The author can analyze the limitations of this fine-tuning strategy in the field of OVS.

作者回复

2024-08-07

We sincerely thank you for your valuable feedback and hope our following clarifications and responses could clear your concerns.

Weaknesses

W1: The introduction lacks transitions from existing problems to the approach of this paper, such as introducing the advantages of Hyperspherical Space.

A1: Introducing hyperspherical space in our method has two advantages. First, the hyperspherical space helps capture a model's intrinsic semantic structure. Specifically, by adhering to the hyperspherical energy principle to updating CLIP's text encoder, we preserve its intrinsic semantic knowledge, thus reducing the risk of over-fitting and improving performance on unseen classes. Second, the hyperspherical space provides a symmetric and robust parameter space for adapting CLIP, allowing its two encoders to mitigate the misalignment between the two modalities. We will add this content in the revised version.

W2: Some formula descriptions can be optimized, for example, explaining the meaning of * in Formula 9.

A2: The "*" represents a tensor product. We will add it to the revised version. Thanks!

W3: Details about comparison methods are needed. In Table 1, the compared method SAN includes an additional backbone.

A3: Thanks for your suggestion. We will correct this by filling in "side adapter" in the "Additional Backbone" column for SAN in the revised version.

Questions

Q1: This work focuses more on cross-modal alignment and the issues related to hyperspherical space, rather than on custom designs for pixel-level predictions. It appears to be more of a generalizable cross-modal fine-tuning paradigm. Have the authors attempted to validate it on tasks beyond open-vocabulary semantic segmentation?

A4: Thank you for your suggestion. We further validate our method on other tasks and observe consistent improvements. Please refer to the global response.

Limitations

L1: The authors provide no analysis of the limitations and broader impact. The author can analyze the limitations of this fine-tuning strategy in the field of OVS.

A5: Thank you for your comment. Our main contribution lies in proposing a parameter-efficient fine-tuning strategy for open-vocabulary semantic segmentation. However, we have not taken memory efficiency into account yet. Given the rapid evolution of vision foundation models for OVS, it is important to pursue low-cost deployment of fine-tuning, which could be improved in future work.

审稿意见

评分: 5置信度: 42024-07-12

This paper proposes H-CLIP, a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. The PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Extensive evaluations across various benchmarks show that H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.

优点

This paper achieves SOTA performance. The Parameter-efficient Fine-tuning is explained by tensor computation.

缺点

1.The novelty is limited. Partial orthogonal fine-tuning (POF) doesn't directly address the challenges of OVSS but rather offers a generic PEFT approach, so what is the difference between POF and OFT[1]? In Equ.5, which module's weights are used by the pre-trained weight matrix, Q(K or V)’s projection layer or FFN? Method details need detailed explanation. 2.Some concerns about DCRC. In this section, the author discusses the use of two k layers deep neural network to update the fourth-order tensor in Equ. 7, and provides some mathematical proof. However, these proofs only show that reversible transformations S(·) can be replaced by reversible matrices S (as shown in Equ. 11,12,14,15), and the authors use k layers deep neural network to replace such reversible matrices S, which cannot explain the meaning of reversible transformations. In other words, why adopting reversible transformations to update the fourth-order tensor in Equ. 7, and what is the role of reversible transformations? Is this approach also work in other fields other than semantic segmentation tasks? In addition, If the block diagonal structure is not adopted, Equ. 16 seems to require only one reversible matrices S4 for mapping. Does this reduce the number of parameters? 3.Insufficient experimental analysis. 1)The decoder of HCLIP seems to be learnable as well. Does the param in Table 2 calculate the decoder part? And, Is the proposed PEFT method applicable to various decoders? If it is replaced with linear probe, is the proposed method still effective? Need further exploration. 2)If a different VFM is adopted (not CLIP), is the proposed method still valid? 3)The proposed method should be compared with more PEFT methods such as VPT, Adapter, LST, SSF [1-5] .

[1] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems, 36:79320–79362, 2023. [2] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022. [3] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019. [4] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005, 2022. [5] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. Advances in Neural Information Processing Systems, 35:109–123, 2022.

问题

See Weakness

局限性

The limitation of the proposed method should be discussed.

作者回复

2024-08-07

Thanks for your instructive comments. We will include all new results and clarifications in the revised version.

Weaknesses

W1: Limited novelty.

A1: We would like to emphasize that the novelty of our proposed POF mainly lies in its task-oriented design. Most previous OVSS methods opt for an asymmetric fine-tuning framework, which may easily lead to misalignment between the two modalities, thus impeding optimization speed[a]. In contrast, POF provides a symmetrical PEFT framework that unlocks a small number of parameters in hyperspherical space for encoders of the two modalities, largely mitigating this issue. We also visualize the training accuracy curve in Fig.3 of the global author response PDF, further demonstrating the advantage of the symmetric fine-tuning solution.

W2: Difference between POF and OFT[1].

A2: OFT[1] adopts orthogonal constraints on both CLIP's image and text encoders to strictly maintain the original semantic structures. Differently, POF applies orthogonal constraints only to the text encoder. This is significant for OVSS, as POF provides more flexibility in fine-tuning the image encoder, facilitating the transfer of CLIP's initial alignment from text-to-image to text-to-pixel. Besides, we compare OFT with the variant of our method, which solely uses POF, demonstrating the obvious improvement of POF over OFT. See the table below.

	A-847	PC-459	A-150	PC-59	PAS-20	PAS-20 $^b$
OFT[1]	10.9	18.0	30.2	53.7	93.7	74.3
POF	12.3	19.0	31.8	56.4	94.6	76.3

W3: In Eq.5, which module's weights are used?

A3: The weights are adjusted in the attention layer in line with most PEFT methods, e.g., LoRA.

W4: Why adopting reversible transformations to update the 4-order tensor in Eq.7, and its role ...?

A4: Through derivation in the manuscript and suppl., we conclude that the tensor-product between a pair of $p$ -order tensors ( $p \geq 3$ ) can be effectively converted into the matrix-product of internal 2-order matrices via $p-2$ reversible transformations $S_i(\cdot)$ , $i=3,\cdots,p$ . This indicates that $S_i(\cdot)$ can capture the correlations among the different matrices. In practice, we use $S_i(\cdot)$ , $i=3,4$ , to update the 4-order tensor in Eq.7, as it can efficiently achieve communication across modalities ( $n_3$ ) and layers ( $n_4$ ) in a parameter space. Besides, to capture potential non-linear relations among matrices, we achieve the reversible transformations $S_3(\cdot)$ and $S_4(\cdot)$ using two $k$ -layer DNNs.

W5: ... also work in other fields?

A5: Yes. Please see Table 1, 2 in the global response.

W6: If the block diagonal structure is not adopted, ... reduce ... parameters?

A6: No, the block diagonal structure is necessary. For notation simplicity, we set $d_v=d_e=d$ in the Sec.4. However, in practice, the dimensions of the tunable matrix $R_{vi} \in \mathbb{R}^{d_v \times d_v}$ and $R_{ei} \in \mathbb{R}^{d_e \times d_e}$ are often not equal, e.g., $d_v=768$ and $d_v=512$ for ViT-B/16 version of CLIP. Given that the dimension of each matrix in a higher-order tensor must be consistent, we use a block diagonal structure to align the matrix dimensions between the two modalities, and thus, it cannot be discarded.

W7: Does the param ... calculate the decoder? Applicable to various decoders?

A7: Following the protocol used in previous PEFT works[b,c], we do not calculate the parameters of the decoder for all the methods. We will indicate this in the revised version. To further evaluate the effectiveness of our method under different decoders, we replace our decoder with three classical decoders: a linear probe, a CNN-based decoder[d] and a transformer-based decoder[e]. Results show our method is effective with different decoders. See the table below.

Decoder	Method	A-847	PC-459	A-150	PC-59	PAS-20	PAS-20 $^b$
Linear probe	LoRA	9.1	13.7	24.9	50.2	93.9	72.6
	Ours	10.2	15.4	26.6	51.1	94.2	73.7
[d]	LoRA	9.4	16.4	26.3	54.1	94.1	74.2
	Ours	10.9	18.2	29.3	55.2	94.9	75.8
[e]	LoRA	9.9	15.1	27.7	53.9	94.1	74.3
	Ours	11.2	17.8	30.8	56.4	95.1	77.3

W8: Different VFM ... still valid?

A8: We base our method on fine-tuning another famous VFM, i.e., SAM (segment anything model), for adapting it to various downstream tasks. Note that SAM only has an image encoder, thus we remove the cross-modal interactions in DCRC. Besides, we incorporate orthogonal constraints in the learnable matrices of the lower layers of the encoder, as they contain generalizable representations of segmentation that should be preserved[f]. We follow the experimental settings provided in[g]. Our method shows competitive performance compared with other PEFT methods. See the table below.

	ADOME	NWPU	TRCAN
VPT[2]	87.7	81.8	71.5
SSF[5]	88.5	81.9	73.0
Ours	90.9	84.2	74.1

W9: Comparing with more PEFT methods.

A9: We provide comparisons with more PEFT methods. The results are shown below. Our method achieves the best performance over all datasets.

	A-847	PC-459	A-150	PC-59	PAS-20	PAS-20 $^b$
OFT[1]	10.9	18.0	30.2	53.7	93.7	74.3
VPT[2]	5.7	10.2	23.7	54.3	93.8	75.1
Adapter[3]	10.4	16.5	28.8	54.9	94.2	75.2
LST[4]	7.2	12.7	27.0	56.8	95.4	76.3
SSF[5]	6.9	15.2	28.6	52.1	93.2	72.8
Ours	12.4	19.3	32.4	57.9	95.2	78.2

[a]Misalign, Contrast then Distill:Rethinking Misalignments in Language-Image Pretraining.ICCV23.[b]Bridging Vision and Language Encoders:Parameter-Efficient Tuning for Referring Image Segmentation.CVPR23.[c]Time-,Memory-and Parameter-Efficient Visual Adaptation. CVPR24.[d]Pyramid Scene Parsing Network.CVPR17.[e]Segmenter:Transformer for Semantic Segmentation.ICCV21.[f]MMA:Multi-Modal Adapter for Vision-Language Models.CVPR24.[g]Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything.CVPR24.

2024-08-13

I would like to keep the original score after reading through the rebuttal.

评论- Thanks for Your Response

2024-08-13

We greatly appreciate your response and once again extend our sincere gratitude for the valuable time and effort you spent on the review.

审稿意见

评分: 7置信度: 32024-07-13

This paper proposes a novel method called Parameter-Efficient Fine-Tuning in Hyperspherical Space for efficiently solving the open-vocabulary semantic segmentation problem. The method introduces a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. To maintain the generalization ability offered by the CLIP text encoder, the authors designed a constraint to PEFT based on Hyperspherical Energy. Comprehensive results on open-vocabulary semantic segmentation benchmarks demonstrate the strong performance of this PEFT method by training only 4% of the total parameters of CLIP.

优点

The idea of introducing hyperspherical space to achieve parameter-efficient training is interesting. This approach attains state-of-the-art performance on current open-vocabulary semantic segmentation benchmarks with fewer learnable parameters. Additionally, it demonstrates better parameter efficiency than LORA on open-vocabulary semantic segmentation tasks, as shown in Table 3.

缺点

I do not see any clear weaknesses. However, I acknowledge that I am not familiar with hyperspherical theorems.

问题

Although I understand this is a parameter-efficient training strategy, is it possible to provide the training time for this method? I am interested in the training time efficiency perspective. It would be better to also provide a comparison of the time with other methods. (No need for complete training during rebuttal, just ensure the number of iterations is the same and compare the time and results.
From the segmentation results in Table 3, this method outperforms LoRA. However, since LoRA is widely validated on other tasks, demonstrating the effectiveness of this method on other CLIP-based tasks would provide strong evidence to support its effectiveness and potential broad impact.

局限性

See in questions. No other clear limitations.

作者回复

2024-08-07

We truly thank you for the insightful comments and suggestions. We hope our responses can address your concerns.

Questions

Q1: Training time efficiency.

A1: We compare the training time of our method with other representative PEFT methods based on ViT-B/16 backbone, which shows comparable time costs. All results are obtained on 4 NVIDIA RTX 3090 GPUs. -- see the table below.

Method	LoRA	VPT [a]	Ours
Training Time (h)	11.8	14.1	12.2

Q2: The effectiveness of this method on other CLIP-based tasks would provide strong evidence to support its effectiveness and potential broad impact.

A2: Thank you for your suggestion. In response to your comment, we compare our method with LoRA on other CLIP-based tasks and consistently outperform it, demonstrating the generalization of our method. Please refer to Table 1 and Table 2 in the global response.

[a] Visual prompt tuning. ECCV22.

评论- Reply to authors

2024-08-13

Thanks for your rebuttal. The results in Tables 1 and 2 are strong, and your response addresses my concerns. Also, your work only slightly increases the training time compared to LoRA. I keep my score.

评论- Thanks for Your Response

2024-08-13

We sincerely thank you for your response! Your help in reviewing our paper has been very valuable in making it better.

作者回复

2024-08-07

Common Response for fine-tuning CLIP on other tasks

We thank all reviewers for their insightful comments. We will include all new results in the revised version.

Since all reviewers (Q2 of Reviewer YGpA, W5 of Reviewer WXoe, Q1 of Reviewer zhUR, and Q1 of Reviewer MnaP) are curious about the performance of our method on other CLIP-based tasks, we provide more detailed experimental comparisons in the tables below.

In Table 1, we present results of a few-shot classification task (16-shots) following the experimental settings provided in CoOp [a]. We also validate our method on an open-vocabulary object detection task and conduct an experiment on COCO dataset following [b] in Table 2. These results demonstrate the generalization ability of our method and its potential impact on the multi-modal community.

Table 1. Comparisons on fine-tuning CLIP for a few-shot classification task

Methods		CLIP			LoRA [c]			H-CLIP (ours)
Datasets	Base(%)	Novel(%)	Harmonic mean(%)	Base(%)	Novel(%)	Harmonic mean(%)	Base(%)	Novel(%)	Harmonic mean(%)
ImageNet	72.43	68.14	70.22	76.53	69.88	73.05	76.92	70.98	73.83
caltech101	96.84	94.00	97.20	98.00	94.11	96.02	97.98	93.43	95.65
OxfordPets	91.17	97.26	94.12	95.34	97.69	96.50	95.67	98.03	96.84
StanfordCars	63.37	74.89	69.45	69.87	73.72	71.74	74.45	76.34	75.73
Flowers102	72.08	77.80	74.83	92.80	75.02	84.97	96.01	74.13	86.66
Food101	90.10	91.22	90.66	90.57	91.14	90.85	90.66	91.48	91.07
FGVCAircraft	27.19	36.29	31.09	25.94	17.23	21.71	33.03	34.45	33.73
SUN397	69.36	75.35	72.23	78.91	77.76	78.33	79.87	78.56	79.21
DTD	53.24	59.90	56.37	75.84	50.18	63.40	78.23	57.76	67.95
EuroSAT	56.48	64.05	60.03	86.79	64.12	73.75	87.89	64.57	76.45
UCF101	70.53	77.50	73.85	79.22	76.09	77.62	82.78	78.35	80.50
Average	69.34	74.22	71.70	79.07	71.64	74.97	81.23	74.57	78.03

Our method shows much better average performance compared with LoRA [c] over 11 datasets on all evaluation metrics, i.e., base and novel accuracy, as well as their harmonic mean.

Table 2. Comparisons on fine-tuning CLIP for an open-vocabulary object detection task

Method	AP $_{50}^{`Base`}$	AP $_{50}^{`Novel`}$
CLIP	21.6	36.4
CLIM (fully fine-tuning)	25.7	42.5
LoRA [c]	24.4	41.5
H-CLIP	25.1	42.9

The results demonstrate that our method can generalize to an open-vocabulary object detection task, even making it comparable to the full fine-tuning method (CLIM).

[a] Learning to prompt for vision-language models. IJCV22.

[b]CLIM: Contrastive Language-Image Mosaic for Region Representation. AAAI24

[c]LoRA: Low-rank adaptation of large language models. ICLR22.

2024-08-13

Dear Reviewers

This is another reminder to engage with the authors in this phase of the rebuttal. The deadline to respond to authors is EOD Anywhere on Earth timezone today.

最终决定Reject

2024-09-25

The authors propose a new, efficient finetuning method for open-vocabulary semantic segmentation: As such models leverage both CLIP image- and text-encoders, the authors perform PEFT (parameter-efficient finetuning) in hypersperical space to preserve the alignment of the two modalities. Using this method, the authors show improvements over conventional PEFT methods.

A common concern among reviewers was the novelty of the method: Reviewers remarked that the approach is similar to Orthogonal Finetuning (OFT). OFT provides orthogonality constraints on both image- and text-encoders, and the reviewers replied in the rebuttal that the proposed POF applies orthogonality constraints to only the text encoder, which is demonstrated to be beneficial for open-vocabulary semantic segmentation. This suggests that the method is too specific to a specific task (open-vocabulary segmentation).

Separately, other reviewers remarked that the method appeared generic and could be applied to other tasks. The authors showed experiments on few-shot image classification tasks in the rebuttal. However, crucially, this missed a comparison with OFT. It is also not clear what the advantages of the proposed method would be over OFT in this scenario.

Reviewers were also lukewarm about the paper, with none of them willing to argue for accepting the paper.

Overall, the authors are encouraged to revise the paper with the various experiments conducted in the rebuttal. To improve the paper further, the authors should also study carefully the cases where orthogonality constraints on only the text-encoder are advantageous to orthogonality constraints on both encoders (OFT) to no orthogonality constraints at all (standard PEFT methods).