6.0

/10

Spotlight4 位审稿人

最低5最高7标准差1.0

3.8

置信度

正确性3.0

贡献度2.5

表达3.3

NeurIPS 2024

PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization

Yao Ni,Shan Zhang,Piotr Koniusz

OpenReview PDF

提交: 2024-04-24更新: 2025-01-16

摘要

关键词

GeneralizationRegularizationTransfer LearningParameter-Efficient Fine-Tuning

评审与讨论

审稿意见

评分: 5置信度: 42024-07-12

The paper proposes a consistency regularization for improving the generalization performance of PEFT methods. Specifically, the regularization constructs two predictions perturbed by different noises and penalizes the L2 distance between them. The paper theoretically studies the benefits and shows that the regularization reduces gradient norms. Empirically, the paper benchmarks the method on multiple tasks and datasets and achieves state-of-the-art performance.

优点

Theoretical results are clearly presented and easy to follow.
The paper presents an extensive evaluation of multiple datasets and tasks.

缺点

Missing related works and comparisons. The major weakness of the paper is the lack of discussion and comparisons with related works. For example, L2-SP [1] directly penalizes deviation in the weight space and ensures alignment. DELTA [2] penalizes feature differences similar to this paper. FTP [3] shows that feature deviation can be reduced to weight deviation in linear layers. While the experiment settings are diverse, the paper only compares to vanilla PEFT methods. The method should be compared to other regularization methods for a fair evaluation.
Increased computation. To construct the two predictions. The method requires two forward passes through the model. This can significantly increase the computation and memory requirement of the algorithm.
How does $D^{PACE}$ ensure alignment? It is clear that $D^{fp}$ (eq.5) encourages alignment. However, the proposed PACE regularization does not seem to encourage alignment explicitly.

[1] Xuhong, L. I., Yves Grandvalet, and Franck Davoine. "Explicit inductive bias for transfer learning with convolutional networks." International Conference on Machine Learning. PMLR, 2018.

[2] Li, Xingjian, et al. "Delta: Deep learning transfer using feature map with attention for convolutional networks." arXiv preprint arXiv:1901.09229 (2019).

[3] Tian, Junjiao, et al. "Fast trainable projection for robust fine-tuning." Advances in Neural Information Processing Systems 36 (2024).

问题

Could the authors provide discussion and comparisons to other robust regularization techniques?
Could the authors comment on how $D^{PACE}$ ensures alignment.

局限性

The paper does not have a potential negative societal impact.

作者回复

2024-08-07

We thank the reviewer for insightful questions that help refine our work further.

1. Comparison with L2SP, DELTA & FTP.

L2SP, DELTA, and FTP aim to retain pre-trained knowledge by aligning finetuned models with pre-trained ones, reducing distance in weight space, feature space and using projected gradient descent, respectively. They provide no guarantees for generalization error nor study it. But we will of course cite/compare with these interesting works in paper.

Our PACE is very different. We leverage the fact that small gradient norms contribute to flat loss landscapes, which enhances generalization (as per our Th. 2 & 3). PACE introduces a novel consistency regularization on output space over different adapter perturbations, implicitly reducing gradient norms (which improves generalization as per Th. 1) and aligning models.

Thus, Our alignment is very different than the above methods.

PACE offers a key advantage over other alignment methods:

guaranteed gradient reduction, which is crucial for model convergence,
robustness and generalization (Th. 1-2).

To validate PACE's effectiveness, we compared it with L2SP, DELTA, and FTP on CIFAR-100 (VTAB) and ImageNet (Domain Adaptation) datasets.

The results, presented in the table below, clearly demonstrate PACE's superior performance.

Method	CIFAR-100 (VTAB)	ImageNet	(Domain	Adapt)
	Source	-Sketch	-V2	-A	-R	Avg
$LoRA_{mul}+VPT_{add}$	74.9	78.3	30.6	68.5	14.1	32.5	44.8
$\$ +L2SP	75.9	78.5	30.4	68.7	14.9	33.5	45.2
$\$ +DELTA	76.4	78.4	30.8	68.7	14.6	33.7	45.2
$\$ +FTP	76.2	78.6	30.8	68.6	15.8	33.6	45.4
$\$ +PACE	79.0	79.0	31.8	69.4	16.3	35.2	46.3

2. Increased computation.

To achieve computation and memory efficiency on par with compared baselines, we have explored two variants to maintain similar computational and memory requirements:

1), PACE $\_{fast}$ : We store model outputs for each sample from the previous epoch ( ${\bf o}\_{e-1}=f_{e-1}({\bf x})$ ) in the CPU memory (or disk if preferred). During training, we feed these into the consistency regularization w.r.t. the current outputs ( $||f\_{e}({\bf x})-{\bf o}_{e-1}||\_2^2$ ) as two versions used different i.i.d. noises (meeting our theoretical need).
2), PACE $_{lazy}^{half}$ : At every $N^{th}$ iteration during training, we apply consistency regularization and halve the batch size.

The table below compares the maximum GPU memory usage, total training time, and accuracy for each task, demonstrating that PACE $\_{fast}$ and PACE $\_{lazy}^{half}$ significantly improve over the baseline while maintaining similar computational demands as baselines.

Method	CIFAR-100	VTAB	(ViT/16-B)	Camelyon	VTAB	(Swin-B)	ImageNet	DomAda	(ViT/16-B)
	Mem	Time	Acc	Mem	Time	Acc	Mem	Time	MeanAcc
Baseline	8.9g	29m	74.6	15.7g	33m	86.7	8.9g	161m	44.8
PACE	17.7g	53m	79.0	29.4g	60m	89.3	17.7g	278m	46.3
PACE $_{fast}$	9.0g	29m	78.3	15.7g	34m	88.8	9.0g	162m	46.1
PACE $_{lazy}^{half}$ (N=2)	9.3g	29m	78.7	15.7g	36m	89.2	9.0g	165m	46.0
PACE $_{lazy}^{half}$ (N=4)	9.3g	29m	78.4	15.7g	35m	88.9	9.0g	163m	45.6
PACE $_{lazy}^{half}$ (N=6)	9.3g	29m	78.4	15.7g	35m	89.0	9.0g	163m	45.7
PACE $_{lazy}^{half}$ (N=10)	9.3g	29m	78.2	15.7g	35m	88.9	9.0g	162m	45.6

3. How $D^{pace}$ align models.

Recall that $D^{pace}$ aims to reduce $||f({\bf \theta}_0+{\bf z}_1\odot \Delta{\bf \theta}) - f({\bf \theta}_0+{\bf z}_2\odot \Delta{\bf \theta})||_2^2$ where ${\bf \theta}_0, \Delta{\bf \theta}$ are pretrained and adapter weights and $f$ is the network, and ensure it is small for all noise ${\bf z}_1, {\bf z}_2\sim\mathcal{N}({\bf 1}, \sigma^2)$ .

Intuitively, imagine we drew $({\bf z}_1={\bf 0}, {\bf z}_2={\bf 1})$ or $({\bf z}_1={\bf 1}, {\bf z}_2={\bf 0})$ . Then it becomes clear we indeed reduce the distance between finetuned and pretrained model $D^{fp}=||f({\bf \theta}_0+\Delta{\bf \theta})-f({\bf \theta}_0)||_2^2$ .

To understand general case with any $({\bf z}_1,{\bf z}_2)$ , follow these steps:

Theoretically, with Prop. 1 and Theorem 3, the distance between finetuned and pretrained model can be approximated and upper bounded:

D^{fp}=[f({\bf \theta}_0+\Delta{\bf \theta}) - f({\bf \theta}_0)]^2\approx[\Delta{\theta}^T{\bf \nabla}-\frac{1}{2}\Delta{\theta}^T{\bf H}\Delta{\bf \theta}]^2 \leq 2d{\color{red}{||\Delta{\bf \theta}\odot{\bf \nabla}||_2^2}}+d^2{\color{red}{||(\Delta{\bf \theta}\Delta{\bf \theta}^T)\odot{\bf H}||_F^2}}

where symbols ${\bf \nabla, H}, d$ are the gradient, hessian matrix and dimension of weights.

Through Theorem 2, $D^{pace}$ can be approximated as:

D^{pace}\approx2\sigma^2{\color{red}||\Delta{\bf \theta}\odot{\bf \nabla}||_2^2}+\sigma^4{\color{red}||(\Delta{\bf \theta}\Delta{\bf \theta}^T)\odot{\bf H}||_F^2}

where $\sigma^2$ is the noise variance.

Since $\sigma$ and $d$ are constants during training, minimizing $D^{pace}$ leads to small ${\color{red}||\Delta{\bf \theta}\odot{\bf \nabla}||_2^2}$ and ${\color{red}||(\Delta{\bf \theta}\Delta{\bf \theta}^T)\odot{\bf H}||_F^2}$ . This results in a small upper bound for $D^{fp}$ , effectively reducing $D^{fp}$ and ensuring alignment of the finetuned model with the pretrained model.

Thus, PACE effectively aligns the finetuned model with the pretrained model even though PACE does not explicitly do that.

Figures 2(b) & 3(b) illustrate that distance from finetuned model to pretrained model become small after applying PACE, verifying that PACE ensures alignment.

Importantly:

Beside alignment, $D^{pace}$ ensures gradient reduction, as proven in Theorem 2 & shown empirically in Fig. 2(a) and 3(a). As per our motivation, this reduction is crucial for improving generalization, as established in Theorem 1.

2024-08-10

I appreciate the authors' response. The rebuttal addressed most of my questions, so I will raise my score. However, I am still concerned about the methods' computation costs. In the rebuttal, the authors proposed two more efficient processes. They either increased memory requirements or introduced additional hyper-parameters. These will hinder the practicality of the proposed method.

2024-08-11

We sincerely appreciate your quick reply and are delighted that our rebuttal has addressed most of your questions. We are grateful for the time and effort you have invested in reviewing our work and providing valuable feedback that has strengthened the background and clarity of our work.

Regarding your concern about the additional memory requirements of the introduced variants, we would like to clarify that PACE $\_{lazy}^{half}$ ideally introduces no additional memory requirement. Although it includes a hyperparameter $N$ , simply setting $N=2$ without any additional hyperparameter search yields much better results than the baseline.

For PACE $\_{fast}$ , the introduced memory to store the output from the model's classification head is marginal compared to the baseline GPU memory requirement. Below table compares the memory required by PACE $\_{fast}$ and the GPU memory required to train the baseline:

Dataset	Mem. PACE $\_{fast}$	Baseline GPU Mem	ratio
CIFAR-100 VTAB (ViT/16-B)	390KB	8.9GB	0.0042%
Camelyon VTAB (Swin-B)	7.81KB	15.7GB	0.000047%
ImageNet DomAda (ViT/16-B)	61MB	8.9GB	0.67%

As shown, memory required by PACE $_{fast}$ is trivial compared to the Baseline GPU memory requirement.

Despite the variants potentially increasing memory or training time in practice, we demonstrate that by simply reducing the batch size and training epochs, PACE $\_{fast}$ still outperforms the baseline while requiring much less GPU memory and training time.

Method	CIFAR-100	VTAB	ViT/16-B	Camelyon	VTAB	Swin-B	ImageNet	DomAda	ViT/16-B	Average
	Mem	Time	Acc	Mem	Time	Acc	Mem	Time	MeanAcc	Mem	Time	Acc
Baseline	8.9g	29m	74.6	15.7g	33m	86.7	8.9g	161m	44.8	11.1g	74m	68.7
PACE $_{fast}$ ( $\frac{1}{2}$ batch size, $\frac{1}{2}$ epochs)	5.4g	17m	78.1	8.6g	21m	88.9	5.4g	85m	45.8	6.5g	41m	70.9
PACE $_{fast}$ ( $\frac{1}{4}$ batch size, $\frac{1}{4}$ epochs)	3.5g	10m	77.8	6.0g	14m	88.7	3.5g	50m	45.6	4.3g	25m	70.7
PACE $_{fast}$ ( $\frac{1}{8}$ batch size, $\frac{1}{8}$ epochs)	2.9g	6m	77.2	5.2g	10m	88.6	2.9g	32m	45.5	3.7g	16m	70.4

The table shows that with 1/8 batch size and epochs, PACE $\_{fast}$ still outperforms the baseline by 1.7% while only using ~1/3 GPU memory and ~1/4 training time. This demonstrates the robustness and generalization benefits that PACE brings to models, allowing them to excel under constrained training configurations.

However, we want to emphasize that the most valuable part of our work lies in the theoretical contributions and analysis, especially the link between small gradient norm and better generalization, and the implicit effect of gradient reduction and mode alignment in PACE.

These insights provide a deeper understanding of neural networks and their behavior, which we believe will benefit the research community and inspire more effective and robust methods next.

We appreciate all the concerns raised by the reviewer and the opportunity for discussion. Kindly do let us know if there is anything else we can further clarify or if you have further questions.

审稿意见

评分: 7置信度: 32024-07-13

This paper introduces PACE, an extension to PEFT methods for ViTs that includes consistency regularization. The paper shows that consistency regularization encourages smaller gradient norms and better alignment between pre-trained and fine-tuned models, resulting in better fine-tuning performance than existing PEFT techniques.

优点

(S1): The technique introduced is well-presented and easy to implement, borrowing ideas from parameter-efficient fine-tuning (PEFT) and consistency regularization (CR) to present a combined technique that is empirically effective.

(S2): The paper rigorously grounds their approach via theoretical analysis, demonstrating how reduced gradient norms lead to better generalization. This is well presented and the analysis itself will be relevant to future work and related research in this field, as other techniques that emulate the same properties (eg: reduced gradient norms) can benefit from PACE.

(S3): The ablation study is well conducted and systematically demonstrates the usefulness of the authors’ design. Throughout the experimental section, the authors also include experiments that probe properties of the model relevant to their method (eg: Fig 2, 3, 4), demonstrating that their theoretical hypotheses are well justified in empirical results.

Overall, this paper is well-presented, technically sound, and is a pleasure to read. I think it will have a good impact on future research in this area.

缺点

(W1): The experimental baseline is not really clarified. Only a ViT-B/16 pre-trained on ImageNet is used. Was the pre-training supervised or unsupervised? Do the analysis and results differ for supervised vs self-supervised backbones? Most common and performant ViTs nowadays (eg: MAE [1], Dino [2]) are self-supervised, so it would be good to clarify if there are differences in the analysis that depend on the method of pre-training.

(W2): It would be interesting to see this technique extended to larger domain shifts (eg: WILDS [3]). Also, it is not clear how the amount of fine-tuning data affects the performance (beyond few-shot learning). I.e. can PACE make better use of large datasets from new domains more effectively than prior methods.

(W3): Minor quibble. It would be useful to include comparisons with orthogonal fine-tuning techniques (eg: BOFT [4], OFT [5]).

References:

[1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[2] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

[3] Koh, Pang Wei, et al. "Wilds: A benchmark of in-the-wild distribution shifts." International conference on machine learning. PMLR, 2021.

[4] Liu, Weiyang, et al. "Parameter-efficient orthogonal finetuning via butterfly factorization." arXiv preprint arXiv:2311.06243 (2023).

[5] Qiu, Zeju, et al. "Controlling text-to-image diffusion by orthogonal finetuning." Advances in Neural Information Processing Systems 36 (2023): 79320-79362.

问题

L220-221: Which pre-training method was used for these backbones? Was it supervised or self-supervised?

I don’t see PACE_h in table 6.

It would be great to see this extended to language applications!

局限性

The authors adequately discuss this section.

作者回复

2024-08-07

We thank the reviewer for insightful questions that help refine our work further.

1. Experiment settings.

ViT/16-B and Swin-B were pre-trained on ImageNet-21K using supervised learning. Our analysis extends to self-supervised pre-trained models as well, since our goal is to retain knowledge from large pre-training datasets, regardless of the specific supervision/pre-training method used.

To demonstrate the effectiveness of our approach across different pre-training methods, we conducted experiments on three VTAB datasets (CIFAR-100, Camelyon and Clevr-Count) using ViT/16-B models pre-trained by Masked Autoencoder (MAE) and DINO, both self-supervised methods trained on ImageNet-1K. The results are shown in the table below.

Method	MAE			Dino
	CIFAR-100	Camelyon	Clevr-Count	CIFAR-100	Camelyon	Clevr-Count
Full	10.3	74.6	52.5	37.2	73.1	34.5
Linear	18.7	79.9	57.1	49.1	82.5	44.2
$LoRA_{mul}+VPT_{add}$	25.1	82.7	82.1	58.1	85.4	55.7
$\quad$ +PACE	44.8	85.8	86.4	60.8	88.1	61.0

2. Large domain shift.

Indeed, our experiments on the VTAB benchmark already demonstrate PACE's effectiveness in handling large domain shifts. VTAB consists of 19 diverse tasks drawn from various domains, including several that exhibit significant domain shifts from the ImageNet pre-training data, especially the specialized and structured datasets.

Specifically, we have tested PACE on Camelyon (which is also part of WILDS) and Diabetic Retinopathy. These medical imaging tasks represent a substantial domain shift from natural images in ImageNet.

Our results show consistent improvements over the baseline across these diverse task with large domain shifts. This performance demonstrates PACE's ability to effectively adapt to new domains while leveraging pre-trained knowledge.

3. Vary number of training samples on FGVC.

As per request, we varied the data percent (10% to 100%) of training samples on FGVC. Table below shows that PACE gain more improvement when data size is small, which is motivated by our generalization error analysis of PACE (Theorem 1-3).

Method	CUB				NAB				Flowers				StanDogs				StanCars
	100%	50%	20%	10%	100%	50%	20%	10%	100%	50%	20%	10%	100%	50%	20%	10%	100%	50%	20%	10%
$LoRA_{mul}+VPT_{add}$	88.9	87.1	83.9	79.1	87.5	80.7	75.0	70.2	99.4	98.5	96.5	93.1	91.2	90.6	88.7	86.9	87.5	78.7	54.9	30.1
$\$ +PACE	89.8	88.4	85.5	81.4	88.8	82.9	77.5	73.8	99.5	99.2	97.9	96.1	92.2	91.8	90.9	89.8	88.8	80.5	57.3	33.2

4. Experiments on large domain shift and large dataset.

In line with current trends, our focus is on finetuning pre-trained models for downstream tasks, where small/medium size datasets are common. In these scenarios, PACE shows significant improvements. For larger datasets, our improvements are smaller, which aligns with Lemma 1 and Theorem 1 - large datasets already allow models to achieve good generalization, diminishing the need for additional generalization improvement techniques.

However, the theoretical findings from PACE, particularly the implicit gradient reduction and alignment, could motivate future work in areas dealing with large domain shifts in big datasets. While we have not extensively tested PACE in such scenarios, these insights could inspire new approaches for handling significant domain shifts, even when data is abundant.

5. Combine with OFT and BOFT.

We re-implemented OFT and BOFT following the code and experimental settings provided by the authors. For OFT, we found that the constrained version (COFT) yields better results than the unconstrained version.

The table below compares results with/without PACE on CIFAR-100 (VTAB) and ImageNet (Domain Adaptation), demonstrating that incorporating PACE leads to improved performance.

Method	CIFAR-100 (VTAB)	ImageNet	(Dom	Ada)
		Source	-Sketch	-V2	-A	-R	Avg
COFT	71.8	76.9	26.4	66.7	13.1	30.7	42.7
$\$ +PACE	75.3	77.8	27.9	68.2	14.9	32.9	44.3
BOFT	72.3	77.1	27.0	66.8	12.8	31.1	42.9
$\$ +PACE	75.7	77.9	28.3	68.2	14.7	33.4	44.5

6. PACE $_h$ .

PACE $_\text{merge}$ refers to PACE $_h$ , where perturbation is applied after merging the adapter feature $\Delta h(\cdot)$ with the pretrained layer feature $h_0(\cdot)$ , namely, PACE $_h$ perturbs $h(\cdot)=\Delta h(\cdot)+h_0(\cdot)$ . This typo has been corrected.

7. Experiments on language tasks.

Following VeRA (Kopiczko et al., ICLR 2024), we conducted GLUE benchmark experiments using RoBERTa-base. We report Matthew's correlation for CoLA, Pearson correlation for STS-B, and accuracy for other tasks. The table below compares PACE+LoRA with other methods, demonstrating PACE's effectiveness in language tasks.

Method	COLA	STSB	MRPC	RTE	QNLI	SST2	Avg.
Full	63.6	91.2	90.2	78.7	92.8	94.8	85.2
BitFit	62.0	90.8	92.7	81.5	91.8	93.7	85.4
Adapt $^D$	62.6	90.3	88.4	75.9	93.0	94.7	84.2
VeRA	65.6	90.7	89.5	78.7	91.8	94.6	85.2
LoRA	63.4	91.5	89.7	86.6	93.3	95.1	86.6
+PACE	65.0	92.3	92.0	86.9	93.6	95.8	87.6

2024-08-10

Thanks for your great response!

I think the paper is good and so I will keep my score.

I'm a little surprised by the CIFAR-100 column in the first table though-- it looks much lower than in the pdf. Am I misreading something?

2024-08-11

Esteem Reviewer,

Thank you for your prompt reply and valuable feedback, which has helped strengthen and clarify our paper.

Regarding the lower CIFAR-100 results on self-supervised MAE/DINO in comparison to our paper, there are two key differences:

1. Pretraining stage: Here ViT was pretrained using self-supervised learning on ImageNet 1K, whereas in our paper, it was pretrained with supervised learning on ImageNet 21K.
1. Fine-tuning stage: For the VTAB benchmark, we fine-tuned the network using only 1,000 images without data augmentation. This means CIFAR-100 on VTAB has only 10 images per class, and the downstream task was supervised learning.

These differences result in less knowledge overlap between pretraining and fine-tuning compared to our paper's setup.

However, applying data augmentation techniques in the finetuning stage (in agreement with pretraining augmentation types) to increase this overlap improved PACE results:

MAE increased from 44.8 to 47.2
DINO improved from 60.8 to 62.9

Despite the improvement, these results remain lower than those in our paper due to the smaller number of images per class in CIFAR-100 VTAB and reduced knowledge overlap with MAE/DINO.

Thank you again for your insightful review and for recognizing the potential impact of our work. We sincerely appreciate your time and expertise. We hope this explanation clarifies the discrepancy in results. If you have any further concerns or questions, kindly do let us know and we will do our best to clarify further details.

审稿意见

评分: 5置信度: 52024-07-15

This paper proposes to regularize the model consistency by optimizing the fine-tuned model to remain consistent for the same sample under different perturbations.

优点

The paper is well-written.
Several experiments are conducted on four visual adaptation tasks: VTAB-1k, FGVC, few-shot learning, and domain adaptation.

缺点

Novelty Concerns: Incorporating consistency regularization with PEFT approaches is already commonplace in multimodal parameter-efficient fine-tuning. For example, VioLET [1], PromptSRC [2], and CoPrompt [3] propose to use additional encoders with shared weights to keep the consistency between the updated model and the original one. Considering that these papers focus on the consistency constraints on each modality itself, rather than cross-modal constraints. Therefore, the approach proposed in this paper to copy the original model parameters to enforce consistency lacks innovation and is simply a repetition of existing ideas.
Efficiency Concerns: The proposed method, as referenced in Eq. 12, necessitates multiple forward passes during fine-tuning, whereas most compared methods require only a single pass. This discrepancy significantly impacts efficiency. The authors should provide detailed efficiency metrics alongside performance results to enable fair comparisons.
Experimental Settings: Several experimental settings appear unconventional. The proposed method is based on a new baseline, LoRAmul+VPTadd. However, PACE does not seem to depend on specific designs of the base PET methods. Therefore, it would be more compelling to incorporate PACE into more general baseline methods such as AdapterFormer or existing SOTA methods like GLoRA. Demonstrating PACE's ability to enhance various PET methods would strengthen its contribution. Additionally, PACE requires 300 epochs for fine-tuning on the VTAB-1K dataset, whereas other methods typically require only 100 epochs. The necessity of such prolonged fine-tuning should be justified. Furthermore, in some benchmarks (e.g., Tables 2 and 3), PACE shows only limited improvement, which undermines its robustness.
Typos and Notational Errors: For instance, L505 misses a square in the second row. Additionally, the equal sign should be replaced with an approximate equal sign.

[1] Wang Y, Liu Y, Zhang X, et al. VioLET: Vision-Language Efficient Tuning with Collaborative Multi-modal Gradients[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 4595-4605.

[2] Khattak M U, Wasim S T, Naseer M, et al. Self-regulating prompts: Foundational model adaptation without forgetting[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 15190-15200.

[3] Roy, Shuvendu, and Ali Etemad. "Consistency-guided prompt learning for vision-language models." arXiv preprint arXiv:2306.01195 (2023).

问题

Please see the weaknesses.

局限性

Yes

作者回复

2024-08-07

We thank the reviewer for helpful comments.

1. Copy the original model for consistency regularization not novel.

We believe there is misunderstanding.

Existing "consistency" models (we will cite/compare these interesting works in paper) align features of fine-tuned model with the pretrained model. They provide no guarantees for generalization error. Our PACE guarantees control of generalization by interplay of noise modulator in network branches. Our "consistency" is thus completely different.

We introduce 3 key innovations:

1) Novel consistency mechanism: Unlike previous methods that apply consistency regularization between finetuned and pretrained models, PACE applies consistency between differently perturbed versions of the same finetuned model while learning adapter parameters. This crucial difference addresses gradient limitations of traditional alignment approaches.

2) Gradient reduction: We prove that naive finetuned-pretrained alignment can not guarantee gradient reduction and even cause gradient explosion (Prop. 1, Figs 4, 7, 8). PACE overcomes this issue, ensuring gradient reduction and better generalization.

3) Theoretical guarantees: We provide rigorous analysis (Thm. 2 & 3) showing how PACE implicitly achieves gradient regularization and model alignment even though PACE does not explicitly do so, a key advancement over previous works. Thm. 1 explains how controlling gradient helps generalization.

4) Superior performance: Our experiments show PACE outperforms both FPA (explicitly align finetuned and pretrained model in output space) and DELTA (Li et al, ICLR 2019) (explicitly align finetuned and pretrained model in feature space with supervised attention) across various datasets (table below).

Method	CIFAR-100 (VTAB)	ImageNet	(Dom	Ada)
	Source	-Sketch	-V2	-A	-R	Avg
LoRA $\_{mul}$ +VPT $\_{add}$	74.9	78.3	30.6	68.5	14.1	32.5	44.8
$\$ +DELTA	76.4	78.4	30.8	68.7	14.6	33.7	45.2
$\$ +FPA	76.6	78.8	31.2	68.6	14.7	33.5	45.3
$\$ +PACE	79.0	79.0	31.8	69.4	16.3	35.2	46.3

PACE is a significant step forward in consistency regularization, with a theoretically grounded/empirically superior approach.

2. Efficiency concern.

We have explored two variants with similar computational/memory needs as baseline:

1) PACE $\_{fast}$ : We store model outputs for each sample from the previous epoch ( ${\bf o}\_{e-1}=f_{e-1}({\bf x})$ ) in CPU memory (or disk). During training, we feed these into the consistency regularization w.r.t. the current outputs ( $||f\_{e}({\bf x})-{\bf o}_{e-1}||\_2^2$ ) as two versions used different i.i.d. noises (our theoretical need).

2), PACE $_{lazy}^{half}$ : At every $N^{th}$ iteration during training, we apply consistency regularization and halve the batch size.

Table below compares the max GPU memory usage, total training time, and accuracy for each task, showing that PACE $\_{fast}$ and PACE $\_{lazy}^{half}$ significantly improve over the baseline under similar compute demands.

Method	CIFAR-100	VTAB	(ViT/16-B)	Camelyon	VTAB	(Swin-B)	ImageNet	DomAda	(ViT/16-B)
	Mem	Time	Acc	Mem	Time	Acc	Mem	Time	MeanAcc
Baseline	8.9g	29m	74.6	15.7g	33m	86.7	8.9g	161m	44.8
PACE	17.7g	53m	79.0	29.4g	60m	89.3	17.7g	278m	46.3
PACE $_{fast}$	9.0g	29m	78.3	15.7g	34m	88.8	9.0g	162m	46.1
PACE $_{lazy}^{half}$ (N=2)	9.3g	29m	78.7	15.7g	36m	89.2	9.0g	165m	46.0
PACE $_{lazy}^{half}$ (N=4)	9.3g	29m	78.4	15.7g	35m	88.9	9.0g	163m	45.6
PACE $_{lazy}^{half}$ (N=6)	9.3g	29m	78.4	15.7g	35m	89.0	9.0g	163m	45.7
PACE $_{lazy}^{half}$ (N=10)	9.3g	29m	78.2	15.7g	35m	88.9	9.0g	162m	45.6

3. PACE+AdaptFormer & GLoRA.

Method	CIFAR-100 (VTAB)	ImageNet	(Dom	Ada)
		Source	-Sketch	-V2	-A	-R	Avg
AdaptFormer	70.6	77.4	26.5	67.4	12.4	28.7	42.4
+PACE	74.8	78.2	27.4	67.9	13.9	31.7	43.8
GLoRA	75.9	78.2	30.3	68.1	13.5	31.6	44.3
+PACE	78.6	78.8	31.7	69.0	15.9	34.4	45.9

4. PACE with 100 epochs on VTAB.

We use 300 epochs for VTAB tasks (small gain over 100 ep.) but PACE does not needs more time than others to converge. As the optimizer uses cosine learning rate decay, reducing epochs to 100 has minimal impact.

To ensure fair memory/compute budgets, we tested PACE with 1/2 batch size, 50 epochs where PACE still improves baseline accuracy by 2.1% and outperforms the previous SOTA GLoRA (500 train epochs and 30 for parameter search). Thus, PACE is efficient/effective.

#Epoch	Method	Natural	Specialized	Structured	Avg
530	GLoRA	83.61	87.02	63.27	77.97
100	Baseline	81.94	85.40	61.40	76.24
100	+PACE	83.94	87.44	64.62	78.67
50	+PACE (1/2 batch size)	83.77	87.32	63.92	78.34
200	Baseline	82.28	85.30	61.64	76.40
200	+PACE	84.13	87.57	64.85	78.85
300	Baseline	82.41	85.00	61.80	76.40
300	+PACE	84.32	87.55	65.13	79.00

5. Tables 2/3: results not that good.

1) In Table 2, PACE shows gains on all shot settings, especially in low-data scenarios crucial for fine-tuning large models on downstream tasks with limited data, and consistently with our motivation to improve the generalization error (Thm. 1-3).

Table below shows average results across datasets and shots. PACE consistently outperforms baseline by 0.9 to 2.9.

Method	AvgAcc	Improvement
LoRA	61.1
$\$ +PACE	63.0	+2.9
VPT	61.0
$\$ +PACE	61.9	+0.9
LoRA $\_{mul}$ +VPT $\_{add}$	62.7
$\$ +PACE	64.2	+1.5

2) In Table 3, we observe a 0.7 gain (expected on large data size). When reducing the data size, we see more gains.

Below is averaged accuracy over different data percentages on FGVC tasks (average gain: 1.8 points).

Method	100% data	50% data	20% data	10% data	Avg
LoRA $\_{mul}$ +VPT $\_{add}$	90.8	87.1	79.8	71.8	82.3
$\$ +PACE	91.5	88.5	81.8	74.8	84.1

6. Typos.

Duly noted. We will fix.

2024-08-12

Thank you for the detailed response. Most of my concerns have been resolved, and I have decided to raise my rating to 5. By the way, I would like the authors to discuss in detail the similarities and differences with the multimodal fine-tuning approach in maintaining modal consistency in the revised manuscript. Additionally, while the authors propose two new variants, these designs are relatively minor tricks rather than novel technical contributions. I hope the authors will further investigate ways to avoid the higher computational complexity associated with multiple forward passes.

评论- Thank you

2024-08-14

Esteem Reviewer,

We thank you for your prompt response. We greatly appreciate your time, effort, and the concerns you have shared with us, which significantly improve the completeness, strength, and clarity of our paper.

1. We will be sure to discuss in detail the similarities and differences with the multi-modal fine-tuning approaches maintaining consistency.

Previous works such as VioLET, PromptSRC, and CoPrompt identified the forgetting problem during multi-modal fine-tuning. They propose aligning the fine-tuned model with the pre-trained model, and maintain consistency to prevent catastrophic forgetting.

Specifically:

VioLET prevents forgetting through collaborative multi-modal gradients
PromptSRC uses L1 distances between pre-trained and fine-tuned outputs
CoPrompt employs cosine similarity between pre-trained and fine-tuned outputs.

PACE establishes the link between smaller gradient norm and better generalization, identifying the gradient issues in typical alignment, and proposes a new consistency regularization between fine-tuned models with different noise perturbations, rather than between fine-tuned and pre-trained models.

PACE aligns models and regularizes gradients, thus preventing catastrophic forgetting and improving generalization.

2. Regarding novel ways to avoid higher computational costs.

We value the reviewer's suggestions and will explore this further.

While seemingly simple, these variants are firmly grounded in our theoretical analysis of gradient behavior during fine-tuning. They serve as practical demonstrations of our core insights while avoid higher computational costs.

The simplicity of these variants allows for easy extension and adaptation to various scenarios and model architectures. We want to emphasize that the most valuable contributions of our work are establishing an important link between small gradient norms and better generalization and introducing a new type of consistency regularization.

We hope our ideas will benefit a deeper understanding of fine-tuning neural networks and their behavior, and will motivate further works.

We truly thank the reviewer again for the constructive suggestions that help us enhance the completeness and clarity or our work. If there is anything we can clarify/improve further, kindly do let us know.

审稿意见

评分: 7置信度: 32024-07-16

The paper proposes a consistency regularization that minimizes the squared L2 distance between two outputs of a model obtained using the same parameters, but multiplying the activations by 2 different noise samples. It proves that the population loss is bounded by the gradient norm, indicating that smaller gradients are better for generalization. It proposes the consistency regularization for aligning the fine-tuned model with the pretrained model for better generalization, and proves that the consistency regularization penalizes the first and second order gradients of the parameters of the model. This results in the training process preserving the knowledge learned by the pretrained model. It also proposes an efficient method for implementing the consistency regularization where the regularization is applied every N iterations.

优点

The paper provieds a theoretical analysis and intuition for the consistency regularization and links generalization to small gradient norm.
The empiracal results show that using the consistency regularization along with existing finetuning methods improves their perfromance.
The consistency regularization can be made somewhat efficient by adding the noise to the features instead of parameters, and by using the consistency regularization in intervals rather than at every iteration.

缺点

Gap between theoretical analysis and practical implementation

The theoretical analysis is done for functions that map from R^d R. However, the in practice, the outputs of the heads of the classification models are multi-dimensional. It's not clear from the proofs that the theoretical results hold for multi-dimensional outputs. While the experimental results do suggest this, a theoretical proof for this would strengthen the paper.

Increased compute and memory requirements

Since the consistency regularization needs to run the model twice (with shared parameters but with two different noise samples), and also backpropagate through both, the compute and the memory required is doubled.
The authors use H100 GPUs (96GB) for their experiments, hence the experiments are feasible. However, with lower end GPUs, this might not be feasible. Moreover, the requirements for large models (in the scale of billion parameters) might be prohibitive.
With this increased compute and memory requirements, the method's use as a PEFT method is limited.
While the paper proposes a method for efficient implementation by applying the regularization every N iterations, it is only tested on one dataset from VTAB-1k (CIFAR100). This should be tested on more datasets for the claim to hold.

Lack of evaluation beyond vision tasks

The paper does not study natural language tasks (language understanding and generation). This would be necessary to show the generality of the method to other domains besides vision tasks.

Large training time

The method is trained on VTAB for a large number of epochs (300). This means that the method can take a long time to converge. Are the comparisons with the other methods (other than the chosen baselines) done with a similar compute budget?
The method is trained for 100 epochs for the FGVC tasks. Why does the method require fewer epochs for FGVC tasks compared to VTAB?

问题

Line 155-156 "as even smaller weight changes can lead to significant divergence in the output space": Can the authors provide references/experimental results to backup this statement?
In Figure 2 (a) and Figure 3 (b), how is the plot shown for the gradient norm of parameters in multiple layers?
See Point 4 in Weaknesses.

局限性

Authors have addressed the limitations.

作者回复

2024-08-07

We thank the reviewer for insightful questions that help refine our work further.

1. Theoretical analysis done for functions from $R^d\rightarrow R$ .

Thank you. In practice, we use the squared L2 distance for multi-dimensional outputs for $D^{fp}$ and $D^{pace}$ , which allows our one-dimensional analysis to naturally generalize to multiple dimensions. For example, for a vector-valued function in the naive alignment, $f({\bf \theta}) = [f_1({\bf \theta}), ..., f_m({\bf \theta})]$ , where $m$ is the output dimension, we have:

$||f({\bf \theta}_0) - f({\bf \theta}_0 + \Delta {\bf \theta})||\_2^2 = \sum\_{i=1}^m [f_i({\bf \theta}_0) - f_i({\bf \theta}_0 + \Delta {\bf \theta})]^2.$

This equality shows that the squared L2 distance in multiple dimensions is simply the sum of non-negative squared differences in each dimension. Consequently, this additive nature enables our one-dimensional analysis to extend seamlessly to multiple dimensions in practice, aligning with our empirical observations.

2. Increased computation and memory requirements.

Thank you. Motivated by your questions, we have explored two variants to maintain similar computational and memory requirements as the baseline:

1), PACE $\_{fast}$ : We store model outputs for each sample from the previous epoch ( ${\bf o}\_{e-1}=f_{e-1}({\bf x})$ ) in the CPU memory (or disk if preferred). During training, we feed these into the consistency regularization w.r.t. the current outputs ( $||f\_{e}({\bf x})-{\bf o}_{e-1}||\_2^2$ ) as two versions used different i.i.d. noises (meeting our theoretical need).
2), PACE $_{lazy}^{half}$ : At every $N^{th}$ iteration during training, we apply consistency regularization and halve the batch size.

Method	CIFAR-100	VTAB	(ViT/16-B)	Camelyon	VTAB	(Swin-B)	ImageNet	DomAda	(ViT/16-B)
	Mem	Time	Acc	Mem	Time	Acc	Mem	Time	MeanAcc
Baseline	8.9g	29m	74.6	15.7g	33m	86.7	8.9g	161m	44.8
PACE	17.7g	53m	79.0	29.4g	60m	89.3	17.7g	278m	46.3
PACE $_{fast}$	9.0g	29m	78.3	15.7g	34m	88.8	9.0g	162m	46.1
PACE $_{lazy}^{half}$ (N=2)	9.3g	29m	78.7	15.7g	36m	89.2	9.0g	165m	46.0
PACE $_{lazy}^{half}$ (N=4)	9.3g	29m	78.4	15.7g	35m	88.9	9.0g	163m	45.6
PACE $_{lazy}^{half}$ (N=6)	9.3g	29m	78.4	15.7g	35m	89.0	9.0g	163m	45.7
PACE $_{lazy}^{half}$ (N=10)	9.3g	29m	78.2	15.7g	35m	88.9	9.0g	162m	45.6

3. Experiments on language tasks.

Following VeRA (Kopiczko et al., ICLR 2024), we conducted GLUE benchmark experiments using RoBERTa-base. We report Matthew's correlation for COLA, Pearson correlation for STSB, and accuracy for other tasks. The table below compares PACE+LoRA with other methods, demonstrating PACE's effectiveness in language tasks.

Method	COLA	STSB	MRPC	RTE	QNLI	SST2	Avg.
Full	63.6	91.2	90.2	78.7	92.8	94.8	85.2
BitFit	62.0	90.8	92.7	81.5	91.8	93.7	85.4
Adapt $^D$	62.6	90.3	88.4	75.9	93.0	94.7	84.2
VeRA	65.6	90.7	89.5	78.7	91.8	94.6	85.2
LoRA	63.4	91.5	89.7	86.6	93.3	95.1	86.6
+PACE	65.0	92.4	92.0	86.9	93.8	95.8	87.6

4. Results of 100 epochs on VTAB.

We use 300 epochs for VTAB tasks as we observed slight extra improvements over 100 epochs. However, it does not mean PACE needs more training time to converge. Since the optimizer uses cosine learning rate decay, reducing the number of training epochs to 100 has minimal impact on performance. For FGVC tasks, we maintained 100 epochs as their larger datasets (1k-20k samples vs. VTAB's 1k) ensure convergence with fewer epochs.

The table below shows results for different training epochs on VTAB.

To ensure fair memory and computational budgets, we tested PACE with half batch size and 50 epochs. Under these conditions, PACE still improves baseline accuracy by 2.10% and outperforms the previous SOTA GLoRA, which uses 500 epochs for training and 30 for parameter search. This demonstrates PACE's efficiency and effectiveness across various training configurations.

#Epoch	Method	Natural	Specialized	Structured	Avg
530	GLoRA	83.61	87.02	63.27	77.97
100	Baseline	81.94	85.40	61.40	76.24
100	+PACE	83.94	87.44	64.62	78.67
50	+PACE (half batch size)	83.77	87.32	63.92	78.34
200	Baseline	82.28	85.30	61.64	76.40
200	+PACE	84.13	87.57	64.85	78.85
300	Baseline	82.41	85.00	61.80	76.40
300	+PACE	84.32	87.55	65.13	79.00

5. Why smaller weight changes lead to significant divergence in the output space.

Smaller weight changes can lead to significant divergence in output space, particularly when a network is sensitive to weight changes or has entered a sharp local minimum. In such cases, even minor weight adjustments can cause substantial output changes or even incorrect predictions. This behavior contrasts with robust networks or those in flatter minima. Consequently, constraining changes in weight space alone is suboptimal, as it disregards the loss landscape, which is also influenced by gradients. For further details on this phenomenon, kindly refer to the papers listed below.

Wu et al, Adversarial Weight Perturbation Helps Robust Generalization, NeurIPS 2020.

Foret et al, Sharpness-aware minimization for efficiently improving generalization, ICLR 2021.

He et al, Defending and Harnessing the Bit-Flip based Adversarial Weight Attack, CVPR 2020.

6. How is the plot shown for the gradient norm of parameters in multiple layers?

We plotted the sum of gradient norms from all layers.

2024-08-09

Thank you for the detiled response and for clarifying my questions. I have read the authors' response as well as other reviews. Overall, the method is promising as an add on to standard finetuning methods, since it has demonstrated applicability to LoRA, OFT and BOFT. Furthermore, it shows theoretical links between gradient norm and generalization which is valuable to the community. I would encourage the authors to add the case of $R^d \rightarrow R^d$ this to the manuscript for completeness. I also thank the authors for providing references in point 5 of the rebuttal. However, I will maintain my original score. Following is a summary of my reasoning:

The newly proposed efficient variants do seem effective, but the come at a cost of increased CPU memory/disk space. In case where an offload to the disk is required, it increases the read/write time (which can be very slow). Thus, the method will have additional overheads no matter the variant.
Due to the additional overhead, I would hesitate to call it a PEFT method, although it leads to a small improvement in accuracy.
While the authors provided experiments on GLUE, the method is still not tested on generative tasks, which is an important application area.

评论- Minor clarifications.

2024-08-09

Esteem Reviewer,

Thank you for your prompt and detailed reply.

We are humbled that our rebuttal has addressed most of your concerns and that you recognize our method is promising and has theoretical value, which we believe will motivate more effective PEFT methods in the future.

We want to thank again for your constructive suggestions, which have truly strengthened our work.

Regarding the memory cost. We appreciate the concern about efficiency. Our method is designed for fine-tuning models on downstream tasks, which typically involve limited data. We want to clarify that PACE $_{fast}$ only needs to store the output of the model's classification head for the training samples. This means that the additional memory required by PACE $\_{fast}$ is trivial (and can be stored even on the GPU).

Below we compare the memory required by PACE $\_{fast}$ and the GPU memory required to train the baseline.

Dataset Mem. PACE $\_{fast}$ Baseline GPU Mem ratio
CIFAR-100 VTAB (ViT/16-B) 390KB 8.9GB 0.0042%
Camelyon VTAB (Swin-B) 7.81KB 15.7GB 0.000047%
ImageNet DomAda (ViT/16-B) 61MB 8.9GB 0.67%

As shown, memory required by PACE $_{fast}$ is trivial compared to the Baseline GPU memory requirement and can even stored into the GPUs.

Even in the rare scenario of fine-tuning on the full ImageNet 1K dataset (1.2 million samples), PACE $_{fast}$ requires only 4.8G of additional memory for temp. storage of the output of the model's classification head. This can be easily done on GPU alone. This is significantly smaller than the dataset itself (>100G) and can be easily accommodated in CPU/GPU memory without needing disk storage.

We should have not mentioned disk storage as it is not needed. Even for datasets with 10 million images or more, with typical multi node processing, we cannot see any need for a disk usage. We just mean extreme situations, e.g., a user with a single 3090GPU and 4GB RAM trying to use on full ImageNet. Even then, half-precision floats would fit output into 2.4GB.

While PACE $\_{fast}$ requires trivial small amount of memory, PACE $\_{lazy}^{half}$ does not require additional memory and also enjoy competitive performance.
We appreciate your question regarding parameter efficiency. Our understanding of parameter efficiency is that when adapting to downstream tasks, especially during inference, the required additional parameters should be minimal. We can confidently state that PACE adheres to this principle:
- PACE introduces no additional learnable parameters for model adaptation. The small set of model classification head outputs we keep for PACE $_{fast}$ are not learnable and will be discarded after training.
- During inference, all PACE variants require no extra parameters beyond the baseline PEFT method.
  
  Therefore, PACE maintains the parameter efficiency of the baseline PEFT method while enhancing performance.
  
  We will add all detailed clarifications into the paper and we are keen to see reviewer's further suggestions that we will accommodate accordingly.
Regarding generative tasks, we kindly ask for reviewer's understanding that PACE is built upon the generalization theory for discriminative tasks.

Currently, the generalization theory for generative models remains underdeveloped and requires collective effort from the entire research community.

Nonetheless, we are looking at this problem and we are trying to figure out the potential implementation details. We will try and see if we are able to get something implemented and processed within remaining discussion time but we feel this is perhaps outside of the scope of the paper given the theoretical framework.

Meantime, kindly do let us know if you have any other questions that we can answer.

2024-08-10

Thank you for providing further clarifications. I will revise some of my previous comments:

I acknowledge the point made about the additional memory requirements. I would ask the authors to include these statistics in the revised manuscript to give the readers a clear idea about the overheads associated with the method.
For point 3, I see that Section 3.2 mentions the structure of the models the theoretical analysis is done for (network with a classification head), but later sections do not mention that the method applies only to discriminative tasks. I would also ask the authors to make this clear in the manuscript, since generative models are an important class of models which have shown a significant impact and interest, and the baseline methods like LoRA work well for those. This is required to give the readers the correct context.

Given that these comments are incorporated, I do not mind raising my score from 6 to 7.

2024-08-10

Thank you for your thoughtful feedback.

We are very glad that our response has resolved your concerns.

We appreciate your active involvement and advice, which has provided valuable opportunities to improve our work and make it clearer for the readers.

We will incorporate all your suggestions, along with those from other reviewers, into the final version.

Below are the changes (we will doublecheck if we did not miss any other details/requests) we will implement in the final version of our paper based on your suggestions:

Include $\mathbb{R}^d \rightarrow \mathbb{R}^m$ before Prop. 1.

Introduce two efficient variants in Sec 3.5. Add Sec. 4.4 to report experiments on these variants and include overhead statistics in the supplementary material.

Add experiments on language tasks and 100-epoch results on VTAB in Sec. 4.1.

Include citations and explain why smaller weight changes lead to divergence in Sec. 3.3.

Clarify how to calculate the gradient norm in Sec. 4.2.

Clarify that PACE requires no additional parameters beyond those required by baseline PEFT methods in Sec. 3.5.

Clarify our context focuses on discriminative tasks in Sec 3.3 and 3.5.

Additionally, we will implement other changes based on suggestions from other reviewers.

Compare PACE with VioLET, PromptSRC, CoPrompt, L2-SP, DELTA, and FTP in Sec 2, with experimental comparisons in Sec. 4.3.

Add experimental comparisons with AdaptFormer, GLoRA, OFT, and BOFT in Sec. 4.3.

Include MAE and Dino experiments in Sec 4.1

Refine explanation for Theorem 3 to clarify why $D^{pace}$ aligns models despite not using explicitly alignment like exisiting methods.

Present experiments with varying training sample sizes on FGVC and provide average improvements in Tables 2 and 3.

Address reviewer-identified typos and conduct a comprehensive proofreading of the entire paper.

We truly appreciate your time and effort.

If there is anything important we missed/oberlooked, kindly do let us know.

评论- Generative Experiment

2024-08-14

As per reviewer's request, we have finalized conducting experiments on language generation tasks by fine-tuning Phi-3-mini-4k-instruct on the GSM8K dataset (Cobbe et al., "Training Verifiers to Solve Math Word Problems," arXiv 2021) using causal language modeling.

We used learning rate of 2e-6, batch size of 4, LoRA rank of 16, prompt "Answer below question. First think step-by-step and then answer the final number:\n\n<Question>" as instruction and fine-tune models on the training set and evaluated the performance on the test set.

The results are as follows:

Method	Accuracy
Pre-trained	62.01
Full	73.16
LoRA	75.66
$\$ +PACE	78.77

As shown, PACE enhances LoRA's performance on this generative language task, despite its roots are in discriminative theory.

Although generalization theory for generative modeling lags behind, the principles of aligning fine-tuned and pre-trained models to retain knowledge and achieving smaller gradient norms for better generalization remain effective in this setting too.

We thank the reviewer again for challenging us to go deeper and evaluate our model further.

作者回复

2024-08-07

We thank all the reviewers for constructive feedback and questions shaping our revised paper.

We have addressed all comments in individual responses to each reviewer.

Below we just provide few highlights:

1. Increased computation and memory requirements: we have provided now PACE $\_{fast}$ and PACE $\_{lazy}^{half}$ variants which enjoy almost identical memory and compute time footprint as baselines while still bringing 2-3% gains.
1. Experiments on language tasks. Following VeRA (Kopiczko et al., ICLR 2024), we conducted GLUE benchmark experiments using RoBERTa-base, and showed gains.
1. We provided results on 50, 100 epochs on VTAB.
1. Consistency novelty. We have explained that our "consistency" model is very different to existing works.
- We do not explicitly align per se with a frozen pretrained model.
- We use noise modulators to align two branches. These noise modulators implicitly regularize gradient (Theorems 2 & 3) and improve generalization error (Theorem 1). Standard works on "consistency" with frozen pretrained model for alignment do not enjoy these properties.
1. We provided results on PACE+AdaptFormer & GLoRA.
1. We have provided experiments on finetuning MAE and DINO (self-supervised models).
1. We provided experiments on OFT+PACE and BOFT+PACE.
1. We provided comparisons with L2SP, DELTA and FTP. We will of course cite these works and discuss accordingly.
1. We detailed how $D^{pace}$ align models while improving generalization error.

We truly hope this rebuttal, theoretical arguments and empirical arguments will convince reviewers about novelty of our work:

Current alignment methods do not study or lower generalisation error (Theorems 1-3)
Our "consistency" is not what other works do. We use noise modulation to achieve regularized gradients and hence improved generalization error.
Our experiments as predicted by theory consistently outperform other 'consistency" approaches.
Our compute and memory cost remains in line with baselines thanks to PACE $\_{fast}$ and PACE $\_{lazy}^{half}$ while enjoying 2-3% gains.

Kind regards,
Authors

评论- Reviewer-Author Discussion

2024-08-07

Dear Reviewers,

Thank you very much for your big efforts.

Now, the authors' rebuttal are available, please check them, as well as other reviewers' comments, to consolidate your comments if needed, and to interact with authors and/or other reviewers and ACs for any further clarifications as you see fit.

We appreciate you for the continuous support.

Thank you.

评论- Let's wrap up the reviewer-author discussion

2024-08-13

Dear Reviewers and Authors,

Thank you very much for all the informative discussions. We appreciate you for the big efforts.

To reviewers: if you need any further clarifications from authors, please make sure they can be addressed by text-only response without additional experimental results.

To authors: If you have any clarification and summary you would like to share with reviewers, or request their comments on some of your rebuttal and/or discussion points, please add those and/or kindly remind the reviewer(s).

Thank you.

评论- thank you

2024-08-14

Esteem AC and SAC,

We thank for your diligence and time dedicated to our work. We have had in-depth discussions with reviewers. Their constructive feedback, expertise and observations have helped us resolve key issues and improve the manuscript accordingly.

We will include all suggestions in the revised paper accordingly.

We thank again for the opportunity to discuss our work, which has led to invaluable improvements.

Best regards,
Authors

最终决定Accept (spotlight)

2024-09-25

This paper presents a consistency-regularization method for the knowledge/generalizability retention of pretrained (large) backbones in parameter-efficient fine-tuning (PEFT). It was reviewed by several knowledgable reviewers. After the rebuttal, the main consensus from all the reviewers was that the proposed method provides a new consistency regularization analysis with effective practical tricks in implementation. This meta-review concurs and recommends acceptance.

The authors are encouraged to carefully revise the paper due to the significant updates provided in the rebuttal and discussions. In addition to the experimental results, the authors should also add more in-depth analyses for the two new implementation tricks provided in the rebuttal to address the reviewers' concerns remained after the reviewer-author discussion.

PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization

摘要

评审与讨论

优点

缺点

问题

局限性

We thank the reviewer for insightful questions that help refine our work further.

1. Comparison with L2SP, DELTA & FTP.

2. Increased computation.

3. How DpaceD^{pace}Dpace align models.

Importantly:

优点

缺点

问题

局限性

We thank the reviewer for insightful questions that help refine our work further.

1. Experiment settings.

2. Large domain shift.

3. Vary number of training samples on FGVC.

4. Experiments on large domain shift and large dataset.

5. Combine with OFT and BOFT.

6. PACEh_hh​.

7. Experiments on language tasks.

优点

缺点

问题

局限性

We thank the reviewer for helpful comments.

1. Copy the original model for consistency regularization not novel.

2. Efficiency concern.

3. PACE+AdaptFormer & GLoRA.

4. PACE with 100 epochs on VTAB.

5. Tables 2/3: results not that good.

6. Typos.

1. We will be sure to discuss in detail the similarities and differences with the multi-modal fine-tuning approaches maintaining consistency.

2. Regarding novel ways to avoid higher computational costs.

优点

缺点

问题

局限性

We thank the reviewer for insightful questions that help refine our work further.

1. Theoretical analysis done for functions from Rd→RR^d\rightarrow RRd→R.

2. Increased computation and memory requirements.

3. Experiments on language tasks.

4. Results of 100 epochs on VTAB.

5. Why smaller weight changes lead to significant divergence in the output space.

6. How is the plot shown for the gradient norm of parameters in multiple layers?

We thank all the reviewers for constructive feedback and questions shaping our revised paper.

3. How $D^{pace}$ align models.

6. PACE $_h$ .

1. Theoretical analysis done for functions from $R^d\rightarrow R$ .