/10

Rejected3 位审稿人

最低1最高5标准差1.6

ICML 2025

ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization

Zijun Wu,Yongchang Hao,Lili Mou

OpenReview PDF

提交: 2025-01-23更新: 2025-06-18

摘要

关键词

Parameter-efficient fine-tuningPrompt tuningLow-dimensional embeddings

评审与讨论

审稿意见

评分: 12025-03-08

The paper proposes a more efficient prompt tuning method in that they need to optimize over fewer variables. They achieve this efficiency through a kind of sketching with the Johnson-Lindenstrauss Lemma. They experiment on NLP tasks.

给作者的问题

See above.

论据与证据

The most problematic claim is wrt efficiency. In fact, you still need to reconstruct the large matrix $\tilde{P}$ at inference time, so you do not have a memory advantage in this respect. Also, you need to consider the additional computational requirements of such computations, which I didn't see discussed or evaluated empirically. For instance, LoRA weights can be simply merged with the model weights and no overhead at inference time is induced. So it seems to me the only advantage could be that this method requires even fewer trainable parameters at finetuning stage, but the limitations should be made much clearer in the text. Moreover, many LoRA-like methods that scale better than LoRA have been proposed but not compared in the experiments.

方法与评估标准

GLUE is fine, but more reasoning tasks should be provided. I would suggest to use more standard benchmarks and methods, such as Llama for autoregressive tasks. Very little evidence on the computational overhead introduced by the method in terms of time/memory.

理论论述

OK, even if not very informative theorems.

实验设计与分析

See above

补充材料

与现有文献的关系

Not all references to new LoRA-based methods are discussed, in fact many methods that scale better than LoRA have been proposed. One of them is VeRA, which is discussed but not compared empirically. Other references include LISA and ReFT.

Pan, Rui, et al. "LISA: layerwise importance sampling for memory-efficient large language model fine-tuning." Advances in Neural Information Processing Systems 37 (2024): 57018-57049.

Wu, Zhengxuan, et al. "Reft: Representation finetuning for language models." Advances in Neural Information Processing Systems 37 (2024): 63908-63962.

遗漏的重要参考文献

See above.

其他优缺点

Strengths

The writing is very clear
Good use of sketching
GLUE experiments are useful

其他意见或建议

Many equations lack a comma at the end

作者回复

2025-03-30

We thank the reviewer’s thoughtful comments, especially for recognizing our clear paper writing, effective parameter reduction through sketching, and useful GLUE experiments. We now provide detailed responses to each of the concerns.

“You still need to reconstruct the large matrix $\tilde{P}$ at inference time, so you do not have a memory advantage in this respect.”

We thank the reviewer for the comment. At inference, ULPT does reconstruct the full prompt embeddings, using the same memory as vanilla prompt tuning. However, ULPT targets LLM customization, where an enormous number of customized LLMs are stored but few are active at a time. ULPT significantly reduces the storage for customizations of foundation models, which is a novel use case.

“You need to consider the additional computational requirements of such computations” and “Very little evidence on the computational overhead introduced by the method in terms of time/memory.”

We appreciate this suggestion. To clarify, the computational overhead introduced by ULPT at inference time is minimal compared with the rest of the network, as the reconstruction for the prompt embeddings occurs only once for each loading of the model. We empirically compare the run time of ULPT’s up-projection against vanilla PT (the results are averaged on 100 runs).

Table 2

Runtime Setting	Llama1B	Llama 3B
Vanilla PT (loading high-dim embeddings)	0.64 ± 0.04 ms	0.91 ± 0.04 ms
ULPT up-projection (r=2)	0.56 ± 0.06 ms	0.59 ± 0.04 ms
ULPT up-projection (r=64)	1.43 ± 0.09 ms	1.87 ± 0.06 ms
ULPT up-projection (r=256)	4.09 ± 0.10 ms	5.80 ± 0.16 ms
Decoding	1481.15 ± 64.26 ms	2536.67 ± 42.14 ms

As seen in Table 2, the embedding up-projection is negligible relative to the decoding time. We will include the runtime analysis in the revised manuscript.

“LoRA weights can be simply merged with the model weights and no overhead at inference time is induced. So it seems to me the only advantage could be that this method requires even fewer trainable parameters at finetuning stage, but the limitations should be made much clearer in the text. ”

We acknowledge that our ULPT is different from LoRA, as we do not keep the original weight matrix. As mentioned in the previous point, the overhead caused by up-projection is negligible compared with the rest of the network (at most 0.3%). The advantages of our work include:

As recognized by the reviewer, our method has much fewer trainable parameters than LoRA and other methods, which is crucial to the storage of massive customized LLMs.
In addition to storage saving, our ULPT also combats the overfitting problem and achieves higher performance than full-dimensional prompt tuning and LoRA (Table 1 in our paper).

In the revision, we’ll clarify that we don’t merge the low-rank embeddings.

“many LoRA-like methods that scale better than LoRA have been proposed but not compared in the experiments.”, “VeRA, which is discussed but not compared empirically.” and “more reasoning tasks should be provided … such as Llama for autoregressive tasks”

We thank the reviewer for highlighting additional LoRA variants, and suggesting additional experiments on Llama for autoregressive tasks. In the rebuttal period, we included additional baselines VeRA and FourierFT, and added two generation benchmarks: GSM8K (math reasoning) and MBPP (code generation). We compare ULPT with those baselines using the Llama 3.2 models (1B and 3B).

Results are presented in Table 1 (rebuttal to Reviewer Ycju) due to the rebuttal space limit. We see that ULPT remains highly competitive, outperforming LoRA, VeRA, and FourierFT when the number of parameters is controlled. Importantly, LoRA and VeRA cannot achieve the ultra-low parameter usage as ULPT that is at the level of a few thousand parameters. We will include these comparisons in the revised manuscript.

“Other references include LISA and ReFT.”

Thanks for suggesting additional parameter-efficient fine-tuning methods beyond prompt tuning and LoRA. We will include discussions of LISA and ReFT in the related work section of our revised manuscript.

“Many equations lack a comma at the end”

Thanks for the suggestion. We’ll adopt a better style (including punctuation for equations) in our revision.

We believe these clarifications and additional results have addressed the reviewer’s concerns. We are grateful for the reviewer’s feedback, and look forward to your support of our work!

审稿人评论

2025-04-03

Thanks to the authors for their reply. I have the following remaining important concerns and suggestions:

The core contribution of this paper is to down-project the prompt tuning embedding matrix with a random matrix inspired by sketching methods. The down-projection saves some number of trainable parameters for finetuning. I feel that this contribution is not very original and not much significant to the literature.
The usefulness of the method is very limited. There doesn't seem to be a significant accuracy improvement and the main benefit would be lower number of trainable parameters. For example, in Table 1, taking the highest ranks DPT and ULPT, accuracy is basically same and ULPT requires 27.1K parameters and DPT requires 55.6K, with a saving of 28500 parameters. This means that, if using float32, you save 114 kilobytes of storage. This saving is negligible.
The authors say that their method is useful when "an enormous number of customized LLMs are stored but few are active at a time". Storage cost is very low so you would need tens of millions of customizations before seeing any significant saving, which seems like a very hypothetical scenario.
Prompt tuning already adds tokens to the input, leading to increased inference time and KV cache memory requirements. Even though the authors show that their method's overhead is small, when "an enormous number of customized LLMs are stored but few are active at a time", this loading operation needs to be performed every time a new customization is loaded, resulting in compounded time overheads (which is more costly than storage).
Regarding reasoning benchmarks, GSM8K is an older benchmark. I suggest the authors to take a look at Table 1 and 2 of ReFT. This is a suggestion for future versions of their paper, I understand that running these experiments now is computationally expensive.

作者评论

2025-04-04

Thank you for your additional comments. We address your concerns as follows:

“ I feel that this contribution is not very original and not much significant to the literature.”

We respectfully disagree with the comments. The effective integration of random projections with prompt tuning has not been previously studied in the literature. Moreover, we show in our analysis (Figure 3 in our paper) that naively down-projecting the prompt embedding, specifically in the ultra-low dimensional setting, introduces significant difficulty in learning, and our proposed learnable scaling and shifting embeddings resolve this problem while keeping the parameter efficiency. The reviewer fails to mention any specific literature but simply “feel(s)” our contribution if not very original. This is a major concern of the review, not our paper.

“There doesn't seem to be a significant accuracy improvement and the main benefit would be lower number of trainable parameters.”

We again disagree with the comments. In terms of performance, for example, ULPT with r=64 (7.9k parameters) achieves the best performance on both GLUE and SuperGLUE compared with all other methods (Table 1). When controlling for the same rank for DPT (55.6k parameters), ULPT significantly outperforms DPT on SuperGLUE (76.8 vs. 73.9).

Despite the improved task performance, we would like to point out that improving efficiency is also a major contribution to the deep learning literature. For example, the ICML’25 review guideline highlights a time-efficiency contribution. Apparently, our parameter-efficient methods can also be a key contribution to the machine learning community.

If saving parameters is not a significant contribution, the reviewer essentially asserts that most LoRA-like papers are subpar to ICML, which is absurd.

We urge the reviewer to read and follow ICML’25 review guideline when judging the merit of our paper.

“ This saving is negligible.” and ”Storage cost is very low so you would need tens of millions of customizations before seeing any significant saving, which seems like a very hypothetical scenario.”

We thank the reviewer for recognizing significant parameter savings in massive LLM customizations. This is exactly the scenario of how LLM is used today. For example, this news report mentions that OpenAI has more than 400M weekly active users. Even if one user keeps one customized LLM, we have 400 million customized LLMs. It is hard to imagine that improving efficiency of LLM customization is a hypothetical scenario.

“Prompt tuning already adds tokens to the input, leading to increased inference time and KV cache memory requirements… this loading operation needs to be performed every time a new customization is loaded, resulting in compounded time overheads (which is more costly than storage)”

For KV cache, we confirm that our approach does not add to any overhead compared with prompt tuning (which is a lightweight and useful way of tuning LLMs). Our work is built on top of prompt tuning, and saves a large number of parameters while further improving task performance.

We further measure the decoding speed to alleviate the reviewers concern. We follow the setups in Table 1 and use rank of 2 with 100 prompt tokens.

Table 3: Decoding speed (tokens/second)

Model	No customization	ULPT
Llama 1B	82.76 ± 0.33	82.71 ± 0.33
Llama 3B	48.74 ± 0.25	48.70 ± 0.22

We found no meaningful difference in decoding speed with or without the additional prompt tokens.

“I suggest the authors to take a look at Table 1 and 2 of ReFT. This is a suggestion for future versions of their paper”

As we mainly follow previous prompt tuning literatures for the experimental settings, we thank the reviewer for their suggestions on our future work. Since ReFT was only published in December last year, we did not have enough time for adopting their setups for our ICML submission. We’ll discuss the paper in our revision and adopt the settings in future work.

审稿意见

评分: 32025-03-10

This paper proposes a new low-dimensional parameterization for prompt tuning that could achieve better performance than the original prompt tuning with only 2% of the parameters.

给作者的问题

I would refer to "Essential References Not Discussed" and "Other Strengths And Weaknesses" sections. I would be happy to reevaluate this work if the authors could give more discussion on the introduction on the two new embeddings and also discuss and compare with the missing literature I mentioned.

论据与证据

The claims are in general clear and convincing.

One issue regarding the claims is the intriduction of shift embedding and the scale embedding. It is unclear why the introduction of these two could result in better performance and if there are better parameterization.

方法与评估标准

The proposed method is evaluated on the GLUE fine-tune tasks, comparing to other fine-tuning methods. I believe these are standard criteria and do make sense.

理论论述

Theorem 3 impose a pretty strong assumption, namely the Polyak-Lojasiewicz inequality. This is one inequality that essentially bound the function value gap toward the optimal value by the norm of the gradient, which serves as a substitute of the strongly convexity assumption. Under such assumption, every local optimal will be gloabl optimal, thus I believe that the theoretical claim should be correct but not very significant.

I didn't check the proofs carefully but would believe that it's correct.

实验设计与分析

The experiment design involves multiple fine-tuning tasks in GLUE and SuperGLUE, which is good. The not-so-good part is that the experiment is only conducted on T5-base model, and it's interesting to see the result on newer models such as Llama3.2 or Qwen2.

补充材料

I didn't check the supplementary material carefully.

与现有文献的关系

I think the work is clear about its relations to the previous works in this research direction.

遗漏的重要参考文献

There is one previous work on extreme-parameter-efficient fine-tuning method, namely the fine-tuning in the Fourier domain, see [1]. This is also a fine-tuning idea with a non-traditional parameterization to save the memory to an extreme. I think this method is worth comparing with both in theory and in experiments.

References:

[1] Gao, Ziqi, et al. "Parameter-Efficient Fine-Tuning with Discrete Fourier Transform." Forty-first International Conference on Machine Learning.

其他优缺点

The paper is clearly written and easy-to-follow. For the weakness, I think the biggest one is again more discussion on the intriduction of shift embedding and the scale embedding. I think it would be helpful if the authors could discuss the necessity of these two new variables theoretically. In particular, the Fourier domain parameterization in [1] seems to require less parameters than the method proposed in this paper.

References:

[1] Gao, Ziqi, et al. "Parameter-Efficient Fine-Tuning with Discrete Fourier Transform." Forty-first International Conference on Machine Learning.

其他意见或建议

I think there are some typos but I didn't check all of them carefully. For example, in the statement of Theorem 3, "Polyak–Lojasiewic" seems missing a "z" at the end.

作者回复

2025-03-30

We thank the reviewer for their detailed feedback. We appreciate that the reviewer says “The claims are in general clear and convincing” and that “The experiment design involves multiple fine-tuning tasks”. Below we address each of the comments in detail.

“One issue regarding the claims is the introduction of shift embedding and the scale embedding.” and “it would be helpful if the authors could discuss the necessity of these two new variables theoretically”

Thanks for raising this point. Our empirical analysis (Section 4.3) demonstrates that without scale and shift embeddings, the optimization process becomes significantly more difficult, particularly in the ultra-low-dimensional setting (e.g., 2-dimensional prompts, Figure 3). Additionally, Figure 4 reveals that the learned shift embeddings exhibit high similarities across different r configurations, further justifying our heuristic design of the shift and scale embeddings. Theoretically, our Theorem 3 ensures that these additional embeddings do not negatively impact the optimization process. We’ll provide further explanation in the revision.

“Theorem 3 impose a pretty strong assumption, namely the Polyak-Lojasiewicz inequality… Under such assumption, every local optimal will be gloabl optimal, thus I believe that the theoretical claim should be correct but not very significant.”

We appreciate the reviewer’s insightful comments on our theoretical assumptions. While the Polyak-Lojasiewicz (PL) inequality seems to be a strong condition, recent studies such as [1] demonstrated that over-parameterization often induces optimization landscapes to approximate PL-like conditions. In real-world applications, modern language models are heavily overparameterized, and tend to satisfy the PL* condition (a variant of PL condition) as shown in [1].

We acknowledge the review’s point that the condition may not always hold in practice, but this is the case for almost every theoretical analysis. That being said, our theorem provides a meaningful insight on ULPT in practice (namely, random projection for embeddings does not add to optimization difficulty), which is novel and has not been stated before.

[1] Loss landscapes and optimization in over-parameterized non-linear systems and neural networks, Liu et al. 2020.

“it's interesting to see the result on newer models such as Llama3.2 or Qwen2” and “the Fourier domain parameterization seems to require less parameters than the method proposed in this paper…I think this method is worth comparing with both in theory and in experiments.”

We thank the reviewer for suggesting comparisons with Fourier-based methods and evaluations on newer models. We conducted additional experiments using the Llama3.2 (1B and 3B) models on two generation datasets: GSM8K (math reasoning) and MBPP (code generation). These generation tasks also complement the 21 tasks in our main paper.

Due to space limit, we kindly refer the reviewer to Table 1 of our rebuttal to Reviewer Ycju. As shown, ULPT consistently outperforms FourierFT under controlled parameter budgets. For instance, when controlling parameters at 4.1K (ULPT r=2 vs. FourierFT n=128), our ULPT achieves higher performance on both GSM8K (39.7 vs. 35.8) and MBPP (26.1 vs. 21.5) for Llama 1B, and similar advantages are observed in the 3B setting (66.3 vs 63.1 on GSM8K and 33.9 vs. 21.9 on MBPP). Moreover, ULPT remains competitive or superior to other baselines such as LoRA, VeRA, and vanilla prompt tuning that require significantly more parameters to learn.

Theoretically, both FourierFT and ULPT leverage random matrices to reduce learnable parameters for fine-tuning, but they achieve this through fundamentally different mechanisms. FourierFT compresses weight updates by leveraging random spectral entries in the frequency domain, while ULPT operates in prompt space by parameterizing the embeddings with an up-project random matrix. The projection preserves the embedding distances essential for transformer attention (our Theorem 2). We will discuss these differences in more detail in our revision.

“I think there are some typos but I didn't check all of them carefully. For example, in the statement of Theorem 3, "Polyak–Lojasiewic" seems missing a "z" at the end.”

Thanks for the catch! We’ll fix them in the revision.

We hope that our clarifications and additional results address all the concerns. We greatly appreciate the reviewer’s willingness to reevaluate our manuscript! Please let us know if there are any further questions. Thanks!

审稿人评论

2025-04-02

I thank the authors for the rebuttal especially on the comparison with Fourier Fine-tune. I'd liek to increase my evaluation since it addresses most of my concerns.

审稿意见

评分: 52025-03-14

This work proposes a change to prompt tuning where first they decompose the standard n x d parameters as two matrices that are multiplied together n x r @ r x d, but the second matrix if random and frozen, thus vastly reducing the number of learnable parameters.

Additionally they add new shift and scale learnable vectors of size d which they find helps optimization and provide theoretical results that show their learned low rank embedding vectors can maintain the same distance relations amongst themselves that the original vectors do.

They test their approach on both GLUE and SuperGLUE tasks and find their method achieves stronger average results.

给作者的问题

N/A

论据与证据

Yes, their method outperforms others on a wide variety of benchmarks like GLUE and SuperGLUE and is competitive in other more difficult settings like MRQA and other datasets.

Additionally they have a number of ablation studies that show each part of their system seems important and contributes to the final performance.

方法与评估标准

Yes they make sense. While benchmarks like GLUE and SuperGLUE are oversaturated, they include other more difficult benchmarks. Additionally, experiments with large scale Bloomz models (up to 3B parameters) show that their method generalizes w.r.t. model type (decoder-only) and scale.

理论论述

I did not check the correctness of proofs

实验设计与分析

Their experimental design makes sense.

They also did extensive ablation studies on things like the rank of the learned parameters, the scale + shift parameters, and which parts of the decomposition are trainable.

Some work like https://arxiv.org/abs/2205.12647 seems to suggest that prompt tuning methods tend to be weaker on tasks that require long generation. It would have been nice to see how their approach fared in this more challenging setting.

补充材料

I did not review the supplementary material

与现有文献的关系

Their work appropriately references other work in the field, including things like citations to works that first proposed the decomposition of prompt parameters.

遗漏的重要参考文献

N/A

其他优缺点

The paper and their method are both clear and straight forward.

They could do a better job at explain what Theorem 2 means practically, I assume that by maintaining distance relations it means that the downstream transformers attention will be unaffected by the low rank representation, but something like that could be more clearly stated.

其他意见或建议

N/A

作者回复

2025-03-30

We appreciate the reviewer for their thorough evaluation and the “strong accept” recommendation! The reviewer fully recognizes the contributions of our work, as well as the comprehensive analysis and clear writing.

“It would have been nice to see how their approach fared in this more challenging setting.”

We thank the reviewer for highlighting the importance of evaluating our approach on tasks involving long generation. We conducted additional experiments on the GSM8K and MBPP datasets for math reasoning and code generation with the maximum generation for a few hundred tokens. We used one of the newest Llama models (3.2), and due to the constraint on time and resource, we considered the 1B and 3B variants.

Table 1: Results from additional experiments. We report accuracy on GSM8K and pass@1 on MBPP. The updated code is available in our anonymous GitHub repo (see footnote 1 in our manuscript for the link).

Method	Param (1B) ↓	GSM8K (1B) ↑	MBPP (1B) ↑	Param (3B) ↓	GSM8K (3B) ↑	MBPP (3B) ↑
ICL (4-shot)	-	34.3	21.1	-	62.5	23.9
LoRA (r=1)	106.5k	38.5	26.7	286.7k	62.9	32.1
LoRA (r=4)	426.0k	40.1	27.2	1.15M	63.4	34.3
LoRA (r=8)	852.0k	40.2	24.7	2.29M	62.2	37.8
VeRA (r=1)	41.0k	39.3	24.4	114.7k	65.5	35.5
VeRA (r=4)	41.1k	39.6	27.8	114.9k	65.0	34.4
VeRA (r=8)	41.2k	40.9	29.5	115.1k	65.7	33.9
FourierFT (n=128)	4.1k	35.8	21.5	7.2k	63.1	21.9
FourierFT (n=512)	16.4k	34.9	27.3	28.7k	66.6	35.3
FourierFT (n=1024)	32.8k	36.6	25.9	57.3k	65.5	35.4
PT	20.5k	40.2	24.7	30.7k	65.3	33.1
ULPT (r=2)	4.1k	39.7	26.1	6.2k	66.3	33.9
ULPT (r=64)	4.7k	42.4	28.7	6.8k	65.6	34.3
ULPT (r=256)	6.7k	41.4	26.3	8.7k	66.4	32.9

These results in Table 1 show that, ULPT remains competitive or superior compared with other baselines including LoRA and its recent variants (VeRA and FourierFT), as well as vanilla prompt tuning. In particular, LoRA and VeRA fail to work in the ultra-low parameter setting, as the required parameters are magnitudes more than ours. FourierFT uses fewer parameters, but their performance is much worse than ours (e.g., 35.8 VS 39.7 for GSM8K at 1B scale with 4.1k parameters).

“They could do a better job at explain what Theorem 2 means practically”

Thanks for the suggestion. Practically, since transformers heavily rely on embedding distances during the forward pass to compute attention patterns, Theorem 2 shows that our randomly up-projected low-dimensional prompt embeddings approximately preserve these pairwise distances. This suggests that the model’s attention mechanism operates on embeddings that reflect the same relational structure as the full-dimensional prompts. We will clarify in the revision.

Once again, we thank the reviewer for their strong support and valuable feedback!

最终决定Reject

2025-05-01

This paper received mixed reviews. After discussion, it was agreed that this paper may not meet the ICML acceptance bar. The authors are encouraged to address the comments of the reviewers to improve this work. In particular, the authors are encouraged to address the following major concerns.

Experiments should be conducted on more LLMs.
Some important references are missing.
Experiments should be conducted on tasks and benchmarks.

The author’s rebuttal has been carefully read and discussed. The author’s message has been carefully read and considered.