4.0

/10

withdrawn4 位审稿人

最低3最高5标准差1.0

3.5

置信度

正确性2.8

贡献度1.5

表达2.8

ICLR 2025

Memory-Efficient Block Coordinate Descent for Hessian-Informed Zeroth-Order Optimizer

Zhiyuan Yu,Yifei Cheng,Liang Ding,Xinmei Tian,Li Shen,Dacheng Tao

OpenReview PDF

提交: 2024-09-28更新: 2024-11-13

摘要

关键词

zeroth-order optimizationmemory-efficient fine-tuning

评审与讨论

审稿意见

评分: 3置信度: 42024-10-18

The paper proposes to run 0-order optimization by diagonal Hessian precondition (HiZOO) in a block-wise fashion. This reduces memory requirement since it only needs to store model parameters and Hessian information for each block.

优点

Results are better than for HiZOO using much less memory.

缺点

Results are compared to weakly engineered baselines, and activation storage needs are overexaggerated. The main missing ingredients here are gradient-checkpointing, which is easily enabled in most models by doing model.gradient_checkpointing_enable() and the possibility of doing weight update during backward computation (like LOMO and GaLORE do).

For example, one can easily fine-tune Llama-2-7B on a 24GB GPU, even on 4096 token sequences with gradient checkpointing, LOMO-like updates, and flash-attention2. The paper claims this will get an OOM error on the SST2 dataset (which has small sequences) on a 48GB GPU!

The proposed algorithm's (B-PDF) main two contenders are HiZOO and BCD (block coordinate descent) using a first-order optimizer. B-PDF seems better than HiZOO, but I believe the authors are using a suboptimal version of BCD with no gradient checkpointing, thus getting OOM errors. When BCD does not get OOM errors, it is clearly better (Table 3).

And if one can run BCD with a first-order method, why bother with B-PDF?

The problem of exaggerating activation storage needs gets even more apparent when one thinks about scaling. If we take a model with D blocks and embedding size N, the number of parameters is $O(DN^2)$ . With batch size B and sequence length L, the activation storage needs are O(BLDN) (assuming flash attention here, which removed the $O(L^2)$ factor). The bigger the network, the smaller the activation storage compared to the parameter storage.

问题

How would B-PDF compare to properly engineered BCD or GaLORE?

审稿意见

评分: 5置信度: 32024-10-31

The paper aims to reduce the memory cost of Hessian-informed zeroth-order (ZO) optimization. The main idea of the paper involves incorporating block coordinate descent (BCD) with the Hessian-informed ZO optimizer (HiZOO), which selects a subset of layers to be tuned at each fine-tuning step. Experiments show that the proposed method achieves a significant memory reduction compared to an existing Hessian-informed ZO method in fine-tuning LLMs, while retaining comparable performance.

优点

S1: Background and problem is well motivated, providing an extension to HiZOO, proposing several BCD methods that can be used to reduce the expensive memory consumption introduced by the hessian-informed ZO problem.

缺点

W1: Some challenges in clarity of methodology. For instance, the concept of blocks is not made clear, in some parts of the paper, there is an impression that each block corresponds a layer, but other sections imply that a block can correspond to a subset of layers. The BCD partitioning granularity used for experiments is not explicitly mentioned, causing some confusion and discrepancy.

W2: Authors emphasize more practical analysis in section 4.2 but experiments section is lacking. Only two models, OPT-1.3B and LLaMA2-7B were evaluated, making it difficult to compare how well the approach scales to different and larger models. In particular, HiZOO, the main baseline compared against evaluated on many larger models ranging from 13B to 66B. Some experiments on the effects of different block sizes would have been nice to help understand if there exists any tradeoffs between performance and memory consumption.

问题

Q1: It seems like there are various hyperparameters associated with BCD block selection like update interval and block granularity. How do you determine these?

Q2: How would the approach perform on larger models like LLaMA2-70B? Would performance still remain competitive compared against HiZOO?

审稿意见

评分: 5置信度: 42024-11-03

In this paper, the authors propose an optimisation algorithm for LLM fine-tuning which combines the block coordinate descent method with a Hessian-informed zero-order optimiser. This combination takes advantage of the hessian matrix information in HiZOO and memory efficiency in MeZO. Mainly in OPT-1.3B, the authors demonstrate the benefits of proposed approach.

优点

The paper is well-written and well structured, thus very easy to follow, even for the readers from other domains.
The proposed approach is well motivated, and could absorb the benefits from both HiZOO and BCD.

缺点

From my understanding, the proposed approach is a combination of both techniques of HiZOO and BCD with limited novelty. Are there any further insights from such a combination?
The evaluation of optimisation seems limited. Most of experiments focus on OPT-1.5B with limited dataset, please provide more evidence that B-PDF performs better than both HiZOO and MeZO in terms of the tradeoff, especially on LLaMA-2-7B, or larger-sized models. For instance, the runtime, memory and accuracy , or the convergence of various optimization algorithms. From Table 2, it seems that the authors only compare the optimisation algorithms under the toy setup considering the runtime for SGD is only 4 minutes.
In equation 7, the authors showcase multiple types of BCD algorithms. But how different algorithms affect the memory / runtime / accuracy are not explained.
Please correct me if I am wrong. LoRA rank can also affects the memory / runtime. Can we reduce the lora rank to reduce the memory to the same level that zeroth-order optimiser could achieve?
What is the performance of MeZO with BCD?

问题

Please see the above.

I have a minor question that: can we achieve the similar level of memory by reduce batch size in first-order approach? From Table 2, it sacrifices much more runtime for memory, which makes me concern of the application of zeroth-order algorithm in real-world applications. In a much larger model or larger finetuning dataset, can we afford such a runtime sacrifice (more than 50 times compared to LoRA)?

审稿意见

评分: 3置信度: 32024-11-04

This study proposes a memory-efficient fine-tuning approach for large language models by integrating block coordinate descent (BCD) with a Hessian-informed zeroth-order optimizer.

优点

Their method is more memory efficient than the previous methods HiZOO and has a comparable memory footprint as MeZO

缺点

The authors’ claim that their method is a practical, convergence-enhanced alternative to MeZO is not substantiated by evidence. I will summarize in three perspectives.

In terms of memory usage, MeZO is more memory-efficient than the proposed method, as demonstrated in Tables 1, 2, and 4.
In terms of performance, the improvement of their method is minimal, with an average score of 70.2 compared to 70.0 for MeZO
In terms of convergence rate, Figure 3 indicates that the convergence rate of their method is similar to that of MeZO, showing no clear advantage.

问题

See the above weaknesses for questions.

伦理问题详情

N/A

撤稿通知

2024-11-13

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.