Dear Reviewer,

Thank you for your detailed review comments. Here are our responses:

• The LoRA-Merge section in Figure 2 is still not clear. I suggest that the authors enhance the caption to describe the meaning of each block in the figure.

Regarding the clarity of Figure 2, we fully agree with your suggestion. We have enhanced the caption by adding detailed descriptions of each block in the "LoRA-Merge" section and included a reference to the appendix in the main text to help readers better understand the implementation details.

• The paper has undergone significant modifications. I recommend that the authors highlight the changes in a different color to better showcase these updates or provide a detailed explanation of the revisions in the general response.

Concerning the visibility of our modifications, we have followed your constructive suggestion and highlighted all changes in blue to better showcase our updates.

• I am still concerned about the novelty: Regarding the significance of maxq, Appendix D provides experiments for the 4-bit case. Are there any analyses for different bit-widths? Additionally, could the distribution of LoRA magnitude also be influenced by initialization or learning rate? Please provide some analysis to further clarify the importance of QBAS.

We appreciate the reviewer’s concern about the novelty and the significance of QBAS across different bit-widths. To address this, we have added experimental results for the 3-bit and 2-bit cases, providing a more comprehensive analysis of QBAS's effectiveness across various bit-widths.

Regarding the potential influence of initialization and learning rate on the distribution of LoRA magnitude, we have conducted a detailed analysis. Among various factors, we identified that adjusting the learning rate for different bit-widths is theoretically the closest method to QBAS. This relationship is particularly apparent under the SGD optimizer, where the two approaches are theoretically equivalent, but differences emerge under the Adam optimizer:

Analysis: SGD vs. Adam

SGD directly updates the parameter using the gradient scaled by the learning rate :

In contrast, Adam dynamically adjusts parameter updates by combining momentum () and adaptive scaling ():

With:

Difference Between Dividing by `maxq` and Adjusting the Learning Rate to `1/maxq`

In SGD

Case 1: Divide gradient by maxq

Case 2: Adjust the learning rate to 1/maxq

For SGD, these two approaches are mathematically equivalent.

In Adam

Case 1: Divide gradient by 1/maxq The calculations for both the first and second moments are affected:

Case 2: Adjust the learning rate to 1/maxq

For Adam, these two approaches are not equivalent due to the influence of maxq on the computation of and . Dividing the gradient by maxq alters both moment estimates, whereas adjusting the learning rate does not.

Further Experiments

To further explore this, we conducted additional experiments on Flan v2 LLaMA 7B-4bit using a constant learning rate. The results demonstrate that QBAS outperforms direct learning rate adjustments . These findings emphasize QBAS’s superior performance,

Learning Rate	QBAS	humanities	STEM	social sciences	other	AVG
1.33e-5	❌	43.1	35.9	53.2	52.4	45.9
2e-4	✅	43.4	37.5	56.5	53.7	47.4