Analog Foundation Models
We train analog foundation models that are robust to noise present in analog in-memory computing hardware and demonstrate accuracy comparable to models trained with 4-bit weight and 8-bit static input quantization.
摘要
评审与讨论
The paper addresses the problem of deployment of LLM on Analog In Memory Computing (AIMC) hardware. The authors propose a methodology to make a pre-trained LLM robust to the non-idealities encountered in AIMC hardware (quantization, static ranges, noise). Inspired by LLM-QAT, the method is based on fine-tuning a pre-trained model using knowledge distillation from the full precision baseline, on a large amount of samples (20B tokens) generated by the LLM itself, using an existing analog hardware-aware training and evaluation methods and tools developed by previous work. The approach is extensively evaluated on numerous benchmarks, using 2 standard pre-trained LLM (3B parameters), showing better accuracy than LLM-QAT on the simulated noisy hardware.
优缺点分析
Strengths :
- Quality : The work appears in a very mature form. The approach is evaluated on numerous benchmarks. The claims are in general well supported.
- Clarity : The paper is very clear, easy-to-read, and gives sufficient information to reproduce the experiments.
- Significance : The work is significant as it is the first work to investigate the scaling of LLM (of moderate size) for AIMC accelerators. It is an important step towards making possible the implementation of LLM on AIMC.
- Originality : The originality lies in the combination of existing methods and the demonstration at large scale.
Weaknesses :
- Quality :
- In my opinion, the authors should spend more time on the strengths and weaknesses of their method depending on various factors (see questions below).
- I have doubts on several assumptions about the hardware model that should be clarified (see question 1).
- One of the claims is that the method shows a better scaling behavior under test-time compute scaling than models trained with 4-bit per-channel weight and 8-bit static input quantization. Although this is supported by an experiment, I find the difference in accuracy between the 2 not very important, moreover the authors do not provide at least intuitions on why this should be the case, therefore I would mitigate this claim.
- Significance : The re-usability of the work could be improved if the paper would study various sizes of models as well as various levels of hardware noise, to better understand the extent of the method. However, I understand that this could be left for future work.
- Originality :
- This works builds upon the work [52] developping the analog hardware-aware training method and toolkit. The link with this work should be clearly stated in the background and related works section.
- The paper shows many property of their method by empirical demonstration, however there is rather limited explanation / intuitions on the why. The paper could be improved with more analysis (see my questions).
问题
- Assumptions on the hardware model. I would consider increase my rating if the authors could clarify / justify these points.
- The model based on [52] use weights in 16 bits : is this precision really achievable with the AIMC chip of [52] using 4 devices per weight ? In my understanding of AIMC, as the precision is represented by the number of encoded levels, I have difficulty to imagine having 2^16 distinct levels with only 4 devices.
- The authors assume that weights with 0 value have no noise, while the model in LeGallo show that they do have noise (2% according to Fig 8). What is the justification ? I believe this can highly impact the results due to a large proportion of weights having zero value, and zero is very sensitive to noise (low SNR).
- The authors aknowledge there exist various types of noise in AIMC, such as temporal variability or device-to-device. However in the paper, the type of noise modelled is not mentioned. I believe the noise is dynamic (sampled at each inference independently), and the authors do not model static device-to-device noise. Could the authors clarify this point ?
- Analysis of the method. The authors show that weight clipping is more important than noise injection to increase robustness to noise. What is the explanation for that ? Moreover, the authors suggest that smaller weights are less robusts than bigger weights (in the experiments on additive gaussian noise). Why is that, knowing that bigger weights have a lower SNR (according to Fig 8) ? Isn’t in contradicton with the fact that their method is made to favor small range of weights ? I would consider increasing my rating if the authors could answer these questions.
- How are the results dependent on factors such as LLM model size, quantization precision and noise level ? Indeed, all the paper is performed assuming these prameters are fixed, while I believe they play an important role. For instance, I expect bigger LLM to be more robust to noise and aggressive quantization. Moreover, the noise model used seem to have low level of noise. What would happen on AIMC with higher noise level, or with lower weights precision ? I would consider increasing my rating if the authors could provide hints on the sensitivity of their method to these parameters.
- How is the method impacting the training speed and ressources ? I would expect noisy training to slower the learning. It seems so according to Tables 7 and 8 : while the Analog FM plateaus after 20B training tokens, the LLM-QAT plateaus already after 10B tokens. I think this should be highlighted in the limitations section.
- Comparison with LLM-QAT for quantization on digital hardware. The fact that it yields better results without noise on digital quantization is surprising and deserves more explanation. Is it the noise injection, the clipping in the weight quantization, or the output static quantization that makes the model more robust to quantization error? Does this mean that QAT methods would improve by introducing these features ?
局限性
I think the authors could spend more time on the limitations of their work (e.g. see my questions 3 and 4), in particular explaining the sensitivity of their method to parameters such as model size, quantization precision and noise level, and highlighting potential limitations in terms of training ressources.
最终评判理由
I have read other reviewers’ comments and the authors’ responses. Overall, I believe the authors have done an impactful work with this paper and correctly addressed reviewers’ concerns, therefore, I would like to recommend acceptance. The main limitations concerned the evaluation and sensitivity of the methods with regards to others / more realistic hardware noise model, and the addition of limitations of the work in the paper. I believe the authors have addressed those concerns by providing more experiments and explanations, and they agreed to add it to the final version of the paper. I thank very much the authors for their work and responses !
格式问题
n/a
General Response
We thank Reviewer bCEm for noting that our work "appears in a very mature form" and is "very clear, easy-to-read, and gives sufficient information to reproduce the experiments." We appreciate their recognition that our work is "significant as it is the first work to investigate the scaling of LLM (of moderate size) for AIMC accelerators". We are grateful for their acknowledgment that our "claims are in general well supported" through evaluation on "numerous benchmarks." We address the comments of the reviewer below.
Limitations Section
Reviewer Comment: The authors could spend more time on the limitations of their work (e.g. see my questions 3 and 4)
We will add these points to the limitations section.
Question 1: Weight Precision
Comment 1a
AIMC devices are typically programmed using a read-write-verify scheme, where voltage pulses nudge a device into the direction that would minimize the error between the target conductance and the current state of the device. If obtaining the current conductance of the device would be a noise-free process, and therefore the conductance we read is the actual conductance of the device, we would be able to program these devices to arbitrary precision. However, the reading process is noisy: the conductance value we read is not equal the true conductance value of the device. Because of this, we stop the programming algorithm when the read-out conductance is within a margin of the target conductance. From a noise modeling perspective, one can therefore assume that the weights are in FP16 precision, but with an additive noise term following the noise profile we assume in the paper.
Comment 1b
The model we originally used assumed zero noise on the zero states. Note that while figure 8 shows a non-zero offset for the zero weight, it is because the weight error is averaged over a bin around zero rather than a distribution of zero states. Assuming zero noise on zero weights can impact the results – especially the ones for the models trained with QAT as many weights are rounded to zero. We therefore redid all experiments with the same noise model, but this time without zeroing out the noise for zero states. Below table shows that the noisy average benchmark performance dropped slightly but is still very close to the results when zero states receive zero noise.
| Model | Phi-3-mini-4k-instruct () | Analog FM () | LM-QAT () | SpinQuant () |
|---|---|---|---|---|
| Avg. | 62.02 | 66.22 | 60.58 | 43.68 |
Comment 1c
We are modeling programming noise, which is the main source of noise for the chip described in [1]. Programming noise is the conductance error from the target weight that remains after a device has been programmed with an iterative read-write-verify programming scheme. Compared to programming noise, which is static (does not get resampled during inference), read noise – due to analog 1/f noise of devices and circuits – is smaller in magnitude, but is dynamic and its magnitude varies as a function of time. Device-to-device variability due to process variations across the chip are taken care of by the iterative programming routine, which adjusts the programming conditions for each device until its conductance is within a margin from the target weight.
Question 2: Method Analysis
Comparing the average weight SNR for the different models gave us an important insight. Clipping applied during training iteratively tightens the weight distribution, causing smaller weights to be mapped to larger conductance values. As a result, the average per-weight SNR increases, resulting in higher robustness. The table below shows that for the PCM-based noise model considered in our paper, larger states have higher SNR compared to smaller states. When comparing the average per-weight SNR (see table below) between the different models, one can observe that the weights of the analog foundation model have the highest average SNR. Notably, the models trained with QAT also show increased levels of average SNR, explaining the observation that QAT trained models are generally more robust compared to off-the-shelf models. Note that in a revised paper, we would include plots instead of tables.
| Normalized Conductance | -1.0 | -0.7 | -0.5 | -0.3 | -0.1 | -0.05 | -0.01 | 0.01 | 0.05 | 0.1 | 0.3 | 0.5 | 0.7 | 1.0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SNR (dB) | 19.44 | 22.74 | 20.56 | 16.17 | 10.17 | 6.38 | -4.35 | -4.35 | 6.38 | 10.17 | 16.17 | 20.56 | 22.74 | 19.44 |
| Model Type | Phi-3-mini-4k-instruct | Llama-3.2-1B-Instruct | ||||
|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct () | Analog FM () | LLM-QAT () | Llama-3.2-1B-Instruct () | Analog FM () | LLM-QAT () | |
| Average per-layer mean SNR (dB) | 12.02 | 13.66 | 12.18 | 11.66 | 14.46 | 12.02 |
Question 3: Model Size, Quantization, and Noise Level Dependencies
Comment 3a: Dependence on LLM model size and quantization precision.
Previous work has shown that larger LLMs are more robust to weight quantization noise. We investigate whether this robustness extends to PCM noise. To test this, we conducted experiments on Qwen3-8B, a state-of-the-art LLM with significantly more parameters. We evaluated Qwen3-8B with and without thinking capability enabled (see table below). The results reveal a clear trend: as model size increases, the performance gap between full-precision and PCM noise evaluation decreases. While we observed this trend for the analog foundation models of Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct in table 1 of the paper, unfortunately due to time limitation we could not conduct further experiments for an analog Qwen3-8B model.
| Model | Avg. |
|---|---|
| Llama-3.2-1B-Instruct () | 45.49 |
| Llama-3.2-1B-Instruct () | 35.94 |
| Phi-3-mini-4k-instruct () | 70.03 |
| Phi-3-mini-4k-instruct () | 64.68 |
| Qwen3-8B-thinking () | 81.34 |
| Qwen3-8B-thinking () | 75.92 |
| Qwen3-8B () | 75.61 |
| Qwen3-8B () | 68.75 |
We also re-trained the analog foundation model with 7 bit input quantization on 1B tokens. We observe a performance drop of ~1% when training on 1B tokens:
| Model | Avg. |
|---|---|
| Analog FM-1B-Tok. () | 66.74 |
| Analog FM-1B-Tok. () | 63.67 |
| Analog FM-1B-Tok. () | 65.25 |
| Analog FM-1B-Tok. () | 62.57 |
Comment 3b: Sensitivity on noise level.
To evaluate the robustness of our method under higher noise conditions, we scaled our original PCM noise model by factors of ×1.5 and ×2. Our results demonstrate that while performance degrades as expected with increased noise levels, our approach maintains reasonable accuracy even under these more severe conditions. Additionally, we evaluated our method using a ReRAM noise model [2] to assess generalizability across different programming noise models (see response to Jwnv Comment 4).
| Noise Condition | Phi-3-mini-4k-instruct (W16) | Analog FM (SI8-W16-O8) | LLM-QAT (SI8-W4) |
|---|---|---|---|
| No Noise | 70.03 | 68.66 | 65.63 |
| PCM | 62.92 ± 2.63 | 66.33 ± 0.86 | 60.70 ± 2.46 |
| PCM (×1.5) | 50.12 ± 7.01 | 63.19 ± 1.25 | 47.27 ± 3.94 |
| PCM (×2) | 33.63 ± 5.03 | 58.05 ± 1.53 | 37.43 ± 5.91 |
| ReRAM | 59.73 ± 1.91 | 65.57 ± 0.91 | 58.52 ± 2.70 |
Question 4: Training Speed
Regarding training speed, each feature in HW-aware training adds latency to the forward and backward pass. When weight noise injection, weight clipping, and input quantization are used, the forward pass of a linear layer is roughly 3 times slower compared to a standard linear layer in PyTorch (extracted from [3]). Although this gets amortized by other operations, it still results in a training slowdown. Regarding convergence, our experiments confirm that models trained with noise injection and output quantization converge slower than those without these techniques. Supporting figures will be included in the final appendix. We will add this to the limitations of our paper.
Question 5: Benefit for QAT
While noise injection encourages weights to converge to flatter regions of the loss landscape [4], weight clipping yields most of the robustness to weight quantization. In contrast to LLM-QAT, weight clipping removes outliers in the model's weight distribution explicitly. As demonstrated in the table below, the mean absolute quantization error for LLM-QAT is significantly higher than for Analog FMs. We speculate that QAT methods might benefit from incorporating iterative weight clipping.
| Model Type | Phi-3-mini-4k-instruct | Llama-3.2-1B-Instruct | ||
|---|---|---|---|---|
| Analog FM (SI8-W16-O8) | LLM-QAT (SI8-W4) | Analog FM (SI8-W16-O8) | LLM-QAT (SI8-W4) | |
| Average per-layer mean absolute quantization error | 0.0034 ± 0.0005 | 0.0062 ± 0.0013 | 0.0019 ± 0.0006 | 0.0045 ± 0.0015 |
References
[1] "A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference", Le Gallo et al., Nature Electronics 2023
[3] "A compute-in-memory chip based on resistive random-access memory", Wan et. al, Nature 2022
[3] "AIHWKIT-Lightning: A Scalable HW-Aware Training Toolkit for Analog In-Memory Computing", Büchel et al., MLNCP at NeurIPS 2024
[4] "Adversarial Weight Perturbation Helps Robust Generalization", Wu et al., NeurIPS 2020
Thank you for all your efforts in addressing my questions. Below are my answers :
- Thank you for the explanation. However, if the conductance programming could be of arbitrary precision if it was noise-free, can you clarify why it required to have 4 devices per weight (I assume 2 are required for positive and negative weights, as conductance value is unsigned) ?
- Thank you for the clarification. If I understood correctly, from Fig8, the error is higher on higher magnitude weights, but as their magnitude is also higher (and proportionnally even more ?), then globally their SNR is higher. A tighter weight clipping means that more weights are close to (or above) the quantization range, meaning that more weights are mapped to higher magnitude conductance values, which increases the average SNR of all layers. I think this point is not easy to understand and should be made clearer in the paper.
- Thank you for the added experiments that enhance the comprehensiveness of the paper. In particular, concerning the noise experiments, I think there are valuable, as we see that the noise level has a high impact on accuracy. This means that, although the proposed method compensate for the accuracy degradation, this compensation is only partial, and it is also crucial to have reliable enough devices and programming methods.
- OK.
- This in an interesting contribution too. Thank you for the explanation.
After reading the comments and responses to other reviewers, I agree that it is valuable to add experiments with a more realistic noise model with other sources of noise (in particular with conductance drift). Morevoer, it is important to add other hardware metrics (efficiency, throughput), to substantiate the motivations for AIMC hardware. Therefore I recommend you to include these additional experiments in the paper (main body or appendices), and to reinforce the limitation section of the work, highlighting what was confirmed in these additional experiments (in particular : the sensivity to the noise level and to other sources of noise that are not accounted for in the method (e.g. the issue with conductance drift)) and the perspectives for improvements regarding this matter.
We thank the reviewer for the detailed response.
- Your assumption that two devices are the minimum number of devices needed to encode one weight is correct, because we need at least one device per sign. The four devices stem from our noise model assumption. By exploiting device characteristics [1] or simply averaging the conductance of multiple devices, using two devices per sign instead of one can reduce the programming noise. The noise model we used from [2] assumes two devices per sign (so a total of four devices per weight). Note that the chip in [2] also allows to use two devices per weight and in fact a noise model with just two devices per weight also exists (which has higher programming noise).
- Your understanding is correct. We will make this clearer in the paper.
- Yes, having reliable devices and programming methods are crucial for achieving higher accuracy, and we will mention this in the discussion section. Achieving both high accuracy and high energy efficiency requires a combination of better devices/circuits/calibrations and hardware-aware algorithms such as those presented in this study.
In the camera-ready version of the paper, we would include the additional experiments conducted and further results that motivate the use of this hardware. We will also expand the discussion section to include the above limitations regarding device noise.
[1] "Exploiting the State Dependency of Conductance Variations in Memristive Devices for Accurate In-Memory Computing", Vasilopoulos et al., IEEE Transactions on Electron Devices 2023
[2] "A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference", Le Gallo et al., Nature Electronics 2023
This paper aims to develop a general and scalable adaptation methodology called hardware-aware training for successfully running pre-trained LLMs on analog in-memory computing hardware with severe analog noise and quantization constraints. In so doing, it aims to minimize the performance degradation due to physical hardware limitations and achieve performance equivalent to that of a digital quantization model with 4-bit weights and 8-bit activations.
优缺点分析
Strengths: This paper describes a specific and precise process for hardware-aware training, applying AIMC to LLM, and demonstrates that learning results based on Gaussian Noise show excellent performance under a given quantization environment on various benchmarks. It is thought that this could be a very interesting topic, considering the possibility of Analog in-memory computing in inference engines. In the field of application of AMIC, I believe this paper offers novelty in advancing low-power LLMs.
Weaknesses: Although it claims to learn in consideration of the environment of AMIC, it is mostly a combination of existing algorithms and does not suggest solving new hypotheses, etc. This paper does not suggest a technique for fine-tuning AMIC, but in the case of actual Analog in-memory computing, it is greatly affected by various semiconductor processes and environments, so it seems that a fine-tuning approach is more realistic, and it is thought that a mention of this is necessary. In AMIC, errors will occur due to Digital-to-Analog Converters and Analog-to-Digital Converters, as well as higher-dimensional errors resulting from semiconductor PVT characteristics. Therefore, a discussion is needed on how to address these issues.
问题
Please refer to the above weaknesses. There are various references in AMIC about what error characteristics AMIC has compared to FP32 model, but I think it is necessary to add to these. Can you add more references about that?
局限性
There is no negative societal impact.
最终评判理由
The author discussed the limitation of the method in this manuscript. While these limitations exist and may incur significant time overhead during training, I believe it contributes to the training of practical PIM-based AI models, and I believe it deserves a higher score.
格式问题
There are no paper formatting concerns.
General Response
We thank Reviewer FiYr for recognizing that our paper "describes a specific and precise process for hardware-aware training" and demonstrates "excellent performance under a given quantization environment on various benchmarks." We appreciate their acknowledgment that this work "offers novelty in advancing low-power LLMs" and represents "a very interesting topic, considering the possibility of Analog in-memory computing in inference engines." We address points raised by the reviewer below.
Comment 1: Novelty and Algorithmic Innovation
Reviewer Comment: Although it claims to learn in consideration of the environment of AMIC, it is mostly a combination of existing algorithms and does not suggest solving new hypotheses, etc.
Our Response: We thank the reviewer for their thoughtful feedback. While it is true that our method builds on a set of already existing algorithms, the novelty lies in how these components are integrated into a practical solution for a previously unsolved and highly relevant problem: enabling robust deployment of LLMs on AIMC hardware. Unlike prior works that target small CNNs or encoder-only transformers, this paper is the first to comprehensively and systematically tackle HW-aware training for pre-trained LLMs in the analog domain. By enabling AIMC-based inference for LLMs, AIMC could play a key role in addressing the global issue of increasing energy consumption resulting from AI, especially because model inference dominates energy consumption. Our approach also offers a foundation for future research on AIMC-compatible training strategies for frontier models.
Comment 2: Fine-tuning and Hardware-Specific Calibration
Reviewer Comment: This paper does not suggest a technique for fine-tuning AMIC, but in the case of actual Analog in-memory computing, it is greatly affected by various semiconductor processes and environments, so it seems that a fine-tuning approach is more realistic, and it is thought that a mention of this is necessary. In AMIC, errors will occur due to Digital-to-Analog Converters and Analog-to-Digital Converters, as well as higher-dimensional errors resulting from semiconductor PVT characteristics. Therefore, a discussion is needed on how to address these issues.
Our Response: Earlier works [1] in AIMC have resorted to chip-in-the-loop finetuning to adapt the weights of the neural network to various noise sources related to the DACs, ADCs, and semiconductor PVT characteristics. While effective, this approach is prohibitively slow and – if done on-chip – very resource intensive due to the need of implementing the whole backpropagation algorithm. For these reasons, researchers and practitioners have started to employ more efficient calibration mechanisms. For ADCs, this could involve tuning various gains and read voltages (see [2] for more details). For other variations not captured by the ADC calibration, per-ADC local digital processing units are typically used. These units apply affine scaling to each dot product . The values , are typically calibrated using regression between expected and observed MVM outputs using synthetic or real data. Besides ADC variations, these affine scales and offsets can also correct other sources of noise, for example temperature variations or IR-drop. We would be happy to include part of this discussion in the paper. However, we would like to clarify that on-chip calibration methods are complementary to the algorithmic research done in this paper. As our results show, off-the-shelf models do not perform well on AIMC hardware models that already use all the above calibration methods. It is very unlikely that on-chip calibration of AIMC can achieve a precision equivalent to high-precision digital accelerators without prohibitively complex and area-consuming circuits. Because of this tradeoff, on-chip calibration must be used in conjunction to hardware-aware algorithmic methods such as those presented in this paper, in order to achieve both high accuracy AND energy efficiency.
Comment 3: Additional References for AIMC Error Characteristics
Reviewer Comment: There are various references in AMIC about what error characteristics AMIC has compared to FP32 model, but I think it is necessary to add to these. Can you add more references about that?
Our Response: We will expand the background section with additional descriptions and references for the different sources of noise in AIMC, as well as how on-chip calibration methods can be used to mitigate them to some extent, based on the above response. Additionally, we conducted further experiments where we evaluated the models from table 1 on a more detailed publicly available noise model calibrated on real hardware [3]. In addition to the programming noise used in the previous benchmarks, this model also includes read noise and conductance drift. See the response to Reviewer sHcb Comment 2 and bCEm Question 1 for more details on the noise models. We find that our models perform best on this more elaborate noise model (see results in the table below).
MedQA Results
| Model | FP-32 | t=1min | t=1h | t=1d | t=1m | t=1y |
|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct () | 52.91 | 43.54 ± 0.81 | 43.65 ± 0.62 | 40.75 ± 0.86 | 35.82 ± 1.11 | 29.25 ± 1.12 |
| Analog FM () | 49.69 | 45.85 ± 0.78 | 45.03 ± 0.75 | 42.09 ± 0.97 | 38.02 ± 1.23 | 33.74 ± 1.12 |
| LLM-QAT () | 46.15 | 39.87 ± 1.00 | 36.59 ± 1.21 | 31.98 ± 0.69 | 26.71 ± 0.35 | 23.30 ± 0.56 |
Arc-C Results
| Model | FP-32 | t=1min | t=1h | t=1d | t=1m | t=1y |
|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct () | 84.56 | 82.37 ± 0.61 | 81.59 ± 0.37 | 79.01 ± 0.41 | 72.71 ± 0.77 | 58.89 ± 3.75 |
| Analog FM () | 84.47 | 83.43 ± 0.14 | 83.05 ± 0.25 | 81.50 ± 0.47 | 76.79 ± 1.15 | 68.57 ± 1.23 |
| LLM-QAT () | 81.31 | 78.34 ± 1.66 | 75.72 ± 1.20 | 69.71 ± 0.80 | 58.98 ± 0.93 | 49.44 ± 0.89 |
References
[1] "A compute-in-memory chip based on resistive random-access memory", Wan et. al, Nature 2022
[2] "Demonstration of 4-quadrant analog in-memory matrix multiplication in a single modulation", Le Gallo et. al, Nature npj Unconventional Computing 2024
[3] "A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference", Le Gallo et al., Nature Electronics 2023
I appreciate your detailed explanation of the response given in such a short timeframe. I also believe my question has been addressed to some extent. As mentioned in your response, I believe this method has some value as a supplementary tool to on-chip calibration. This should be reflected in the original manuscript, and I believe it contributes to the training of AI models using Processing in Memory Computing, which is very limited compared to the learning method in the paper. While these limitations exist and may incur significant time overhead during training, I think it contributes to the training of practical PIM-based AI models, and I believe it deserves a higher score. I hope the limitations of the overall system will be addressed in the paper in the future.
We thank the reviewer for the comment and for raising the score in response to our rebuttal. We will include the fact that our method is complementary to calibration in the camera-ready version of the paper.
The authors present a method for adapting pretrained LLMs to AIMC constraints without requiring original training data. Their approach maintains performance similar to 4-bit weight/8-bit activation quantization approaches on digital hardware, despite the presence of noise. They show this approach is robust with realistic noise/quantization scenarios and that the approach is robust on a range of benchmarks, and in fact outperforms quantized models with compute scaling at test time. This is a significant innovation as it clearly demonstrates the potential of AIMC in SOTA learning settings enabling major energy savings in settings of great interest.
优缺点分析
The paper and results are strong and well motivated. This has clear impact and utility for future AIMC applications in a realm that is of great and direct interest. Experiments extend the SOTA for Analog models considerably.
Strengths: the paper takes an important step toward extending the scope of AIMC hardware to billion parameter models for the first time, showcasing performance in a language setting. This bridges a major gap between extant literature on AIMC housed models (mostly CNNs/ RNNs with under 50m parameters) and models approaching a scale of interest in modern ML literature. The authors also do fairly comprehensive evaluation on 12 different benchmarks, so results reported are clearly robust. The results show that n-best sampling can be leveraged toward superior performance on these models than standard quantized models, suggesting good tradeoffs for inference heavy use cases which are increasingly an area of great interest at scale.
Weaknesses: Despite these points some limitations persist. While the researchers do address a few of these, namely the gap in performance compared to off the shelf models. There are certainly some areas where a paper such as this could benefit from further experiments. For example, it is fairly commonplace in research on foundation models in general to investigate scalability and robustness. While the training cost of further bridging the gap between these and actual foundation models is very likely prohibitively expensive, once can still assess metrics of scalability across a range of smaller model sizes. This would help to support the claim (as yet unproven) that such an approach can extend to the 100B parameter regime. Additionally, the researchers could have further explored non-gaussian or drift-based non-idealities. In my opinion insufficient attention was given to the various additional realistic sources of potential error one might encounter on chip. Furthermore, given the understandable lack of on-chip deployment, it would be interesting to provide more estimation of hardware metrics (throughput, on-chip latency, energy use), ADC/DAC overhead for example is an important consideration alongside noise derived therein.
In addition to further discussion of the complications encountered in real world on-chip deployment of such models, further ablations and theoretical investigation of for example, which aspects of the HWA training scheme (clipping vs noise injection or quantization) contribute most to robustness would be interesting and provide a basis for further investigations in this area to build upon.
Nevertheless, this paper is very strong empirically and provides a significant step forward in bridging the gap between extant AIMC results and SOTA ML.
问题
While the paper was strong, many of the contributions are more engineering than theoretical contributions. Stronger theoretical contribution would result in a higher score. For example, there is no theoretical justification given for clipping and noise synergy. Authors could develop this insight by drawing links to other literature e.g. quantile based regularization or stochastic smoothing , likewise there is a lack of robustness guarantees provided or analysis of error propagation (via. e.g. perturbation theory). Having guaranteed for regimes where models are provably robust would make this paper significantly stronger. As mentioned before, some account of how number of training tokens, model size, or noise variance influence convergence would also boost this paper's overall quality.
局限性
yes
最终评判理由
I have increased my confidence in light of the authors' response, as I had already recommended this work for acceptance.
格式问题
None
General Response
We thank this reviewer for recognizing that our "paper and results are strong and well motivated" with "clear impact and utility for future AIMC applications" and that our "experiments extend the SOTA for Analog models considerably." We appreciate their acknowledgment that our paper "takes an important step toward extending the scope of AIMC hardware to billion parameter models for the first time" and "bridges a major gap between extant literature on AIMC housed models and models approaching a scale of interest in modern ML literature." We are grateful for their recognition of our "comprehensive evaluation on 12 different benchmarks" showing "clearly robust" results. We address the comments of the reviewer below.
Comment 1: Scalability Assessment
Reviewer Comment: For example, it is fairly commonplace in research on foundation models in general to investigate scalability and robustness. While the training cost of further bridging the gap between these and actual foundation models is very likely prohibitively expensive, once can still assess metrics of scalability across a range of smaller model sizes. This would help to support the claim (as yet unproven) that such an approach can extend to the 100B parameter regime.
Our Response: We are aware of this limitation of our current study and are actively working on extending this method to larger model sizes. Unfortunately, due to time and resource limitations, we were not able to obtain results within the time-frame of the rebuttal.
Comment 2: Non-Gaussian and Drift-Based Non-Idealities
Reviewer Comment: Additionally, the researchers could have further explored non-gaussian or drift-based non-idealities. In my opinion insufficient attention was given to the various additional realistic sources of potential error one might encounter on chip.
Our Response: We evaluated the models from table 1 on a more detailed publicly available noise model calibrated on real hardware [1]. In addition to the programming noise used in the previous benchmarks, this model also includes read noise and conductance drift. Read noise is the temporal conductance fluctuation resulting from 1/f noise in AIMC devices. Conductance drift refers to the decrease in device conductance over time that is observed in PCM devices. Both read noise and conductance drift are modeled in a statistical manner with conductance- and time-dependent standard deviations. We perform the evaluation for the Phi-3-mini-4k-instruct-based models on the Arc-C and MedQA benchmark. We simulate drift for a time-span of one minute, hour, day, month and year. Results show that the analog foundation model is significantly more robust to drift compared to the off-the-shelf model, and the model trained with LLM-QAT.
MedQA Results
| Model | FP-32 | t=1min | t=1h | t=1d | t=1m | t=1y |
|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct () | 52.91 | 43.54 ± 0.81 | 43.65 ± 0.62 | 40.75 ± 0.86 | 35.82 ± 1.11 | 29.25 ± 1.12 |
| Analog FM () | 49.69 | 45.85 ± 0.78 | 45.03 ± 0.75 | 42.09 ± 0.97 | 38.02 ± 1.23 | 33.74 ± 1.12 |
| LLM-QAT () | 46.15 | 39.87 ± 1.00 | 36.59 ± 1.21 | 31.98 ± 0.69 | 26.71 ± 0.35 | 23.30 ± 0.56 |
Arc-C Results
| Model | FP-32 | t=1min | t=1h | t=1d | t=1m | t=1y |
|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct () | 84.56 | 82.37 ± 0.61 | 81.59 ± 0.37 | 79.01 ± 0.41 | 72.71 ± 0.77 | 58.89 ± 3.75 |
| Analog FM () | 84.47 | 83.43 ± 0.14 | 83.05 ± 0.25 | 81.50 ± 0.47 | 76.79 ± 1.15 | 68.57 ± 1.23 |
| LLM-QAT () | 81.31 | 78.34 ± 1.66 | 75.72 ± 1.20 | 69.71 ± 0.80 | 58.98 ± 0.93 | 49.44 ± 0.89 |
Comment 3: Hardware Metrics Estimation
Reviewer Comment: Furthermore, given the understandable lack of on-chip deployment, it would be interesting to provide more estimation of hardware metrics (throughput, on-chip latency, energy use), ADC/DAC overhead for example is an important consideration alongside noise derived therein.
Our Response: Using the open-source simulator presented in [2], we simulated the pipelined inference of the Phi-3-4096-Instruct model for a pre-fill of 256 tokens, 64 generated tokens, and a batch size of 4. Note that because the tool is written in Python, simulating longer sequences takes prohibitively longer. On an abstract AIMC-based architecture, Phi-3-4096-Instruct achieves a throughput of 2554.43 tokens/s and an energy efficiency of 199.78 tokens/s/W. Although this energy efficiency is already 3.6× higher compared to a scenario where a GPU could perform all necessary INT8 OPs involved in the model inference in one go at 100% utilization, it still does not show the full potential of AIMC. Because of the high density of non-volatile memory, AIMC-based systems can host billions of parameters on a single chip. Besides opening up completely new use-cases, this dramatically cuts energy consumption, especially for models (preferably MoEs) with 10s to 100s of billions of parameters [2].
Comment 4: Ablation Studies and Theoretical Investigation
Reviewer Comment: In addition to further discussion of the complications encountered in real world on-chip deployment of such models, further ablations and theoretical investigation of for example, which aspects of the HWA training scheme (clipping vs noise injection or quantization) contribute most to robustness would be interesting and provide a basis for further investigations in this area to build upon.
Our Response: Deploying a model on-chip involves more than just HW-aware training of the model. Different algorithms at different levels of the stack make sure that the impact of non-idealities is minimized. For example, special programming algorithms ensure precise programming of the weights into conductances, calibrations of the ADCs ensure matching transfer curves between ADCs, and per-dot-product affine scales and offsets are calibrated to correct other non-idealities not captured by the ADC calibration or programming. The effect of clipping and noise injection was studied in table 13. Weight clipping accounts for most of the robustness. When more tokens are used, noise injection only yields an improvement of 0.52%. For an explanation on why weight clipping is more important, see the following answer.
Question 1: Theoretical Contributions and Robustness Guarantees
Reviewer Comment: While the paper was strong, many of the contributions are more engineering than theoretical contributions. Stronger theoretical contribution would result in a higher score. For example, there is no theoretical justification given for clipping and noise synergy. Authors could develop this insight by drawing links to other literature e.g. quantile based regularization or stochastic smoothing, likewise there is a lack of robustness guarantees provided or analysis of error propagation (via. e.g. perturbation theory). Having guaranteed for regimes where models are provably robust would make this paper significantly stronger.
Our Response: We now link the strong gains observed in robustness through weight clipping to explicitly increasing the average SNR in the weights. Clipping applied during training iteratively tightens the weight distribution, causing smaller weights to be mapped to larger conductance values. The table below shows that for the PCM-based noise model considered in our paper, larger states have higher signal to noise ratio (SNR) compared to smaller states. When comparing the average per-weight SNR (see table below) between the different models, one can observe that the weights of the analog foundation model have the highest average SNR. Notably, the models trained with QAT also show increased levels of average SNR, explaining the observation that QAT trained models are generally more robust. While weight clipping explicitly increases the SNR, we believe that noise injection encourages the weights to end up in a flatter region of the weight loss landscape [3]. We agree that extending analog foundation models with frameworks such as provable robustness would be interesting – especially for safety critical domains – and leave it for further work.
| Normalized Conductance | -1.0 | -0.7 | -0.5 | -0.3 | -0.1 | -0.05 | -0.01 | 0.01 | 0.05 | 0.1 | 0.3 | 0.5 | 0.7 | 1.0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SNR (dB) | 19.44 | 22.74 | 20.56 | 16.17 | 10.17 | 6.38 | -4.35 | -4.35 | 6.38 | 10.17 | 16.17 | 20.56 | 22.74 | 19.44 |
| Model Type | Phi-3-mini-4k-instruct | Llama-3.2-1B-Instruct | ||||
|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct (W16) | Analog FM (SI8-W16-O8) | LLM-QAT (SI8-W4) | Llama-3.2-1B-Instruct (W16) | Analog FM (SI8-W16-O8) | LLM-QAT (SI8-W4) | |
| Average per-layer mean SNR (dB) | 12.02 | 13.66 | 12.18 | 11.66 | 14.46 | 12.02 |
Question 2: Training Convergence Analysis
Reviewer Comment: As mentioned before, some account of how number of training tokens, model size, or noise variance influence convergence would also boost this paper's overall quality.
Our Response: Interestingly, the number of tokens itself did not have an impact on convergence. However, the amount of noise injection and the presence of output quantization mattered. Training with output quantization, convergence is slower. Also noise injection leads to slower convergence, and too much noise reduces the final performance. We will add supporting figures in the final version of the paper.
References
[1] "A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference", Le Gallo et al., Nature Electronics 2023
[2] "Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing", Büchel et al., Nature Computational Science 2025
[3] "Adversarial Weight Perturbation Helps Robust Generalization", Wu et al., NeurIPS 2020
Dear Reviewers, Thank you for reviewing our NeurIPS submission. We encourage you to engage in the discussion phase. We are committed to addressing any questions or concerns promptly and constructively.
I thank the authors for their thoughtful response and clarification. While I have already recommended this work for acceptance, and though it is clear some of previous criticisms were I will leave my evaluation as it is. Increasing individual scores could be a possibility if the authors were to indicate where specifically they might provide clarifications / add detail within the paper. However this is clearly a direction of great interest to many and my belief in the merits of this work remain rather firm.
Analog in-memory computing (AIMC) can deliver reduced energy consumption for LLMs but imposes 4 to 8-bit quantization and device-level noise that reduce LLM accuracy. The authors present a hardware-aware training pipeline combining synthetic data distillation, learned static 8-bit input ranges, globally static 8-bit output quantization, iterative weight-clipping, and small additive per-channel Gaussian noise. Their method adapts pre-trained models (Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct) using <1% of the original pre-training tokens. After retraining on 20B synthetic tokens, the analog foundation models retain within ~4 % of their FP16 baselines and match state-of-the-art 4-bit-weight/8-bit-activation digital quantization when evaluated under realistic PCM AIMC noise across several reasoning, knowledge and safety benchmarks. They also quantize via round-to-nearest to run on conventional 4-bit digital accelerators, and show superior test-time compute-scaling on MATH-500 compared with QAT-quantized models.
优缺点分析
Strengths:
- In contrast to several prior AIMC works that focus on retaining accuracy in CNNs on AIMC hardware, this paper tackles the much larger LLMs. The problem of reducing energy consumption in LLM workloads while retaining accuracy is especially relevant today as the energy costs of LLMs continue to skyrocket.
- While prior works only focused on accuracy impact of ADCs (output quantization), this paper models the accuracy impact under weight noise, input DAC quantization, and output ADC quantization.
- The authors evaluate many benchmarks, confirming that their method works for the given LLMs and PCM AIMC noise model.
- The method achieves high accuracy for AIMC LLMs while retraining on only 1% of the training data
- The retraining method yields LLM models that retain accuracy and are deployable on both analog and digital hardware.
Weaknesses:
- The paper assumes that the entire LLM dot product is performed in one-shot on an AIMC crossbar. LLM dot products may require as many as 8192 MAC operations. However, typical AIMC architectures usually support only ~128 MACs in one crossbar column. Hence, the MVM must be tiled and intermediate partial results are often accumulated in the digital domain using wide adders without any noise. For this reason, the dot product noise in the tiled setting may not be as severe as accumulating the entire dot product in the analog domain in one shot.
- While the study demonstrates that billion-parameter LLMs can be made AIMC-ready, it still requires substantial retraining compute and leaves a noticeable accuracy gap on the hardest reasoning tasks, highlighting the need for lighter post-training schemes and further robustness research. The authors acknowledge this limitation.
- Results are reported only for two relatively small LLMs (~1B parameters). It is unclear whether the method will scale to cutting-edge model sizes.
- The method is not validated for popular ReRAM and MRAM AIMC architectures and noise models.
问题
- What is the method's impact on accuracy in the tiled AIMC setting?
- Can the method be adapted to modeling and reducing the accuracy impact of other AIMC non-idealities such as sneak paths (undesired paths for current, parallel to the intended path)
局限性
yes
格式问题
N/A
General Response
We thank the reviewer for recognizing that our paper "tackles the much larger LLMs" compared to prior AIMC works and acknowledging that our "method achieves high accuracy for AIMC LLMs while retraining on only 1% of the training data." We also appreciate the constructive feedback which we address below.
Comment 1: Tiled AIMC Architecture
Reviewer Comment: The paper assumes that the entire LLM dot product is performed in one-shot on an AIMC crossbar. LLM dot products may require as many as 8192 MAC operations. However, typical AIMC architectures usually support only ~128 MACs in one crossbar column. Hence, the MVM must be tiled and intermediate partial results are often accumulated in the digital domain using wide adders without any noise. For this reason, the dot product noise in the tiled setting may not be as severe as accumulating the entire dot product in the analog domain in one shot. Related question: "What is the method's impact on accuracy in the tiled AIMC setting?"
Our Response: It is correct that modern AIMC chips have much smaller tiles compared to the weight matrices of modern LLMs. For heterogeneous AIMC-based accelerators targeting LLM inference, tile sizes from 256 by 256 to 1024 by 1024 are realistic [1,2], while larger tile sizes are favored because of higher efficiency. Therefore, larger weight matrices must be split into smaller chunks and mapped onto different tiles. The MVM is then performed by aggregating partial results of the individual tiles using high(er)-precision digital circuitry. It is correct that using tiled matrices has a positive impact on accuracy, primarily because weights are now re-scaled per-tile, instead of per-layer. As a result, outliers from one tile do not impact the conductance mapping on another tile. This causes more weights to get mapped to higher conductances, increasing the overall SNR. To show the effect on accuracy, we evaluated the Phi-3-mini-4k-instruct models in table 1 with a tile size 512 by 512. Due to the higher SNR, noisy accuracy generally increases. Because the analog foundation model already has tight weight distributions, tiling does not have an effect on the noisy performance of this model. For details, see the table below.
| Model | MMLU (5-shot) | GSM8K (CoT 8-shot) | BoolQ (0-shot) | Hellaswag (5-shot) | MedQA (2-shot) | AGIEval (0-shot) | Arc-C (10-shot) | Arc-E (10-shot) | ANLI (7-shot) | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct () | 69.36 | 79.91 | 78.75 | 83.51 | 52.83 | 38.06 | 84.56 | 91.08 | 52.25 | 70.03 |
| Phi-3-mini-4k-instruct () | 65.28 ± 0.50 | 65.11 ± 4.59 | 74.84 ± 1.40 | 78.22 ± 2.00 | 47.10 ± 0.98 | 34.40 ± 0.92 | 82.33 ± 0.97 | 89.97 ± 0.46 | 44.83 ± 2.07 | 64.68 |
| Analog FM () | 67.24 | 76.95 | 77.22 | 83.24 | 49.61 | 37.23 | 83.62 | 90.57 | 51.94 | 68.62 |
| Analog FM () | 65.16 ± 0.24 | 70.72 ± 1.25 | 75.58 ± 1.50 | 80.53 ± 0.75 | 46.65 ± 1.18 | 35.68 ± 0.59 | 82.95 ± 0.45 | 89.87 ± 0.27 | 48.95 ± 2.33 | 66.23 |
| LLM-QAT () | 64.12 | 68.92 | 75.11 | 79.22 | 45.20 | 36.30 | 81.48 | 89.18 | 51.16 | 65.63 |
| LLM-QAT () | 61.52 ± 0.46 | 61.63 ± 0.75 | 69.08 ± 6.02 | 76.42 ± 1.01 | 42.22 ± 0.97 | 34.10 ± 0.75 | 79.15 ± 0.83 | 87.85 ± 0.65 | 44.22 ± 4.27 | 61.80 |
| SpinQuant () | 67.28 | 74.83 | 76.27 | 0.0 | 49.06 | 36.45 | 83.62 | 90.45 | 48.47 | 67.55 |
| SpinQuant () | 49.46 ± 1.87 | 20.49 ± 6.69 | 64.42 ± 2.71 | 36.01 ± 3.78 | 32.16 ± 1.21 | 28.16 ± 1.28 | 62.20 ± 4.30 | 77.05 ± 3.26 | 34.60 ± 1.29 | 44.95 |
Comment 2: Training Requirements and Accuracy Gap
Reviewer Comment: While the study demonstrates that billion-parameter LLMs can be made AIMC-ready, it still requires substantial retraining compute and leaves a noticeable accuracy gap on the hardest reasoning tasks, highlighting the need for lighter post-training schemes and further robustness research. The authors acknowledge this limitation.
Our Response: Similar to QAT in the digital domain, our method sets a strong baseline in terms of accuracy in the analog domain. Although we show that it is sufficient to train on less than 1% of the pre-training data, we agree that training large models is resource intensive. We share the opinion of the reviewer that more research into methods enhancing the robustness post-training needs to be carried out.
Comment 3: Model Scale Limitations
Reviewer Comment: Results are reported only for two relatively small LLMs (~1B parameters). It is unclear whether the method will scale to cutting-edge model sizes.
Our Response: We are actively working on extending this method to larger model sizes, but due to time and resource limitations we were not able to obtain results in the given time-frame of the rebuttal.
Comment 4: Further Noise Model Validation
Reviewer Comment: The method is not validated for popular ReRAM and MRAM AIMC architectures and noise models. Related question: "Can the method be adapted to modeling and reducing the accuracy impact of other AIMC non-idealities such as sneak paths (undesired paths for current, parallel to the intended path)?"
Our Response: Following your suggestions, we evaluated the Phi-3-mini-4k-instruct models in table 1 using a ReRAM noise model extracted from [3]. Under this overall stronger noise, our analog foundation model shows 7.05% better performance compared to the QAT model (see table below for details). Sneak path currents are typically mitigated at the circuit level. For example, by connecting word lines to the gates of transistors in the unit cells, sneak path currents resulting from exact-zero inputs are avoided. We will include results on ReRAM in the final version of the paper and include discussions on other noise sources in the discussion section.
| Model | MMLU (5-shot) | GSM8K (CoT 8-shot) | BoolQ (0-shot) | Hellaswag (5-shot) | MedQA (2-shot) | AGIEval (0-shot) | Arc-C (10-shot) | Arc-E (10-shot) | ANLI (7-shot) | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct () | 69.36 | 79.91 | 78.75 | 83.51 | 52.83 | 38.06 | 84.56 | 91.08 | 52.25 | 70.03 |
| Phi-3-mini-4k-instruct () | 61.80 ± 0.66 | 49.65 ± 4.91 | 71.79 ± 1.93 | 72.06 ± 1.42 | 43.33 ± 1.35 | 32.31 ± 0.88 | 79.19 ± 1.24 | 88.13 ± 0.31 | 39.35 ± 4.45 | 59.73 |
| Analog FM () | 67.24 | 76.95 | 77.22 | 83.24 | 49.61 | 37.23 | 83.62 | 90.57 | 51.94 | 68.62 |
| Analog FM () | 64.61 ± 0.29 | 69.96 ± 0.88 | 76.33 ± 1.20 | 79.29 ± 0.97 | 45.73 ± 0.73 | 35.71 ± 0.47 | 82.23 ± 0.51 | 89.57 ± 0.23 | 46.73 ± 2.88 | 65.57 |
| LLM-QAT () | 64.12 | 68.92 | 75.11 | 79.22 | 45.20 | 36.30 | 81.48 | 89.18 | 51.16 | 65.63 |
| LLM-QAT () | 59.00 ± 1.14 | 53.40 ± 2.25 | 70.60 ± 4.16 | 72.01 ± 1.76 | 38.31 ± 2.62 | 32.91 ± 1.18 | 76.75 ± 1.51 | 86.26 ± 0.82 | 37.46 ± 8.89 | 58.52 |
References
[1] "An analog-AI chip for energy-efficient speech recognition and transcription", Ambrogio et al., Nature 2023
[2] "A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference", Le Gallo et al., Nature Electronics 2023
[3] "A compute-in-memory chip based on resistive random-access memory", Wan et. al, Nature 2022
Dear Reviewers, Thank you for reviewing our NeurIPS submission. We encourage you to engage in the discussion phase. We are committed to addressing any questions or concerns promptly and constructively.
This paper demonstrates for the first time Analog in-memory computing (AIMC) for LLMs with billion parameters overcoming a significant limitation of prior work which considered only small AIMC models (e.g. CNNs) with < 50M params. The authors conducted a comprehensive evaluation on 12 different benchmarks where they report strong results. All reviewers are positive about this paper and its impact in the area of hardware aware training of foundation models noting that the rebuttal has addressed many (if not all) of the concerns raised. There's a clear positive consensus among the reviewers that has been further strengthened after the rebuttal (even Jwnv acknowledges that their concerns have fully addressed). Hence, clear accept.