Can LLMs Enhance Performance Prediction for Deep Learning Models?
摘要
评审与讨论
This paper proposes a new method that combines Graph Neural Networks (GNNs) and Large Language Models (LLMs) to predict the performance of deep learning models. GNNs can help capture the structural information in the model architectures, while LLM can help enhance the generalization and adaptation ability of the prediction model. To seamlessly combine them two, the proposed method pre-trains the GNN first and then uses two different stages to fine-tune the combined model. For the two-stage fine-tuning, the graph-text adaptation stage uses the text-adaption dataset to tune the LLM and the projection layer between GNN and LLM, and the performance prediction finetuning stage uses a multi-platform dataset to tune the whole model, including the GNN and LLM. Both of the two stages use LoRA and soft prompting methods for efficient model training. As a result, the proposed method can achieve better prediction accuracy than the previous GNN-based methods, especially for the adaptation to new hardware with few samples.
优点
- This paper is well-written and easy to follow.
- This paper introduces a novel method that combines a GNN and an LLM for performance prediction tasks. It can help inspire many other GNN+LLM researches on those structured data.
- The evaluation clearly shows the effectiveness of the proposed method.
缺点
- Although the total training time can be greatly reduced, the inference time will increase due to the computation of LLM. There will be a new inference performance issue when applying this method to the DL system stack. There is no discussion about the inference time.
- The proposed method heavily depends on multi-stage training, which must pre-train GNN and LLM first and then use two separate fine-tuning stages to finish the training. Each step needs a careful design for the training.
问题
Question about the results in Table 8. It seems that the GNN baseline can perform better on all of the fp32 platforms, and this result is contrary to the conclusion in the experiments section. But there is no discussion and any result about fp32 in the main body. It is better to have an explanation.
We thank the reviewer for their excellent comments!
Weakness:
-
Our approach is particularly well-suited for use cases such as recommending optimal hardware configurations for AI deployments, where the primary concern isn't immediate inference speed, but the ability to efficiently adapt to new hardware and architectures. This adaptability is crucial in environments where selecting the right hardware significantly influences both the performance and cost-effectiveness of AI applications. As shown in Section 4.4, our method adapts to new hardware and model architectures more effectively than traditional methods, showcasing its practical benefits. Additionally, our model achieves a prediction latency of approximately 0.28 seconds per sample on an A100 GPU. This performance allows for the ahead-of-time tuning of compiler and program configurations, ensuring systems are precisely adjusted to meet the specific requirements of anticipated workloads. This capability is especially useful when the target hardware is not available ahead of time, allowing for effective system configuration and optimization without direct hardware access.
-
We would like to emphasize that the GNN pre-training and graph-to-text adaptation stages are one-time processes. Once these stages are completed, they provide a stable foundation that does not require retraining for additional performance metrics or new hardware configurations. Subsequently, our method allows for the fine-tuning of specific performance metrics such as latency prediction and energy consumption. This fine-tuning stage is highly adaptable and can be applied to additional hardware configurations as discussed in Section 4.4. This flexibility ensures that our method, once initially set up, can be efficiently extended to new applications without the need for complete retraining, thereby reducing the overall computational cost and time involved in adapting to new tasks.
Question
- We appreciate the reviewer’s feedback regarding the performance of the GNN baseline on fp32 platforms in Table 8. To address this, we performed a variability analysis across platforms using the Std-to-Mean Ratio, which represents the standard deviation as a proportion of the mean CostTime. This metric reveals that GPU-based fp32 platforms (gpu-T4-trt7.1-fp32, gpu-P4-trt7.1-fp32) exhibit low normalized variability (0.75 and 0.69, respectively), indicating stable and predictable performance trends. The CPU-based fp32 platform (cpu-openppl-fp32), while having a higher absolute standard deviation (801.45 ms), also demonstrates relatively low normalized variability (0.76), suggesting that its variability is proportional to its high mean CostTime. These characteristics align with the strengths of GNN models, which use MSE loss to effectively capture continuous and structured performance mappings. In contrast, non-fp32 platforms show significantly higher normalized variability compared to their average performance, with Std-to-Mean Ratios ranging from 0.84 to 3.44. Platforms such as atlas300-acl-fp16 exhibit disproportionately large variability relative to their mean CostTime. This reflects intricate and hardware-specific performance trends. Our method excels in such scenarios due to its ability to model complex and diverse performance patterns, particularly for platforms with high normalized variability.
| Platform | Std-to-Mean Ratio | Mean Cost Time | Std. |
|---|---|---|---|
| gpu-P4-trt7.1-fp32 | 0.69 | 33.67 | 23.34 |
| gpu-T4-trt7.1-fp32 | 0.75 | 23.64 | 17.69 |
| cpu-openppl-fp32 | 0.76 | 1051.99 | 801.45 |
| gpu-P4-trt7.1-int8 | 0.84 | 16.44 | 13.85 |
| gpu-T4-trt7.1-int8 | 0.88 | 18.1 | 15.89 |
| hi3519A-nnie12-int8 | 1.05 | 127.05 | 133.39 |
| hi3559A-nnie11-int8 | 1.68 | 154.23 | 258.5 |
| mul270-neuware-int8 | 3.38 | 60.59 | 204.76 |
| atlas300-acl-fp16 | 3.44 | 56.9 | 195.53 |
Thank you again for your constructive comments. Based on these updates and your highlighted strengths, we respectfully request reconsideration for an increased score.
The paper proposes to use both graph neural networks (GNNs) and Large Language Models (LLMs) for a more accurate and generalizable performance prediction framework for deep learning models. The proposed method leverages a GNN-based encoder to generate graph tokens as inputs to the language model and achieves notable improvements against GNN-only or LLM-only baselines.
优点
The proposed method, compared to baselines, takes network structure into consideration to achieve a better prediction accuracy while also maintaining a manageable computational cost.
缺点
- The LLM-based baselines provided in Table 1 all have very obvious drawbacks and therefore are not strong enough to serve as main baselines to illustrate the novelty of the proposed methods. The 'text' method does not include a network structure at all, which naturally leads to bad performance. The 'JSON' method uses a huge input in the LLM with possibly redundant information that significantly increases the training runtime and harms the accuracy.
- The breakdown results for different model architectures are missing.
- There seem to be different results reported in Table 6 of the original NNLQP paper, which reports a 79.51% accuracy.
问题
- Can you provide a stronger LLM-only baseline? E.g., instead of using direct JSON input which can contain useless information, a more compact text-based and structured information can be passed to the LLM.
- What are the 'hardware details' in the prompt?
伦理问题详情
No ethic review needed
Dear Reviewer,
Thank you for your valuable feedback. I would like to clarify that the paper you referenced (https://openreview.net/forum?id=bpS4vaOg7q#discussion) is a non-archival version presented at the ICML 2024 WANT workshop. According to ICLR guidelines, this does not constitute plagiarism. I kindly request the removal of the ethical concern raised in your review.
I will address the other points you've mentioned as soon as possible. Thank you once again!
Dear Reviewer, thank you for your constructive feedback.
We appreciate the opportunity to clarify the use of JSON in our study. The JSON format, as illustrated in the provided sample [1], contains only necessary information. For datasets like TPU graphs [2], which can involve thousands of nodes, both JSON and text formats become inefficient due to the resulting huge context length. These methods, while structurally informative, lead to substantial increases in input size, which directly impacts processing time and computational resource demands. Our approach, which condenses the entire graph into a single graph token, directly addresses these inefficiencies. By simplifying the representation while retaining essential structural details, we significantly reduce the input length required for the LLM. This not only enhances processing speed but also minimizes the computational load, making our method particularly effective for handling large-scale computational graphs.
Acc(10)% - Performance Comparison Breakdown
| Platform | NNLQP-MiniLM | NNLQP-mpnet | NNLQP-distilroberta | NNLQP | GNN-NF | Ours-Mistral7B |
|---|---|---|---|---|---|---|
| cpu-openppl-fp32 | 24.69 | 29.22 | 33.25 | 58.94 | 62.00 | 56.26 |
| hi3559A-nnie11-int8 | 42.17 | 45.18 | 54.22 | 73.19 | 72.59 | 88.15 |
| gpu-T4-trt7.1-fp32 | 29.86 | 43.19 | 39.71 | 60.87 | 55.94 | 58.74 |
| gpu-T4-trt7.1-int8 | 39.27 | 36.05 | 40.99 | 27.90 | 49.36 | 46.78 |
| gpu-P4-trt7.1-fp32 | 30.49 | 46.65 | 39.26 | 60.97 | 56.58 | 58.89 |
| gpu-P4-trt7.1-int8 | 46.48 | 42.51 | 48.93 | 36.68 | 59.02 | 48.83 |
| hi3519A-nnie12-int8 | 47.79 | 48.10 | 39.24 | 77.53 | 58.23 | 85.02 |
| atlas300-acl-fp16 | 26.28 | 46.80 | 44.23 | 47.76 | 46.79 | 68.05 |
| mul270-neuware-int8 | 27.11 | 25.64 | 28.94 | 21.61 | 24.91 | 33.70 |
| Average | 34.90 | 40.37 | 40.97 | 51.72 | 53.93 | 60.49 |
MAPE% - Performance Comparison Breakdown
| Platform | NNLQP-MiniLM | NNLQP-mpnet | NNLQP-distilroberta | NNLQP | GNN-NF | Ours-Mistral7B |
|---|---|---|---|---|---|---|
| cpu-openppl-fp32 | 22.60 | 18.75 | 18.91 | 10.48 | 11.01 | 12.22 |
| hi3559A-nnie11-int8 | 14.37 | 14.55 | 11.60 | 7.55 | 8.29 | 5.38 |
| gpu-T4-trt7.1-fp32 | 19.17 | 14.57 | 14.97 | 9.32 | 11.80 | 9.69 |
| gpu-T4-trt7.1-int8 | 16.22 | 16.69 | 15.43 | 18.10 | 12.65 | 14.05 |
| gpu-P4-trt7.1-fp32 | 18.37 | 13.66 | 15.41 | 9.75 | 10.00 | 9.91 |
| gpu-P4-trt7.1-int8 | 14.12 | 13.41 | 13.14 | 13.75 | 9.62 | 12.05 |
| hi3519A-nnie12-int8 | 13.80 | 13.14 | 13.65 | 7.13 | 10.04 | 5.96 |
| atlas300-acl-fp16 | 21.54 | 16.20 | 16.12 | 14.41 | 16.02 | 9.47 |
| mul270-neuware-int8 | 25.20 | 24.14 | 24.91 | 26.18 | 25.34 | 28.31 |
| Average | 18.38 | 16.12 | 16.02 | 12.96 | 12.75 | 11.89 |
[1] https://anonymous.4open.science/r/llm-dl-perf-AD62/data/sample.json
[2] Phothilimthana, Phitchaya Mangpo, et al. TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs. arXiv:2308.13490, arXiv, 5 Dec. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2308.13490.
In our experiments, the 'hardware details' in the prompt refer primarily to the name of the hardware, such as "NVIDIA GPU P4 Trt7.1 fp32," as mentioned in the NNLQP dataset. We initially considered including more detailed hardware specifications, but to ensure a fair comparison with the NNLQP model, we opted to maintain a consistent level of detail in the hardware descriptions. Our approach offers a significant advantage in terms of adaptability. Unlike the NNLQP model, which requires retraining to integrate additional hardware features, our method allows for the flexible modification of prompts to include varying levels of hardware detail. This can be done without altering the underlying model architecture, thereby enhancing accuracy and adaptability without additional training overhead.
The difference in baseline accuracy is due to variations in the number of training epochs. The NNLQP used 500 epochs for multi-platform performance prediction. However, training LLM-based models for such a large number of epochs requires substantial computational resources. Our approach focuses on demonstrating how LLM-based methods can quickly adapt to new hardware environments with fewer training steps. In our experiments, we used 10 epochs, as specified in the paper.
Thank you again for your constructive feedback, and we respectfully request reconsideration for an increased score.
I would like to sincerely thank the reviewer for their careful response and the extra breakdown comparison. I also appreciate their clarification on the JSON raw data and the hardware info. However, I'm not convinced by the provided justification for using only 10 epochs during training:
- The difference in baseline accuracy is too huge to ignore (NNLQP 10 vs. 500 epochs)
- Even if there is concern about computational cost, I do not fully agree with the authors about reducing from 500 to merely 10 epochs in training, especially for comparison purposes.
- The proposed method undergoes graph pre-training, which may unintentionally give it a benefit over the baseline NNLQP. This may not be fair to the baseline NNLQP as its GNN is only trained for 10 epochs.
- I understand that the GNN and LLM pre-training is one of the proposed features, however, fully training accuracy is what ultimately matters in practical use cases.
I sincerely suggest the authors consider providing experiment results that show that the proposed method achieves higher accuracy than the ones provided in the NNLQP paper with full training. This will significantly benefit any readers who want to implement this method in practice.
Dear Reviewer,
Thank you for your feedback and acknowledgment of our efforts. To address your concerns about the training epochs, we conducted additional experiment where the NNLQP model was trained for a full 500 epochs, and our proposed model with the Llama3-8B base LLM was trained for just 10 epochs using the NNLQP Multi-platform dataset (contains eight different hardware platforms).
Subsequently, we performed transfer learning on a new hardware platform, hi3519A-nnie12-int8, leveraging the pre-trained weights of both models to evaluate their quick adaptation. For this phase, we used only 32 training samples, after transferring the weights, the NNLQP model was trained for 500 epochs and achieved an accuracy of 57% (Acc(10%)). In contrast, our model reached an accuracy of 70% (Acc(10%)) in just 3 epochs. These results highlight our model's adaptability to new hardware platforms with fewer training epochs, showcasing its practical utility in scenarios where training data and time are limited.
Given these findings and the strengths you've highlighted, we respectfully request a reconsideration of the score.
We also believe that our positive results in this area will be of interest to many researchers in the field.
Thanks for the extra information. I would like to sincerely thank the authors for their constructive discussion.
I noticed that there is still a significant gap between the figure3 (transfer learing) in the manuscript and the baseline transfer learning results (figure7 in the NNLQP paper).
- Figure 3 shows that NNLQP achieves around 60% accuracy with 700+ samples, however, figure 7 in the original NNLQP paper shows around 80% accuracy.
- With the atlas300-acl-fp16 dataset, the proposed method provides a less significant benefit. If compared to the results reported in NNLQP, the accuracy of GNN is still 10% lower.
- The transfer learning experiment is only carried out on two platforms, compared to Figure 7.
- The paper lacks a comparison to Figure 6 (transfer learning to unseen structure) of NNLQP.
In summary, the reviewer decides to maintain the score.
Thank you for your constructive feedback. We would like to again clarify that the NNLQP paper they trained the model for 500 epochs using the multi-platform dataset, which includes 8 hardware platforms. Additionally, 500 epochs were used for transfer learning on a new hardware platform. In contrast, as stated in our paper, all models used 10 epochs for initial training on the multi-platform dataset and just 3 epochs for transfer learning. This configuration asses the quick adaptability, which is crucial for real-world scenarios.
Our new experimental results indicate that our method outperforms NNLQP even when NNLQP is trained for 500 epochs. We will include the results for the remaining hardware platforms in the revised manuscript.
Additionally, regarding transfer learning to unseen DL architectures, we have already addressed this aspect in our experiments. While NNLQP focuses on CNN-based architectures, we tested our method on the transformer-based ViT model. As shown in Table 5, our model achieved a higher accuracy of 95% (Acc(10%)).
The paper introduces a novel approach that combines Graph Neural Networks (GNNs) and Large Language Models (LLMs) to predict the inference latency of deep learning (DL) models. DL models are first converted into graphs with feature matrices via the Open Neural Network Exchange (ONNX) format. These graphs are then processed by a GNN to generate graph embeddings, which are concatenated with token embeddings from the LLM. The authors propose a three-stage training pipeline: (1) GNN pre-training, where the GNN is trained using Graph Masked Auto Encoder techniques; (2) Graph-Text Adaptation, which aligns the GNN's graph embeddings with the LLM’s text embeddings using a projection layer and LoRA (Low-Rank Adaptation); and (3) Performance Prediction Fine-Tuning, which simultaneously trains both models. The results indicate that this multi-modal approach achieves lower Mean Absolute Percentage Error (MAPE) and higher accuracy compared to baseline methods and adapts well to new DL architectures and hardware configurations with limited training data.
优点
-
The paper's method, which utilizes GNN embeddings as inputs for LLMs, is novel for performance prediction tasks. Unlike previous works, this approach combines GNN and LLM in a structured, multi-stage training pipeline to optimize both models for this task, showing high adaptability to new architectures and hardware.
-
The proposed method shows considerable improvement in performance prediction metrics (e.g., MAPE, accuracy) compared to baselines, demonstrating strong potential for adaptability to new hardware configurations and DL architectures with limited training samples.
-
The presentation of the paper and explanation of the method is very clear. Although the method consists of several stages and different modules, the writing makes it easy to understand how each module is integrated in the training pipeline.
-
The authors propose a new training pipeline that can effectively optimize two different neural networks, GNN and LLM, to the task of performance prediction. Empirical results show that the resulting model has high adaptability to new hardware configurations or DL graphs given only a small set of training data.
缺点
-
In Table 3, the proposed method is compared against a single GNN baseline, which is insufficient to assess the enhancement from the method. The authors should add existing works such as [1] and [2] to their experiments for fair comparison with SOTA methods.
-
While this work demonstrates the feasibility of using GNN and LLM together to predict the performance of DL models, it lacks a thorough discussion on the specific challenges and characteristics that make this problem difficult to address using LLMs. In many fields, the effectiveness of LLMs is well established, and multi-modal language models are widely used. To strengthen the contribution of this paper, an analysis of why this problem requires an LLM-based approach and how it can be effectively addressed should be included.
-
the baseline accuracy of 51.72% reported in Table 3 is questionable, as it is significantly lower than the accuracy reported in the original paper (Average of Acc(10%): 59.73% in the original paper). Additional discussion on why the baseline shows lower accuracy compared to the original paper is needed.
-
The “Justification for the Proposed Architecture” section 4.2 is difficult to understand and confusing. Are the authors suddenly presenting justification for using LLMs by showing suboptimality of their early designs that use smaller LMs? Or are they presented as enhanced baselines which simply integrate language models to the GNN baseline?
[1] Yu, G. X., Gao, Y., Golikov, P., & Pekhimenko, G. (2021). "Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training." Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21).
[2] Zhang, Y., Li, Y., & Wang, Y. (2023). "Runtime Performance Prediction for Deep Learning Models with Graph Neural Network." Proceedings of the 2023 IEEE International Conference on Big Data (Big Data).
问题
-
How do other baselines (e.g., [1], [2]) perform with the evaluation suite you used?
-
What are the specific challenges of using LLMs to predict the performance of DL models? There are already several works that use GNN and LLM together to encode the graph and make it understandable for LLMs [3], [4].
-
Why is the Acc(10%) of the GNN baseline much lower than the figure reported in the original paper?
[3] Bahare Fatemi, Jonathan Halcrow, Bryan Perozzi (2024), “Talk like a Graph: Encoding Graphs for Large Language Models”, The Twelfth International Conference on Learning Representations (ICLR 2024)
[4] Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al-Rfou, Jonathan Halcrow (2024), “Let Your Graph Do the Talking: Encoding Structured Data for LLMs”, https://arxiv.org/abs/2402.05862
We sincerely thank the reviewer for their thoughtful feedback and constructive suggestions.
Regarding Habitat [1], their method relies on the GPU resources the user already owns to run the deep learning model on their hardware. It uses a wave-scaling technique to predict performance on new hardware and performs layer-by-layer performance predictions. Additionally, their approach is tailored specifically to GPUs. Our approach treats the entire model as a single graph and predicts performance without requiring access to target hardware. These distinctions will be discussed in detail in the revised manuscript.
DNNPerf [2] shares similarities with the NNLQP GNN architecture, but the authors did not provide the necessary code or dataset to reproduce or validate their results. DNNPerf focused on memory consumption and training time estimation but did not extend their evaluation to multi-platform performance prediction, a key focus area for our work. In contrast, NNLQP not only targets inference performance but also includes comprehensive multi-platform evaluations and offers open-source code and datasets, ensuring it serves as a more suitable and robust baseline for comparison and further development.
The difference in baseline accuracy is due to variations in the number of training epochs. The NNLQP used 500 epochs for multi-platform performance prediction. However, training LLM-based models for such a large number of epochs requires substantial computational resources. Our approach focuses on demonstrating how LLM-based methods can quickly adapt to new hardware environments with fewer training steps. In our experiments, we used 10 epochs, as specified in the paper.
Bahare Fatemi et al. (2023) [3] introduced a GNN + LLM approach for graphQA, where graphs are treated as text. Perozzi et al. (2024) [4] demonstrated improved performance over this by treating graphs as graph tokens. Our approach was built upon Perozzi et al. (2024), and we introduced a structured pre-training strategy that significantly enhances adaptability and performance. A key feature is the graph-to-text adaptation stage, which updates only the projection layer and the LoRA matrix of the LLM during training. This design allows the LLM to effectively interpret GNN encoder embeddings while retaining its extensive pre-trained knowledge. Our ablation studies (Table 4) show a 7.83 percentage point improvement in Acc(10)%, and our adaptation experiments (Figure 3) demonstrate up to a 50 percentage point increase in accuracy, showcasing the effectiveness of this structured pre-training.
Section 4.2 aims to show why we moved from smaller language models to a larger LLM system. Our initial experiments with models like MiniLM, mpnet-base, and distilroberta revealed their inability to effectively distinguish between hardware specifics and model details, as shown in Table 3. These results highlighted the need for a more capable model to handle our complex tasks. This underperformance led us to adopt the larger, more sophisticated Mistral-7B model enhanced with LoRA. The transition wasn't merely enhancing baselines but was necessary to address the shortcomings of the smaller models.
Thank you again for your constructive feedback, and we would like to request a revised score.
[1] Yu, G. X., Gao, Y., Golikov, P., & Pekhimenko, G. (2021). "Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training." Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21).
[2] Zhang, Y., Li, Y., & Wang, Y. (2023). "Runtime Performance Prediction for Deep Learning Models with Graph Neural Network." Proceedings of the 2023 IEEE International Conference on Big Data (Big Data).
[3] Bahare Fatemi, Jonathan Halcrow, Bryan Perozzi (2024), “Talk like a Graph: Encoding Graphs for Large Language Models”, The Twelfth International Conference on Learning Representations (ICLR 2024)
[4] Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al-Rfou, Jonathan Halcrow (2024), “Let Your Graph Do the Talking: Encoding Structured Data for LLMs”, https://arxiv.org/abs/2402.05862
The paper proposes a deep learning-based system for predicting the performance of deep learning models. It combines a GNN encoder for producing a representation of the model's computational graph with an LLN which incorporates hardware details and the like. To adapt to novel hardware architectures and/or models, fine-tuning is employed. A comprehensive series of experiments demonstrates that this approach outperforms competing approaches and can transfer well with limited data.
优点
- The paper presents a novel approach to performance modeling for deep learning which outperforms existing approaches.
- Improved performance modeling is highly impactful for many parties trying to decide whether to invest in new hardware, or budget for model deployment.
- A comprehensive set of experiments demonstrates the benefit of the approach.
- The paper is clear and well-written.
缺点
- The dataset considered for evaluation (NNLQP) focuses exclusively on vision tasks, and almost all models considered are CNNs (there is one study which shows good results for low-data transfer to ViTs). The paper would be stronger if it considered models from additional modalities, most notably transformers on text tasks.
- Likewise, the hardware considered does not include recent GPU architectures, e.g., H100 GPUs, and is mostly focused on dedicated inference devices. While this is valuable to include, it neglects server-class GPUs which are probably more widely deployed, and which have very different hardware characteristics.
- Many models these days require multiple GPUs even for inference. It is not clear how the approach handles this situation.
Minor: Typo: Figure 2, stage 2: "Prompt Summaries the graph": "Summaries" should be "Summarizes".
问题
Please see comments/questions under weaknesses.
Dear Reviewer,
Thank you for your thoughtful feedback and for recognizing the strengths of our work.
We acknowledge the current dataset’s focus on vision tasks and CNNs and agree that expanding to other modalities, such as transformers for text tasks, would strengthen the work. Preliminary results with ViTs show promise, and we plan to include NLP benchmarks and server-class GPUs (e.g., H100) in future evaluations. This study lays the groundwork for further advancements in performance modeling across diverse hardware and model architectures.
Our approach can be extended to handle multi-GPU inference by treating the computational graph as a unified structure that accounts for inter-GPU communication. Future work will explicitly evaluate such scenarios, ensuring applicability to modern workloads.
We will correct the typo in the final version of the figure. Thank you again for your constructive feedback and for supporting our work.
This work combines GNN and LLM for predicting the performance of a DL model in a certain hardware environment. It first verifies that using graph embeddings serve as effective input tokens for an LLM for performance prediction. A GNN foundation model is further proposed to avoid retraining. A specialized dataset verifies that GNN + LLM achieves higher accuracy and better efficiency than pure text-based methods.
优点
-
The paper is well-organized and easy-to-follow.
-
The proposed method does not require retraining for a new model or a new hardware environment.
-
A new dataset is brought about.
-
Experimental results support that the proposed method achieves better advantages.
缺点
This work follows Perozzi et al. (2024)'s finding. What's the technical differences between the two works? Without such a comparison, it is hard to judge the technical novelty of the second contribution (line 108-109), which is the key contribution in my opinion.
问题
Will the code and dataset be released?
Dear Reviewer,
Thank you for your review and constructive feedback. We appreciate your thoughtful comments.
We would like to highlight the key differences between our approach and the findings of Perozzi et al. (2024). While their work provided a foundational framework for combining GNNs and LLMs, our approach extends this by introducing a structured pre-training strategy that significantly enhances model adaptability and performance.
A central aspect of our work is the graph-to-text adaptation stage, which updates only the projection layer and the LoRA matrix of the LLM during training. This design allows the LLM to interpret GNN encoder embeddings effectively while retaining its extensive pre-trained knowledge. Our ablation studies (Table 4) show that this strategy improves Acc(10)% by 7.83 percentage points. Furthermore, as shown in Figure 3, this approach enhances accuracy by up to 50 percentage points during adaptation experiments.
We confirm that the code and dataset will be publicly available soon at this link: https://anonymous.4open.science/r/llm-dl-perf-AD62
Thank you for recognizing the strengths of our work. We respectfully request reconsideration for an increased score.
The paper presents a novel approach to enhance the accuracy and adaptability of deep learning model performance prediction by combining GNN and LLM. The reviewers raised concerns about the technical novelty, ablation studies and baseline selection. The authors have provided detailed responses to the reviewers' concerns and questions during the rebuttal and discussion period. However, there are several significant issues that require substantial revisions to the paper. Addressing these concerns within the limited rebuttal period is unlikely, and the quality of the paper may not be adequately ensured.
A more in-depth concern is that why LLM is able to predict performance of neural networks. The basic motivation is unclear and confusing. Given the intense competition at ICLR, the AC suggests that the authors revise the paper based on the reviewers' feedback and consider submitting it to the next conference, with best wishes for success.
审稿人讨论附加意见
The reviewers raised concerns about the technical novelty, ablation studies and baseline selection. The authors have provided detailed responses to the reviewers' concerns and questions during the rebuttal and discussion period. However, there are several significant issues that require substantial revisions to the paper. Addressing these concerns within the limited rebuttal period is unlikely, and the quality of the paper may not be adequately ensured.
Reject