6.5

/10

Poster4 位审稿人

最低6最高8标准差0.9

3.8

置信度

正确性3.0

贡献度2.5

表达2.8

ICLR 2025

Multi-objective Differentiable Neural Architecture Search

Rhea Sanjay Sukthanker,Arber Zela,Benedikt Staffler,Samuel Dooley,Josif Grabocka,Frank Hutter

OpenReview PDF

提交: 2024-09-26更新: 2025-02-11

TL;DR

We introduce a novel hardware aware differentiable multi-objective architecture search method that efficiently profiles the whole Pareto-front.

摘要

关键词

hardware efficiencyneural architecture searchnetwork compression

评审与讨论

审稿意见

评分: 8置信度: 32024-10-21

The paper introduces a novel framework for differentiable neural architecture search (NAS) that integrates a generalizable hardware embedding with multi-objective embeddings during supernet training. This approach enables the NAS framework to generalize to unseen hardware platforms and efficiently sample Pareto-optimal subnets based on user preferences. To achieve this, the authors design a MetaHypernetwork that uses a sampled preference vector and representative hardware embeddings to guide architecture sampling within the supernet. During training, instead of randomly sampling subnets, the framework selects and trains subnets based on the hardware and objective embeddings, resulting in a better Pareto front. Extensive experiments demonstrate that this framework consistently produces superior Pareto-optimal subnets across various devices, outperforming several state-of-the-art NAS frameworks.

优点

The differentiable NAS framework proposed in the paper introduces a novel hypernetwork that enables zero-shot transferability to new devices, which, to the best of my knowledge, is an original contribution. I appreciate how this framework addresses key challenges faced by the NAS community. Previous approaches directly optimize network structures and hardware metrics using differentiable operators for one specific platform, while two-stage NAS frameworks separate supernet training from subnet searching. However, the search process in two-stage frameworks is often costly, as it relies on hardware-specific predictors and estimators for each objective on each platform. Although some works, such as [1, 2], have attempted to reduce the search cost in two-stage frameworks, I believe that a unified solution that provides zero-shot transferability and multi-objective optimization across new platforms has been lacking. This paper fills that gap by proposing a unified framework that combines hardware and objective embeddings with differentiable NAS through the use of a hypernetwork. I believe this is an exciting progress for neural architecture search.

The paper is well-presented and well-organized. The extensive experiments that compare it to several state-of-the art NAS frameworks and the in-depth analysis of the framework fully support the claims made in the paper.

[1] HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning

[2] BRP-NAS: Prediction-based NAS using GCNs

缺点

My only concern is that training differentiable NAS with multi-objective hardware metrics can often be unstable and may struggle to converge, leading to issues with reproducibility. While the authors have shared the hyperparameters and plan to release the implementation, it would be beneficial to include a discussion of the training techniques in the Appendix to address these challenges.

问题

Could you provide the training and validation loss curves?
In Figure 13, do you quantize the probability of $r_m$ to obtain the vector in the embedding layer $e_m$ ? As I understand it, embedding layers typically take indices as inputs.
In Figure 13, if I understand correctly, $e_{\phi_0}$ is a linear layer that maps the device feature vector to a k-dimensional vector. If this is the case, referring to it as an 'embedding' layer may be somewhat confusing, given the use of $e_m$ for embeddings.
Suppose an application requires a network with a latency of under 30 ms on a given platform. How does the user preference vector, where each element is a probability within [0, 1], map to specific latency (and energy consumption) in real-world scenarios?

评论- Official Response from Authors (1/2)

2024-11-18

We sincerely thank you for your review of our submission and the positive feedback on several aspects like the zero-shot transferability and a unified multi-objective differentiable framework of MODNAS. We are also glad to hear that you found our paper well written, well organized and our experiments thorough. We address your concerns and questions in the following.

My only concern is that training differentiable NAS with multi-objective hardware metrics can often be unstable and may struggle to converge, leading to issues with reproducibility. While the authors have shared the hyperparameters and plan to release the implementation, it would be beneficial to include a discussion of the training techniques in the Appendix to address these challenges.

Thank you for highlighting this critical point. We concur that differentiable NAS and hypernetworks often demand careful tuning of hyperparameters (discussed briefly in the limitations paragraph of Sectoin 5 as well). For hyperparameters specific to the supernetwork, we align them exactly with those used in the original search spaces and benchmarks. Overall, we find the hypernetwork to be reasonably robust to the hyperparameter choices. While there is always room for improvement, in our experiments, there were 3 components that made MODNAS robust and to work reliably across benchmarks:

The choice of MetaHypernetwork update scheme: this played a pivotal role in the performance of MODNAS. While other gradient update strategies (see lines 406-412) underperformed or started diverging (Figure 6), MGD converged relatively quickly to a hypervolume close to that of the global Pareto front. The convergence of MGD to a pareto stationary point is discussed in Desideri [1] and Zhang et al. [2]. The convergence of MGD in bilevel optimization is an open research topic (see recent results from Ye et al. [3] and Yang et al. [4]). One potential scenario when MGD could fail is when the gradient directions of the objectives it is optimizing point in different opposing directions, however this becomes practically unlikely, especially as the number of objectives grows (in our case we use it to find the common gradient across devices, which is for instance 13 devices on NB201).
The choice of gradient estimation method in the Architect: In Section 3, lines 270-278, we discuss our choice for the method that enables gradient estimation through discrete variables (as architectures are discrete variables). We noticed that the ReinMax [5] estimator always outperformed previous estimators such as the one in GDAS [6] (Figure 15 in the appendix), so we believe this choice is crucial.
Weight entanglement vs. weight sharing in the Supernetwork: In early experiments on NB201 we noticed that weight sharing in the Supernetwork, was not only more expensive, but much more unstable as well when compared to weight entanglement [7, 8], even yielding diverging solutions quite often (common pattern seen in differentiable NAS with shared weights as you mention; see Zela et al. [9] for instance).

We hypothesize that all design choices mentioned above play an implicit regularization effect on the upper level optimization in the bi-level problem, leading to a faster convergence and robustness [9, 10, 11]. We have updated the paper with Appendix J containing the points we discussed above.

Could you provide the training and validation loss curves?

We have added the training and validation loss curves for NAS-Bench-201 in Appendix L. At each mini-batch iteration we plot the average cross entropy loss across all devices. As expected both training and validation cross entropy go down and we do not notice any overfitting. The high noise is common for sample-based NAS optimizers, since a sampled different architecture is activate at each mini-batch iteration.

In Figure 13, do you quantize the probability of $r_m$ to obtain the vector in the embedding layer $e_m$ ? As I understand it, embedding layers typically take indices as inputs.

Yes, this observation is correct. We quantize the continuous sampled $r_m \in [0, 1]$ to the discrete $[0, 1, … 100]$ interval (see line 14 in https://anonymous.4open.science/r/MODNAS-1CB7/hypernetworks/models/hpn_nb201.py). We have added a sentence (highlighted in blue) in Appendix E.2 to clarify this in the updated version of the paper.

In Figure 13, if I understand correctly, $e_{\phi_0}$ is a linear layer that maps the device feature vector to a k-dimensional vector. If this is the case, referring to it as an 'embedding' layer may be somewhat confusing, given the use of $e_m$ for embeddings.

Yes, that is correct. We apologize for the confusion and we have already fixed this in the text by referring to it as a linear layer instead. Thank you.

评论- Official Response from Authors (2/2)

2024-11-18

Suppose an application requires a network with a latency of under 30 ms on a given platform. How does the user preference vector, where each element is a probability within [0, 1], map to specific latency (and energy consumption) in real-world scenarios?

We agree with you that this is a very practical and relevant scenario. Therefore, in the paper we have conducted an experiment with a slightly different version of MODNAS that allows us to incorporate user constraints without the need to map them to the [0, 1] probability simplex. You can find the description of the method under the “MODNAS vs. constrained single-objective optimization” paragraph of Section 4 (lines 420-436). You can find the empirical results in Figure 7. Note that in the legend we write e.g. “MODNAS (0.2)”, however, the algorithm is utilizing the actual hardware constraint (which we map to the preference vector value 0.2 for the plotting). In this particular example, since we knew the entire NB201 space exhaustively, we made sure to select hardware constraints that would map to equidistant preference vectors, however, this should not matter in a more practical case. To recap, running MODNAS using hardware constraints results in only one change in the algorithm: If the predicted hardware metric value from the MetaPredictor is smaller than the constraint (e.g. 30 ms), the gradient in lines 6 and 14 of Algorithm 1 will not be computed (constraint satisfied), otherwise they will (constraint violated) and the algorithm will try to optimize this metric.

We hope that our response addresses your concerns and strengthens the paper. If any points remain unclear, we are open to providing further explanation. Thank you again for your time and valuable feedback. We appreciate your consideration of these responses in your evaluation.

–References–

[1] Jean-Antoine Désidéri. Multiple-Gradient Descent Algorithm (MGDA). [Research Report] RR-6953, 2009. ffinria-00389811v2f

[2] Zhang, Q., Xiao, P., Ji, K. and Zou, S., 2024. On the Convergence of Multi-objective Optimization under Generalized Smoothness. arXiv preprint arXiv:2405.19440.

[3] Ye, F., Lin, B., Cao, X., Zhang, Y. and Tsang, I.W., 2024. A first-order multi-gradient algorithm for multi-objective bi-level optimization. In ECAI 2024 (pp. 2621-2628). IOS Press.

[4] Yang, X., Yao, W., Yin, H., Zeng, S. and Zhang, J., 2024. Gradient-based algorithms for multi-objective bi-level optimization. Science China Mathematics, pp.1-20.

[5] Liu, L., Dong, C., Liu, X., Yu, B. and Gao, J., 2024. Bridging discrete and backpropagation: Straight-through and beyond. Advances in Neural Information Processing Systems, 36.

[6] Dong, X. and Yang, Y., 2019. Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1761-1770).

[7] Sukthanker, R.S., Krishnakumar, A., Safari, M. and Hutter, F., 2023. Weight-Entanglement Meets Gradient-Based Neural Architecture Search. International Conference in Automated Machine Learning 2024

[8] Cai, H., Gan, C., Wang, T., Zhang, Z. and Han, S., 2019. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791.

[9] Zela, A., Elsken, T., Saikia, T., Marrakchi, Y., Brox, T. and Hutter, F., Understanding and Robustifying Differentiable Architecture Search. In International Conference on Learning Representations.

[10] Chen, X. and Hsieh, C.J., 2020, November. Stabilizing differentiable architecture search via perturbation-based regularization. In International conference on machine learning (pp. 1554-1565). PMLR.

[11]Smith, S.L., Dherin, B., Barrett, D. and De, S., On the Origin of Implicit Regularization in Stochastic Gradient Descent. In International Conference on Learning Representations.

2024-11-22

Thank you for your response. The authors have addressed all my concerns. I look forward to the release of your implementation.

评论- Response from authors

2024-11-23

Thank you again for your great feedback and positive score.

审稿意见

评分: 6置信度: 42024-10-28

The authors propose a multi-objective HW aware NAS algorith that can emit optimized DNN architectures across multiple different target devices using a single search run. To overcome the challange of having to solve multiple different optimization runs for each considered HW platform, the authors propose using a hypernetwork to generate an architectural distribution across multiple devices based on a "preference vector" and a "hardware device feature vector" from which discrete architectures can then be sampled. For objective function evaluators, the authors use a Supernetwork as a standin for a accuracy and "MetaPredictors", one for each target device, to estimate hardware metrics.

优点

The extension of existing zero-shot NAS techniques with Hypernetworks for HW aware NAS is motivated well and contextualized nicely for alredy existing techniques.
The authors provide an extensive evaluation for different applications (language modeling, vision, translation), multiple well known NAS searchspaces, and different target spaces (2-3 dimensional).

缺点

The authors do not show how the proposed DNN architectures would actually perform on the different target systems. As far as I understand it, the HV results shown are calculated using the estimated results from the "MetaPredictors". While this still allows for a relative comparision with the other techniques and algorithms evaluated in the paper, it makes it hard to evaluate the actual usefulness and effectiveness of the apporach.

问题

What assumptions do the MetaPredictors make about the underlying devices when estimating latency and energy requirements? For example, do they assume a particular choice of operating system and process scheduler used, other parallel and system processes running, core utilisation of the DNN inference, runtime library, or HW design used to execute the DNN on the target device?
Especially on smaller systems, memory requirements are often a major bottleneck of DNN inference: Could the approach proposed by the authors be extended to include memory? So far, the evaluation seems to focus only on accuracy, latency and energy.
How exactly are "preference vector" and a "hardware device feature vector" defined? The authors cover many different types of HW in their evaluation: GPUs (e.g. 1080 ti), smarthpones (e.g. pixel 2, 3), edge devices (e.g. raspi 4) and dedicated HW accelerators (eyeriss, FPGA), which can be characterised by a number of different metrics, often exclusive to the type of HW considered. (e.g. #cores, processor speed, RAM, ... for the raspi4 vs. #streaming cores, VRAM, bus interface, ... for the 1080 ti vs. #LUTs, #BRAM, ... for the FPGA). So it would be really interesting to know how the authors put these metrics into relation and unified them into one vector.
Which FPGA did the authors use? Since an FPGA is just freely programmable hardware, it would also be interesting to know which DNN accelerator design the authors implemented on the FPGA to perform their evaluation.
Fig. 5: Why is MODNAS shown as an upper threashold line (+/- variance)? Based on the caption, I would have expected to see an HVI curve that stops after 24 evaluations, similar to the other curves in the plot.

评论- Official Response from Authors (1/2)

2024-11-18

Thank you for the detailed review of our work. We appreciate that you highlight different positive aspects of our work including our motivation to extend zero-shot NAS to the HW-aware settings, as well as appreciating our thorough experimental evaluation. We respond to each of your questions below.

...results shown are calculated using the estimated results from the "MetaPredictors"

We would like to provide a clarification here. While we indeed use the MetaPredictor to guide the differentiable search, the final hypervolume and Pareto-Front results presented throughout the paper are actually computed using the true accuracy, latency and energy values corresponding to different the benchmarks [1,2,3,4] that we use for evaluation.

What assumptions do the MetaPredictors make about the underlying devices when estimating latency and energy requirements?

We use precomputed latency and energy benchmarks from other papers [1,2,3,4,5] that have already been peer-reviewed. The authors of these benchmarks carefully control for number of allocated CPU cores to ensure consistent profiling across hardware devices (see HW-GPT-Bench [4] for example). Our MetaPredictor architecture itself makes no assumptions about the procedure followed in profiling these hardware metrics. However, this could be an interesting idea for improving the predictive performance of the MetaPredictor by appending such features to the hardware embedding.

Could the approach proposed by the authors be extended to include memory?

We thank you for raising this important point. The short answer is: Yes. MODNAS is agnostic to the type of hardware metric objective as long as we use a predictor to estimate the metric. Supporting a new objective like memory consumption is about replacing the existing latency/energy predictor with a GPU memory predictor. Following your suggestion, we are currently running experiments using MODNAS for memory-perplexity optimization on the GPT-L space from HW-GPT-Bench [4]. We will report back when this experiment finishes.

How exactly are "preference vector" and a "hardware device feature vector" defined?

Preference vectors: During the training stage of MODNAS we sample a scalarization uniformly at random from the probability simplex (e.g. for 2 objectives a scalarization can be [0.24, 0.76])-- scalarizations are in [0,1] and their sum is 1. During inference, we sample 24 fixed points which are uniformly spaced on a circle (for 2 objectives) or a hypersphere (for >2 dimensions). Preference vectors are quantized ([0.24, 0.76] →[24, 76]) before being passed as input to the MetaHypernetwork.
Hardware device embeddings: We use a simple scheme similar to HELP [5] to define the hardware embedding. We sample 10 fixed architectures from a search space, compute their hardware metric value (latency/energy) on a particular device and use this to compute the hardware embedding vector of size 10.

Which FPGA did the authors use?

For the experiments using latency and energy usage in FPGA, we utilize the precomputed values from HW-NAS-Bench [1], which describes the FPGA data collection procedure in their Appendix D.6. To quote the authors: “We then compile all the architectures using the standard Vivado HLS toolflow (Xilinx Inc., a) and obtain the bottleneck latency, the maximum latency across all sub-accelerators (chunks) of the architectures on a Xilinx ZC706 development board with Zynq XC7045 SoC (Xilinx Inc., b).”

Fig. 5: Why is MODNAS shown as an upper threashold line (+/- variance)?

Figure 5 depicts MODNAS as a single line because we do only one search run and in the end of it we evaluate 24 architecture samples using the same MetaHypernetwork conditioned on hardware devices. On the contrary, other black-box methods based on accuracy and latency predictors, sample a single new architecture at every optimization step and they need to run on every device independently since they do not incorporate hardware information during the search.

We would like to thank you again for reading our paper and your feedback. We hope that we were able to address all your concerns and that you will consider increasing your score after reading our responses. Otherwise, if you have additional questions, we are happy to follow up the discussion.

评论- Official Response from Authors (2/2)

2024-11-18

–References–

[1] Li, C., Yu, Z., Fu, Y., Zhang, Y., Zhao, Y., You, H., Yu, Q., Wang, Y., Hao, C. and Lin, Y., HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark. In International Conference on Learning Representations.

[2] Cai, H., Gan, C., Wang, T., Zhang, Z. and Han, S., Once-for-All: Train One Network and Specialize it for Efficient Deployment. In International Conference on Learning Representations.

[3] Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C. and Han, S., 2020, July. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7675-7688).

[4] Sukthanker, R.S., Zela, A., Staffler, B., Klein, A., Purucker, L., Franke, J.K. and Hutter, F., 2024. HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models. In 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Datasets and Benchmarks Track

[5] Lee, H., Lee, S., Chong, S. and Hwang, S.J., 2021. HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning. In 35th Conference on Neural Information Processing Systems, NeurIPS 2021 (pp. 27016-27028).

2024-11-22

Thank you for answering my questions and for clarifiying. So far, it has addressed my concerns. I am looking forward to the memory-perplexity optimization results to adjust my scoring.

评论- Update with the memory-perplexity results

2024-11-23

Thank you very much for your response. We're pleased to know that our response addressed your concerns. As promised, we’ve updated the PDF by including in the end Appendix O and Figure 35, where we showcase the application of MODNAS for optimizing memory usage (using Bfloat16 precision and context size of 1024) and perplexity on OpenWebtext within the HW-GPT-Bench GPT-L search space, featuring models up to 774M parameters. Since memory usage does not depend on the device type, our approach does not utilize the MGD updates in Algorithm 1 for computing the common gradient descent direction, instead leveraging only preference vectors to calculate the scalarized objective. This highlights once again the flexibility of MODNAS across diverse settings, even the ones it was not designed for. Despite this adjustment, MODNAS remains competitive, delivering a Pareto front comparable to leading black-box MOO baselines.

评论- Thanks for those additional results

2024-11-24

Given the additional experimentation with MODNAS's ability to also optimise memory and remain competitive with other MOO baselines, as well as the adjustments made in response to my and the other reviewer's concerns, the paper is now in an acceptable state from my perspective.

评论- Thank you for your feedback on our paper

2024-11-25

Thank you very much again for increasing your score and your detailed feedback on our work, which ultimately helped us enhance the quality and thoroughness of our paper and experiments.

审稿意见

评分: 6置信度: 42024-11-03

This paper proposes a Neural Architecture Search (NAS) algorithm that optimizes several performance metrics on various hardware. The paper adopts Multi Objective Optimization (MOO) to optimize multiple objectives simultaneously. The key idea is to train $**MetaHypernetwork**$ which takes hardware features and user preferences about performance metrics as inputs and returns optimized network architectures as outputs. To train $**MetaHypernetwork**$ , the paper exploits $**Supernetwork**$ for NAS and $**MetaPredictor**$ for optimization in terms of efficiency.

$**MetaHypernetwork**$ takes user preferences and device features, and returns network architecture designs. Then, $**Architect**$ makes those architecture designs into differentiable architecture parameters. With these architecture parameters, $**Supernetwork**$ computes accuracy-related loss of the given network design. Also, $**MetaPredictor**$ computes efficiency-related loss with the network architecture under the given hardware features. Further, the paper proposes to update $**MetaHypernetwork**$ using Multiple Gradient Descent (MGD) to get optimized network architectures that satisfy multiple objectives concurrently.

For NAS, the proposed method is adaptable to diverse pretrained $**Supernetwork**$ . For efficiency, $**MetaHypernetwork**$ can be updated with a lot of hardware-related loss functions at once. For optimization, the paper achieves Pareto frontier solutions that are hard to reach with sequential or averaged gradient updates.

优点

The paper proposes to execute one-shot NAS while search networks satisfy multiple objectives about accuracy and hardware efficiency with target hardware.
The paper analyzes the proposed method with various aspects such as efficacy, and robustness of the training process.
The paper provides extensive experiments and visualizations to support the proposal and its analysis.

缺点

It may be hard to regulate the trade-off among user preferences with scalarization. Figure 4 can be a support, but it is just an abstract depiction, not experimental results.
The proposed method can help search optimized network architectures quickly at low cost. However, network architectures the same as or near ground truth solutions may hard to be reach with $**MetaHypernetwork**$ , where other works can reach with huge search costs.

问题

How are user preference and hardware device embeddings designed?
Please explain how $**Architect**$ makes architecture parameters differentiable and propagate them to the $**MetaHypernetwork**$ .
In Figure 4, solutions on the Pareto frontier can be achieved by modulating user preferences. However, in HDX [1], which is a one-shot NAS with hardware constraints, claims that modulating hyperparameters linearly doesn’t lead to linearly distributed results. Is $**MetaHypernetwork**$ free from this problem? Can real experimental results be plotted like Figure 4 to substantiate the integrity of $**MetaHypernetwork**$ ?
To search network architectures with other works, the paper sets the search time budget up to 2.5 times compared to that of MODNAS. (Or fixed time budget, e.g., 192 hours) That is, MODNAS outperforms prior works in terms of search time, and the quality of solutions is better than that of others under given time budgets. What the review wonders is whether can other works find near-GT solutions with a larger time budget. If a target network is distributed extensively and used frequently, huge search costs can be tolerable if there are better solutions.

[1] Deokki Hong, Kanghyun Choi, Hye Yoon Lee, Joonsang Yu, Noseong Park, Youngsok Kim, and Jinho Lee. Enabling Hard Constraints in Differentiable Neural Network and Accelerator Co-Exploration. In Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022.

评论- Official Response from Authors (1/2)

2024-11-18

Thank you for carefully reading our paper, your detailed feedback and the positive score. We are encouraged to see that you identify several positive aspects of the work. We address your concerns and respond to each of your questions below.

Regulating the trade-off among user preferences with scalarization: In Figure 4, solutions on the Pareto frontier can be achieved by modulating user preferences. However, in HDX [1], which is a one-shot NAS with hardware constraints, claims that modulating hyperparameters linearly doesn’t lead to linearly distributed results. Is MetaHypernetwork free from this problem? Can real experimental results be plotted like Figure 4 to substantiate the integrity of MetaHypernetwork?

Following your suggestion, we have now plotted the Pareto front with the respective preference vectors in the same plot (https://anonymous.4open.science/r/MODNAS-1CB7/pareto_rays.png). We utilize one of our runs on the NAS-Bench-201 test devices, namely Eyeriss. As seen in the plots, the preference vectors and the points on the Pareto-Front are very aligned with each other, hence substantiating the integrity of the MetaHyperNetwork. Moreover, we have added a new Appendix K in the updated PDF describing this experiment. Thank you very much for this suggestion. Ultimately, this experiment helped us make our case stronger by validating the integrity of the generated solutions.

”network architectures the same as or near ground truth solutions may hard to be reach with, where other works can reach with huge search costs.” and “What the review wonders is whether can other works find near-GT solutions with a larger time budget?”

We agree with you that blackbox multi-objective optimizers can potentially reach the global Pareto front if the compute resources are not a concern and given enough time, it is not practical to train or even evaluate these architectures, especially for larger model sizes (e.g. Transformer spaces from HW-GPT-Bench). Sometimes in practice, the user wants to get a quick estimation of the Pareto front, and this is the use-case where MODNAS shines. Given enough budget, even random search will find a near optimal solution. For instance in NB201, the size of the search space is K=15625 architectures. The optimal theoretical number of random search steps $n$ to achieve a success probability $\alpha$ is approximately: $n \geq K ln(1/1-\alpha)$ , therefore for random search to have a success probability more than 0.5 it requires $n \geq 10781$ iterations in theory. For the other guided search methods, this number is even smaller, though similar to MODNAS, they have the same limitation that they can converge to a local minimum. Nevertheless, we conducted the same experiment as the one in Figure 3 in the paper, but this time with the baselines given 4x more budget than MODNAS. You can find the results in this link: https://anonymous.4open.science/r/MODNAS-1CB7/radar_hypervolume.pdf. We updated the paper with Appendix M with this new result and the above discussion.

How are user preference and hardware device embeddings designed?

Preference vectors: During the training stage of MODNAS we sample a scalarization uniformly at random from the probability simplex (e.g. for 2 objectives a scalarization can be [0.24, 0.76])-- scalarizations are in [0,1] and their sum is 1. During inference, we sample 24 fixed points which are uniformly spaced on a circle (for 2 objectives) or a hypersphere (for >2 dimensions). Preference vectors are quantized ([0.24, 0.76] →[24, 76]) before being passed as input to the MetaHypernetwork.
Hardware device embeddings: We use a simple scheme similar to HELP [4] to define the hardware embedding. We sample 10 fixed architectures from a search space, compute their hardware metric value (latency/energy) on a particular device and use this to compute the hardware embedding vector of size 10.

评论- Official Response from Authors (2/2)

2024-11-18

Please explain how Architect makes architecture parameters differentiable and propagate them to the MetaHypernetwork.

We apologize for not making this clearer. We have updated the paper by adding Appendix N, where this is explained in detail. The idea is as follows:

Forward pass

The MetaHypernetwork parameterizes the unnormalized architectural distribution: $\tilde{\alpha} = H_{\Phi}$ , where $\Phi$ are the MetaHypernetwork parameters.
$\tilde{\alpha}$ is passed to the Architect and it does the following steps:

a) Normalizes $\tilde{\alpha}$ and samples a one-hot (discrete) $\alpha$ : $\alpha \sim Categorical(Softmax(\tilde{\alpha}))$ .
b) Sets the Supernetwork architectural parameters to the one-hot $\alpha$ , i.e., resulting in a single subnetwork by masking the Supernetwork.
c) Passes $\alpha$ as input to the MetaPredictor.
The Supernetwork and MetaPredictor do a forward pass using the training data (e.g., images) and hardware embedding, respectively.
Compute the scalarized loss function.

The main problem now is that we cannot directly backpropagate the gradient computation through the Architect to update the MetaHypernetwork parameters. This is due to the sampling from the Categorical distribution in step 2/a above being non-differentiable. The Straight-Through Estimator (STE) [1,2] approximates the gradient for the discrete architectural parameters by ignoring this actual non-differentiable sampling operation as follows:

Backward pass

Compute the gradient of the loss with respect to the discrete architectural parameters $\alpha$ : $\partial \mathcal{L} / \partial \alpha$ .
Propagate this gradient back to $\Phi$ (MetaHypernetwork parameters) via the probability distribution:
- $\nabla_{\Phi} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \alpha}\frac{\partial\alpha}{\partial Softmax}\nabla_{\Phi}Softmax(H_{\Phi})$
- STE backpropagates "through" a proxy that treats the non-differentiable function (sampling of $\alpha$ ) as an identity function (as a result $\frac{\partial\alpha}{\partial Softmax} = 1$ ) and computes the gradient w.r.t. the MetaHypernetwork parameters:
  $\nabla_{\Phi} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \alpha} \nabla_{\Phi} Softmax(H_{\Phi})$ .

To recap, during the forward pass the Architect samples a discrete architecture from an architecture distribution parameterized by the MetaHypernetwork, and during backpropagation the STE is utilized to propagate back through the sampling operation to update the MetaHypernetwork parameters, hence the distribution where the discrete architectures in the next iteration will be sampled from. We hope that this makes the update procedure for the architectural parameters clearer. If you have additional questions we are happy to follow up on the discussion.

We would like to thank you again for the very detailed review. We hope that we were able to address all of your concerns and that you would consider increasing your score.

–References–

[1] Jang, E., Gu, S. and Poole, B., 2017, April. Categorical Reparametrization with Gumble-Softmax. In International Conference on Learning Representations (ICLR 2017).

[2] Liu, L., Dong, C., Liu, X., Yu, B. and Gao, J., 2024. Bridging discrete and backpropagation: Straight-through and beyond. Advances in Neural Information Processing Systems, 36.

[3] Sukthanker, R.S., Zela, A., Staffler, B., Klein, A., Purucker, L., Franke, J.K. and Hutter, F., 2024. HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models. In 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Datasets and Benchmarks Track

[4] Lee, H., Lee, S., Chong, S. and Hwang, S.J., 2021. HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning. In 35th Conference on Neural Information Processing Systems, NeurIPS 2021

2024-11-24

Thank the authors for their kind answers and additional experiments.

I believe that the paper satisfies my acceptance threshold.

Therefore, I decide to maintain the rating of acceptance.

评论- Thank you for feedback on our paper

2024-11-25

Thank you very much again for a positive score and your detailed review, which helped us greatly improve our work.

审稿意见

评分: 6置信度: 42024-11-08

The paper (MODNAS) presents an approach to Neural Architecture Search (NAS) that balances competing objectives—like performance, latency, and energy efficiency—across multiple hardware devices. By encoding user preferences as a scalarization vector, MODNAS efficiently searches for Pareto-optimal solutions across diverse devices in a single run. The method employs a hypernetwork to generate architectures for specific hardware configurations, leveraging multiple gradient descents for optimization. The method has been evaluated on MobileNetV3 on ImageNet-1k, an encoder-decoder transformer space for machine translation, and a decoder-only space for language modeling.

优点

1-The method is adaptable to a range of devices by conditioning the hypernetwork on device embeddings, making it highly versatile for deployment on diverse hardware.

2-MODNAS is tested on various hardware devices and tasks, including image classification, machine translation, and language modeling, showcasing its applicability across multiple domains.

缺点

1-Using Hypernetworks for NAS is well known but doesn’t seem a promising solution. It is like a heuristic solution.

2- I think energy and latency are not necessarily conflicting metrics.

问题

1- Using hypernetwork for the NAS and prate-frontier learning is well known.

That limits the novelty of the proposed method.

[A]Brock, A., Lim, T., Ritchie, J., and Weston, N. (2018). SMASH: One-shot model architecture search through hypernetworks. In the International Conference on Learning Representations

[B[ Lorraine, Jonathan, and David Duvenaud. "Stochastic hyperparameter optimization through hypernetworks." arXiv preprint arXiv:1802.09419 (2018).

[C] Pan, Z., Liang, Y., Zhang, J., Yi, X., Yu, Y., and Zheng, Y. (2018). Hyperst-net: Hypernetworks for spatio-temporal forecasting. arXiv preprint arXiv:1809.10889

[D] Zhang, C., Ren, M., and Urtasun, R. (2019). Graph hypernetworks for neural architecture search. In International Conference on Learning Representations.

However, there are major concerns about the performance, initialization, and scalability as stated here (Section 6 in the following paper): Chauhan, Vinod Kumar, Jiandong Zhou, Ping Lu, Soheila Molaei, and David A. Clifton. "A brief review of hypernetworks in deep learning." Artificial Intelligence Review 57, no. 9 (2024): 250.

2- Hypernrtwork is not performing well on unseen networks. https://arxiv.org/abs/2110.13100 With that, Could you elaborate on any specific constraints or limitations when transferring MODNAS to devices significantly different from the ones used in the training phase?

3- MiLeNAS paper (CVPR’20) shows that the gradient errors caused by approximations of second-order methods in bi-level optimization results in suboptimality, in the sense that the optimization procedure fails to converge to a (locally) optimal solution. However, it seems that the authors still use a bi-level optimization technique. It seems using advanced techniques like MiLeNAS might be more useful.

4- Can the MODNSD be applied to object detection tasks as well?

5- MobileNet search space is pretty small DNN. How does it work on more complex DNN search space?

6- The paper mentions scalability but doesn’t mention potential bottlenecks. For example, are there search space complexities that MODNAS struggles with?

7- The paper doesn't provide the code. That limits reproducibility of the proposed method.

8- I am not sure how accurate is the energy model that the paper is using considering even Eyeriss is a pretty old paper (2016). The authors need to provide more details about energy modeling.

9- I think it is better to use power rather than energy. Considering energy is power x latency, when you minimize the latency, if the power is fixed, energy is minimized automatically. Considering power, accuracy, and latency can be a better metric.

10- The paper employs the Frank-Wolfe solver for optimizing scalarizations. Could you discuss any trade-offs in this choice, and if other optimizers were considered?

11- Given that preference vectors affect the optimization landscape, an analysis of how different preference configurations influence the final architectures would clarify MODNAS’s versatility.

12- MODNAS is tested for three objectives—accuracy, latency, and energy. Could this method scale effectively if additional objectives were introduced, or would this necessitate modifications?

13- The approach appears tailored to devices with GPU-like architectures. It would be valuable to understand how MODNAS performs on other architectures, like mobile hardware.

14- Minor:

A)The paper is pretty dense. I hardly can read the legends in Figures 7, 8.

B)Minor Typo: Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging" – A comma after "i.e." would improve readability: "i.e., finding a diverse set..."

"To search across devices, we frame the problem as a multi-task multi-objective optimization problem" – Add a comma, “multi-task, multi-objective optimization problem.”

评论- Official Response from Authors (1/3)

2024-11-18

We thank you for carefully reading our paper and your detailed feedback on our work. We appreciate your recognition of the ability of MODNAS to adapt to a wide variety of hardware deployment scenarios. Below, we address each of your questions:

“Using Hypernetworks for NAS is well known but doesn’t seem a promising solution. It is like a heuristic solution.” and “Using hypernetwork for the NAS and pareto-frontier learning is well known.”

Thank you for referring to the different works which adopt hypernetworks for NAS. However, we would like to point out the following key differences in the way [A], [B] and [D] use hypernetworks compared to MODNAS: [A], [B] and [D] propose amortizing the cost of architecture training in NAS by learning a hypernetwork to directly generate the weights $w$ of a given architecture [A, D] or hyperparameter configuration [B]. MODNAS instead generates the parameters defining the distribution of an architecture i.e. the $\alpha$ values which are only a handful, e.g. 30 parameters for the NB201 supernetwork. Given the much lower dimensionality of the architecture distribution space $\alpha$ compared to the network parameter space $w$ , the issues with scalability are greatly reduced. Furthermore, the hypernetwork itself is a simple embedding layer with very few parameters, e.g., 0.03MB for NB201 and this makes optimizing them and initializing them easier. Similarly, HyperST-Net [C] derives the parameter weights in a cascading manner from temporal and spatial characteristics. Again, since it generates network parameters which are usually much larger in number, it faces the same issues as mentioned above.

Hypernetwork is not performing well on unseen networks. https://arxiv.org/abs/2110.13100 With that, Could you elaborate on any specific constraints or limitations when transferring MODNAS to devices significantly different from the ones used in the training phase?

We refer you to the t-SNE plots (Figure 12 in the appendix) of computed hardware bank similarity vectors for the 19 devices (13 train and 6 test) on NB201. We observe that the embeddings indeed reflect that devices with similar properties are co-located. We study MODNAS on qualitatively different hardwares (see Table 4 in appendix for a complete list) during test (unseen) and training stages and the observation of transfer across different hardware devices does hold in practice. In addition, MODNAS shows good generalization performance with even a smaller subset of training devices (check figure 25 in appendix). We attribute this to the conditioning of the hypernetwork on the hardware embedding and the attentive device pool in the architecture of the MetaHypernetwork.

MiLeNAS paper (CVPR’20) shows that the gradient errors caused by approximations of second-order methods in bi-level optimization results in suboptimality, in the sense that the optimization procedure fails to converge to a (locally) optimal solution. However, it seems that the authors still use a bi-level optimization technique. It seems using advanced techniques like MiLeNAS might be more useful.

We use the Reinmax [1] method, which is a straight-through estimator achieving second-order accuracy by integrating second-order numerical methods. Reinmax does not require Hessian computation, thus having negligible computation overheads. As it is impractical to employ methods that do not use discrete architectural samples during the search, especially on larger search spaces such as the Transformer ones, we chose Reinmax (NeurIPS’23), a state-of-the-art gradient estimation method for discrete variables. Nevertheless, since the type of architecture optimizer we use is a modular component, Reinmax can be replaced with the mixed-level formulation of MileNAS in our configurable experimental framework. We conducted such an experiment and added the result of MODNAS + MiLeNAS in the plots of Figure 15 (Appendix), where we compare it to Reinmax and GDAS as well. Reinmax is still the best performing optimizer.

Can the MODNAS be applied to object detection tasks as well?

Yes, the MODNAS algorithm is versatile if given an architecture search space and a set of hardware devices. In our paper we have selected benchmarks from previous works such as HELP [2], namely NB201, MobileNetv3 [3] for ImageNet classification, Transformer space for machine translation [4], where the ground truth latencies had been already profiled; process that took a significant effort from other researchers. We have also included the recent HW-GPT-Bench [5] for language modeling. If you are aware of any object detection benchmark which contains latency or energy measures on various hardware devices, we would be happy to include it in our experiment suite.

评论- Official Response from Authors (2/3)

2024-11-18

MobileNet search space is pretty small DNN. How does it work on more complex DNN search space?

In Figure 11 in the main paper, we show the performance of MODNAS on the GPT-S Transformer space (largest model containing 124M parameters) and search space size of $10^{12}$ from the recent HW-GPT-Bench paper [5].

The paper mentions scalability but doesn’t mention potential bottlenecks. For example, are there search space complexities that MODNAS struggles with?

We discuss the limitations of our work in Section 5.

The paper doesn't provide the code. That limits reproducibility of the proposed method.

We apologize for the inconvenience. We had provided our complete codebase in the introduction (line 91-92), however it seems that the anonymous link had expired. We have updated it and will make the code public upon acceptance. We also provide the link here for completeness: https://anonymous.4open.science/r/MODNAS-1CB7/README.md.

I am not sure how accurate is the energy model that the paper is using considering even Eyeriss is a pretty old paper (2016). The authors need to provide more details about energy modeling.

For the experiments using the energy as an objective we use the precomputed energy values from the respective papers, namely HW-NAS-Bench [6] and HW-GPT-Bench [5] and HELP [2]. We only use our energy predictors to predict the ground truth values from these benchmarks.

I think it is better to use power rather than energy. Considering energy is power x latency, when you minimize the latency, if the power is fixed, energy is minimized automatically. Considering power, accuracy, and latency can be a better metric.

Thank you for the suggestion. Indeed, the energy and latency are highly correlated in the benchmarks they were precomputed. The high correlation might necessarily hold across devices and model scales, e.g., in figure 64 of HW-GPT-Bench [5], the authors observe that energy usage and latency on the H100 GPU have a kendall- $\tau$ correlation coefficient of only 0.67. In a practical scenario, it makes more sense to use power as you suggested. We conducted the experiment with 3 objectives with the main motive of showcasing that MODNAS can be scaled to more than 2 objectives without any additional search costs.

The paper employs the Frank-Wolfe solver for optimizing scalarizations. Could you discuss any trade-offs in this choice, and if other optimizers were considered?

In the Figure 6 of the main paper, we already demonstrate empirically how MGD compares to other gradient update strategies. Please see the “robustness of MGD” paragraph (lines 406-412) and figure 6. As it can be seen, MGD (red curve) performs the best.

Given that preference vectors affect the optimization landscape, an analysis of how different preference configurations influence the final architectures would clarify MODNAS’s versatility.

Thank you for this careful observation. In earlier experiments on NB201, we noticed that using random uniform samples of the preference vectors from the probability simplex, resulted in the highest hypervolume. Using different concentration coefficients in the Dirichlet distribution resulted in a worse final performance. In lines 236-238, we also briefly mention that it is possible to optimize/adapt the parameters of the Dirichlet distribution, however this would require differentiating through discrete samples from the distribution and would add another update step in the algorithm. Finally, in Appendix K of our updated paper we have added a section on “Alignment of preference vectors with pareto front”, where we plot the generated solutions from the MetaHypernetwork together with their respective preference vectors.

MODNAS is tested for three objectives—accuracy, latency, and energy. Could this method scale effectively if additional objectives were introduced, or would this necessitate modifications?

Yes, the search phase would only require additional forward passes through the predictors estimating the new objectives, which consist of small neural networks and have negligible inference costs. Since we use a scalarized objective in the algorithm (line 6), we only require a single backward pass regardless of the number of objectives. The only modification that needs to be done is to train a new MetaPredictor to approximate the new objective. This is done only once before the MODNAS search, however is still cheap; for instance, it took 3h on a single RTX2080Ti GPU to train the latency and energy MetaPredictors.

评论- Official Response from Authors (3/3)

2024-11-18

The approach appears tailored to devices with GPU-like architectures. It would be valuable to understand how MODNAS performs on other architectures, like mobile hardware.

We study MODNAS across a wide variety of devices. Our benchmarks contain other hardware platforms, including mobile hardware (Pixel3, Samsung devices), embedded devices (FPGA, Raspberry Pi) and CPU devices. See Table 4 in the appendix for a list.

Regarding minor points and the paper being dense.

We thank you for carefully reading our paper. We will increase the size of figures 7-8 and move some parts of the paper to appendix to make the paper more legible and clear for readers. We have also fixed the typos you pointed out in the updated version of the paper.

We would like to thank you again for your very detailed review. We hope that we were able to address all your concerns and that you will consider increasing your score after reading our responses. We are also happy to engage in further discussion if you have more concerns.

–References–

[1] Liu, L., Dong, C., Liu, X., Yu, B. and Gao, J., 2024. Bridging discrete and backpropagation: Straight-through and beyond. Advances in Neural Information Processing Systems, 36.

[2] Lee, H., Lee, S., Chong, S. and Hwang, S.J., 2021. HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning. In 35th Conference on Neural Information Processing Systems, NeurIPS 2021 (pp. 27016-27028).

[3] Cai, H., Gan, C., Wang, T., Zhang, Z. and Han, S., Once-for-All: Train One Network and Specialize it for Efficient Deployment. In International Conference on Learning Representations 2020.

[4] Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C. and Han, S., 2020, July. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7675-7688).

[5] Sukthanker, R.S., Zela, A., Staffler, B., Klein, A., Purucker, L., Franke, J.K. and Hutter, F., 2024. HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models. In 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Datasets and Benchmarks Track

[6] Li, C., Yu, Z., Fu, Y., Zhang, Y., Zhao, Y., You, H., Yu, Q., Wang, Y., Hao, C. and Lin, Y., HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark. In International Conference on Learning Representations.

2024-11-24

Thank you for your responses and for providing more content.

A few more comments:

1-Regarding utilizing hypernetwork, if the search space is small ( i.e., 30 alpha parameters) why do we need to use hypernetwork? Can’t we use simple heuristic, bayesian, or evolutionary algorithms? What benefits do hypernetworks bring to the table?

2-I am not expecting the authors to provide results for object detection, but it would be great to have a discussion in the appendix on how to apply the proposed method in object detention and what challenges need to be solved. The following work might be useful: [X] Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices [Y] YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems [Z] Virtuoso: Energy- and Latency-aware Streamlining of Streaming Videos on Systems-on-Chips

3- I still believe that Eyeriss is an old work and the authors need to use more recent work/numbers with 2-4 nm technology size y to get a more accurate estimation. However, I do understand this is out of the scope of this paper. So, no need to take action on this.

4- I was referring to different DNN NAS compared to the MobileNet search space. For example, DARTS search space and not GPT ones.

5- Section 5 doesn’t discuss possible search space complexities that MODNAS struggles with.

6- Regarding power, I think you need to use power which is independent of latency. When you are using latency, if the latency is reduced, the energy also will be reduced if the power is fixed. So, if you optimize for latency, energy also is optimized. Using power makes your problem more interesting to solve and as you mentioned has more practical application.

评论- Response to followup questions

2024-11-25

Thank you very much for your followup questions. We respond to each of your questions inline:

1-Regarding utilizing hypernetwork, if the search space is small ( i.e., 30 alpha parameters) why do we need to use hypernetwork? Can’t we use simple heuristic, bayesian, or evolutionary algorithms? What benefits do hypernetworks bring to the table?

We would like to clarify this point here. Let’s consider NB201 (the smallest space we consider) for simplicity. This space has 15,625 possible architectures and this corresponds to 5^6. Thus, while the output space of the hypernetwork is 6x5 (30 architecture parameters are predicted), corresponding to 6 edges and 5 operation choices, we can sample 5 possible operation choices for every edge edge using the discrete sampler reinmax, resulting in 15,625 (5^6) architecture choices.

Furthermore, we want to refer you to the discussion on the search complexity in Section 4.4. One major benefit that our MODNAS pipeline has (including the hypernetwork) is the ability to generate a pareto front on multiple devices and objectives in just a search run. In the case of other blackbox heuristics such as BO or ES, the search phase needs to be conducted multiple times on individual devices since there is no hardware specific information being given to the algorithm during search. Furthermore, these algorithms normally rely on ground truth architecture evaluations and can be extremely in-efficient even in small search spaces–something that is not the case for MODNAS.

2-I am not expecting the authors to provide results for object detection, but it would be great to have a discussion in the appendix on how to apply the proposed method in object detention and what challenges need to be solved. The following work might be useful: [X] Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices [Y] YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems [Z] Virtuoso: Energy- and Latency-aware Streamlining of Streaming Videos on Systems-on-Chips

Thank you for pointing us to these relevant papers benchmarking object detection on different hardware devices. We agree that object detection is a very important application since it is probably one of the most relevant use cases of neural networks on embedded devices (e.g. in self-driving cars). As per your suggestion we have now included a discussion on object detection as a potential application in Appendix P of the paper where we also refer to the papers you pointed.

3- I still believe that Eyeriss is an old work and the authors need to use more recent work/numbers with 2-4 nm technology size y to get a more accurate estimation. However, I do understand this is out of the scope of this paper. So, no need to take action on this.

We share the same opinion as you here. Given the rapid advancements in hardware devices for deep learning, benchmarks do get outdated quickly. In this work, we simply rely on previous hardware benchmarks, some of which (e.g.: HW-NAS-Bench) can be older and some newer, such as HW-GPT-Bench. Our main goal while chosing these benchmarks was to showcase the ability of MODNAS to work on 1) different search spaces, e.g., convolutional, transformer-based; 2) different tasks, e.g., machine translation, image classification, language modeling; 3) different objectives, e.g., latency, energy usage, perplexity, accuracy, and now memory usage as well. We believe that what you mention is an important call for developing “evolving hardware benchmarks”, which are continuously adapted as new hardware becomes available.

4- I was referring to different DNN NAS compared to the MobileNet search space. For example, DARTS search space and not GPT ones.

Thank you for clarifying this. We are not aware of a benchmark that includes hardware metrics on the DARTS search space and would be happy to include an experiment on this search space if there exists one including architecture hardware metrics profiled across a variety of devices. We do not foresee any issues with MODNAS struggling on such cell-based search spaces since the NB201 is also cell-based (though with a fixed cell topology) and MODNAS works very reliably there. Moreover, the Supernetwork that MODNAS uses was originally developed on such cell-based search spaces.

评论- Response to followup questions

2024-11-25

5- Section 5 doesn’t discuss possible search space complexities that MODNAS struggles with.

Thank you for raising this important point. We will update Section 5 in the paper with the following (after rebuttal since we need to restructure the paper to incorporate all the feedback from this discussion too): “We also want to mention some search space complexities that MODNAS can potentially struggle with. One instance can be on very large search spaces such as Einspace [1], wherein weight sharing or entanglement cannot be exploited directly since one cannot fit a Supernetwork that contains all the architectures in a single GPU. This might require either supernetwork model parallelism across GPUs or the usage of other performance proxies instead of the supernetwork. We leave such avenues for future work.”

6- Regarding power, I think you need to use power which is independent of latency. When you are using latency, if the latency is reduced, the energy also will be reduced if the power is fixed. So, if you optimize for latency, energy also is optimized. Using power makes your problem more interesting to solve and as you mentioned has more practical application.

We agree with your point here, however, in the benchmarks we utilized in our experiments, we rely on previously evaluated benchmarks where the respective authors have measured energy and latency given a fixed power. We cannot use power in these benchmarks unfortunately. You raise an important point here though, suggesting a shift of focus from energy to power when developing a hardware-aware benchmark in the future.

We thank you again for your very detailed review and your followup questions. We hope that we were able to sufficiently address your followup questions and that you will consider increasing your score. We are also happy to engage in further discussion if you have more questions.

[1] Ericsson, L., Espinosa, M., Yang, C., Antoniou, A., Storkey, A., Cohen, S.B., McDonagh, S. and Crowley, E.J., 2024. einspace: Searching for Neural Architectures from Fundamental Operations. arXiv preprint arXiv:2405.20838.

2024-11-26

Thank you for providing answers to my questions and doubts. I will raise my score.

2024-11-26

We are really happy that we have addressed your concerns and that you will increase your score (which you might have forgotten to do since it is not updated in OpenReview). Ultimately, your feedback was very helpful and made our submission stronger. Thank you very much.

评论- General response to all reviewers

2024-11-18

We would like to thank all the reviewers for reading our paper and their insightful feedback. We also appreciate the positive average score. As a general response, we want to emphasize the following main changes in the submission PDF (highlighted in blue):

Appendix J: additional discussion on the robustness of MODNAS (addressing Reviewer KeBz)
Appendix K: alignment of preference vectors with pareto front (addressing Reviewer ugEA and ei5k)
Appendix L: training and validation loss curves (addressing Reviewer KeBz)
Appendix M: Multi-objective optimization baselines with more budget (addressing Reviewer ugEA)
Appendix N: Additional Details on the Architect (addressing Reviewer ugEA)
Updated Figure 15 with the MODNAS + MiLeNAS baseline (addressing Reviewer ei5k)

We hope that after reading our responses your concerns will be addressed and you will consider increasing your scores. Thank you very much for your time.

2024-12-03

As the discussion period closes soon, we would like to again thank all the reviewers for their thorough evaluation of our work and their active engagement during the discussion phase. Their valuable feedback has significantly enhanced both the clarity of our paper and the depth of our experimental analysis. We also appreciate the overall positive reception reflected in the scores of our paper.

We would like to highlight the following additional updates, which are incorporated into the revised version of our paper (marked in blue):

Appendix O: New experiments analyzing perplexity and memory usage objectives.
Appendix P: Discussion on the applicability of our approach to the object detection task.

Moreover, as suggested by reviewer ei5k, we will refine Section 5 of the paper to include a discussion on potential search space complexities.

If there are any further questions, we remain available for discussion. Thank you once again for your thoughtful contributions.

AC 元评审

2024-12-21

This paper presents a hypernetwork-based method for hardware-aware neural architecture search and demonstrates zero-shot generalization to new devices. The strengths of this paper include the detailed experiments that tested the proposed MODNAS on 19 hardware devices and showed good performance. The main weaknesses of this paper include the details of the approach were not initially very clear, and reviewers generally do not favor hyper network solutions to NAS. After the rebuttal, the missing details were addressed as can be seen in the edits of the appendix. All the reviewers ended up with a positive score for this paper.

审稿人讨论附加意见

During the discussion period, the authors adequately addressed the reviewers' comments. Specifically, they added five sections to the appendix. The reviewer acknowledged the changes and adjusted their score.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)

Multi-objective Differentiable Neural Architecture Search

摘要

评审与讨论

优点

缺点

问题

Could you provide the training and validation loss curves?

In Figure 13, do you quantize the probability of rmr_mrm​ to obtain the vector in the embedding layer eme_mem​​? As I understand it, embedding layers typically take indices as inputs.

In Figure 13, if I understand correctly, eϕ0e_{\phi_0}eϕ0​​ is a linear layer that maps the device feature vector to a k-dimensional vector. If this is the case, referring to it as an 'embedding' layer may be somewhat confusing, given the use of eme_mem​ for embeddings.

Suppose an application requires a network with a latency of under 30 ms on a given platform. How does the user preference vector, where each element is a probability within [0, 1], map to specific latency (and energy consumption) in real-world scenarios?

优点

缺点

问题

...results shown are calculated using the estimated results from the "MetaPredictors"

What assumptions do the MetaPredictors make about the underlying devices when estimating latency and energy requirements?

Could the approach proposed by the authors be extended to include memory?

How exactly are "preference vector" and a "hardware device feature vector" defined?

Which FPGA did the authors use?

Fig. 5: Why is MODNAS shown as an upper threashold line (+/- variance)?

优点

缺点

问题

”network architectures the same as or near ground truth solutions may hard to be reach with, where other works can reach with huge search costs.” and “What the review wonders is whether can other works find near-GT solutions with a larger time budget?”

How are user preference and hardware device embeddings designed?

Please explain how Architect makes architecture parameters differentiable and propagate them to the MetaHypernetwork.

优点

缺点

问题

“Using Hypernetworks for NAS is well known but doesn’t seem a promising solution. It is like a heuristic solution.” and “Using hypernetwork for the NAS and pareto-frontier learning is well known.”

Hypernetwork is not performing well on unseen networks. https://arxiv.org/abs/2110.13100 With that, Could you elaborate on any specific constraints or limitations when transferring MODNAS to devices significantly different from the ones used in the training phase?

Can the MODNAS be applied to object detection tasks as well?

MobileNet search space is pretty small DNN. How does it work on more complex DNN search space?

The paper mentions scalability but doesn’t mention potential bottlenecks. For example, are there search space complexities that MODNAS struggles with?

The paper doesn't provide the code. That limits reproducibility of the proposed method.

I am not sure how accurate is the energy model that the paper is using considering even Eyeriss is a pretty old paper (2016). The authors need to provide more details about energy modeling.

I think it is better to use power rather than energy. Considering energy is power x latency, when you minimize the latency, if the power is fixed, energy is minimized automatically. Considering power, accuracy, and latency can be a better metric.

The paper employs the Frank-Wolfe solver for optimizing scalarizations. Could you discuss any trade-offs in this choice, and if other optimizers were considered?

Given that preference vectors affect the optimization landscape, an analysis of how different preference configurations influence the final architectures would clarify MODNAS’s versatility.

MODNAS is tested for three objectives—accuracy, latency, and energy. Could this method scale effectively if additional objectives were introduced, or would this necessitate modifications?

The approach appears tailored to devices with GPU-like architectures. It would be valuable to understand how MODNAS performs on other architectures, like mobile hardware.

Regarding minor points and the paper being dense.

1-Regarding utilizing hypernetwork, if the search space is small ( i.e., 30 alpha parameters) why do we need to use hypernetwork? Can’t we use simple heuristic, bayesian, or evolutionary algorithms? What benefits do hypernetworks bring to the table?

3- I still believe that Eyeriss is an old work and the authors need to use more recent work/numbers with 2-4 nm technology size y to get a more accurate estimation. However, I do understand this is out of the scope of this paper. So, no need to take action on this.

4- I was referring to different DNN NAS compared to the MobileNet search space. For example, DARTS search space and not GPT ones.

5- Section 5 doesn’t discuss possible search space complexities that MODNAS struggles with.

审稿人讨论附加意见

In Figure 13, do you quantize the probability of $r_m$ to obtain the vector in the embedding layer $e_m$ ? As I understand it, embedding layers typically take indices as inputs.

In Figure 13, if I understand correctly, $e_{\phi_0}$ is a linear layer that maps the device feature vector to a k-dimensional vector. If this is the case, referring to it as an 'embedding' layer may be somewhat confusing, given the use of $e_m$ for embeddings.