PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
5
5
2.8
置信度
创新性2.8
质量2.5
清晰度2.8
重要性2.5
NeurIPS 2025

Learning to Flow from Generative Pretext Tasks for Neural Architecture Encoding

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

We propose a graph-based unsupervised pre-training method for neural architecture encoding.

摘要

关键词
Graph representation learningNeural architecture encodingSelf-supervised learning

评审与讨论

审稿意见
5

The authors propose FGP, a pretraining strategy to train neural architecture encoders to perform NAS. The proposed strategy allows capturing flow based information to predict neural network performance while being computationally cheaper. The authors use a proxy flow based objective to achieve this and show improved downstream performance with their method on predicting neural network performance as well as finding better neural architectures.

优缺点分析

Strengths

  1. The proposed FGP method is able to predict neural network performance via a flow based objective which predicts NN performance by mimicking the forward and backward passes via a graph based step.
  2. The experiments are well structured and highlight the effectiveness of FGP in predicting neural network performance as well as finding better architectures.

Weaknesses

  1. While a ResGatedGCN with FGP outperforms Flowerformer in Fig 8a, training with FGP increases the encoding time of the ResGatedGCN significantly (from 10210^2 to 10310^3) presumably due to the flow based objective, which suggests that the Flow based objective still introduces significant pretraining overhead compared to other non flow based methods. However in Fig 8b, FGP is faster than other pretraining methods. It is unclear which pretraining method has been compared for the ResGatedGCN in Fig 8a and how that is different from the baselines in Fig 8b. Overall, the time taken for pretraining with FGP seems to significantly increase over other pretraining methods. It would be nice to have a Pareto curve for performance vs time taken for each pretraining method to highlight the efficiency of FGP.
  2. While FGP is claimed to capture the flow based information implicitly, it is unclear how combining FGP with FlowerFormer is able to increase performance. It would suggest that the FGP objective adds information other than the flow to training, can the authors ablate this effect?

The empirical evaluations of the proposed method shows that the flow based objective can improve encoding performance, but it seems to come at additional cost. The exact tradeoffs between performance and efficiency, specifically due to the flow based objective can be made clearer by the authors.

问题

See above

局限性

The flow training objective seems to be limited to GNN based encoders, which can be computationally expensive. Perhaps as a more general direction, is there an extension to transformer based encoders which could be more efficient?

最终评判理由

The authors provide additional experiments distinguishing the effects FGP pre-training on FlowerFormer, which was one of my main concerns, They have also explained their computation time analysis in their response.

格式问题

None

作者回复

Dear Reviewer zBpp

We deeply appreciate your invaluable feedback on the further investigations of computation speed and backbone neural architecture encoders.

Below, we provide detailed responses to each of your comments.

Best regards,

The Authors.


Weakness 1.1

While a ResGatedGCN with FGP outperforms Flowerformer in Fig 8a, training with FGP increases the encoding time of the ResGatedGCN significantly (from to) presumably due to the flow based objective, which suggests that the Flow based objective still introduces significant pretraining overhead compared to other non flow based methods.

However in Fig 8b, FGP is faster than other pretraining methods. It is unclear which pretraining method has been compared for the ResGatedGCN in Fig 8a and how that is different from the baselines in Fig 8b.

Response 1.1

We apologize for the confusion. Our response is three-fold, and we will revise the manuscript by adding these details.

  • [Clarification on Figure 8a] The X-axis values in Figure 8a represent the total time required for both architecture processing (encoding) and the pre-training phase using our method (FGP). Therefore, (1) the additional time observed should not be interpreted solely as increased architecture processing time, and (2) other pre-training methods are not used in Figure 8a.
    • [Detail 1] For ‘ResGatedGCN (non-flow-based) w/o FGP’ and ‘FlowerFormer (flow-based) w/o FGP’, the X-axis values reflect only the architecture processing time for ResGatedGCN [1] and FlowerFormer [2], respectively.
    • [Detail 2] In contrast, for ‘ResGatedGCN (non-flow-based) w/ FGP’ and ‘FlowerFormer (flow-based) w/ FGP’, the X-axis values include both the pre-training time and the architecture processing time. Thus, the time gap between the w/ FGP and w/o FGP settings reflects the cost of pre-training, not additional architecture processing time.
  • [Clarification on Figure 8b] Figure 8b presents the pre-training time required for the ResGatedGCN architecture under each pre-training method. It shows that FGP achieves a pre-training time comparable to that of the fastest baseline method.
    • [Detail] Compared to the fastest pre-training method (ZC-Proxy [3]), our method (FGP) requires only 10 additional seconds to train 14K architectures over 200 epochs—a relatively minor overhead in practice. Despite this marginal overhead, FGP outperforms ZC-Proxy in 26 out of 27 settings.
    • [Note] The sum of the pre-training time for FGP in Figure 8b and the architecture processing time for ResGatedGCN w/o FGP in Figure 8a is equivalent to the time for ResGatedGCN w/ FGP in Figure 8a.
  • [Speed clarification] Pre-training with FGP does not introduce additional overhead in architecture processing. Moreover, while it requires some extra pre-training time, this cost amounts to only a few minutes and results in substantial performance gains, as detailed in our response to Weakness 1.2.

Weakness 1.2

Overall, the time taken for pretraining with FGP seems to significantly increase over other pretraining methods. It would be nice to have a Pareto curve for performance vs time taken for each pretraining method to highlight the efficiency of FGP.

The empirical evaluations of the proposed method shows that the flow based objective can improve encoding performance, but it seems to come at additional cost. The exact tradeoffs between performance and efficiency, specifically due to the flow based objective can be made clearer by the authors.

Response 1.2

Our response is three-fold.

  • [Comparison with baselines] Our method (FGP) achieves a pre-training time comparable to that of the fastest baseline method.
    • [Detail] Compared to the fastest pre-training method (ZC-Proxy [3]), our method (FGP) requires only 10 additional seconds to train 14K architectures over 200 epochs—a relatively minor overhead in practice. Despite this marginal overhead, FGP outperforms ZC-Proxy in 26 out of 27 settings.
  • [Experiments] As detailed in Appendix B.5 of our manuscript, our proposed method (FGP) outperforms the strongest baseline methods (ZC-Proxy and GMAE [4]) under a fixed pre-training time constraint.
    • [Setting] We fix the pre-training time for each method and use ResGatedGCN [3] as the backbone neural architecture. All other experimental settings follow those described in Section 4.2.
    • [Results] As shown in Tables 1 and 2, FGP consistently outperforms all baseline methods across all the settings.
  • [Implication] These results indicate that substantial performance gains can be achieved with just a few minutes of pre-training.

Table 1. Performance prediction results in the NAS-Bench-101 dataset under diverse fixed pre-training time.

Pre-training time (secs.)0100200300400500
GMAE65.0 (7.8)65.7 (6.1)66.6 (6.8)67.2 (7.4)67.8 (4.6)68.0 (4.9)
ZC-Proxy65.0 (7.8)67.9 (5.3)68.8 (5.3)69.2 (4.7)69.0 (4.4)68.9 (4.8)
FGP (Ours)65.0 (7.8)71.8 (4.8)73.2 (3.9)74.0 (4.0)74.7 (4.4)74.8 (4.3)

Table 2. Performance prediction results in the NAS-Bench-201 dataset under diverse fixed pre-training time.

Pre-training time (secs.)0100200300400500
GMAE73.4 (1.5)74.0 (1.6)74.4 (1.6)74.8 (1.2)74.6 (1.5)74.8 (1.9)
ZC-Proxy73.4 (1.5)76.3 (1.7)79.5 (1.2)79.3 (1.4)79.6 (1.2)79.8 (1.3)
FGP (Ours)73.4 (1.5)79.8 (1.1)81.4 (1.1)82.0 (1.0)82.2 (0.4)82.2 (0.7)

Weakness 2

While FGP is claimed to capture the flow based information implicitly, it is unclear how combining FGP with FlowerFormer is able to increase performance. It would suggest that the FGP objective adds information other than the flow to training, can the authors ablate this effect?

Response 2

Our response is two-fold.

  • [Hypothesis] We hypothesize that FGP enables FlowerFormer to learn a broader range of information flow beyond what is present in labeled architectures by leveraging a large number of unlabeled architectures, thereby enhancing its ability to capture more diverse flow patterns.
    • [Detail] When trained solely on labeled samples, FlowerFormer tends to learn information flow patterns limited to that specific set. In contrast, FGP leverages a much larger pool of neural architectures whose ground-truth performance is unknown, exposing FlowerFormer to a wider range of information flow patterns. This exposure enables FlowerFormer to learn more diverse and generalizable information flow representations.
  • [New experiments: Empirical evidence] Our hypothesis is empirically supported by our new experiments, which reveal that the performance of FlowerFormer tends to be positively correlated with the size of the pre-training dataset.
    • [Intuition] By varying the number of neural architectures used during FGP pre-training, we aim—though not precisely—to control the diversity of information flow patterns to which FlowerFormer is exposed. A positive correlation between dataset size and FlowerFormer performance suggests that FGP helps FlowerFormer encounter a broader range of information flow patterns, leading to improved accuracy in performance prediction.
    • [Setting] We vary the proportion of the pre-training dataset used for FGP training—0% (no pre-training), 20%, 40%, 60%, 80%, and 100%—and evaluate FlowerFormer’s performance on the performance prediction task. Kendall’s Tau is used as the evaluation metric, and all other settings follow those described in Section 4.1 of our manuscript.
    • [Results] As shown in Table 3, FlowerFormer’s performance tends to improve with larger pre-training dataset sizes, providing empirical support for our hypothesis.

Table 3. Performance prediction results of FlowerFormer under varying sizes of pre-training dataset

% of the dataset used for pre-training.0% (No pre-training)20%40%60%80%100%
NAS-Bench-10174.0 (3.6)74.6 (3.8)75.9 (1.9)75.8 (4.0)76.2 (5.1)76.3 (3.6)
NAS-Bench-20177.3 (1.5)81.1 (0.8)82.3 (1.2)82.4 (1.3)82.7 (1.4)83.5 (1.7)

References

  • [1] Residual Gated Graph ConvNets
  • [2] FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer
  • [3] Dynamic Ensemble of Low-Fidelity Experts: Mitigating NAS “Cold-Start”
  • [4] Graph Masked Autoencoder Enhanced Predictor for Neural Architecture Search
评论

I thank the authors for their detailed response,

The additional time analysis and the experiments showing FGP allows FlowerFormer to learn a broader range of information flow answers my questions about FGP. I have raised my score accordingly.

评论

Dear Reviewer zBpp,

We sincerely appreciate your thoughtful acknowledgement of our additional analysis, as well as your consideration in adjusting the review score accordingly.

Thank you also for your constructive feedback on our work. We will carefully incorporate your comments, along with the additional analysis, into our revised manuscript.

Warm regards,

The Authors of Submission 11597

审稿意见
4

This paper aims to predict the performance of neural architectures effectively and efficiently. Specifically, the flow surrogate is proposed to simulate the ground-truth information flow, which helps to train the architecture encoder effectively. Then, the trained encoder can be used for the performance prediction of most of the neural architectures.

优缺点分析

Weaknesses:

  1. There are some typos in this paper. Line 23: 'extensive' computational resources.....

  2. The insight of this paper is not solid enough. The proof that the 'flow surrogate' can effectively represents the information flow is lacked. Moreover, after reviewed the details of the generation of 'flow surrogate', there is no constraint that ensures different information surrogates for two architectures with largely different performance, especially when the the inputs are random initialized.

  3. There is no clear explanation that why the current flow-based methods are computationally heavy. Given the inference time of GNN, it is hard to understand why it takes such a long computation time.

  4. Figure 3 is confusing. In the middle sub-figure of Process 1, why the 'In' information is directly used for the computation of the 'Out'? Even they are not directly connected. Is this means that all historical information will be used when computing the high-level information? Will this cause a requirement of large memory?

  5. The sum pooling for multiple incoming information is not reasonable, since the summation can not distinguish the incoming source which is too rough for simulating the real information flow.

问题

Please refer to the Weaknesses.

局限性

Yes.

最终评判理由

Please ensure that the clarifications from the rebuttal are added in the final version.

格式问题

None.

作者回复

Dear Reviewer VhSE,

We sincerely appreciate your valuable feedback on the flow surrogate analysis and presentation.

Below, we provide detailed responses to each of your comments.

Best regards,

The Authors


Weakness 1

There are some typos in this paper. Line 23: 'extensive' computational resources

Response 1

Thank you for pointing out the typos. We will revise the manuscript to address all identified typos, including the case you mentioned (replacing ‘extensive’ with ‘expensive’).


Weakness 2.1

The insight of this paper is not solid enough. The proof that the 'flow surrogate' can effectively represents the information flow is lacked.

Response 2.1

  • [New preliminary analysis] While rigorous proof is challenging, we present a new analysis showing that our flow surrogate can distinguish neural architectures with distinct operations, resulting in different patterns of information flow.
    • [Setup] We conduct three binary classification tasks to determine whether a given architecture includes: (1) a 1×1 conv. operation, (2) a 3×3 conv. operation, and (3) a pooling operation. We use flow surrogate representations as input features and train an MLP as the classifier, performing 10 trials. The architectures are split into 80%/20% for training/testing.
    • [Result] As shown in Table 1, our flow surrogate achieves over 92% accuracy across all cases, indicating its ability to distinguish architectures with different operations that give distinct patterns of information flow.

Table 1. Operation classification results.

1x1 Conv3x3 ConvPooling
NB-10193.2 (0.3)99.9 (0.2)92.1 (0.7)
NB-20198.9 (0.3)99.9 (0.1)1.0 (0.0)

Weakness 2.2

Moreover, after reviewed the details of the generation of 'flow surrogate', there is no constraint that ensures different information surrogates for two architectures with largely different performance, especially when the inputs are random initialized.

Response 2.2

We acknowledge the reviewer’s concern that the flow surrogate may lack a strong constraint to effectively distinguish architectures with distinct performance. However, we offer evidence suggesting that our flow surrogate can make such distinctions. Our response is two-fold.

  • [Effectiveness of flow] We claim that the ability to distinguish architectures based on their information flow allows us to differentiate between high- and low-performing architectures.
    • [Evidence 1] As shown in Fig. 6 of our paper, our flow surrogates effectively separate high- and low-performing architectures in the embedding space, even without any knowledge of the architecture performance.
    • [Evidence 2] As detailed in Sec. 1 and 2 of our paper, SOTA architecture encoders designed to capture information flow [1, 2] have shown remarkable success in performance prediction tasks, suggesting that the flow is a key feature for accurately estimating architecture performance.
  • [Effectiveness of random vectors] We also claim that random vectors can still effectively capture information flow.
    • [Evidence 1] Prior work [3] shows that random vectors can yield meaningful representations when combined with adequate computational structures (e.g., convolutional layers). Similarly, our propagation mechanism enables random vectors to capture the flow.
    • [Evidence 2] Compared to learning-based alternatives, our flow surrogate outperforms architecture embeddings derived from the supervised-trained flow-aware encoder TA-GATEs [2] (see Table 2). See Appendix B.9.2 for details.
    • [Evidence 3] Note that our method is robust to the choice of random vectors, with different initializations consistently yielding similar performance. See Appendix B.9.1 for details.

Table 2. Comparison with a learning-based alternative.

Metric: Kendal TauLearning-basedRandom-vector-based (ours)
NB-10172.2 (5.6)74.8 (4.8)
NB-20174.7 (1.4)82.2 (0.7)

Weakness 3

There is no clear explanation that why the current flow-based methods are computationally heavy. Given the inference time of GNN, it is hard to understand why it takes such a long computation time.

Response 3

  • [Clarification] Sorry for the missing detail. The computational burden of flow-based encoders comes from their asynchronous message passing, which is difficult to parallelize. We will elaborate on these details in the revised manuscript.
    • [Mechanism] To capture information flow, they sequentially pass messages according to the topological order of nodes in a directed acyclic graph (DAG), instead of passing messages between all linked node pairs at once.
    • [Detail 1] As detailed in Sec. 3.2.1, each node is assigned a topological order based on its position in the DAG. Nodes with no incoming edges (typically the input node) are assigned order 1. Nodes with incoming edges only from order-1 nodes are assigned order 2, and nodes with incoming edges only from order-1 and -2 nodes are assigned order 3, and so on.
    • [Detail 2] Message passing begins at the order-1 nodes, which send messages to order-2 nodes, where the messages are updated. These updated messages are then moved to order-3 nodes. The process continues through the DAG, passing messages to later-order nodes until reaching the highest-order node (typically the output node). The process is then reversed: messages are propagated backward through the graph until they reach the input node again. This process constitutes a single message-passing layer in the flow-based encoder. Note that this process is hard to parallelize, as each sub-step depends on the output of the preceding one.
    • [Summary] Thus, unlike typical GNNs that perform message passing in a single step per layer, flow-based encoders need multiple steps per layer proportional to twice the length of the longest path from the input node to the output node, which is hard to parallelize.

Weakness 4

Figure 3 is confusing. In the middle sub-figure of Process 1, why the 'In' information is directly used for the computation of the 'Out'? Even they are not directly connected. Is this means that all historical information will be used when computing the high-level information? Will this cause a requirement of large memory?

Response 4

Sorry for the confusion. Our response is three-fold, which we will elaborate on in the revised manuscript.

  • [Figure clarification] We represent deactivated computations using grey-colored arrows. Thus, when updating the embedding of a given node, only the embeddings of nodes directly connected via incoming edges are used; others are not used.
    • [Detail] The middle sub-figure of Process 1 shows the computation for updating the embeddings of ‘1×1’ and ‘3×3’ operations. Here, the embeddings of the respective operations and the ‘in’ node are used, while the ‘out’ node is not involved.
  • [Embedding clarification] We do not store all historical embeddings. In the forward pass, order-k nodes’ embeddings are updated using only order-(k-1) and current order-k nodes’ embeddings; in the backward pass, only order-(k+1) and order-k nodes’ embeddings are used.
    • [Detail] Node ordering is detailed in our Response 3 and Sec. 3.2.1 of our paper.
  • [New experiments: Memory & time] Our flow-surrogate computation requires less than 10MB of memory and takes under 1 minute to process 14K architectures.
    • [Setup] We measure the difference in memory usage before and after the computation, along with the execution time.
    • [Result] As shown in Table 3, our method processes 14K architectures (# of architectures in NB-101 and 201) in under 1 minute while using less than 10 MB of memory.

Table 3. Memory and time consumption in obtaining flow surrogates.

Memory usage (MB)Time consumtion (sec.)
NB-1019.231
NB-2019.432

Weakness 5

The sum pooling for multiple incoming information is not reasonable, since the summation can not distinguish the incoming source which is too rough for simulating the real information flow.

Response 5

Our response is two-fold.

  • [Clarification] While sum aggregation (pooling) is permutation-invariant, our flow surrogate computation mechanism allows the distinction between different sources of incoming information.
    • [Goal] We need to distinguish information from different sources when each source involves distinct operations (since sources with identical operations yield indistinguishable information, we do not attempt to differentiate them).
    • [Detail 1] As detailed in Sec. 3.2, messages (i.e., vector representations of incoming information) are transformed based on the operations they pass through, with each operation type having its own projection function. Thus, messages that have undergone different operations are likely to be distinct.
    • [Detail 2] Therefore, since messages are sufficiently distinct when originating from different operations, the aggregation function does not need to explicitly account for the specific operation each message has undergone.
  • [New experiments: Alternatives] We show that sum aggregation outperforms other commonly used aggregation functions in GNNs.
    • [Setup] We use 2 variants of flow surrogate where the sum aggregation is replaced with other common aggregations in GNNs: mean and max. Other settings are the same as in Sec. 4.2.
    • [Result] As shown in Table 4, our method that uses sum aggregation outperforms its alternatives.

Table 4. Comparison with other aggregations.

Metric: Kendal TauSum (ours)Mean (variant 1)Max (variant 2)
NB-10174.8 (4.8)71.6 (5.8)71.1 (6.4)
NB-20182.2 (0.7)81.7 (1.0)81.5 (0.9)

References

  • [1] FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer
  • [2] TA-GATES: An Encoding Scheme for Neural Network Architectures
  • [3] Deep Image Prior
评论

I am happy with the clarification from the author and have increased my score.

评论

Dear Reviewer VhSE,

We are grateful to hear that you are happy with our clarification, and we deeply appreciate that you have increased the score accordingly.

We also thank you for your constructive feedback. We will thoughtfully incorporate your comments, along with the additional analyses, into the revised manuscript.

Warm regards,

The Authors of Submission 11597

审稿意见
5

This work addresses the problem of training neural architecture encoders, or models that can predict the downstream task performance given a proposed architecture description. Specifically, this work proposes an representation that captures the "flow" of a target model, where the flow refers to information flow w.r.t. the forward and backward passes. The proposed pre-training objective in this work allows an encoder to learn to predict the flow, which previously required specific architecture components to capture. This is accomplished by first assigning a topological ordering over elements in the architecture, then tracing the forward and backward pass of the candidate architecture. This information is aggregated into an embedding that an encoder-decoder pair are trained on with a reconstruction objective.

The proposed method is evaluated for downstream performance prediction, as well as utility in predictor-aware-NAS. An ablation study over the different constituent components is used to validate the efficacy of each.

优缺点分析

Strengths:

  • The proposed method seems novel with respect to related work. Framing the pretraining objective on the flow-based surrogate itself seems like a more flexible design space compared to modeling it directly with the encoder
  • The paper is well-written and easy to follow
  • Experiments show a consistent (sometimes sizeable) increase in performance

Weaknesses :

  • The values of \lambda_1 and \lambda_2 are not mentioned, only their search space. I'm concerned that the performance of the method might be tuned separately for each test setting, or that performance is not robust to the choice of \lambda
  • It seems to me that a major part of this work is that no "special components" are needed to predict the flow, as that information is now captured in the embedding. That unlocks a large design space of less-constrained encoders. However, I don't see any mention of what other encoder designs / architectures were considered.

Other comments: line 111: variation -> variational

问题

See weaknesses section

局限性

Yes

最终评判理由

After a detailed rebuttal regarding issues with hyperparameter specification and encoder design experiments, I have raised my score.

格式问题

N/A

作者回复

Dear Reviewer df83,

We sincerely appreciate your invaluable feedback on our hyperparameter and encoder analyses.

Below, we provide detailed responses to each of your comments.

Best regards,

The Authors


Weakness 1

The values of \lambda_1 and \lambda_2 are not mentioned, only their search space. I'm concerned that the performance of the method might be tuned separately for each test setting, or that performance is not robust to the choice of \lambda

Response 1

Our response is two-fold.

  • [Clarification] We apologize for the missing detail. All hyperparameters were tuned using the validation set, which is described in Section 4.1 of our manuscript.
    • [Configuration upload] We will report the validation-best hyperparameter configurations to our revised manuscript and upload them to Anonymous GitHub after the rebuttal phase since the rebuttal policy prohibits uploading additional content during this period.
  • [New Experiments: Robustness Analysis] Our method is robust to the choice of the λ2\lambda_{2} and λ2\lambda_{2} hyperparameters.
    • [Setting] Recall that our search space for λ1\lambda_{1} and λ2\lambda_{2} includes three configurations: (λ1(\lambda_1, λ2){(13,23),(23,13),(12,12)}\lambda_2) \in \lbrace (\frac{1}{3}, \frac{2}{3}), (\frac{2}{3}, \frac{1}{3}), (\frac{1}{2}, \frac{1}{2}) \rbrace , as detailed in Appendix A.4 of our manuscript. In this additional experiment, we report the performance prediction task results under each configuration. We use ResGatedGCN [1] as our backbone architecture encoder, and other settings are the same as those in Section 4.2 of our manuscript.
    • [Results] As shown in Table 1, the performance variations across these settings remain marginal and consistently outperform the strongest baselines (GMAE [2] and ZC-Proxy [3]). These results highlight the robustness of our method to hyperparameter choices.

Table 1. Results of performance prediction under each hyperparameter configuration.

(λ1,λ2)=(12,12)(\lambda_1, \lambda_2) = (\frac{1}{2}, \frac{1}{2})(λ1,λ2)=(13,23)(\lambda_1, \lambda_2) = (\frac{1}{3}, \frac{2}{3})(λ1,λ2)=(23,13)(\lambda_1, \lambda_2) = (\frac{2}{3}, \frac{1}{3})ZC-ProxyGMAE
NAS-Bench-10174.8 (4.8)74.8 (4.4)74.3 (4.5)68.3 (6.7)68.1 (4.7)
NAS-Bench-20182.2 (0.7)82.3 (0.8)81.7 (0.6)79.9 (0.8)74.8 (1.2)

Weakness 2

It seems to me that a major part of this work is that no "special components" are needed to predict the flow, as that information is now captured in the embedding. That unlocks a large design space of less-constrained encoders. However, I don't see any mention of what other encoder designs / architectures were considered.

Response 2

Our response is two-fold.

  • [Clarification] Sorry for the missing detail. Since a neural architecture can be represented as a graph (see Section 2.2 of our manuscript), we adopt graph neural networks (ResGatedGCN and GIN) and a graph transformer (FlowerFormer) as backbone architecture encoders. We will incorporate these details in our revised manuscript.
    • [Detail] This choice aligns with common practice in neural architecture encoding work [2, 3, 4], where graph neural networks and graph transformers are widely used to encode neural architectures.
  • [New Experiments: Additional Encoders] The superiority of our method over baseline pre-training methods holds across different types of neural architecture encoders.
    • [Clarification] As the reviewer noted, our method is compatible with a wide range of neural architecture encoders.
    • [Setting] To validate this, we additionally evaluate our method using two alternative encoders: an MLP-based encoder [5] and a transformer-based encoder [6]. All other experimental settings follow those described in Section 4.2 of our manuscript. We compare against the two strongest baselines, GMAE and ZC-Proxy.
    • [Results] As shown in Tables 2 and 3, our method (FGP) outperforms both baselines, demonstrating that its effectiveness holds valid under other choices of backbone neural architecture encoders.

Table 2. Results of performance prediction using the MLP-based architecture encoder. Since this encoder does not produce node embeddings, the node-feature-generative method (GMAE) is not applicable.

w/o pre-trainingGMAEZC-ProxyFGP (ours)
NAS-Bench-10142.5 (5.4)-49.8 (7.3)65.2 (4.1)
NAS-Bench-20159.6 (2.2)-70.2 (2.8)72.3 (1.3)

Table 3. Results of performance prediction using the Transformer-based architecture encoder.

w/o pre-trainingGMAEZC-ProxyFGP (ours)
NAS-Bench-10162.7 (3.6)63.2 (4.5)64.1 (8.5)69.7 (6.3)
NAS-Bench-20163.2 (2.0)65.7 (2.4)69.2 (3.2)74.3 (1.5)

Weakness 3

Other comments: line 111: variation -> variational

Response 3

Thank you for pointing out the typos. We will revise all identified typos, including the one mentioned by the reviewer.


References

  • [1] Residual Gated Graph ConvNets
  • [2] Graph Masked Autoencoder Enhanced Predictor for Neural Architecture Search
  • [3] Dynamic Ensemble of Low-Fidelity Experts: Mitigating NAS “Cold-Start”
  • [4] Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?
  • [5] A study of neural architecture encoding
  • [6] TNASP: A Transformer-based NAS Predictor with a Self-evolution Framework
评论

Dear Reviewer df83

We would first like to express our sincere gratitude for your constructive feedback on our work.

As the discussion period is drawing to a close, we kindly ask you to consider our rebuttal.

If you find that our response adequately addresses your concerns, we would greatly appreciate it if you could consider adjusting your score accordingly.

Please feel free to reach out with any further questions or concerns. We are happy to engage in additional discussion.

Best regards,

The Authors

评论

I appreciate the detailed response from the authors. I really had a single main criticism, which was that more attention should have been paid to the search / design space of the proposed method. I am satisfied with the response given here. In particular, I think tables 2 and 3 are quite important to the novelty of the method. If there is indeed a wider applicable encoder-design-space, then tables 2 and 3 show that this work can be built upon in the future.

I'll raise my score.

评论

Dear Reviewer df83,

We are grateful that our response has addressed your concerns and deeply appreciate your decision to increase the score.

We also thank you for your constructive feedback and will thoughtfully incorporate your comments, together with the additional analyses, into the revised manuscript.

Warm regards,

The Authors of Submission 11597

审稿意见
5

This paper proposes FGP (Flow-based Generative Pre-training) which is a novel pre-training method for neural architecture encoding. The key idea is to train a neural architecture encoder to capture information flow within a neural architecture without requiring specialized and computationally expensive model structures. Specifically, FGP trains an encoder to reconstruct this flow surrogate, enabling it to learn information flow without complex model structures.

优缺点分析

Strengths:

  • The paper argues that FGP can make simpler flow-aware encoders and provides clear training guidance. The claim is backed by the experimental results to show that FGP significantly boosts encoder performance with gains of up to 106% when compared to encoders trained solely with supervised learning.
  • FGP is substantially faster and consistently outperforms baseline pre-training methods in a majority of settings.

Overall, the paper presents a well-argued and experimentally supported novel pre-training method for neural architecture encoding that addresses critical challenges of efficiency and effectiveness in the field.

Weaknesses:

  • The paper acknowledges that FGP demonstrates empirical effectiveness across various benchmark, further theoretical investigation into their flow surrogates would make this work very strong.
  • Even though the authors have touched on the issue that they have limited understanding of the types of information flows that FGP could capture, it would be good to see a thourough discussion around it.
  • Without the above two points, the claims are purely emperical and not strong.

问题

Address the points raised above under weaknesses

局限性

yes

最终评判理由

Thanks for addressing the issues raised in my feedback. I've raised my score.

格式问题

The paper is well-structured.

作者回复

Dear Reviewer RQZD,

We deeply appreciate your invaluable feedback on our theoretical analysis and the discussion of our flow surrogate.

Below, we provide detailed responses to each of your comments.

Best regards,

The Authors


Weakness 1

The paper acknowledges that FGP demonstrates empirical effectiveness across various benchmark, further theoretical investigation into their flow surrogates would make this work very strong.

Response 1

  • [Permutation invariance] In response to the reviewer’s suggestion, we provide a preliminary theoretical analysis demonstrating that our flow-surrogate computation is invariant to node permutations (i.e., node indexing), a desirable property when representing the information flow of a neural architecture to ensure effective downstream task performance.
    • [Intuition] Prior works [1, 2] have shown that accurately capturing the information flow within a neural architecture is critical for downstream tasks, such as performance prediction. Therefore, for our flow surrogate to be effective in such tasks, it should accurately represent this information flow.
    • [Crucial aspect of flow] In the NAS-Bench datasets used in our experiments, the information flow within a neural architecture is independent of the indexing of individual operations. Accordingly, it is desirable for the computation of flow surrogates to be permutation-invariant with respect to these operations.
    • [Proof sketch] Thost and Chen [3] theoretically demonstrate that asynchronous message passing on a directed acyclic graph—also used in our flow-surrogate computation—is invariant to node permutations when the permutation-invariant pooling function is employed for neighbor aggregation. Their theoretical result extends to our setting, indicating that our method likewise ensures permutation invariance. We will add this analysis to our revised manuscript.

Weakness 2

Even though the authors have touched on the issue that they have limited understanding of the types of information flows that FGP could capture, it would be good to see a thourough discussion around it.

Response 2

Our response is two-fold: (1) empirical evidence supporting the effectiveness of the flow surrogate in capturing information flow, and (2) a discussion of the challenges in providing more rigorous theoretical justification.

  • [New experiments: Preliminary analysis] We present a new preliminary analysis suggesting that our flow surrogate can differentiate neural architectures with distinct operations, resulting in different patterns of information flow.
    • [Setting] We conduct three binary classification tasks to determine whether a given neural architecture includes: (1) a 1×1 convolution operation, (2) a 3×3 convolution operation, and (3) a pooling operation. We use flow surrogate representations as input features and train an MLP as the classifier, performing 10 trials. The architectures are split into 80% for training and 20% for testing.
    • [Results] As shown in Table 1, our flow surrogate achieves over 92% accuracy across all cases, indicating its ability to distinguish architectures with different operations that give distinct patterns of information flow.

Table 1. Operation classification results. The mean and standard deviation over 10 trials are reported.

1x1 Conv3x3 ConvPooling
NAS-Bench-10193.2 (0.3)99.9 (0.2)92.1 (0.7)
NAS-Bench-20198.9 (0.3)99.9 (0.1)1.0 (0.0)
  • [Challenges] Unfortunately, a more rigorous theoretical analysis is non-trivial, primarily due to two key challenges:
    • [Definition of flow] It is challenging to formally define information flow within a neural architecture with a single mathematical formulation. This is because the actual information flow is highly dependent on the specific parameter values, which can vary significantly across models.
    • [Non-linearity] Both the actual neural architecture and our flow-surrogate computation involve multiple non-linear functions, which complicates the derivation of rigorous theoretical conclusions.

Weakness 3

Without the above two points, the claims are purely emperical and not strong.

Response 3

Given our responses 1 and 2, while our additional results may not fully meet the reviewer’s expectations, we believe they adequately address the two concerns raised.

  • [Regarding point 1] In response to the concern about the lack of theoretical analysis, we provide a theoretical justification showing that our flow surrogate computation is permutation invariant with respect to node ordering, which is an essential property for effectively capturing information flow.
  • [Regarding point 2] In response to the concern regarding the lack of thorough analysis on the information flow captured by the surrogate, we provide a preliminary analysis indicating that our flow surrogate can distinguish information flows involving distinct operations.

References

  • [1] FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer
  • [2] TA-GATES: An Encoding Scheme for Neural Network Architectures
  • [3] Directed Acyclic Graph Neural Networks
评论

Dear Reviewer RQZD

We are pleased to hear that our response has addressed the concerns raised in your feedback, and we sincerely appreciate your decision to update the score accordingly.

It appears that your previous comment is no longer visible in the system. If you have any additional concerns or would like to continue the discussion, we would be happy to engage further.

Thank you once again for your thoughtful and constructive feedback.

Best regards,

The Authors of Submission 11597.

评论

Dear Reviewers and (Senior) Area Chair

We sincerely appreciate your dedicated academic service.

Below, we provide a concise summary of our rebuttal.


Strength

The reviewers noted the following key strengths of our work:

Strength 1 [Novelty] The reviewers appreciated the novelty of our proposed generative pre-training method, which enables an architecture encoder to capture information flow.

  • Reviewer RQZD: “…a novel pre-training method for neural architecture encoding…”.
  • Reviewer df83: “…novel with respect to related work…”.

Strength 2 [Sufficient experiments] The reviewers appreciated our extensive experiments, which systematically demonstrate the effectiveness of our method in downstream tasks.

  • Reviewer RQZD: “The claim is backed by the experimental results…”.
  • Reviewer zBpp: “The experiments are well structured…”.

Strength 3 [Strong performance] The reviewers appreciated the performance gains of our method in the architecture performance prediction task compared with existing baselines.

  • Reviewer RQZD: “…faster and consistently outperforms baseline…”.
  • Reviewer df83: “…consistent (sometimes sizeable) increase in performance…”.
  • Reviewer zBpp: “…show improved downstream performance…”.

Rebuttal summary

The primary concerns raised by the reviewers are twofold: (1) limited understanding of the information flow captured by our method, and (2) limited analysis of backbone architecture encoders.

  • Weakness 1 [Flow analysis] Reviewers RQZD and VhSE requested an analysis of the type of information flow captured by our proposed method.
    • [Our responses] Through additional analysis, we empirically demonstrate that our flow surrogate can distinguish architectures with different operations, resulting in distinct information flows.
    • [Details] Further details can be found in our response 2 to Reviewer ZQZD and our response 2.1 to Reviewer VhSE.
  • Weakness 2 [Backbone architecture encoders] Reviewers df83 and zBpp requested an analysis of backbone encoders, focusing on (1) which types of encoders are effective when coupled with our method, and (2) why flow-aware encoders are effective with our proposed approach.
    • [Our responses] Through additional analysis, we empirically demonstrate that (1) our method improves the performance not only of graph neural networks but also of MLP-based encoders and Transformers, and (2) it enables flow-aware encoders to learn more diverse flow patterns by leveraging unlabeled neural architectures.
    • [Details] Further details can be found in our response 2 to Reviewer df83 and our response 2 to Reviewer zBpp.

Other clarifications

In addition to the above major comments, we will incorporate the following clarifications into our revised manuscript.

  • Clarification 1 [Figure clarification] We will provide more detailed explanations of Figures 3 and 8 to enhance readers’ understanding.
    • [Details] Further details can be found in our response 4 to Reviewer VhSE and our response 1.1 to Reviewer zBpp.
  • Clarification 2 [Design justification] We will include justification for the use of summation pooling, supported by additional ablation studies.
    • [Detail] Further details can be found in our response 5 to Reviewer VhSE.
  • Clarification 3 [Theoretical Analysis] We will include a theoretical analysis of the permutation invariance of our method with respect to node indexing.
    • [Detail] Further details can be found in our response 1 to Reviewer RQZD.
  • Clarification 4 [Hyperparameter analysis] We will further describe our hyperparameter tuning process and the robustness of our method to hyperparameter choices.
    • [Detail] Further details can be found in our response 1 to Reviewer df83.

Reviewer responses

We believe that our rebuttal was generally well received. Accordingly, most of the reviewers raised their evaluations.

  • Reviewer df83: “…I am satisfied with the response … I'll raise my score”.
  • Reviewer VhSE: “I am happy with the clarification from the author and have increased my score”.
  • Reviewer zBpp: “…answers my questions… I have raised my score accordingly”.

Once again, thank you for your dedicated academic service.

Warm regards,

The authors of submission 11597

评论

Dear authors,

I know it's late but for some reason none of the reviewers have asked that, so here it goes: why are all your Kendall Taus much higher than 1? I'm guessing you multiplied them by 100, but this should be clearly explained and not subject to guessing...

Best AC

评论

Dear Area Chair wYwe,

We apologize for the confusion, and thank you for your detailed question regarding our results. As you noted, all experimental results reported using the Kendall’s Tau metric are multiplied by 100.

We will clarify this in the revised manuscript and will also review the entire manuscript to address any points that may potentially cause confusion for readers.

Once again, thank you for your question, which helps improve readers’ understanding of our experimental results.

Warm regards,

The authors of submission 11597

最终决定

The initial reviews mentioned good structure, novelty, good argumentation and well-thought experiments as the main strengths of the paper. At the same time, the criticism was centred around the results being purely empirical, lack of explanations for certain statements and generalisability/robustness of the results.

During the rebuttal, the authors have provided additional explanations, including a preliminary analysis of certain (theoretical) properties of their method (in particular: invariance to node permutations and ability to distinguish architectures with distinct operations), as well as ablation studies and comparisons. In the end, all reviewers were convinced to recommend acceptance. Although confidence is low in certain cases, after reading the paper myself I concur that it constitutes an interesting contribution and would like to recommend acceptance.