PaperHub
5.0
/10
Rejected4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.8
置信度
ICLR 2024

TAFS: Task-aware Activation Function Search for Graph Neural Networks

OpenReviewPDF
提交: 2023-09-21更新: 2024-02-11

摘要

关键词
Activation FunctionGraph Neural NetworksAutoMLNeural Architecture Search

评审与讨论

审稿意见
5

The study presents a novel approach to Graph Neural Network (GNN) activation function search using bi-level optimization. An efficient algorithm is introduced that explores a search space defined by universal approximators with smoothness constraints, allowing for quick optimal function discovery. By using stochastic relaxation, the algorithm bypasses the challenges of non-differentiable objectives, outperforming existing activation functions across various GNN models and datasets, leading to a tailored GNN activation function design.

优点

  1. The idea of designing a task-aware activation function for GNNs is excellent, since it could universally enhance the performance of GNNs across different datasets and tasks.
  2. The writing and expression of this paper are good.

缺点

  1. Graph classification is crucial in graph mining due to its emphasis on global topological structures. Given its distinct objectives from node classification, it's imperative for the author to include experiments on graph classification.
  2. As a universal method, TAFS should be tested on as many GNN models as possible. GCN and GraphSAGE are quite similar, while GAT and GIN are different, which should also be used as backbone to evaluate the performance of TAFS.
  3. The authors should conduct a time complexity analysis and compare it with GReLU.

问题

See weaknesses.

评论

GIN experiments

CoraDBLPCornellTexasChameleon
GNNActivationmeanstdmeanstdmeanstdmeanstdmeanstd
GINStackReLU84.70.1884.480.2752.976.369.197.3731.491.76
Tanh85.40.3986.220.1454.593.5970.278.256.973.57
Swish84.660.3585.080.6355.144.3965.959.7631.14.52
TAFS85.10.4686.980.2256.594.6569.195.0159.973.16
ResidualReLU84.770.4884.180.2155.144.3972.434.6530.533.07
Tanh85.210.786.140.2356.768.0272.977.0559.912.87
Swish84.840.584.510.3255.684.3974.596.5331.361.47
TAFS85.060.5188.270.2359.593.9777.696.0760.663.97
JKNetReLU85.950.7884.960.2371.896.5382.76.0731.451.8
Tanh85.40.6585.680.2772.974.8382.163.6764.471.71
Swish84.990.4485.040.2470.276.1682.166.0733.163.25
TAFS85.80.3286.610.1274.595.5783.626.2666.713.44
MixhopReLU83.660.9281.690.1365.413.5977.847.1356.143.07
Tanh83.590.8481.780.265.413.5972.979.6756.182.19
Swish83.221.381.590.5466.493.2476.767.7656.41.75
TAFS83.960.4681.830.271.356.9679.846.7164.952.43

Dataset: ogbg-molhiv

MethodActivationTest ROCAUCParamTime
GINReLU0.77073.3M53Min
Tanh0.7783.3M53Min
Swish0.76693.3M53Min
TAFS0.7883+30067Min

Q3. * The authors should conduct a time complexity analysis and compare it with GReLU.*

Answer: The setting of GReLU is actually very different from ours. GReLU designs a hyperfunction which includes additional graph convolutional layers, as a result, GReLU is not a univariate function as commonly required for activation functions (on the other hand, ReLU, Swish, Tanh and our TAFS as univariate functions). This can also be perceived by GReLU’s parameters. GReLU has N * C * K extra parameters, where N is number of nodes, C is the node feature dimensions and K is number of segments. So GReLU is dependent on the graph dataset size. TAFS is independent. Our TAFS brings extra parameter only dependent to number of graph layers, which is basically M * L, where M is the size of MLP (usually dozens) and L is the applied layers.

评论

Thank you for your appreciation of our work and for your insightful questions. We answer each below.

Q1: Graph classification is crucial in graph mining due to its emphasis on global topological structures. Given its distinct objectives from node classification, it's imperative for the author to include experiments on graph classification.

Answer: We add below the performance of graph classification on the ogbg-molhiv dataset. We include baselines of classical GNN as well as Top-1 ranking solution PAS+FPS. In all cases, TAFS could achieve a performance gain.

Graph level task Dataset: ogbg-molhiv, #Graphs 41K, #Nodes per graph 25.5

MethodActivationTest ROCAUCParamTime
GCNReLU0.76060.53M40Min
Tanh0.78830.53M40Min
Swish0.75770.53M40Min
TAFS0.7934+30055Min
GINReLU0.77073.3M53Min
Tanh0.7783.3M53Min
Swish0.76693.3M53Min
TAFS0.7883+30067Min
PAS+FPsReLU0.816927M17 Hours
Tanh0.820127M17 Hours
Swish0.815227M17 Hours
TAFS0.8203+0.01M23 Hours

Q2. As a universal method, TAFS should be tested on as many GNN models as possible. GCN and GraphSAGE are quite similar, while GAT and GIN are different, which should also be used as backbone to evaluate the performance of TAFS.

Answer: We agree. Here are our detailed experiments on GAT/GIN across small and large datasets. As can be seen, TAFS achieves the best in most cases, showing the effectiveness of our method.

GAT experiments

CoraDBLPCornellTexasChameleon
GNNActivationmeanstdmeanstdmeanstdmeanstdmeanstd
GATStackReLU84.730.6585.360.1355.147.5755.686.328.429.75
Tanh84.511.2584.910.256.762.9656.765.9255.573.58
Swish84.441.485.340.2856.762.42608.0922.811.04
TAFS86.141.6885.910.1356.593.1558.927.9155.973.16
ResidualReLU83.960.685.160.0955.146.0754.596.9261.184.19
Tanh85.620.1885.250.354.595.2454.052.4264.652.25
Swish84.440.9784.490.3152.435.5756.768.8859.961.29
TAFS86.471.2585.990.2255.117.3356.818.4465.663.78
JKNetReLU86.430.2584.770.2378.928.6179.465.0157.680.46
Tanh86.730.9285.070.1375.145.2478.382.9658.251.01
Swish85.470.884.680.282.76.0777.845.2454.692.36
TAFS87.210.4785.220.1882.846.2679.384.5264.822.01
MixhopReLU84.251.6784.10.1263.787.7662.77.9151.322.27
Tanh84.620.5184.340.2867.573.8270.277.0553.821.27
Swish84.40.3884.390.3564.863.8265.956.0750.262.93
TAFS85.290.4585.040.3467.865.9269.037.5354.783.62

Dataset: ogbn-proteins

MethodActivationTest ROCAUCParamTime
GATReLU0.81762.48M33.6Min
Tanh0.80032.48M33.6Min
Swish0.82992.48M33.6Min
TAFS0.8682+30041.5Min
审稿意见
5

This paper proposes a Task-Aware Activation Function Search method, abbreviated as TAFS. TAFS is capable of efficiently searching for and discovering new, effective activation functions within GNN applications. Firstly, for the search space of activation functions, TAFS introduces a continuous latent space equipped with a general approximator that includes an additional smoothness constraint. Specifically, TAFS utilizes a MLP to approximate the optimal activation function, where the parameters of the MLP are optimized as part of the search space, incorporating a Jacobian regularization term. Secondly, employing stochastic relaxation techniques, the search space is reparameterized with a Gaussian distribution, shifting the optimization target from the parameters of the activation function to those of the Gaussian distribution. Comprehensive evaluations on node and link-level tasks demonstrate that this method achieves excellent performance.

优点

  1. Presents an innovative and intriguing approach to activation function search within the context of Graph Neural Networks, marking a novel area of research.
  2. Introduces a probabilistic search algorithm capable of effectively exploring a regularized function space, leading to the discovery of novel activation functions.
  3. Experimental results demonstrate that, compared to baseline methods, TAFS achieves excellent performance in node and link prediction tasks, significantly enhancing the efficiency of the search process.

缺点

  1. The search strategy presented in this paper does not support a larger GNN search space. For instance, if there is a need to search for aggregation functions or message passing functions, the optimal form of the activation function is likely to change.

  2. If TAFS employs an MLP to approximate the optimal activation function, the performance of activation function would also depend on the number of layers in the MLP and the non-linear transformation functions. This has not been discussed in this paper.

  3. In experiments, although the paper introduces a Jacobian regularization term, the impact of this regularization has not been empirically tested.

  4. In TAFS, a stochastic relaxation is used, involving the reparameterization of the activation function parameters using a Gaussian distribution. However, this paper does not discuss the advantage of this strategy.

问题

  1. In a previous work [1], the best-performing activation function is referred to as Swish, which has a specific functional form. Similarly, can the best-performing activation function identified by TAFS be represented using a generic function? [1] Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions. In International Conference on Learning Representations, ICLR, 2018.

  2. The sentence in the main text, "… how can we design GNN activation functions to adapt effectively to various graph-based tasks, creating task-aware activation functions??" seems to have an issue with the use of symbols.

  3. In the text, "... and demote by ¯w_δ all the parameters of GNN, …", the word "demote" appears to be misspelled.

  4. The experimental section lacks clear explanations of the metrics used in the tables. For example, the description of Table 2 does not specify whether the results represent accuracy, AUC, or something else.

  5. In Figure 4, subfigure (a) shows that larger values of K yield better results but at a slower pace, demonstrating a "trade-off between accuracy and computation time." Is the Y-axis in (a) representing accuracy? And it's not clear how the relationship between K value and computation time is depicted.

评论

Q5. In a previous work [1], the best-performing activation function is referred to as Swish, which has a specific functional form. Similarly, can the best-performing activation function identified by TAFS be represented using a generic function? [1] Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions. In International Conference on Learning Representations, ICLR, 2018.

Answer: This is an interesting question. We answer in two ways. Firstly, we could achieve the closed form equation by symbolic regression techniques. Because we have obtained the univariate function and we could simulate many (x, y) value pairs to symbolic regression to distill. Swish could do so because it uses a template based space which naturally leads to this symbolic form as a result. However, this has a big negative impact on the possible function space, i.e., it is very possible the suitable activation function is not in this manually-designed template space. Secondly, in terms of usage, we do not need to distill the explicit form because we open source the learned weights of the univariate function and for practitioners, they just need to load this small function and replace their ReLU. The usage is thus very easy.

We show below a case study of the symbolized version of TAFS searched results. On the Cora dataset with GCN-JK network baseline, the searched result is the blue line. We distill an explicit formula by symbolic regression: y = 0.6 Tanh(-x) and plug in this activation function back to the model. The performance is shown below and we provide a visualization in the appendix Sec H. There is a performance gap in the table below. This is because explicit symbolic space is not accurate, which shows further that our implicit functional space is expressive.

DatasetBaselineActivationAccuracy
CoraGCN-JKTAFS89.08
0.6*Tanh(-x)87.89

Q6-Q7: The sentence in the main text, "… how can we design GNN activation functions to adapt effectively to various graph-based tasks, creating task-aware activation functions??" seems to have an issue with the use of symbols.

In the text, "... and demote by ¯w_δ all the parameters of GNN, …", the word "demote" appears to be misspelled.

Answer: Thanks for your careful review. We have corrected these typos.

Q8: * The experimental section lacks clear explanations of the metrics used in the tables. For example, the description of Table 2 does not specify whether the results represent accuracy, AUC, or something else.*

Answer: In Table 2, the standard accuracy is used. In Table 3, both ROCAUC and PRAUC are used. In our extended OGB experiments, we use ROCAUC for node classification and graph classification, Hit@20 for link prediction, following the OGB standard.Thus, we have experimented on various tasks of different objective requirements.

Q9: In Figure 4, subfigure (a) shows that larger values of K yield better results but at a slower pace, demonstrating a "trade-off between accuracy and computation time." Is the Y-axis in (a) representing accuracy? And it's not clear how the relationship between K value and computation time is depicted.

Answer: Yes, Y-axis is the downstream performance, here classification accuracy. For example, moving from K=32 towards K=64 increases 20% more computational times. As a result, it is truly a trade-off that should depend on the user's own preference. We will add this discussion in the paper.

评论

Thank the reviewer for insightful questions. We answer each below.

Q1. The search strategy presented in this paper does not support a larger GNN search space. For instance, if there is a need to search for aggregation functions or message passing functions, the optimal form of the activation function is likely to change.

Answer: Actually, this is totally integratable. Our method TAFS could be applied after the other GNN search method finds the backbone model (for example, GraphNAS, 2019 Gao et al). We experiment this strategy on the OGB graph classification top 1 winner PAS+FPS method. It is a neural architecture search baseline that does exactly GNN space search. In the searched 14 layers GNN checkpoint, we replace its ReLU functions by TAFS, and fine tune further. As can be seen, within an affordable time, the performance could boost further, showing that TAFS is easily integrable with the GNN NAS methods.

MethodActivationTest ROCAUCParamTime
PAS+FPsReLU0.816927M17 Hours
TAFS0.8203+0.01M23 Hours

Reference: https://ogb.stanford.edu/docs/leader_graphprop/#ogbg-molhiv

Q2. If TAFS employs an MLP to approximate the optimal activation function, the performance of activation function would also depend on the number of layers in the MLP and the non-linear transformation functions. This has not been discussed in this paper.

Answer: We highlight that MLP is a universal approximator which we choose on purpose as an expressive functional space. As a result, a normal MLP (not too large) would be sufficient in finding the suitable functions.

We include the discussion on MLP hyperparameters in Figure 4(b)(c). It is true that this influences the performance. In order not to add too many extra parameters, we choose MLP with one-hidden layer (100 neurons) of Tanh activation function. No activation function is applied on the output neuron. We experiment below the differences of ReLU/Tanh used in MLP. Tanh is thus our final choice.

Dataset: DBLP

ModelActivationAccuracy
GCN-StackReLU83.06
TAFS (relu)88.91
TAFS (tanh)89.08
GAT-residualReLU83.96
TAFS (relu)85.23
TAFS (tanh)85.47

Q3. In experiments, although the paper introduces a Jacobian regularization term, the impact of this regularization has not been empirically tested.

Answer: Thanks for pointing this out. We give below the influence of TAFS w./w.o. regularization. We will add this part to the paper later.

Dataset: DBLP

ModelActivationAccuracy
GCN-StackReLU83.06
TAFS (w.o. reg)87.51
TAFS (w. Reg)89.08
GAT-residualReLU83.96
TAFS (w.o. reg)83.23
TAFS (w. Reg)85.47

Q4. In TAFS, a stochastic relaxation is used, involving the reparameterization of the activation function parameters using a Gaussian distribution. However, this paper does not discuss the advantage of this strategy.

Answer: The biggest advantage of the stochastic relaxation is to remove the differentiability requirement for the task evaluation metric (e.g. ROCAUC, Hit@20), which is very common in Graph tasks. By sampling the weights of activation functions from a probability distribution, we optimize the distribution parameters by proposition 1 in the paper. The proposition 1 gives the gradient of loss w.r.t distribution parameters, and there is no terms such as \nable M, avoiding the calculation of gradient of M. This makes the method more applicable in general tasks. The choice of Gaussian distribution is without loss of generality because we don’t have prior knowledge about this choice of distribution. In cases where we know certain probability distribution is better, we could definitely change to others such Poisson distribution.

We give below the influence of TAFS w./w.o. stochastic regularization. The full table has been updated in Table 3 in the paper.

DatasetModelActivationROCAUCPRAUC
DTISkipGNNReLU0.9220.928
TAFS w.o. relaxation0.9330.934
TAFS0.9520.954
HOGCNReLU0.9270.929
TAFS w.o. relaxation0.9230.922
TAFS0.9430.94
审稿意见
5

The authors introduce a framework for designing activation functions for graph neural networks (GNNs) based on the downstream task. The framework, called TAFS, uses a bi-level stochastic optimization problem with Lipschitz regularization to search for the optimal activation patterns. The authors claim that TAFS can automate the discovery of task-aware activation functions without significant computational or memory overhead and show that TAFS can achieve substantial improvements over existing methods in various tasks.

优点

S1. The authors review the existing work, explain the necessity of achieving task-aware activation functions in the context of GNNs, and identify two challenges for this goal. To address them, the authors design a new framework consisting of a compact search space and an efficient search algorithm, which enables automated activation function search.

S2. The paper writing is relatively good, and the authors provide a detailed explanation of the key parts of their method, namely the implicit functional search space and the stochastic relaxation.

缺点

W1. Some of the experimental settings and results explanations in the paper are vague, such as Figure 1 in the Introduction section and Figure 3 in the Experiments section. I cannot understand how the experimental results in the figures were obtained, and what the data in the figures mean.

W2. From the experimental results, the performance of TAFS needs to be improved. Table 2 shows that in some experimental scenarios on the DBLP, Cornell, Texas and Chameleon datasets, the results obtained by TAFS are only marginally better than directly using a specific activation function, or even worse. The authors did not explain this phenomenon.

W3. I wonder why in the drug and protein interaction prediction experiments, only the results of TAFS and ReLU were provided, instead of comparing with multiple activation functions as in the node classification experiments.

W4. Since the authors compared the search efficiency with Swish and APL, why didn’t they involve the comparison with these two search-based methods in the effectiveness validation experiments (Table 2 and Table 3)?

问题

See the weakness part

评论

Thanks for the questions. We answer each below.

Q1. Some of the experimental settings and results explanations in the paper are vague, such as Figure 1 in the Introduction section and Figure 3 in the Experiments section. I cannot understand how the experimental results in the figures were obtained, and what the data in the figures mean.

Answer: In a word, Figure 1 illustrates the impact of activation function in GNN and motivates our work to search activation functions. Figure 3 demonstrates the searched candidates from baselines and our proposed TAFS.

Figure 1 shows the experiments on two GNN baselines (GCN and GraphSage) on two typical graph datasets (Core and DBLP), with different choices of activation functions (ReLU, Tanh, Leaky-Relu, Swish). The evaluation metric is node classification accuracy. We show by Figure 1 that activation functions in GNN, which was previously paid little attention, could have a huge impact on the performance. This motivates our study to design efficient activation function search algorithms in this paper.

Figure 3 demonstrates multiple things. Fig 3(a) gives an idea on the search space of our compared baselines Swish and APL to show what functions in their search space looks like. Fig 3(b)(c) shows our searched activation functions tailoring two datasets Cora and DBLP. Specifically, we show that (1) our method finds adaptive activation functions for different datasets (2) due to the efficiency of our method, we could adapt layer-wise activation functions without exploding the search space. So different layers find different activation functions (3) our found activation functions are distinct to already published manually-designed ones (4) our activation functions are smooth functions which potentially reflects our smoothness regularization.

Q2. From the experimental results, the performance of TAFS needs to be improved. Table 2 shows that in some experimental scenarios on the DBLP, Cornell, Texas and Chameleon datasets, the results obtained by TAFS are only marginally better than directly using a specific activation function, or even worse. The authors did not explain this phenomenon.

Answer: Thank the reviewer for pointing this out. In our previously table version, we bold the best results by considering only the mean over multiple runs. Actually, to make it more precise, we now take into account both the mean and standard variance and bold best results of the same statistical significance level.

In the revised pdf, we see now that TAFS is best or tied for best in all cases. In some datasets which are small such as Texas, Chameleon, the variance is very high. Thus, TAFS archives the same performance level.

This phenomenon is much alleviated in large scale datasets such as OGB ones. In our experiments with OGB, we show that TAFS is the best as below.

Node level task Dataset: ogbn-proteins, Node 132,534, Edge 39,561,252

MethodActivationTest ROCAUCParamTime
GCNReLU0.7219100K15.3Min
Tanh0.6521100K15.3Min
Swish0.7255100K15.3Min
TAFS0.7483+30016.8Min
GATReLU0.81762.48M33.6Min
Tanh0.80032.48M33.6Min
Swish0.82992.48M33.6Min
TAFS0.8682+30041.5Min
GIPAReLU0.891717M10.7 hours
Tanh0.880917M10.7 hours
Swish0.889917M10.7 hours
TAFS0.8991+150014.9 hours

Link level task dataset: ogbl-ddi, Node 4,267, Edge 1,334,889

MethodActivationTest Hits@20ParamTime
GCNReLU0.37071.29M7Min
Tanh0.33181.29M7Min
Swish0.37721.29M7Min
TAFS0.3991+30010Min
GraphSageReLU0.53911.42M8Min
Tanh0.52281.42M8Min
Swish0.53771.42M8Min
TAFS0.5512+30013Min
E2NReLU0.96060.6M4 Hours
Tanh0.95540.6M4 Hours
Swish0.95150.6M4 Hours
TAFS0.9681+3004.7 Hours
评论

Q3. I wonder why in the drug and protein interaction prediction experiments, only the results of TAFS and ReLU were provided, instead of comparing with multiple activation functions as in the node classification experiments.

Answer: We add the other choices’ performance below. We didn’t include this to emphasize the impact by adopting TAFS. The table will be merged to the main paper later.

DatasetModelActivationROCAUCPRAUC
DTISkipGNNReLU0.9220.928
Tanh0.9150.911
Swish0.9310.935
TAFS0.9520.954
HOGCNReLU0.9270.929
Tanh0.9030.911
Swish0.9250.921
TAFS0.9430.94
DDISkipGNNReLU0.8860.866
Tanh0.9030.872
Swish0.8810.865
TAFS0.9110.898
HOGCNReLU0.8980.881
Tanh0.8870.876
Swish0.8810.872
TAFS0.9170.901
PPISkipGNNReLU0.9170.921
Tanh0.9120.919
Swish0.920.921
TAFS0.9270.937
HOGCNReLU0.9190.922
Tanh0.9170.913
Swish0.9150.914
TAFS0.9230.929
DGASkipGNNReLU0.9120.915
Tanh0.9120.914
Swish0.9240.917
TAFS0.930.94
HOGCNReLU0.9270.934
Tanh0.9180.929
Swish0.9230.924
TAFS0.9330.942

Q4. Since the authors compared the search efficiency with Swish and APL, why didn’t they involve the comparison with these two search-based methods in the effectiveness validation experiments (Table 2 and Table 3)?

Answer: Due to the inefficiency of Swish and APL, we could not practically obtain reasonable results by Searchable Swish and APL. For example, in the case of Table 4 in the paper, we can see that searchable Swish takes 50 Hours to run on Chameleon and APL costs 6 times more memory consumption than TAFS. In our table 2, we have included 5 datasets and 10 graph baselines, leading to a total of 50 sets of experiments. We find it not necessary to include the searchable Swish and APL in every single set of task. As a result, we included already in Table 2 and Table 3 the searched results of Swish paper (Swish), which is worse than TAFS. We also give below the APL results and searchable swish on two tasks for a general idea of their ineffectiveness. Note that both baselines could not be applied in OGB datasets due to inefficiency in time and memory, while TAFS could.

Dataset: DBLP

ModelActivationAccuracy
GCN-StackReLU83.06
Swish search82.79
APL search81.03
TAFS89.08
GAT-residualReLU83.96
Swish search83.01
APL search81.22
TAFS85.47
审稿意见
5

The paper introduces TAFS, a framework for designing task-specific activation functions in Graph Neural Networks (GNNs). TAFS uses a search algorithm to optimize activation functions for specific tasks, resulting in improved performance compared to traditional activation functions. The design of TAFS is more efficient than baseline methods on optimizing such bi-level optimization problems. The experiment results show that using TAFS usually achieves better performances than using fixed activation functions.

优点

  1. The paper is overall clearly written and easy to follow.
  2. This paper studies a interesting research problem that is rarely studied in the GML domain.

缺点

  1. The experimental results are not very convincing. The experiments are all conducted on very small datasets, and I don't think the datasets for link prediction experiments are the commonly used ones in literature. I'd recommend the authors to use more standardized and commonly accepted benchmarks such as the OGB ones.
  2. Although the proposed method is already much more efficient than the baselines Swish and APL, it still takes about 10x times of runtime when compared with fixed activation functions. It's hard to tell whether the performance improvements worth such huge overhead on the time cost.

问题

  1. I'd appreciate if the authors can elaborate more on the difference of the proposed method versus the activation function search methods for CNNs and RNNs (as referenced in Sec. 1).
评论

Graph level task Dataset: ogbg-molhiv, #Graphs 41K, #Nodes per graph 25.5

MethodActivationTest ROCAUCParamTime
GCNReLU0.76060.53M40Min
Tanh0.78830.53M40Min
Swish0.75770.53M40Min
TAFS0.7934+30055Min
GINReLU0.77073.3M53Min
Tanh0.7783.3M53Min
Swish0.76693.3M53Min
TAFS0.7883+30067Min
PAS+FPsReLU0.816927M17 Hours
Tanh0.820127M17 Hours
Swish0.815227M17 Hours
TAFS0.8203+0.01M23 Hours

Q3. I'd appreciate if the authors can elaborate more on the difference of the proposed method versus the activation function search methods for CNNs and RNNs (as referenced in Sec. 1).

Answer: The two most important activation function search methods are Swish and APL, which is explained and compared in Sec 2 and especially in Table 1. We give a detailed introduction in Appendix B. Swish is inefficient in running time because every inner optimization will train the network until convergence. APL is inefficient in parameters and memory because it parameterized each neuron’s activation by multiple hinges. In the case of ogbg-molhiv datasets (above) and 1st ranking solution PAS on graph classification, the total parameter is 27M and the running time is over 17 hours. APL will add over 150M parameters which leads to OOM and Swish will cost over 70 days even if we train only 100 epochs. Our method TAFS, takes only 23 hours and adds only 0.01M parameters, which is significantly more efficient than baselines, and as a result, paves the way for further study in activation function search.

DatasetModelParamTime
ogbg-molhivPAS+FPS27M17Hours
SwishNA>70days
APLOOM (>150M)NA
TAFS+0.01M23 Hours
评论

Thanks for the rebuttal. I raised my score to 5.

评论

Thanks for the important suggestion on the scalability. We answer each question below.

Q1: The experimental results are not very convincing. The experiments are all conducted on very small datasets, and I don't think the datasets for link prediction experiments are the commonly used ones in literature. I'd recommend the authors to use more standardized and commonly accepted benchmarks such as the OGB ones. Although the proposed method is already much more efficient than the baselines Swish and APL, it still takes about 10x times of runtime when compared with fixed activation functions. It's hard to tell whether the performance improvements worth such huge overhead on the time cost.

Q2. Although the proposed method is already much more efficient than the baselines Swish and APL, it still takes about 10x times of runtime when compared with fixed activation functions. It's hard to tell whether the performance improvements worth such huge overhead on the time cost.

Answer: We answer Q1 and Q2 together. The inappropriate comparison in the manuscript was due to the fact that the baseline was too small and as a result the computational head seems to be 10x more, though our method is already much more efficient than literature. Actually, our method’s parameters have no dependency on the size of the graph, making it easily scalable. We experimented on OGB and much larger baselines as in the table below. Three OGB datasets in node/link/graph level tasks are chosen and we experiment both classical GNN baselines and top-ranking baselines. As can be seen, on larger datasets (ogbn-molhiv) and larger baselines (PAS) which take over 17 hours, our method only adds about 35% more computational time, which is much more acceptable than baselines. Actually, both Swish search and APL search could not be run on such larger datasets due to their inefficiency in running time and overparameterization, which leads to months’ GPU hours or leads to Out-of-Memory error.


Node level task Dataset: ogbn-proteins, Node 132,534, Edge 39,561,252

MethodActivationTest ROCAUCParamTime
GCNReLU0.7219100K15.3Min
Tanh0.6521100K15.3Min
Swish0.7255100K15.3Min
TAFS0.7483+30016.8Min
GraphSageReLU0.7567193K17.7Min
Tanh0.7442193K17.7Min
Swish0.7319193K17.7Min
TAFS0.7768+30019.9Min
GATReLU0.81762.48M33.6Min
Tanh0.80032.48M33.6Min
Swish0.82992.48M33.6Min
TAFS0.8682+30041.5Min
GIPAReLU0.891717M10.7 hours
Tanh0.880917M10.7 hours
Swish0.889917M10.7 hours
TAFS0.8991+150014.9 hours

Link level task dataset: ogbl-ddi, Node 4,267, Edge 1,334,889

MethodActivationTest Hits@20ParamTime
GCNReLU0.37071.29M7Min
Tanh0.33181.29M7Min
Swish0.37721.29M7Min
TAFS0.3991+30010Min
GraphSageReLU0.53911.42M8Min
Tanh0.52281.42M8Min
Swish0.53771.42M8Min
TAFS0.5512+30013Min
E2NReLU0.96060.6M4 Hours
Tanh0.95540.6M4 Hours
Swish0.95150.6M4 Hours
TAFS0.9681+3004.7 Hours
评论

We thank all the reviewers for the insightful comments. Our work has been strengthened with significantly more experiments and analysis. We highlight motivation and contributions here and reply to each question separately later

Motivation: Activation function is important for Graph Neural Networks (GNN). GNN usually has shallow layers. GNN is just a simple transformation of node features without activation functions. Activation functions empower GNN to increase the nonlinear modeling capacity [1][2]. Compared to CNN which is usually deep, GNN can have larger dependency on the appropriate choice of activation functions.

This motivates us to explore activation function searching, which is paid little attention in the literature.

Our contributions are multiple-folds:

  1. Novelty: We achieve task-awareness by stochastic relaxation and the proposed MLP gives expressive and compact parameterization for implicit functional space.
    • compared to Swish: Swish takes much more time because it requires the inner optimization to be converged while TAFS is iterated bi-level optimization which does not requires inner convergence at each iteration
    • compared to APL: APL takes 6 times more parameters that cannot be applied in large graph while TAFS is scalable
    • compared to GReLU: GReLU is not univariate and has a number of parameters dependent on the graph size, while TAFS’ parameter is independent of graph size. Also, GReLU is not searching algorithm
  2. Effectiveness: We experiment on both small and large datasets. We could achieve consistent improvement w.r.t ReLU choice and on diverse tasks in node/link/graph level, our method is the state-of-the-art in activation function design.
  3. Efficiency: We experiment on large OGB dataset and top ranking solutions up to 27M parameters. We show that TAFS finds appropriate solutions with extra 35% computational time and negligible extra memory cost, which is not possible for Swish or APL search methods.

As a consequence, our solution provides significant value for practitioners and to the community. Our proposed TAFS is an off-the-shelf tool for better activation function design, which provides a more efficient and effective baseline for pushing further research, emphasizing the previously neglected design part of GNN.

  1. Prajit Ramachandran, Barret Zoph, Quoc V. Le: Searching for Activation Functions. ICLR 2018
  2. Bianca Iancu, Luana Ruiz, Alejandro Ribeiro, Elvin Isufi: Graph-Adaptive Activation Functions for Graph Neural Networks. MLSP 2020
AC 元评审

TAFS is an interesting method which puts activation functions in the spotlight of GNN design, offering a novel method for automated optimisation of activation functions in a bi-level setting.

I find the approach to be interesting, the topic timely, and the results have potential for a strong paper -- especially taking into account the rebuttal results, which the authors sadly haven't incorporated into the main paper during the discussion phase.

In my opinion, there are still several areas of improvement for the work as currently presented (including the rebuttal), but to keep this meta-review concise, I highlight the most actionable one -- the presentation and discussion of the OGB results.

The authors added a lot of OGB results during the rebuttal, however they provided no error bars / standard deviations around their estimates -- which, in absence of any other data, forces me to conclude that these are single-run results. Further, there are no details provided about the models considered in the OGB tables; most importantly: whether the gains in performance could have also been obtained through careful hyperparameter tuning outside of the activation function. And I am doubtful that ReLU is the only useful baseline to use for these tasks---there are other "fixed" activation functions that could have been compared against.

I would have much rather preferred if the authors had focussed only on one OGB task/base architecture, but invested more effort into robust error estimates and actually convincing us that the stated gains came from TAFS and could not have been easily obtained in other ways. In the present form, especially given unanimous rejection ratings by all reviewers, I think a rejection is the most sensible outcome for this work. I do hope the authors will revise, refine and contextualise their OGB results, and come back stronger in the next revision cycle.

为何不给更高分

Reviewers unanimously recommend rejection after rebuttal, and more work is needed to robustify the results presented within it.

为何不给更低分

N/A

最终决定

Reject