PaperHub
5.8
/10
Poster4 位审稿人
最低5最高6标准差0.4
5
6
6
6
3.3
置信度
ICLR 2024

Structural Fairness-aware Active Learning for Graph Neural Networks

OpenReviewPDF
提交: 2023-09-24更新: 2024-04-11

摘要

关键词
Active LearningGraph Neural NetworksStructural Fairness

评审与讨论

审稿意见
5

The study focuses on enhancing the performance of GNNs for semi-supervised node classification, even when high-quality labeled samples are scarce. Traditional active learning methods may not work optimally in graph data, given their unique structures and the bias introduced by the positioning of labeled nodes. To address this, the researchers introduce a unified optimization framework called SCARCE, which can be combined with node features. Their experiments confirm that this method not only enhances GNN performance but also helps mitigate structural bias and improve fairness in the results.

优点

  1. Comparison with many baselines in this paper is very good.

缺点

  1. The paper is not easy to follow. I suggest the existing work on fairness in GNN should be discussed. Otherwise, it is very hard to estimate the significance of this work.

  2. My major concern is that the fairness definition in this paper is not very clear. We usually use demographic party (DP) or equal odds (EO) to measure fairness. However, this paper only uses the Standard Deviation (SD) and the Coefficient of Variation. Why do the authors consider them instead of DP and EO?

  3. Figure 6 and Figure 7 are also not very clear to me. The authors discussed that 'SCARCE, which combines both SIS and LPS variance, SCARCE can not only elevate overall performance but also attain commendable fairness'. However, it is very hard to get this result from these two figures. I suggest the authors provide more details for examination.

  4. This paper should focus on fairness instead of classification accuracy. However, Tables 1 and 2 provide more details about the classification accuracy. There should be a trade-off between accuracy and fairness. Only showing accuracy does not make any sense. In addition, how to balance the trade-off between accuracy and fairness in this paper. I do not find any implementation details related to this.

问题

See Weakness.

评论

Q2: My major concern is that the fairness definition in this paper is not very clear. We usually use demographic party (DP) or equal odds (EO) to measure fairness. However, this paper only uses the Standard Deviation (SD) and the Coefficient of Variation. Why do the authors consider them instead of DP and EO?

R2: The performance fairness issue that we focused on in this paper is caused by the label position bias. We would like to first provide the definitions:

Label Position Bias & Structural Fairness: In this paper, we use the notion of "Label Position Bias", a critical aspect of performance bias in GNNs. This bias arises when the performance of a GNN model varies for nodes based on their proximity to labeled nodes in the graph.[1] Formally, in a graph G, with the set of labeled nodes VLV_L, and unlabeled nodes VUV_U, a GNN model trained on VLV_L may exhibit varying performance for two nodes i,jVUi, j \in V_U based on their "structural distance" to VLV_L. A key metric we use to measure this distance is the Label Proximal Score LPSi=jVLPijLPS_{i} = \sum_{j \in V_L} P_{ij}, where P is the personalized PageRank matrix. The LPS can well quantify the proximity of unlabeled nodes to labeled nodes and a higher LPS score typically correlates with better model performance for that node [1]. [1] also found the low variance of the LPS score, which indicates all the nodes have a similar "distance" to the labeled nodes, can mitigate the label position bias and improve the performance fairness of GNNs. As a result, we use the variance of the LPS score of each node as the measure of structural fairness.

Measuring Performance Fairness: To quantify performance fairness, we categorize nodes into different sensitive groups based on their LPS scores. For each group, we calculate the average accuracy of the GNNs and use metrics such as Standard Deviation to measure the performance disparity across these groups.

In fact, when considering multiple sensitive groups, the concepts of DP and SD are the same. If there are only two sensitive groups ii and jj with accuracy aia_i and aja_j, the DP=aiajDP = |a_i - a_j|. However, if there are multiple sensitive groups, DP=iaiaˉDP = \sum_i |a_i - \bar{a}|, where aˉ\bar{a} is the average accuracy. Therefore, we use the Standard Deviation SD=i(aiaˉ)2SD = \sqrt{\sum_i (a_i - \bar{a})^2} to measure the fairness. In our paper, we use the Label Proximity Score to split the nodes into multiple sensitive groups. This realization aligns with your suggestion, and we will update our terminology about SD to reflect this more accurately in our revised paper. Additionally, we use the Coefficient of Variation (CV) because it normalizes the standard deviation by the mean, CV=SD/aˉCV = SD / \bar{a}. This normalization is particularly useful for comparing the relative disparity in performance across models with different overall accuracy levels, allowing us to assess fairness in a way that accounts for differences in model performance.

Q3: Figure 6 and Figure 7 are also not very clear to me. The authors discussed that `SCARCE, which combines both SIS and LPS variance, SCARCE can not only elevate overall performance but also attain commendable fairness'. However, it is very hard to get this result from these two figures. I suggest the authors provide more details for examination.

R3: Thank you for your feedback on Figures 6 and 7, which might be uneasy to read. The proposed SCARCE combines two metrics, i.e., the LPS variance and SIS, to strategically select nodes for annotation. Figures 6 and 7 showcase the impact of these metrics on the Cora and CiteSeer datasets, respectively. For each figure, the left subfigure represents the accuracy, and the right subfigure represents the fairness (Standard Deviation) of using different metrics. From Figure 6 and 7, we can find:

  • For accuracy, the proposed SCARCE, which combines both LPS variance and SIS, consistently achieves better accuracy compared to using either metric alone.
  • For fairness, using the LPS variance alone tends to cause a low SD (better fairness). This finding is consistent with our motivation for leveraging the LPS variance to address the fairness issue. The SIS alone may lead to a higher SD. However, when combining these two metrics in SCARCE, we observe that the model maintains commendable fairness (lower SD).
评论

Q4: This paper should focus on fairness instead of classification accuracy. However, Tables 1 and 2 provide more details about the classification accuracy. There should be a trade-off between accuracy and fairness. Only showing accuracy does not make any sense. In addition, how to balance the trade-off between accuracy and fairness in this paper. I do not find any implementation details related to this.

R4: We would like to clarify that the main focus of this paper is the active learning for the graph neural networks, which aims to maximize the model's performance within a labeling budget. We propose a unified framework that effectively combines different metrics in active learning. As discussed in R1, both active learning and mitigating label position bias are fundamentally related to the labeling of nodes in a graph. Besides, structure fairness (low LPS variance) is also related to the overall performance of GNNs, as demonstrated in the preliminary study. Therefore, we leverage structural fairness as one metric in active learning to maximize GNNs' performance. Meanwhile, we found this metric can also improve performance fairness, as shown in Figures 4 & 5 in our paper. Therefore, it is not a trade-off between accuracy and fairness. As demonstrated in Figures 6 & 7, incorporating the LPS variance can improve the model's overall performance and performance fairness. As a result, the proposed method bridges the gap between active learning and mitigating the label position bias in GNNs, proposing that addressing fairness can be synergistic with achieving high model performance.

We hope that we have addressed the concerns in your comments, and please kindly let us know if there is any further concern, and we are happy to clarify.

评论

Dear Reviewer RKZS,

We appreciate your constructive feedback. We are pleased to provide detailed responses to address your concerns.

Q1: I suggest the existing work on fairness in GNN should be discussed. Otherwise, it is very hard to estimate the significance of this work.

R1: Thank you for your suggestions. It is important to clarify that the type of fairness our paper addresses differs significantly from the commonly studied attribute bias, which pertains to model performance disparity across different sensitive attribute groups, such as gender and race.

Our work concentrates on structural fairness, specifically tackling the issue of "Label Position Bias" in GNNs. This bias emerges due to the varying performance of a GNN model for nodes based on their "structural distance" to labeled nodes in the graph. Research [1] uses the Label Proximal Score (LPS) to quantify this proximity, with a higher LPS indicating a closer "distance" to labeled nodes and typically correlating with better model performance. They found the low variance of the LPS score, which means all the nodes have a similar "distance" to the labeled node, can mitigate the label position bias and improve the performance fairness of GNNs. Thus, we use the LPS variance to measure structural fairness. One natural way of reducing the LPS variance is to strategically select the labeled nodes based on their position in the graph.

The goal of active learning is to strategically select some nodes to label that can maximize the model's performance. Therefore, both active learning and mitigating the label position bias are fundamentally linked to the labeling of nodes in a graph. Our preliminary studies illustrate that structure fairness (Low LPS variance) is also related to the overall performance of GNNs, which is the central focus of active learning. Therefore, in this work, we use the LPS variance as one metric for active learning to select nodes to label. The proposed method not only improves performance but also addresses the fairness issue.

As for the existing works on fairness in GNNs, most of them focus on attribute fairness, which is not related to the selection of labeled nodes. Therefore, they are not related to labeling, which is the target of active learning. The pioneering work [1] identifies label position bias and proposes solving it by learning an unbiased graph structure. Our contribution differs as we approach the issue from a labeling perspective, offering an alternative solution to the problem.

Additionally, our method has the advantage of performing well even when node features are noisy or missing, as shown in the experiments in Appendix Section B.3 of our revised paper. For your convenience, we also present partial results here:

  1. Noise feature setting: we add different levels of gaussion noise to the node features.
GCNCoraCiteSeer
Noise0.10.30.50.70.90.10.30.50.70.9
Random64.52±4.5051.76±4.1944.76±4.1240.54±4.1238.65±3.3349.16±8.2236.36±5.7129.53±4.3827.05±2.0726.34±1.85
Degree61.58±0.4454.84±0.6651.25±0.6049.82±0.6247.37±0.7944.37±0.8837.27±0.4734.28±0.5634.53±0.4634.12±0.56
Pagerank60.77±1.1643.22±2.5934.96±2.2332.86±0.9432.76±1.2749.34±2.7733.40±2.4927.99±1.2225.41±1.9525.08±1.74
FeatProp77.51±0.6145.93±2.1838.50±1.6332.94±0.4826.86±1.4552.30±2.4733.16±1.6227.59±1.8426.10±1.2524.88±0.98
GraphPart79.22±0.4862.17±1.2451.80±0.8545.01±1.0540.83±1.5652.37±1.1344.63±1.0538.91±0.6635.59±0.9234.42±1.51
SCARCE75.13±0.8167.23±0.7059.42±0.4755.90±0.6353.45±0.9659.50±2.5350.35±0.5641.61±0.8337.23±0.6335.23±1.11
  1. Missing feature setting: we use the one-hot ID as node features. |GCN||Cora|||CiteSeer|| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |Budget|5C|10C|20C|5C|10C|20C| |Random|44.76±5.83|56.47±2.30|64.03±1.74|28.41±2.69|34.78±3.06|40.58±2.61| |Degree|44.19±7.98|51.69±2.82|63.26±1.95|27.33±1.97|31.74±2.79|40.99±2.02| |Pagerank|39.19±3.55|47.95±3.94|63.93±3.85|24.03±1.96|32.83±2.21|40.35±2.02| |FeatProp|47.75±3.16|58.84±3.22|65.66±1.38|29.79±2.32|34.79±2.00|41.89±2.71| |GraphPart|55.37±4.06|57.40±4.24|67.60±2.61|38.27±2.08|37.40±2.56|43.61±3.15| |SCARCE|61.43±2.31|68.64±1.94|69.59±1.22|37.89±1.94|43.51±1.44|48.20±1.15|

From the above results, we can find that the feature-based active learning method, such as FeatProp, can't perform well when the noise level is high or all the features are missed. However, our proposed SCARCE can only leverage the graph structure and perform well on both scenarios.

[1] Towards Label Position Bias in Graph Neural Networks, NeurIPS'23

评论

Dear Reviewer RKZS,

We would like to express our sincere gratitude to you for reviewing our paper and providing valuable feedback. Could we kindly know if the responses have addressed your concerns? If there are any further questions, we are happy to clarify. Thank you.

Best,

All authors

评论

Dear Authors,

Thank you for your response. While I appreciate your elaboration on the contributions and objectives of the paper, I believe the emphasis on fairness may not be entirely appropriate in this paper. It appears that active learning is the primary focus, yet the definitions and applications of fairness within the paper lack clarity.

Most importantly, it's worth noting that this concern is not unique to me. I checked other reviewer's comments. Reviewer UJDy also has the same concern.

Given this, my recommendation would be to revise the paper with a reduced emphasis on fairness, possibly even removing the term from the title. This is merely a suggestion, and I understand it's ultimately at your discretion.

After reviewing your response, which addressed some of my questions, I am inclined to adjust my evaluation. I will raise my score from 3 to 5.

Thanks.

Reviewer RKZS.

评论

Dear Reviewer RKZS,

Thank you for your prompt reply and great suggestion. We are glad to know that our previous responses have addressed most of your concerns, and we are happy to address your remaining question: the term of fairness.

We agree that the current wording on 'fairness' may lead to ambiguity, and we appreciate your guidance in enhancing the clarity and impact of our work. The fairness discussed in this paper is about a very new concept recently proposed, namely "label position bias'' (or "bias'' for short once we make the context clear), which is unique for graphs and quite different from the commonly discussed fairness, which confuses you and another reviewer. Therefore, to fully address this ambiguity, we would like to change this wording to "label positional bias'' for better clarity.

Furthermore, we propose the following revision plan to remove the ambiguity you are concerned about:

  1. Terminology: Throughout this paper, we will change the "fairness" to "bias", specifically for the label position bias, which is caused by the labeling of nodes in graphs. Mitigating the label position bias is well aligned with the goal of active learning - labeling.

  2. Title: We will change the title to "Structural Bias-Aware Active Learning for Graph Neural Networks".

  3. Introduction: In paragraph 3, we will introduce more about label position bias, such as the problems caused by it. For example, in real-world applications like fraud detection, where labeled data might be scarce, users far away from these labeled nodes are at a higher risk of being misclassified due to the label position bias. Then, we will point out that unbiased labeling can mitigate the label position bias and improve overall performance as demonstrated in the preliminary study. This builds the connection between active learning and mitigating the label position bias.

  4. Preliminary: In section 2.2, we will define unbiased labeling as node labeling with small LPS variance. Through the preliminary studies, we demonstrate that unbiased labeling can improve the overall performance while mitigating the label position bias. As a result, we can use it as one of the structural criteria in active learning.

  5. Experiments: Section 4.3 will be renamed from "Fairness Comparison" to "Evaluating Bias Mitigation Performance" to reduce ambiguity; In Section 4.4, we will introduce experiments under noise and missing feature scenarios to demonstrate the effectiveness of our proposed structural criteria.

These revisions will significantly diminish any ambiguity associated with the term 'fairness' and establish a clearer relationship between active learning and the mitigation of label position bias in graph neural networks. Moreover, the proposed revisions are minor and should be easily accomplishable while mitigating confusion.

We hope these changes meet your expectations and contribute to a more precise and impactful paper. If you are satisfied with the revision plan, we will upload the revised version. If you have any further questions or suggestions, we are happy to discuss them with you.

Best Regards,

All Authors

审稿意见
6

To leverage graph structure and mitigate structural bias in active learning, the authors present a unified optimization framework.

优点

Originality: The investigation of structure fairness using active learning is something new.

Quality: The technical quality is below average. Many details are missing. For example, label position bias is a new term that was recently proposed, and the authors should elaborate more on it with a more intuitive explanation instead of just some formula. Also, at the beginning of Sec. 3.1, tt is a binary vector, and thus it should be t0,1nt \in \\{0,1\\}^n. The relaxation in the paper does not make sense to the reviewer.

Clarity: In general, it is ok. The reviewer understands how the proposed method works but sometimes fails to see why.

Significance: The fairness and active learning problems for graphs are important.

缺点

  1. The paper tries to solve the structure fairness problem using active learning. However, the connection between these two is weak, and the reviewer does not find any strong motivations to do so. The authors claim that "in active learning, strategically choosing labeling nodes, represented by t, can potentially reduce the LPS variance, promoting fairness in GNNs" in the second para. of Sec. 2.2, but the reviewer does not find any theoretical guarantees to motivate this finding.

  2. The goal of active learning is different from mitigating the bias in graphs and the ultimate goal of AL is to use as few as labeled nodes to achieve the best prediction performance. Therefore, the motivation of this work is totally unclear.

  3. The paper does not provide any theoretical proof to support the findings or the motivations. The relaxation used in the unified framework is also misleading.

问题

  1. Why can we use the relaxation of the binary vector t to its convex hull?
  2. How does the proposed method solve the fairness issue from the theoretical aspect?
  3. What is the formal definition of the structure bias in graphs, and how can we quantify it?
  4. Why do we use active learning to solve the fairness issue in graphs? What if we do not have access to the oracle?
评论

Q6: How does the proposed method solve the fairness issue from the theoretical aspect?

R6: In this work, we focus on the performance fairness issue caused by the labeling of nodes, which is different from the traditional fairness issue caused by some sensitive attributes, such as gender and race. To solve this performance fairness issue, we directly optimize the structure fairness (low LPS variance) by selecting the nodes for annotation (active learning). And we empirically demonstrate that the proposed method can mitigate the performance fairness issue, as demonstrated in Figures 4 and 5 of our paper. Besides, the LPS score of node yy is related to the influence score from labeled node xx, defined as the absolute values of entries of Jacobian matrix [hx(k)hy(0)]\left[\frac{\partial h_x^{(k)}}{\partial h_y^{(0)}}\right] [7], where hx(k)h_x^{(k)} is the representation of node xx in a kk-th layer GNN. [1] found that a lower LPS variance suggests a more uniform influence of all labeled nodes to each unlabeled node across the graph, hinting at greater fairness.

[7] Representation Learning on Graphs with Jumping Knowledge Networks, ICML'18

Q7: Why do we use active learning to solve the fairness issue in graphs? What if we do not have access to the oracle?

R7: As mentioned in R1 and R3, both active learning and structure fairness issues are related to the labeling of nodes in a graph. Thus, we leverage the principles of active learning - strategic node selection for labeling - to improve structure fairness and thus solve the performance fairness issue of GNNs.

Regarding the concern about reliance on an oracle for annotations: in active learning, the oracle is typically a necessary component for providing labels for the selected samples. If we do not have access to the oracle, we might need some pre-trained model to do annotation. For example, we can use the Large Language Model as an annotator, like [8].

[8] Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs. Arxiv'23

We hope that we have addressed the concerns in your comments, and please kindly let us know if there is any further concern, and we are happy to clarify.

评论

Q4: The paper does not provide any theoretical proof to support the findings or the motivations.

R4: We appreciate the concern regarding the lack of theoretical proof in our paper. Our research, focusing on the effectiveness of our approach in the context of active learning for GNNs, is primarily grounded in empirical evaluations. This decision is partly informed by the nature of active learning itself, which, as suggested by the No Free Lunch Theorem, does not subscribe to a one-size-fits-all strategy. Active learning metrics often vary significantly based on the specific datasets and contexts they are applied to. As a result, most active learning approaches use some heuristic metrics to select samples [2][3].

To illustrate this, let's consider the example of the FeatProp method, which is based on the assumption of linearity in GNNs. This method presupposes that the original features can represent the final node representations in GNNs. However, this assumption often falls short due to the inherent non-linearity of neural networks. As such, in scenarios where the original features are compromised – for instance, in the presence of noise or feature missing – the FeatProp method's performance can be significantly impacted, as shown in the experiments in Appendix Section B.3 of our revised paper. For your convenience, we also present partial results here:

  1. Noise feature setting: we add different levels of gaussion noise to the original feature. |GCN|||Cora|||||CiteSeer||| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |Noise|0.1|0.3|0.5|0.7|0.9|0.1|0.3|0.5|0.7|0.9| |Random|64.52±4.50|51.76±4.19|44.76±4.12|40.54±4.12|38.65±3.33|49.16±8.22|36.36±5.71|29.53±4.38|27.05±2.07|26.34±1.85| |Degree|61.58±0.44|54.84±0.66|51.25±0.60|49.82±0.62|47.37±0.79|44.37±0.88|37.27±0.47|34.28±0.56|34.53±0.46|34.12±0.56| |Pagerank|60.77±1.16|43.22±2.59|34.96±2.23|32.86±0.94|32.76±1.27|49.34±2.77|33.40±2.49|27.99±1.22|25.41±1.95|25.08±1.74| |FeatProp|77.51±0.61|45.93±2.18|38.50±1.63|32.94±0.48|26.86±1.45|52.30±2.47|33.16±1.62|27.59±1.84|26.10±1.25|24.88±0.98| |GraphPart|79.22±0.48|62.17±1.24|51.80±0.85|45.01±1.05|40.83±1.56|52.37±1.13|44.63±1.05|38.91±0.66|35.59±0.92|34.42±1.51| |SCARCE|75.13±0.81|67.23±0.70|59.42±0.47|55.90±0.63|53.45±0.96|59.50±2.53|50.35±0.56|41.61±0.83|37.23±0.63|35.23±1.11|
  2. Missing feature setting: we use the one-hot ID as node features. |GCN||Cora|||CiteSeer|| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |Budget|5C|10C|20C|5C|10C|20C| |Random|44.76±5.83|56.47±2.30|64.03±1.74|28.41±2.69|34.78±3.06|40.58±2.61| |Degree|44.19±7.98|51.69±2.82|63.26±1.95|27.33±1.97|31.74±2.79|40.99±2.02| |Pagerank|39.19±3.55|47.95±3.94|63.93±3.85|24.03±1.96|32.83±2.21|40.35±2.02| |FeatProp|47.75±3.16|58.84±3.22|65.66±1.38|29.79±2.32|34.79±2.00|41.89±2.71| |GraphPart|55.37±4.06|57.40±4.24|67.60±2.61|38.27±2.08|37.40±2.56|43.61±3.15| |SCARCE|61.43±2.31|68.64±1.94|69.59±1.22|37.89±1.94|43.51±1.44|48.20±1.15|

While we acknowledge the importance of theoretical backing in scientific research, our current focus has been on demonstrating the practical efficacy and adaptability of our approach in diverse and realistic settings. Moving forward, we aim to complement our empirical findings with theoretical analysis to provide a more holistic understanding of our approach's mechanics and its broader applicability.

[2] A Survey of Deep Active Learning, CSUR'21

[3] A Comparative Survey of Deep Active Learning, Arxiv'22

Q5: The relaxation used in the unified framework is also misleading. Why can we use the relaxation of the binary vector t to its convex hull?

R5: The relaxation of binary vectors to their convex hull is a well-established technique in combinatorial optimization [4], particularly when dealing with complex optimization problems that are otherwise computationally intractable. This approach allows for the application of more efficient, continuous optimization methods. This method is usually used in gradient-based attacks [5][6]. For example, [5] leverages this relaxation to do the topology attack on graphs.

[4] LP Relaxations for Combinatorial Relaxation

[5] Topology attack and defense for graph neural networks: An optimization perspective, Arxiv'19

[6] Probabilistic Categorical Adversarial Attack and Adversarial Training, ICML'23

评论

Dear Reviewer tqC2,

We appreciate your constructive feedback. We are pleased to provide detailed responses to address your concerns.

Q1: The paper tries to solve the structure fairness problem using active learning. However, the connection between these two is weak, and the reviewer does not find any strong motivations to do so.

R1: Thank you for your feedback. We would like to clarify that the main focus of this paper is the active learning for graph neural networks, which aims to maximize the model performance within a labeling budget. There are two main reasons why we connect the structure fairness problem with active learning. First, the structure fairness discussed in our paper is based on the labeling. We explore "Label Position Bias" in GNNs, a critical performance fairness issue arising from varying performance based on nodes' "structural distance" to labeled nodes. The distance can be measured by the Label Proximity Score (LPS). Research [1] indicates that nodes closer to labeled nodes (higher LPS) typically perform better. They found the low variance of the LPS score, which means all the nodes have a similar "distance" to the labeled node, can mitigate the label position bias and improve the performance fairness of GNNs. Thus, we use the LPS variance to measure structural fairness. One natural way of reducing the LPS variance is to strategically select the labeled nodes based on their position in the graph. Meanwhile, the goal of active learning is to select proper nodes for annotation that can maximize the model performance. Thus, both problems are related to the labeling. Second, from our preliminary studies, we found that structural fairness (low LPS Variance) can improve the overall model performance. As a result, we use the LPS variance as one metric to help active learning. Meanwhile, as the structural fairness (low LPS Variance) can improve the performance fairness, we find that the proposed method can improve both the model performance and the performance fairness of GNNs.

[1] Towards Label Position Bias in Graph Neural Networks, NeurIPS'23

Q2: What is the formal definition of the structure bias in graphs, and how can we quantify it?

R2: Label Position Bias & Structural Fairness: Label Position Bias arises when the performance of a GNN model varies for nodes based on their proximity to labeled nodes in the graph.[1] Formally, in a graph G, with the set of labeled nodes VLV_L, and unlabeled nodes VUV_U, a GNN model trained on VLV_L may exhibit varying performance for two nodes i,jVUi, j \in V_U based on their "structural distance" to VLV_L. A key metric we use to measure this distance is the Label Proximal Score LPSi=jVLPijLPS_{i} = \sum_{j \in V_L} P_{ij}, where P is the personalized PageRank matrix. The LPS can well quantify the proximity of unlabeled nodes to labeled nodes and a higher LPS score typically correlates with better model performance for that node [1]. We use the variance of the LPS score of each node as the measure of structural fairness.

Measuring Performance Fairness: To quantify performance fairness, we categorize nodes into different sensitive groups based on their LPS scores. For each group, we calculate the average accuracy of the GNNs and use metrics such as Standard Deviation to measure the performance disparity across these groups.

Q3: The goal of active learning is different from mitigating the bias in graphs and the ultimate goal of AL is to use as few as labeled nodes to achieve the best prediction performance. Therefore, the motivation of this work is totally unclear.

R3: The label position bias studied in this paper is different from the traditional attribute bias issue. It is unique for graphs and is related to the labeled nodes' position in the graphs. A natural solution to mitigate the label position bias is to select the nodes, that have proper positions in the graph, to label. Meanwhile, as you mentioned, the ultimate goal of AL is to select the proper nodes to label to achieve the best prediction performance. As a result, both active learning and mitigating the bias are fundamentally related to the labeling of nodes in a graph.

In the preliminary studies, we found that structure fairness (Low LPS variance), which can reflect the label position bias, is also related to the overall performance of GNNs. Therefore, we leverage the LPS variance as one metric in the active learning to improve the overall model performance. This bridges the gap between active learning and mitigating the label position bias in GNNs, proposing that addressing fairness can be synergistic with achieving high model performance.

评论

Dear Reviewer tqC2,

We would like to express our sincere gratitude to you for reviewing our paper and providing valuable feedback. Could we kindly know if the responses have addressed your concerns? If there are any further questions, we are happy to clarify. Thank you.

Best,

All authors

评论

Thanks for the detailed rebuttal. Most of my concerns have been addressed and I have my score raised.

评论

Dear Reviewer tqC2,

Thanks for your response and support. We are glad to know that our rebuttal has addressed your concerns. Please let us know in case there remain outstanding concerns, and if so, we will be happy to respond.

Best Regards,

All Authors

审稿意见
6

This paper proposes a unified optimization framework for active learning on graph neural networks (GNNs) that can flexibly incorporate different selection criteria such as structure inertia score (SIS) and label proximity score (LPS) variance. It is empirically demonstrated that SCARCE outperforms existing baselines on node classification tasks across multiple benchmark datasets. In particular, SCARCE achieves higher accuracy than methods like FeatProp and GraphPart while also enhancing fairness by reducing variance in LPS across nodes.

优点

(1) This paper is generally well-organized and easy-to-follow.

(2) The proposed unified optimization framework is flexible and does not require extensive hyperparameter tuning, which is especially useful for active learning. In addition, the scalability seems promising as well.

(3) The superiority on utility and fairness seems significant given the presented results in section 4.3 and 4.4.

缺点

(1) There lacks a formal introduction of the notion for fairness at the beginning of this paper.

(2) Despite the discussion on scalability, this paper does not perform any experiments on large-scale network datasets.

(3) Only performing experiments on two GNN backbones undermines the superiority of the proposed framework. In addition, one advantage of this paper lies in the applicability on featureless networks, which is not tested in this paper either.

问题

(1) I would suggest to add a formal introduction about the fairness notion studied in this paper in Section 2, and add a descriptive discussion in the Introduction accordingly.

(2) If the proposed framework can be easily generalized onto large network data, will the performance superiority still be maintained?

(3) If the proposed framework can be easily generalized onto featureless network data, will the performance superiority still be maintained? Note that in such cases, the feature input of GNNs can be generated following traditional ways.

(4) Can the proposed framework achieve generally good performance across different state-of-the-art GNN backbones? It would be better to adopt more backbones for experiments.

伦理问题详情

N/A

评论

R4 Continued: The results of the noise feature setting are shown below:

GCNCoraCiteSeer
Noise0.10.30.50.70.90.10.30.50.70.9
Random64.52±4.5051.76±4.1944.76±4.1240.54±4.1238.65±3.3349.16±8.2236.36±5.7129.53±4.3827.05±2.0726.34±1.85
Degree61.58±0.4454.84±0.6651.25±0.6049.82±0.6247.37±0.7944.37±0.8837.27±0.4734.28±0.5634.53±0.4634.12±0.56
Pagerank60.77±1.1643.22±2.5934.96±2.2332.86±0.9432.76±1.2749.34±2.7733.40±2.4927.99±1.2225.41±1.9525.08±1.74
FeatProp77.51±0.6145.93±2.1838.50±1.6332.94±0.4826.86±1.4552.30±2.4733.16±1.6227.59±1.8426.10±1.2524.88±0.98
GraphPart79.22±0.4862.17±1.2451.80±0.8545.01±1.0540.83±1.5652.37±1.1344.63±1.0538.91±0.6635.59±0.9234.42±1.51
SCARCE75.13±0.8167.23±0.7059.42±0.4755.90±0.6353.45±0.9659.50±2.5350.35±0.5641.61±0.8337.23±0.6335.23±1.11
APPNPCoraCiteSeer
Noise0.10.30.50.70.90.10.30.50.70.9
Random69.52±5.1460.06±4.4355.21±4.1953.16±3.6051.51±4.4250.91±7.0539.35±4.3735.08±5.9632.97±3.4931.77±4.15
Degree67.59±0.9261.86±0.5558.98±0.6756.29±0.8255.81±0.8850.83±1.5841.98±1.1039.77±0.6236.92±0.3836.84±0.51
Pagerank70.87±1.7054.90±2.8550.41±1.8744.75±1.9343.62±2.4952.08±2.5541.20±2.0134.63±2.4330.68±2.2731.78±1.99
FeatProp80.17±0.7748.23±1.7244.64±1.3635.49±0.7735.76±0.8654.49±1.6636.12±1.3427.69±1.5725.43±0.8124.31±1.53
GraphPart80.68±0.4168.76±1.8066.11±1.0559.14±2.0454.07±2.2260.38±1.4045.39±1.3640.97±1.1439.19±1.7838.79±1.11
SCARCE79.39±0.6775.89±0.6672.46±0.2070.10±1.2169.19±0.7862.58±1.2454.76±0.8248.60±0.6345.67±0.5544.28±0.60

From the above results, we can have the following observations:

  • With the level of noise increase in the features, the performance of the feature-based methods drops a lot. For example, the FeatProp performs even worse than random selection.
  • The proposed SCARCE, which only leverages the graph structure in active learning, can outperform baselines by a large margin.
  1. Missing Feature Setting: To further test the resilience of SCARCE under challenging feature conditions, we conducted experiments in a setting where all features are missing. In this scenario, we replaced the original node features with one-hot ID features. We conduct experiments on both Cora and CiteSeer datasets with labeling budgets of 5C, 10C, and 20C using both GCN and APPNP models. The results are shown below:
GCNCoraCiteSeer
Budget5C10C20C5C10C20C
Random44.76±5.8356.47±2.3064.03±1.7428.41±2.6934.78±3.0640.58±2.61
Degree44.19±7.9851.69±2.8263.26±1.9527.33±1.9731.74±2.7940.99±2.02
Pagerank39.19±3.5547.95±3.9463.93±3.8524.03±1.9632.83±2.2140.35±2.02
FeatProp47.75±3.1658.84±3.2265.66±1.3829.79±2.3234.79±2.0041.89±2.71
GraphPart55.37±4.0657.40±4.2467.60±2.6138.27±2.0837.40±2.5643.61±3.15
SCARCE61.43±2.3168.64±1.9469.59±1.2237.89±1.9443.51±1.4448.20±1.15
APPNPCoraCiteSeer
Budget5C10C20C5C10C20C
Random62.27±3.8668.71±3.2275.52±1.6538.24±3.2146.10±2.8549.98±1.52
Degree64.48±2.9765.29±1.6373.21±1.1738.92±4.7946.93±1.6049.10±2.16
Pagerank61.79±1.6168.89±1.1174.15±0.5831.94±1.2042.58±1.3950.70±1.05
FeatProp35.22±0.9947.42±3.3472.47±0.4231.89±0.4631.98±0.5729.01±0.56
GraphPart66.40±0.6071.98±0.5974.15±1.3846.10±0.6249.93±0.4051.34±0.78
SCARCE72.39±1.5476.33±0.9976.77±0.7148.06±3.6652.20±2.0256.21±1.92

From the above results, we can have similar findings: While the feature-based methods, such as FeatProp, perform poorly under the missing feature setting, our proposed SCARCE can achieve good performance.

In summary, the proposed method SCARCE can perform as expected when the feature quality is low or missing, but the feature-based active learning methods may perform poorly.

We hope that we have addressed the concerns in your comments, and please kindly let us know if there is any further concern, and we are happy to clarify.

评论

W3 & Q4: Only performing experiments on two GNN backbones undermines the superiority of the proposed framework.

R3: We appreciate your suggestions. We further add two representative GNNs: GAT[2], which leverages the attention mechanism, and GCNII[3], which is a deep GNN. For these two methods, we follow the hyperparameter settings in their original paper. Specifically, for GCNII, we adopt 64 layers. The results on the Cora and CiteSeer datasets with budgets of 5C, 10C and 20 C, are shown below:

GATCoraCiteSeer
Budget5C10C20C5C10C20C
Random65.66±3.7074.06±4.1980.00±2.2451.36±7.5361.85±2.3464.82±2.15
Uncertainty55.31±5.7862.02±6.5674.72±3.6944.41±6.3549.95±5.6460.49±4.00
Density64.10±4.2870.72±3.7378.12±2.9749.44±7.5860.10±3.7364.17±2.02
CoreSet60.72±7.0372.54±3.4079.48±1.5247.29±6.7658.00±4.2164.54±2.11
Degree62.34±2.2073.44±1.3978.31±0.9842.18±1.8944.41±1.2949.38±2.20
Pagerank63.78±2.0571.84±1.7780.00±1.0342.98±2.4058.62±1.8766.51±0.77
AGE66.45±4.7173.93±3.8980.14±1.5549.53±5.4058.97±5.9465.18±2.26
FeatProp73.40±1.3977.62±1.3682.36±0.5358.03±2.9962.14±2.2968.08±0.54
GraphPart76.91±1.4178.29±0.8880.94±1.5258.24±2.4165.10±1.3266.41±1.33
SCARCE-Structure76.36±2.0081.78±0.4682.58±0.4164.23±1.9269.32±0.6072.36±0.49
SCARCE-Feature80.56±0.7783.01±0.6385.04±0.2660.91±3.1070.47±0.6172.60±0.38
SCARCE-ALL80.62±0.5083.27±0.3984.96±0.2461.20±2.7770.29±0.6172.34±0.40
GCNIICoraCiteSeer
5C10C20C5C10C20C
Random69.76±6.4778.29±3.9983.24±1.0855.25±7.6865.59±1.6068.25±1.28
Uncertainty68.78±5.6477.15±3.3682.33±1.3053.59±8.7062.91±5.4767.97±1.69
Density72.31±3.1078.85±1.6183.22±1.0855.49±7.0465.41±2.8068.07±1.51
CoreSet66.07±3.3476.11±3.3780.43±1.9754.22±6.1761.45±3.0567.41±1.02
Degree68.02±1.9277.92±0.4582.85±0.4158.99±1.6160.62±1.9567.41±0.50
Pagerank69.82±2.7476.64±1.2783.52±0.5153.41±2.7062.61±1.0069.53±0.66
AGE71.31±3.8978.81±2.1483.29±1.2356.65±4.7364.46±2.5867.59±2.60
FeatProp74.63±2.5983.85±0.5884.49±0.6044.99±2.2552.97±2.0863.01±0.93
GraphPart80.45±1.0782.47±0.4683.4±0.4247.89±2.2558.99±1.3465.99±1.08
SCARCE-Structure79.96±1.1083.43±0.5483.98±0.4864.55±0.7269.99±0.7472.05±0.35
SCARCE-Feature82.05±1.3283.25±0.6585.96±0.5367.63±1.9870.98±0.3772.25±0.74
SCARCE-ALL82.61±1.0183.15±0.4785.94±0.5166.37±4.0770.91±0.4272.89±0.44

From the above results, we can observe that the proposed SCARCE can also work well with both GAT and GCNII backbones. This result demostrate the proposed SCARCE can achieve generally good performance across different state-of-the-art GNN backbones.

[2] Graph attention networks. ICLR'18

[3] Simple and deep graph convolutional networks. ICML'20

Q3: In addition, one advantage of this paper lies in the applicability on featureless networks, which is not tested in this paper either.

R4: Thanks for your suggestion. To verify the effectiveness of the proposed method on low-quality feature cases, we conducted additional experiments in two distinct settings, i.e., Noise feature and Missing feature setting.

  1. Noise Feature Setting: We introduced varying levels of standard Gaussian noise to the original features. Specifically, we modified the features as X=X+kϵX = X + k*\epsilon, where ϵN(0,1)\epsilon \sim \mathcal{N}(0,1), and k takes values from the [0.1,0.3,0.5,0.7,0.9][0.1, 0.3, 0.5, 0.7, 0.9]. This experiment was conducted on the Cora and CiteSeer datasets with budget settings of 5C and 10C, where C represents the number of classes in the dataset. The results are shown in Appendix Section B.3 in our revised paper. For your convenience, we present the results of 5C budgets with GCN and APPNP backbones in Part 3.
评论

Dear Reviewer UJDy,

Thank you for your insightful suggestions. We are pleased to provide detailed responses to address your concerns.

Q1 & W1: There lacks a formal introduction of the notion for fairness at the beginning of this paper. (I would suggest to add a formal introduction about the fairness notion studied in this paper in Section 2, and add a descriptive discussion in the Introduction accordingly.)

R1: In response to your valuable feedback, we have incorporated a comprehensive introduction of the structure fairness issue in both the Introduction and Section 2 of our revised paper. For your convenience, we present the formal introduction to the fairness issue discussed in our paper here:

Label Position Bias & Structural Fairness: In this paper, we use the notion of "Label Position Bias", a critical aspect of performance bias in GNNs. This bias arises when the performance of a GNN model varies for nodes based on their proximity to labeled nodes in the graph.[1] Formally, in a graph G, with the set of labeled nodes VLV_L, and unlabeled nodes VUV_U, a GNN model trained on VLV_L may exhibit varying performance for two nodes i,jVUi, j \in V_U based on their "structural distance" to VLV_L. A key metric we use to measure this distance is the Label Proximal Score LPSi=jVLPijLPS_{i} = \sum_{j \in V_L} P_{ij}, where P is the personalized PageRank matrix. The LPS can well quantify the proximity of unlabeled nodes to labeled nodes and a higher LPS score typically correlates with better model performance for that node [1]. [1] also found the low variance of the LPS score, which indicates all the nodes have a similar "distance" to the labeled nodes, can mitigate the label position bias and improve the performance fairness of GNNs. As a result, we use the variance of the LPS score of each node as the measure of structural fairness.

Measuring Performance Fairness: To quantify performance fairness, we categorize nodes into different sensitive groups based on their LPS scores. For each group, we calculate the average accuracy of the GNNs and use metrics such as Standard Deviation to measure the performance disparity across these groups.

[1] Towards Label Position Bias in Graph Neural Networks, NeurIPS'23

Q2 & W2: Despite the discussion on scalability, this paper does not perform any experiments on large-scale network datasets.

R2: To verify the scalability of the proposed method, in addition to the OGB Arxiv dataset used in our paper, we have conducted extensive experiments on the OGB Products dataset, which contains over 2.4 million nodes. We conduct experiments with the labeling budgets of 5C, 10C, and 20C, where C is the number of classes. The results of the OGB Products dataset with GCN and APPNP backbones are shown below:

BackboneGCNAPPNP
Budget5C10C20C5C10C20C
Random60.78±0.6067.38±0.7470.45±0.8966.15±1.4370.23±0.4274.82±0.10
Degree37.13±0.4645.82±0.3850.19±0.9949.98±0.8555.43±0.7460.81±0.97
Pagerank56.33±0.6063.20±0.2069.00±0.4561.08±0.3267.61±0.7072.69±0.18
FeatProp63.34±2.5867.48±0.9771.65±0.5768.64±0.6569.80±1.2071.8±0.61
GraphPartOOTOOTOOTOOTOOTOOT
SCARCE62.25±0.3770.27±0.2274.43±0.2266.66±0.0972.63±0.2076.57±0.14

Please note the GraphPart method takes more than 24 hours, and we treat it as out of time (OOT).

The results from our experiments on the large-scale OGB Products dataset clearly demonstrate the efficacy of our proposed SCARCE method. Notably, SCARCE achieves high accuracy with significantly fewer labeled nodes compared to standard training splits. For instance, while the standard split typically uses around 200,000 nodes for training, SCARCE requires only 940 labeled nodes to attain accuracies of 74.43/(75.64) and 76.57/(76.62) for the GCN and APPNP models, respectively, where the value in the ()(\cdot) represent the accuracy of the standard split. This underscores SCARCE's effectiveness in active learning, particularly in large-scale graph neural network applications.

评论

Dear Reviewer UJDy,

We would like to express our sincere gratitude to you for reviewing our paper and providing valuable feedback. Could we kindly know if the responses have addressed your concerns? If there are any further questions, we are happy to clarify. Thank you.

Best,

All authors

评论

Thanks for the detailed rebuttal. Most of my concerns have been addressed and I have my score raised.

评论

Dear Reviewer UJDy,

Thanks for your response and support. We are glad to know that our rebuttal has addressed your concerns. Please let us know in case there remain outstanding concerns, and if so, we will be happy to respond.

Best Regards,

All Authors

审稿意见
6

Existing active learning models for GNNs heavily rely on the quality of initial node features and ignore the impact of label position bias in the selection of representative nodes. To address these limitations, this paper proposes a novel framework called SCARCE.

优点

  • They identify the limitations in current active learning methods, specifically the oversight regarding feature quality and position bias.
  • They propose a novel framework to tackle the aforementioned limitations.
  • Extensive experiments validate the effectiveness of the proposed framework.

缺点

  • There are concerns regarding the fundamental motivation behind active learning. While the primary motivation for active learning lies in the difficulty of obtaining high-quality labels in real-world scenarios, the iterative addition of labels for learned target nodes during the optimization process raises doubts about the original motivation. This creates some contradiction as it suggests that labels for target nodes might be easy to obtain.
  • The improvement compared to baselines seems not statistically significant.
  • They argue that existing methods heavily rely on the quality of initial node features while the proposed framework can mitigate this problem. However, there lack of experimental support. The features of datasets seem typical, lacking any characteristics such as unavailability and noise. There is a need for a quantifiable evaluation to support this point.

问题

Please refer to the weaknesses.

  • There are concerns regarding the fundamental motivation behind active learning. While the primary motivation for active learning lies in the difficulty of obtaining high-quality labels in real-world scenarios, the iterative addition of labels for learned target nodes during the optimization process raises doubts about the original motivation. This creates some contradiction as it suggests that labels for target nodes might be easy to obtain.
  • The improvement compared to baselines seems not statistically significant.
  • They argue that existing methods heavily rely on the quality of initial node features while the proposed framework can mitigate this problem. However, there lack of experimental support. The features of datasets seem typical, lacking any characteristics such as unavailability and noise. There is a need for a quantifiable evaluation to support this point.

伦理问题详情

No

评论

R3 Continued: The results of the noise feature setting are shown below:

GCNCoraCiteSeer
Noise0.10.30.50.70.90.10.30.50.70.9
Random64.52±4.5051.76±4.1944.76±4.1240.54±4.1238.65±3.3349.16±8.2236.36±5.7129.53±4.3827.05±2.0726.34±1.85
Degree61.58±0.4454.84±0.6651.25±0.6049.82±0.6247.37±0.7944.37±0.8837.27±0.4734.28±0.5634.53±0.4634.12±0.56
Pagerank60.77±1.1643.22±2.5934.96±2.2332.86±0.9432.76±1.2749.34±2.7733.40±2.4927.99±1.2225.41±1.9525.08±1.74
FeatProp77.51±0.6145.93±2.1838.50±1.6332.94±0.4826.86±1.4552.30±2.4733.16±1.6227.59±1.8426.10±1.2524.88±0.98
GraphPart79.22±0.4862.17±1.2451.80±0.8545.01±1.0540.83±1.5652.37±1.1344.63±1.0538.91±0.6635.59±0.9234.42±1.51
SCARCE75.13±0.8167.23±0.7059.42±0.4755.90±0.6353.45±0.9659.50±2.5350.35±0.5641.61±0.8337.23±0.6335.23±1.11
APPNPCoraCiteSeer
Noise0.10.30.50.70.90.10.30.50.70.9
Random69.52±5.1460.06±4.4355.21±4.1953.16±3.6051.51±4.4250.91±7.0539.35±4.3735.08±5.9632.97±3.4931.77±4.15
Degree67.59±0.9261.86±0.5558.98±0.6756.29±0.8255.81±0.8850.83±1.5841.98±1.1039.77±0.6236.92±0.3836.84±0.51
Pagerank70.87±1.7054.90±2.8550.41±1.8744.75±1.9343.62±2.4952.08±2.5541.20±2.0134.63±2.4330.68±2.2731.78±1.99
FeatProp80.17±0.7748.23±1.7244.64±1.3635.49±0.7735.76±0.8654.49±1.6636.12±1.3427.69±1.5725.43±0.8124.31±1.53
GraphPart80.68±0.4168.76±1.8066.11±1.0559.14±2.0454.07±2.2260.38±1.4045.39±1.3640.97±1.1439.19±1.7838.79±1.11
SCARCE79.39±0.6775.89±0.6672.46±0.2070.10±1.2169.19±0.7862.58±1.2454.76±0.8248.60±0.6345.67±0.5544.28±0.60

From the above results, we can have the following observations:

  • With the level of noise increase in the features, the performance of the feature-based methods drops a lot. For example, the FeatProp performs even worse than random selection.
  • The proposed SCARCE, which only leverages the graph structure in active learning, can outperform baselines by a large margin.
  1. Missing Feature Setting: To further test the resilience of SCARCE under challenging feature conditions, we conducted experiments in a setting where all features are missing. In this scenario, we replaced the original node features with one-hot ID features. We conduct experiments on both Cora and CiteSeer datasets with labeling budgets of 5C, 10C, and 20C using both GCN and APPNP models. The results are shown below:
GCNCoraCiteSeer
Budget5C10C20C5C10C20C
Random44.76±5.8356.47±2.3064.03±1.7428.41±2.6934.78±3.0640.58±2.61
Degree44.19±7.9851.69±2.8263.26±1.9527.33±1.9731.74±2.7940.99±2.02
Pagerank39.19±3.5547.95±3.9463.93±3.8524.03±1.9632.83±2.2140.35±2.02
FeatProp47.75±3.1658.84±3.2265.66±1.3829.79±2.3234.79±2.0041.89±2.71
GraphPart55.37±4.0657.40±4.2467.60±2.6138.27±2.0837.40±2.5643.61±3.15
SCARCE61.43±2.3168.64±1.9469.59±1.2237.89±1.9443.51±1.4448.20±1.15
APPNPCoraCiteSeer
Budget5C10C20C5C10C20C
Random62.27±3.8668.71±3.2275.52±1.6538.24±3.2146.10±2.8549.98±1.52
Degree64.48±2.9765.29±1.6373.21±1.1738.92±4.7946.93±1.6049.10±2.16
Pagerank61.79±1.6168.89±1.1174.15±0.5831.94±1.2042.58±1.3950.70±1.05
FeatProp35.22±0.9947.42±3.3472.47±0.4231.89±0.4631.98±0.5729.01±0.56
GraphPart66.40±0.6071.98±0.5974.15±1.3846.10±0.6249.93±0.4051.34±0.78
SCARCE72.39±1.5476.33±0.9976.77±0.7148.06±3.6652.20±2.0256.21±1.92

From the above results, we can have similar findings: While the feature-based methods, such as FeatProp, perform poorly under the missing feature setting, our proposed SCARCE can achieve good performance.

In summary, the proposed method SCARCE can perform as expected when the feature quality is low or missing, but the feature-based active learning methods may perform poorly.

We hope that we have addressed the concerns in your comments, and please kindly let us know if there is any further concern, and we are happy to clarify.

评论

Dear Reviewer BDeb,

We appreciate your constructive feedback. We are pleased to provide detailed responses to address your concerns.

Q1: There are concerns regarding the fundamental motivation behind active learning. While the primary motivation for active learning lies in the difficulty of obtaining high-quality labels in real-world scenarios, the iterative addition of labels for learned target nodes during the optimization process raises doubts about the original motivation. This creates some contradiction as it suggests that labels for target nodes might be easy to obtain.

R1: Thank you for pointing out this concern. We would like to clarify that the fundamental motivation behind active learning is that we want to maximize the model's performance within a limited labeling budget. Due to the difficulty of obtaining high-quality labels, we usually assume the samples are labeled by human experts (often called “oracle”), which can be expensive and time-consuming. Therefore, we usually set a labeling budget to constrain the number of samples that can be labeled by human experts.

In active learning, there are primarily two settings: iterative [1][2] and one-step [3][4]. In the iterative setting, as you mentioned, we ask human experts to label a small number of samples and then train a model based on these samples in each iteration. Afterward, we use the learned model to select some new samples (for example, samples with low confidence based on the model) and ask human experts to label them. This process is repeated until the limited budget is reached. Conversely, in the one-step active learning setting, which is the focus of our work, all the labels are obtained at once within the set budget. This approach avoids the resource-intensive process of training the model multiple times. However, for both settings, the effects of human experts are the same as the number of labeled samples is the budget. Therefore, it does not imply that obtaining labels is easy.

[1] An analysis of active learning strategies for sequence labeling tasks.

[2] Active learning for graph embedding.

[3] Active learning for graph neural networks via node feature propagation.

[4] Partition-based active learning for graph neural networks.

Q2: The improvement compared to baselines seems not statistically significant.

R2: We have conducted the t-test between our proposed SCARCE and the best baseline model across each dataset and budget setting. The p-value less than 0.05 usually indicates that the improvement is statistically significant. In our paper, we have marked results that are statistically significant with a red asterisk (*) in Tables 1 and 2 for clear reference. For your convenience, we also present partial results of the GCN model here:

DataSetCoraCiteSeerComputerArxiv
Budget5C10C20C5C10C20C5C10C20C5C10C20C
P-Value2.9E-060.41.1E-048.0E-109.7E-194.4E-203.0E-060.014.7E-051.2E-061.5E-102.3E-21

From the results, the majority of our improvements over baseline models are indeed statistically significant.

Q3: They argue that existing methods heavily rely on the quality of initial node features while the proposed framework can mitigate this problem. However, there lack of experimental support. The features of datasets seem typical, lacking any characteristics such as unavailability and noise. There is a need for a quantifiable evaluation to support this point.

R3: Thank you for this great suggestion. Following your suggestion, we conducted additional experiments in two distinct settings to demonstrate the superiority of the proposed SCARCE.

  1. Noise Feature Setting: We introduced varying levels of standard Gaussian noise to the original features. Specifically, we modified the features as X=X+kϵX = X + k*\epsilon, where ϵN(0,1)\epsilon \sim \mathcal{N}(0,1), and k takes values from the [0.1,0.3,0.5,0.7,0.9][0.1, 0.3, 0.5, 0.7, 0.9]. This experiment was conducted on the Cora and CiteSeer datasets with budget settings of 5C and 10C, where C represents the number of classes in the dataset. The results are shown in Appendix B.3 of our revised paper. For your convenience, we present the partial results of 5C budgets with GCN and APPNP backbones in Part 2.
评论

Dear Reviewer BDeb,

We would like to express our sincere gratitude to you for reviewing our paper and providing valuable feedback. Could we kindly know if the responses have addressed your concerns? If there are any further questions, we are happy to clarify. Thank you.

Best,

All authors

评论

Dear Reviewers,

We wish to express our gratitude for your thoughtful comments and concerns. Following your valuable feedback, we have submitted our responses and revisions to address the issues. As the discussion period is about to end, we kindly request your confirmation of the receipt of our responses. Additionally, we welcome any further concerns or suggestions regarding our response. Your timely response is greatly appreciated and will be immensely helpful for us to improve our work.

Thank you for your time and consideration.

Sincerely,

The Authors

AC 元评审

The submission proposes an active learning algorithm in the graph-based setting, where access to node labels are limited, with the goal of training a graph neural network. The authors argue that standard active learning methods do not consider, and may suffer from, label position bias (the observation that a model can achieve better prediction performance on nodes that are closer to already labeled nodes).

The proposed algorithm is designed with specific attention to this graph structural bias and is shown to empirically outperform several existing active learning methods (general and GNN specific). As a side effect, the authors also find that the proposed algorithm provides improved fairness (in terms of group performance across groups).

Several clarifications, new experiments, and changes to the presentation were made in response to concerns raised by reviewers, which I believe were almost entirely addressed.

为何不给更高分

Based on the reviewer feedback and my own reading, the contributions do not appear to be so groundbreaking as to warrant a spotlight or oral.

为何不给更低分

After clarifications and revision, the submission addresses most of the reviewer concerns and provides an effective algorithm for a setting of interest.

最终决定

Accept (poster)