/10

Poster3 位审稿人

最低3最高3标准差0.0

ICML 2025

Test-time Adaptation on Graphs via Adaptive Subgraph-based Selection and Regularized Prototypes

Yusheng Zhao,Qixin Zhang,Xiao Luo,Junyu Luo,Wei Ju,Zhiping Xiao,Ming Zhang

提交: 2025-01-14更新: 2025-07-24

TL;DR

This paper explores a novel yet under-explored problem of test-time adaptation on graphs and proposes a novel method named Adaptive Subgraph-based Selection and Regularized Prototype Supervision (ASSESS) to solve the problem.

摘要

关键词

test-time adaptationgraph neural networks

评审与讨论

审稿意见

评分: 32025-03-09

This paper find that existing graph neural network methods struggle with performance degradation when adapting to test-time domain shifts, and current test-time adaptation methods for graphs mainly focus on Euclidean data and face challenges with label scarcity and knowledge utilization. The authors propose ASSESS framework that enables fine-grained and structure-aware selection of reliable test graphs, effectively balances prior knowledge from unknown training graphs and posterior information from unlabeled test graphs, and achieves significant improvement in test-time adaptation on graphs.

update after rebuttal:

The authors' rebuttal address most of my concerns, so I keep my score to "weak accept".

给作者的问题

I would like to know how many subgraphs the author will choose for estimation calculation, as well as how the adaptation time of ASSESS compares to other baseline methods.

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes, the proof of Theorem 3.1 is correct and well-constructed. However, I notice that the authors make lots of assumptions in the paper. I think a brief discussion of how these assumption affects the generality of the result would be helpful.

实验设计与分析

Yes, the paper evaluates ASSESS on five datasets spanning different domains and compares ASSESS with a range of baselines. As far as I know, there are many different settings for TTA, such as online or offline, as well as batch size settings, and they should be different from source-free methods such as SHOT and RNA. I think the authors should provide a detailed explanation of their TTA settings and compare more TTA methods under the same settings.

补充材料

Yes, the proof of Theorem3.1 and details of dataset and baseline methods.

与现有文献的关系

The key distributions of this paper extend on graph representation learning, domain adaptation, self-supervised learning and test-time adaptation.

遗漏的重要参考文献

其他优缺点

Strength:

The paper addresses an under-explored problem: test-time adaptation on graph-structured data.
The proposed method, ASSESS, combines two innovative components: Adaptive Subgraph-based Selection (ASBS) and Regularized Prototype Supervision (RPS). The use of subgraph mutual information for adaptive thresholding and the construction of semantic prototypes are well-motivated ideas.
The paper is well-written and logically structured.

Weakness:

The author only compared the methods related to source-free setting in the experimental comparison. I think the most advanced TTA methods should be compared under the graph setting.
The author lacks specific settings for the TTA experiment, such as whether it is online or offline, and whether the sample is a streaming input, etc.
The datasets (e.g., FRANKENSTEIN, Mutagenicity) focus on chemical and social graphs. Including other graph types (e.g., knowledge graphs, citation networks) would better demonstrate generalizability.

其他意见或建议

I think the author should use the entropy-based method and the subgraph-based method to compare the results during the selection stage, highlighting the advantages of subgraph-based selection.

作者回复

2025-03-29

We are truly grateful for the time you have taken to review our paper, your insightful comments and support. Your positive feedback is incredibly encouraging for us! In the following response, we would like to address your concerns and provide additional clarification.

Q1. However, I notice that the authors make lots of assumptions in the paper. I think a brief discussion of how these assumption affects the generality of the result would be helpful.

A1. Thank you for your suggestion. The addition, multiplication, and the ReLU function can be found to satisfy these assumptions. Our full neural network is a combination of these operations, which satisfy these assumptions as well. Similar assumptions can be found in recent works [1], [2], [3], and [4].

[1] A convergence theory for deep learning via over-parameterization, ICML'19

[2] Stagewise training accelerates convergence of testing error over SGD, NeurIPS'19

[3] Gradient-free optimization of highly smooth functions: improved analysis and a new algorithm, JMLR'24

[4] Efficient active learning halfspaces with Tsybakov noise: A non-convex optimization approach, AISTATS'24

We will discuss this in the revised version of the manuscript.

Q2. As far as I know, there are many different settings for TTA, such as online or offline, as well as batch size settings, and they should be different from source-free methods such as SHOT and RNA. I think the authors should provide a detailed explanation of their TTA settings and compare more TTA methods under the same settings.

A2. Thank you for your suggestion. We follow the existing work [5] to adopt the offline setting of TTA. We compare additional TTA methods under the offline setting, and the results show that our method achieves better accuracy.

Methods	PROTEINS	IMDB-BINARY
ASSESS (ours)	78.9	67.8
T3A [6]	69.3	64.3
GAPGC [7]	71.9	65.1
MATCHA [8]	73.2	65.8

[5] Test-Time Training for Graph Neural Networks, arXiv'22

[6] Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization, NeurIPS'21

[7] GraphTTA: Test Time Adaptation on Graph Neural Networks, ICML'22

[8] Matcha: Mitigating Graph Structure Shifts with Test-Time Adaptation, ICLR'25

Q3. The author only compared the methods related to source-free setting in the experimental comparison. I think the most advanced TTA methods should be compared under the graph setting.

A3. Thank you for your comment. We evaluate additional TTA methods under the offline graph setting, and the results are shown in A2. We will add these in the revised manuscript.

Q4. The datasets (e.g., FRANKENSTEIN, Mutagenicity) focus on chemical and social graphs. Including other graph types (e.g., knowledge graphs, citation networks) would better demonstrate generalizability.

A4. Thank you for your comment. We provide an additional citation network dataset (DBLP_v1), and the results are shown as follows:

Methods	ASSESS (ours)	T3A	GAPGC	MATCHA
DBLP_v1	89.4	83.1	84.6	86.7

Q5. I think the author should use the entropy-based method and the subgraph-based method to compare the results during the selection stage, highlighting the advantages of subgraph-based selection.

A5. Thank you for your suggestion. We compare our ASBS with entropy-based selection, and the results below show that our method outperforms entropy-based selection.

Selection Strategies	PROTEINS	IMDB-BINARY
ASBS	78.9	67.8
Entropy	71.8	64.7

Q6. I would like to know how many subgraphs the author will choose for estimation calculation, as well as how the adaptation time of ASSESS compares to other baseline methods.

A6. Thank you for your comment. We use one subgraph for the estimation, and rely on temporal ensembling (as mentioned in Line 192-197, around Eq. 5) to obtain stable estimation of mutual information. As for the adaptation time, we want to emphasize that GNNs are relatively light-weight, and computation is very fast. We compare the adaptation time of ASSESS with baselines below, and the results show that our computation time is comparable to other methods.

Methods	ASSESS (ours)	RNA	MATCHA
FRANKENSTEIN	2.1s	2.5s	2.0s
Mutagenicity	1.7s	2.0s	1.5s

审稿意见

评分: 32025-03-12

This paper investigates the problem of test-time adaptation on graphs, addressing the challenge of adapting a pre-trained graph neural network (GNN) to unseen test data without access to the original training set. The authors propose ASSESS (Adaptive Subgraph-based SElection and Regularized Prototype SuperviSion), a novel method that combines graph selection and prototype regularization to enhance adaptation performance under distribution shifts. The paper also presents a rigorous theoretical analysis about ASSESS. Empirical evaluations on five diverse graph datasets confirm the superior performance of ASSESS over state-of-the-art baselines, with ablation studies further highlighting the contributions of its components.

给作者的问题

Please refer to the weakness section.

I may change my score based on the authors' feedback regarding the weaknesses.

论据与证据

All the claims are supported by theoretical or experimental evidences.

方法与评估标准

Yes

理论论述

Yes. I have checked the theoretical claims in section 3, including the proof in the appendix.

实验设计与分析

Yes. I have checked the experimental designs, the use of datasets, ablation studies, etc.

补充材料

Yes. I have reviewed all parts in the supplementary material.

与现有文献的关系

As graphs neural networks are widely used in processing non-Euclidean data, such as proteins or social networks, their performance under test-time distribution shifts is an important aspect of the robustness of many algorithms.

遗漏的重要参考文献

This paper could benefit from a discussion of other works graph transfer learning, e.g. [1][2].

[1] Structural Re-weighting Improves Graph Domain Adaptation (ICML’23) [2] Learning invariant graph representations for out-of-distribution generalization (NeurIPS’22)

其他优缺点

Strengths:

The paper provides a well-structured and comprehensive theoretical analysis of the proposed ASSESS algorithm, demonstrating its convergence properties under test-time adaptation settings. The analysis is grounded in established optimization assumptions, such as the Polyak-Łojasiewicz condition and the Tsybakov noisy condition, providing a strong mathematical foundation for the method’s effectiveness.
The proposed method is evaluated on five real-world graph datasets, each of which represents different domains. The authors compare their approach against a diverse set of baselines, including graph neural networks, unsupervised/semi-supervised training methods, and test-time adaptation algorithms. The results consistently show improvements in classification accuracy.
The ablation results provide strong empirical evidence that both components are crucial for achieving state-of-the-art performance in test-time adaptation on graphs.

Weaknesses:

While the paper focuses on test-time adaptation, it does not explicitly discuss connections to related topics such as graph domain adaptation and out-of-distribution (OOD) generalization. Since ASSESS deals with distribution shifts at test time, its relationship to existing methods in transfer learning should be discussed.
The authors introduce a temporal ensembling strategy to stabilize the estimation of mutual information in the adaptive subgraph selection process (Eq. 5). However, the motivation for why this specific technique was chosen is not clearly explained.
The paper does not specify whether the selected datasets cover both homophilic and heterophilic graphs. Homophilic graphs exhibit strong intra-class connectivity, while heterophilic graphs contain nodes that connect across different classes. Since GNN performance varies significantly based on graph structure, it is important to explicitly discuss the structural diversity of the datasets.
The paper presents inconsistent descriptions of how the ASBS and RPS components interact within the overall adaptation framework. Figure 1 suggests that ASBS and RPS operate in parallel, implying that reliable test graphs are selected and then immediately used for self-training. However, Section 3.4 and the provided loss formulation (Eq. 17) indicate an iterative execution, where ASBS first refines the test graph selection over multiple steps before RPS is applied.

其他意见或建议

The format of some equations could be improved (e.g. Eq 12-15).
In Line 223, "large" should be "larger".
In Line 253, "closed form" should be "closed-form".

作者回复

2025-03-29

Q1. While the paper focuses on test-time adaptation, it does not explicitly discuss connections to related topics such as graph domain adaptation and out-of-distribution (OOD) generalization. Since ASSESS deals with distribution shifts at test time, its relationship to existing methods in transfer learning should be discussed.

A1. Thank you for your suggestion. We will add discussions to related topics in the revised version. Here we provide a brief comparison. TTA aims to adapt a well-trained model without accessing the source data. Domain adaptation allows access to source data, which is different from TTA. OOD generalization focuses on building generalizable models during training, whereas TTA focuses on the test phase.

Q2. The authors introduce a temporal ensembling strategy to stabilize the estimation of mutual information in the adaptive subgraph selection process (Eq. 5). However, the motivation for why this specific technique was chosen is not clearly explained.

A2. Thanks for your comment. Temporal ensembling is used to stabilize the estimation of mutual information. It is costly to sample a lot of subgraphs to compute the mutual information estimation, and a lack of samples may lead to unstable MI estimation. Therefore, we use the temporal ensembling technique to stabilize the estimation of mutual information.

Q3. The paper does not specify whether the selected datasets cover both homophilic and heterophilic graphs. Homophilic graphs exhibit strong intra-class connectivity, while heterophilic graphs contain nodes that connect across different classes. Since GNN performance varies significantly based on graph structure, it is important to explicitly discuss the structural diversity of the datasets.

A3. Thanks for your comment. The adopted datasets contain both homophilic and heterophilic graphs, and we show the homophily ratio of these datasets as follows:

Datasets	FRAN	MUTA	PROTEINS	NCI1	IMDB-BINARY
Homophily Ratio	0.04	0.37	0.92	0.62	0.67

Q4. The paper presents inconsistent descriptions of how the ASBS and RPS components interact within the overall adaptation framework. Figure 1 suggests that ASBS and RPS operate in parallel, implying that reliable test graphs are selected and then immediately used for self-training. However, Section 3.4 and the provided loss formulation (Eq. 17) indicate an iterative execution, where ASBS first refines the test graph selection over multiple steps before RPS is applied.

A4. Thanks for the comment. Selection and adaptation are performed iteratively. In each iteration, we select reliable test graphs using ASBS and then adapt the model using RPS. The ASBS is integrated in the overall optimization process, where the optimization computes losses (including the MI loss), and the MI loss is used to guide the reliable graph selection process in ASBS.

Other typos:

The format of some equations could be improved (e.g. Eq 12-15). In Line 223, "large" should be "larger". In Line 253, "closed form" should be "closed-form".

Thank you for your careful review! We will correct these in the revised version.

审稿意见

评分: 32025-03-14

This paper proposes an algorithm ASSESS that handles graph test time domain adaptation for graph-level task. The algorithm mainly contain two components as the adaptive subgraph-based selection and regularized prototype supervision. The subgraph-based selection tends to select reliable test graphs by setting individual graph threshold using the mutual information between the subgraphs. Then, with the selected reliable sets of test graphs, they first construct the prior prototypes as the weight matrix of pretrained model as prior knowledge. Then, the prototype and model are updated with the self-supervised objective that enforces the matching of prototype with corresponding class embeddings of test graphs under the regularization enforcing close distance of updated prototypes to prior prototypes. This paper also provides the convergence analysis of the algorithm and demonstrate superior performance compared to GNN baselines and test-time adaptation baselines.

给作者的问题

Please refer to the above sections.

论据与证据

Strength:

The test time adaptation on graph specifically for graph-level task is an under-explored direction that worth more investigation and understanding.
The selected datasets span a wide range of real world applications

Weakness:

Lack of illustration and more concrete motivation behind the unique challenge appear for graph test-time adaptation in the introduction. For instance, what can be the challenges of test-time adaptation under structural shifts within these datasets and why they cannot be addressed by previous algorithms. The second challenge seems to be non-unique to graph data.

方法与评估标准

Strength:

The propose method is clear in objective regarding the two modules
The methodology is written clearly and easy to read

Weakness:

It could be better if we motivate the algorithm design in a more theoretical manner. For instance, why in principle checking the mutual information of subgraphs and the whole graph can be a good indicator to select reliable test graphs. Also, is the random selection of subgraph to calculate mutual information the best way?
The design of regularized prototype supervision is kind of independent to the graph structure and can also be applied to euclidean data. Also, the idea of similar prototype or template adjustment [1] has been used in previous literature, which limits the novelty.

[1]. Iwasawa, Yusuke, and Yutaka Matsuo. "Test-time classifier adjustment module for model-agnostic domain generalization." Advances in Neural Information Processing Systems 34 (2021): 2427-2440.

理论论述

The theoretical analysis mainly targets the convergence, while I expect some more analysis in terms of the rationales of the algorithm as mentioned above.
There are assumptions of the theorem that might be a bit oversimplified, e.g. the distribution of the test unlabeled graphs as the mixture of two distributions.

实验设计与分析

Strength:

The datasets span a wide range of fields
There is ablation study investigating the impact of different components

Weakness/Questions:

Lack of TTA and GTTA baselines: over half of the baselines are not designated for test-time adaptations and there are only 2 non-graph TTA baselines and 1 graph TTA baseline.
The performance over different variants of the algorithms are very close, even within one std. Does that imply that there is no clear and significant difference in contribution from different components in ASSESS.

补充材料

Briefly went through the algorithm and baseline sections

与现有文献的关系

This paper can let people pay more attention to the graph test time adaptation problem.

遗漏的重要参考文献

More discussion of the graph test time adaptations are needed in the related work section in addition to the test-time adaptation for euclidean data.

其他优缺点

Please refer to the above sections.

其他意见或建议

It might be better if you illustrate more details in pipeline design for figure 1.

作者回复

2025-03-29

We are truly grateful for the time you have taken to review our paper and your insightful review. Here we address your comments in the following.

Q1. Lack of illustration and more concrete motivation behind the unique challenge appear for graph TTA in the introduction. For instance, what can be the challenges of TTA under structural shifts within these datasets and why they cannot be addressed by previous algorithms. The second challenge seems to be non-unique to graph data.

A1. Thanks for your comment. The shift on both node attributes and graph structures would deteriorate the performance of GNNs according to [A]. Previous works often select reliable samples for self-training using fixed thresholds shared across all graphs, which is not flexible enough to handle the complex distribution shifts (both attributes and structures).

As for the second part, our paper focuses on graph TTA. Compared with traditional TTA, the complexity of graph-level data makes the problem challenging. We will try to extend our method to other types of data, and leave it to future work.

[A] Matcha: Mitigating Graph Structure Shifts with Test-Time Adaptation. ICLR'25

Q2. It could be better if we motivate the algorithm design in a more theoretical manner. For instance, why in principle checking the mutual information of subgraphs ... Also, is the random selection of subgraph to calculate mutual information the best way?

A2. Thanks for your comment. A high MI between a graph and its subgraphs indicates that the graph representation encodes information shared across its subgraphs. In other words, the encoder is able to handle the inherent structure of this graph. To validate it, we compare our method with other variants and show that MI can achieve the best performance.

Indicators	PROTEINS	IMDB-BINARY
MI	78.9	67.8
Entropy	71.8	64.7
Confidence	69.5	64.5

Random selection is simple yet effective here. To validate it, we provide the empirical results (compared to edge perturbation, removing and adding edges with probability $p$ ).

Strategies	PROTEINS	IMDB-BINARY
Random Subgraph	78.9	67.8
Edge perturbation, $p=0.1$	76.3	66.7
Edge perturbation, $p=0.2$	75.7	66.1
Edge perturbation, $p=0.3$	74.2	65.4

Q3. The design of regularized prototype supervision is kind of independent to the graph structure and can also be applied to euclidean data. Also, the idea of similar prototype or template adjustment [1] has been used ...

A3. Thanks for your comment. This work focuses on graph data, but our regularized prototype supervision is a universal design, which is incorporated into our graph TTA methods. We are committed to extending our design to other types of data, and leave it to future work.

Our work is different from T3A (Iwasawa et al.) in the following aspects:

Different Motivation: T3A is motivated by the idea of support set, whereas our ASSESS is motivated by Bayesian theory.
Different Methodology: T3A maintains a support set for each class using the test samples, and uses the support set for classification. By comparison, we use optimal transport to obtain the prototypes, which are used as supervision signals with regularization from the prior.
Different Scenario: T3A focuses on the domain generalization of images, whereas we focus on test-time adaptation of graphs.

Moreover, empirical results in A5 show that our method outperforms T3A.

Q4. I expect some more analysis in terms of the rationales of the algorithm as mentioned above.

A4. Thanks for your comment. The overall rationale is to first select (ASBS, as discussed in A2) and then supervise (RPS, A3). Due to limited length, please refer to previous answers.

Q5. There are assumptions of the theorem that might be a bit oversimplified, e.g. the distribution of the test unlabeled graphs as the mixture of two distributions.

A5. Thanks for your comment. Without loss of generality, we consider a mixture of two distributions, and it can easily be generalized to cases with multiple distributions.

Q6. Lack of TTA and GTTA baselines ...

A6. Thanks for your comment. We add additional TTA (T3A) and GTTA (GAPGC, MATCHA) baselines as follows.

Methods	PROTEINS	IMDB-BINARY
ASSESS (ours)	78.9	67.8
T3A	69.3	64.3
GAPGC	71.9	65.1
MATCHA	73.2	65.8

Q7. The performance over different variants of the algorithms are very close, even within one std. ...

A7. Thanks for your comment. We provide additional ablation studies on other datasets as follows, and we can find a clear improvement overall on these datasets.

Methods	PROTEINS	IMDB-BINARY
ASSESS	78.9 $\pm$ 3.8	67.8 $\pm$ 2.9
w/o ASBS	66.7 $\pm$ 3.5	63.9 $\pm$ 2.6
w/o RPS-a	69.2 $\pm$ 3.9	64.1 $\pm$ 2.9
w/o RPS-b	70.9 $\pm$ 4.0	64.2 $\pm$ 2.8

Others: Related works and figure 1.

Thanks for your comment. We will add more related works (e.g., GAPGC, MATCHA, etc.) and details in Figure 1.

审稿人评论

2025-04-03

Thank you for the rebuttal, it addressed some of my concerns so I raise my score to 3.

最终决定Accept (poster)

2025-05-01

This paper introduces ASSESS, an innovative framework for test-time adaptation (TTA) on graphs, addressing the under-explored challenge of domain shifts in non-Euclidean data. The method combines adaptive subgraph selection via mutual information and prototype regularization to balance prior knowledge from pretrained models with test-time data.

Reviewers highlight several strengths: rigorous theoretical analysis of convergence properties, comprehensive empirical validation across diverse graph datasets (including homophilic/heterophilic structures), and clear performance gains over baselines like T3A and GAPGC. While concerns were raised about the theoretical motivation for subgraph mutual information and potential overlaps with prototype-based methods, the authors’ rebuttals address these via ablation studies and comparisons, demonstrating ASSESS’s unique value in graph-structured TTA. The authors are encouraged to carefully revise the paper according to the reviewers' concerns.