PaperHub
4.9
/10
Poster4 位审稿人
最低2最高3标准差0.4
3
2
3
3
ICML 2025

Does Graph Prompt Work? A Data Operation Perspective with Theoretical Analysis

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

we offer theoretical proof of why and how much graph prompting works

摘要

关键词
graph promptinggraph neural networks

评审与讨论

审稿意见
3

The paper addresses the challenge that the theoretical underpinnings of graph prompting remain underexplored, particularly highlighting the lack of rigorous theoretical proof regarding why and to what extent it works. This has often been seen as a "dark cloud" over the field, hindering further progress. In response, the paper introduces a theoretical framework that rigorously analyzes graph prompting from a data operation perspective.

给作者的问题

Please refer to the Weaknesses and Suggestions.

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The proposed methods and evaluation criteria are appropriate for the problem at hand. However, the paper lacks results on cross-domain datasets from real-world scenarios.

理论论述

I have roughly checked the proof process for all the theorems presented in the paper.

实验设计与分析

I have checked the datasets, experimental setup, and results presented in the paper.

补充材料

I reviewed the proofs of the theorems provided in the appendix. The proofs are well-structured and logically sound, offering a clear explanation of the theoretical aspects of the paper.

与现有文献的关系

The paper provides a comprehensive analysis and proof of the existing literature on graph prompting. It offers valuable insights by approaching the topic from a macro-level theoretical perspective.

遗漏的重要参考文献

NAN

其他优缺点

Strengths:

  1. The motivation of the paper is clear, effectively addressing the gap in existing heuristic graph prompting methods that lack theoretical grounding.

  2. The topic of the paper is valuable, with a well-organized structure and clear writing, presenting the ideas in a step-by-step manner.

  3. The paper provides thorough theoretical proofs for the claims and approaches introduced, offering solid justification for the proposed framework.

Weaknesses:

  1. The paper does not offer theoretical guidance for designing new graph prompting methods. While it rigorously analyzes existing methods, it does not extend the theoretical framework to propose novel techniques.

  2. The paper lacks experiments on real-world cross-domain datasets, which are crucial for assessing how well graph prompting adapts to different downstream tasks.

其他意见或建议

  1. Testing on real-world cross-domain datasets would provide insights into the bridge graph method's generalizability and effectiveness across diverse applications.

  2. It would be best to extend the theoretical framework to propose novel graph prompting methods, as this would add more practical value.

作者回复

W1. The paper does not offer theoretical guidance for designing new graph prompting methods. While it rigorously analyzes existing methods, it does not extend the theoretical framework to propose novel techniques.

Thank you for pointing this out. In fact, our theoretical framework can provide fundamental theoretical guarantees for designing novel graph prompting methods. By considering Theorems 6 and 7, one can derive guidance for selecting the structure and number of prompts. These results can serve as useful directions for designing new graph prompting methods in future research. In our code project homepage, we also added some discussion to inspire the community for desining better graph prompt according to our theory analysis. please see the open discussion of our work in the code link.

W2. The paper lacks experiments on real-world cross-domain datasets, which are crucial for assessing how well graph prompting adapts to different downstream tasks.

Regarding experiments on real-world data, please refer to Appendix C, where we provide detailed experiments on real-world datasets. These results are consistent with those obtained on synthetic datasets in the main text, further supporting the validity of our theoretical findings.

Regarding the cross-domain datasets, we have the same response to Reviewer SySL C1.

Other Comments or Suggestions

We thank the reviewer for these insightful suggestion. We will open a new discussion online to keep updating our latest research findings on more real-world applications and more graph prompt design. We currently have already finished such explorations and that is the reason why we wish to find a pure theory support, just like the motivation of this paper. For anonymity policy, we will open these explorations once accepted.

审稿人评论

Thank you for your reply. I will keep my rating.

审稿意见
2

This study aims to provide solid theoretical analysis of graph prompts. The theoretical findings include the capabilities of graph prompts on GCN models with and without non-linear layers, the error bound of the data operations by graph prompts for both a single graph and batch of graphs, and the error distributions of the data operations by graph prompts. This work also provides empirical studies to confirm these theoretical findings.

update after rebuttal

This study analyzes the capacity of graph prompting methods for pre-trained GNN models. The analysis primarily focuses on GPF(-Plus) and All-in-One(-Plus) as the graph prompting methods and GCNs and GATs as the GNN models. I think the correct formulations of the studied graph prompting methods and GNN models are the most basic requirement of a theoretical paper. However, when checking their corresponding descriptions in the paper, we can find out that most of them are questionable, confusing, and even incorrect, e.g., GPF-Plus (line 630 to 638), GAT (Equation 7 to 9), GCN (line 642), All-in-One-Plus (no descriptions found).

Considering this, I have to keep my evaluation as negative. I hope the authors can check their paper from beginning to end rigorously to avoid such obvious mistakes.

给作者的问题

Please mainly address the weaknesses above.

论据与证据

Not fully supported. For example, using the distance between Fθ(Gp)F_{\theta^*}(G_p) and C(G)C(G) as the error is questionable. In addition, the formulation of GAT is weird. More details are provided in the weakness list.

方法与评估标准

Not fully make sense. The theoretical analysis in this paper has some issues in terms of the assumptions, formulations, and proof. More details are provided in the weakness list.

理论论述

I checked the proofs for theoretical claims, including most parts from page 11 to 17. The issues can be found in the weakness list.

实验设计与分析

I checked the experiments in the main paper and the supplementary material. The issues can be found in the weakness list.

补充材料

I checked the proofs for theoretical claims, including most parts in page 11-17 and page 25-26.

与现有文献的关系

The key contribution of this paper is the theoretical analysis of the existing graph prompt learning methods, such as GPF and All-in-one. So it is very related to the literature.

遗漏的重要参考文献

N/A

其他优缺点

  • Weakness1: Using the distance between Fθ(Gp)F_{\theta^*}(G_p) and C(G)C(G) as the error is questionable. In many cases, there are multiple optimal graph embeddings given different task predictors. There may exist graph embeddings that are far from the optimal graph embeddings but can achieve almost-optimal performance. Hence, using the distance as the error may be improper to determine whether a prompted graph is close to the optimal ones.
  • Weakness2: The definition of linear and non-linear graph aggregations is vague. The meaning of so-called “non-linear graph aggregations” like GAT is vague in this paper. Basically, GAT and GCN both aggregate the embeddings from neighboring nodes of a node linearly and pass the aggregated embedding to a non-linear activation function to obtain the updated embedding. Using linear and non-linear graph aggregations to distinguish them is quite confusing.
  • Weakness3: The formulation of GAT is weird. In Equation (7)~(9), the authors use “the simplest form” of the attention mechanism in GAT. In “the simplest form”, the attention coefficient αjk\alpha_{jk} is determined by node features and irrelevant to hidden embeddings. As a result, αjk\alpha_{jk} will be constant across different GAT layers, not affected by GAT parameters at all. Therefore, it is quite strange to formulate GAT in this way.
  • Weakness4: The formulation of GPF-plus is unclear. The authors should specify what QQ represents in Equation (14) and the meaning of MM.
  • Weakness5: The meaning of ϵ\epsilon when formulating Equation (17) should be specified. According to Table 3, ϵ\epsilon represents an error. But it seems that ϵ\epsilon before Equation (17) does not represent an error.
  • Weakness6: The authors should introduce the method All-in-one-Plus in the experiments. It is not introduced anywhere but only appears in some experimental results.
  • Weakness7: The design for obtaining C(G)C(G) in the experiments is questionable. C(G)C(G) is obtained based on a modified graph by randomly removing nodes edges. These modifications do not include other graph manipulations, such as changing node features and adding extra nodes.

其他意见或建议

N/A

作者回复

W1

  • The motivation of this distance we used is that: we assume once we are given an anchor/target graph embedding, our theory proves that we can use graph prompt to approximate such embedding. The given embedding is not necessarily the only optimal graph embedding. It can be any one. The purpose of this paper is to present the powerful data operation approximation from graph prompt. Of course, in a more realistic view, we usually hope such anchor graph embeding should be some optimal one for specific task because it can link graph prompt to various downstream tasks, but please note that this is not necessary for the theory basic of this paper, we say it just for indicating our reader that we can use such technique to find potential optimal solution for downstream task, and let our readers understand that graph prompt and traditional fine-tune are two routines to achive such target.
  • Although the distance between Fθ∗(Gp) and C(G) may overestimate the actual performance deviation, this further supports the fact that the distance serves as an upper bound for the deviation. Our theoretical results demonstrate that this upper bound of the error is either zero or tightly controlled, ensuring that the performance deviation is also controlled. This, in turn, guarantees the effectiveness of the prompt. Thus, the strategy of using this distance as the error remains valid.
  • Concerning different task predictors (decoders), while optimal embeddings vary across decoders, in practice, each downstream task has a fixed decoder from the pretrained model. Our analysis focuses solely on scenarios with fixed downstream tasks, ensuring consistent optimal embeddings .
  • Additionally, our analysis does not require uniqueness of the optimal embeddings —only their existence. If exists and the distance to remains small, performance quality is assured, maintaining the validity of our approach regardless of embedding uniqueness.
  • We appreciate your insightful comments and agree that exploring additional measurement approaches is promising, though somewhat beyond this paper's current scope. We intend to include such discussions in the camera-ready version upon acceptance.

W2

Thank you for your question. Please note that we clearly discuss non-linear graph aggregations in Section 4.4. The definitions of linear and non-linear graph aggregations are clearly stated in the paper. Specifically:

  • Linear graph aggregations: The aggregation coefficients do not depend on the node feature vectors. For example, in GCN, the coefficients for combining neighboring nodes' embeddings are constant. (e.g. H=AXWH=AXW with A a constant matirx and W the parameter, just like what we call y=ax+by=ax+b in linear algebra)
  • Non-linear graph aggregations: The aggregation coefficients depend on the node feature vectors. For example, in GAT, the coefficients are determined by attention scores, which are computed based on the feature vectors of neighboring nodes.

You are correct that GAT and GCN both use non-linear activation functions for processing aggregated information. However, we emphasize the distinction in the aggregation process, not the subsequent information processing. Almost all GNN models use non-linear layers (e.g., MLPs) to capture complex patterns during information processing, but this is not the focus of our discussion.

W3

Sorry for this typos. We will modify Equation (7) by replacing Xj, Xk with Hj, Hk (hidden embeddings), which is consistent with the reference of GAT. Please note that this typos will not affect our downstream theory analysis.

W4

In the mentioned section, we state that GPF-plus adds a combination of multiple prompt vectors to each node’s features. Specifically, Q ∈ R^M × k is a matrix, where the i-th row Qi represents the combination coefficients of the k prompt vectors, which are then added to the node features. We stated the update rule is mathematically expressed as:
[Xω]i = Xi + QiP. "M represents the number of graphs in the dataset Ω." This indicates the number of graphs being processed, each with distinct combinations of prompt coefficients Qi. We will add the explanation in the final version.

W5

We apologize for this inconsistency, which may cause confusion. We will replace this notation in Equation (17) with a different symbol and update the notation table accordingly in the final version.

W6

All-in-one-Plus: we set the inserting patter of all-in-one as independent parameters (all-in-one, however, relies on prompt tokens). We will clarify this in the final version.

W7

Our implementation of the modified graph includes all the operations you mentioned: "Adding/deleting nodes, adding/deleting/changing edges, and transforming features of a given graph G." This is explicitly stated in the task settings described in Appendix C. We will revise this section further in the final version.

审稿人评论

Thank you for providing the detailed reply. I still have the following unaddressed concerns and decide to maintain the rating.

  • Following W1: Thanks for the explanation. I would say using the loss for analyzing the quality of graph embeddings by graph prompts is more intuitive and reasonable than the distance to the optimal graph embedding with ϵ\epsilon. Even if a graph embedding is close to the optimal one (i.e., their distance is smaller than ϵ\epsilon), its real quality is also affected by other factors, such as the landscape at the embedding. Although the upper bound of the distance error is either zero or tightly controlled as demonstrated in this paper, I think loss-based metrics for analysis make more sense.
  • Following W2: I agree that GCN and GAT use different coefficients for aggregation. However, they are both using H=AXWH=AXW to update embeddings. The difference here is that GCN uses consistent AA while GAT uses layer-specific AA. If exist, it will be great to provide existing studies that use linear/non-linear graph aggregations to distinguish them. Otherwise, the authors should provide formal definitions before analysis.
  • Following W3: If the authors modify Equation (7) to the form consistent with GAT, what does “the simplest form” of GAT refer to?
  • Following W4: The explanation is still confusing. If QQ’s ii-th row QiQ_i represents the combination coefficients of prompt vectors, QiPQ_i P will be an FF-dimensional row vector added to the feature matrix XiX_i of graph GiG_i. In this scenario, the nodes in graph GiG_i will share the same prompt vector QiPQ_i P. But I think the nodes in GPF-plus should have diverse prompt vectors, which is inconsistent with the authors’ formulation.
  • Following W5: Let’s say τ\tau to replace ϵ\epsilon, i.e., S=A+τIS=A+\tau I in GCN when formulating Equation (17). What does τ\tau mean? I am just wondering why we need an additional term τ\tau here.
  • Following W7: The authors should specify the operations used for obtaining the resulting figures in the paper. Now we may think the results are only with randomly removing nodes edges.
作者评论

Following W1:

We think you might have some misunderstanding, which could be further clarified as follows: Please note that the target of this paper is to theoretically analyze how well the graph prompt approximate graph data manipulation, which is the only target and task of this paper. To this end, the so called “the loss for analyzing the quality of graph embeddings by graph prompts” by you is just what we are now doing, using current distance to see how well a graph prompt can approximate a manipulated graph. As what we reply before, the given embedding is not necessarily the only optimal graph embedding. It can be any one. The purpose of this paper is to present the powerful data operation approximation from graph prompt.

For other task loss, the performance of a graph embedding can be seen as a decoded result of the embedding, which depends on the decoder's overall properties (landscape). This dependency adds complexity to the analysis because we can not say a powerful data manipulation capability strickly corresponds to better various downstream task performance. Our paper only focus on analysing why and how well a graph prompt manipulate graph data.

Regarding your statement, “its real quality is also affected by other factors, such as the landscape at the embedding”, please note:

  • For cases where the distance is 0 (as proven in our work), the performance is guaranteed to be optimal, and hence the "real quality" is not affected.
  • For cases with small distances, the Lipschitz continuity of neural networks ensures that the performance differences remain bounded and small. As our work is an early-stage theoretical exploration of graph prompts, we acknowledge that future work can aim to extend such analysis to performance-level differences. However, by analyzing the rigorous upper bound through the distance metric, we have already obtained meaningful and significant results, which do not diminish the contributions of this work.

Following W2:

Linear and non-linear aggregation is not a peculiar term in GNN area, it is just a very natural mathematical concept. for GAT, A in the H=AXW contain attention score, which should be calculated by X. therefore, H=AXW contain non-linear equation for X. We suggest the reviewer could think about such simple math eqution: y=ax+b (a is constent, linear) while y=ax+b and a=f(x) (a is up to x and then multiplied by x again, which is non-linear).

It is a normal math concept, not a peculiar term. However, to further address your concern, we promise in the final version, we will explicitly provide formal definitions to clarify that.

Following W3:

It means the most classical.

Following W4:

In our analysis of GPF (e.g., Theorem 3), we demonstrated that a single prompt per node is already sufficient to achieve zero error for fitting. When extending to multiple graphs, we aim to study the performance of prompts across the entire dataset. The core issue lies in how to combine multiple prompts for multiple graph embeddings. Unlike Universal Prompt Tuning for Graph Neural Networks, which assigns diverse prompt vectors to different nodes to ensure faster prompt tuning, our approach focuses on studying the theoretical properties of prompts. From this perspective, for a single graph, it is not necessary to assign diverse prompt vectors to different nodes. This simplification makes the analysis clearer and more intuitive.

Following W5:

In GCN implementations, using only the adjacency matrix A for message passing can lead to the issue of propagating information only from neighboring nodes without considering the node itself. To address this, a virtual "self-loop" is often added, resulting in the form A + τI.

  • In most implementations, τ is set to 1 by default.
  • However, to allow finer control over how much self-information is propagated, an additional term τ can be introduced. This provides flexibility in controlling the contribution of self-loops during information processing.

Following W7:

Thank you for pointing this out again. In the final version, we promise we will clearly specify that.

Final Fight for our Paper:

We think we do not need to further declare the meaningful and strong contribution of this paper because the theoretical contributions of our work remain significant despite minor typos or ambiguities. We thank for your engaged comment. However, it seems that the issues you raised are primarily due to some basic expressions and primary math knowledge, which we believe will definitely be fully addressed in our final version once accepted. It would be appreciated if you consider raising the score to reflect our paper quality objectively.

Thanks.

审稿意见
3

The paper presents a comprehensive theoretical analysis of graph prompting, a novel technique aimed at adapting pre-trained GNN models to downstream tasks without retraining. It introduces the concepts of "bridge sets" and "ϵ-extended bridge sets" to explain the capacity of graph prompts to simulate graph transformations and align upstream pre-trained models with downstream tasks. The authors provide theoretical proofs demonstrating the conditions under which graph prompts effectively approximate various graph data operations, establish error bounds on these approximations, and examine error distributions. Empirical results validate these theoretical findings, supporting the practicality and effectiveness of graph prompting.


(+) The paper fills a significant theoretical gap in the graph prompting literature, rigorously establishing conditions under which graph prompts can successfully approximate data operations.

(+) It systematically addresses various scenarios including single and batch graphs, linear and non-linear graph aggregation models (GCN and GAT), and extends its theoretical guarantees to practical GNN architectures.

(+) Introducing bridge sets as a theoretical tool is innovative and clarifies how prompting impacts model behavior from a data manipulation perspective.


(-) Despite rigorous theoretical insights, practical application and generalization to various real-world graph tasks might still face challenges in prompt design and optimization.

(-) Some theoretical guarantees depend on assumptions like "full-rank" model parameters, which, although justified by practical model initialization and training strategies, might limit generalization or require specific model conditions.

(-) While synthetic datasets convincingly demonstrate theoretical validity, broader validation across diverse real-world datasets and complex scenarios would enhance practical relevance.


update after rebuttal

Thank you to the authors for the thoughtful and detailed rebuttal. I appreciate the clarifications regarding the full-rank assumption and the inclusion of supporting theorems (e.g., Theorems 5, 8, and 9) for scenarios where the assumption does not hold. The explanation of practical conditions under which full-rank matrices are likely to appear is helpful.

That said, I still believe the practical implications and generalizability of prompt design remain open challenges, especially in more complex real-world settings. While I acknowledge the experiments in the appendix, I would encourage the authors to highlight these results more explicitly in the main text to improve accessibility and clarity.

Overall, the rebuttal strengthens my confidence in the theoretical contributions, but my initial position remains unchanged. I still lean toward a weak accept, primarily due to the paper’s rigorous theoretical insights and the novelty of the proposed framework.

给作者的问题

  1. Could you clarify under which practical conditions the assumption of a full-rank matrix in Theorems 3 and 4 typically holds?

  2. Can you discuss the scalability of your proposed graph prompting methods to extremely large graph datasets, such as those encountered in industrial applications?

论据与证据

The claims in the paper are clearly stated and largely supported by convincing theoretical proofs and controlled empirical evidence. However, claims about generalizability to complex real-world scenarios are not thoroughly supported by empirical experiments.

方法与评估标准

The methods and evaluation criteria, including synthetic benchmarks and the use of GCN and GAT as representative models, are appropriate and relevant to the theoretical focus of the study. However, additional real-world datasets could provide a more comprehensive evaluation.

理论论述

I checked the correctness of key theoretical claims such as Theorems 3, 4, and 5. The proofs appear mathematically sound and logically consistent. Minor concerns may exist around the conditions for full-rank assumptions, but they are clearly justified.

实验设计与分析

The experimental designs using synthetic datasets effectively validate theoretical claims. The choice of metrics and analysis approaches are sound. However, the extension to real-world graph data could be more comprehensive.

补充材料

I did review the supplementary material.

与现有文献的关系

The paper clearly positions itself in relation to recent literature, especially in the area of prompting techniques and graph neural networks. It effectively references significant prior work such as Sun et al. (2023a, b), Fang et al. (2022, 2024), and Liu et al. (2023).

遗漏的重要参考文献

The paper adequately references essential and relevant prior work. No critical omission is evident concerning the core contributions. However, (not necessarily) I recommend to include more recent papers such as GSPF, SUPT, GSmFP, and HGPROMPT at the final version. Because graph prompting has rapidly emerged as a promising research direction, as mentioned by authors.

其他优缺点

The paper's originality lies significantly in its rigorous theoretical grounding of prompting techniques for GNNs, contributing valuable insights and clarity to an otherwise empirically driven field. The clarity and structure of the presentation further enhance its impact.

其他意见或建议

Minor editing for typographical errors could further enhance readability. Overall, the manuscript is clearly written. The citeauthor format is inconsistent. For instance, in page 2,

(line 071) ... as described in the review by Sun et al. (2023b), can be ...

(line 107) Initially, (Fang et al., 2022) have proved that ...

作者回复

C1: Despite rigorous theoretical insights, practical application and generalization to various real-world graph tasks might still face challenges in prompt design and optimization.

Thank you for your insightful comment. This paper primarily focuses on providing theoretical insights and rigorous proofs for graph prompting. We believe that exploring practical applications and generalizations should be left to the broader research and industrial communities. To triger this exploration, we added some open discussion to our work in the code project homepage, from which you could see some practical impact of our theory.

C2: Some theoretical guarantees depend on assumptions like "full-rank" model parameters, which, although justified by practical model initialization and training strategies, might limit generalization or require specific model conditions.

Thank you for your comment. In Theorems 3 and 4, the "full-rank" assumption are critical for ensuring the existence of lossless graph prompting methods in general settings. For cases where the full-rank assumption does not hold, we have dedicated considerable effort in our work to analyze such scenarios (e.g., Theorems 5, 8, and 9). These theorems provide corresponding bounds that demonstrate only controllable errors and ensure the robustness of prompt design. Therefore, the "full-rank" assumption is a key condition discovered during our exploration and does not detract from the contributions of this paper.

C3: While synthetic datasets convincingly demonstrate theoretical validity, broader validation across diverse real-world datasets and complex scenarios would enhance practical relevance.

In Appendix B of our paper, we conducted extensive analyses and experiments on various real-world datasets. The results obtained are very similar to those from synthetic datasets in the main text. Therefore, we believe that both synthetic and real-world datasets yield consistent results, fully supporting the validity of our theoretical findings.

Regarding the experimental setup, we carefully aligned each experiment with the theoretical discoveries, focusing on the core theoretical insights. From a practical experimental perspective, this is sufficient to validate the problem. Broader and more complex settings could be left to the wider research community. Our goal is to emphasize the solid theoretical results.

C4: typos and references:

We have noted the minor editing errors and suggestions for updating references. These will be totally included in the final version of the paper.

Q1. Could you clarify under which practical conditions the assumption of a full-rank matrix in Theorems 3 and 4 typically holds?

Thank you for your question. As discussed in the section following Theorem 4, when parameter matrices are initialized using methods such as orthogonal initialization, He initialization, or Xavier initialization, they are almost always full-rank throughout the training process.

Here, "almost always" can be understood as follows: the determinant of a matrix is 0 if and only if the matrix is rank-deficient. However, during training, the determinant can take any value along the real number axis and its behavior can be considered as random fluctuations, making it extremely unlikely to be exactly 0. Since He initialization is the standard initialization method in modern practices, the direct answer is that with proper initialization, trained models can be considered full-rank matrices in practice. We added some open discussion to our work in the code project homepage, from which you could see further practical discussion on full-rank condition.

Q2. Can you discuss the scalability of your proposed graph prompting methods to extremely large graph datasets, such as those encountered in industrial applications?

Our work primarily analyzes classical prompting methods, such as GPF and All-In-One. In Theorem 7, we provide a detailed discussion on the performance of multi-token GPF and multi-node subgraph All-In-One methods on large-scale graph datasets. Specifically, part (4) of Theorem 7 gives an upper bound on the final error, theoretically demonstrating the scalability of graph prompting methods to extremely large graph datasets.

Some studies have already achieved promising results on larger datasets, and our work provides theoretical support for these findings. For extremely large graph datasets, following our theorems and subsequent discussions, effective results can be achieved using prompts that are much smaller than the dataset itself. This theoretical work strongly supports the scalability of graph prompting methods to extremely large graph datasets and provides a solid foundation for future industrial applications.

We will add more scalability discussion in the final version.

审稿意见
3

The paper theoretically analyzes graph prompting. First, it shows that the main reason why graph prompting works is because it can simulate graph operations, and why this is important when encountering new tasks. Second, it presents upper bounds on the error of graph prompt when simulating graph operations. Third, the results are extended to non linear graph aggregations.

update after rebuttal

The paper presents an interesting contribution. However, I am keeping my score unchanged, as stronger experimental validation would be needed for a higher score. Specifically:

  1. The evaluation focuses on embedding reconstruction rather than performance on the actual downstream task, despite having access to labels of the downstream task.

  2. The training task consists on approximating graph data operations such as random corruption, but it would be more insightful to train on real-world graph tasks. More importantly, to properly assess the potential of graph prompting, it is necessary to change the dataset between training and test.

给作者的问题

Why does the loss in Equation 1 only takes the graph-level embedding? Why aren't we taking the output of downstream classifier (which is applied to the graph-level embedding) and the ground truth labels?

论据与证据

The theoretical claims are supported by clear and convincing evidence

方法与评估标准

I am not sure I understand the evaluation criteria. I was expecting to have defined a training task, a the training dataset, a downstream task and a downstream dataset. However, from Appendix C, it seems that the training task is to approximate graph data operations such as random corruption, and that the downstream task is the one of interest (and the dataset does not change).

Why aren't we considering a completely different training task defined on a different dataset than the downstream task? I understand that this setting is significantly more challenging, but otherwise I believe we are not really testing the generalization ability of graph prompting.

理论论述

The proofs seem correct to me.

实验设计与分析

Unfortunately I do not think I fully understand the settings of the experimental design. If the goal of graph prompting is to adjust the downstream data to make the downstream task compatible with the pretraining task, why don't we directly test the performance on the downstream task (since we have the labels), instead of reporting the error in reconstructing the embeddings?

补充材料

I reviewed the supplementary material.

与现有文献的关系

The key contributions of the paper relate to the broader scientific literature by extending the concept of graph prompting, which has been explored in empirical studies but lacks theoretical understanding. The paper demonstrates that graph prompts can simulate graph data operations and provides theoretical guarantees for their effectiveness.

遗漏的重要参考文献

None

其他优缺点

The topic is interesting but the paper is hard to hard. This is partially due to the presence of many theoretical results, but I think the authors should include intuitive explanations throughout the paper.

The experimental settings are unclear, as I discussed above. Appendix C partially helped me understanding the loss and the training procedure (which aims at obtaining the same graph embeddings under manipulations of the graph), and should be partially moved to the main paper, otherwise it is unclear what are the training task and dataset.

其他意见或建议

\rightarrow in Equation 2 should be \approx

AinA_{in} should be Ain\mathbf{A}_{in} in line 97 right

please use \citet in Section 2 motivation.

作者回复

C1: Why aren't we considering a completely different training task defined on a different dataset than the downstream task? I understand that this setting is significantly more challenging, but otherwise, I believe we are not really testing the generalization ability of graph prompting.

Thank you for your question. Transferring graph models on different domains (graph datasets) is a very promising potential application of graph prompt because we believe the core challenge on this topic is how to learn effective data manipulation strategy. before graph prompt, we can nearly only focus on hand-crafted design, which is far from promising to achieve this goal. Transfering across graph datasets/domain, or across differnt tasks, are two hard problems in graph learning area. The one across domains/datasets is still a not well solved problem. recently, there are some work trying for this but most of them are far from sufficient to industrial applications.

Our paper found that, learning an effective data manipulation strategy might be helpful for both domain transfering and task transfering in graph learning area. recently, some graph prompt-based work have found the promsing restuls for graph domain transfering for recommendation (Zhang et al. Adaptive Coordinators and Prompts on Heterogeneous Graphs for Cross-Domain Recommendations. arxiv 2410.11719). The potential behind this lies in the powerful graph data manipulation capability of graph prompt. However, diving into so many various applications is far out of the scope of our motivation. This paper wishes to provide the background theory support for our community to explore further. To this end, we focus on the core problem: how well does graph prompt manipulate graph data? According to this, our experiments are close to justify the correctiness of our theory analysis. We will explore more task settings in the futre to expand the imapct of this paper.

C2: I think the authors should include intuitive explanations throughout the paper. The experimental settings are unclear, as I discussed above. Appendix C partially helped me understand the loss and the training procedure ..., and should be partially moved to the main paper..."

Thank you for the suggestion.We will carefully consider better formatting and move some key explanations from Appendix C to the main paper in the final version to improve clarity.

C3: typos

Thanks for pointing our this, we will carefully update in our final version.

C3: Why does the loss in Equation 1 only take the graph-level embedding? Why aren't we taking the output of the downstream classifier (which is applied to the graph-level embedding) and the ground truth labels?

This concern might come from our expression issue. We are sorry for the caused misunderstanding. Take the graph classification as an example, the classifier (like MLP) can be usually treated as some addion layers for the graph model so the output of the classifier can be treated as a special graph embedding. We take this expression just for conciseness without loss of generality. In a standard graph prompt workflow, the task head is usually fixed and pre-defined so the task head is just a constant mapping and it is usually won't affect the graph prompt design. We thank you for this suggestion and we will try our best to refine this section in our final version.

审稿人评论

Thank you for the reply. Can you please expand on why we don't directly test the performance on the downstream task (since we have the labels and it is what we are ultimately interested in), instead of reporting the error in reconstructing the embeddings?

作者评论

Thank you for the reply.

There is already a significant body of work in the literature on graph prompt tuning that directly tests downstream task performance, such as All-in-One Prompt. These works have extensively demonstrated that prompt tuning performs well in downstream tasks when evaluated in an end-to-end manner.

Building upon their findings, which have already validated the practical effectiveness of prompt tuning, our work focuses on a different, yet equally important, aspect: providing theoretical explanations for the effectiveness of prompt tuning. Specifically, rather than testing downstream task performance directly, our goal is to identify the core mechanism behind prompt tuning. This mechanism is centered on tuning the graph embeddings to effectively fit the transformations required by downstream tasks.

Kindly note that this paper did not propose any new prompt model. The only motivation of our paper is to answer how well a graph prompt can manipulate graph data. This is the only target. testing graph prompt on other task is just like a benchmark or proposing new model, which is not the content of our paper.

In the end, we would like to further present the main contribution of our work:

  1. Core Contribution:

    • Our study is fundamentally aimed at understanding the theoretical underpinnings of prompt tuning. By analyzing the ability to tune graph embeddings to fit transformations (e.g., via reconstruction error), we offer a rigorous explanation for why prompt tuning works.
    • This focus distinguishes our work from purely engineering-driven approaches that prioritize end-to-end performance without necessarily addressing underlying principles.
  2. Challenges in Evaluating Downstream Tasks:

    • Direct downstream evaluation can be heavily influenced by task-specific factors, such as the choice of decoders, discrete optimization steps, and specific task requirements. These factors often obscure the generalizable insights into prompt tuning mechanisms.
    • Our embedding reconstruction approach avoids these complications, providing a more controlled and interpretable evaluation of prompt tuning effectiveness.
  3. Completing the Theory-to-Practice Chain:

    • Our work forms a theory-to-practice bridge when combined with the aforementioned practical studies ([X], [Y]). Those works already establish the practical effectiveness of prompt tuning in downstream tasks, while our work provides the missing theoretical guarantees. Together, they establish a complete pipeline for understanding and applying graph prompt tuning effectively.

In our follow-up work, we plan to incorporate additional settings and auxiliary end-to-end experiments to further validate our theoretical insights in practical scenarios. However, we emphasize that the primary contribution of this paper lies in its theoretical guarantees, which provide essential guidance for this field.

We believe that our current approach, which focuses on the distance between embeddings and their optimal counterparts, along with extensive experimental validation, is logically consistent and sufficient for supporting our claims. Combined with practical works mentioned in our paper, our study contributes to a robust foundation for graph prompt tuning research.

Lastly, we sincerely thank you for your thoughtful comments and encourage you to acknowledge the significant theoretical contributions of this work. We appreciate your review and hope you will consider further support our paper.

Thanks

最终决定

The paper presents a theoretical analysis of graph prompting (inserting new subgraphs or features to the target graph and applying a pretrained GNN to adapt to downstream tasks), addressing a critical gap in the literature by providing formal guarantees on its approximation capabilities, error bounds, and applicability to both linear and non-linear graph models. The work is well-received by most reviewers for its rigorous theoretical contributions. Though some concerns remain regarding practical applicability, experimental validation, and clarity of certain assumptions, the rebuttals adequately address most critiques.