PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
3
4
3
ICML 2025

Vision Graph Prompting via Semantic Low-Rank Decomposition

OpenReviewPDF
提交: 2025-01-09更新: 2025-07-24
TL;DR

In this paper, we propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures.

摘要

关键词
Graph Neural NetworksVisual Prompting

评审与讨论

审稿意见
4

The paper introduces Vision Graph Prompting (VGP), a novel parameter-efficient fine-tuning method tailored for Vision Graph Neural Networks (ViG). The authors propose that semantic information in vision graphs resides primarily in low-rank components of the latent feature space, a key insight derived from PCA-based analysis of graph structures. Building on this, VGP incorporates three prompt types—SeLo-Graph Prompt, SeLo-Edge Prompt, and SeLo-Node Prompt—each leveraging semantic low-rank decomposition to capture global and local semantic dependencies within ViG topologies. The method freezes the pre-trained ViG backbone and fine-tunes only the prompts and a downstream head, achieving performance comparable to full fine-tuning with significantly fewer trainable parameters. Extensive experiments on ten vision datasets (e.g., CUB, Flowers, GTSRB) and nine graph datasets (e.g., BBBP, Tox21, PPI) demonstrate that VGP outperforms existing visual and graph prompting methods, achieving an average accuracy of 89.6% on vision tasks and 76.39% on graph tasks, surpassing full fine-tuning in several cases. The main contributions include the VGP framework, the low-rank prompting insight, and its superior transfer performance across diverse downstream tasks.

给作者的问题

Generalizability to Graph Tasks: Your hypothesis suggests low-rank patterns exist in chemistry/biology graph data (Section 5.3). Could you provide PCA or similar analysis on these datasets (like Figure 2 for vision) to confirm this? A positive response with evidence would strengthen my confidence in VGP’s broader applicability, potentially raising my rating from "accept" to "strong accept."

Choice of Rank r=32r=32: Table 4 shows peak performance at r=32r=32, but Appendix A.3 estimates r=50r=50 (CUB) and r=60r=60 (Flowers). Why was r=32r=32 chosen over these values? Clarification could resolve this apparent discrepancy, impacting my view on the method’s optimization rigor.

论据与证据

The claims in the paper are generally well-supported by clear and convincing evidence. The primary claim—that VGP achieves performance comparable to full fine-tuning while being parameter-efficient—is substantiated by quantitative results in Tables 1 and 2, showing VGP’s accuracy surpassing or matching baselines across diverse datasets. The assertion of semantic information residing in low-rank components is convincingly supported by PCA visualizations (Figures 2 and 5) and theoretical discussion in Appendix A.3, linking shared PCA components to low-rank properties. Ablation studies (Table 4, Figure 4) further validate the effectiveness of individual components (SeLo-Graph, SeLo-Edge, SeLo-Node) and hyperparameter choices (e.g., rank rr, blending factors α\alpha and β\beta). However, the claim of generalizability to traditional graph tasks (Section 5.3) is slightly weaker due to the lack of detailed analysis on why low-rank properties extend to chemistry/biology domains beyond a hypothesis. While plausible, this claim could benefit from additional evidence, such as a similar PCA analysis on graph datasets, to strengthen its foundation.

方法与评估标准

The proposed VGP method and its evaluation criteria are well-suited to the problem of adapting ViG models for downstream vision tasks. The method’s design—introducing low-rank prompts at graph, edge, and node levels—aligns logically with the topological nature of ViG, addressing the limitations of Transformer-centric prompting methods. Using ten vision datasets (e.g., CUB, GTSRB, SVHN) with diverse categories and distributions is a robust choice for evaluating transfer performance, as is the extension to nine chemistry/biology graph datasets to test generalizability. The evaluation metric (classification accuracy) is standard and appropriate for these tasks. The experimental setup, including freezing the backbone and training for 100 epochs with AdamW optimization, is reasonable and consistent with prior work (e.g., DAM-VP, InsVP). However, the choice of a single ViG-M backbone (pre-trained on ImageNet-21k) could be expanded to other ViG variants (e.g., MobileViG) to validate robustness further, though this is a minor concern given the focus on prompting efficiency.

理论论述

The paper includes a theoretical claim in Appendix A.3 linking the low-rank property of semantic information to PCA and eigenvalue decomposition (Equations 13-15). I reviewed the correctness of this analysis, which builds on standard PCA principles to argue that semantically connected nodes share dominant components, implying a low-rank structure. The formulation appears mathematically sound: the covariance matrix decomposition and rank estimation based on eigenvalue thresholds are consistent with PCA theory. The error term estimation (O(λr+1)O(\lambda_{r+1})) is a reasonable approximation, though it assumes a clear eigenvalue drop-off, which is visually supported by Figure 5’s long-tail distribution. No significant issues were found, but the analysis could be strengthened by quantifying the variance captured by the chosen rank rr (e.g., 50 for CUB) to directly tie it to the experimental choice of $r=32

实验设计与分析

I examined the experimental designs and analyses in Sections 5 and 5.4, including the quantitative results (Tables 1, 2, 5) and ablation studies (Table 4, Figure 4). The design is sound: comparing VGP against full fine-tuning and state-of-the-art prompting methods (e.g., InsVP, GraphPrompt) on diverse datasets ensures a fair and comprehensive evaluation. The ablation studies systematically test core components, rank rr, and blending factors α\alpha and β\beta, with results consistently showing performance improvements (e.g., 5.7% gain from SeLo-Graph on CUB). The statistical validity is supported by the use of standard splits (e.g., scaffold split for chemistry datasets) and consistent augmentation strategies. One minor issue is the lack of statistical significance testing (e.g., confidence intervals) for accuracy differences, which could bolster claims of superiority (e.g., VGP’s 89.6% vs. InsVP’s 84.6% on vision tasks). Additionally, the computational efficiency claim (3.1% FLOPs overhead, Table 5) is plausible but could be clarified by detailing how FLOPs were calculated for prompt operations.

补充材料

I reviewed the supplementary material in the Appendix (Section A). Specifically, I examined A.1 (Efficiency Analysis), A.2 (Implementation Details), A.3 (Semantic Low-Rank Property), and A.4 (Vision Dataset Details). These sections provide valuable context: A.1 quantifies parameter reduction (94.6%) and FLOPs overhead (3.1%), A.2 details training settings (e.g., AdamW, 100 epochs), A.3 supports the low-rank claim with PCA theory and visualizations, and A.4 lists dataset statistics (e.g., CUB: 200 classes, 5,794 test samples). The material is well-organized and enhances the main paper’s credibility. I did not review additional figures (e.g., Figure 5) beyond their mention, as they were adequately described.

与现有文献的关系

The paper’s key contributions align well with trends in parameter-efficient fine-tuning (PEFT) and graph neural networks (GNNs). The use of prompting for vision tasks builds on prior work like VPT (Jia et al., 2022) and InsVP (Liu et al., 2024), extending it to ViG, a graph-based vision backbone introduced by Han et al. (2022). The low-rank decomposition idea echoes techniques in efficient Transformer adaptation (e.g., LoRA, Hu et al., 2021, not cited) but is novel in its application to graph structures. The extension to traditional graph tasks (e.g., MoleculeNet) ties into GNN prompting literature (e.g., GraphPrompt, Liu et al., 2023; GPF-Plus, Fang et al., 2023), offering a bridge between vision and graph domains. The insight into low-rank semantic properties also resonates with dimensionality reduction studies in GNNs (e.g., Kipf & Welling, 2016b), though applied uniquely to vision graphs.

遗漏的重要参考文献

While the paper cites relevant prior work, two areas could benefit from additional references:

Low-Rank Adaptation: The low-rank decomposition approach shares conceptual similarities with LoRA (Hu et al., 2021, "LoRA: Low-Rank Adaptation of Large Language Models," ICLR 2022), a PEFT method for Transformers. Discussing LoRA could contextualize VGP’s novelty in adapting low-rank ideas to graph structures. Graph Compression: The low-rank insight might relate to graph compression techniques like "GraphSAGE" (Hamilton et al., 2017, NIPS), which aggregates neighborhood features efficiently. Citing this could clarify how VGP differs from prior graph feature reduction methods. These omissions do not undermine the work but could enhance its positioning within the broader PEFT and GNN literature.

其他优缺点

Strengths:

Originality: VGP is a pioneering effort in prompting ViG models, creatively adapting low-rank concepts to graph structures.

Significance: The method’s parameter efficiency (94.6% reduction) and strong performance (e.g., 89.6% average accuracy) make it highly practical for resource-constrained settings.

Clarity: The paper is well-written, with clear explanations of the method (Section 4) and insightful visualizations (Figures 2, 3).

Weaknesses:

Clarity of Generalizability: The extension to graph tasks (Table 2) is compelling but lacks depth in explaining why low-rank properties hold beyond vision, limiting interpretability.

Limited Backbone Variety: Testing only on ViG-M restricts insights into broader applicability across ViG variants. Minor Presentation Issue: The abstract could better highlight the low-rank insight as a core novelty, as it currently focuses more on the framework.

其他意见或建议

I have no other comments or suggestions.

作者回复

Thank you for your valuable feedback.

Q1,W1. Generalizability to Graph Tasks

From the efficacy of our method on chemistry/biology graph datasets, we hypothesize similar latent semantic low-rank patterns also exist in these graph data. In particular, chemical bonds and protein interaction structures exhibit structured low-rank properties similar to semantic regions in images.

For example, in PPI (Protein-Protein Interaction) dataset, each node in graph represents a type of protein, while edges denote interaction relationships. These interactions are primarily driven by specific functional groups such as hydroxyl and carboxyl groups, which are crucial to biochemical reactions, analogous to low-rank semantic features in vision images. Conversely, other chemical groups that do not significantly contribute to interactions correspond to high-frequency local details in images, which tend to be redundant.

So when extracting features from these protein graph data for tasks such as protein function prediction, it is essential to identify the key functional groups that drive interactions. Since our VGP model is designed to capture low-rank semantic structures, it effectively generalizes to chemistry and biology graph datasets, surpassing prior graph prompting methods.

Due to the inherent abstract nature of protein interaction graphs, it is challenging to visualize this similar semantic pattern like 2D images in Figure 2. To illustrate the underlying patterns, we visualize the graph structures and color the nodes with corresponding the first three PCA components. We make comparisons between node features obtained from trained and untrained GNN models on PPI dataset. Interestingly, we find that node features from trained GNN models present significantly better low-rank features consistency. The visualization is provided in an anonymous link(https://anonymous.4open.science/r/ICML25-anonymous-DF1B/PPI-PCA.pdf).

Q2. Choice of Rank r=32r=32

Our choice of r=32r=32 is motivated by two key considerations:

  1. Trade-off between performance and parameter efficiency. As shown in Table 4, CUB gets near-optimal results with 87.4% of r=32r=32 and 87.2% of r=64r=64 and its estimated rank is about 50, falling between these two values. Besides, CIFAR gets peak results when r=64r=64 and second-best performance when r=32r=32. In aspect of performance, choose r=32r=32 or r=64r=64 both seems plausible. However, in aspect of parameter efficiency, increasing rr from 32 to 64 nearly doubles the trainable parameters. So we tend to choose the smaller one, as r=32r=32 for an optimal balance between performance and parameter efficiency.
  2. Consistent hyperparameters across datasets. To maintain a unified hyperparameter setting and avoid dataset-specific tuning, we adopt consistent r=32r=32 across all the datasets. Even though estimated ranks of CUB and Flowers are 50 and 60 respectively, other datasets like SVHN and CIFAR10 exhibit lower ranks as 18 and 20, as table shown below. To reach a reasonable compromise between these datasets, we set r=32r=32 to satisfy majority of datasets.
DTDCUBNABirdsDogsFlowersFoodCIFARCIFAR10GTSRBSVHN
estimated rr36509030604655202618

W2. Backbone Variety

We supplement additional experiments based on other representative graph-based vision models, including MobileViG and GreedyViG. The experiments are conducted on six vision datasets and the backbones are pre-trained on ImageNet-1k. As table shown below, our VGP consistently excels other SOTA vision prompting and graph prompting methods, demonstrating robustness across backbones due to our adaptability to diverse graph structures.

MethodDTDCUBFlowersFoodCIFAR10SVHNAverage
MobileViG
GPF-Plus68.581.494.382.094.182.283.7
InsVP68.184.095.283.995.288.985.9
VGP71.684.997.688.096.894.788.9
GreedyViG
GPF-Plus69.381.794.982.294.582.784.2
InsVP68.884.195.584.095.589.186.2
VGP72.185.498.087.397.294.589.1
审稿意见
3

This paper introduces a novel parameter-efficient method called Vision Graph Prompting (VGP) with semantic low-rank decomposition for Vision GNNs. Empirical results demonstrate that the proposed approach achieves impressive performance on both image classification and traditional graph classification tasks. In addition, the paper provides supporting visualization evidence via PCA, which underscores the motivation behind the low-rank decomposition design in the prompts.

给作者的问题

I am curious about the cluster located in the bottom-right corner of Figure 1. It appears to differ significantly from the other clusters. While the patches in the top figure seem to represent the background of the overall image, it is unclear what this particular cluster corresponds to. Could you clarify its meaning or significance?

论据与证据

The authors assert that semantically connected components in the graph exhibit low-rank properties, as evidenced by visualizations produced using PCA and t-SNE. However, I believe this visualization approach may have a critical limitation due to the capacity of PCA in effectively extracting the target object in complex images. Specifically, PCA may struggle to isolate the target object when selecting the top components, particularly in scenarios where images contain multiple objects or intricate backgrounds. Therefore, I suggest providing additional visualization results of PCA components, especially for images with multiple objects and complex backgrounds, to better evaluate the method’s effectiveness under such conditions.

方法与评估标准

Overall, the proposed method, VGP, provides an effective solution for leveraging semantic graph information within a low-rank space. Specifically, the semantic low-rank decomposition framework of VGP, including the SeLo-Graph Prompt, SeLo-Edge Prompt, and SeLo-Node Prompt, facilitates both structural adaptation and feature enhancement.

理论论述

I have carefully reviewed the theoretical aspects of this paper and did not identify any obvious errors. However, I noticed that the authors did not provide an equation summarizing the proposed three modules for parameter updating. Including such an equation is recommended to enhance the clarity and understanding of the overall method.

实验设计与分析

The experimental results of the proposed method, as presented in Tables 1 and 2, are promising. However, the authors do not appear to provide sufficient analysis regarding the differences between ViT-based methods and ViG-based ones.

Firstly, the parameter sizes of the selected backbones remain unclear and should be explicitly stated for better comparison. Secondly, it is recommended that the authors analyze why the basic visual prompting method for ViTs (i.e., VPT) outperforms ViG-based prompting methods on certain datasets, such as CUB, NABirds, Dogs, and Flowers. Specifically, the authors should offer a more detailed discussion on the advantages and disadvantages of ViT-based and ViG-based prompting methods. This would help readers better understand the critical discrepancies between these two approaches.

补充材料

I have reviewed the appendix part, including implementation details and the datasets statistics.

与现有文献的关系

The authors validate the effectiveness of the proposed method solely on image classification and graph classification tasks in this paper. However, it remains unclear whether the method can be extended to other vision tasks, such as object detection and segmentation.

遗漏的重要参考文献

I am not an expert in this field, so I am unsure if there are any related references that have not been cited in this paper.

其他优缺点

Strengths:

  1. The originality of the proposed method stems from the innovative combination of existing ideas, including low-rank adaptation and Vision GNN, which demonstrates a creative and thoughtful approach.

  2. The writing in this paper is clear and well-structured, making it easy for readers to understand both the motivation behind the work and the effectiveness of the proposed method.

Weaknesses:

My major concerns can be found in previous parts of Claims And Evidence, Theoretical Claims, and Experimental Designs Or Analyses.

其他意见或建议

I have no further comments.

作者回复

Thank you for your valuable feedback.

Q1. Cluster Located in Bottom-Right Corner of Figure 1

We further check the correspondence between the t-SNE clusters and images patches, finding that the bottom-right cluster corresponds to the bird's reflection on the water. This observation aligns with the PCA visualizations in Figures 1 and 2, where the bird’s reflection is also highlighted.

Interestingly, the ViG model appears to learn semantic information about reflections as a byproduct of supervision in the bird classification task.

However, since the model lacks explicit supervision on background elements, the background features exhibit a sparse distribution in the upper region of the t-SNE figure.

W1. PCA with Multiple Objects and Complex Backgrounds

In Figure 2(b) of our paper, the samples from Flowers dataset already contains complex backgrounds with cluttered grass and leaves, as well as instances with multiple objects (e.g., the top-right sample with two flowers).

The results demonstrate that PCA effectively extracts target objects using trained ViG model’s features, attributed to the semantic low-rank property of the ViG's latent feature space. In the final version, we will provide additional visualizations specifically focusing on multiple objects and complex backgrounds.(https://anonymous.4open.science/r/ICML25-anonymous-DF1B/multi-objects-w-complex-backgrounds.pdf)

W2. Summarizing Proposed Three Modules for Training

Combining Equation 11 and 12 in the paper, we further summarize the prompted ViG model as equation below. The updated modules during training are underlined. Only the three low-rank prompt matrices, one semantic feature extraction MLP and the low-rank virtual nodes are trained, while all other modules in the ViG backbone remain frozen:

f^(x_i)=(1β)x_i+g^(x_i,P_n)W_update+_x_jN^(x_i)βMLP_s(x_j)P_e,   N^(x_i)[X, [n_1,,n_M]P_g]\hat{f}(\mathbf{x}\_i)= (1-\beta)\cdot\mathbf{x}\_i+\hat{g}(\mathbf{x}\_i, \underline{\mathbf{P\_n}}) \cdot \mathbf{W}\_{update} +\sum\_{\mathbf{x}\_j \in \hat{\mathcal{N}}(\mathbf{x}\_i)} \beta\cdot \underline{\mathrm{MLP\_s}}(\mathbf{x}\_j)\cdot\underline{\mathbf{P\_e}}, ~~~\hat{\mathcal{N}}(\mathbf{x}\_i) \subseteq [\mathbf{X},~[\underline{\mathbf{n}\_1, \dots, \mathbf{n}\_M}]\cdot \underline{\mathbf{P\_g}}]

W3. Differences between ViT-based and ViG-based Prompts

The ViT-based prompting methods either prompting on images pixels like VP, explicitly adjusting the RGB channels space, or prompting on image tokens like VPT, functioning via feature similarity-based attention mechanism.

However, these methods lack awareness of graph structures, such as edge connections between patches. While ViG-based graph prompting methods like GraphPrompt and GPF-Plus explicitly alters graph structures, including modifying node features, inserting new nodes and constructing new edges, thus better leveraging the graph representation.

As for ViT-based prompting methods outperform ViG-based one on certain datasets, this is likely because vision datasets contain both visual features and latent graph structures. ViT-based vision prompting methods excel in processing raw vision data, whereas ViG-based methods are more effective at graph-based reasoning. Consequently, each approach has its own advantages, leading to instances where ViT-based prompting achieves superior results.

W4. Parameter Sizes of the Backbone

We have provided details of both parameter sizes and computation costs of backbone ViG and our VGP in Appendix A.1 and Table 5. The ViG model has 48.68M parameters averagely and our VGP only has 2.61M trainable parameters, reducing 94.6% from full fine-tuning.

W5. Extending to Other Vision Tasks

We supplement additional semantic segmentation tasks on ADE20K dataset. As table shown below, our method consistently excels other vision and graph prompting methods on semantic segmentation tasks, demonstrating its effectiveness across different vision tasks.

MethodViG-MAdapterVPTInsVPGraphPromptVGP
mIoU(%)47.944.241.642.344.447.6
审稿意见
4

This paper proposes a novel approach called Vision Graph Prompting (VGP), which enables parameter-efficient fine-tuning of the Vision GNN (ViG) model. Additionally, the paper observes that essential semantic information in Vision Graph structures is concentrated in low-rank components and leverages this insight to introduce a Semantic Low-Rank Decomposition-based prompting method. To capture both global and local semantic features within the graph structure, three key components—SeLo-Graph, SeLo-Edge, and SeLo-Node Prompt—are introduced. Experimental results demonstrate that this approach significantly enhances the transfer learning performance of the ViG model while requiring far fewer parameters compared to full fine-tuning.

给作者的问题

  • Will the proposed method show the same effect in other graph-based vision models (e.g., MobileViG, GreedyViG)?
  • How will performance change if the structure of the graph is altered?
  • Has the impact of prompting on the model's explainability been analyzed?
  • Is it possible to achieve the same performance improvements in other domains, such as autonomous driving, medical imaging, and remote sensing?

论据与证据

  • This paper visually demonstrates, through Figure 2 and Figure 5, that the primary semantic information of the Vision Graph is concentrated in the lower-dimensional components via PCA analysis. Additionally, the Ablation Study in Table 3 experimentally proves that SeLo-Graph, SeLo-Edge, and SeLo-Node Prompt each contribute to performance improvement.

  • The experimental results in Table 1 and Table 2 further indicate that the proposed method outperforms existing approaches across various benchmark experiments.

  • However, there is a lack of experiments comparing the cases with and without the application of low-dimensional decomposition. Therefore, it remains unclear how significant the performance improvement is compared to the basic ViG.

  • It is necessary to provide a more detailed explanation of why the graph prompting technique is structurally optimized for the ViG model.

  • In this paper, the impact of blending factors α and β on performance is discussed, indicating that the optimal values are found within a specific range (0.1 to 0.3). Although this claim is supported by quantitative analysis, more detailed insights should be provided regarding why deviations from this range lead to performance degradation.

方法与评估标准

  • This paper conducts experiments using various benchmarks (CIFAR, CUB, GTSRB, and chemical/biological graph data) and appropriately analyzes the contribution of each technique through an Ablation Study. Additionally, the evaluation considering the balance between parameter efficiency and performance appears to be well-justified.

  • While the paper claims that the proposed method achieves results comparable to full fine-tuning, it does not specify the exact performance evaluation metrics used for comparison. Including metrics such as accuracy, F1 score, or AUC would provide clearer insights into the effectiveness of the proposed method.

  • Additional explanations on the graph datasets should be provided. While it appears that datasets from GPF-PLUS and MoleculeNet were used, there is a lack of detailed descriptions regarding their characteristics (even the appendix does not provide an explanation).

  • It is necessary to verify whether the proposed method demonstrates the same effectiveness in other graph-based vision models (e.g., MobileViG, GreedyViG). Further analysis should be conducted to determine whether the performance of the proposed low-dimensional decomposition method varies depending on the dataset.

  • A clearer analysis of how changes in graph structure affect performance during the prompting process would be beneficial.

理论论述

  • This paper demonstrates through PCA-based analysis that the semantic information of the Vision Graph is primarily contained in low-dimensional components. By utilizing Eigenvalue Decomposition (EVD) of the Covariance Matrix, it shows that semantic information is concentrated in a few principal components. Furthermore, based on this mathematical foundation, it logically validates the effectiveness of Semantic Low-Rank Decomposition.

  • This paper claims that the proposed method effectively captures critical semantic information in Vision GNNs, thereby enhancing feature extraction. While this claim is supported by experimental results, the theoretical justification for how this improvement is achieved through low-rank decomposition and graph adaptation needs to be clearly articulated.

  • The paper explains that extensive experiments demonstrate significant improvements in transfer performance across various downstream tasks. However, the logical connection between the theoretical claims and the experimental results should be further strengthened. In particular, providing a more detailed explanation of how the proposed theoretical framework translates into practical performance gains would enhance the coherence of the paper and establish a clearer pathway from theory to application.

实验设计与分析

  • This paper has appropriately set up comparison groups for the experiments and conducted a comprehensive comparative analysis, including existing prompting techniques (VPT, InsVP) and the graph prompting (GPF-Plus) technique. The Ablation Study verifies the contribution of each proposed technique to performance improvement, ensuring logical validity.

  • However, providing more explicit information on the number of repetitions (epoch) would enhance the transparency of the experimental design.

  • This paper claims significant improvements in transfer performance, but it does not include detailed statistical analyses or significance testing results. Incorporating such analyses would help assess the robustness of the findings and is essential for demonstrating that the observed improvements are not merely due to random chance.

补充材料

  • The paper enhances the reliability of the research by including additional experimental results, implementation details, and mathematical proofs in the Appendix.

与现有文献的关系

  • The paper effectively summarizes the relationship with existing Vision GNN and Vision Prompting research, clearly explaining the differences from Transformer-based prompting techniques

  • It is necessary to verify whether the proposed method demonstrates the same effectiveness in other graph-based vision models (e.g., MobileViG, GreedyViG).

遗漏的重要参考文献

  • The paper effectively summarizes existing research, particularly related to Vision GNN and Vision Prompting, and does not appear to have omitted any essential studies that should have been mentioned.

其他优缺点

  • The paper proposes the first Vision Graph Prompting technique for the ViG model, demonstrating high originality. It achieves superior performance compared to existing methods while maintaining parameter efficiency. The inclusion of an Ablation Study and various benchmark experiments enhances the reliability of the research.

  • It would be beneficial to include additional statistical significance for the experimental results. There are changes in model performance based on hyperparameters (r, α, β), and it may be necessary to include a process for finding the optimal values.

其他意见或建议

  • Grammar errors and inaccurate expressions: "textbf3.1%" → "3.1%" (Typo correction needed) "demonstrated in Table reftab-datasets" → "demonstrated in Table 6" (Reference error correction)
  • It is necessary to more clearly articulate the limitations of the study in the conclusion. For example: analysis of the causes of poor performance on certain datasets, suggesting additional research directions, etc.
  • Providing a clearer explanation of the PCA visualization in Figure 5 would be beneficial. The current explanation is somewhat brief, which may make it difficult for readers to easily understand the meaning of the graph.
作者回复

Thank you for your valuable feedback.

Q1. Experiments on Other Graph-based Vision Models

We supplement additional experiments on other graph-based vision models, including MobileViG and GreedyViG, across six vision datasets with ImageNet-1k pre-trained backbones. As table shown below, our VGP consistently excels other SOTA vision and graph prompting methods, demonstrating robustness across backbones due to our adaptability to diverse graph structures.

MethodDTDCUBFlowersFoodCIFAR10SVHNAverage
MobileViG
GPF-Plus68.581.494.382.094.182.283.7
InsVP68.184.095.283.995.288.985.9
VGP71.684.997.688.096.894.788.9
GreedyViG
GPF-Plus69.381.794.982.294.582.784.2
InsVP68.884.195.584.095.589.186.2
VGP72.185.498.087.397.294.589.1

Q2. Ablation on Altered Graph Structures

We conduct additional ablation studies to analyze the impact of structural modifications in the SeLo-Graph Prompt. Our method inserts virtual nodes and dynamically constructs edges based on feature similarity, thereby altering the graph structure.

As table shown below, both virtual node insertion and edge construction enhance feature extraction. While static edge allocation improves performance, dynamic edge construction based on feature similarity achieves the best results, as it better captures complex semantic relationships.

AblationCUBGTSRB
w/o SeLo-Graph Prompt85.893.4
Only insert virtual nodes86.294.1
Insert virtual nodes+static edge allocation86.696.9
Insert virtual nodes+dynamic edge construction87.498.1

Q3. Analysis of Explainability of Prompting Impact

Fig.2 and Fig.5 in paper show that fully fine-tuned vision graph models can recognize semantically related patches and connect them via edges, exhibiting a semantic low-rank property.

Our VGP reinforces this effect through three prompting modules, ensuring low-rank feature consistency across connected patches. This mimics the behavior of fully fine-tuned models, effectively linking semantically related regions.

As shown in Table 1, our method effectively extracts discriminative semantic information, leading to significant performance gains.

Q4. Experiments in Other Domains

We supplement additional experiments on remote sensing tasks on EuroSAT dataset, which consists of satellite images from Sentinel-2. As table shown below, our method still compels other SOTA prompting methods, even getting comparable results with full fine-tuning with only 4% trainable parameters.

MethodViG-MAdapterVPTInsVPGraphPromptGPF-PlusVGP
EuroSAT Acc.(%)92.3785.2483.5587.1485.5086.9791.98

W1. Statistical Significance and Hyperparameter Selection

We run all experiments for three times with different seeds and report the highest results. The average standard deviation is 0.3%, which is significant lower than our 4% performance gain, confirming the robustness of our results. Detailed statistical significance for each dataset will be provided in final version.

For hyperparameter selection (r, α, β), we evaluate multiple candidates and select the optimal ones. And we use a fixed set of hyperparameters across all datasets.

S1,S2,S3. Comments and Suggestions

We appreciate your feedback and will address the following in final version:

  1. We have corrected typos in the supplementary materials.
  2. We will discuss limitations and failure cases to guide future research.
  3. Additional PCA visualization details are included. Specifically, we compute PCA components for all patch tokens encoded by the trained model. The latent feature space is decomposed into PCA components, where those with large coefficients capture major variance. We map the top three PCA components into RGB channels for visualization, ensuring patches with similar colors share similar PCA component distributions, thus indicating low-rank properties. This approach is similar to visualization of famous DINOv2.

Graph Prompt Structurally Optimized for ViG

Standard vision prompting methods (e.g., VP, VPT) operate on pixel-level or token-level representations without explicit graph structures.

In contrast, graph prompting methods (e.g., GraphPrompt, GPF) directly modify node features, insert virtual nodes, or establish new edges, enabling structured graph-based prompting while overlooking semantic features in vision data.

Our VGP builds upon this principle, explicitly optimizing prompts for graph-structured vision models, incorporating semantic low-rank decomposition strategy.

Number of Repetitions

Following DAM-VP (Appendix A.2), we train each dataset for 100 epochs.

审稿人评论

The authors have provided sincere and well-reasoned responses to all of the reviewer’s questions. In particular, they effectively demonstrated the generalizability and extensibility of the proposed method through additional experiments on alternative backbones (MobileViG, GreedyViG) and the remote sensing domain (EuroSAT). They also convincingly explained the effects of graph structure modifications and the improvement in explainability brought by the prompting technique, supported by both quantitative and qualitative evidence. Plans to supplement statistical significance analysis and hyperparameter selection were clearly stated as well.

However, there are a few shortcomings. First, regarding statistical validation, conducting only three runs and reporting only the best performance may be somewhat insufficient in terms of consistency and reliability. In addition, among the various domain experiments, real-world applications such as autonomous driving or medical imaging—which are more complex—were not tested. Furthermore, the paper focuses more on overall system performance improvement rather than providing in-depth analysis on the causes of individual performance gains, which weakens the connection between theory and experiments. Addressing these issues in the future would further enhance the completeness of the paper.

Taking all these points into account, I will adjust my previous score slightly upward. I wish the authors the best with their final results.

作者评论

Thank you very much for your thorough and constructive feedback. We sincerely appreciate your recognition of our efforts to address your concerns, especially regarding the generalizability and explainability of our method.

We acknowledge the limitations you raised. Regarding statistical validation, we agree that more comprehensive experimentation (e.g., more runs with mean and standard deviation) would enhance the reliability of our results, and we plan to incorporate this in future work. We also appreciate your suggestion on exploring more complex real-world domains such as autonomous driving and medical imaging—this is a valuable direction that we are actively considering for follow-up research.

Lastly, we agree that a deeper analysis into the individual contributions of each module would strengthen the theoretical-experimental connection, and we will work toward expanding this aspect in a future extended version of the paper.

Thank you again for your thoughtful comments and for adjusting your score.

审稿意见
3

In this work, the authors present Vision Graph Prompting (VGP), a parameter-efficient fine-tuning method for Vision Graph Neural Networks. The core insight is that semantic information in vision graphs primarily resides in the low-rank components of the latent feature space. The authors propose three semantic low-rank prompting methods: SeLo-Graph, SeLo-Edge, and SeLo-Node prompts, which capture global structural patterns and fine-grained semantic dependencies.

给作者的问题

I'm curious, given the authors' finding that the original graph structure's features have a low-rank decomposition property, could they consider adding LoRA to ViG for fine-tuning as a prompt alternative?

论据与证据

Overall, the claims are well-supported and clear.

方法与评估标准

The evaluation criteria, including accuracy and parameter efficiency across diverse datasets, are appropriate and comprehensive.

理论论述

I roughly checked the theoretical claims in the article. They're mostly based on existing theories and seem reasonable.

实验设计与分析

The ablation study in the paper does not include experiments where only SeLo-Edge or only SeLo-Node is used, nor does it show the results of combining SeLo-Graph with SeLo-Node. This limits the thoroughness of the analysis of each component's individual contribution.

补充材料

I read the supplementary material in its entirety.

与现有文献的关系

This work bridges the gap between Transformer-focused prompting techniques and graph-based vision models, contributing to the development of parameter-efficient fine-tuning methods for ViG and potentially other graph neural network applications.

遗漏的重要参考文献

N/A.

其他优缺点

Strengths:

  1. The authors present a parameter-efficient fine-tuning method specifically designed for Vision Graph Neural Networks (ViG), addressing a previously under-explored area.

  2. The core insight regarding the low-rank properties of semantic information in vision graphs is well-supported and motivated.

  3. Extensive experiments across diverse datasets demonstrate the effectiveness of the proposed method.

Weaknesses:

  1. Missing detailed information on the implementation of virtual nodes, such as their initialization method and how the number of virtual nodes is determined.

  2. The author does not discuss potential issues that may arise when dealing with very large or complex graph structures. Additionally, the author does not clarify the number of prompts M in the SeLo-Graph Prompt, which may also be crucial to the the effectiveness of the method.

  3. The proposed method appears to be instance-level, which may result in a significantly larger number of parameters compared to more generic prompt methods. Though a comparison with the full fine-tuning method is given in the appendix, the paper does not provide a detailed comparison of parameter quantities, which is important for assessing the method's efficiency.

  4. It is unclear whether SeLo-Node acts on all nodes in the graph or if it includes the virtual nodes from SeLo-Graph. This ambiguity affects the understanding of the method's application scope and its impact on different parts of the graph.

其他意见或建议

N/A.

伦理审查问题

N/A.

作者回复

Thank you for your valuable feedback.

Q1. Adding LoRA as a Prompt Alternative

We supplement additional experiments comparing with LoRA across ten vision datasets. As table shown below, LoRA surpasses traditional visual prompting method (VPT) due to its low-rank adaptation property and matches GPF-Plus’s performance.

While LoRA focuses on low-rank adaptation within the model’s parameter space, it does not leverage graph topology, unable to refine structural relationships, limiting further gains. Our VGP achieves SOTA performance by jointly optimizing visual semantics and graph structures via semantic low-rank prompting.

MethodDTDCUBNABirdsDogsFlowersFoodCIFARCIFAR10GTSRBSVHNAverage
VPT71.477.376.473.195.381.976.393.279.782.480.7
GPF-Plus71.082.077.278.295.782.680.994.590.583.183.6
LoRA69.779.277.374.094.683.581.494.990.290.883.6
VGP74.887.480.981.798.289.589.798.398.196.989.6

W1. Implementation Details of Virtual Nodes

Thanks for your reminder. We provide additional implementation details of virtual nodes:

  1. The virtual nodes in SeLo-Graph Prompt are initialized using Kaiming Normal distribution.

  2. The number of virtual nodes MM is set to 14. Ablation experiments is conducted as table below. A smaller number leads to suboptimal prompting effects due to insufficient guidance, while an excessive number does not yield further improvements but incurring additional parameter cost.

Virtual Node Number MM037142842
CUB85.886.386.987.487.286.9
GTSRB93.495.597.398.198.097.6
SVHN95.196.096.296.996.796.8

W2. Dealing with Large or Complex Graph

Our VGP is capable of handling large and complex graph data. In our experiments on chemistry/biology graph datasets in Table 2, the number of nodes can reach 5,000 with non-uniform edges distribution, far more complex than the 196-nodes image graphs. The table below presents the average graph sizes for different chemistry/biology datasets. Even though, our VGP still achieves seven SOTA results across nine benchmarks with only 0.15M parameters, verifying its robustness and generalizability.

As for number of prompts MM in the SeLo-Graph Prompt, we follow the same setting with vision datasets as 14, not specifically tuned for chemistry/biology datasets. Even though the graph are much larger and more complex graphs, our method still excels with other graph prompting methods with a general hyperparameter setting, verifying its robustness.

DatasetsBBBPTox21ToxCastSIDERClinToxMUVHIVBACEPPI
Graph Size77651658374198982889310745139

W3. Parameter Quantities Comparison

We provide a parameter comparison of SOTA methods on the CUB dataset, as table shown below. Our method achieves high efficiency, requiring only 2.63M trainable parameters (5% of ViG-M’s full fine-tuning at 48.71M). This is due to our lightweight low-rank design, which avoids large parameter matrices while maintaining strong performance.

MethodViG-MVPTIns-VPGPF-PlusAdapterDAM-VPVGP
Param.(M)48.711.771.832.293.486.242.63

W4. Whether SeLo-Node Prompt Acts on Virtual Nodes

Yes, the SeLo-Node Prompt acts on all the nodes within graph, including virtual nodes inserted by SeLo-Graph Prompt. We will provide a more explicit description of prompting process in final version for better clarity as below.

  1. SeLo-Graph Prompt inserts virtual nodes and builds virtual edges, updating graph structures
  2. SeLo-Edge Prompt refining the edge-level semantic interactions via edges within updated graph
  3. SeLo-Node Prompt intensifies node-level semantic information on each node in updated graph

Each Component's Individual Contribution

We supplement additional ablation experiments on different components combinations as table shown below. While SeLo-Graph Prompt refines graph structures, SeLo-Edge and SeLo-Node Prompt enhance low-rank semantics between and within nodes. Each component contributes to performance gains.

SeLo-GraphSeLo-EdgeSeLo-NodeCUBGTSRB
---76.277.4
--81.986.9
--82.387.5
--81.086.5
-85.393.0
-85.593.3
-85.893.4
87.498.1
审稿人评论

Thanks for the authors' response. Since most of my concerns have been addressed, I am inclined to increase my score.

作者评论

Thanks for your positive feedback and for considering increasing your score. We truly appreciate your thoughtful review and are glad that our rebuttal addressed your concerns. We are committed to improving our work and are grateful for your constructive comments, which helped us strengthen the paper.

最终决定

This paper received four positive ratings, with all reviewers generally inclined to accept it. The paper presents Vision Graph Prompting (VGP), a novel parameter-efficient fine-tuning method tailored for Vision Graph Neural Networks (ViG). Different from the prompting methods designed for transforms, the author specifically considers the rich topological relationships between nodes and edges in the visual graph structure, which improves the modeling ability of complex semantics. According to the reviews, this paper is well written, with a well-explained methodology, clear motivation and insightful visualizations that aid understanding. All reviewers recognized the novelty of the proposed method, which innovatively explores efficient parameter fine-tuning for the ViG model. In addition, the authors conduct extensive experiments across different datasets and with different backbone networks to effectively demonstrate the generality and scalability of the proposed method. The authors have addressed the concerns raised, resolving most of the doubts. Therefore, the Area Chair (AC) recommends accepting the paper.