6.7

/10

Poster3 位审稿人

最低6最高7标准差0.5

4.0

置信度

COLM 2025

VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

Ziang Ye,Yang Zhang,Wentao Shi,Xiaoyu You,Fuli Feng,Tat-Seng Chua

OpenReview PDF

提交: 2025-03-21更新: 2025-08-26

TL;DR

VisualTrap is the first backdoor attack targeting GUI agents’ visual grounding, using subtle triggers to manipulate interactions across environments, posing significant security risks for edge device deployments.

摘要

关键词

GUI AgentBackdoor Attack

评审与讨论

审稿意见

评分: 7置信度: 42025-05-12

This work proposes a backdoor attack on GUI agent grounding, where stealthy triggers (20 * 20 Gaussian noise) can mislead the agent to locate textual plans to triggers.

The authors show that as little as 5% poisoned data can lead to a high attack success rate of around 90%, on various domains like mobile, web, and desktop, for both end-to-end and modular agent architectures.

接收理由

Using backdoor attack on the GUI agent grounding is novel and potentially impactful.
The authors conduct comprehensive experiments on various datasets, domains, and agent architectures. The proposed attack seems to be effective in all settings.

拒绝理由

The authors don't report the exact number of poisoned data. Since SeeClick pertaining data is 1M, even 5% or 10% of them is non-trivial. This poses questions about whether the proposed attack is realistic.
The authors provide a limited explanation of the relatively poor Desktop ASR results in Table 3. If higher resolution is an important factor, I think it should be systematically studied in section 4.2?

评论- Response to Reviewer X5Ti:

2025-06-03

W1: The authors don't report the exact number of poisoned data. Since SeeClick pertaining data is 1M, even 5% or 10% of them is non-trivial. This poses questions about whether the proposed attack is realistic.

Thank you for this important clarification request. We would like to clarify that we first sampled 10% of the SeeClick dataset for our experiments, which corresponds to 101,040 samples in total. Among these, approximately 65k are grounding data samples. Specifically, 10% equals 6,551 grounding samples, and 5% equals 3,276 grounding samples. We will include these exact figures clearly in the revised manuscript.

W2: The authors provide a limited explanation of the relatively poor Desktop ASR results in Table 3. If higher resolution is an important factor, I think it should be systematically studied in section 4.2?

Thank you for the valueable suggestion, In response, we conducted additional experiments to systematically study the impact of image resolution on attack performance. Specifically, we resized images size by various scale factors before adding the trigger, allowing us to observe how changes in resolution affect the attack's effectiveness. The results of these experiments are presented in the following figure:

https://anonymous.4open.science/api/repo/2i8k0kl2r1B38/file/scale_factor_study_results.pdf

Our findings indicate that significantly increasing the image resolution can indeed lead to a noticeable decline in attack performance, compared to simply reducing the trigger size. We believe this is because VLMs resize image inputs to fit within their maximum pixels. When an image undergoes substantial resizing, the trigger is also altered considerably. Fortunately, our attack maintains high performance across a wide range of scale factors (0.5-2), demonstrating its robustness under varying image resolutions. We will incorporate these new results and analyses into our revised version.

Thank you once again for your valuable feedback, which has helped us enhance the depth of our analysis

2025-06-03

Thanks for the response.

I increased my score to 7.

审稿意见

评分: 7置信度: 42025-05-12

This paper introduces VisualTrap, a backdoor attack method targeting the visual grounding component of GUI agents powered by LVLMs. The attack works by injecting poisoned data—specifically, Gaussian noise patch as the trigger—into screenshots during pre-training, causing the agent to mismap textual plans to the trigger location instead of the intended GUI element. Experiments show that with just 5% poisoned data, the attack can effectively hijack the agent’s behavior and generalize across downstream tasks and different GUI environments.

接收理由

(1) This work is the first study to propose a backdoor attack targeting visual grounding component of GUI agents.

(2) The experiments show both the effectiveness and stealthiness of the proposed attack, leading to high attack success rates with a small portion of poisoned data.

(3) Extended experiments and analysis show that the attack can generalize to downstream tasks and across different GUI environments.

(4) The paper is well-written and easy to follow.

拒绝理由

(1) The paper states that trigger placement is done at a random location, but it is unclear whether the gold location is excluded during this process. If not, this could affect the validity of the results. The statistics on any overlap between trigger and gold locations would be needed.

(2) The experiments are limited to a single model family—Qwen2-VL 2B/7B as the LVLM backbone. To show the broader effectiveness of the proposed backdoor attack, more recent versions like Qwen2.5-VL [1] and other model families like LLaVA [2] are needed.

[1] Qwen2.5-VL Technical Report

https://arxiv.org/pdf/2502.13923

[2] LLaVA-NeXT: Improved reasoning, OCR, and world knowledge

https://llava-vl.github.io/blog/2024-01-30-llava-next/

给作者的问题

(1) Figure 4 seems to be not consistent with the description about LLM component in Line 363-365?

(2) What are the authors’ thoughts on other potential defense mechanisms? Also, could you elaborate on the input-side filtering methods mentioned in Line 367?

评论- Response to Reviewer D55U: Part 2(Questions)

2025-06-03

Q1: Figure 4 seems to be not consistent with the description about LLM component in Line 363-365?

Thank you for pointing out this inconsistency. After reviewing our experimental result, It appears that the positions of "LLM Poison" and "Vision Poison" were inadvertently switched in Figure 4. We will revise the figure to correct these labels and ensure alignment with the description in Lines 363-365

Q2: What are the authors’ thoughts on other potential defense mechanisms? Also, could you elaborate on the input-side filtering methods mentioned in Line 367?

Thank you for your insightful question regarding defense mechanisms, which are indeed vital for the practical deployment of GUI agents. We appreciate the opportunity to elaborate on potential strategies, including input-side filtering methods mentioned in Line 367.

Regarding input-side filtering, we refer to preprocessing techniques applied to GUI screenshots before they are processed by the agent. Specifically, these techniques could include:

Visual Anomaly Detection: This involves implementing statistical analyses to identify unusual pixel patterns or patches that deviate from standard GUI elements. For instance, techniques such as frequency domain analysis or local variance measurements could be used to detect Gaussian noise patches.
Patch-based Analysis: This method involves segmenting input images into smaller patches and analyzing each for potential trigger signatures, akin to approaches utilized in computer vision security.

Beyond input-side filtering, we believe additional defense strategies merit exploration:

Tracking Agent Actions for Suspicious Behavior: Developing mechanisms to monitor agent actions for anomalies, such as verifying the legitimacy of click locations, could enhance security.

In summary, our VisionTrap backdoor introduces two significant challenges that require further defense exploration:

Identifying Trigger Patterns: While our paper uses Gaussian noise as a trigger, real-world attackers might employ specific shapes (e.g., "+", ".") or text embedded within images as triggers. These elements can appear normal within a UI, making it difficult to detect unusual pixel patterns and ascertain the cause of an attack.
Detecting Attack Steps: In practical scenarios, GUI agents operate in multiple steps to achieve their objectives. An attacker might inject a trigger at any stage, leading to biased choices or dangerous operations. Detecting threats at each step is inefficient and costly, underscoring the need for new methods or mechanisms that can accurately and efficiently identify attack points.

We appreciate your feedback and look forward to further discussions on enhancing the security of GUI agents.

评论- Response to Reviewer D55U: Part 1(Weaknesses)

2025-06-03

W1: The paper states that trigger placement is done at a random location, but it is unclear whether the gold location is excluded during this process. If not, this could affect the validity of the results. The statistics on any overlap between trigger and gold locations would be needed.

Thank you for your valuable feedback. We would like to clarify the trigger placement procedure in our experiments:

Downstream Tasks: Since our goal is to demonstrate real-world threats by injecting triggers only on interactable elements, we explicitly exclude the gold element locations from trigger injection. This ensures that the trigger does not overlap with the ground truth locations, preserving the validity of the evaluation.

Pretraining Poisoning: In this setting, trigger placement is randomized, and there is a small percentage of overlap between trigger and gold locations. Specifically, the overlap ratios are as follows:

Platform	Overlap Ratio
Mobile	2.4%
Desktop	3.6%
Web	2.6%

We will include this detailed information in the revised version of the paper to improve clarity and address this concern.

W2: The experiments are limited to a single model family—Qwen2-VL 2B/7B as the LVLM backbone. To show the broader effectiveness of the proposed backdoor attack, more recent versions like Qwen2.5-VL [1] and other model families like LLaVA [2] are needed.

Thank you for the valueable suggestion, We have conducted following addtional experiments on Qwen2-VL-3B and LLaVA-NeXT-Mistral-7B:

LVLM Backbone	Attacked Module		CI-ACC ( $\uparrow$ )				ASR ( $\uparrow$ )
		Mobile	Desktop	Web	Avg	Mobile	Desktop	Web	Avg
	Clean	0.835	0.853	0.817	0.835	0.002	0.018	0.002	0.007
Qwen2.5-vl-3B	Full Poison	0.838	0.826	0.795	0.820	0.968	0.985	0.906	0.953
	Poison LLM	0.847	0.812	0.803	0.821	0.879	0.906	0.834	0.873
	Poison Vision	0.831	0.845	0.823	0.833	0.962	0.947	0.912	0.940

	Clean	0.372	0.354	0.523	0.416	0.033	0.025	0.018	0.025
LLaVA-NeXT-Mistral-7B	Full Poison	0.359	0.368	0.516	0.414	0.552	0.521	0.731	0.601
	Poison LLM	0.355	0.351	0.513	0.406	0.526	0.518	0.683	0.576
	Poison Vision	0.363	0.359	0.541	0.421	0.574	0.531	0.748	0.618

The results for Qwen2.5-VL-3B are consistent with those observed in the Qwen2-VL series. Even with just 5% poisoned data, we achieve a high attack success rate of around 90%.

For LLaVA-NeXT, which did not undergo grounding-specific pretraining and has a relatively smaller vision tower compared to Qwen, Due to our resource constraint, training on approximately 65k grounding data for 1 epoch is insufficient for it to accurately determine the target position. Nevertheless, our method still achieves an average attack success rate of around 60%, demonstrating the effectiveness of VisionTrap.

2025-06-05

Thanks for the response and conducting additional experiments. It addresses my concerns, and I have increased my score.

审稿意见

评分: 6置信度: 42025-05-13

This paper introduces VisualTrap, a novel backdoor attack targeting the visual grounding capabilities of GUI agents powered by LVLMs. The method injects a small proportion of poisoned training data during the grounding pretraining phase, effectively hijacking the agent's visual grounding. This allows the attacker to manipulate the agent’s behavior by directing it to misinterpret interface elements based on the presence of a trigger. The paper demonstrates the effectiveness of VisualTrap through empirical results, showing that it can hijack visual grounding with high success rates and maintain its stealthiness, even when fine-tuned on clean data.

接收理由

This paper addresses an important and underexplored area in the security of GUI agents, backdoor attacks targeting visual grounding. The method and results are timely, as GUI agents become more integrated into everyday technologies. The research highlights the vulnerability of LVLMs and emphasizes the need for security measures in the development of GUI agents.
VisualTrap's ability to transfer across different environments (mobile, desktop, and web) and tasks adds significant practical value, making it a noteworthy contribution to security in AI systems.

拒绝理由

While the paper demonstrates the feasibility of the attack, it lacks a deeper exploration of how these vulnerabilities might manifest in real-world applications of GUI agents. For example, how would this attack perform in more complex multi-agent or multi-round systems where the agent’s task could span several steps or agents? The authors do not address such multi-agent or multi-round complexities that are common in real-world GUI systems.
The authors evaluate the attack primarily on standard datasets such as Aitw and Mind2Web, but the generalization of the attack to more complex, real-world tasks remains unclear. The paper could benefit from more detailed discussions of how the attack might work in broader settings.

给作者的问题

The paper demonstrates the attack in various GUI environments (mobile, desktop, web). How do you envision this attack's performance in systems that involve multiple rounds or agents working together to complete a task? Does the attack still hold in such multi-agent/multi-round contexts?
While the attack is demonstrated on several tasks, could you provide examples of more complex, multi-step tasks that this attack might affect? How does it generalize to tasks involving more intricate user interactions or higher levels of abstraction?
Would it be possible to evaluate this attack in a real-world deployment of GUI agents, for example, on a mobile app or desktop automation tool, to provide more concrete evidence of its impact in practical scenarios?

伦理问题详情

N/A

评论- Response to Reviewer tzq5:

2025-06-03

W1&Q1: Multi-Agent and Multi-Round Complexities

We appreciate the insightful question regarding the manifestation of our attack in real-world GUI systems involving multiple steps and multiple agents. we want to first clarify that our experiments already encompass multi-step task executions and a modular multi-agent architectures.

● Multi-Step Tasks: the environments used in our evaluations—AITW, Mind2Web, and OmniAct—each involve sequential decision-making processes spanning multiple steps(as detailed in appendix B.2)

●Multi-Agent Architectures: Our experiments on OmniAct also investigated a recent widely used modular multi-agent framework(SeeClick-V [1] [2]), where a planning agent interprets task instructions to generate descriptive text actions, and a grounding agent maps these descriptions to precise UI coordinates.

We believe our Attack Remains Effective in More Complex Multi-Step and Multi-Agent Contexts.

● Persistence Across Steps and Agents: Once the backdoor is implanted into the visual grounding component, it consistently influences every grounding operation throughout the entire multi-step task. This means that whenever the trigger appears at any stage, regardless of which agent or module is currently active, the system’s behavior is redirected toward the trigger location.

● Stealthiness in Multi-Step Tasks: In practical multi-step GUI tasks, agents perform sequences of operations to achieve high-level goals. An attacker can strategically inject the trigger at any step, causing subtle biases or incorrect actions that may propagate unnoticed through subsequent steps. This stealthy manipulation makes the attack difficult to detect and mitigate in real-world multi-round interactions

Our attack specifically targets the visual grounding capabilities of GUI agents powered by large vision-language models (LVLMs). Therefore, it provides an effective attack vector in any scenario where an LVLM is required to directly provide precise UI locations based on textual plans.

W2&Q2: Complex Real-World Task Generalization

We thank the reviewer for highlighting the importance of discussing our attack’s applicability to more complex, real-world scenarios. In fact, the effectiveness of our attack tends to increase as task complexity grows because:

Vulnerability at Higher Levels of Abstraction: Complex tasks often involve longer sequences of actions and interaction with a wider variety of interface elements. This expanded scope provides attackers with more opportunities to strategically embed triggers at critical points.

Consider a "book a business trip" task involving:

● Steps 1–3: Flight Booking — A trigger embedded in the airline selection interface could redirect choices toward a competitor, causing unfair market manipulation.

● Steps 4–6: Hotel Reservation — A trigger placed in the hotel selection could steer the user toward overpriced options, resulting in economic loss.

● Steps 7–9: Ground Transportation — A trigger in the pickup location input could cause misdirection to an incorrect address, potentially creating security risks.

Our attack operates at the visual grounding level,enabling it to compromise any or all of these steps regardless of the high-level task complexity. This capability provides attackers with a potent means to exert direct and covert control over user interactions across diverse, intricate workflows.

Q3: Real-world deployment of GUI agents

We recognize the value of evaluating our attack in real-world settings, such as mobile apps or desktop automation tools, to provide concrete evidence of its practical impact. While full real-world deployment requires extensive ethical approval, human label realworld tasks and controlled environments, our experiments on realistic datasets (Aitw for mobile, Mind2Web for web, OmniACT for desktop) provide strong evidence of practical threat potential.We are actively exploring opportunities to conduct such evaluations and will include these results in future work to further substantiate our findings.

Reference:

[1] Gou, Boyu, et al. Navigating the Digital World as Humans Do: ICLR2025.

[2] Wu, Zhiyong, et al. Os-Atlas: A Foundation Action Model for Generalist Gui Agents. ICLR2025.

2025-06-06

Thank you for the clarification. Regarding the Multi-Agent and Multi-Round Complexities, my initial intent was to inquire about how your method would handle scenarios involving multiple LVLMs. Specifically, if there are multiple LVLMs in the system, how would the attack remain effective across different agents? Additionally, could you clarify the assumptions made about the LVLMs' interactions in these multi-agent, multi-round settings to ensure the attack's effectiveness across various VLLMs behind agents and rounds? This would help further solidify the scalability of your approach in more complex real-world environments.

评论- Follow-up on Multi-Agent and Multi-Round Complexities for Reviewer tzq5

2025-06-07

We appreciate your follow-up on heterogeneous, multi-LVLM systems. Below, we provide further clarification on our threat model and explain how the implanted backdoor remains effective even when distinct LVLMs collaborate within multi-round pipelines.

Assumptions in Multi-LVLM Systems:

Single Grounding LVLM: In practical agent architectures, only one component is typically responsible for translating (screenshot, referring expression) → (UI coordinates). We assume that at least one of the LVLMs in the pipeline fulfills this grounding role.
Text-to-Action Interface: Upstream agents, such as planners or memory modules, convey textual action descriptions—such as “click the ‘Pay’ button”—to the grounding LVLM, which then produces the specific, atomic command, click [x, y].

Regardless of the number of LVLMs in the system, the backdoor needs to be implanted only in the module that performs the grounding task. Other agents, such as planners, memory modules, or reflection agents, can operate on clean or different backbones without neutralizing the backdoor. For instance, the planner could be based on GPT-4V, while the grounding agent uses the Qwen-2VL backbone, as demonstrated in our modular architecture experiments.

For example, In a scenario where the planner(GPT-4V) completes the planning phase and generates an action plan, it passes the textual action description to the grounding LVLM (Qwen2-VL) to produce actionable commands. If the grounding LVLM is compromised, the actions executed on the GUI can be manipulated. As you noted, this process may involve multiple rounds to complete a task, and an attacker can strategically inject triggers at any step or across multiple steps to achieve their objectives, as illustrated in the example provided in our response to W2&Q2. It is not necessary for all LVLMs to be compromised; controlling the actual action location on the user interface is sufficient to achieve the attacker's goals.

We hope this clarifies that VisualTrap scales to complex, multi-LVLM pipelines as long as at least the grounding LVLM responsible for generating atomic commands to be executed on the GUI have been compromised. We will include a discussion on how our attack can be implemented in heterogeneous, multi-LVLM systems and look forward to further discussions if you have additional questions or if there are any misunderstandings regarding your inquiry.

评论- Anticipating Your Response

2025-06-11

Dear Reviewer,

We are eagerly awaiting your feedback. Have our responses addressed your concerns?

Best, Authors

最终决定Accept

2025-07-08

The submission introduces VisualTrap, a back-door attack that embeds a small, invisible trigger into roughly 5 % of a GUI agent’s visual-grounding pre-training data. Once implanted, the trigger reliably misdirects clicks on mobile, web, and desktop interfaces and survives subsequent clean fine-tuning. Reviewers found the study novel and well executed, highlighting its new threat model and thorough experiments across interface types and both modular and end-to-end agent pipelines.

Three issues dominated the discussion. First, realism of the attack vector: reviewers asked whether poisoning 5 % of a large-scale corpus is plausible. The authors clarified that their benchmark uses 65k images (about 3k-6k poisoned) and argued that a malicious actor could release a pretrained grounding model with the trigger already embedded; downstream developers who fine-tune or chain agents would inherit the vulnerability at no cost to the attacker. Second, robustness to resolution changes: additional tests at 0.5-2x desktop resolutions showed only modest drops in success rate. Third, dependence on a particular LVLM backbone: new experiments with Qwen2-VL-3B and LLaVA-NeXT yielded comparable attack success, indicating the issue is not model-specific. Reviewers agreed these updates address the substantive concerns, leaving mainly editorial refinements.

If the work moves forward, it would benefit from folding the poisoned-data statistics, resolution study, and cross-backbone results into the main text, tightening the threat-model description around the visual-grounding stage, correcting the mislabeled figure, standardizing terminology, and adding a brief note on input-filtering and action-auditing defenses mentioned in the rebuttal.