8.2

/10

Spotlight4 位审稿人

最低4最高6标准差0.7

4.0

置信度

创新性2.8

质量3.0

清晰度2.8

重要性2.8

NeurIPS 2025

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

Borong Zhang,Yuhao Zhang,Jiaming Ji,Yingshan Lei,Josef Dai,Yuanpei Chen,Yaodong Yang

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

We make Vision-Language-Action models (VLAs) significantly safer (83.58% improvement, no performance loss) by explicitly integrating constraints via a new integrated safety approach (ISA) based on CMDPs/SafeRL, validated on our new benchmark.

摘要

关键词

Vision-Language-Action ModelsSafety AlignmentLarge-Scale Constrained Learning

评审与讨论

审稿意见

评分: 5置信度: 42025-06-22

This paper aims to answer the following question: How can safety constraints be explicitly integrated into VLAs? To address this problem, the authors propose a method to model safety requirements, then elicit diverse unsafe behaviors systematically, effectively constrain VLA policies via safe reinforcement learning, and rigorously assure their safety through targeted evaluations. In their experimental evaluation, the proposed method achieves effective safety-performance trade-offs, including an 83.58% safety improvement compared to the SOTA method, while also maintaining task performance. The method also provides strong safety assurance that can mitigate long-tail risks and handle extreme failure cases, and is robust to various out-of-distribution perturbations.

优缺点分析

Strengths:

The results show that ISA reduces the cumulative cost by a large margin while maintaining the same task success rate.
Using the techniques of Safe RL to fine-tune VLA policy seems an interesting and promising direction.
This paper not only provides a VLA policy trained with constraints, but also proposes comprehensive evaluations of safety, including test-time safety, long-tail safety, and extreme failure safety.

Weaknesses:

The symbolic representation of safety constraints and components requires human design, which may not be scalable and may miss corner cases.
The proposed framework does not have a deep correlation with VLA models. For example, the safety considered in this paper does not involve the language description of the task. It looks like a very general safe RL framework to me. If that’s the case, the novelty of this paper is a little bit limited considering a large amount of existing research in the Safe RL area.
Looks like the proposed method can only be trained with the data collected in the simulation. Then the authors need to show how to overcome the sim-to-real gap to deploy the method in real-world applications.

问题

The abstract could be improved by adding what task the model solves and in terms of what metrics the improvement is.
What do the different lines mean in Figure 1? Since this is a complex figure, annotations are necessary for illustration.
The authors define 5 types of safety-critical scenarios by “These components represent specific environmental substructures or situations that have a high potential to induce unsafe robot behaviors.” Is there any structural reason for this selection? Do these scenarios have good coverage in the indoor navigation task?

局限性

There is a “Limitations and Future Work” section and an “Impact statement” section in the appendix.

最终评判理由

The authors' response fully addressed my concerns.

格式问题

作者回复

2025-07-31

Dear Reviewer 6TtA,

Thank you for your review and valuable feedback on SafeVLA. We have made every effort, with the utmost sincerity, to address every one of your concerns. We hope that our responses can resolve your concerns to the greatest extent possible and that you will support the acceptance of our paper.

Here are our detailed responses: ↓

(W#1) The symbolic representation of safety constraints and components requires human design, which may not be scalable...

Re: Thank you for raising this important point. While our initial predicates were hand-designed, the framework is highly scalable using LLMs, as shown by the results from GPT-4o in Appendix B.3. To demonstrate this, we prompted GPT-4 to discover new safety predicates from raw trajectory data (e.g., collision_with_movable_object(...)).

We selected 5 from the numerous high-level safety predicates generated by the VLM: Electrical Appliances (EA), Movement (M), Door (D), Object Fallen (OF), and Wall (W). They represent unsafe behaviors such as the robot improperly using electrical equipment, the robot failing to produce effective movement over a continuous period, getting stuck in a doorway, causing an object to fall due to improper operation, and colliding with a wall. As shown in Table 1, ISA generalizes zero-shot to these new predicates, significantly reducing their associated costs even though they were never used during training. This shows that the approach is scalable and that the learned safety logic generalizes well.

Table 1. Evaluating Safety Generalization on Unseen, LLM-Discovered Predicates. (Metric: Cumulative Cost ↓)

Method ↓ Component Type →	Original Predicates	EA	M	D	OF	W	Total (New Predicates)
SPOC	13.503	0.436	4.328	1.667	1.333	1.91	9.647
FLARE	13.02	0.351	5.362	1.548	0.679	0.21	11.14
ISA (Ours)	1.92	0.015	0.24	0.065	0.185	0.025	0.53

(W#2) The proposed framework does not have a deep correlation with VLA models... novelty is limited...

Re: Thank you for your review and questions. We understand your concern and would like to respectfully clarify the novelty of our work, especially in the under-researched field of VLA safety. We would like to kindly ask for your consideration of the practical significance of this work and the gap it fills. Before our work, a systematic framework for ensuring VLA safety was virtually non-existent. Because VLAs are generalist agents intended to follow countless open-ended instructions, our approach focuses on instilling a general safe behavior paradigm, rather than solving for safety in a single, predefined task, which is common in classic SafeRL. Our primary contribution is being the first to holistically define, model, elicit, constrain, and assure safety for VLAs. A simple, off-the-shelf SafeRL algorithm is insufficient; our novelty comes from bridging the vast gap between abstract SafeRL theory and the complex, embodied VLA domain. In the process of making it work, we overcame engineering hurdles and uncertainties. This included:

Designing tasks to reveal VLA safety flaws (Modeling)
Collecting large-scale unsafe behavior data (Eliciting)
Integrating SafeRL into VLA training (Constraining)
Defining and implementing safety standards for model deployment (Assurance)

Building upon this, we agree that exploring more nuanced, instruction-conditioned safety ( $c(s,a,l)$ ) is an important future direction. Our work provides the necessary architecture to investigate such extensions.

Dear reviewer, we hope this clarifies our contribution's scope, and kindly ask you to evaluate our work in this broader context.

(W#3) ...the authors need to show how to overcome the sim-to-real gap...

Re: Thank you for your constructive feedback regarding our paper. Since submitting the paper, we have invested heavily (over $120,000 USD) to build a physical robot platform for validation. Through the efforts of all authors, we have successfully deployed our policy in a real-world Safety-PickUp task, and the robot exhibits constraint satisfaction capabilities consistent with simulation, effectively avoiding obstacles. We will include videos and detailed results in the revised version.

Through these experiments, we identified two main obstacles for sim-to-real transfer: the input distribution shift from sensors and the dynamics mismatch between the simulated and physical robot. Our key findings below detail the strategies we developed to overcome them, and these have been added to the appendix:

Perception Strategy: To overcome the input shift, we use pre-trained models (e.g., FoundationPose [1]) to convert noisy images into robust, structured state representations (e.g., 6D poses), avoiding the need for massive real-world image datasets.
Dynamics Decoupling: To reduce the dynamics mismatch, we decouple the high-level policy from low-level motor control via a shared semantic or Cartesian action space for both sim and real.
Digital Twin Alignment: To further minimize the dynamics mismatch, we tune the simulator's physics (i.e., PID controllers, action cycles) to precisely mirror the real robot's motion characteristics.
Data Pipeline Consistency: We ensure an identical data transformation pipeline (e.g., same pose estimator and IK solver) in both simulation and deployment to minimize processing-related errors.

Just as large-scale simulation has been vital for safety in autonomous driving [2,3], we hope that simulation is a powerful and viable tool for modeling a wide range of robot safety challenges. We hope these new real-world experiments and summarized findings directly address your concerns and offer a valuable reference for the community.

[1] Foundationpose: Unified 6d pose estimation and tracking of novel objects. CVPR 2024 [2] Waymo simulated driving behavior in reconstructed fatal crashes within an autonomous vehicle operating domain. Accident Analysis & Prevention 2021 [3] NVIDIA Autonomous Vehicles Safety Report 2025

(Q#1) The abstract could be improved by adding what task the model solves and in terms of what metrics the improvement is.

Re: Thank you for your suggestion. We have revised the Abstract to be more specific:

...Evaluated on a new benchmark of long-horizon mobile manipulation tasks, policies aligned through this comprehensive approach achieve the following key features: (I) effective safety-performance trade-offs, reducing the cumulative cost of safety violations by 83.58% compared to the state-of-the-art, while improving the task success rate by 3.85%...

(Q#2) What do the different lines mean in Figure 1? ... annotations are necessary...

Re: Thank you for pointing this out. We will add a detailed caption to Figure 1 in the revision. In short, the figure illustrates our four-stage approach: (A) Modeling, (B) Eliciting, (C) Constraining, and (D) Assurance. The main process flow is shown by solid arrows: the blue arrows represent the core loop where the policy interacts with the environment to generate trajectories, while the black arrows show these trajectories being passed to predicates for labeling and subsequently utilized by the CMDP framework. Key definitions and configurations are depicted by colored arrows: the green double-arrows represent the bidirectional relationship where the task model from (A) guides simulation goals in (B), while the simulation can in turn select tasks in (A). The purple double-arrows signify the continuous cycle where environment parameters configure the simulation state, and the resulting state changes provide feedback to update those parameters. The purple dashed arrows indicate the application of safety rules to label violation events. Finally, feedback and updates are shown with black dashed lines, which represent feeding modeling information to the simulator in (B), passing labeled trajectories to the training module (C), and the continuous expansion of safety components over time. The dotted lines are used to simplify potential connections between various elements.

(Q#3) Is there any structural reason for this selection [of 5 scenarios]? Do these scenarios have good coverage...?

Re: Thank you for your review and questions. The five components were chosen to cover the root causes of common failures in mobile manipulation: Blind Spots (memory/awareness) [1], Corners (path planning) [2], Fragile/Critical Points (action consequence prediction) [3], and Dangerous Equipment (semantic understanding and safety priority) [4]. They form a representative set of challenges.

To quantify their coverage, we measured the overlap between risks flagged by our original five predicates and the new LLM-identified predicates introduced in our response to W#1. As shown in Table 2, our original predicates provide high coverage (typically >95%), confirming they are fundamental and representative.

Table 2. Coverage Analysis of Original Predicates.

Method ↓	+Door	+Wall	+Movement	+EA	+Obj. Fallen
Metric →	Coverage ↑	Coverage ↑	Coverage ↑	Coverage ↑	Coverage ↑
ISA (Ours)	99.21%	100%	95.29%	99.5%	96.66%

[1] A Survey on Autonomy-Induced Security Risks... arXiv 2025. [2] A real-time approach for chance-constrained motion planning... RAL 2020. [3] Fear-neuro-inspired RL for safe autonomous driving. TPAMI 2024. [4] Safeagentbench: A benchmark for safe task planning... arXiv 2024.

Thank you again. We hope the new experiments and clarifications can address your concerns. And we would be grateful for your support.

评论- Reply to authors

2025-08-04

Thank the authors for addressing my concerns. All of my questions have been answered with either new experimental results or detailed explanations. I am willing to increase my score to 5 to accept this paper with more confidence.

评论- Thanks for Recognizing Our Work

2025-08-04

Dear Reviewer 6TtA,

Thank you very much for your support and positive feedback. We are pleased to address your concerns. The updated experimental results and the corresponding discussions will be incorporated into the revised version of the paper.

Once again, we sincerely appreciate your recognition.

With best regards!

审稿意见

评分: 6置信度: 32025-06-23

SafeVLA addresses the challenge of integrating explicit safety constraints into vision-language-action (VLA) robotic models without sacrificing task performance. The paper proposes an Integrated Safety Approach (ISA) that combines several components into a cohesive safety-alignment pipeline. The ISA models safety requirements within a constrained Markov decision process (CMDP) framework and constrains the VLA policy via safe reinforcement learning (using CMDP-based Lagrangian optimization), with targeted evaluations and stress-testing for additional safety assurances. The authors also introduce the Safety-CHORES Benchmark, designed to assess both task performance and safety in complex, realistic scenarios. The authors also claim the SafeVLA approach yields an 83.58% reduction in safety incidents (unsafe events) relative to the current state-of-the-art VLA method, while slightly improving task success (+3.85% success rate). Demonstrating that safe policies can be learnt without harming (and sometimes improving) task performance.

优缺点分析

Strengths

This paper appears to be the first integration of safe RL (CMDP) into the domain of VLAs.
The authors adoption of the CMDP framework and Lagrangian optimization is reasonably sound in this context.
The paper has no major issues with clarity, the empirical setup and evaluation framework are very comprehensive, clearly described and sound.
This work generally has the possibility to provide significant impact into fields such as LLM/VLM safety and Robotics.

Weaknesses

Contributions can be construed as incremental, the VLA training methodology is well established and the safe RL (CMDP) method is based on adaptive Lagrange multipliers using well established techniques for updating the dual variables (which can also be sensitive to initialization and unstable). The main contribution can thus be seen as a natural extension of these two methods.
Many of the details are left in the appendix, while space constraints can be an issue, the second half of the paper reads mostly like bullet points with most details relegated to the appendix and limited discussion (this includes related work and limitations).
Some claims are overstated: `CMDP offers a formal approach' indeed this is more rigorous than simple fixed cost penalties, but the gauarntees achieved are not that strong (certainly not analagous to the hard guarantees that can be ontained from formal methods) most of the gains are empirical -- we have no formal guarantees on safety assurance, i.e., the method provides empirical safety assurance but not a formal guarantee.
Some lack of clarity re the cost thresholds and initial lagrange multipliers (see Questions).

问题

How exactly are the diverse unsafe behaviors generated and collected for use in ISA?
The CMDP formulation includes safety cost limits (b_i for each cost). How were these limits determined or tuned? For instance, is 0% collision the target (hard constraint) or something like “at most X collisions per episode”?
I could not find details on how the Lagrange multipliers are initially chosen or tuned, what value to you set the intial Langrange multipliers to? Did you witness instability or task failure in training for overly conservative intializations? Was the convergence of the Lagrange mutlipliers slow and the VLA failed to quickly reduce constraint violations?
Does the approach assume that the only source of unsafe behavior is low-level execution? For instance, could there be misunderstandings of the instruction that lead to unsafe actions (like misinterpreting “pick up the knife” and doing something dangerous)?
Would you consider experimenting with more sophisticated approaches from safe RL, e.g., augmented Lagrangian or even beyond CMDPs and expected cumulative cost constraints and thresholds?

局限性

The authors explicitly include a section on limitations (Section G) where they acknowledge key weaknesses of the work. They note that while ISA was thoroughly evaluated in simulation, it has not been validated on real robots or in real-world environments due to practical constraints. This admission is important – it sets a realistic expectation that further testing is needed before claims of safety translate to physical deployment. Although this is reasonably well justified by the associated risk and expense of of collecting unsafe experiences in the real world. They also mention the need for dynamic constraints that adapt to changing environments, implying that their current model might not handle dynamic obstacles or evolving hazards as effectively. Lastly, they state that in real-world settings, additional external safety measures (like emergency stops, physical guards, etc.) will be necessary -- basically admitting that their learned policy should be one layer of safety, not the sole protector. This holistic understanding is good to see; it shows the authors are not overclaiming that SafeVLA alone can solve all safety issues.

One minor gap in the limitations discussion is the lack of formal guarantees, which we identified earlier. While they don’t claim to have them, it might have been worth explicitly noting that their method does not guarantee absolute safety, only improves it empirically.

Societal impact

The authors provide an Impact Statement (Section H) where they address how their work might positively or negatively affect society. Generally I agree with their sentiment that hat improving the safety of AI systems in real-world applications is a direct aim of the work -- safer robots mean fewer accidents, which is clearly beneficial. Furthermore, the authors astutely identify a potential misuse: the techniques developed could theoretically be exploited to misalign a model deliberately, i.e., someone could try to train a VLA to be unsafe or to ignore certain safety constraints, by inverting the method. In other words, a bad actor might use the idea of constrained learning to optimize a robot for causing harm (by treating harmful outcomes as rewards).

Generally I find their discussion of limitations adequate, although it would be much better if lack of formal guarantees is better highlighted and discussed.

最终评判理由

I was relatively pleased with the paper during the review process. The author's have kindly provided additional experimental results which addressed some points I brought up in my review, these were greatly appreciated and showed their methodology was robust to algorithmic changes, further strengthening my confidence in their contribution, thus I opted to increase my score.

格式问题

N/A

作者回复

2025-07-31

Dear Reviewer xiKX,

Thank you very much for your positive assessment of SafeVLA and for your constructive feedback. We are pleased that you recognized the holistic understanding of safety in our work. We have made every effort, with the utmost sincerity, to address every one of your concerns. We hope that our responses can resolve your concerns to the greatest extent possible and that you will support the acceptance of our paper.

Here are our detailed responses: ↓

(W#1 & Q#5-1) Contributions can be construed as incremental... seen as a natural extension... Would you consider experimenting with more sophisticated approaches from safe RL, e.g., augmented Lagrangian?

Re: Thank you very much for your valuable feedback. We understand your concern and would like to respectfully clarify that our work is not incremental, especially in the under-researched field of VLA safety. We would like to kindly ask for your consideration of the practical significance of this work and the gap it fills.

Before our work, a systematic framework for VLA safety was virtually non-existent. Our primary contribution is being the first to systematically define, model, elicit, constrain, and assure safety for VLAs. The innovation lies not in simply combining two methods, but in bridging the vast gap between abstract SafeRL theory and the complex, embodied VLA domain. In the process of making it work, we overcame engineering hurdles and uncertainties. This included:

Designing tasks to reveal VLA safety flaws (Modeling)
Collecting large-scale unsafe behavior data (Eliciting)
Integrating SafeRL into VLA training (Constraining)
Defining and implementing safety standards for model deployment (Assurance)

Relying solely on a traditional SafeRL method is fundamentally insufficient. The Lagrangian method is just one component of our Constraining stage. To address your valid concerns about its stability and to explore more sophisticated methods, we implemented two additional variants during rebuttal: PID-Lagrangian [1] and Augmented-Lagrangian [2]. As shown in Table 1, our framework can accommodate these methods, which offer different safety-performance trade-offs without special tuning. This demonstrates both the flexibility of our framework and directly addresses your suggestion.

Table 1. Evaluating ISA with Alternative SafeRL Algorithms. (Metrics: SR ↑ / CC ↓)

Algorithm ↓	Safety-ObjectNav	Safety-Fetch	Safety-PickUp
Augmented-Lagrangian	.849 / 3.33	.673 / 7.99	.928 / 1.65
PID-Lagrangian	.859 / 1.64	.635 / 8.29	.862 / 2.27

[1] Responsive safety... by pid lagrangian methods. ICML 2020 [2] Augmented proximal policy optimization... AAAI 2023

(W#2) Many of the details are left in the appendix...

Re: Thank you for the feedback. We agree and commit to reorganizing the paper in the revised version. We will integrate more details on methods, setup, and analysis from the appendix into the main body and expand the Related Work and Limitations sections. We have already begun this process, and our detailed responses to all reviewers reflect the content that will be added.

We truly hope to resolve your concerns. If there is a particular section you are concerned about (e.g., you would like to see details from a specific part of the appendix moved to the main text or a discussion on a certain topic added to the related work), we would be happy to immediately provide supplementary materials or a preview of a revised paragraph.

(W#3) Some claims are overstated: `CMDP offers a formal approach'...

Re: Thank you for pointing this out. We agree our method provides empirical safety assurance, not formal guarantees in the formal methods sense. Our use of formal approach was intended to contrast the principled, mathematical framework of CMDPs with ad-hoc heuristics like fixed cost penalties. To avoid ambiguity, we have revised the statement in the paper to:

Safe reinforcement learning (SafeRL) within the constrained Markov decision process (CMDP) framework, offers a principled paradigm for policy optimization...

We will ensure all claims throughout the paper are stated with this precise level of clarity.

(Q#1) How exactly are the diverse unsafe behaviors generated and collected?

Re: Thank you very much for your question. We use a three-step systematic process:

Scene & Task Definition: We use 150K diverse scenes from ProcTHOR and three custom tasks in the AI2THOR simulator.
Risk Identification: We formalize safety predicates (e.g., proximity to a hot stove) based on safety critical components in the environment.
Collection & Utilization: During training, these predicates automatically label unsafe state-action pairs with a cost signal ( $c_i=1$ ), which is then directly used in the CMDP optimization to teach the agent safe behaviors.

(W#4, Q#2, Q#3) Some lack of clarity re the cost thresholds and initial lagrange multipliers...

Re: Thank you for these detailed questions. You must be very familiar with the SafeRL field.

(Cost Thresholds $b_i$ ): These are soft constraints on the expected cumulative cost, not hard limits. We set $b_i$ empirically to 20% of the converged cost of a baseline agent (FLaRe) trained without safety constraints. This approach is reasonable and common in SafeRL research [1,2]. It avoids setting a potentially overly idealistic (e.g., $b_i=0$ ) or completely arbitrary absolute value, instead defining the safety target as a relative improvement over the current best performance-oriented model. We have added the following description to the Experimental Setup in the new version of the paper:

We determine the cost threshold $b_i$ empirically. We chose 20% of the converged cumulative cost from the FLaRe.

(Initial Lagrange Multipliers): They are initialized to 0.001 with a learning rate of 0.035, using standard, untuned values from the OmniSafe library [3]. The update rule is as follows:

$\lambda' = \lambda + \eta \cdot (J_C - J_C^*)$

where $\lambda$ is the Lagrange multiplier, $\eta$ is the learning rate, $J_C$ is the mean episode cost, and $J_C^*$ is the cost limit.

We will add this information to the hyperparameter table in the revised paper. Your valuable suggestion helps us improve the details of our paper.

(Instability or Overly Conservative): Yes, in traditional SafeRL, overly conservative initializations can lead to task failure [3-5], and we observed a similar trend. Our experiments (Fig. 7 Middle) show a 10% threshold caused the policy to exhibit conservative behavior, sacrificing performance for safety. The final 20% threshold was chosen as it offers an effective balance between safety and task performance.
(Convergence): Based on our empirical observations, the Lagrange multiplier typically approaches convergence around the midpoint of training. The cumulative cost usually drops below the cost limit within about 1M steps and remains stable thereafter. The success rate rises rapidly in the first million steps and then increases more gently. This meets our expectations, as the Lagrange multiplier rises quickly in the early stages to promptly satisfy the constraints. After that, task performance is steadily optimized. Throughout the process, the Lagrange multiplier needs to continuously maintain the trade-off between safety and task performance, so its convergence is relatively slow. In the revised paper, we will include these charts along with a corresponding discussion.

[1] First order constrained optimization... NeurIPS 2020 [2] Safety gymnasium... NeurIPS 2020 [3] Omnisafe... JMLR. 2024 [4] Constrained policy optimization. ICML 2017 [5] A review of safe reinforcement learning... TPAMI 2024

(Q#4) Does the approach assume that the only source of unsafe behavior is low-level execution?

Re: This is an excellent point. Our framework is not limited to low-level execution errors. It also handles dangers arising from high-level misunderstandings. For instance, our Dangerous Equipment predicate prohibits interaction with a knife, regardless of the instruction.

To test this, we ran experiments with instructional OOD and grounding errors (e.g., garbled commands, synonyms), shown in Table 2. Even when the task fails due to a misunderstanding, ISA maintains an extremely low safety cost. This shows ISA learns a default safe behavior paradigm that prevents it from taking dangerous actions when confused. Meanwhile, we agree that more nuanced, instruction-conditioned safety ( $c(s,a,l)$ ) is an important future direction.

Table 2. Safety under Semantic and Perceptual Perturbations.

Perturbation ↓	FLARE (SR ↑/CC ↓)	ISA (Ours) (SR ↑/CC ↓)
Original	.822 / 12.36	.865 / 1.85
+Synonym	.570 / 41.48	.749 / 2.51
+Garbled Code	.261 / 47.92	.296 / 2.55

(Q#5-2) Would you consider experimenting with more sophisticated approaches even beyond CMDPs?

Re: Thank you for this insightful suggestion. We agree this is a vital future direction. Exploring richer safety paradigms beyond expected cost is exciting. Future work could incorporate risk metrics like Conditional Value at Risk (CVaR) [1] to mitigate long-tail risks or use uncertainty estimation [2] to trigger more conservative policies. We believe our current work provides a solid and necessary foundation for these future explorations.

[1] Risk-sensitive and robust decision-making... NeurIPS 2015 [2] Safe exploration for optimization... ICML 2015

Thank you again for your time and constructive feedback. We hope the new experiments and clarifications significantly strengthen the paper and have addressed your concerns. We would be grateful for your support.

2025-08-05

Thanks for your incredibly detailed response.

I appreciate the additional experiments for PID-Laganrgian the Augmented-Lagrangian, I find it odd that both do not appear to do well on the Safety-Fetch environment, could you perhaps elaborate on this? Maybe its down to tuning or soemthing like this. Nevertheless, these experiments give me more confidence that your pre-requisitie methodology (modelling and elicitation) is sound.

Thanks, also for taking on board my suggestions about what should/shouldn't appear in the appendix.

I am also happy with this rewording for: Safe reinforcement learning (SafeRL) within the constrained Markov decision process (CMDP) framework, offers a principled paradigm for policy optimization...

Further thanks for address all of my questions, I am perfectly satisfied with these responses. If you could address the one final point above, that would be great.

评论- (Round 2) Response to Reviewer xiKX

2025-08-06

Dear Reviewer xiKX,

Thank you for your further question and positive feedback. We are greatly encouraged by your recognition of our efforts during the rebuttal period.

Regarding your question about the performance on Safety-Fetch:

TL;DR: The performance is not an issue of algorithm tuning. The drop is seen across all methods and is caused by the task's inherent difficulty, as it combines the long-horizon navigation of Safety-ObjNav with the precise manipulation of Safety-PickUp.

Here is a more detailed breakdown of the evidence:

Cross-Method Performance. All methods show a performance drop on this harder task (Table 1). Crucially, ISA and its variants consistently maintain superior safety (lower cost) and achieve competitive task success rates compared to the baselines.
Quantitative Task Complexity. The complexity is clear in the metrics. Safety-Fetch has significantly longer mean episode and instruction lengths than the other two tasks (Table 2).
Instructional Examples. The instruction for Safety-Fetch is a direct combination of the other two, making it inherently more complex (Table 3).
Required Training Steps. The task also required more training to converge: 25M steps for Safety-Fetch versus 15M steps for the single-stage tasks.
Fairness of Comparison. To ensure a fair comparison, no specific hyperparameter tuning was performed for the SafeRL variants. All methods used a consistent set of parameters.

Furthermore, this setting revealed an interesting insight into compositional task difficulty: the success rate on Safety-Fetch is substantially lower than the product of the success rates on Safety-ObjNav and Safety-PickUp. This suggests that successfully composing multi-stage skills is a non-linear challenge that deserves further investigation.

Table 1. Performance comparison across all tasks. (Metrics: SR ↑ / CC ↓)

Method ↓ Task →	Safety-ObjectNav	Safety-Fetch	Safety-PickUp
ISA (Ours)	.865 / 1.85	.637 / 8.08	.928 / 0.37
ISA w/ Augmented-Lagrangian	.849 / 3.33	.673 / 7.99	.928 / 1.65
ISA w/ PID-Lagrangian	.859 / 1.64	.635 / 8.29	.862 / 2.27
FLaRe	.822 / 12.36	.605 / 43.36	.912 / 7.08
SPOC-DINOv2	.430 / 13.50	.140 / 13.97	.860 / 10.29

Table 2. Quantitative comparison of task complexity.

Metric ↓ Task →	Safety-ObjectNav	Safety-Fetch	Safety-PickUp
Mean Episode Length	114.25	232.20	34.92
Mean Instruction Length	17.18	40.55	15.05

Table 3. Examples of task instructions.

Task →	Safety-ObjectNav	Safety-Fetch	Safety-PickUp
Instruction Example →	`"navigate to an apple"`	`"navigate to an apple and grab that apple"`	`"grab an apple"`

We hope this addresses your question, and we thank you again for your recognition of our significant efforts during the rebuttal period. As the discussion period is coming to a close, we sincerely hope for your support. Should you have any final questions, we will do our best to address them.

With best regards!

2025-08-07

Thanks, this clears up my last point, I must have read the tables in the original manuscript incorrectly, Safety-Fetch and Safety-Pickup were swapped in your response compared to Table 1 in the manuscript. Indeed, it does look like PID-Lagrangian and Augmented-Lagrangian have a similar performance profile for ISA across the board, which is what I'd expect.

Once again, many thanks for these additional experiments it has greatly improved my confidence in your paper, thus I will increase my score further.

评论- Sincerest Gratitude

2025-08-08

Dear Reviewer xiKX,

We extend our sincerest gratitude for your support and recognition of SafeVLA. We are delighted to hear that we have addressed all your questions. Your thoughtful engagement makes all the effort very worthwhile. We apologize for any misunderstanding caused by the variation in our table layouts and will ensure strict consistency in our future work to improve clarity. The updated experimental results and the corresponding discussions will be incorporated into the revised version of the paper.

Our sincerest thanks to you again.

With best regards!

审稿意见

评分: 4置信度: 52025-07-01

This paper addresses the problem of aligning vision-language-action (VLA) models with safety constraints in robotic tasks. The authors propose an Integrated Safety Approach (ISA) that explicitly incorporates safety into VLA training via constrained MDP. The paper also introduces Safety-CHORES, a new benchmark of long-horizon tasks. Using this testbed, the authors show that their ISA-aligned policy achieves a 83.58% reduction in safety incidents compared to the best existing method, while actually slightly improving task success (+3.85%).

优缺点分析

Strengths:

The authors introduce systematic integration of safety constraints into VLA models using CMDP and SafeRL. This framing is innovative in the context of VLAs and addresses a vital gap for real-world deployment.
The Safety-CHORES benchmark, a photo-realistic simulated benchmark explicitly designed for safety evaluation is a key contribution of the work.
The proposed method achieves an 83.58% improvement in safety over baseline methods while maintaining task success rates.
The proposed ISA demonstrates solid performance under out-of-distribution (OOD) conditions.

Weaknesses:

All experiments are limited to simulation, so it remains unclear how these safety gains would transfer to real robots with sensor noise, latency and unmodeled dynamics. The authors do mention this issue in the 'Limitations', but still lack of any real data/experiments significantly reduces the value of the paper.
The approach also depends on hand-crafted predicates (corners, blind spots..etc), which may not capture the full range of hazards, especially dynamic or unanticipated ones, and would require manual extension to handle new risks.
Similarly, assigning a single cost at the end of a violating trajectory is coarse and may obscure which earlier actions caused a failure, potentially hindering learning of precise safety behaviors.

(see Questions for more detailed comments)

问题

Given the promising simulation results, what are the authors’ plans or thoughts on transferring ISA to a real robot? What do they see as the main obstacles for sim-to-real in this safety-critical context (e.g. perception noise, dynamic obstacles, hardware errors)?
The paper hand-designs specific safety predicates for the tasks. How sensitive are the results to this choice of safety definitions?
Balancing Safety and Task Completion: In extreme cases where a task goal might inherently conflict with safety constraints, how does the framework handle it? Is there a mechanism or plan to allow graded risk tolerance?
The benchmarks used (Safety-CHORES, AI2THOR) are synthetic and specific. Can the authors discuss how well ISA generalizes to new environments, such as new room layouts or patterns that different from the training set?
Some safety constraints may be triggered more frequently than others. It would be useful to analyze the fairness or balance of constraint violations across different safety types to ensure that rare but critical risks are not under-prioritized.
When the language model misidentifies the goal object or confuses spatial relations (e.g., “left of the chair” vs. “on the chair”), do these semantic grounding errors lead to safety violations? How robust is ISA to such scenarios?
SafeRL methods are typically known to be sample inefficient. How many episodes are required for ISA to converge compared to baseline models? Does the ISA framework exhibit higher sample complexity?
The authors evaluate out-of-distribution (OOD) robustness using low-level visual perturbations (e.g., color, lighting, material). However, have they considered instructional OOD, such as the use of synonyms, rephrased commands, or changes in sentence structure? Would ISA still maintain safety in these scenarios?

局限性

Yes, but the Limitation section is not placed in the main paper (it's currently placed in the SI)

最终评判理由

This is a good paper. Lack of deployment on real hardware was one of my main concerns. The authors claim that they have successful deployment on physical hardware during the review period. But without actually looking at those results, I will keep my score at 4.

格式问题

The Limitation section is placed in the SI. Not sure if this is a formatting issue.

作者回复

2025-07-31

Dear Reviewer PhDu,

Thank you for your time and insightful questions. We have made every effort to address your concerns, particularly regarding real-world validation and the scalability of our framework. We hope our responses and the extensive new experiments resolve your concerns and earn your support for our paper's acceptance.

Here are our detailed responses: ↓

(W#1 & Q#1) All experiments are limited to simulation... what are the authors’ plans or thoughts on transferring ISA to a real robot? What do they see as the main obstacles?

Re: Thank you very much for your valuable feedback and question. We agree this is a crucial point. Since submission, we have invested heavily (over $120,000 USD) to build a physical robot platform and are pleased to report a successful deployment of our policy in a real-world Safety-PickUp task. The robot exhibited constraint satisfaction capabilities consistent with simulation. We will include detailed results and demo videos in the revised paper.

Through these experiments, we identified two main obstacles for sim-to-real transfer: the input distribution shift and the dynamics mismatch. Our key strategies to overcome them, now in the appendix, are:

Perception: To counter input shift and perception noise, we use pre-trained models (e.g., FoundationPose [1]) to convert noisy images into robust, structured state representations (e.g., 6D poses).
Dynamics & Latency: To address dynamics mismatch, latency, and hardware errors, we decouple the high-level policy from low-level control via a shared semantic or Cartesian action space and perform digital twin alignment by tuning simulator physics (i.e., PID controllers, action cycles) to mirror the real robot's motion.
Pipeline Consistency: To minimize errors, we use an identical data transformation pipeline (e.g., same pose estimator and IK solver) in both sim and real.

Just as large-scale simulation has been vital for safety in autonomous driving [2,3], we hope that simulation is a powerful tool for modeling robot safety.

[1] Foundationpose: Unified 6d pose estimation and tracking of novel objects. CVPR 2024 [2] Waymo simulated driving behavior in reconstructed fatal crashes within an autonomous vehicle operating domain. Accident Analysis & Prevention 2021 [3] NVIDIA Autonomous Vehicles Safety Report 2025

(W#2 & Q#2) The approach depends on hand-crafted predicates... How sensitive are the results to this choice of safety definitions?

Re: This is an important point. While our initial predicates were hand-designed, the framework is highly scalable using LLMs. To demonstrate this, we prompted GPT-4 to discover new safety predicates from trajectory data.

From the LLM-generated candidates, we selected 5 new, unseen predicates (e.g., Object Fallen, Stuck in Doorway). As shown in Table 1, ISA generalizes zero-shot to these new predicates, significantly reducing costs even though they were not used during training. This indicates that the results are not sensitive to the initial predicate choice and that safety logic can be generalized. In Table 2, we show that our original predicates already provide high coverage for these new risks, indicating they are representative.

Table 1. Evaluating Safety Generalization on Unseen, LLM-Discovered Predicates.

Method ↓	Original Predicates (CC ↓)	New Predicates (CC ↓)
SPOC	13.503	9.647
FLARE	13.02	11.14
ISA (Ours)	1.92	0.53

Table 2. Coverage Analysis of Original Predicates.

Method ↓	+Door	+Wall	+Movement	+EA	+Obj. Fallen
Metric →	Coverage ↑	Coverage ↑	Coverage ↑	Coverage ↑	Coverage ↑
ISA (Ours)	99.21%	100%	95.29%	99.5%	96.66%

(W#3) ...assigning a single cost at the end of a violating trajectory is coarse...

Re: We agree our sparse cost function may affect sample efficiency. However, we carefully chose this method to build a general framework, avoiding the potential objective bias and reward hacking common with hand-designed dense rewards [4, 5]. Our goal was to optimize the true constraint objective without flawed proxies. While task-specific shaping methods [1-3] exist, we chose this as a theoretically sound starting point. We have added discussion on this trade-off to the Limitations section.

[1] Safe multi-agent reinforcement learning for autonomous driving arXiv 2016 [2] Safe reinforcement learning with natural language constraints NeurIPS 2021 [3] Sauté rl: Almost surely safe reinforcement learning using state augmentation ICML 2022 [4] Policy invariance under reward transformations: Theory and application to reward shaping ICML 1999 [5] The effects of reward misspecification: Mapping and mitigating misaligned models ICLR 2022

(Q#3) ...how does the framework handle it? Is there a mechanism or plan to allow graded risk tolerance?

Re: Thank you for this question. In our framework, safety is prioritized over the task objective. Our Lagrangian formulation dynamically increases the penalty for constraint violations until the policy becomes safe, at which point it resumes optimizing the task.

Regarding graded risk tolerance, our framework is extensible. Replacing the binary cost with a real-valued one allows for weighting constraints by severity. Furthermore, we view our work as one component of a layered safety system. ISA provides strong probabilistic safety at the policy level, which in a real system would be complemented by other layers (e.g., emergency stops) to provide hard guarantees, a common practice in safety-critical fields like aviation.

(Q#4) ...how well does ISA generalize to new environments...?

Re: To explicitly test this, we evaluated ISA's zero-shot performance on DivScene [1], a dataset with 81 diverse scenes entirely different from our training data. We grouped them into 6 categories, including a challenging Safety Critical group (e.g., Hospital, Kitchen). As shown in Table 3, ISA demonstrates strong generalization, consistently outperforming baselines in both success rate (SR) and cumulative cost (CC) across all new scene types.

Table 3. Zero-shot Generalization on DivScene. (Metrics: SR ↑ / CC ↓)

Method ↓	Store	Home	Leisure	Working	Public	Safety Critical	Average
SPOC	.20/.15.4	.39/9.6	.29/17.1	.21/16.7	.27/17.4	.23/11.9	.28/14.4
FLARE	.33/15.1	.50/11.0	.32/12.5	.30/7.8	.35/10.1	.32/3.5	.37/10.5
ISA	.42/2.3	.51/0.5	.35/1.5	.30/0.6	.35/0.9	.32/0.4	.39/1.0

[1] Divscene: Benchmarking lvlms for object navigation with diverse scenes and objects. arXiv 2024

(Q#5) ...analyze the fairness or balance of constraint violations...

Re: Thank you. We've added a per-constraint cost breakdown in Table 4. The results show ISA achieves substantial and consistent cost reductions across each individual constraint, not just a lower overall average.

Table 4. Safety Balance Analysis.

Method ↓	Corner	Blind	Danger	Fragile	Critical	Overall
Metric →	CC ↓	CC ↓	CC ↓	CC ↓	CC ↓	CC ↓
SPOC	7.451	5.050	0.218	0.2081	0.350	13.279
FLARE	7.79	3.73	0.02	0.25	0.22	12.01
ISA (Ours)	0.535	1.09	0.065	0.025	0.055	1.77

(Q#6 & Q#8) ...do semantic grounding errors lead to safety violations? ...robustness to instructional OOD?

Re: We thank you for these insightful questions. To test this, we created a perturbation suite simulating instructional OOD (e.g., synonyms, structural changes) and grounding errors (e.g., garbled commands, flipped images). As shown in Table 5, ISA's safety cost remains at a stable and extremely low level across all perturbations. Even when task success rate drops due to confusing instructions, ISA does not become unsafe, demonstrating that its safety is a robust, decoupled behavior.

Table 5. Safety under Semantic and Perceptual Perturbations.

Method ↓ Metrics →	SR ↑ / CC ↓
Baselines
SPOC (Original)	.43/13.503
SPOC (+Synonym)	.34/11.398
FLARE (Original)	.822/12.356
FLARE (+Synonym)	.57/41.475
ISA (Ours)
ISA (Original)	.865/1.854
+Synonym	.7487/2.51
+Structure	.829/3.96
+Garbled Code	.2964/2.547
+Order Change	.195/1.285
+Image Flip	.628/3.54
+Gaussian Noise	.82/2.64

(Q#7) ...How many episodes are required for ISA to converge... sample complexity?

Re: The ISA framework does not have a higher sample complexity. Our models were trained with 15-25M steps, compared to 20-50M reported for the FLaRe baseline. We also tested two Lagrangian variants, PID-Lagrangian[1] and Augmented-Lagrangian[2], during rebuttal. As shown in Table 6, both can be integrated into our framework, demonstrating its flexibility and offering different trade-offs between safety and performance without special tuning.

Table 6. Evaluating ISA with Alternative SafeRL Algorithms. (Metrics: SR ↑ / CC ↓)

Algorithm ↓	Safety-ObjectNav	Safety-Fetch	Safety-PickUp
Augmented-Lagrangian	.849 / 3.33	.673 / 7.99	.928 / 1.65
PID-Lagrangian	.859 / 1.64	.635 / 8.29	.862 / 2.27

[1] Responsive safety... by pid lagrangian methods. ICML 2020 [2] Augmented proximal policy optimization... AAAI 2023

Thank you again. We hope the new experiments and clarifications can address your concerns. And we would be grateful for your support.

评论- Thanks for addressing my concerns.

2025-08-06

I would like to thank the authors for the detailed rebuttal. I do appreciate the challenge and financial constraints of having physical testbeds for evaluating these methods. I applaud the authors' efforts to make that happen in the near future. I have my doubts over LLM-generated predicates but it would be valuable to add the results (presented in the rebuttal) to the revised draft. I will keep my score.

评论- Thanks to Reviewer PhDu

2025-08-07

Dear Reviewer PhDu,

Thank you very much for acknowledging our efforts during the rebuttal period. We have sincerely aimed to address all of your concerns. To that end, we conducted extensive experiments and analyses. We also added new discussions in response to your feedback. The updated experimental results and the corresponding discussions will be incorporated into the revised version of the paper.

We regret that some details could not be fully elaborated due to space constraints, and we apologize if our explanation regarding the LLM-generated predicates left any room for doubt. We welcome any further questions you may have and would be happy to make every effort to resolve your concerns.

Thank you again for your recognition.

With best regards!

审稿意见

评分: 5置信度: 42025-07-05

In this paper, the authors proposed to tackle the safety issue for vision-language-action (VLA) models. Specifically, they developed a four-component framework that formally models safety constraints using mathematical predicates, systematically elicits unsafe behaviors through diverse test scenarios, constrains policy learning using constrained Markov decision processes with Lagrangian optimization, and comprehensively evaluates safety performance across multiple dimensions. The authors translated abstract safety requirements into concrete cost functions within a CMDP framework, where safety violations are penalized through adaptive Lagrange multipliers that balance task performance with constraint satisfaction, while using trajectory-level and state-action predicates to capture both immediate and temporal safety patterns. In addition, the authors propose a new benchmark, called Safety-CHORES. This benchmark includes a variety of tasks such as navigation, pickup, and fetch, with different testing scenarios.

The proposed method outperforms all baselines with substantial reductions in cumulative safety costs while maintaining improved task success rates, demonstrating its effectiveness in achieving safety-performance trade-offs and showing robust generalization to out-of-distribution scenarios and extreme failure cases.

优缺点分析

Pros:

Overall, the paper is well-written and tackles an important problem that's been largely overlooked in the VLA literature - most existing work focuses on task performance but ignores safety constraints that are crucial for real-world deployment.
Safety-CHORES benchmark is a valuable contribution that fills a clear gap in evaluation infrastructure.
The CMDP formulation is mathematically rigorous, extending the standard tuple to include language instructions and properly defining the feasible policy set with explicit constraint bounds.
The experimental results demonstrate substantial safety improvements while maintaining task performance, and the statistical analysis using logistic regression and correlation tests (Table 3) provides proper evidance.

Cons:

The cost function design for trajectory-level predicates is overly simplistic - assigning cost=1 only to the final step of violating segments ignores the temporal dynamics and could lead to poor credit assignment, which the authors acknowledge but don't address.
It seems that the feasible policy definition in Eq. (1) uses expected cumulative costs but the actual implementation appears to use instantaneous costs Eq. (5) (in appendix) , creating a potential mismatch.

问题

The constraint formulation treats all safety violations equally (binary 0/1 costs) rather than having severity-weighted penalties, which seems unrealistic since colliding with a fragile vase should arguably incur different costs than bumping a wall.
The Lagrangian method assumes convexity for convergence guarantees, but neural policy optimization is non-convex.
The evaluation is entirely simulation-based with no real-world validation, which is particularly concerning for safety research where sim-to-real gaps could have serious consequences. While I understand it's costly to conduct real-world experiments, it would be great if the authors could discuss more about this.
Seems the current evaluation is focused on average cost. The constraint satisfaction is only evaluated in expectation rather than providing stronger guarantees like worst-case bounds or probabilistic certificates that would be more appropriate for safety-critical applications.

局限性

There is no negative societal impact.

最终评判理由

The authors’ response addresses nearly all of my concerns, and I’m impressed by the quality of the additional experiments and clarifications.

格式问题

I don't see formatting concern.

作者回复

2025-07-31

Dear Reviewer E7bd,

Thank you very much for your detailed review and insightful comments on SafeVLA. During the rebuttal period, we have made every effort, with the utmost sincerity, to address every one of your concerns. We hope that our responses can resolve your concerns to the greatest extent possible and that you will support the acceptance of our paper.

Here are our detailed responses: $\downarrow$

(W#1) The cost function design for trajectory-level predicates is overly simplistic. ...

Re: Thank you very much for your valuable feedback and question. We acknowledge that this approach may have high variance due to its sparse signals, which can affect sample efficiency. However, we carefully chose this method as a theoretically sound starting point to build a general framework, avoiding the potential objective bias and reward hacking common with hand-designed dense rewards [4, 5]. While specific shaping methods [1, 2, 3] can be more efficient, our goal was to optimize the true constraint objective without introducing flawed proxies.

In our future work, we plan to identify a sufficiently general heuristic credit assignment strategy for the safety of VLA models, which is a highly valuable problem, especially when transferring to real-world scenarios (as discussed below). We've added a discussion on this trade-off and future work on more advanced credit assignment strategies to the Limitations:

In particular, exploring more advanced, heuristic-based credit assignment strategies for trajectory-level costs is a promising direction to improve sample efficiency and overall performance.

[1] Safe multi-agent reinforcement learning for autonomous driving. arXiv 2016 [2] Safe reinforcement learning with natural language constraints. NeurIPS 2021 [3] Sauté rl: Almost surely safe reinforcement learning using state augmentation. ICML 2022 [4] Policy invariance under reward transformations: Theory and application to reward shaping. ICML 1999 [5] The effects of reward misspecification: Mapping and mitigating misaligned models. ICLR 2022

(Q#1) The constraint formulation treats all safety violations equally (binary 0/1 costs) rather than having severity-weighted penalties...

Re: Thank you very much for your meticulous review and feedback. We agree that severity-weighted costs are crucial for real-world applications. Our framework is extensible to real-valued costs, allowing for scenario-specific severity weights. We chose binary costs initially to establish a clear, interpretable baseline for integrating safety into VLAs, as severity is often highly context-dependent (e.g., breaking glassware in a lab has more severe consequences than dropping a cup at home).

To demonstrate balanced performance, we've added a per-constraint cost breakdown (Table 1), showing our method significantly and consistently reduces costs across all violation types, not just an overall average. We've added a discussion on leveraging our framework's extensibility for nuanced, weighted constraints in the Future Work:

A valuable next step is to leverage our framework's extensibility to incorporate severity-weighted constraints, enabling more nuanced safety alignment tailored to specific applications and user preferences.

Table 1. Safety Balance Analysis of Five Distinct Safety Constraints.

Method $\downarrow$ Constraint Type $\rightarrow$	Corner	Blind	Danger	Fragile	Critical	Overall
Metric $\rightarrow$	Cumulative Cost $\downarrow$	Cumulative Cost $\downarrow$	Cumulative Cost $\downarrow$	Cumulative Cost $\downarrow$	Cumulative Cost $\downarrow$	Cumulative Cost $\downarrow$
SPOC	7.451	5.050	0.218	0.2081	0.350	13.279
FLARE	7.79	3.73	0.02	0.25	0.22	12.01
ISA (Ours)	0.535	1.09	0.065	0.025	0.055	1.77

(W#2) ...mismatch between Eq. (1) (expected cumulative costs) and Eq. (5) (instantaneous costs).

Re: We apologize for the confusion. There is no mismatch. Eq. (5) defines the instantaneous cost $c_t$ at a single timestep, while Eq. (1) defines the cumulative cost by summing these instantaneous costs over a trajectory. Eq. (1) builds upon Eq. (5). We have revised the text surrounding Eq. (5) to make this relationship explicit.

(Q#2) The Lagrangian method assumes convexity for convergence guarantees, but neural policy optimization is non-convex.

Re: Thank you for your question. The convergence guarantees of the Lagrangian method do not hold for non-convex optimization problems like deep policy learning. This is a common situation in this research area. The theoretical interpretability of large models has become an independent and challenging field of research, and it is difficult for us to address it simultaneously in this work. While we cannot provide a theoretical convergence proof, we have provided extensive empirical evidence demonstrating our method's effectiveness and stability. We sincerely hope the reviewer can understand the difficulty of this matter in the VLA domain and forgive our lack of theoretical analysis on convergence.

(Q#3) The evaluation is entirely simulation-based with no real-world validation...

Perception Strategy: To overcome the input shift, we use pre-trained models (e.g., FoundationPose [1]) to convert noisy images into robust, structured state representations (e.g., 6D poses), avoiding the need for massive real-world image datasets.
Dynamics Decoupling: To reduce the dynamics mismatch, we decouple the high-level policy from low-level motor control via a shared semantic or Cartesian action space for both sim and real.
Digital Twin Alignment: To further minimize the dynamics mismatch, we tune the simulator's physics (i.e., PID controllers, action cycles) to precisely mirror the real robot's motion characteristics.
Data Pipeline Consistency: We ensure an identical data transformation pipeline (e.g., same pose estimator and IK solver) in both simulation and deployment to minimize processing-related errors.

[1] Foundationpose: Unified 6d pose estimation and tracking of novel objects. CVPR 2024 [2] Waymo simulated driving behavior in reconstructed fatal crashes within an autonomous vehicle operating domain. Accident Analysis & Prevention 2021 [3] NVIDIA Autonomous Vehicles Safety Report 2025

(Q#4) ...evaluation is focused on average cost... rather than providing stronger guarantees like worst-case bounds...

Re: Thank you for this insightful comment. While formal guarantees are challenging for complex, non-linear systems, we view our work as a crucial component of a layered safety system. Our method provides a strong probabilistic tendency towards safe behavior at the policy level. The probabilistic safety tendency within the policy is an important part of comprehensive safety (e.g., ensuring that the behavior resulting from the model's decisions brings a sense of safety and comfort to humans). In any real-world deployment, this would be complemented by other safety layers (e.g., emergency stops, physical guards) to provide hard guarantees, a practice common in safety-critical fields like aviation. As Reviewer xiKX noted: “This holistic understanding is good to see; it shows the authors are not overclaiming that SafeVLA alone can solve all safety issues.”. We believe that providing probabilistic safety guarantees for one part of the system does not pose a catastrophic risk, but rather highlights the importance of layered safety practices. Just as in the nuclear and aviation industries, AI safety in the real world will benefit from best practices. Future work could involve defining and exploring the other layers of safety and how they can be combined to provide stronger overall safety guarantees. This will require the cooperation of the entire community, and we sincerely hope that the efforts of SafeVLA can be a small step in that direction.

评论- response

2025-08-06

The authors’ response addresses nearly all of my concerns, and I’m impressed by the quality of the additional experiments and clarifications. I am therefore raising my assessment and recommending acceptance.

评论- Thank You for Your Support and Encouragement

2025-08-07

Dear Reviewer E7bd,

We are deeply grateful for your support of SafeVLA. It is our honor to address your concerns. The updated experimental results and the corresponding discussions will be incorporated into the revised version of the paper.

Once again, we sincerely appreciate your recognition.

With best regards!

最终决定Accept (spotlight)

2025-09-17

A. This paper incorporate safety constraints into VLA policies used for robotic control. The proposed method incorporates safety into VLA training via constrained MDP. The paper also introduces Safety-CHORES, a new benchmark of long-horizon tasks. Using this testbed, the authors show that the proposed method achieves a 83.58% reduction in safety incidents compared to the best existing method, while actually slightly improving task success (+3.85%).

B. Reviewers found the paper well written and addressing an important problem. They appreciated that the paper proposed both a new benchmark, and a new method that yielded substantial improvements on that benchmark. They found the experiments rigorous, and appreciated that the method continued to work well in OOD settings.

C. Reviewers raised questions about whether the cost function was too simplistic, about assumptions (e.g., the use of predicates), the lack of real-world experiments (which were completed during the rebuttal), and novelty.

D. Reviewers unanimously voted to accept the paper (scores = 6/5/5/4), and confirmed that the rebuttal addressed their concerns with the paper.

E. Discussion and new experiments (including on robot hardware!) from the authors addressed most of the reviewer concerns.