PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.8
置信度
创新性2.8
质量2.8
清晰度3.0
重要性2.8
NeurIPS 2025

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

This work integrates the robot's perception of external forces as a novel modality into Vision-Language-Action models by leveraging a multimodal Mixture-of-Experts (MoE) architecture to capture subtle dynamic changes during interaction processes.

摘要

Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose ForceVLA, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces FVLMoE, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot's ability to adapt to subtle contact dynamics. We also introduce ForceVLA-Data, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2% over strong $\pi_0$-based baselines, achieving up to 80% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at https://sites.google.com/view/forcevla2025/.
关键词
VLARoboticsContact-rich ManipulationRobotics and Control

评审与讨论

审稿意见
4

The paper presents ForceVLA, a visual-language-action (VLA) model enhanced with force sensing and feedback. Force signals are perceived online and integrated into the model to guide action generation. To enable force awareness, the authors introduce a force signal processing layer and a mixture-of-experts (MoE) architecture for effective force integration. A teleoperation system is developed to collect data with force annotations, resulting in a curated dataset spanning five diverse tasks. Extensive experiments on these contact-rich tasks demonstrate the effectiveness of ForceVLA and its ability to generalize across different scenarios.

优缺点分析

Strengths

  • Strong motivation. The paper is well-motivated. Contact-rich tasks pose significant challenges for standard VLA policies, making them ideal candidates for evaluating the integration of tactile and force feedback.
  • Well-designed method. The proposed approach is thoughtfully designed. The force signal processing and fusion mechanisms are clearly described in the method section. The fusion strategy, along with the MoE architecture, is a reasonable and effective choice for handling heterogeneous information.
  • Effective and generalizable. ForceVLA demonstrates strong performance on challenging contact-rich tasks such as plug insertion, USB insertion, and cucumber peeling. The authors further validate the model’s generalization capability across variations in object type, insertion height, and external disturbances through a comprehensive set of experiments.

Weaknesses

  • Questionable task difficulty and quality. Some task setups raise concerns about difficulty and realism. For example, in the plug insertion task, the plug is initially half-inserted into the socket, significantly reducing the complexity—requiring the robot to simply reach and make a small movement to complete the task. In the demonstration video, the robot’s end-effector trajectory appears quite unsmooth, and it relies on trial-and-error to locate the correct contact point. The USB insertion task shows more impressive results; however, the trajectory near the socket is still jerky, and the robot even drops the USB at one point and needs to re-grasp it. In contrast, tasks like cucumber peeling and board wiping show better performance, though they do not require high precision in position or force control.
  • Task design. As noted, the initial configuration of the plug being half-inserted undermines the challenge of the task. In the board wiping task, the robot can succeed simply by applying a large force, without needing fine control over the force applied. A more demanding task—such as wiping a bristle bottle—would better justify the need for precise force sensing and execution. Additionally, many of the generalization experiments focus on variations in visual appearance or visual disturbances, which are typical in standard VLA setups and do not highlight the unique benefits of force modeling.
  • The website could be more organized.

问题

  • Does the teleoperation setup provide force feedback to the human operator? What specific challenges do the authors face when teleoperating these contact-rich tasks? Additionally, what are the teleoperation success rates for each task?
  • Is the final ForceVLA model trained as a single multi-task policy, or are separate task-specific policies used?
  • While the method is framed as a VLA model, the language component appears limited to providing task instructions for a small number of predefined tasks. Since the model is not aimed at open-world generalization, developing the approach as a vision-based, force-aware policy may be more accurate and appropriately modest than purposely integrating "L" to make it align with popular VLAs.
  • How is the base policy π0\pi_0 used in the training pipeline? Is ForceVLA trained from scratch, or is it fine-tuned from π0\pi_0?
  • The model is trained on only a few hundred trajectories. Are these data sufficient to avoid overfitting, especially given the complexity of the tasks? What specific mechanisms or design choices encourage generalization, particularly given the relatively small training dataset?
  • Have the authors considered comparing Diffusion Policy or 3D Diffusion Policy? While these are not VLAs, they represent competitive approaches for visuomotor control.

局限性

  • The overall quality of the results is somewhat restricted. Some task executions exhibit suboptimal behaviors such as unsmooth trajectories, retries, and dropped objects, which raise concerns about robustness and precision.
  • The originality of the method is a little bit restricted. Both feature fusion and MOE have already been proposed in previous works.

最终评判理由

ForceVLA presents a framework for collecting vision-tactile data and a VLA architecture for policy training. Leveraging vision and tactile signals, the authors have demonstrated impressive capabilities in completing fine-grained tasks with heavy occlusions, where tactile information would play a crucial role. In summary, the paper explores a promising direction to build stronger robot policy, where the specific focus is integrating force information, and has demonstrated its superiority via thorough experiments. It contributes valuably to the community and deserves acceptance.

格式问题

N/A

作者回复

We sincerely thank Reviewer 7Vfu for their extremely thorough and detailed review. We appreciate the positive comments on our motivation and method design. Your critical feedback on task design, evaluation, and the framing of our contributions is invaluable, and it has pushed us to significantly clarify and strengthen our paper. We address each of your concerns below.


1. On Task Design, Difficulty, and Generalization

This section addresses the reviewer's primary concerns about the quality and design of our tasks and generalization experiments.

Regarding Task Difficulty and Realism (Weakness 1 & 2): We appreciate the reviewer's concerns about task design and would like to provide more context.

  • On the "Plug Insertion" setup: We understand the concern that starting with the plug half-inserted reduces complexity. We initially attempted a full "pick-move-insert" sequence. However, due to hardware limitations, our gripper's frictional force was insufficient to hold the plug firmly against the large insertion forces required. We therefore opted for the "push-to-insert" task design. This setup introduces its own unique challenges that are highly relevant to force control: without a firm grasp, the plug is prone to wobbling and shifting, requiring the policy to be highly sensitive to instantaneous force changes to maintain a stable push. We will add a discussion of this hardware limitation and the resulting design choice to the paper. When hardware permits (e.g., with a stronger gripper), we plan to explore the full sequence in future work.
  • On "Unsmooth Trajectories": We apologize for this impression. The videos on our website were sped up 2-3x for brevity. The original trajectories are smoother, and we will clarify this on the website. Regarding "trial-and-error," we acknowledge this reflects the current capabilities of the model and that achieving smoother, more "ballistic" motions is a key challenge for future work.
  • On "Board Wiping": We agree that if the task could be solved by simply applying a large force, its value would be diminished. However, this is not feasible in our setup. Applying excessive force pushes the whiteboard out of its fixed position and beyond the robot's workspace, leading to task failure. Therefore, controlled force application is necessary to wipe the board effectively while keeping it in place.
  • On a "More Demanding Task": The suggestion of wiping a bristle bottle is excellent. This is a great idea for a more challenging benchmark, and we will consider it for future work.

Regarding the Benefits of Force Modeling in Generalization (Weakness 2): Thank you for pointing out the need to better highlight how our generalization experiments test force modeling. While some experiments involve visual changes, many are designed to create novel physical dynamics:

  • For Pump Bottle, generalization to bottles of different heights and pumps with different travel distances requires the policy to use force feedback. For shorter bottles or longer pumps, the robot must press lower, using the upward reaction force to determine when the press is complete or if it has exceeded a safe limit.
  • For Plug Insertion, using plugs of different sizes and shapes introduces complex new force dynamics. Smaller plugs demand more precise directional control during contact, while taller plugs require fine control over the timing and magnitude of the force. The "unstable socket" generalization specifically introduces random, unpredictable dynamic forces (sliding, wobbling), forcing the model to rely on precise force sensing and immediate adjustments to counteract the instability. The visual occlusion task for this setup similarly forces a greater reliance on force modeling over vision.

We will revise our experimental section to make these force-specific challenges much clearer.


2. On Methodological & Training Details

This section addresses the reviewer's specific questions about our implementation.

Regarding Teleoperation Details (Question 1): Although the human operator does not receive direct haptic feedback, they are provided with real-time visual feedback of the forces within their Quest3 VR headset. The 6D wrench is visualized as a dynamic force vector at the TCP, allowing the operator to see and react to contact forces. The main challenge during teleoperation (e.g., for USB insertion) is the low resolution of the headset's display, which can make fine details hard to see from a distance. Operators sometimes mitigate this by looking at the real setup with their own eyes. We have not formally tracked the teleoperation success rate, but anecdotally, we perform about 60 teleoperation trials to collect 50 successful demonstrations, suggesting a success rate of approximately 85%.

Regarding Single vs. Multi-task Training (Question 2): Thank you for this question. In our main experiments (Figure 5), we trained separate, task-specific policies. However, we also conducted multi-task joint training, and these results are reported in Section D (Table 5) of our supplementary material. The results show that our model performs well both as a single multi-task policy and as separate task-specific policies. We will add a pointer to these results in the main paper.

Regarding Base Policy (π₀) Usage (Question 4): We initialize our model with the pre-trained weights of π₀. During training, we use the LoRA method to fine-tune the original π₀ parameters (such as the Gemma model and action expert), while we train the new parameters of our proposed FVLMoE module from scratch. We will clarify this training procedure in the methodology section. Regarding Dataset Size and Generalization (Question 5): We agree that a few hundred trajectories is a relatively small dataset. We believe our strong generalization results demonstrate that the model is not merely overfitting. We encourage this generalization through two main mechanisms:

  1. Implicit Regularization from Pre-training: Fine-tuning a powerful VLA like π₀ provides a strong prior that helps prevent overfitting.
  2. Data Diversity: During data collection, we consciously introduce variations in object starting positions, interaction angles, and approach directions to enrich the training data. We also randomize initial object positions during testing.

3. On Positioning, Originality, and Broader Comparisons

This section addresses the higher-level concerns about our work's framing and its relation to other methods.

Regarding the VLA Framing and Language Component (Question 3 & Limitations): This is a fair point. While the language instructions in our main experiments are for a predefined set of tasks, our goal is to build towards general-purpose robots that can handle diverse tasks at a macro level while also executing fine-grained, contact-rich motions at a micro level. Our multi-task training results (in Appendix D), which require language understanding to differentiate tasks, show that our model has potential for open-world generalization. We chose the VLA framework to align with this long-term vision, rather than developing a more narrowly-focused, vision-force policy. While we acknowledge we have not demonstrated large-scale language generalization due to time constraints, we believe our work is a critical step in that direction.

Regarding Originality (Limitations): While the core components we use (feature fusion, MoE) have been explored in other contexts, we believe our primary novelty lies in the specific architectural design and first-of-its-kind demonstration of integrating these techniques for online, 6D force-based VLAs operating in complex, real-world contact tasks. This specific problem domain and our successful application remain underexplored.

Regarding Comparison to Diffusion Policy (Question 6): Thank you for the suggestion. We agree that Diffusion Policy and 3D Diffusion Policy are strong baselines for visuomotor control. For this initial work, we prioritized a direct comparison with the π₀ VLA family, as our method is a direct extension of the VLA paradigm. Given the time constraints of the rebuttal period, we were unable to implement and evaluate these new baselines, but we recognize their importance and will include them in future iterations of this work.

We also thank the reviewer for the feedback on our project website. We will update it with a more organized and polished design to better present our methodology and results.


We hope these detailed responses have addressed all your concerns. We are grateful for the thorough review, which will undoubtedly help us produce a much stronger final paper.

评论

Thanks for the detailed rebuttal. My questions regarding task design, methodology, and higher-level concerns have been adequately addressed. I believe this is a solid paper that merits acceptance. However, I remain unconvinced that it makes a fundamental contribution to the field. So I will keep my original rating.

审稿意见
4

This paper proposes ForceVLA, a Vision-Language-Action model enhanced with 6D force sensing. The core module, FVLMoE, is a force-aware Mixture-of-Experts that fuses force, vision, and language modalities during action generation. A new dataset, ForceVLA-Data, is collected for contact-rich tasks. Experiments on 5 real-world tasks show improved performance over π0 baselines.

优缺点分析

Strengths Novel use of force sensing as a first-class modality in VLA models. The FVLMoE fusion module is a technically sound idea. Solid improvements in task success rates (+23.2%) across five contact-rich tasks. Weaknesses No simulation experiments, which limits reproducibility and accessibility. In the "Unstable Socket" task, π0-fast w/f performs best, suggesting that a simple connection can be effective if the pre-training data contains enough torque sensor perception information.

问题

-Why are there no simulation experiments? Could that help with reproducibility or scalability? -- π0-fast w/f performs best in the Unstable Socket task. Does this mean that simple fusion is sufficient if the pre-training data contains enough torque sensor data? Can the authors share insights on π0-fast competitiveness with sufficient torque sensor training data? -Can ForceVLA generalize to low-cost robots without precise F/T sensors?

局限性

yes

最终评判理由

The rebuttal addressed my main concerns, and with no major unresolved issues, my overall assessment and recommended score remain unchanged.

格式问题

none

作者回复

We thank Reviewer 4bXA for the constructive feedback and for raising important questions about the interpretation of our results and the broader applicability of our work. We address each point below.


1. On Simulation Experiments, Reproducibility, and Scalability(Weakness 1, Question 1)

We appreciate the reviewer's point regarding simulation experiments. Our primary focus for this work was to tackle the challenges of force-aware manipulation directly in the real world, as physical interaction and multi-body contact dynamics are central to our problem. We believe that using real-world data ensures our findings are directly applicable to physical deployment scenarios. While we agree that simulation offers significant benefits for scalability and reproducibility, developing a high-fidelity simulation for these diverse, contact-rich tasks was beyond the scope of our current resources for this submission. However, we concur that large-scale training with generated data in simulation is a very promising direction for expanding the range of tasks our model can address. We see this as a key area for future work and will add this discussion to the paper.

2. On the Performance of π₀-fast in the "Unstable Socket" Task(Weakness 2, Question 2)

This is a very keen observation, and we thank the reviewer for asking for a deeper insight into this interesting result. Our analysis reveals a nuanced reason for this behavior. We have consistently observed that π₀-fast models, due to their architecture, often exhibit stuttering, pausing mid-air, or significant delays between actions. Paradoxically, this frequent pausing becomes an advantage for the specific "Unstable Socket" task. When the end-effector pauses while in contact with the plug, the likelihood of the unstable socket sliding or wobbling decreases significantly. This effectively reduces the difficulty of this particular generalization task to that of a stable socket, creating the impression of superior performance for the π₀-fast model. However, for general contact-rich, fine-manipulation tasks, we argue that π₀-fast is not as competitive. Its autoregressive decoding nature limits its inference speed, which in turn hinders its ability to react promptly to real-time force changes and adjust its trajectory flexibly. We will add this important nuance to our analysis section to prevent misinterpretation of this result.

3. On Generalization to Low-Cost Robots without Precise F/T Sensors(Question 3)

Thank you for this forward-looking question about practical deployment. Due to time and hardware constraints, our current experiments are limited to a single robot platform. However, we strongly agree that demonstrating generalization to more common, lower-cost platforms is crucial. To promote wider practical deployment and help democratize force-aware manipulation research, we are actively assessing ForceVLA’s adaptability and performance on lower-cost platforms equipped with external or retrofitted force sensors.


We hope our responses have clarified these points and addressed your concerns. Thank you again for your valuable feedback.

评论

Thank you to the authors for the clear and thoughtful clarifications. The explanations on the lack of simulation experiments, the π₀-fast behavior in the “Unstable Socket” task, and potential deployment on low-cost platforms were helpful and addressed my questions. The clarifications are appreciated, and my overall assessment remains consistent with the original review.

审稿意见
4

This paper presents ForceVLA, a method for enhancing vision-language-action (VLA) models with force information using a mixture-of-experts (MoE) policy architecture. The proposed approach is evaluated on a real-world robot across several manipulation tasks. While the idea of integrating force signals into VLA frameworks is appealing and timely, the paper leaves several architectural and conceptual questions unresolved.

优缺点分析

Strengths:

  • Exploring force-enhanced VLAs is a timely and relevant problem.
  • Real-robot experiments are conducted on diverse tasks with promising results.
  • The mixture-of-experts formulation show potential to leverage multimodal signals.

Weaknesses:

  • The related work omits several recent efforts combining vision, language, and tactile/force signals (e.g., VTLA, FuSe). A broader and more up-to-date literature review would better situate the paper.
  • From the introduction, one would expect experts in the MoE to specialize in different modalities (e.g., one for vision, one for force). However, the architecture does not enforce this: modalities are simply concatenated, and all experts receive the same fused input.
  • It is unclear what exactly comprises a "force observation"—e.g., Fig. 3 suggests a snippet of force trajectories rather than a single reading, but this is not explained clearly.
  • There is no force feedback during teleoperation, which raises doubts about whether the demonstrations are truly reactive to force or just to delayed visual cues.
  • Results are presented as tables of numbers with little analysis. Success rates remain low. Why? What are the main failure modes?
  • A comparison with MoE models trained without force input is missing, which would be important to isolate the value of the force modality.
  • In Table 2, performance of pi0 degrades when force is added (without MoE). Why does force hurt performance in this case?

问题

  1. Can you clarify whether each expert in the MoE is intended to specialize in different modalities? If so, how is this encouraged during training?
  2. What exactly is included in a "force observation"? Is it a single reading, or a sequence?
  3. Given that the teleoperator receives no force feedback, how are successful force-sensitive demonstrations collected?
  4. Why does pi0 perform worse with added force information in Table 2?
  5. What are the typical failure modes that explain the relatively low success rates?
  6. Can you include a comparison to a MoE model trained without access to force inputs?

局限性

yes

最终评判理由

The authors have addressed most of my concerns. The justification about low success rates in some tasks is not fully convincing (e.g., humans can perform force-driven USB insertion with minimal visual feedback), but overall, this can still be a helpful contribution to the community.

格式问题

none

作者回复

We sincerely thank Reviewer 1t4m for the thoughtful and detailed feedback. Your comments on architectural clarity, experimental analysis, and missing comparisons are invaluable. They have helped us identify key areas for improvement, and we have addressed each point below.


1. On Architectural & Conceptual Clarity

Regarding MoE Specialization (Weakness 2, Question 1): While it is challenging to definitively assign a single modality to each expert, our analysis reveals clear specialization patterns based on task and execution phase. As detailed in Section C of our supplementary material, we observe that the router learns to dynamically allocate different experts for different stages of a task. For example, Figure 9 shows that the expert load varies across different task completion percentages. This indicates that the router has learned a strategy to leverage different experts based on the evolving context of the task, rather than a static modality assignment.

Regarding the Definition of "Force Observation" (Weakness 3, Question 2): Thank you for this question highlighting a point of ambiguity. To clarify, a "force observation" is a reading from a single frame/timestep, not a sequence over time. Specifically, at each observation step, this single reading is a 6-dimensional vector representing the estimated external wrench applied on the TCP, expressed in the world frame (0FextR6×1^0F_{ext}\in \mathbb{R}^{6\times 1}). This vector consists of a R3×1\mathbb{R}^{3\times1} force component and a R3×1\mathbb{R}^{3\times1} moment component: [fx,fy,fz,mx,my,mz]T[f_x,f_y,f_z,m_x,m_y,m_z]^T. The units are Newtons [N] for force and Newton-meters [Nm] for moments. We will revise the paper to make this distinction between a single-step reading and a time-sequence explicit to avoid any confusion.

Regarding Force-Sensitive Data Collection (Weakness 4, Question 3): Although the human operator does not receive direct haptic feedback, they are provided with real-time visual feedback of the forces within their Quest3 VR headset. The 6D external wrench readings from the robot's sensor are streamed to the headset and visualized as a dynamic force vector originating from the robot's TCP. This vector continuously changes its length and orientation based on the measured force, allowing the operator to see and react to contact forces during teleoperation, thus enabling the collection of force-reactive demonstrations.We acknowledge that teleoperation systems with direct haptic force feedback are now available. Due to hardware constraints in our current setup, we did not integrate this capability. We agree that this is a valuable direction and will certainly consider incorporating direct haptic feedback in our future work to potentially enhance the quality and reactivity of the demonstrations.


2. On Analysis of Experimental Results

Regarding Typical Failure Modes and Low Success Rates (Weakness 5, Question 5): We have compiled a detailed breakdown of the failure modes for each task and model:

  • Plug Insertion: w/o force models generally exhibit cruder behavior during insertion, unable to adjust their strategy based on interaction state changes. w/ force models can generally perceive changes in external forces but are not flexible enough in adjusting their trajectory. ForceVLA can better maintain continuous contact, follow the plug's surface, make fine adjustments to the force angle and duration within a more precise range, and terminate the action more promptly upon full insertion.
  • USB Insertion: The overall success rate for this task is low. A common problem across all five models is the difficulty in aligning with the USB port, which we attribute to insufficient visual clarity and the backbone model's inability to process fine-grained visual information. However, ForceVLA's success rate is slightly higher because it exhibits clear autonomous adjustment or re-attempt behaviors upon feeling external force from contact with the socket.
  • Bottle Pumping: Simply pressing a seen bottle could be completed with 100% success by the pi0-base models and ForceVLA. Therefore, during testing for this task, we introduced additional visual occlusions and background changes. Most of the failure modes occurred under these specific variations. w/o force models often missed the pump or did not press it fully. ForceVLA was more robust but sometimes pressed off-center.
  • Cucumber Peeling: w/o force models were significantly more prone to breaking the peel mid-way, unable to peel continuously from end to end, and sometimes peeled too deeply, indicating a lack of control over the peeling force. w/ force models peeled more stably with a wider peel, but still had issues with breaking the peel and not following the cucumber's curvature well. ForceVLA could overcome these issues.
  • Board Wiping: A common problem for all models was the inability to pick up the eraser, possibly because its placement was far from the base camera, resulting in low resolution, and its color was similar to the black table, making precise localization difficult. In the remaining trials, if a small part of the writing wasn't erased, we classify as a failure. Additionally, if a model didn't stop after 5 minutes, we counted it as a failure. The w/o force models started wiping in the air or pressed too hard, leaving scratches, due to their inability to perceive force. ForceVLA made closer contact with the board and applied a more appropriate wiping force than other models.

We will integrate this detailed failure analysis into the appendix to provide a much richer context for our quantitative results.

Regarding π₀ Performance with Force Input (Weakness 7, Question 4): The performance degradation of π₀ with force in generalization tasks (Table 2) stems from an out-of-distribution data problem. The base π₀ model was pre-trained on state information that did not include force data. When we concatenate force information to the state input for fine-tuning, the model performs well on in-distribution tasks. However, during generalization to unseen objects and scenarios, the model must rely more heavily on its pre-trained knowledge. Since the force data is an unfamiliar and out-of-distribution signal for the pre-trained π₀, it disrupts its existing priors, leading to a drop in performance. Our ForceVLA architecture is designed to mitigate this by handling multimodal signals more robustly.


3. On Additional Comparisons and Ablations

Regarding Comparison to MoE without Force (Weakness 6, Question 6): To directly assess the value of the force modality within our MoE architecture, we conducted the suggested ablation study where the force input to the ForceVLA model was masked.

The results are as follows: The success rates were 20% for Plug Insertion, 0% for USB Insertion, 30% for Pump Bottle, and 26.7% for Wipe Board. The Peel Cucumber task was not reported due to time constraints. Compared to the success rates of ForceVLA with force inputs, these results represent a drop of 60% for Plug Insertion, 25% for USB Insertion, 37% for Pump Bottle, and 0.3% for Wipe Board.

In this ablation study, the common failure modes for ForceVLA w/o force inputs included:

  • During Plug Insertion, the end-effector could only push a short distance in the insertion direction, and the trajectory's curvature was insufficient to properly insert the plug into the socket; alternatively, the end-effector would only move to the plug's surface and perform an arcing push motion near the contact point without making actual contact, similar to overfitting the insertion trajectory.
  • In the USB Insertion task, the gripper's closing width was too large, making it unable to grasp the USB drive, possibly due to overfitting the gripper width and preventing a firm grasp.
  • When pumping the bottle, it repeatedly applied excessive force, causing the gripper to bend, or the off-center press caused the bottle to spring away.
  • When wiping the board, the number of attempts to erase the writing was reduced, leading to most cases where the board was not wiped clean.

We summarize the likely cause of these failure modes as follows: Without force input, using the FVLMoE module increases the model's total number of parameters, making it more prone to overfitting. During inference, it exhibits many stereotypical behaviors that are misguided by vision, which represents an overfit to the action distribution of the training data. Conversely, the fact that the full ForceVLA model with the additional force modality does not suffer from this overfitting strongly indicates that our architectural design is effective at leveraging force signals to regularize the policy and achieve robust, reactive behaviors instead of merely memorizing training trajectories. We will add this crucial ablation study and analysis to the paper.

Regarding Related Work on VTLA and FuSe (Weakness 1): We sincerely thank you for pointing out these highly relevant works. We have studied both papers and agree they are essential for situating our work. We will add a detailed comparison to our Related Work section.

  • Comparison with VTLA: VTLA focuses on sim-to-real transfer using a new simulation dataset and fingertip tactile sensors. Our work is fundamentally different as it is a real-world-first approach, using a new real-world dataset with 6D force-torque data, and our novelty lies in the MoE architecture for online force modulation, not sim-to-real.
  • Comparison with FuSe: FuSe introduces a finetuning recipe to adapt existing generalist models to new modalities. Our contribution is a complete, end-to-end policy architecture. Our focus is on investigating a new architectural paradigm (the MoE) for deep integration.

We hope these clarifications, analyses, and new experimental results have addressed your concerns. Incorporating these changes will significantly improve the quality and clarity of our paper. Thank you again for your valuable guidance.

评论

Thank you for addressing most of my concerns. I still find the justification about low success rates in some tasks not fully convincing (e.g., humans can perform force-driven USB insertion with minimal visual feedback), but overall, the paper would greatly benefit of the remaining clarifications mentioned in the rebuttal. Therefore, I am increasing my score.

审稿意见
5

This paper proposes ForceVLA, a novel end-to-end manipulation framework that treats external force sensing as a modality within VLA systems. ForceVLA introduces FVLMoE, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time. The paper also introduce ForceVLA-Data, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2% over strong π0-based baselines, achieving up to 80% success in tasks such as plug insertion.

优缺点分析

Strengths:

  1. This paper explores a timely and important direction—integrating force signals into vision-language-action (VLA) models for contact-rich manipulation tasks. The proposed method, a force-aware Mixture-of-Experts (MoE) fusion module, enables dynamic processing and deep integration of force, visual, and language modalities during action generation. This represents a meaningful advancement in the development of multimodal large action models, particularly for tasks requiring nuanced physical interactions. I believe the work would be of interest to the community who is working on multimodal learning, multimodal large action model, robot learning, and contact-rich manipulation.
  2. The authors benchmark their model against state-of-the-art approaches such as pi-0 and demonstrate superior performance in tasks like “Insert USB,” “Insert Plug,” and “Peel Cucumber.” Furthermore, under five challenging settings, ForceVLA consistently outperforms π0, highlighting its robustness and effectiveness across diverse contact-rich scenarios.
  3. The authors conduct ablation studies on various architectural designs of ForceVLA, with a particular focus on the integration of force feedback. This analysis is valuable for researchers interested in incorporating temporal signals into vision-language models.

Weaknesses:

  1. One of the key contributions of this manuscript is the introduction of a comprehensive data collection pipeline tailored for contact-rich manipulation tasks. This includes the development of teleoperation tools, data converters, and the release of a new dataset. Such an infrastructure is highly valuable for advancing learning in physically grounded manipulation tasks where force feedback plays a crucial role.

However, the paper does not sufficiently discuss the limitations of existing contact-rich manipulation datasets or clearly position its contributions in relation to prior work. Several recent efforts have made significant strides in this space:

  • Fan et al., RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot (RSS 2023 Workshop), introduced a broad dataset for generalizable robotic skill learning.
  • Narang et al., Factory: Fast Contact for Robotic Assembly (RSS 2022), focused on high-speed, precise contact interactions in assembly tasks.
  • Tang et al., IndustReal: Transferring Contact-Rich Assembly Tasks from Simulation to Reality (RSS 2023), tackled the sim-to-real gap for contact-rich assembly by leveraging simulation and domain adaptation.

A more thorough discussion contrasting these works with the proposed dataset—in terms of task diversity, force sensing resolution, data volume, teleoperation fidelity, and real-world applicability—would strengthen the contribution and help the audience better appreciate the novelty and utility of this new resource.

  1. Limited discussion on failures: While the proposed method demonstrates favorable performance against existing state-of-the-art, it is unclear why the proposed method fails "Pump Bottle," "Wipe Board-1 and 2." The description of these failures is limited. Could different fusion strategies work better for these categories?

问题

  1. Please provide a more thorough discussion contrasting these works with the proposed dataset.

  2. Limited discussion on failures: While the proposed method demonstrates favorable performance against existing state-of-the-art, it is unclear why the proposed method fails "Pump Bottle," "Wipe Board-1 and 2." The description of these failures is limited. Could different fusion strategies work better for these categories?

局限性

Yes.

最终评判理由

I appreciate the authors’ efforts in addressing the comments raised in my original review. The rebuttal addressed them. Please ensure to incorporate these comments in the final manuscript.

Overall, the work is important to the community. Specifically,

  • This paper explores a timely and important direction—integrating force signals into vision-language-action (VLA) models for contact-rich manipulation tasks.
  • The authors benchmark their model against state-of-the-art approaches such as pi-0 and demonstrate superior performance in tasks like “Insert USB,” “Insert Plug,” and “Peel Cucumber.”
  • The authors conduct ablation studies on various architectural designs of ForceVLA, with a particular focus on the integration of force feedback. This analysis is valuable for researchers interested in incorporating temporal signals into vision-language models.

Thus, I recommend accepting the paper.

格式问题

N/A

作者回复

We sincerely thank Reviewer hqeq for their positive assessment and very insightful feedback. We are encouraged that the reviewer found our work to be a "timely and important direction" and a "meaningful advancement." Your constructive comments have been instrumental in helping us clarify and strengthen our contributions. We address your points below.


1. Regarding the Discussion and Positioning of the Dataset (Weakness 1, Question 1)

We thank the reviewer for this detailed and extremely helpful suggestion. We agree completely that a more thorough comparison with existing datasets will better situate our work and help the community appreciate its unique utility. As suggested, we have performed a detailed analysis and will incorporate the following discussion into the final version of our paper.

  • Comparison with RH20T (Fan et al.): While RH20T provides a valuable, large-scale dataset, its primary offering is raw, unformatted data with basic reading tools. Our work significantly advances usability and integration by providing not only the raw data but also a pre-processed dataset in the SOTA Lerobot format and the conversion tools themselves. Datasets in the Lerobot format are "plug-and-play," meaning they can be directly loaded to train most models within the powerful Lerobot library. This out-of-the-box compatibility removes significant data-wrangling overhead, greatly accelerating research in this area.
  • Comparison with Factory (Narang et al.): Factory is a powerful toolkit focused on simulating contact-rich interactions at scale, providing virtual assets and controllers to train policies in a simulated environment. Our approach is fundamentally different, as our data is collected entirely in the real world. By doing so, our pipeline and dataset bypass the well-known sim-to-real gap. Our work thus offers a complementary, reality-first methodology that is more directly applicable to physical robot deployment.
  • Comparison with IndustReal (Tang et al.): The approach of IndustReal is notable as it explicitly avoids F/T sensors, instead relying on other cues to manage contact. Our work is built on a contrasting philosophy. We posit that explicit, high-resolution force feedback is a critical signal for dexterous manipulation. Therefore, our ForceVLA-Data prioritizes the deep integration of 6D F/T sensor data. This positions our dataset as an essential resource for developing and benchmarking a distinct class of policies that explicitly leverage force.

In summary, while these pioneering datasets have pushed boundaries in task diversity (RH20T), simulation (Factory), and sim-to-real (IndustReal), our ForceVLA-Data offers unique contributions in three key areas: out-of-the-box usability with the Lerobot framework, a reality-first methodology, and a focus on explicit 6D force sensing.

We will add this detailed discussion to a new subsection in our Related Work section. We thank the reviewer again for prompting this valuable clarification.

2. Regarding the Discussion of Failure Modes(Weakness 2, Question 2)

We thank the reviewer for pointing out the need for a more detailed failure analysis. Below, we describe the typical failure modes for our ForceVLA model on the mentioned tasks and address the question about alternative fusion strategies.

  • For "Pump Bottle": The most common failure mode for ForceVLA was making contact with the pump head off-center. It is important to note that for the standard task with a bottle seen during training, our model achieved a 100% success rate. The observed failures occurred primarily when we introduced challenging visual variations (occlusions and background changes) during testing to probe the model's robustness.
  • For "Wipe Board": The failures for ForceVLA fell into two categories:
  1. Grasping Failure: In some cases, the model failed to pick up the eraser. We attribute this to visual perception challenges, as the eraser was placed far from the base camera, resulting in a low-resolution image, and its color was similar to the black table, making precise localization difficult.
  2. Timeout due to Meticulousness: Interestingly, some failures were due to timeouts. The model would successfully erase ~90% of the writing but would then spend a prolonged period making repeated, fine-grained attempts to erase a tiny remaining smudge. While this is technically a failure by our timeout metric (5 minutes), this behavior suggests the policy is highly sensitive to the task goal, perhaps overly so.
  • On Different Fusion Strategies: This is an excellent question. For the failures we observed (e.g., grasping the low-resolution eraser, off-center pressing under visual distraction), we believe the root cause lies more in the limitations of the visual perception backbone rather than the FVLMoE fusion module itself. Our MoE is designed to dynamically weigh inputs, but it cannot overcome fundamental ambiguity or a lack of fine-grained detail in the upstream visual stream. We hypothesize that an alternative fusion strategy would likely face similar challenges if the input perception is noisy. However, we agree that exploring fusion mechanisms that can explicitly model visual uncertainty is a very promising direction for future work. We will add this discussion to the limitations section of our paper.

Once again, we sincerely thank you for your thoughtful and constructive review. Your feedback has been invaluable in helping us strengthen the paper, and we hope our responses and planned revisions have fully addressed your concerns. Thank you for your time and guidance.

评论

I appreciate the authors’ efforts in addressing the comments raised in my original review. The rebuttal addressed them. Please ensure to incorporate these comments in the final manuscript.

Overall, the work is important to the community. Specifically,

  1. This paper explores a timely and important direction—integrating force signals into vision-language-action (VLA) models for contact-rich manipulation tasks.
  2. The authors benchmark their model against state-of-the-art approaches such as pi-0 and demonstrate superior performance in tasks like “Insert USB,” “Insert Plug,” and “Peel Cucumber.”
  3. The authors conduct ablation studies on various architectural designs of ForceVLA, with a particular focus on the integration of force feedback. This analysis is valuable for researchers interested in incorporating temporal signals into vision-language models.

Thus, I recommend accepting the paper.

评论

We sincerely thank all reviewers (1t4m, hqeq, 4bXA, 7Vfu) for their valuable time and insightful feedback throughout the entire review and rebuttal process. Every comment has been crucial in improving the quality of our paper.

We are especially grateful to Reviewer 1t4m for increasing their score after the rebuttal. We understand and appreciate your remaining reservation regarding the success rates on some tasks, and this motivates us to continue exploring and improving the model's robustness in our future work.

We are very grateful to Reviewer hqeq for their strong support and final recommendation for acceptance. Your affirmation of the importance of our work's direction, experiments, and analysis is a great encouragement to our team.

We thank Reviewer 4bXA for acknowledging the clarity and thoughtfulness of our response, and we are glad that our explanations and clarifications addressed your questions.

We also thank Reviewer 7Vfu for their extremely thorough review and for ultimately recommending acceptance. Your in-depth feedback has helped us improve the rigor of our paper from multiple dimensions, including task design and methodology.

In summary, we once again thank all reviewers for their constructive comments. We promise that all clarifications, analyses, and additions agreed upon in the rebuttal will be diligently integrated into the final camera-ready version. We look forward to sharing our work with the broader academic community.

最终决定

The paper proposes ForceVLA, a force-aware vision-language-action model for contact-rich manipulation using a Mixture-of-Experts module routed by estimated force levels. The approach supports both vision-only and vision+force settings, and is validated on peg-in-hole, insertion, and rotation tasks.

Reviewers are generally positive (one Accept, three Borderline Accepts), highlighting the motivation, real-robot validation, and modular design. Concerns on novelty and experimental depth are addressed in the rebuttal and final remarks with added ablations, failure analyses, and task justifications.

Overall, this is a timely and practical contribution to multimodal policy learning. The AC recommends acceptance, with the expectation that the camera-ready version further strengthens experimental analysis, related work, and presentation clarity.