PaperHub
7.0
/10
Poster3 位审稿人
最低3最高4标准差0.5
4
4
3
ICML 2025

SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation

OpenReviewPDF
提交: 2025-01-09更新: 2025-07-24
TL;DR

Memory-based Multi View Robotics Transformer

摘要

关键词
Robot LearningBehavior CloningImitation LearningMemory-based Robotics Transformer

评审与讨论

审稿意见
4

This paper addresses the problem of robotic manipulation with memory. To that end, the authors combine the RVT-2 manipulation framework with the SAM2 memory-enabled segmentation model. I'll only adjust the "coarse" module which predicts the rough spatial region of interest for manipulation; the memory is only incorporated into the coarse module, not the "fine" grip-prediction module. The high-level architectural combination works as follows:

  1. RVT-2 takes a bunch of random camera views and synthesizes a point cloud, from which three standard orthogonal camera angles are re-rendered (XY / YZ / XZ planes). There is then a learned component which takes in these RGB images and a language instruction and outputs are three action heatmaps from the same viewpoints. The spatial intersection of the extrusions is the region of interest that is then passed to the downstream "fine" module.

  2. The authors want to ultimately enable the integration of SAM2's memory / object tracking capability into this architecture. So they introduce a "SAM2Act" module which produces the action heatmaps. There is some fancy upsampling going on, but the key idea is that they pass the instruction, original RGB images, and the SAM2 embeddings of those images all into a multi-view transformer ("MVT"), which then outputs a latent vector that is upsampled into the heatmaps. This new architecture alone shows some decent performance improvements over RVT-2 without even considering the memory.

  3. Now memory is incorporated as follows, independently for each view. The basic idea is that past input image embeddings and output heatmap embeddings are stored pairwise in a FIFO queue. They take the heatmap image embedding output of the MVT and do a few layers of cross-attention with the memory, and then output a new heatmap image. This allows the module to condition on the observation history.

The authors show that the SAM2Act module alone results in a ~5% success rate bump for existing benchmarks, and with memory enabled reduces the failure rate by ~80% on some memory-requiring benchmarks.

给作者的问题

  1. What are the cases where the SAM2Act+ still fails on MemoryBench? Is it due to failures in memory (opening the wrong drawer), or failure in execution (trying and failing to open the drawer)?
  2. I'm trying to understand the implications of the SAM2Act-fine module not having a memory component. Does that mean that a memory-enabled instruction like "move the doll by grabbing the limb closest to the robot, then grab the teddy bear by the same limb" would not be possible? I don't think this is a major issue if true but I want to check my understanding of the limitations here.
  3. Why make the views independent in the memory architecture? It seems that the memory attention module should have the capacity to use or ignore other views as appropriate.
  4. Can you speculate about why the robustness is so improved in Section 5.3? Is it the pre-trained knowledge from the SAM2 image encoder that helps distinguish semantic meaning despite visual perturbations?

论据与证据

I think all the claims made are well-supported by the experiments. Namely:

  1. SAM2Act+ enables memory-dependent manipulation where baselines fail.
  2. SAM2Act is more robust across different environmental perturbations.
  3. SAM2Act outperforms existing policies on established baselines which don't involve memory.

方法与评估标准

The authors compare on both existing benchmarks and propose their own benchmark which relies on memory. The included baseline methods are appropriate (if somewhat limited) and the results are compelling.

理论论述

N/A

实验设计与分析

Overall I liked found the experiments to be quite thorough, especially the ablations between SAM2Act and SAM2Act+. One additional ablation I would be curious about is the performance of incororporating memory without the SAM2 image encoder step. My intuition here is that these are actually orthogonal, and we could simply have the MVT take in an instruction + RGB images, output an embedding, perform cross-attention with old images & heatmaps (any embedding module could be used here), and then upsample the resulting final embedding. But I don't think this experiment is necessary for the paper to be valuable.

补充材料

I did not review the supplementary material.

与现有文献的关系

Vision-Language-Action models are a promising research direction right now in robotics (Octo, π0\pi_0, OpenVLA). These models have generally made markovian assumptions; i.e. the optimal action is independent of past observations given the current observation. This is of course not realistic for real-world autonomy. If a robot turns around, it shouldn't forget what's behind it. To the best of my knowledge, this is the first serious attempt to incorporate memory into these VLA models and comprises an important step for the field.

遗漏的重要参考文献

N/A

其他优缺点

Weaknesses:

  1. It's a bit hard to understand the paper as-is without referencing the RVT-2 paper. The figures especially could maybe use some more labels and clarity. For instance, the three action heatmaps in Figure 3 are not labelled and I actually only saw a solid blue square until I looked closely.
  2. The baselines are a little lacking (e.g. Octo and related follow-up work).

其他意见或建议

  1. I think some of the result summaries are a little editorialized. For example, line 436 states that SAM2Act+ outperforms SAM2Act by a "huge" margin of 37.6%. I find this to be stylistically a little off-putting and would prefer to let the (impressive) numbers speak for themselves. Or perhaps add a sentence emphasizing that the failure rate was "~81% lower than that of SAM2Act and ~83% lower than the RVT-2 baseline," or "less than one-fifth that of SAM2Act and the RVT-2 baseline." This is just my preference, and I think it's also fine if the authors want to keep the language as-is.
  2. Conversely, the abstract is dramatically understated: "SAM2Act+ achieves competitive performance on MemoryBench." I'd suggest giving some concrete numbers here to spark reader interest.
  3. There's a point of confusion for me about how the higher-resolution embeddings from the SAM2 encoder are passed to the multi-resolution upsampling method in Stage 2. Namely, are the original embeddings of the RGB image passed through, ignoring the memory module? Or is the memory somehow involved here? Specifically, in Figure 3 I'm expecting an arrow from the SAM2 Image Encoder directly to the Multi-Resolution Upsampling block containing the higher-resolution embeddings, bypassing the memory attention and the MVT. I guess these were just left out for simplicity of the diagram?
  4. I'd suggest using bold font a little more judiciously in Section 5.
作者回复

We sincerely thank Reviewer auqZ for recognizing the contributions of our work. We are pleased the reviewer found novelty in our method, noting its ~5% improvement in success rate and state-of-the-art performance on benchmarks, as well as our memory components reducing failures by ~80% on memory-intensive tasks. We also appreciate that reviewer acknowledged our comprehensive experiments and thorough evaluation.

Despite this, the reviewer raised several concerns and suggestions, which we address below:

The baselines are a little lacking.

We appreciate the reviewer's feedback on the baselines. Our evaluation spans three benchmarks: RLBench (18 tasks), The Colosseum, and MemoryBench. We achieve state-of-the-art results on RLBench and The Colosseum. For MemoryBench, we tested OpenVLA, which obtained a 0% success rate across all tasks (All baselines on our anonymous website). OpenVLA demonstrated basic interactions (e.g., approaching objects) but failed key actions like closing drawers or grasping, due to visual ambiguities—identical scenes requiring different actions (e.g., opening vs. closing). Conventional VLAs struggle as they rely solely on visual and language cues without temporal context. Our method integrates timestep information into proprioceptive inputs, resolving these ambiguities effectively. Consequently, even RVT-2 and SAM2Act (without explicit memory) succeed in such tasks, though occasionally erring on subtasks requiring actual memory. Extensive baseline comparisons and ablations confirm our method’s robustness on MemoryBench, highlighting areas for VLA improvement. Full results tables will appear in the final manuscript.

Clarification on how the higher-resolution embeddings from the SAM2 encoder are passed to the multi-resolution resampling methods in Stage 2?

We apologize for the confusion and clarify briefly: Multi-resolution embeddings from the SAM2 encoder integrate into Stage 2 as in Stage 1 (see simplified Figure 4). Lower-resolution (res 16) embeddings pass through MVT and Memory Attention (conditioned on Memory Bank embeddings) before multi-resolution upsampling. Higher-resolution embeddings skip MVT and Memory Attention, entering upsampling directly. The resulting action heatmap is encoded into memory, indirectly involving all resolutions in memory processing.

What are the cases where SAM2Act+ still fails on MemoryBench?

Our analysis of the reopen_drawer task indicates that failures primarily stem from memory issues rather than execution errors. Specifically, SAM2Act+ often reopens the wrong drawer, showing reliable command execution but difficulty recalling the correct drawer. This aligns with our benchmark's focus on isolating memory-related errors. We found that shadows in the simulation occasionally caused ambiguous visual cues, complicating memory recall. To test this, we regenerated the dataset without shadows and, after retraining, SAM2Act+ achieved 100% success. This confirms that visual ambiguity from shadows was the primary cause of memory failures and demonstrates our approach effectively resolves these memory challenges.

Can you speculate about why the robustness is so improved in Section 5.3? Is it the pre-trained knowledge from the SAM2 image encoder that helps distinguish semantic meaning despite visual perturbations?

Similar response to Question 2 from Reviewer 24ak; please kindly refer to that for more details.

For SAM2Act+ why does the fine branch not have the memory component?

Our design separates coarse and fine branches based on information density. The coarse branch uses scene-level point clouds to generate virtual images and employs a memory module for spatially consistent heatmaps, critical for memory-dependent tasks. The fine branch processes localized, rapidly changing views focused on immediate interactions, where incorporating memory would disrupt spatial consistency.

Suggestion to refine our figures and clarify our writing and highlighting key results and impacts

We thank the reviewer for their detailed suggestions and fully agree that their proposed improvements to the figures and manuscript will strengthen our paper. Due to ICML’s no-revision policy during rebuttal, we will incorporate these enhancements in future versions.

Why make the views independent in the memory architecture?

We treat views independently in the memory architecture for two reasons. First, the MVT effectively integrates multi-view cues by processing each view both independently and dependently through attention layers. Second, consistent with SAM2’s design, avoiding cross-view dependencies prevents memory complications due to significant camera angle changes. This approach simplifies memory integration and enhances robustness and reliability.

We hope our responses have adequately addressed all of the reviewers' concerns. If so, we would greatly appreciate your consideration in raising your ratings.

审稿意见
4

This submission introduces SAM2Act, a transformer-based coarse-to-fine behavior cloning policy for language-conditioned robot manipulation. SAM2Act exploits the multi-resolution features of SAM2 vision foundation model, through a novel up-sampling scheme which enables high-precision manipulation behavior. Additionally, the SAM2Act+ employs a spatial memory module, enabling it to solve more complex memory-based (non-Markovian) tasks. To evaluate the latter, a novel robot manipulation benchmark (extension to RLBench) is proposed, featuring 3 carefully designed tasks that specifically evaluate the model's spatial memory capabilities.

给作者的问题

  1. How does the proposed architecture differ from the architecture that are used to build world models (vision model + memory component) and how can the proposed architecture (due to its similarity) be adapted as a backbone architecture for training world models. It may be beneficial to add this discussion on the manuscript.
  2. Is SAM2 the best choice of a vision foundation model to be used as a backbone for the integration of the proposed robot manipulation components (and why) ? How do other vision foundation models compare to SAM2 as visual encoders for the task at hand?
  3. Is there any evidence or insights that the proposed benchmark tasks of reopen_drawer and put_block_back evaluate different aspects of spatial memory?

论据与证据

In my opinion, the claims made in this submission are supported by clear and convincing evidence.

方法与评估标准

In my opinion, the proposed methods and evaluation criteria are meaningful for the task at hand.

理论论述

Not applicable.

实验设计与分析

In my opinion, the experimental design and analysis is convincing.

补充材料

I have fully read the supplementary material and considered it in my review, as if it was part of the main manuscript.

与现有文献的关系

This submission integrates an existing vision foundation model (SAM2) with novel components (multi-view policy and memory architecture) to enhance it with language-guided robot manipulation capabilities. The proposed model formulation is novel and pushes the limits of SOTA in high-precision and memory-dependent manipulation tasks. Furthermore, the proposed spatial memory-based benchmark tasks can be used by the community to provide a more target evaluation of robot manipulation policies on memory-aware tasks.

遗漏的重要参考文献

In my opinion, the related works that are essential to understand the key contributions of the paper have been cited. A more enhanced discussion of the positioning of the proposed methodology to approaches from different manipulation related literature may enhance the understanding of the reader (please see question below for more specific suggestions).

其他优缺点

Strengths:

  • The manuscript studies a very interesting and timely problem.
  • The proposed solution is novel and insightful and successfully takes advantage of the latest advances in vision foundation models (offering tremendous generalisation capabilities) integrating them with novel elements that enhance their capabilities on high-precision and memory-aware robot manipulation.
  • The proposed benchmark (extension to RLBench) can also facilitate the targeted evaluation of manipulation policies specifically on memory-aware tasks.

Comments (related to questions below):

  • World models (e.g. the latest DINO-WM), although serving a different purpose, also typically feature a combination of a visual encoder, with a memory component and policy (although many works tend to adhere to the Markovian assumption). It is unclear how the proposed approach relates to such approaches, or if the proposed model architecture can potentially be adopted for world modelling.
  • The choice to build on top of SAM2, although proven effective due to the multi-resolution features provided and ablated against previous versions of SAM, could have been further ablated by the use of different visual encoders (e.g. DINOv2 or DepthAnything features) to indicate the generality of the proposed approach, and offer more insights on which Vision Foundation Models are most suitable for robot manipulation.
  • It is unclear whether reopen_drawer and put_block_back tasks of the proposed benchmark effectively do evaluate different aspects of spatial memory (namely 3D vs 2D spatial information) as claimed. This is because the z-axis information of which drawer to reopen can potentially be encoded in the 2D pixel-space, unless if the position of the chest of drawers all together is randomized between the 2 stages of the experiment.

其他意见或建议

Not Applicable

作者回复

We sincerely appreciate the Reviewer 24ak for recognizing the novelty of our approach, specifically in leveraging multi-resolution features from SAM2, and our innovative up-sampling scheme designed to facilitate high-precision manipulation policy learning. Additionally, we are grateful that the reviewer acknowledged our effective use of SAM2's memory mechanism to address complex, memory-based tasks, as well as our carefully designed MemoryBench, explicitly created for evaluating spatial memory capabilities. We are pleased that the reviewer found our claims well-supported by clear and convincing evidence (agreed by Reviewer yA68 and auqZ). Moreover, we thank the reviewer for emphasizing the timeliness and relevance of our research within robotic manipulation(similar to Reviewer yA68 and auqZ), and for highlighting the novel insights our study provides towards training generalist robotics agents(agreed by Reviewer auqZ).

Despite this, the reviewer raised several concerns and suggestions, which we address below:

How does the proposed architecture differ from world models? Can SAM2Act+ be adapted to serve as a backbone for training world models?

World models (e.g., DINO-WM, TD-MPC) differ significantly from SAM2Act in both objectives and training paradigms. While world models are trained to predict future observations and latent dynamics using various objective functions, SAM2Act is trained via behavior cloning. Although both incorporate memory and visual features, SAM2Act targets more realistic scenarios that violate the Markov assumption, unlike world models as mentioned by the reviewer. Moreover, world models typically decouple visual encoding and memory during training, whereas SAM2Act+ tightly integrates visual embeddings with spatial memory in an end-to-end fashion. While SAM2Act+ could potentially serve as a backbone for world modeling—by replacing the behavior cloning head with dynamics prediction and incorporating self-supervised objectives—that direction is beyond the scope of this work. Due to ICML's no-revision policy during rebuttal, we'll discuss this in a future version.

Is SAM2 the best choice of a vision foundation model used as backbone for robotic manipulation as compared to other visual encoders?

Our choice of SAM2 as the vision backbone is strongly justified by both prior work and our own ablation studies. In the SAM-E paper, a thorough evaluation compared various visual encoders—including CLIP, DINO, and the robotics-specific R3M—and found that the SAM image encoder outperformed the others when paired with an RVT-based action sequence prediction module. Since SAM2 builds on SAM by introducing multi-resolution features, it is expected—and indeed confirmed—to deliver superior performance and also supported from in-depth studies from the original SAM2 paper and SAM2-Adapter. To further validate this, we replaced SAM2 with alternative encoders suggested by the reviewer. Our experiments show that when using original upsampling (a fair comparison, as only SAM2 provides multi-resolution embeddings), DINOv2 achieves 82.2 ± 0.5 and DepthAnythingV2 achieves 81.1 ± 1.2 on the RLBench 18 tasks. Even the ablated version of SAM2Act, which uses the SAM2 encoder without multi-resolution output, already outperforms both alternatives with a success rate of 84.2 ± 0.9. Moreover, when leveraging SAM2’s multi-resolution outputs, we anticipate an even larger performance gap in its favor. Thus, the evidence strongly supports SAM2 as the best choice for our application, as its multi-resolution capabilities are uniquely advantageous for the robot manipulation tasks addressed in our work.

Clarification on whether the reopen-drawer and put_block_back tasks effectively evaluate different aspects of spatial memory (namely, 3D versus 2D spatial information) as claimed?

We realize our description of spatial memory evaluation may have caused confusion. Our intention with the two tasks is not to distinguish 3D from 2D spatial memory, but rather to isolate distinct components of spatial memory. Specifically:

  • Reopen_Drawer: This task is designed to assess whether the agent can retain and recall information along the vertical (z-axis) dimension. It tests the agent’s ability to remember which specific drawer was interacted with previously.
  • Put_Block_Back: This task evaluates the agent’s memory of the horizontal layout (x and y axes), ensuring it can accurately reposition objects within the overall spatial configuration.

Both tasks involve processing 2D pixel input but focus on different spatial dimensions within a 3D environment. By decoupling these aspects, our benchmark evaluates how effectively an agent integrates spatial cues across different axes. We hope this clarifies the confusion regarding the spatial memory tasks.

We hope our responses have addressed all of the reviewers' concerns. If so, we would appreciate your consideration in raising your ratings.

审稿人评论

I appreciate the authors' responses which clarify my raised comments. I believe that the proposed methodology makes a notable contribution to the community and maintain my Accept score.

作者评论

We sincerely thank the reviewer for acknowledging and recognizing the notable contribution through this paper.

审稿意见
3

The paper introduces SAM2Act, a multi-view, language-conditioned behavior cloning policy for 6-DoF 3D robotic manipulation, which integrates a visual foundation model (SAM2) with a memory architecture to enhance feature representation and task-specific reasoning. SAM2Act leverages multi-resolution upsampling and visual embeddings from SAM2 to achieve high-precision manipulations and robust generalization across environmental and object-level variations. Additionally, the authors propose SAM2Act+, an extension of SAM2Act that incorporates memory-based components (Memory Bank, Memory Encoder, and Memory Attention) to enable episodic recall and solve spatial memory-dependent tasks. The paper also introduces MemoryBench, a benchmark for assessing spatial memory in behavior cloning models. Empirical results demonstrate that SAM2Act and SAM2Act+ achieve state-of-the-art performance across multiple benchmarks, highlighting their effectiveness in complex manipulation tasks and their ability to generalize to unseen perturbations.

Post Rebuttal:

Thanks for additional results provided in the rebuttal, most of my concerns are addressed. Overall, the overall technical novelty of this paper is somehow limited. However it's unique contributions (such as the memorybench and analysis on memory-related tasks) could be interesting to some group of the community. I've decided to raise my score to Weak Accept.

给作者的问题

  1. As shown in Table 2 and Table 7, replacing SAM2 with SAM results in a significant performance decline. The generalizability evaluated in Colosseum and the average success rate in RLBench for the SAM-based variant are even lower than those of the RVT-2 baseline. Does this outcome suggest that the performance of the proposed method is largely attributed to the SAM2 encoder? Further clarifications can help better understand technical contribution and novelty of the paper.
  2. Could the authors clarify the distinction between long-horizon tasks in general and the tasks in MemoryBench, as well as the unique challenges posed by tasks that necessitate the use of a memory module? Additionally, would the memory module also bring improvements to long-horizon tasks more broadly, such as those in the LIBERO-Long benchmark?

论据与证据

All the claims made in the paper are supported by corresponding evidence.

方法与评估标准

The proposed pipeline, including the utilization of SAM2 for improved generalizability and the proposed memory module for tackling memory-specific tasks, is intuitive and makes sense.

The evaluation benchmarks are comprehensive, and supported by further real-world evaluations. The new benchmark proposed, MemoryBench, also addresses important research questions.

理论论述

No theoretical claims are made in the paper.

实验设计与分析

It would be better for the authors to highlight the difference between long-horizon manipulation tasks and tasks that require the trained policy to have both semantic and spatial memory as introduced in the MemoryBench. More detailed discussion would be helpful in highling contributions in this paper.

补充材料

I've watched all videos provided in the supplementary material.

与现有文献的关系

The utilization of SAM2 feature for various downstream tasks is adopted in prior literature [1].

[1] Chen, Tianrun, et al. "Sam2-adapter: Evaluating & adapting segment anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more." arXiv preprint arXiv:2408.04579 (2024).

遗漏的重要参考文献

Many designs proposed in the paper are largely inherited from RVT-2[1], except for the memory module. I believe it would be better for the authors to highlight unique designs that distinguished this paper from RVT-2. Otherwise, the overall technical novelty seems to be limited.

[1] Goyal, Ankit, et al. "RVT-2: Learning precise manipulation from few demonstrations." RSS 2024.

其他优缺点

Strength: The proposed MemoryBench addresses an important aspect of the robot policies: the spatial memory capabilities. The evaluation suite is also thoroughly discussed in the paper.

Weaknesses: The originality of model architecture design is constrained. Given the SAM2Act module and the coarse-to-fine pipeline is largely borrowed from RVT-2, and the memory module in SAM2Act+ is similar to the tacking module in SAM2.

其他意见或建议

None

作者回复

We sincerely thank Reviewer yA68 for the detailed and insightful feedback. We appreciate the recognition of our innovative use of SAM2 for generalization, the intuitive memory module design, and our comprehensive benchmarking. We're especially grateful for the positive remarks on our real-world evaluations, also echoed by Reviewer 24ak and auqZ. We also value the reviewer’s recognition of MemoryBench as a meaningful contribution to addressing critical research questions, as well as the acknowledgment that our claims are well-supported by the evidence—an assessment shared across reviewers.

Despite this, the reviewer raised several concerns and suggestions, which we address below:

The model architecture appears to have limited originality, as the SAM2Act module and coarse-to-fine pipeline closely follow RVT-2, and the memory module in SAM2Act+ strongly resembles SAM2’s tracking module.

We acknowledge that our work naturally builds upon prior research, embracing the principle of 'standing on the shoulders of giants,' much like RVT-2 evolved from RVT through targeted architectural changes, or SAM-E integrated elements from RVT with SAM's image encoder and action-chunking module. Our approach similarly leverages existing architectures but introduces critical, innovative adaptations that significantly enhance performance. In particular, our novel contribution is in effectively addressing the challenging problem of memory, representing one of the first systematic attempts in this area. Our key adaptations that make SAM2Act and SAM2Act+ effective are: (1) Multi-Resolution Upsampling, which leverages SAM2’s multi-scale image embeddings to boost RLBench performance and generalization (Tables 2 & 7), and (2) Memory Task Adaptation, where we extend SAM2’s tracking module to multi-view, multi-step settings by treating action heatmaps as object masks and integrating MVT embeddings into memory. This extension, which required careful design and extensive experimentation, is novel and essential—naively using the base models without these changes fails, as shown in our ablations, Table 7.

Given the performance drop when replacing SAM2 with SAM—falling below even the RVT-2 baseline on Colosseum and RLBench (Tables 2 & 7)—to what extent does the method's effectiveness rely on SAM2 rather than the proposed contributions?

We appreciate the reviewer’s observation and agree that the SAM2 encoder contributes significantly to performance. The drop observed when replacing SAM2 with SAM aligns with findings from the SAM2 and SAM2-Adapter papers (to be cited in future version), which show that SAM2’s lightweight Hiera encoder yields stronger embeddings for segmentation and downstream tasks. However, our ablations also demonstrate that the strong performance of SAM2Act arises from the combination of SAM2 with the other novel contributions from our proposed method and not solely from the encoder. Our primary contributions include innovations like multi-resolution upsampling, which adeptly leverages multi-resolution embeddings. As shown in Table 2, the improved generalization on Colosseum is primarily driven by multi-resolution upsampling. Without multi-resolution embeddings, performance matches the SAM-based variant, highlighting our architectural contributions as the main driver of generalization. Moreover, when comparing the SAM-based variant to RVT-2, their overlapping performance intervals (80.8 ± 1.9 vs. 81.4 ± 3.1) indicate no statistically significant differences. In summary, our approach combines several novel contributions that together enhance both performance and practicality.

Could the authors clarify the distinction between general long-horizon tasks and the specific tasks in MemoryBench, particularly highlighting the unique challenges that make a dedicated memory module necessary?

Robotic manipulation tasks typically follow the Markov assumption, where the optimal action depends solely on the current observation. Even in long-horizon tasks key information is directly observable and might not need memory. In contrast, MemoryBench is explicitly designed to violate this assumption—tasks are ambiguous, and visually identical states may require different actions based on prior interactions. We appreciate the reviewer’s suggestion of LIBERO, it emphasizes action-based models, which differ from our keyframe-based approach in SAM2Act. To investigate the relation between long-horizon and memory-based tasks, we curated four cube-stacking tasks with increasing complexity and keyframe length. We observed consistent performance degradation for both SAM2Act and SAM2Act+ as horizon length increased, suggesting that memory-based challenges extend beyond task length alone. (Results on our anonymous website).

We sincerely hope our responses have adequately addressed all of the reviewers' concerns. If so, we would greatly appreciate your consideration in raising your ratings.

最终决定

The paper proposes SAM2Act, a mutli-view transformer based policy that leverages better visual representations from large-scale foundation model. The paper is well motivated with comprehensive experiments. Although there are some concerns from reviewers, authors did a good job addressing most concerns during rebuttal. Please incorporate all the comments (clarification, novelty, baselines, failure case analysis, etc.) in the revised version.