5.5

/10

Poster4 位审稿人

最低4最高6标准差0.9

4.3

置信度

正确性2.8

贡献度2.0

表达2.8

NeurIPS 2024

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Zichen Jeff Cui,Hengkai Pan,Aadhithya Iyer,Siddhant Haldar,Lerrel Pinto

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

DynaMo, a new self-supervised method for pretraining visual encoders for downstream visuomotor control by explicitly modeling dynamics in the demonstration observations.

摘要

关键词

Robot learningrepresentation learningself-supervised learning

评审与讨论

审稿意见

评分: 6置信度: 42024-06-23

This paper presents a way of pre-training vision encoder for robot control. Specifically, instead of using vanilla contrastive or masked autoencoder approaches, this method creates two models: 1) an inverse dynamics model that estimates the transition latent (actions) and 2) a forward dynamics model that takes in the current encoded visual latent and the transition latent and predicts the next latent observation. The results suggest that the method improves upon existing visual pre-training methods for robotics.

优点

Few of the past works on visual pretraining for robotics consider the time / action but only focus on the visual observation aspect. This work presents a method that attempts to improve visual pretraining by modeling the dynamics present in the dataset.
The results suggest that the method improves upon existing visual pre-training baselines

缺点

Prior visual pre-training for robotics operates under the premise that we have an image or video dataset, where we pre-train on these datasets and then finetune for a particular task. However, this method performs pre-training on the task-specific dataset, which is better aligned with the downstream tasks. Instead of having pre-training and fine-tuning using the same dataset and solving the same task, the objective of visual pretraining (exemplified by MAE, MoCo, etc.) is that if we train on a mass amount of data, we can finetune to a specific task (i.e. ImageNet pre-training then COCO segmentation finetuning).
A few prior works [1,2] have tried to model forward and inverse dynamics concurrently. [1] also uses forward and inverse dynamics to train a visual encoder. The key difference between these works is that here action is modeled as a latent variable. Why ground truth action values are not used in pre-training (especially when pre-training and fine-tuning happen on the same task) is not justified in the manuscript. It would be quite convincing if pre-training is done on natural videos, or large-scale robot datasets where action spaces cannot be standardized, and then shows improved finetuning performance.

[1] Agrawal, Pulkit, Ashvin V. Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." Advances in neural information processing systems 29 (2016).

[2] Fragkiadaki, Katerina, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. "Learning visual predictive models of physics for playing billiards." arXiv preprint arXiv:1511.07404 (2015).

问题

The reviewer wants to ask for two sets of experiments to address weakness 1:

How do masked pre-training methods compare to DynaMo when they are trained on the same data? I.e. train two networks analogously with the method provided in VC-1 and MVP on the task-specific dataset, and evaluate their task performance. This experiment would demonstrate that even for specific tasks, pre-training with DynaMo outperforms existing visual pre-training methods on task-specific datasets.
How does DynaMo generalize to unseen tasks (in the sense that it can generalize to tasks outside of the dynamics seen in training)? I.e. pre-train DynaMo on (1) Put Yogurt (2) Get yogurt (3) Get Tea and evaluate on (1) Put ketchup (2) Get Water.

局限性

Limitation section is present.

作者回复

2024-08-07

Thank you for your thorough review and constructive feedback. We are glad that you found our approach to visual pretraining novel. We will address each of your concerns below.

"Instead of having pre-training and fine-tuning using the same dataset… train on a mass amount of data… finetune to a specific task": We discuss the motivation for in-domain vs. large-scale SSL pretraining in detail in the global comment. In particular, we show that DynaMo is compatible as an in-domain SSL fine-tuning step for Internet-scale pretrained weights like ImageNet (paper Table 5). To clarify, our work specifically tackles the problem of learning efficiently from small-scale decision-making data, often with only hundreds of demonstration trajectories. In this low-data regime, DynaMo improves downstream performance and outperforms pretrained weights and other SSL methods, as shown in Table 1 in the paper. And in principle, DynaMo can be trained on much larger datasets, but unfortunately we have academic compute constraints.

"Why ground truth action values are not used in pretraining… if pre-training is done on natural videos, or large-scale robot datasets": This is indeed a very exciting direction. In fact, the motivation for modeling the action as a latent is to make DynaMo applicable to a wide range of datasets including natural videos, and show that simply modeling dynamics on videos, without any augmentations or actions, is a feasible visual pretraining objective for visuomotor control. We do not have the compute to run on massive datasets in an academic setting, but we hope that this opens up a direction for industry labs to explore how dynamics pretraining can scale to Internet-scale datasets. We will make this clearer in our next revision.

"How do masked pre-training methods compare to DynaMo… on the same data": Thank you for pointing out the missing baseline. We have added MAE as an in-domain SSL baseline, discussed in global comment (1). In summary, MAE underperforms DynaMo by an average of 33% across sim environments, and completely fails to solve Block Pushing.

"How does DynaMo generalize to unseen tasks": we have added a new kitchen task (picking up a bread) to test encoder generalization, detailed in global comment (4). In summary, we use the old DynaMo encoder pretrained on existing kitchen tasks to train a new policy on the unseen task, and find that the policy still manages to complete the task, although pretraining a fresh encoder can improve performance in this low-data regime. We hypothesize that encoder generalization could improve if pretrained on significantly larger datasets, which is an exciting direction that we hope industry labs could explore.

We hope this addresses your concerns and questions. We would be keen to discuss any remaining points that stand between us and a higher score.

2024-08-12

Thank you for the response. With the added experiments, I decided to raise the score.

审稿意见

评分: 6置信度: 42024-07-05

This paper presents a self-supervised model, DynaMo, for pretraining visual encoders adopted for visuo-motor control. The targeted downstream task is imitation learning for robotic manipulation. Instead of using an out-of-domain dataset for pretraining and then transferring to a new domain using alternative techniques, the authors propose exploiting sequences of observations from in-domain demonstrations to pretrain three different models, a visual encoder, a forward and an inverse model. Once this is done a policy can be learned with observations encoded using the pre-trained visual encoder.

优点

The most important benefit of DynaMo is that a visual encoder can be trained with limited risk of suppressing data dimensions necessary for visuomotor control, an otherwise frequently occurring problem.

Even if similar models that combine training of forward and inverse models have existed in literature before, the action representation is assumed unobserved in the proposed model, which has rarely been the case before. The literature on imitation learning from observed state sequences is vast, with little cited in the paper. However, the way this is done for pretraining in the proposed model is innovative and easily applicable to a practical scenario.

The experiments are rather exhaustive with five different settings and embodiments tested, two of which are real-world scenarios. In experiments that compare to alternative self-supervised methods and pretrained representations, the proposed visual embeddings are shown to be very competitive. It is also shown that DynaMo can be used to finetune an encoder pre-trained on ImageNet for even better results while being relatively insensitive to the choice of policy class.

缺点

The paper is written as if there were no research in the area before the deep learning boom. Only one citation out of 70 citations is older than 10 years. The paper suggests that training exclusively on in-domain data is new, even if this used to be the way it was typically done before the arrival of data-hungry deep-learning-based models, models that forced people to a greater extent to rely on offline training on out-of-domain data with data augmentation, contrastive learning, etc.

The idea to train pairs of inverse and forward models online has existed in psychology and robotics for at least 25 years, such as in the works of Wolpert et al [1]. Using similar models, imitation learning has been a common theme over the years, with [2] being just an example. Without this connection back to earlier research, this paper gives the impression of trying to reinvent the wheel, and it becomes unclear what the contributions really are.

Even if the experiments suggest that DynaMo can be beneficial also in real-world settings, the presented experiments are too few to be conclusive. The real world is way more diverse with more than just a small selected set of objects that can be manipulated. However, this weakness is pointed out in the conclusions, which makes it less problematic.

[1] Wolpert and Kawato, “Multiple paired forward and inverse models for motor control”, Neural Networks, 11, 1998.

[2] Demiris and Hayes, “Imitation as a dual-route process featuring predictive and learning components: a biologically plausible computational model”, in Imitation in Animals and Artifacts, MIT Press, 2002.

问题

Are the visual encoder, inverse and forward models only trained on the demonstrations from the respective datasets? Even if this is only for pretraining, the demonstrations, at least the real-world ones, are very few compared to the complexity of the tasks learned. Why not exploit all possible sequences available on the same embodiment, even for tasks that will eventually not be of interest?
How restrictive is the assumption that forward models are unimodal? Has this become a weakness during the experiments?
Since both the inverse and forward models seem to be ignored after pretraining, what is the motivation for a separation between the two? Why not train a network to predict the next encoded observation from earlier ones, essentially with the inverse and forward models merged into one? Why is the latent action representation needed at all?

局限性

Yes, some limitations related the real-world experiments and the unimodality of models are brought up in the conclusions.

作者回复

2024-08-07

Thank you for your insightful review and for suggesting these papers. We are glad that you found our action-free assumption innovative. After reading these papers, it is clear that our work and indeed many others in the field of representation learning and imitation learning have been inspired by these seminal works before the deep learning boom. We apologize for the omission and will include them and further related works in our manuscript. The idea of dynamics and predictive coding is well-established in earlier literature. And for imitation learning in robotics, we see early work such as [1][2][3][4], and many others pioneer the currently prevalent setup of learning from human demonstration data for robotic manipulation. We will add an additional subsection in the Introduction and Related Works sections to highlight this inspiration.

"It becomes unclear what the contributions really are": Our paper focuses on training a good visual encoder for visual imitation learning. A major problem is that in-domain demonstration data is expensive to collect, and underutilized by common training approaches: the visual encoder is usually either trained end-to-end jointly with the policy from scratch, or pretrained on massive out-of-domain datasets that may not generalize to the task at hand. Our contribution is twofold. One, we show that simply modeling dynamics on videos, without any augmentations or actions, is a feasible visual pretraining objective. We empirically show that it improves downstream imitation learning performance on simulated and real robot environments, outperforming prior methods. Two, we show that our method is compatible as an in-domain fine-tuning step. Starting with a strong visual encoder trained on large out-of-domain datasets, our method can further improve its performance by fine-tuning on a small task-specific dataset.

"Experiments are too few to be conclusive… the real world is way more diverse": Thank you for raising this important point. We hope that our additional experiments showcase the applicability of our method in various settings. We fully acknowledge that lab settings are rather limited compared to real-world environments. As a community, we need to conduct more experiments outside of labs. Benchmarking the robustness of these encoders in more diverse and realistic environments would be an important direction for future work.

"Why not exploit all possible sequences available on the same embodiment": Please see global comment (4) for a detailed discussion, as well as an additional kitchen experiment exploring encoder generalization to an unseen task. To summarize, DynaMo can be trained on much larger datasets, but unfortunately we have limited compute. Nevertheless, we show that an encoder trained with DynaMo on existing tasks can be used to train a policy on a completely new task, although at this small scale, directly pretraining on the new task improves performance. We hypothesize that generalization would improve if we trained on much larger datasets. We also note that DynaMo can be used to fine-tune other models pretrained on Internet-scale datasets like ImageNet for improved performance.

"How restrictive is the assumption that forward models are unimodal? Has this become a weakness during the experiments?": This is equivalent to assuming deterministic environment transitions. Given the previous state, and an action, we assume that there is a well-determined next state. We believe this is a reasonable assumption for most real-world manipulation tasks like opening a door, putting away toys, etc. This assumption can fail when the environment is stochastic or otherwise hard to predict (e.g. making a sand dune, playing adversarial sports). In that case, we can easily relax this assumption by predicting a distribution of next states instead. To our best knowledge, the simulated environments are deterministic, and the real environments are essentially deterministic (approximately rigid-body Newtonian dynamics).

"Why is the latent action representation needed at all?": Given the same previous state, an agent can act in multiple ways, leading to a distribution of next states, even in a deterministic environment (e.g. picking up a mug by the handle, or by the body). So we first observe both the previous and next states to “fix” the latent action, then predict the deterministic next state given the previous state, and the latent action. We show that this is crucial in the ablation section (§4.6). We will make this clearer in our next revision.

We hope this addresses your concerns and questions. We would be keen to discuss any remaining points that stand between us and a higher score.

[1] N. Delson and H. West. Robot programming by human demonstration: Adaptation and inconsistency in constrained motion. In Proceedings of IEEE International conference on Robotics and Automation, volume 1, pages 30–36. IEEE, 1996.

[2] M. Kaiser and R. Dillmann. Building elementary robot skills from human demonstration. In Proceedings of IEEE International Conference on Robotics and Automation, volume 3, pages 2700–2705. IEEE, 1996.

[3] S. Liu and H. Asada. Teaching and learning of deburring robots using neural networks. In Proceedings of IEEE International Conference on Robotics and Automation, volume 3, pages 339–345. IEEE, 1993.

[4] H. Asada and B.-H. Yang. Skill acquisition from human experts through pattern processing of teaching data. In Proceedings of IEEE International Conference on Robotics and Automation, volume 3, pages 1302–1307. IEEE, 1989.

2024-08-13

This reviewer wants to thank the authors for an informative rebuttal and is looking forward to reading the paper with the promised changes included. Despite an unfortunate lack of references back to earlier research, there clearly is a connection that ought to be highlighted, since it partially explains why SSL makes sense at all for an embodied agent constantly adapting to an ever-changing environment.

审稿意见

评分: 6置信度: 42024-07-08

This paper presents a self-supervised learning method for robot learning that learns representations by using data from demonstrations. The objective is based on learning latent actions from inverse dynamics, and learning forward dynamics model that uses such latent actions as inputs. Several techniques are utilized to prevent the model from finding trivial solutions and thus collapsing. Experiments are conducted in both real-world and simulation environments.

优点

Clear writing with good figures
Real-robot experiments!
Focuses on the important problem of pre-training representations from demonstrations, as utilizing such limited but in-domain data can be crucial in the context of robot learning where in-domain data is scarse but important especially for fine-grained control tasks.

缺点

As other self-supervised learning models are trained on non-standard robotic datasets, it is not clear whether they are trained well with good hyperparameters -- for instance without collapses -- is there a way to ensure that baseline methods are well-tuned?
I understand that the main focus of this paper is to introduce a self-supervised learning method and compare its performance to other baselines. But what would the performance look like if you consider the full fine-tuning setup that uses gradients from behavior cloning for updating the encoder? Can we squeeze more information and maybe performance boost from fully fine-tuning the encoder? How would all the methods perform in this setup? This could further strengthen the claims of this paper that we should focus on extracting more information from demonstrations.
One important missing baseline is [1] that pre-trains (optionally causal) transformer with masked modelling objective. Even though it uses a pre-trained visual encoder, using features from the causal transformer can be still a baseline Moreover, it's a bit awkward that MAE trained on demonstrations is missing from the baseline even though MVP is selected as a pre-trained representation baseline. Including MAE, maybe optionally its multi-view variant [2], can make results be more convincing.

[1] Radosavovic, Ilija, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. "Robot learning with sensorimotor pre-training." In Conference on Robot Learning, pp. 683-693. PMLR, 2023.

[2] Seo, Younggyo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, and Pieter Abbeel. "Multi-view masked world models for visual robotic manipulation." In International Conference on Machine Learning, pp. 30613-30632. PMLR, 2023.

问题

See Weaknesses

局限性

N/A

作者回复

2024-08-07

Thank you for your thoughtful review, and pointers to missing baselines. We are glad that you find our in-domain visual pretraining setting important. We will address each of your concerns below.

"Is there a way to ensure that baseline methods are well-tuned?": We monitor the observation embeddings for representation collapse during training. For all SSL baselines, we start with the recommended hyperparameters in the paper or official repo, and tune hyperparameters when there seems to be representation collapse. We will release the hyperparameter details for reproduction with the public release of this work.

"What would the performance look like if you consider the full fine-tuning setup": We have added end-to-end fine-tuning results on the Push-T environment. Please see global comment (6) and PDF Table 3 for detailed experiment results. We find that fine-tuning further improves the performance of pretrained encoders, and that pretrained representations significantly outperform training the encoder and policy jointly from scratch. We would like to note though fine-tuning the full model takes 4x longer to train.

"Missing baseline… transformer with masked modelling objective": Thank you for suggesting RPT as a baseline. We added an observation-only variant of RPT with a ResNet18 backbone. We found RPT to be a strong baseline, outperforming most other SSL baselines, but falling short of DynaMo by 7% across the sim environments. Please see global comment (2) and Table 1 in the PDF for detailed results.

"MAE trained on demonstrations is missing from the baseline": Thank you for pointing this out. We have added MAE as an in-domain SSL baseline, discussed in global comment (1). In summary, MAE underperforms DynaMo by an average of 33% across sim environments, and completely fails to solve Block Pushing.

We hope this addresses your concerns and questions. We would be keen to discuss any remaining points that stand between us and a higher score.

2024-08-12

Thank you for the response. I have read other reviews and the response, and currently decided to maintain the score.

The reason for not increasing the score is that I agree with the other reviewers that the submission can be improved by (i) improving the experimental design especially with architecture and (ii) making the reasoning of not using actions when they are available more clearer.

But I'd also like to note that I disagree with other reviewers in that SSL should be designed for large-scale datasets as representation learning on scare demonstration data can be very useful for robotics, which is why I'm not decreasing the score.

审稿意见

评分: 4置信度: 52024-07-12

This paper presents DynaMo, using in-domain data for self-supervision. It jointly learns a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings. The

优点

This paper is easy to follow.

缺点

Simplified Real-World Setup: The real-robot experiments appear overly simplistic. Objects seem to be fixed in place, indicated by the red marker on the table, suggesting a lack of randomization in object placement. This setup makes the task easier for conventional imitation learning methods like diffusion policy and act, potentially allowing them to achieve a 100% success rate.

Suggestion: Introduce spatial randomization to the scene. Conduct additional experiments under these conditions to demonstrate Dynamo's superiority in more complex and varied scenarios.

Unfair Comparisons in Simulation: In Table 1, Dynamo is compared with several baselines that use different backbones, which makes the comparison potentially unfair. Impact: The difference in backbones could skew the performance results, making it difficult to accurately assess Dynamo's relative performance.

Suggestion: Include more experiments of Dynamo with various backbones such as ViT and ResNet-50. Compare these results against the baselines to provide a fairer and more comprehensive evaluation.

The motivation for using SSL in this context is unclear. Typically, SSL is advantageous due to its ability to learn from massive datasets without human labels. However, in the field of robotics, in-domain data are often scarce. This could make the application of SSL less persuasive and potentially less effective.

问题

Currently, I believe this paper has significant flaws in its experimental design, both in simulation and real-robot settings. As such, my initial score is 4, with the real-robot experiments being a notable strength. However, the existing experiments do not sufficiently support the claims made in the paper. If the authors can provide additional experiments based on my suggestions above, and if the results substantiate their claims, I would be willing to raise my rating.

局限性

No.

作者回复

2024-08-07

Thank you for your detailed review and constructive comments. We are glad that you consider our robot experiments a notable strength. We will address each of your concerns below.

Simplified real-world setup: We would like to clarify that the red marker on the table is for setup reference only. At test time, the object is randomly placed within the convex hull of the demonstration starting positions. We have updated the paper website with all 10 rollout videos for each Allegro task, as well as visualizations of all rollout starting positions in Figure 1 in the PDF. We have also added a kitchen task with more variations in the object starting position in global comment (5).

Unfair comparisons in simulation: We would like to kindly note that only MAE-based baselines (VC-1, MVP, MAE) use ViTs as they are incompatible with ResNets. All other baselines use a ResNet18 backbone.

We have added MAE as an in-domain SSL baseline with the ViT-B backbone. Please see global comment (1), and Table 1 in the PDF for more details.
We have also run DynaMo (ViT-B) on Push-T. It outperforms MAE (ViT-B) by 28%, but underperforms DynaMo (ResNet18) by 22% while taking 4 days (vs. 1 hour) to train. Overall, we find that ViTs trained on small in-domain datasets perform significantly worse than small ResNets, as vision transformers require much more data to perform well.
For consistency, we have updated MVP to use the ViT-B backbone, such that all MAE-based baselines use the same backbone. We find that MVP (ViT-S) and MVP (ViT-B) have similar performance across environments.

Motivation for SSL in the low-data regime: As you have mentioned, scarce in-domain data is a real problem in robotics. We elaborate in detail our motivation for SSL in the low-data regime in the global comment. To clarify, our work is designed to tackle the problem of learning efficiently from small-scale decision-making data, often with only a few hundred or less trajectories. DynaMo improves downstream performance, is compatible as an in-domain fine-tuning step for other pretrained encoders, and can be trained on much larger datasets, but we have academic compute constraints.

We hope this addresses your concerns and questions. We would be keen to discuss any remaining points that stand between us and a higher score.

评论- Response to rebuttal

2024-08-08

Thank you for your reply.

I appreciate the explanation and checked the new video uploaded by the author. However, I am unsure if updating the webpage violates the rebuttal rules. Therefore, even though this video addresses my concern, my evaluation will not take it into account.
Whether ViT performs worse than small ResNets depends on the experimental conditions (e.g., freeze, LoRA, or full parameter fine-tuning) and hyper-parameter settings (e.g., same or lower initial learning rate compared to the pretrain stage). The experimental setup should be clearly elaborated and compared. Current experiments are insufficient and not convincing.

Therefore, I have decided to keep my initial rating after the rebuttal.

评论- Author response

2024-08-09

Thank you for your reply. We are confident that our rebuttals are by the rules (“should not contain any links to external pages”). To facilitate the discussion, we have updated the existing paper website for transparency and in good faith. If you would like to disregard that anyway, we invite you to look at Figure 1 in the uploaded PDF as an alternative visualization for your point above, as well as the additional kitchen experiment in global comment (5) with starting position variations explicitly addressing this concern.

As for your second point, we believe there is a misunderstanding. Allow us to elaborate below.

Whether ViT performs worse than small ResNets depends on the experimental conditions: We would like to kindly note that in the low-data regime, larger models ≠ better performance: the original ViT paper [1] has observed that ViTs require larger datasets (“14M-300M images”) to reach performance parity, whereas for robotics and decision-making, most datasets have 10k-100k frames. It is not surprising that ViTs underperform ResNets at this scale. The main focus of this paper is an SSL method that works in the low-data regime for control tasks, rather than a pretrain-then-fine-tune setup for visual encoders in general.

Current experiments are insufficient and not convincing: Could you give us specific setups in existing published work that you think we should follow? We are happy to take a look.

The experimental setup should be clearly elaborated and compared: To enable us to address your concerns effectively, would you clarify which parts of the experimental setup are unclear? For all our main results, we follow the same evaluation procedure: pretrain an encoder from random initialization, then train a downstream policy on the frozen embeddings and use environment rollout metrics to evaluate the encoder performance. The encoder is not fine-tuned during policy training. This setup is used for the ViT training as well.

[1] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

作者回复

2024-08-07

We thank the reviewers for your insightful and constructive comments, and for finding our robot results strong (96in, kzSJ) and the approach novel (B6gm, MZTu). To address your concerns, we have run all requested experiments within our compute budget. You can find our detailed response in individual replies to your review, as well as a summary of shared concerns, additional results and revisions here:

SSL on in-domain data vs. massive out-of-domain data (reviewers 96in, MZTu, B6gm): We would like to clarify that this work is tackling efficient learning with small-scale decision-making data. Several prominent works [1][2][3] deal with demonstration datasets in the few hundred or less trajectories. Our SSL method, DynaMo, is created and designed to accelerate policy learning in the low-data regime. DynaMo improves downstream policy performance with in-domain SSL pretraining, and significantly outperforms learning the encoder and policy from scratch (see (6) below). We also show that DynaMo can also be used as an in-domain fine-tuning step for encoders pretrained on large out-of-domain datasets like ImageNet. Finally, DynaMo is in principle compatible with Internet-scale datasets, but unfortunately we do not have the requisite compute in an academic lab. Running DynaMo on large datasets is an exciting direction that we hope industry labs can follow-up and explore.

Comparison with more baselines, and additional experiments:

Comparisons with MAE (reviewers kzSJ, MZTu): We train MAE with a ViT-B backbone on all environments with the official implementation. We find that MAE underperforms DynaMo by 24% on Franka Kitchen, 51% on Push-T, and completely fails to solve the task on Block Pushing. (PDF Table 1)
Comparison with RPT (reviewer kzSJ): We train an observation-only variant of RPT on all environments with a ResNet18 backbone. We find RPT to be a strong baseline, outperforming most other SSL baselines, but falling short of DynaMo by 7% across the sim environments. (PDF Table 1)
Baseline backbones (reviewer 96in): We would like to clarify that only MAE-based baselines (VC-1, MVP, MAE) use ViTs, as they are incompatible with ResNets. All other baselines use a ResNet18 backbone, which we will make clearer in our next revision. We have unified all ViTs to use the ViT-B backbone. We have also run DynaMo (ViT-B) on Push-T. It outperforms MAE (ViT-B) by 28%, but underperforms DynaMo (ResNet18) by 22% while taking 4 days (vs. 1 hour) to train. Overall, we find that ViTs trained on small in-domain datasets perform significantly worse than small ResNets, as larger models require much more data to perform well. (PDF Table 1)
Generalization (reviewers B6gm, MZTu): We have added a new real kitchen task (picking up a bread) to test whether the old encoder trained on existing kitchen tasks can be used to train a policy on the new task. We train a policy head with the old DynaMo encoder, and also with the best baseline (old MoCo encoder). We find that policies trained with the existing encoders still manage to complete the task (DynaMo 4/10, MoCo 3/10; successes/total rollouts). We have also trained DynaMo and MoCo encoders from scratch on the new task only, and evaluated likewise. We find that policies trained with the new in-domain encoders exhibit improved performance (DynaMo 7/10, MoCo 5/10). We hypothesize that encoder generalization will improve if trained on much larger datasets, but we have limited academic compute. (PDF Table 2)
Variations in starting positions (reviewer 96in): In the task above in (4), the starting positions of the task object are varied across the workspace (~20x15cm in size). We find that our encoder completes the task 7/10 times, outperforming MoCo (5/10). We would also like to clarify that for the Allegro hand environment, the task objects do have significant variations in their starting positions (~25x15cm for the sponge, and ~20x10cm for the teabag). We have included a visualization (PDF Figure 1) of the hand task starting positions, and updated the rollout videos on the website. (PDF Table 2)
End-to-end fine-tuning after encoder pretraining (reviewer kzSJ): We fine-tune DynaMo-pretrained, MoCo-pretrained, ImageNet-pretrained, and randomly initialized ResNet18. We find that fine-tuning can further improve the performance of pretrained encoders by up to 12%, and that pretrained representations significantly outperform training the encoder and policy jointly from scratch. (PDF Table 3)

We hope that these updates inspire further confidence in our work. At the same time, we invite any further questions or feedback that you may have on our work.

[1] S. Lee, Y. Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto. Behavior generation with latent actions. arXiv preprint arXiv:2403.03181, 2024.

[2] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.

[3] T. Z. Zhao, V. Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.

最终决定Accept (poster)

2024-09-25

DynaMo is a self-supervised learning approach for robotics that jointly learns inverse and forward dynamics models using in-domain data. This approach avoids suppressing important data dimensions for visuomotor control. The model assumes unobserved action representations, which is novel. The paper could benefit from a better discussion of prior art and context of this technical advance Empirical The paper presents its application in both simulated and real-world robot experiments. While the approach shows promise, the review highlights concerns about the experimental setup, including simplified real-world scenarios and potentially unfair comparisons in simulations. Furthermore other reviewers requested well-tuned baselines and exploring full fine-tuning setups, as well as additional experiments with more complex scenarios and varied backbones to better demonstrate DynaMo's effectiveness and provide a fairer evaluation.

However, the rebuttal did a good job of providing clarifications with more than one reviewers updating the scores leaning more positively. All reviewers lean accept after rebuttal phase, and the Meta-reviewer concurs on the overall agreement. The authors are advised to review the final comments, and update the manuscript accordingly.. In addition to including clarifications in the main paper, the authors are also advised to describe the possible limitations of this work in more detail.