6.0

/10

Rejected6 位审稿人

最低5最高8标准差1.0

3.7

置信度

正确性2.7

贡献度2.2

表达3.2

ICLR 2025

Objects matter: object-centric world models improve reinforcement learning in visually complex environments

Weipu Zhang,Adam Jelley,Trevor McInroe,Amos Storkey

OpenReview PDF

提交: 2024-09-17更新: 2025-02-05

TL;DR

We propose an object-centric model-based RL pipeline, which integrates recent advances in computer vision to allow agents to focus on key decision-related elements.

摘要

Deep reinforcement learning has achieved remarkable success in learning control policies from pixels across a wide range of tasks, yet its application remains hindered by low sample efficiency, requiring significantly more environment interactions than humans to reach comparable performance. Model-based reinforcement learning (MBRL) offers a solution by leveraging learnt world models to generate simulated experience, thereby improving sample efficiency. However, in visually complex environments, small or dynamic elements can be critical for decision-making. Yet, traditional MBRL methods in pixel-based environments typically rely on auto-encoding with an $L_2$ loss, which is dominated by large areas and often fails to capture decision-relevant details. To address these limitations, we propose an **object-centric MBRL pipeline**, which integrates recent advances in computer vision to allow agents to focus on key decision-related elements. Our approach consists of four main steps: (1) annotating key objects related to rewards and goals with segmentation masks, (2) extracting object features using a pre-trained, frozen foundation vision model, (3) incorporating these object features with the raw observations to predict environmental dynamics, and (4) training the policy using imagined trajectories generated by this object-centric world model. Building on the efficient MBRL algorithm STORM, we call this pipeline **OC-STORM**. We demonstrate OC-STORM's practical value in overcoming the limitations of conventional MBRL approaches on both Atari games and the visually complex game Hollow Knight. Code and videos are available in the supplementary materials.

关键词

reinforcement learningmodel-based RLobject-centric RLvideo object segmentationAtariHollow Knight

评审与讨论

审稿意见

评分: 6置信度: 42024-10-30

The authors present an object-centric MBRL pipeline that enables agents to focus on key, decision-relevant elements. They leverage recent advances in computer vision by extracting object features using a pre-trained segmentation foundation model with few annotations (six annotations in their experiments). These extracted features are incorporated into the world model as an additional modality along with raw observations. The authors demonstrate that their method (OC-STORM) is more effective than the baseline (STORM) for object-centric tasks in Atari games and Hollow Knight.

优点

The authors present an effective approach to overcoming challenges in learning control policies for visually complex, object-centric tasks by leveraging recent advances in computer vision. While foundation models for RL/robotics are still emerging and there is no established standard for learning useful representations for control agents, this paper shows that inheriting knowledge from vision foundation models can be a promising direction. The authors seamlessly integrate the segmentation model with RL, requiring a little bit of task-specific prior knowledge and few annotations.
The authors conduct comprehensive experiments across a range of video game tasks in two different domains. While their method does not consistently outperform others, they demonstrate that OC-STORM performs effectively in object-centric games by categorizing games into two groups, as shown in Table 2.
The paper is well-written and easy to follow, with prior works thoroughly and up-to-date in their presentation.

缺点

Although prior works are well presented, comparisons with them are missing in the experiments. While Table 1 compares the method with other general-purpose MBRL approaches, it would be valuable to see how OC-STORM performs against methods specifically designed for object-centric policy learning, such as [1], [2], [3], and [4]. Without these comparisons, it is challenging to fully assess the method’s effectiveness relative to similar approaches. [1] Stefano Ferraro, Pietro Mazzaglia, Tim Verbelen, and Bart Dhoedt. FOCUS: Object-Centric World Models for Robotic Manipulation. In Intrinsically-Motivated and Open-Ended Learning Work- shop @ NeurIPS2023, November 2023 [2] Jaesik Yoon, Yi-Fu Wu, Heechul Bae, and Sungjin Ahn. An Investigation into Pre-training Object- centric Representations for Reinforcement Learning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 40147–40174. PMLR, 2023. URL https://proceedings. mlr.press/v202/yoon23c.html. [3] Akihiro Nakano, Masahiro Suzuki, and Yutaka Matsuo. Learning Compositional Latents and Behaviors from Object-Centric Latent Imagination. In The 38th Annual Conference of the Japanese Society for Artificial Intelligence, JSAI 2024, 2024. [4] Younggyo Seo, et al. Masked world models for visual control. Conference on Robot Learning. PMLR, 2023.
Although the motivation and design of the architecture and pipeline are sound, the experimental results are not particularly impressive or convincing. Given the advantages of prior knowledge on decision-related objects, few annotations, and the use of foundation models, one would expect a more significant performance boost over methods that lack these benefits, particularly in terms of final performance and sample efficiency (see Table 1 and Fig. 4).

问题

Additional Overhead for Inference Time: In OC-STORM, the segmentation model is used as a pre-processing step to extract object features. This likely introduces extra overhead during both training and inference, particularly given the upscaled observations and the large backbone of the foundation model. Can you clarify how significant this overhead is, especially during test time?
Analysis of Segmentation Model Quality and Errors: Since segmentation foundation models are used, some segmentation errors are inevitable. The paper addresses a few aspects of this: failure cases of segmentation models are presented qualitatively (Fig. 12), and the effect of annotation count is discussed (Fig. 11). However, a more detailed analysis of segmentation model quality in relation to downstream control performance would strengthen the work. For example, examining the correlation between segmentation quality and control performance could provide insights into whether advanced segmentation would lead to better results. Additionally, analyzing segmentation model failures during test time could help evaluate the method’s robustness to segmentation errors, which is crucial for safe deployment.
Other Segmentation Models: Related to the second aspect, it would be beneficial to explore how OC-STORM performs when combined with other segmentation models, such as SAM2.
Other MBRL Models: Since the main contribution is the incorporation of additional input into the world model, I believe this approach could be applied to other model-based reinforcement learning methods, such as Dreamer. While some modifications may be necessary due to the non-symmetric nature of non-Transformer latent dynamics models, demonstrating that the inclusion of object features can consistently enhance performance across various base MBRL models would strengthen the argument.
Other Domains: While the proposed method performs well in video game environments, it may prove even more advantageous in other, more object-centric domains, such as object manipulation, where decision-relevant objects are more unique.
Missing Information in Table 1 and Table 2: Without standard deviation or error, it’s hard to assess the reliability and consistency of the results presented.

2024-11-21

Weakness 1

We focus on object-centric learning with pre-trained vision models in this work. We are not aiming at reaching a SOTA compared to other types of object-centric models. So we believe that comparing with baselines without object-representation is sufficient.

Section 5.2 is a rough comparison with FOCUS (use masks to indicate objects). The experiments show that vector-based representation (OC-STORM) generally results in stronger performance than the mask-based representation.

As for the other methods [2,3,4], they use unsupervised slot-attention-based object discovery mechanisms. These methods can’t distinguish task-related and task-unrelated objects. Consider a mind experiment: if an environment contains 100 moving balls and only 2 of them with certain colours are related to the task goal, then the unsupervised methods can’t track those 2 particular balls.

In the case of Hollow Knight, there are many moving objects with large areas in the background, we believe these methods can’t always extract what we need for decision-making. Thus, we believe that debugging these methods for Hollow Knight would be difficult, and if we report poor performance for these algorithms, we might also face criticism questioning the credibility of our implementation.

Weakness 2

Atari: The improvements from STORM to OC-STORM on the Atari 100k are still significant for us, given not all games are suitable to be represented with Cutie.

By observing the training process, we believe that current MBRL algorithms have reached their limit under the 100k-sample constraint. Without addressing issues like adaptive exploration, achieving substantial improvements will be challenging. Moreover, since Atari environments are visually simple, the impact of improving perception is limited. Therefore, on paper, the improvements may not appear to yield multiplicative gains.
Hollow Knight: As seen in Table 3 and Figure 9, OC-STORM learns much faster than STORM, which supports that our method can behave well in visually complex environments.

Questions 1

Thank you for pointing out this.

Please refer to the Global Response R3.

Questions 2

This is indeed an important question. Two-stage object-centric MBRL methods should be more controllable and analyzable.

But we think this should be investigated in future work since it would be pretty hard to evaluate the robustness at this stage:

This paper is an early exploration combining vision foundation models with RL. A detailed analysis of segmentation robustness would expand the scope beyond the core contributions.
Such a robustness analysis is more meaningful at the deployment stage, where models, data, and parameters are finalized. Even if we conduct such analysis, the conclusions may not generalize well to real-world scenarios.

Questions 3

As justified in Section 2.1 and Appendix A, Cutie and SAM2 are the most suitable object extractors at the current stage.

SAM2 is a concurrent work and has been submitted to this year’s ICLR. It wouldn’t have been possible for us to include it in this paper, but this would be interesting for follow-up work.

Questions 4

We believe our experiments are sufficient to demonstrate the incorporation of object features is helpful for MBRL algorithms. STORM and Dreamer differ only in the structure of their sequential models, and we believe that “OC-Dreamer” is unlikely to exhibit counterintuitive inefficacy.

Questions 5

Please refer to the Global Response R1 & R2.

Questions 6

We will include JSON files in the supplementary material including results for all seeds. But it’s not a standard to include standard deviation in the tables as it would make it less readable.

2024-11-27

It’s a good motivation to leverage VFMs to address the challenges of RL, and I appreciate the effort in conducting additional experiments on another domain and answering my questions. However, the benefit of OC-STORM on the additional domain does not appear very significant, and my concerns remain.

Comparisons with Baselines:

If the goal is to solve a specific problem, it is essential to compare the proposed approach with at least 2 methods specifically tailored to that problem. Comparing only with general methods (e.g., Dreamer, IRIS, etc.) is insufficient and unfair, as these methods are not explicitly designed for object-centric problems.
'Mask + Visual' in Section 5.2 differs from FOCUS, as FOCUS uses masks not as input but as an auxiliary task.
Regarding the statement, “we might also face criticism questioning the credibility of our implementation”: I disagree with this. If the results are supported by reasonable conjectures and justifications, there is no reason for the credibility of the implementation to be questioned.

Impact of Segmentation Model Bias:

Understanding how segmentation model bias affects control task performance is a critical analysis that should be conducted at this stage. I disagree with the statement “such a robustness analysis is more meaningful at the deployment stage.” Without understanding the effect of segmentation bias at various stages of development, it is unclear why practitioners would choose this method until deployment. Robustness analysis during the research phase is essential to build confidence in the approach.

I will maintain my current score for now and will revisit my decision after further discussions with the other reviewers.

2024-11-28

We're pleased that you appreciate our efforts following your feedback to conduct additional experiments in an alternative domain, and our answers to your questions.

We apologise for the lack of precision in our explanation of FOCUS. But since we also use mask reconstruction as an auxiliary task, we believe this is a sufficient comparison, as one could expect the performance with the mask as input to be better than without it. We will clarify the distinction between FOCUS and our method in Section 5.2.
For "comparing methods specifically tailored to that problem", we respect the reviewer's opinion, but we would like to clarify our point:
- As mentioned in the paper and the comment above, those unsupervised OCRL methods have limited detection capabilities and lack sufficient controllability, making them theoretically unsuitable for this task.
- Neither the original papers nor third-party sources have tested those methods on relatively general benchmarks like Atari. We still believe that, as the stakeholders of this work, our implementation for testing those methods on our task would not be convincing enough. While we can ensure that STORM and OC-STORM share an identical configuration except for the object module.
About segmentation model bias:
- The video on Hollow Knight in the supplementary materials is a qualitative example. The segmentation mask here is only relatively accurate, and there are many rough and incorrect areas. However, this does not affect OC-STORM's ability to learn a good policy. What we mean is that a quantitative analysis would be task-specific and better be conducted during deployment.
- Nonetheless, we've just conducted experiments that mimic segmentation failure, and the results are presented in Appendix K. The results show that as the detection accuracy of the vision model increases, the agent's performance improves accordingly. These also demonstrate the robustness of OC-STORM in handling unstable detection results.

In summary, we hope you can appreciate the value of our current contribution as it stands, and will reconsider your rating of our work.

2024-11-28

I appreciate that the authors have evaluated their method on Atari, a benchmark often overlooked in the object-centric learning literature for RL. However, I still recommend adding evaluations with object-centric baselines on this benchmark, as it would significantly strengthen the paper.
Thank you for conducting this analysis. It offers valuable insights into how segmentation model performance affects downstream task performance and identifies the framework’s breakpoint. This information will help readers understand when to exercise caution in selecting segmentation models.

I have increased the score to weak accept.

2024-11-28

We sincerely appreciate your professional feedback, the time you've spent discussing this with us, and your decision to increase the score.

审稿意见

评分: 6置信度: 32024-11-02

The paper proposes OC-STORM, an object-centric model-based reinforcement learning (MBRL) framework designed to improve sample efficiency and performance in visually complex environments. Unlike conventional MBRL methods that rely on pixel-based auto-encoding, OC-STORM leverages recent advances in object detection to focus on key, decision-relevant elements within scenes. The approach uses an object-centric pipeline that includes object annotation, feature extraction, and training with simulated trajectories, enhancing reinforcement learning in environments such as Atari games and Hollow Knight. The paper demonstrates OC-STORM's performance improvement over existing MBRL algorithms, particularly in settings where object information is essential for decision-making.

优点

The paper introduces a unique method of integrating object-based features with reinforcement learning, allowing the agent to prioritize decision-relevant elements within complex visual environments. This approach addresses a significant gap in traditional pixel-based MBRL methods by enabling more focused training.
The experiments cover both Atari games and Hollow Knight, demonstrating the versatility and practical relevance of OC-STORM. This shows that the method is not limited to simplified or controlled environments but can extend to visually and structurally complex settings.

缺点

See question.

问题

The selection of Cutie as the object detector is well-justified, given its robustness in generalizing across diverse environments. However, evaluating additional object detection models or conducting an ablation study would provide valuable insights into the flexibility and performance of different detectors within OC-STORM. Could the authors share experimental results that compare the effectiveness of various object detection models in this context?
Integrating object detection and feature extraction may introduce computational overhead. A comprehensive analysis of the trade-offs between computational cost and performance gains in complex environments would enable potential users to better understand OC-STORM's scalability and efficiency. Could the authors provide experimental results detailing this trade-off?

2024-11-21

Questions 1

As justified in Section 2.1 and Appendix A, Cutie and SAM2 are the most suitable object extractors at the current stage.

SAM2 is a concurrent work and has been submitted to this year’s ICLR. It wouldn’t have been possible for us to include it in this paper, but this would be interesting for follow-up work.

Questions 2

Thank you for pointing out this.

Please refer to the Global Response R3.

评论- Response to the authors

2024-11-25

Thank you for your response. I will maintain my original score.

2024-11-25

Thank you for acknowledging the strengths of our work and for providing feedback to further enhance it.

审稿意见

评分: 8置信度: 32024-11-02

This paper proposes an object-centric world model to handle environments that contain small, dynamic objects that are likely important for decision-making. Their approach involves annotating key objects using segmentation masks, extracting the object features using a pretrained model, training a world model with these features and finally training a policy using MBRL. They claim to be the first to adopt object-centric learning on Atari and the visually complex game of Hollow Knight. They outperform the prior MBRL work on 18 of 26 tasks in the Atari 100k benchmark.

优点

Good discussion on the thought process in selecting the model for extracting object representations and other hyperparameter/design choices
Paper is well-motivated, reasoning for using an object-centric WM is justified
Good related work section discussing literature in MBRL and object-centric representations for RL
Results are convincing, good comparisons to relevant baseline methods
Nice qualitative analysis highlighting what objects are being segmented by the object model in the evaluation environments
Good discussion of potential limitations of the work

缺点

Is the primary novelty in the paper the use of the pretrained features from Cutie in a MBRL framework? It is not clear apart from this what other contributions are presented in this work.
Environments are carefully selected such that an object-centric approach has an advantage. I wonder whether this approach could be applicable in continuous control tasks and potentially real-world robot tasks. It would be nice to see such results in a future study.
It would be nice to see a model-free baseline there for comparison as well. It is not clear to me what the benefit is for MBRL in these environments where the simulator is readily available. If it is about sample-efficiency, it would be good to see how this stacks against your standard SAC or PPO.

问题

As mentioned in the weaknesses, how scalable is this to visual domains with continuous action spaces and possibly to real-world tasks?

2024-11-21

Weaknesses 1

Indeed, the major novel point is simply introducing object features as part of the input. Yet we believe this point and this paper are still quite significant within the RL community:

To our knowledge, few works have successfully adopted general-purpose pre-trained vision models for RL agents. And we are the first to successfully deploy this on Atari games and Hollow Knight.
Naturally one would expect to collect and fine-tune thousands of images to enable key object perception of an agent. We provide a pipeline that can enable key object perception and only requires few-shot annotations.

Weaknesses 2

Hollow Knight is more object-centric but not Atari games are. OC-STORM behaves significantly more efficiently when key objects can be represented as objects. And with visual inputs, the performance is on par with STORM on other types of games. (Table 2).

Also please refer to the Global Response R1 & R2.

Weaknesses 3

MBRL is typically more sample-efficient than model-free ones. SAC/PPO won’t learn many useful things in a 100k budget on both Atari and Hollow Knight.

As for “the simulator is readily available”, Hollow Knight is a game but we don’t have simulators like Atari games, and all samples are done in real-time. Applying SAC/PPO on this would take an unbearable long time.

Questions 1

Please refer to the Global Response R1 & R2.

2024-11-25

I appreciate the author's reply and additional experimental results in the MetaWorld environment, although the results there are not as impressive as the ones for the Atari game / Hollow Knight environments. I acknowledge the author's efforts and maintain my current score.

2024-11-25

We greatly appreciate you for taking the time to review our work and for providing a high evaluation of our submission.

审稿意见

评分: 5置信度: 42024-11-03

This paper introduces OC-STORM, which incorporates object-centric state-based representations obtained from a pre-trained visual model into the MBRL method STORM. The OC-STORM extractor requires a few hand-annotated segmentation masks to identify key objects before extracting their features. Experimental results indicate that OC-STORM outperforms STORM on both Atari 100K and Hollow Knight benchmarks.

优点

The method is a straightforward extension of an existing model-based RL algorithm and should be easily replicable.
Experiments results across two video game benchmarks show that OC-STORM effectively improves the performance of the base method STORM.
Modifications to the Hollow Knight benchmark can help the agent learn a more meaningful policy and should be regarded as a valuable contribution to the research community.

缺点

The core idea is relatively simple. Only subsection 3.1 in the method section contain novel contributions (i.e. introducing object features as part of the input), while the remaining sections elaborate on DreamerV3's (and STORM's) model architecture and training methodology.
The generality of the object-centric MBRL pipeline is unclear. It is uncertain how the effectiveness of OC-STORM extends beyond the specific STORM architecture. Further experiments combining this pipeline with other MBRL methods, such as DreamerV3, should be conducted.
The experimental results are insufficient to support the paper's claim. While the paper mentions "visually complex environments" in its title, experiments are limited to two video game benchmarks and exclude robotics control benchmarks, where scenarios with visual distractions are prevalent, as seen in Natural Background DMC[1] and DMC Remastered[2]. Atari, one of the benchmarks utilized, may not be regarded as a visually complex environment. Within the video game domain, Minecraft benchmarks such as Minedojo[3] feature visually complex elements closer to real-world scenarios and should more suitable for inclusion in this paper. Additional experiments on diverse benchmarks are necessary to support the paper's argument.
The scope of applicability is unclear. The extractor OC-STORM uses relies on hand-annotated segmentation masks. While these mask are easy to obtain in Atari and Hollow Knight, it may may pose challenges in open-world tasks due to complex object interactions. Beyond acquisition difficulties, the impact of mask quality on the extractor's accuracy remains an unresolved issue. Authors should address these challenges or acknowledge them as part of the limitations.

[1] Zhang et al. Natural environment benchmarks for reinforcement learning.

[2] Grigsby et al. Measuring visual generalization in continuous control from pixels.

[3] Fan et al. Minedojo: Building open-ended embodied agents with internet-scale knowledge.

问题

Why do you base your method on STORM, but not DreamerV3? Given that DreamerV3 has better overall performance on multiple domains. Will OC-DreamerV3 perform better than DreamerV3?
How applicable is OC-STORM when faced with a moving camera, or where not all objects appear in a single frame? This situation is common in open-world environments like Minecraft.
Can you demonstrate results in domains other than video games? For example on the robotics domain?
Why are the carrots in gopher not specified as key objects? Is it because they are stationary?

2024-11-21

Weaknesses 1

Indeed, the major novel point is simply introducing object features as part of the input. Yet we believe this point and this paper are still quite significant within the RL community:

To our knowledge, few works have successfully adopted general-purpose pre-trained vision models for RL agents. And we are the first to successfully deploy this on Atari games and Hollow Knight.
Naturally one would expect to collect and fine-tune thousands of images to enable key object perception of an agent. We provide a pipeline that can enable key object perception and only requires few-shot annotations.

Weaknesses 2

We believe our experiments are sufficient to support the point that “introducing object features as part of the input” is useful. And the difference between DreamerV3 and STORM is not that big.

Please also refer to the response to Question 1.

Weaknesses 3

Please refer to the Global Response R1 & R2.

Weaknesses 4

The impact of mask quality and detection quality are indeed unresolved issues. Thank you for pointing this out and we will include them in the Limitation section.

As long as we can get screenshots of the environment, there is no significant difficulty in annotation. We can label a frame in about 2 minutes in our study, and an environment requires only 6 or 12 frames. With the help of modern segmentation tools like SAM[1] and RITM[2], this can be achieved pretty fast and easily. We use RITM in our case.

[1] Kirillov, Alexander, et al. "Segment anything." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Sofiiuk, Konstantin, Ilya A. Petrov, and Anton Konushin. "Reviving iterative training with mask guidance for interactive segmentation." 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022.

Questions 1

Here are several reasons:

STORM is an adaption of DreamerV3 (just changing the world model) and reaches similar performance on the Atari100k benchmark. Anyone who is interested in using DreamerV3 can modify it with our codebase.
We use spatial attention to handle the relationship among the objects. When implemented, this operation is more straightforward with a transformer codebase compared to an RNN codebase. We can simply transpose the tensor to distinguish spatial or temporal attention.
Cutie and STORM are based on PyTorch, but DreamerV3 is based on Jax. We try not to use two different deep learning frameworks at the same time to avoid unnecessary development difficulties.
DreamerV3 has a PyTorch version implemented by the community, yet under the PyTorch framework, STORM is faster to train since transformers allow parallel optimization while RNNs require backpropagation through time (BPTT). Jax may have some compelling techniques that can make BPTT faster.

Questions 2

Please refer to the supplementary materials. For example in segmentation_masks/Atari/ChopperCommand (see imgs and vis_masks folder), not all the objects (or different appearances of an object) involved would appear in a frame at the same time. Similar examples include Atari/Jamesbond, Atari/RoadRunner, HollowKnight/MantisLords, etc.

The video in the supplementary materials also contains moving cameras, disappearing and re-appearing of several objects.

Questions 3

Please refer to the Global Response R1 & R2.

Questions 4

Based on our experience of playing, once the black mole (or some other animal) reaches the carrots’ level, it’s inevitable to lose one of them. The carrots are only life indicators, and the key policy is around blocking the channel.

2024-11-22

Thank you for addressing my comments.

I have read the additional experiment results on Meta-world. The results are positive. Could the authors explain the reasons for choosing these four tasks, as well as the reasons behind the underperformance on the pick-out-of-hole task?
I admit that the experiment results on Hollow Knight are important. However, as other reviewers have also pointed out, experiments on at least one of the more established benchmarks are vital if the authors would like to emphasize successful applications in visually complex environments.
The authors should carry out experiments to demonstrate the impact of mask quality and detection quality. For example using pure SAM vs hand-annotated.
I still see no point in not conducting OC-DreamerV3 experiments. From the results reproduced by the authors, the performance of STORM on Atari100k can hardly be regarded as similar compared to DreamerV3. The community cares more about performance over training speed. If the authors need more time to conduct these experiments, then perhaps the paper is not ready and should be more suitable for publication in subsequent venues.
The problems on moving cameras and disappearing/reappearing of objects have been addressed. I have no further questions on this.
Based on my experience of playing (and training on) Gopher, it is possible to bonk a gopher when it's above ground, and it's part of an agent's policy on how to bonk a gopher. The carrots serve as task targets rather than merely life indicators.

The questionable generalizability of this method, as well as a lack of experiment results have prevented me from increasing my rating. I would like to keep my score for now.

2024-11-22

Thank you for your reply.

Meta-world: We choose 1 easy, 2 medium, and 1 hard task according to the MWM paper's Appendix F Experiments Details. These tasks are randomly picked and may cover some different objects. We are not very familiar with the Meta-world benchmark and can't give a specific reason for the underperformance on the pick-out-of-hole task.
Of course, we would like to explore OC on more benchmarks, especially MineDojo, and that would make this work much more impactful. However, the current state of computer vision technology is not sufficient to support this, as we mentioned in the Limitation section and the global response. As for Natural Background DMC and DMC Remastered, we need a joint & link binding algorithm rather than simply object detection. Still, we believe our experiments are sufficient for this work, if our methods can't handle highly dynamic and visually complex environments, then there won't be any improvements on Hollow Knight.
Our masks are interactively hand-annotated with the help of SAM/RITM so that the quality could be assumed as "groud-truth". Since the amount of labelling data required is relatively small, we believe it is fairly straightforward to ensure the quality of these labels, which may limit the significance of studying this aspect further. If you mean "pure SAM" by fully automatic ground-truth mask generation, we believe this investigation itself is worth another paper.
We don't agree with you on this point:
1. Despite all, OC-STORM still performs significantly better than DreamerV3. If we can enable an algorithm that is less effective than DreamerV3 to achieve better performance through OC, it would precisely demonstrate the effectiveness of our approach.
2. STORM and Dreamer differ only in the structure of their sequential models, and we believe that “OC-Dreamer” is unlikely to exhibit counterintuitive inefficacy.
We may not have observed the game carefully enough, but our focus is on demonstrating the improvements brought by OC methods, and the current annotations still convey the point. However, it’s not because the objects are static that we choose not to label them.

2024-11-28

Dear Reviewer v36h,

We were wondering if our above responses had addressed your remaining concerns?

Additionally, you mentioned that "The questionable generalizability of this method, as well as a lack of experiment results, have prevented me from increasing my rating." However, we feel that is perhaps unwarranted given we have now shown positive experimental results across both video game (classic and modern) and robotics environments, including both discrete and continuous control, demonstrating that the method does in fact appear to generalize well. In addition, we believe that a method which performs strongly on object-centric video game environments can be a valuable contribution irrespective of performance on other more general benchmarks. We hope that you can appreciate the potential benefits of our novel object-centric approach to the wider gaming/agents community, and may reconsider your rating of our work.

审稿意见

评分: 5置信度: 42024-11-04

In this work, the authors propose a object-centric model-based reinforcement learning pipeline, which can capture small or dynamic elements than can be critical for decision making. The authors point out the limitations of the previous MBRL methods, which often rely on auto-encoding with an L2 loss, which could be vulnerable to capture the moving objects, or small parts in the scenes. The proposed method, OC-STORM, provides a object-centric MBRL pipeline done by four steps: (1) annotate key objects related to rewards and goals, (2) extract object features using a pre-trained foundation model, (3) incorporate object features with the raw observations and predict dynamics, and (4) train the policy. As a result, the proposed method shows improved performance on diverse Atari Benchmarks.

优点

The manuscript is well-written and easy to read, making it straightforward to understand the four key steps of the proposed method.
By training STORM using both object features obtained through the object extractor and resized observations, it shows improved performance over the original MBRL algorithm, STORM.
In addition to quantitative evaluations, appropriate qualitative results are also provided.

缺点

Rather than comparing with Dreamer/STORM/DIAMOND, the proposed method need to be compared with algorithms where additional representation techniques are applied. For instance, MWM [1] is a very similar work, which added a masked autoencoder to Dreamer and showed it also could capture small moving blocks in the scenes. Algorithms like MWM would actually be more suitable baselines.
Additionally, although not MBRL, there are numerous studies that consider the dynamics of the environment in representation part [2]. Rather than comparing with these representation learning methods, this paper mainly focuses on which aspects it improves over normal MBRL. As a result, it lacks persuasive arguments for why the proposed method is better than other environment dynamics-aware representation learning methods.
For these reasons, the description in the related works section is somewhat lacking.
Appendix E.2 includes an ablation study on the number of annotation masks, where the authors claim that "a greater number of masks contributes to more robust performance." However, even in static environments like "Atari Boxing," where at least two masks (for the player and the opponent) seem necessary, using only one mask ultimately achieves a similar convergence value to using six masks. Therefore, I would like to ask the authors to share results with a wider range of mask counts and in more diverse environments.

[1] Seo, Younggyo, et al. "Masked world models for visual control." Conference on Robot Learning. PMLR, 2023.

[2] Cui, Zichen Jeff, et al. "DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control." arXiv preprint arXiv:2409.12192 (2024).

问题

How the mask for 6-shot and 1-shot be for the experiments in the Figure 11?
What is the additional computational cost compared to the base MBRL model for calculating the object extractor and the added features?

2024-11-21

Weakness 1~3

We focus on object-centric learning in this work. We are not aiming at reaching a SOTA compared to other types of dynamics-aware world world models. So we believe that comparing with baselines without object-representation is sufficient.

The biggest difference between our work and theirs is that we integrated a pre-trained vision model with a world model, which has never been achieved before.

Nevertheless, we added some additional experiments on Meta-world and compared them with MWM. Please refer to the Global Response R2.

Weakness 4

In our context, a mask refers to the annotation result of a single frame (may include multiple objects). We will add a clarification for this. See segmentation_masks/Atari/Boxing/vis_masks.

The experiments in Appendix E.2 are designed to test Cutie’s few-shot capability: Do we really need many annotations for an environment? This can provide an intuitive guideline for choosing the number of annotations when adapting to new environments:

Only one annotation is still workable in some cases.
When the computational resource is available, increasing the number of annotations may have benefits in performance.

Questions 1

Please refer to the response to Weakness 4. More specifically, we use 0.png for Pong and 3.png for Boxing.

Questions 2

Thank you for pointing out this, we will include this in a new revision.

Please refer to the Global Response R3.

2024-11-26

Thank you for addressing comments. Even though the authors provided some additional results (MWM), I would not agree that integrating a pre-trained vision model with a world model improves the novelty of this work a lot and keeping the orginal ratings.

2024-11-28

Dear Reviewer pRve,

We hope you can appreciate the value of our current contribution as it stands. Additionally, we hope that the extra experiments we conducted to enable a comparison with MWM, as requested, justify at least a minor increase in the rating.

2024-11-27

Thank you for your response. We respect the reviewer's opinion.

However, we would like to clarify our point: To our knowledge, there has been no prior work successfully applying OC to Atari (and Hollow Knight) without accessing game memory. We have achieved this by combining MBRL with pre-trained vision models, which we believe still constitutes a novel contribution.

审稿意见

评分: 6置信度: 42024-11-04

The paper introduces OC-STORM, an object-centric MBRL pipeline designed to improve sample efficiency in environments with complex visuals. While traditional MBRL methods with auto-encoding losses can miss small but crucial details in dynamic scenes, OC-STORM addresses this by focusing on key decision-related objects. The pipeline involves annotating key objects, extracting their features using Cutie, a pre-trained vision model, and incorporating these features into the world model for policy training. OC-STORM outperforms it’s predecessor STORM on Hollow Knight and surpasses other SOTA MBRL baselines on many Atari games.

优点

S1. The paper is well-written, comprehensible, and offers a profound review of related work within object extraction and MBRL, effectively situating the proposed method within this broader context. The authors adequately explain their motivations for selecting Cutie as the object feature extractor, while outlining certain shortcomings of prior approaches.

S2. The use of a categorical VAE helps ensure more stable and interpretable latent representations, particularly suited for environments where distinguishing between discrete elements is crucial. It enhances the agent’s ability to model structured features effectively.

S3. The spatial-temporal transformer can model both inter-object relationships within a single timestep and the evolution of these objects across time. The spatial block enables understanding interactions between objects at a given moment, while the causal temporal block ensures robust prediction of an object’s trajectory over time, using only past and present information. By combining these, the model can capture both immediate spatial context and long-term temporal dynamics to predict future states. Further incorporating agent actions as a control signal allows the transformer to factor in how an agent’s behavior influences object states. Varying the number of key objects (K∗) adds to the adaptability, enabling effective processing whether using object features, visual input, or both.

S4. The illustrations in the paper are informative. Figure 2 gives a detailed overview of the model's structure, showing how the object module and visual module interact within the architecture. Figure 3 demonstrates the model's ability to capture the state and position of objects, reinforcing the quantitative evidence of its performance. The illustrations in Appendices H and I provide a comparison between the ground truth sample segmentation masks and the ones produced by Cutie.

S5. The method outperforms regular STORM in the Hollow Knight boss fights, and SOTA MBRL baselines in many Atari environments.

S6. The ablation between the module combinations motivates the usage of both vector and visual model components.

缺点

W1. Motivation. The authors highlight the theoretical benefits and performance enhancements of their method, such as improved sample efficiency and superior benchmark results. However, there is no discussion on the practical applications of the object-centric MBRL pipeline in potential realistic settings.

W2. Determining key object annotations.

While using manual annotations to inform agents about key objects is reasonable, determining the optimal number of these objects—represented by K—remains unclear. In simple Atari games like Pong, setting K is heuristically straightforward. However, this task becomes challenging in more complex environments. Users might need to tune K for optimal performance, which can be computationally expensive. Given that K is integral to the method's effectiveness, it would be beneficial for the authors to assess the method's sensitivity to different K values across various settings. This limitation is particularly pronounced in environments where the number of relevant objects is uncertain or fluctuates. An incorrect choice of K can lead to under-representation—missing key features—or over-representation—wasting capacity on less important elements.
In all tasks from Atari and Hollow Knight, the number of annotated objects ranges between 2-4. However, there is no detailed discussion on how these values were determined beyond noting that they are based on heuristics, and it is unclear whether other values were explored. More complex environments may have a greater number of essential objects requiring annotation, raising the question of whether the method can effectively handle such cases. Reliance on a low number of annotated objects (low K) may limit the method's applicability, as it could struggle to incorporate and process a higher number of annotated objects crucial for success in more complex settings.
The method appears to be limited to scenarios where all annotated objects appear in the same frame—a property common to all Atari games. Similarly, most boss fights in Hollow Knight feature a fixed background apart from some minor visual disparities. Only with Pure Vessel and Mage Lord, the screen will slightly pan left or right with the player's horizontal movement. Regardless of this panning, the key objects necessary for the agent's success can always appear in the frame. As shown in Figure 15, the annotations typically include the agent, the boss, and optionally any projectiles the boss might fire. It is straightforward to cherry-pick and annotate the most useful frames where both the boss and harmful projectiles are visible.
However, this approach becomes unfeasible in environments that are truly partially observable, such as navigating 3D landscapes from an egocentric perspective. Not only does the agent not appear in all sections of the environment, but it also generally has a limited field of view—roughly 90 degrees—occluding most information. This means it is common not to obtain even a single frame where all key objects are visible. This limitation reduces access to informative annotation masks. Moreover, since Cutie attempts to mask the specified number of key objects, it will be incapable of doing so when not all objects are present, resulting in incorrect masks. This raises the question of the applicability of OC-Storm in broader contexts.

W3. Choice of Hollow Knight. The authors' choice of Hollow Knight boss fights as benchmarking tasks is commendable due to the visual and dynamic complexity compared to Atari, along with the game's challenge for both AI and human players. Understandably, since Hollow Knight is not an established benchmark, adapting prior MBRL methods to this setting is non-trivial because of significant differences that the authors outline, like sample step limits, resolution, environment wrapping, and reward functions. The authors acknowledge that these differences make direct comparisons with existing methods impractical, which then raises the question of whether Hollow Knight is the most appropriate choice for demonstrating the superiority of their approach.

While the authors argue that Hollow Knight's visual complexity, dynamic elements, rare duplicates, and less critical backgrounds make it a suitable testbed for object-centric learning, more commonly used 2D platformer environments like CoinRun [1] or benchmarks like ProcGen [2] could serve the same purpose. Equally, in the 3D realm, game-based platforms such as ViZDoom [3], DMLab [4] or more realistic environment suites like Habitat [5], and AI2Thor [6] offer environments with egocentric perception. These environments feature objects and entities that agents need to avoid, interact with, or navigate toward, making them suitable for object-centric RL approaches. Moreover, they have been widely studied in the MBRL domain and come with existing implementations and baseline evaluations, facilitating direct comparisons with prior work.

The Mage Lord and Pure Vessel boss fights show the lowest performance and are the only ones where the screen pans as the player moves left and right. The other boss fights maintain a fixed perspective and consistent background, similar to most Atari games. The lower performance in these panning scenarios suggests that the method may require static backgrounds with minimal variations for effective object-centric learning. To demonstrate the robustness of OC-Storm, it would be valuable to assess the method in environments with more dynamic backgrounds and highly disentangled consecutive frames, such as those found in embodied perception environments.

[1] Cobbe, Karl, et al. "Quantifying generalization in reinforcement learning." International conference on machine learning. PMLR, 2019.

[2] Cobbe, Karl, et al. "Leveraging procedural generation to benchmark reinforcement learning." International conference on machine learning. PMLR, 2020.

[3] Kempka, Michał, et al. "Vizdoom: A doom-based ai research platform for visual reinforcement learning." 2016 IEEE conference on computational intelligence and games (CIG). IEEE, 2016.

[4] Beattie, Charles, et al. "Deepmind lab." arXiv preprint arXiv:1612.03801 (2016).

[5] Szot, Andrew, et al. "Habitat 2.0: Training home assistants to rearrange their habitat." Advances in neural information processing systems 34 (2021): 251-266.

[6] Kolve, Eric, et al. "Ai2-thor: An interactive 3d environment for visual ai." arXiv preprint arXiv:1712.05474 (2017).

问题

Q1. What is the MBRL training objective? It seems to not be clearly defined. The Preliminaries are melded together with the Related Work, lacking some essential concepts, terminologies, and notations used throughout the paper.

Q2. How to specify K in more complex environments with a higher number of potential key objects?

Q3. How does the method perform under true partial observability when not all key objects are simultaneously present on the screen?

Q4. Table 2 categorizes Atari games based on whether all key information for the decision can be represented as objects. How is this determined?

Q5. What are potential realistic applicable RL scenarios where OC-STORM method can be useful?

Q6. If evaluating other existing methods is not feasible, why choose Hollow Knight as a benchmark? Wouldn't selecting a more standardized environment allow for clearer comparisons and strengthen the study's validity?

Q7. Does OC-STORM outperform STORM on Hollow Knight only in terms of sample efficiency, or can it achieve comparable overall performance once fully converged? According to the training curves in Appendix D.6, STORM is unable to converge with 100K samples. It would be insightful to compare the methods with an increased number of samples to assess their true performance potential.

2024-11-21

W1. Motivation

Though we haven’t conducted an experiment in a real-world scenario, this pipeline has no barrier to being used in that case:

DayDreamer[1] has proved that MBRL performs well on a real robot.
Cutie is trained on real-world videos (DAVIS, MOSE, YouTubeVOS), and has the capability to be used for object tracking under a visually complex setting.
The sample on Hollow Knight is conducted in real-time so that our method requires no fast simulators and can be used in the real world.

[1] Wu, Philipp, et al. "Daydreamer: World models for physical robot learning." Conference on robot learning. PMLR, 2023.

W2. Determining key object annotations

Since we can combine both low-resolution visual input and object vectors, we can always choose to “under-represent” and only keep the essential objects. For example, we achieved a performance boost on Atari Qbert, Krull, and some other environments even missing many reward-related objects. Here we can provide a guideline for determining the number of objects that work across all the tested environments:
1. Only keep the minimum number of objects sufficient to calculate the environment's reward.
2. Remove objects that can’t be identified well with the object detection algorithm (which is Cutie in this work). Please see the limitation section.
Similar to the first point. The limitation of the number of objects is affected by Cutie, which is designed to track a single object. We believe in the near future there will be a method that can support a massive number of objects without losing detection accuracy.
Cutie is a retrieval-based VOS method and can handle the situation that you mentioned. We only need to cover the desired objects in the 6 or 12 annotations. Please refer to the supplementary materials. For example in Code/segmentation_masks/Atari/ChopperCommand (see imgs and vis_masks folder), not all the objects (or different appearances of an object) involved would appear in a frame at the same time. Similar examples include Atari/Jamesbond, Atari/RoadRunner, HollowKnight/MantisLords, etc.

W3. Choice of Hollow Knight

Please refer to the Global Response R1.

The Mage Lord and Pure Vessel are naturally more difficult bosses even for human players. Meanwhile, the agent performs well on God Tamer, which also has a moving camera.

There exist some bosses that are more difficult and have a fixed camera. We haven’t conducted experiments on those to save time and resources, as we thought our method wouldn’t learn many useful things in a 100k budget.

Q1.

We will include a brief MBRL training objective in the Preliminaries section.

Q2.

Please refer to the response to weakness 2, points 1 and 2.

Q3.

Please refer to the response to weakness 2, point 3.

Q4.

The standard roughly follows the response to weakness 2, points 1. We can still play the game perfectly with only the cropped-out objects for the games listed in the first row, but not in the second row. The “roughly” means we don’t consider under-representation caused by Cutie's failure to detect certain objects.

Q5.

Please refer to the response to weakness 1. Examples include object manipulation, visual goal navigation, etc. However, we don’t plan to conduct real-world experiments in this work.

Q6.

Please refer to the response to weakness 3.

Q7.

STORM can also converge with a larger number of samples, and yes, the improvement is mainly on sample efficiency. OC-STORM can significantly reduce the sample time when reaches a similar performance compared to STORM. This is important when the environment can only be running at real-time speed. For example, we may expect the OC-STORM to defeat the boss for the first time in one hour, but for STORM it could be 2 hours.

For these two algorithms, the specific convergence time varies and is closely related to the difficulty of the bosses. We standardize the experiments in this study using 100k steps for simplicity.

2024-11-25

W1. To clarify, I don't expect the authors to run experiments in a real-world scenario. Although the building blocks of your approach have laid their own grounds for motivation, such as DayDreamer functioning with robots or Cutie with realistic videos, I wish to see the authors discuss how uniquely their object-centric MBRL pipeline can be used in broader contexts or potential practical scenarios. What are the benefits of OC-Storm representing an object-centric world models approach in RL? Currently, all I can find in the paper is better performance and sample efficiency on simulation environments.

W2. The supplementary material, that the authors pointed to, indeed demonstrates that Cutie can handle cases where not all objects are visible in a single frame, which addresses one of my concerns. However, the issue of selecting K—the number of annotated objects—remains unresolved. Without a principled approach, the reliance on heuristics or trial-and-error for determining K is still a minor weakness.

OC-Storm's scalability beyond 4 annotations has not been tested, which limits its applicability in settings that may require handling more complex or numerous annotations. While this limitation may stem from Cutie’s design, it is integral to the overall method. Since no alternative approaches to Cutie have been explored, it is difficult to foresee whether the MBRL pipeline will be compatible or effective with more advanced object-tracking solutions.

W3. I remain unconvinced that OC-Storm would perform effectively in environments with truly dynamic screen movement. Minor changes like lighting variations or slight distractions are not the issue. While horizontal panning occurs in three of the boss fights, it remains unclear how OC-Storm would handle settings with constantly shifting perspectives, such as navigating or looking around in 3D environments with egocentric perception.

I think Hollow Knight is an exemplary choice of environment for this study. To strengthen the evaluation, I recommend the authors incorporate settings where the screen moves in all degrees of freedom. While fixed-camera scenarios with limited panning (left to right in the 3 boss fights) may have practical applications, more realistic use cases involve cameras that can move freely in all degrees of freedom. Environments like DMLab/ViZDoom/Minecraft replicate this setting. They also offer simple scenarios with barely any memory requirements and denser rewards necessitating little exploration. Could OC-Storm still function with the rapidly changing background as the agent observes it's surroundings.

I commend the authors for addressing the concerns raised by all six reviewers and have also considered those responses. I have raised the score.

2024-11-25

Thank you for appreciating our work and raising your score.

We now understand your concern and thank you for your suggestion. We will try to highlight this point.
We agree with you. For the first point, we expect a future foundation model (or some other method) could directly generate target objects with task rules (natural language or how reward is calculated). But for now, such trial-and-error by humans seems inevitable. The second point also first requires a better vision model to design the RL part.
Perhaps this could be left to work that focuses on adopting OC in continuous & 3D environments. This would be even more interesting as one may be able to embed physics rules into object representations. Our work primarily focuses on establishing the pipeline for using general vision models to perform OCRL.

评论- Global Response

2024-11-21

Global Response

We would like to thank all the reviewers for their time, effort, and insightful comments. We are grateful for their recognition of the value of this work.

All six reviewers acknowledged the object-centric method as being well-motivated, novel, unique, gap-filling, effective, or easy to adopt.
Four reviewers highlighted the clarity of the writing in the paper.
Two reviewers agreed that the exploration of Hollow Knight is valuable.
Two reviewers commended the thoroughness of the related work section.

In the following, we address some of the common questions and suggestions raised by the reviewers.

R1: “Why Hollow Knight” and “Why not other environments”

This work is motivated by adopting MBRL to Hollow Knight, and we found the learning efficiency of STORM is not ideal. After investigation, we found that the dynamic background may have taken up too much of the perception capability of the model. Please refer to Appendix B, Figure 5 and lines 912 to 917.
Why not DMLab/CoinRun/Minecraft/…:
- These environments have other challenging aspects for RL, such as hard exploration, long-term memory, extremely sparse rewards, etc. Simply adding this OC module may not solve the major problem.
- Meanwhile, Hollow Knight's abstract MDP is more like Atari Boxing, which can be solved easily by current RL algorithms. And its background is more dynamic and distracting than any of the mentioned environments. Using Hollow Knight would better demonstrate the advantages of our method in enhancing visual perception capabilities.
- As described in the Limitation section, Cutie may not be able to handle these environments. We need to wait for some method like a few-shot YOLO that is consistent across frames to conduct this research more efficiently.
Why not continuous control: Just like the previous point, there are also many structural issues in these environments. Even if the visual perception is handled perfectly, we might still fail to achieve performance improvements. We don’t want these reasons to mislead anyone into thinking that object-centric methods are not useful at all.

R2: Continous control: additional experiments on Meta-world

We conduct 4 experiments on Meta-world [1] and compare the results from MWM [2]. Due to the limited time of the discussion phase, we can’t manage a full evaluation here.

The results are presented at the end of the revised paper (Appendix J). OC-STORM generally exhibits higher sample efficiency than STORM. In some tasks, it also outperforms MWM in terms of efficiency and performance.

This is only an additional result that showcases our method can be used in continuous control tasks and 3D scenes. We still believe that our original experiments are sufficient for this work.

[1] Yu, Tianhe, et al. "Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning." Conference on robot learning. PMLR, 2020.

[2] Seo, Younggyo, et al. "Masked world models for visual control." Conference on Robot Learning. PMLR, 2023.

R3: Computational overhead of OC-STORM

The computation speed on Atari games with an RTX 3090 is shown in the following table. The data is formatted as sample or evaluation speed (iterations/second) & training speed (iterations/second). The total training time can be calculated with 100k/sample speed + 100k/training speed. The input resolution for Cutie is 320*420 (double the original 160*210).

Algorithm	0 obj	2 obj	3 obj	4 obj
STORM*	114 & 8.1	-	-	-
OC-STORM (obj module only)	-	32 & 8.8	32 & 8.5	31 & 7.8
OC-STORM (both modules)	-	28 & 5.9	27 & 5.5	27 & 5.3

We believe the computational cost of introducing Cutie is acceptable in many cases.

2024-11-26

Dear Reviewers,

Thank you again for your time and effort in reviewing our work. With less than 48 hours left in the discussion period, we would be grateful if the remaining reviewers could reply to our responses soon.

Please could you let us know if we have addressed your concerns, or if there is anything else we can address for you to consider increasing your rating of our paper.

Kind regards,

The Authors

AC 元评审

2024-12-21

After careful consideration of the six expert reviews, the subsequent author-reviewer discussions, and the area chair’s own reading of the paper, I recommend rejecting this submission. While the paper presents an interesting approach to incorporating object-centric information into model-based reinforcement learning, several fundamental limitations prevent it from meeting ICLR's acceptance standards.

The paper's primary technical contribution centers on augmenting STORM with object features extracted from a pre-trained vision model. While this approach demonstrates improved performance on certain benchmarks, the methodology raises several concerns. First, the authors' choice to implement their approach solely on STORM, rather than demonstrating its applicability across multiple MBRL frameworks, limits the generalizability of their contribution. Given that the proposed object-centric enhancement could theoretically be applied to any MBRL method, this narrow focus undermines the broader impact of the work.

A second major concern relates to the evaluation environments. While Atari and Hollow Knight (which also raised concerns that it is not a common benchmark) provide interesting test cases, they share a common limitation: both feature relatively static camera perspectives with full scene observations. The effectiveness of the object detection approach in more dynamic, three-dimensional environments with partial observability remains unaddressed. This limitation is particularly relevant given that many real-world applications involve complex, changing viewpoints and partial observations.

The comparison methodology also raises concerns. The authors compare their method, which leverages a pre-trained vision model, against baselines that learn representations from scratch. While the use of pre-trained models is valid and increasingly common in the field, the experimental design should account for this advantage, perhaps by including baselines that also utilize pre-trained components or by providing more detailed ablation studies.

During the discussion period, the authors made commendable efforts to address reviewer concerns, including providing additional experiments on Meta-world environments. However, these additions, while valuable, do not fully address the limitations of the work. The positive results on Meta-world suggest potential broader applicability, but without more comprehensive evaluation in diverse, dynamic environments, the paper's claims about effectiveness in "visually complex environments" remain insufficiently supported.

While one reviewer provided a highly positive assessment (rating of 8), the review lacked technical depth and failed to engage with these limitations. The remaining reviewers provided more detailed technical analyses that highlighted these important concerns.

审稿人讨论附加意见

During the rebuttal period, there was extensive discussion between the authors and reviewers regarding several fundamental aspects of the paper. The discussions centered primarily on the method's generalizability, benchmark selection, and experimental validation.

Multiple reviewers, including MVaR and v36h, raised concerns about the limited scope of evaluation, particularly questioning why the authors chose to implement their approach only on STORM rather than demonstrating its effectiveness across multiple MBRL frameworks like DreamerV3. The authors responded that STORM and DreamerV3 differ only in their sequential model structure, suggesting similar improvements would likely be observed. However, this response did not fully address the fundamental concern about demonstrating broader applicability.

Another significant point of discussion focused on the choice of benchmarks. Reviewer gCb5 and v36h questioned whether Atari and Hollow Knight provided sufficient evidence of the method's effectiveness in truly complex visual environments. While the authors added Meta-world experiments during the rebuttal period, these results were not as impressive as those from the original environments, leaving questions about generalizability unresolved.

The authors engaged constructively with reviewer pRve's request for computational overhead analysis, providing detailed timing comparisons for different configurations. This addition helped clarify the practical implications of implementing their approach.

Reviewer MVaR's concerns about segmentation model quality and its impact on downstream performance led to meaningful additions to the paper, including new experiments demonstrating the relationship between detection accuracy and agent performance. This addition, along with the Meta-world results, showed the authors' commitment to improving the work's empirical validation.

最终决定Reject

2025-01-22

Reject