/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Zero Shot Generalization of Vision-Based RL Without Data Augmentation

提交: 2025-01-23更新: 2025-07-24

TL;DR

We demonstrate how disentangled representation learning and associative memory can be used to enable zero-shot generalization of vision-based RL agents without data augmentation or fine-tuning.

摘要

关键词

representation learningreinforcement learningdisentangled representation learningassociative memoryzero-shot generalization

评审与讨论

审稿意见

评分: 42025-02-25

The authors propose a method for generalizing vision-based RL policies in the presence of image distribution changes or distractors, such as modifying the color scheme of the image. There method aims to learn a representation that disentangles the different components of an image, specifically into two types: task-relevant and task-irrelevant features. Ideally, if a given feature is either only task relevant info or only task irrelevant info, then generalization is possible because for some downstream image with distractors, those distractors should get filtered into the irrelevant features while the relevant features still capture only the needed info.

给作者的问题

What happens if you use to large of a latent space |z_d|? Does this break things?
Since each dimension is treated independently, the scaling of the number of the codebook is exponential in |z_d|. Is this correct? Is there a way around this somehow? Please discuss.

论据与证据

Their claims, that disentangled visual representations are necessary for generalization, are well-motivated both intuitively and by making some arguments from biology. Their experiments show that their proposed approach outperforms other approaches without using data augmentation, except for a baseline that uses a large pre-training dataset.

方法与评估标准

To my knowledge, the benchmark environments they use are standard for this type of RL.

理论论述

I checked the proofs in the appendix and it seems correct to me.

实验设计与分析

The described experiments are reasonable given the problem formulation, though I did not check the code.

补充材料

I did not review supplementary material.

与现有文献的关系

This work relates to broader literature on image processing, RL, image based RL, and representation learning. This work seems to be mostly connecting some prior work in disentangled representation learning with image-based RL.

遗漏的重要参考文献

I am not sure if this is strictly necessary, but bi-simulation (e.g. work from Amy Zhang et al.) may be worth looking into.

其他优缺点

Strengths:

Paper is well written
The introduction and motivation is very clear

Weaknesses:

Some figures providing qualitative examples would be helpful. Specifically, the following two:
- A figure showing the architecture and inference procedure for the representation, especially relating to the codebook. This part of the paper was harder to parse, and a figure may clarify things.
- A figure illustrating the learned representation. For example, show the same trajectory but with two different coloring schemes, and plot the representations as this trajectory progresses. If the algorithm is working correctly, all relevant features should closely match, while irrelevant features may be wildy different. Additionally, show if you can use this approach to modify the images. For example, modify the observation to be a different color, compute its representation, set its irrelevant features to match the training data, then regenerate the image. You should be able to effectively change the image visuals using this approach without losing the relevant information, such as joint position.

其他意见或建议

Minor:

Notation in the background section is a bit unclear. For example, you use R(s_t, a_t) for reward, but you also use r_t. r_t is not defined anywhere, and I assume this is reward at time t. Also, the equation for the bellman residual J(Q) has reward listed twice and a missing parenthesis in the subscript notation. Lastly, the policies objective is to maximize value, so this should either be indicated or a negative sign should be included.
Quotation marks are all double end quotes.

作者回复

2025-04-01

Thank you for your in-depth feedback and for providing references to work on bi-simulation for visual generalization. We will add bi-simulation to the related works section and fix the reward / RL objective notation issues for the final iteration of the manuscript. We respond to individual comments and questions below.

Bi-simulation may be worth looking into.

Thank you for bringing this to our attention and it is indeed relevant. It aligns with data augmentation (DA) and latent regularization methods like our RePo baseline. From Appendix A.1 Equation 8, DA is defined as attempting to learn an “optimality invariant” Q-function such that $Q^*(**z**, a) = Q^*(**z**', a)$ for two states $**z**$ and $**z'**$ that are semantically the same but perhaps differ visually. Similarly, bi-simulation defines an optimal representation where states that are associated with similar return are closer together in some metric space, and one where task-irrelevant features are removed from the representation. However, our core claim is that for generalist agents, what is “task-irrelevant” can vary depending on the task, so removing such information may be detrimental. For example, a flying bird is irrelevant to autonomous driving until it becomes a collision risk. Rather than removing “task-irrelevant” features, we argue that generalist agents should retain all information about the environment, where the onus is on the policy network to decide what variables are relevant for taking optimal actions.

Since our method uses a reconstruction loss, we retain all information about the environment but specifically in a factorized representation, making our method equivariant to visual distribution shifts. Thus, our method makes no distinction between task-relevant and task-irrelevant features – it simply attempts to factorize the latent structure into separate dimensions without bias. The policy and critic networks decide what is task-relevant and what isn’t. Though this can make our method sensitive to “task-irrelevant” features for any given task, the associative memory component mitigates much of the problem by mapping OOD values (e.g., a new robot color at test time) to known values seen during training before passing the representation to the actor/critic networks.

A figure showing the architecture and...

Thank you for your recommendation. We will update Figure 1 to include a diagram showing the inference procedure of the associative latent model.

A figure illustrating the learned representation...

Here is the first figure requested by the reviewer: https://drive.google.com/file/d/1P0zkA4s4bRJoNWCNhLWmqKLDOUI-GmbW/view?usp=sharing. We take our model ( $z_d=12$ ) trained on the standard, unmodified DMControl environment and perform rollouts on two instances of the color-hard evaluation environment with the same initial conditions. The two plots in the top row show the values that each of the 12 latent dimensions takes on over the course of the trajectory for both environments. The bottom row of images correspond to the initial state of the latent trajectory plots in the top row. We find that latent dimensions that wildly vary are the same in the left and right plot, specifically the ones colored red, pink, green, and brown. As the reviewer predicted, some of the dimensions vary significantly while others are more similar throughout the rollout, likely representing the divide between task-relevant and task-irrelevant features.

Additionally, show if you can use this approach to modify the images...

Regarding the reviewer’s second point, we do precisely this in Figure 6 (bottom) and Figure 8 in the appendix (A.3). In this experiment, we take our model trained on the color-hard environment, sample a batch of images with randomized colors, and encode them into our disentangled latent representation. We then hold all but one latent dimension fixed, interpolate that latent dimension from min to max value (x-axis), and visualize the resulting image using the decoder. The y-axis is several of the latent dimensions we interpolated. From Figure 6, we can see that interpolating the first two latent dimensions (first two rows) corresponds to changing the color of the scene, while the latent dimension corresponding to the bottom row changes the joint angle of the left knee.

What happens if you use to large of a latent space |z_d|? Does this break things?

This doesn't break the algorithm, but it can cause performance degradation if $|z_d|$ strays too far away from the true number of sources of variation. Several of the latent dimensions will end up corresponding to the same physical attributes. We provide a study on the effects of $|z_d|$ in the Appendix Section A.8.

Since each dimension is treated independently...

The number of codebooks 1:1 corresponds to the number of dimensions in the latent space, and so the scaling is linear i.e. if $|z_d|$ is 20, then we have 20 codebooks.

审稿人评论

2025-04-02

Thank you for clarifying how your approach relates to bisimulation
Interesting. This looks promising. Suggestions: Plot each latent dimension separately to compare them directly, so you will have 12 plots with 2 lines each rather than 2 plots with 12 lines. This will make interpreting this easier. Also, it isn't clear from the figure if you already do this, but take the same sequence of actions in each environment so that the states are identical, and they are one to one comparable. Otherwise, the latent representations may diverge simply because the trajectories diverge.
Oh, awesome. Perhaps change the caption and labeling of the figure to make this more clear as I did not realize this on my first pass.
Devising some method for determining this hyper parameter, other than a brute force search, would be a good direction for future work.

作者评论

2025-04-04

Thank you for responding to our rebuttal and for the suggestions on the plot. Here is the updated figure: https://drive.google.com/file/d/1btjbLQbaX1VaM3i8iLaFhVbj_PiSshoF/view?usp=sharing. We decided to create two rows of 12 plots because the trajectories of some of the latent dimensions match to the point where they may be indistinguishable. The left Walker image corresponds to the initial state of the top row of latent trajectories, while the right Walker image corresponds to the bottom row. Both environments are rolled out using the same actions as the reviewer suggested. Again, we find that many of the latent dimensions are very similar (if not the same), while some vary significantly. Regarding the reviewer's other comments:

We appreciate the feedback on the captions and will improve them in the final version of the paper.
We agree!

审稿意见

评分: 32025-03-11

This paper proposes Associative Latent DisentAnglement (ALDA) that builds on standard off-policy RL towards zero-shot generalization. It learns a disentangled representation from the training data and then uses an associative memory model to recover data points in the original training distribution given OOD data.The authors also prove that data augmentation methods can be considered as a weak disentanglement. Experiments show that the proposed methods can outperform most baselines.

给作者的问题

The authors can consider replying to my concerns mentioned in previous parts. I am not an expert in the area of RL, so feel free to point them out if I have any misunderstandings.

论据与证据

I think all the claims are well supported.

方法与评估标准

The proposed algorithm is simple and well-motivated. The Association strategy is also elegant and effective.
However, I have some concerns in its novelty. Although disentanglement is highlighted throughout the paper, it looks like a simple application of existing QLAE algorithm. There are no specific adaptions for the concrete Vision-based RL scenario. Especially, the authors do not conduct any disentanglement in the temporal domain. This makes it impossible to analyze the dynamic cues from vision inputs despite its importance in many RL tasks.

理论论述

The authors provide a proof that data augmentation is a weak disentanglement of the latent space. This provides a good insight to building the connection between previous literature and this work, and also gives a good motivation for learning a disentangled representation.

I do not see any mistakes in the proof.

实验设计与分析

It is great that the authors conduct extensive experiments to compare with multiple baselines on different tasks.
I am not sure whether these experiment setups are common knowledge in this field, but I would encourage the authors to explain some settings at least in the appendix. I am not sure whether each experiment is in-distribution or out-of-distribution. If they are OOD, how are they OOD? Do they have OOD visual appearances or dynamics? Is the OOD interpolation or extrapolation?
Besides, it would also be helpful to provide some failure cases to show the limitation of the OOD generalization.
Since disentanglement is an emphasis of this paper, it is great to have some analysis about how the representation is disentangled and what is the physical meaning of each component if possible.

补充材料

I have read the supplementary material.

与现有文献的关系

There is a concern about novelty mentioned in "Methods And Evaluation Criteria".

遗漏的重要参考文献

[1] shares a very similar insight with this paper about disentangled representation for generalization. Since it is a key contribution of the proposed algorithm, I believe the authors should have some discussions related to [1].

[1] Wu, Zheng, et al. "Zero-shot policy transfer with disentangled task representation of meta-reinforcement learning." 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023.

其他优缺点

Please refer to the previous parts.

其他意见或建议

Please refer to the previous parts.

作者回复

2025-04-01

Thank you for your in-depth feedback and for providing the additional reference. We will add it to the related works and discussion sections for the camera-ready version of our manuscript. We respond to individual comments and concerns below.

However, I have some concerns in its novelty...

As the reviewer correctly pointed out, self-supervised disentanglement methods currently do not provide a way to extract or disentangle temporal information. To address this, we take a fixed window of consecutive image observations, encode each observation separately into disentangled latent embeddings, and then extract temporal information via a 1D convolutional network (See figure 2), which is one of our novel contributions.

Our other novel contributions are as follows:

We show that the latent model in QLAE is equivalent to a Hopfield network with fixed, predetermined memories when the quantization loss is removed. We replace this with the attention mechanism used in modern Hopfield networks, boosting performance over standard QLAE.
Prior RL disentanglement methods, such as DARLA, use a two-stage approach—disentangling latents with random policy data and then training an optimal policy on the fixed representation. This approach is suboptimal because (a) the random agent may not explore the full state space, and (b) critic gradients cannot backpropagate to the latent space or encoder, which is critical for good performance. Our framework jointly disentangles the latent space and trains the policy while allowing critic gradients to update the latent model, leading to significantly better performance than DARLA.

I am not sure whether these experiment setups are common knowledge in this field, but I would encourage the authors to explain some settings at least in the appendix...

We will add a summary of the DMControl Generalization Benchmark (DMCGB) to the appendix in the final, camera-ready version. For the reviewer’s reference, DMCGB is a wrapper over the standard DMControl (DMC) benchmark, which focuses on optimal control. DMCGB introduces visual distribution shifts to assess zero-shot generalization of agents trained on high-dimensional image observations. The “in-distribution” (or “training” environment as we refer to it in the paper) is the standard DMC benchmark, which emits unmodified image observations. The “color-hard” environment is an OOD setting that perturbs colors randomly on reset, while the DistractingCS environment applies camera jitter and plays random videos from a pre-recorded dataset. Both environments modify only visuals, leaving task dynamics unchanged. Our method trains solely on the unmodified DMC environment and is periodically evaluated on DMCGB environments, testing extrapolative generalization. In contrast, methods like SVEA apply random transformations such as overlaying images from the Places Dataset containing 10 million real world images, or applying random convolutions which change the colors in the scene, making their generalization results more representative of interpolative generalization.

Besides, it would also be helpful to provide some failure cases to show the limitation of the OOD generalization.

The results on the DistractingCS environment highlight the limitations of our method’s OOD generalization. While our performance is on par with SVEA, an ideal model would distinguish between the distracting background video and the agent in the foreground, resulting in minimal performance loss. The observed performance drop suggests room for improvement, making this a promising direction for future research.

Since disentanglement is an emphasis of this paper, it is great to have some analysis...

The representation is disentangled such that each latent dimension corresponds to one unique aspect of the image. As an example, in the Walker2D task, a given dimension could be the color of the robot, one of the robot’s joint angles, the floor, the sky, etc. We qualitatively show the physical meanings of some of the latent dimensions in the “latent traversal plots” in Figure 6 in the main paper, and in the appendix section A.3. In these experiments, we sample a batch of images, set all but one latent dimension static, and then interpolate one latent dimension from min to max value (x-axis) and generate the resulting images using the decoder. The y-axis (rows) are interpolations of different dimensions of the latent space. This allows us to see the physical meaning of each latent dimension, and indeed we find that the latent variables correspond to unique attributes such as a joint angle or the color of the robot/background as with Figure 6 (bottom) where the colors are randomized.

审稿人评论

2025-04-03

I appreciate the efforts of authors in writing this rebuttal. This rebuttal can solve my concerns, so I will keep my score.

作者评论

2025-04-08

We appreciate the reviewer for taking time to read our rebuttal and finalize their score. Given that the discussion deadline is approaching, we look forward to hearing back from all reviewers on their final scores and are happy to answer any remaining questions / concerns.

审稿意见

评分: 32025-03-14

The paper introduces Associative Latent Disentanglement (ALDA), an approach to zero-shot generalization in vision-based reinforcement learning (RL) without relying on data augmentation. ALDA leverages disentangled representations and associative memory mechanisms to enable RL agents to generalize to novel environments by factorizing latent spaces into modular components, allowing for independent adaptation of task-relevant and task-irrelevant features.

给作者的问题

How does ALDA perform if the associative memory component is removed or replaced with a simpler alternative?
Have you tested ALDA on more complex or real-world RL tasks beyond DeepMind Control Suite? How well does it scale?

论据与证据

The claims in the paper are generally supported by clear and convincing evidence.

方法与评估标准

The proposed methods and evaluation criteria are well-aligned with the problem of zero-shot generalization in vision-based RL.

理论论述

The theoretical claims, particularly the connection between data augmentation and weak disentanglement, appear well-reasoned.

实验设计与分析

he experimental design is generally sound, with well-chosen benchmarks, multiple baselines, and ablation studies. The inclusion of latent traversals and β-study strengthens the analysis.

补充材料

The supplementary material includes additional latent traversal visualizations, β-study ablations, framestack comparisons, and proof details.

与现有文献的关系

The paper builds on existing work in vision-based RL, disentangled representation learning, and associative memory. It extends prior research on disentanglement in RL (e.g., DARLA) by integrating modern Hopfield networks for associative memory.

遗漏的重要参考文献

N/A

其他优缺点

Strengths: -- The integration of disentangled representation learning with associative memory is novel in the context of vision-based RL. The theoretical perspective connecting data augmentation to weak disentanglement is insightful. -- The work addresses a crucial challenge in RL generalization, providing a potential alternative to data augmentation that could improve scalability and efficiency. -- Strong baseline comparisons, ablations, and visualization techniques support the claims.

Weaknesses: -- Some theoretical discussions, particularly regarding disentanglement and associative memory mechanisms, could be more clearly explained for a broader audience. -- While ablations exist, a more controlled comparison of ALDA with and without associative memory would further clarify its unique benefits.

其他意见或建议

N/A

作者回复

2025-04-01

Thank you for your feedback and suggestions on how we could improve the paper's clarity. We respond to individual comments and questions below.

Some theoretical discussions, particularly regarding disentanglement and associative memory mechanisms, could be more clearly explained for a broader audience. -- While ablations exist, a more controlled comparison of ALDA with and without associative memory would further clarify its unique benefits.

Because association is implicit in the dynamics of the latent model that QLAE uses to perform disentanglement, there is no way to remove just the associative part of ALDA and perform an experiment. The closest comparison we can do is the comparison between QLAE and BioAE presented in Figure 3. Like BioAE, QLAE uses activation energy minimization as an auxiliary objective in order to disentangle the latent representation. The only other differences between the two are that BioAE does not have an associative latent model and that BioAE also enforces nonnegative activations, so it essentially functions as ALDA without associative memory. As per reviewer N37F's suggestion, we plan to update Figure 1 to include a diagram of the inference procedure of the associative latent model so that the role of association in our method is more clearly conveyed.

How does ALDA perform if the associative memory component is removed or replaced with a simpler alternative? ALDA without associative memory essentially functions as DARLA (see Figure 5 for comparison) or BioAE (see Figure 3 for comparison), which are purely disentanglement methods, both of which perform worse at visual generalization compared to ALDA. For more details, please see our response above.

Have you tested ALDA on more complex or real-world RL tasks beyond DeepMind Control Suite? How well does it scale? Please see our response to Reviewer Tie6 under "The environments the authors use are toy control environments...".

审稿人评论

2025-04-09

Thank you for the clarification. I will maintain my rate.

审稿意见

评分: 32025-03-16

The authors present ALDA - an approach for training disentangled representations along with off-policy learning for OOD generalization.

They build upon existing QLAE-based latent model, which is SOTA disentanglement method, which uses latent space dimensions, each having their own codebook.

The prove that data augmentation is weak disentanglement and derive a novel loss function for their approach for simultaneous representation and policy training. For temporal sequences, they feed in batches of data to the model, fed to a 1-D convolution for the actor-critic networks, and batch reconstruction is done directly on the latent space. They compare DARLA, SAC+AE, RePo, SVEA with ALDA on DeepMind Control Generalization Benchmark for "color hard" and "distracting cs" cases.

ALDA beats all baselines except SVEA which uses data augmentation. The authors argue that the test set visuals are also covered with external data augmentation in SVEA, thereby not showing true generalization performance. The authors also present interesting visuals to showcase disentanglement of their model using the produced latents.

Update After Rebuttal

Thanks to the authors for the rebuttal. I enjoyed reading the rebuttal, the additional experiments and the arguments made in the rebuttal. The paper has its merits. I think the idea that the authors present is interesting.

给作者的问题

How easy or hard is it to transfer the approach to on-policy algorithms?

论据与证据

Data augmentation is weak disentanglement: I think this is a fair idea and the authors discuss a theoretical proof in detail.
Data augmentation requires larger models, training data, longer training times, and have greater training instability: While the authors discuss this in the introduction, I think it would be nice to discuss this for SVEA vs ALDA since that would be an interesting statistic to see.
If a data-driven model can generalize better with less data, then it will scale better with more data: This is a fair point. The authors do provide some reasoning behind this based on their results on ALDA. However, if this is actually true, why did the authors not try SVEA + ALDA (i.e. data augmentation and disentanglement together)? Is it hard to implement such an approach? If not, this would directly prove this proposition. In a world where data is cheap to generate/obtain, data augmentation shouldn't not be used.

方法与评估标准

Methods

The authors train ALDA approach on four tasks for DeepMind Control Suite. They evaluate on "color hard" and "distracting cs" environments. This is a fair way to evaluate the method. However, the authors should look into potentially harder tasks (e.g. navigation/rearrangement/manipulation). The environments the authors use are toy control environments, which are far from the real-world environments. There might be other benchmarks that the authors could explore.
The baselines are of varied kinds including data augmentation, disentanglement, etc. This provides a nice overview/comparison of their approach with other approaches.

Evaluation Criteria: They use episode reward for comparison, which is standard in control tasks. The authors mention that it is not possible to directly evaluate disentanglement, and therefore show qualitative examples in Fig 6, which makes sense.

理论论述

The authors prove a theorem showing that data augmentation is weak disentanglement of latent space. The discussion is sound, and shows that in order to achieve a latent representation only relevant to the task, we would have to gather data from all task-irrelevant source, which would be unrealistic in the real world.
There is also discussion in Section 4 on how the loss function for ALDA is created based on theory on attention-based Hopfield networks and QLAE dynamics. I have a high level intuition of this idea, but I am not entirely certain on the correctness of the entire discussion.
There is a set of proofs in appendix too, which I did not read in detail.

实验设计与分析

They show that QLAE achieves better performance over the course of training compared to BioAE (another disentanglement method), showing better OOD generalization.
The analyze the results from various training and representation strategies in Figure 5, discussing the performance of ALDA against other approaches.
The analysis of different latents in Figure 6 is also interesting and showcases disentanglement.

补充材料

I skimmed over it, but did not read it in detail. The authors discuss more proofs,ablations like hyperparameters, frame-stacking instead of batching, and other implementation details.

与现有文献的关系

I think the approach is overall interesting. However, the authors could present results on more environments, tasks for a comprehensive study of the approach. Otherwise, if this approach is only applicable/useful in toy control problems, within a limited number of scenarios, then it might not have a lot of impact in the field.

遗漏的重要参考文献

N/A

其他优缺点

A well-written paper overall. I like the motivation, the buildup of proofs, and the choice of baselines. The authors could have tried on more benchmarks and set of problems, however.

其他意见或建议

Line 96 - incorrect quotes around "weak"
Line 145 - incorrect quotes around "random convolution"
Line 142, Col2 - incorrect quotes around "irrelevant
Line 157 - don't -> do not
Line 295 - we've -> we have
Line 290, Col 2 : incorrect quotes around "color hard"
Inconsistent formatting of color hard and distracting cs throughout the paper.
Line 410 - Incorrect quotes
Line 418 - Incorrect quotes
Line 428 - That's, isn't - informal usage.

作者回复

2025-04-01

Thank you for your comments and feedback on the paper. We will fix the grammar errors for the camera-ready version of the manuscript. Regarding your questions and concerns, we respond to each one individually below.

I think it would be nice to discuss this for SVEA vs ALDA since that would be an interesting statistic to see.

The type of data augmentation that’s applied can affect the performance, stability, and sample complexity of the underlying RL algorithm. As noted in the SVEA paper, “More recently, extensive studies on data augmentation have been conducted with RL, and conclude that, while small random crops and translations can improve sample efficiency, most data augmentations decrease sample efficiency and cause divergence”. To explore this, we trained SVEA directly on the color-easy evaluation environment in DMControl. The color-easy environment randomizes the agent, sky, and background colors on reset, but not to the extreme RGB values that the color-hard environment does. This form of augmentation, often used with lighting randomizations for Sim2Real RL deployment [1], also helps assess generalization to OOD visual shifts, since in this experiment the DistractingCS and to an extent color-hard environments will be OOD with respect to the training data.

We compared ALDA, standard SVEA, and SVEA (color-easy) on the "Walker Walk" task here: https://drive.google.com/file/d/1hwFX6glI8-IW6i4vGqDSqo-MuCzDo5F2/view?usp=sharing. SVEA (color-easy) underperformed compared to both vanilla SVEA and ALDA, particularly in the training and color-hard environments. We suspect that the diversity of the 10 million real-world images from the Places Dataset is crucial for SVEA’s generalization and training stability, ensuring that evaluation environments remain in-distribution, or at least well within the support of the training distribution.

[1] "Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai."

Why did the authors not try SVEA + ALDA?...

ALDA aims to factorize the image distribution into latent variables, and random overlays could disrupt this by obfuscating the underlying structure. What we wish to convey is, rather than solely relying on data augmentations or brute forcing the generalization problem by collecting massive datasets, models that learn the underlying structure from fewer examples can allocate remaining compute/data budgets to other tasks. For instance, if a robot agent can learn SO(2) invariance from a single object rotated in various ways and can tease out the notion of rotational invariance, then it should be able to generalize rotational invariance to other objects, removing the need for exhaustive data collection on all possible orientations of all possible objects. Unlike computer vision or language data, robot data is more difficult to collect and not as widely available. As of now, the field is allocating a significant amount of data collection efforts to viewpoint, color, lighting, background, [...] randomizations, but if we can alleviate this on the model/architecture side, then those efforts can be spent elsewhere.

In instances where data is cheap or large datasets are readily available, we completely agree with the reviewer that the data should be leveraged. However, solely relying on data may not be sufficient to achieve true generalization, and may instead be obfuscating deeper issues within current robot learning methods. In the SO(2) invariance example, random rotations in CV or viewpoint variations in robotics may not truly be capturing the SO(n) group given current architectures [1, 2], indicating that data alone is insufficient for solving generalization.

[1] "Progress and limitations of deep networks to recognize objects in unusual poses."

[2] "On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory."

The environments the authors use are toy control environments...

We agree that exploring this method in more complex environments would be beneficial and are working on extending our approach to benchmarks like Sim2Real transfer of manipulation policies trained via behavior cloning as part of a separate investigation. However, the focus of this work is to explore whether combining association with latent disentanglement enables zero-shot generalization—an idea that has not been studied before. To that end, we chose RL as the driving optimizer and the DMControl benchmark so that we can study visual generalization in isolation without worrying about the complexities of Sim2Real, harder tasks, real hardware, etc. We also maintain that visual distribution shifts remain a challenging problem to solve, regardless of the difficulty of the underlying task. Finally, recent works addressing visual generalization in RL, many of which we included as baselines, primarily evaluate on DMControl. Extending our method and all baselines to other benchmarks would be beyond the scope of this study.

审稿人评论

2025-04-09

Thanks to the authors for the rebuttal. I enjoyed reading the rebuttal, the additional experiments and the arguments made in the rebuttal. I will retain my score.

最终决定Accept (poster)

2025-05-01

This paper addresses zero-shot generalization in RL, leveraging disentangled / composable representations in a latent space learned using the QLAE algorithm. The subject of generalization (zero-shot or otherwise) is timely in RL, and it's nice to see this being directly addressed through representation learning. The comparison to and analysis of data augmentation is good, as this is a common technique for improving generalization, and the conclusions are convincing.

There were some concerns about clarity, but these seem to have been sufficiently addressed. Some additional concerns were raised wrt some of the claims wrt disentanglement in the model (and when in the data augmentation experiments), but I'm satisfied with the answers and the reviewers did not raise any additional concerns.

I therefore recommend the paper is accepted to the conference.