PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高6标准差0.4
6
6
5
6
3.5
置信度
正确性2.5
贡献度2.5
表达2.8
ICLR 2025

Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We propose a practical two-step approach that combines data augmentation and synthetic data generation to address generalization challenges in vision-based offline reinforcement learning.

摘要

Offline reinforcement learning (RL) offers a promising framework for training agents using pre-collected datasets without the need for further environment interaction. However, policies trained on offline data often struggle to generalise due to limited exposure to diverse states. The complexity of visual data introduces additional challenges such as noise, distractions, and spurious correlations, which can misguide the policy and increase the risk of overfitting if the training data is not sufficiently diverse. Indeed, this makes it challenging to leverage vision-based offline data in training robust agents that can generalize to unseen environments. To solve this problem, we propose a simple approach—generating additional synthetic data. We propose a two-step process, first $augmenting$ the originally collected offline data to improve zero-shot generalization by introducing diversity, then using a diffusion model to $generate$ additional data in latent space. We test our method across both continuous action spaces (Visual D4RL) and discrete action spaces (Procgen), demonstrating that it significantly improves generalization without requiring any algorithmic changes to existing model-free offline RL methods. We show that our method not only increases the diversity of the training data but also significantly reduces the generalization gap at test time while maintaining computational efficiency. We believe this approach could fuel additional progress in generating synthetic data to train more general agents in the future.
关键词
Offline Reinforcement LearningGeneralizationData AugmentationSynthetic Data Generation

评审与讨论

审稿意见
6

The authors propose a novel data augmentation method to enhance generalization in an offline RL setting. By training a diffusion model on a set of augmented latent features, the model can subsequently generate additional latent data for offline RL training. The paper demonstrates that the distribution of the augmented data more closely aligns with the evaluation data distribution, resulting in improved generalization performance.

优点

  1. The paper proposes a novel approach to using diffusion models to generate new data for image-based RL. Augmenting in the latent space helps avoid the overwhelming computational costs associated with both training and inference when using a diffusion model directly on image level.
  2. A comprehensive experimental analysis is provided.
  3. The paper is clearly written and easy to follow.

缺点

  1. On line 246, I noticed that the Augmented dataset has the same size as the Baseline dataset, whereas the Augmented Upsampled dataset is larger than the Baseline dataset. This raises the question of whether the performance gain of the Augmented Upsampled dataset over the Augmented dataset is primarily due to the increased amount of augmented data used in training. Given that the diffusion model is trained on the augmented data, the data it generates may follow a similar distribution to the augmented data. So, is there a significant difference between using data augmentation to increase the dataset size versus using a diffusion model to generate additional data? Conducting an additional experiment where the size of the Augmented dataset is increased to match the Augmented Upsampled dataset would help isolate the potential benefits of the diffusion model's data generation.
  2. Following the discussion above, the performance of this method might highly depend on the data augmentation used in the first phase. The types of the data augmentation actually decide the data distribution generated by the trained diffusion model. Have you run any ablation study on this?
  3. minor: typo on line 137 (invrease). typo on line 361 (tecnic)

问题

  1. On line 150, the types of data augmentation used are listed. Could you please explain why random shift, a common data augmentation used in image-based RL, is not included here?
  2. On line 169, both the state ss and state ss' are augmented by a function called Augment. Are they augmented by the same image transformation or a randomly sampled image transformation?
  3. If I understand correctly, FDD in section 5.5.1 (line 361) refers to including 5% of data that is close to the evaluation distracting dataset. Is it possible that, by incorporating a small amount of data from the Fixed Distraction Dataset along with the diffusion upsampling process, we could achieve relatively good performance even without the initial data augmentation phase? Because this small amount of data could be enlarged by the diffusion model. It will beinteresting to test this hypothesis by comparing the performance of the models trained with different percentages of FDD and diffusion upsampling against the full method.
评论

Augmented dataset has the same size as the Baseline dataset

We appreciate the reviewer’s positive feedback on the novelty of our approach, the clarity of our writing, and the comprehensive experimental analysis. Thank you for raising this important question about dataset size and the distribution of upsampled augmented data compared to the original augmented data. Our upsampled dataset extends the augmented data distribution to better align with the testing data, leveraging the broader diversity introduced by upsampling. While the diffusion model generates data based on the augmented data distribution, it also expands diversity beyond the original distribution, as shown in Figure 5 of the SynthER paper. Our JS divergence analysis supports this observation, demonstrating that the upsampled data moves closer to the testing data distribution while increasing overall diversity. At its core, our method integrates visual augmentation with upsampling into a unified approach to improve generalization when additional data is unavailable. Investigating the impact of dataset size by comparing pure augmentation with our combined method using equal-sized datasets would indeed be a valuable direction for future work. However, our current focus is on presenting a simple and effective solution for scenarios with limited data, where enhancing diversity through augmentation and upsampling is crucial.

the performance of this method might highly depend on the data augmentation used in the first phase

Thank you for pointing out this complementary aspect to the earlier discussion. To improve generalization, we began with all augmentations proposed by RAD [1] and systematically refined them. We eliminated augmentations that caused training instability and iteratively narrowed down to those most impactful on generalization performance. Through this process, we observed that both environments favored similar augmentations, and we fine-tuned their parameters to maximize stability and effectiveness (as detailed in Supplementary Section B.0.3). Our focus was on achieving stable training outcomes while balancing computational constraints, as this work was conducted on a single GPU academic setting. While we performed ablations during the selection process, we opted not to include an exhaustive set of results in the paper to maintain focus on the simplicity and effectiveness of combining visual augmentation with diffusion-based upsampling for generalization. We believe this strikes a balance between demonstrating the method’s impact and avoiding overemphasis on augmentation-specific studies in offline RL. We agree that a deeper exploration of the interplay between augmentations and diffusion-generated data is a valuable direction for future work.

[1] Laskin et al. Reinforcement Learning with Augmented Data. NeurIPS 2020

typos

Thank you for pointing out the minor typos. We corrected "invrease" (line 137) and "tecnic" (line 361) in the revised version of paper.

Are they augmented by the same image transformation or a randomly sampled image transformation?

Thank you for asking for the clarification of transformations applied to states. We realized that we didn’t add details about that and we updated section B.0.3 in supplementary material. To clarify, the augmentation function is applied to states s and s′, which uses independently sampled transformations from the same set of augmentations, each selected with equal probability. Within each state, the same transformation is consistently applied across all images in the stack to preserve temporal and spatial relationships, which is also used in RAD work. This ensures diversity across different states while maintaining structural integrity within each state.

Is it possible that, by incorporating a small amount of data from the Fixed Distraction Dataset along with the diffusion upsampling process?

We thank the reviewer for highlighting this intriguing aspect of our findings in section 5.1.1. Incorporating a small amount of data from the Fixed Distraction Dataset (FDD) alongside the diffusion upsampling process—and evaluating its performance without the initial data augmentation phase—is indeed an intriguing hypothesis. Our primary focus was to demonstrate the combination of augmentation and upsampling as a cohesive method, which is why we kept our approach consistent with the augmented and upsampled (ours) pipeline to ensure clarity and alignment with our key message. That said, we acknowledge the potential value of exploring this direction, which we have identified as an open problem in Section 5.1.1, encouraging further investigation by the research community. If the reviewer believes this analysis would provide significant additional insights and help improve the paper’s impact, we are happy to include an ablation study in the revised version to examine this hypothesis further.

评论

Thanks for your effort on the response. I believe that comparing pure image augmentation with diffusion-augmentation under the same-size setting, and analyzing the choice of image augmentation in the first stage could enhance the paper. I will maintain my current score.

评论

We appreciate your suggestion and review of our responses. As mentioned in our initial response, our focus is on synthetic data generation rather than a thorough investigation of augmentation techniques. However, we emphasize that our method combines visual augmentation with upsampling to efficiently increase diversity and generalization in data-limited settings.

As discussed earlier, we methodically refined the augmentation decisions for stability and efficacy. Our JS divergence study demonstrates how the upsampled data better aligns with the testing distribution while improving diversity. We will consider your insightful recommendation to compare pure image augmentation with diffusion-augmentation under the same-size setting and to analyze the choice of image augmentation in future extension work

审稿意见
6

This paper proposes a two-step approach to improve generalization in offline reinforcement learning with visual inputs. By combining data augmentation with diffusion model-based synthetic data generation in latent space, the method enhances training data diversity without modifying existing algorithms. The authors evaluate their approach on both continuous (Visual D4RL) and discrete (Procgen) action spaces, demonstrating significant reduction in generalization gaps while maintaining computational efficiency. Notably, their method is the first to effectively address visual generalization challenges across both continuous and discrete control tasks in offline RL.

优点

  1. The paper presents an innovative approach by combining two complementary data augmentation strategies: classic transformations and generative model-based data synthesis. This integration effectively leverages both the reliability of traditional augmentation methods and the diversity potential of generative modeling, providing a more comprehensive solution to the data diversity challenge in offline RL.

  2. The implementation of diffusion model-based data synthesis in latent space, rather than in high-dimensional observation space, demonstrates significant computational efficiency. This design choice makes the approach more practical and scalable while maintaining effectiveness in generating diverse synthetic data.

  3. The paper includes insightful analysis using metrics like Jensen-Shannon divergence to quantify the alignment between training and testing distributions

缺点

Weaknesses

  1. The discussion and analysis of chosen data augmentation techniques in Section 3.2 lacks sufficient depth. The authors should provide empirical evidence for their augmentation choices and properly reference established techniques from online RL literature, such as DrAC[1], SVEA[2], and the comprehensive survey[3]. The current treatment of augmentation strategies is superficial and fails to leverage valuable insights from prior work.

  2. The proposed Generalization Performance metric Gperf =Ttest Btest Btrain Btest G_{\text {perf }}=\frac{T_{\text {test }}-B_{\text {test }}}{B_{\text {train }}-B_{\text {test }}} needs better justification. A more straightforward approach would be using Ttrain /Btrain T_{\text {train }}/B_{\text {train }} to evaluate training effectiveness, while comparing Btest /Btrain B_{\text {test }}/B_{\text {train }} with Ttest /Ttrain T_{\text {test }}/T_{\text {train }} would provide a more natural measure of generalization capabilities.

  3. The experimental results reveal a critical misalignment with the paper's claimed contribution to "Zero-Shot Visual Generalization." The performance improvements predominantly stem from enhanced training performance rather than improved generalization ability, as evidenced by the persistent generalization gap between training and testing environments. This fundamental disconnect between the empirical results and the paper's main thesis requires substantial clarification and resolution for the work to be considered acceptable.

  4. The paper provides insufficient exploration of diffusion model design choices and their impact on performance, lacking crucial ablation studies on model architecture, hyperparameters, and the relationship between latent space dimensionality and generation effectiveness.

[1] Automatic Data Augmentation for Generalization in Reinforcement Learning, NeurIPS 2021

[2] Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation, NeurIPS 2021

[3] A Comprehensive Survey of Data Augmentation in Visual Reinforcement Learning, Arxiv, 2022

问题

See weakness

评论

The discussion and analysis of chosen data augmentation techniques in Section 3.2 lacks sufficient depth

We thank the reviewer for recognizing the innovation in our approach, its computational efficiency, and the value of our JS divergence analysis. However, we respectfully disagree with the reviewer’s concern regarding the lack of sufficient depth in the data augmentation. While we greatly value the contributions of methods like DrAC and SVEA, these methods involve algorithmic changes specific to online RL, which differ from our objective of providing a simple, algorithm-agnostic solution for offline RL with visual observations. Our focus is on achieving non-algorithmic changes to offline RL using simple, effective visual augmentations, as demonstrated by RAD [1], with four specific techniques detailed in Supplementary Section B.0.3. These augmentations, combined with diffusion-based upsampling, are central to our goal of enhancing data diversity and generalization without altering the underlying RL algorithm. To address this further, we expanded the “Related Work” section to clarify these distinctions in more detail and included the missing papers from the three suggested by the reviewer.

[1] Laskin et al. Reinforcement Learning with Augmented Data. NeurIPS 2020

The proposed Generalization Performance metric Gperf=TtestBtestBtrainBtestG_{perf} = \frac{T_{test} - B_{test}}{B_{train} - B_{test}} needs better justification.

Regarding the generalization metric, its design is grounded in established principles commonly used in reinforcement learning (RL) generalization studies. See the original Procgen paper [1] Section 2.2. They use the following Rnorm=RRminRmaxRminR_{norm} = \frac{R - R_{min}}{R_{max} - R_{min}}. The idea being to normalize w.r.t. what is possible given the setup. Here RminRmin is the lowest possible score, i.e., the test performance of the baseline (with poor generalization), and RmaxRmax is the highest possible score, the train performance on the baseline training data, which is the least diverse and easiest to overfit. We added this discussion in the paper (section 4.3.1) as we agree it appears plucked from thin air in the current draft, so thank you for flagging.

[1] Cobbe et al. Leveraging Procedural Generation to Benchmark Reinforcement Learning. ICML 2020

The experimental results reveal a critical misalignment with the paper's claimed contribution to "Zero-Shot Visual Generalization.

Thank you for highlighting this critical aspect of our work on zero-shot generalization. We respectfully disagree with the reviewer’s assertion that the paper does not demonstrate improved zero-shot generalization, as we show this in Procgen (see aggregate performance added to Table 3). Additionally, we present the FDD approach (Table 2), where we observe improvement in the generalization gap for the DMC environments. That said, we understand that the improved performance in the original environment in Table 1 (not necessarily a bad thing!) could lead to confusion. We are happy to rephrase the title if you have a recommendation. One proposal could be “Synthetic Data Enables Training Robust Agents from Offline Data,” as our agents perform well across a wide range of settings. We also updated Tables 2 and 3 to include Test/TrainTest/Train and TrainTestTrain-Test results for both environments, aligning with the metrics suggested by the reviewer. Please let us know if this makes more sense now.

insufficient exploration of diffusion model design choices

Thank you for highlighting the need for deeper exploration of diffusion model design choices. Our diffusion model builds directly on SynthER’s design, which allows us to make consistent comparisons with the V-D4RL benchmark and ensures the reproducibility of our results. SynthER’s supplementary material (Section B) already provides extensive ablations on model architecture and denoiser parameters, and our findings closely align with these results. As such, duplicating those ablations in our work would have added redundancy without offering additional insights. Instead, we focused on demonstrating the effectiveness of our combined augmentation and diffusion-based upsampling method in the context of offline RL. To provide clarity, we have added references to SynthER’s ablations in the updated Supplementary Section C.3,1. Additionally, we conducted and included a table on latent space size ablations in the updated supplementary material, addressing a gap not explored in SynthER’s ablation work (Section C and Section F). While this analysis was initially omitted for brevity, we now provide it to offer further insights into the relationship between latent space dimensionality and performance, complementing SynthER's findings.

评论

I appreciate the authors' thorough response and the improvements made to the paper. The clarifications on data augmentation techniques, generalization metrics, and ablation studies have enhanced the technical presentation. The additional results in Tables 2 and 3 provide better evidence for the method's effectiveness.

However, while I acknowledge these improvements and will adjust my score upward, I maintain some reservation about the impact-to-complexity ratio. The performance gains, though positive, appear modest given the complexity of implementing and tuning both the data augmentation pipeline and the diffusion model-based synthesis. Therefore, while the paper makes a valuable contribution, I believe it remains at a borderline level for acceptance.

Thank you for your efforts in addressing the review concerns.

评论

Dear Reviewer cYiQ,

We have carefully addressed all the points you raised in my rebuttal, and we believe the additions and explanations provided will help in evaluating the paper further. There is not much time left for us to address your additional comments; hence, we gladly ask you to evaluate our rebuttal at your earliest time, as your insights are important in ensuring a complete assessment of the work. We value the time you spent in this process and would be pleased to offer any more explanations should they be necessary.

Thank you for your time and consideration.

Authors

评论

We appreciate your careful review and for noting our improvements as well as for raising your score. Your comments have been quite helpful in directing important clarifications and additions, including those seen in Tables 2 and 3.

For impact-to-complexity ratio, to the best of our knowledge, this is the first work to effectively implement this practical method in two different kinds of offline RL environments—one with continuous action spaces and one with discrete action spaces. By demonstrating its effectiveness across these settings, we believe our method provides a strong starting point for the research community, aiming to reduce the complexity of selecting augmentation strategies and tuning hyperparameters and making it easier for others to adopt and build upon this approach.

审稿意见
5

The paper discusses a novel approach to enhance zero-shot visual generalization in offline reinforcement learning (RL) by integrating data augmentation and diffusion models. The proposed two-step method first augments the original dataset to increase diversity, then employs a diffusion model to generate additional synthetic data in latent space. This approach significantly reduces the generalization gap in both continuous (V-D4RL) and discrete (Procgen) control tasks without altering existing model-free RL algorithms. The results demonstrate improved performance in unseen environments, suggesting that this method could advance the training of more robust agents in offline RL settings.

优点

  1. The two-step approach effectively combines data augmentation and diffusion model-based upsampling, significantly reducing the generalization gap in both continuous (V-D4RL) and discrete (Procgen) control tasks. This leads to improved performance in unseen environments without requiring modifications to existing model-free offline RL algorithms.

  2. By augmenting the original dataset and generating synthetic data in the latent space, the method broadens the distribution of training data. This increased diversity helps mitigate overfitting to spurious correlations in visual inputs, making the trained agents more robust to variations in unseen scenarios.

  3. The approach maintains computational efficiency by operating in the latent space rather than the pixel space, allowing for the generation of diverse synthetic data without incurring significant computational costs. This scalability makes it practical for real-world applications in various domains, such as healthcare and robotics.

缺点

Although the method shows promising results in benchmarks like V-D4RL and Procgen, these are controlled environments. It’s unclear how the method would perform in more complex, real-world scenarios where the variety of unseen situations is vastly greater than in benchmark tests.

The effectiveness of the approach depends heavily on specific augmentation techniques like rotation, color jittering, and color cutout. The results may vary significantly if the distribution of unseen environments does not align well with these augmentations.

While the authors claim the diffusion-based data generation is computationally efficient, training and running diffusion models can be resource-intensive. This could be a bottleneck for scaling up the approach to larger datasets or high-resolution visual inputs.

The paper focuses on a two-step process (data augmentation and diffusion model-based upsampling) but does not explore or compare with other generative models (e.g., GANs, VAEs) that could also potentially increase diversity and improve generalization.

While augmentation and synthetic data help generalization, there is a risk that the model may overfit to artificially generated diversity, especially if this data diverges from real-world test distributions.

问题

How does the proposed approach handle significantly different visual distributions in real-world applications (e.g., new lighting conditions or object appearances)?

What are the specific computational costs associated with diffusion model-based upsampling, especially when scaled to larger datasets or higher-resolution visual inputs?

Has the performance of the approach been tested against other generative methods, such as GANs or VAEs, to assess if they could offer similar improvements with potentially lower computational overhead?

How does the choice of augmentation techniques affect generalization across different types of environments? Would different augmentations be needed for different application domains?

Could there be an overfitting risk associated with heavy reliance on synthetic data? How does the method mitigate this, if at all?

评论

Thank you for your review, we are pleased to see you appreciate that our method is a simple approach to achieve improved generalization in two distinct domains. It appears your primary concerns relate to scaling beyond the domains shown, which we believe will be challenging for us to answer concretely in this rebuttal. We note that the domains used have been popular for existing works on data augmentation [1, 2] and are still being used by industry labs in recent publications [3].

See specific responses below:

Cost of diffusion modeling: in this paper we are focused on the offline RL setting, where there is bottleneck on the amount of available data but not necessarily the amount of time or compute to maximize performance with it. Further, it may be possible when scaling this method to use an open source foundation model which already has world knowledge. Finally, for what it is worth, this work was done with extremely constrained computation resources of a single GPU in an academic lab - yet we were able to get state of the art performance for visual generalization. We think that is a good sign for scalability!

The GAN and VAE comparison is a great point - however - this comparison was already made in the SynthER paper and we do not have any reason to believe it would not hold in a more challenging setting. Please check out Table 1: https://arxiv.org/pdf/2303.06614, the performance is drastically better for Diffusion which makes sense, it is now the dominant paradigm in generative modeling. What we are showing here is that it may also make it possible to achieve additional visual generalization benefits, if set up correctly with augmenting the data, which was not obvious to us initially and thus we believe is a valuable contribution to the community.

We agree that it is definitely important to make sure the data does not deviate too far from the ground truth distribution. This was shown in the SynthER paper in Figure 5 and note there has been additional work in this space, such as Policy Guided Diffusion [4], which likely improves our synthetic data generation pipeline. The main contribution in this paper is showing these methods can aid visual generalization, which had not been shown before.

[1] Yarats et al. Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels. ICLR 2021

[2] Laskin et al. Reinforcement Learning with Augmented Data. NeurIPS 2020

[3] Ortiz et al. DMC-VB: A Benchmark for Representation Learning for Control with Visual Distractors. NeurIPS 2024 Datasets and Benchmarks Track

[4] Jackson et al. Policy-Guided Diffusion. NeurIPS 2023 Workshop on Robot Learning

评论

Dear Reviewer uuZk,

As the December 2nd midnight AoE deadline for questions approaches, we kindly request your review of our rebuttal. We believe we have addressed all your points and hope the clarifications meet your expectations. If so, we would greatly appreciate it if you could reflect this in your evaluation score. Thank you for your time and consideration, and please let us know if further clarification is needed.

Best regards, Authors

评论

Dear Reviewer uuZk,

We have carefully addressed all the points you raised in my rebuttal, and we believe the additions and explanations provided will help in evaluating the paper further. There is not much time left for us to address your additional comments; hence, we gladly ask you to evaluate our rebuttal at your earliest time, as your insights are important in ensuring a complete assessment of the work. We value the time you spent in this process and would be pleased to offer any more explanations should they be necessary.

Thank you for your time and consideration.

Authors

审稿意见
6

This paper proposes a novel two-stage method to improve generalization in offline visual RL. The method first introduces data augmentations, then trains a latent-space diffusion model to generate new transitions.

The method is tested on V-D4RL and Offline ProcGen. The experiments show that augmentation and upsampling together greatly improve generalization performance.

优点

  • The method is simple and easy to understand;
  • The proposed method improves generalization performance of offline visual RL methods;
  • The method also yields an additional improvement when given a small subset of data with distractions;
  • The method only changes the dataset and therefore does not depend on the particular RL algorithm, so in theory it can be used to improve any offline RL algorithm;

缺点

  • As authors listed, the method requires tuning of the data augmentation parameters, which limits it's applicability.
  • The experiments only include DrQ and CQL. Since this paper deals with extending the data and can be applied to many different methods, it would make the method more compelling if there were more methods like in SynthER, e.g. IQL, TD3+BC, EDAC.
Writing
  • Figures 4, 5, 6 are too large
  • JS divergence heatmaps in the figures throughout the paper are not very informative. In figure 1 b and 6b, I think it should just be a bar plot, heatmaps with just 4 values seem unnecessary. In figures, 4 and 5, to make it more informative, I'd put the exact values on top of the squares;
  • 361 tecnic -- technique? Although it's incorrect, I like this spelling.

问题

  • Do I understand correctly that your 'Upsampled' method is akin to SynthER? If that's so, I would put that in parenthesis. If not, could you provide that comparison?
评论

the method requires tuning of the data augmentation parameters, which limits it's applicability.

We sincerely thank you for your valuable feedback and are glad you found our method "simple and easy to understand." We acknowledge that tuning data augmentation techniques and diffusion model parameters can be challenging, as noted in our limitations discussion in Section 7. To address this, we used all augmentations from RAD [1] to systematically narrow down the augmentation options to reduce computational cost for effectiveness and training stability. For diffusion model parameters, we started with hyperparameters proposed in the SynthER paper that led to a proper balance between generalization and overfitting. To mitigate the time-consuming nature of tuning, we employed JS divergence analysis and distribution visualizations to assess alignment between the upsampled and baseline datasets. Specifically, this was achieved by systematically varying the size of the denoiser network and the number of training steps, allowing us to identify configurations that produced the best alignment. This approach allowed us to reduce the need for repeated RL training while efficiently improving data diversity and generalization. We have updated Supplementary Section B.0.3 to include a detailed explanation.

[1] Laskin et al. Reinforcement learning with augmented data. 2020

The experiments only include DrQ and CQL. Since this paper deals with extending the data and can be applied to many different methods, it would make the method more compelling if there were more methods like in SynthER, e.g. IQL, TD3+BC, EDAC.

We selected DrQ+BC and CQL to align with the benchmark datasets from V-D4RL and Offline Procgen (Lu et al., 2023a; Mediratta et al., 2024), respectively, to ensure fair comparisons with existing results. DrQ+BC was chosen because its authors highlighted generalization challenges for model-free algorithms on distracting datasets, providing an opportunity to demonstrate how our method effectively addresses these issues. CQL was selected due to its underperformance in offline generalization tasks compared to other algorithms, making it an ideal case to showcase the potential of our method. Note that DrQ+BC is largely the same as TD3+BC but for visual observations. While additional algorithms like IQL and EDAC could be explored in future work, our focus was on demonstrating that our method is algorithm-agnostic and applicable across diverse environments, including both continuous and discrete action spaces, rather than comparing the relative performance of various algorithms.

Do I understand correctly that your 'Upsampled' method is akin to SynthER? If that's so, I would put that in parenthesis. If not, could you provide that comparison?

Thank you for the suggestion. In the Method Section (Subsection 3.1), we have updated the "Upsampling with Diffusion Models" item to explicitly reference SynthER as the method we employed for upsampling.

Response to "Writing"

Thank you for your valuable feedback on the writing and figure presentation. Regarding Figures 4, 5, and 6, we aimed to balance readability and organization by combining multiple charts into single figures to effectively summarize the data. While we recognize the figures are somewhat large, this approach minimizes disruption to the paper's structure. For the heatmaps, we chose this format for concise and quick visual interpretation. Although adding exact values on the squares could provide additional information, it reduced visual clarity due to interference with the color scheme. We opted for heatmaps instead of bar plots to maintain readability but appreciate the suggestion and will explore alternative formats, such as annotated heatmaps, in future work. Finally, we corrected the spelling of "tecnic" to "technique" in the revised version and are glad you liked the original phrasing.

评论

Dear Reviewer zkst,

We have carefully addressed all the points you raised in my rebuttal, and we believe the additions and explanations provided will help in evaluating the paper further. There is not much time left for us to address your additional comments; hence, we gladly ask you to evaluate our rebuttal at your earliest time, as your insights are important in ensuring a complete assessment of the work. We value the time you spent in this process and would be pleased to offer any more explanations should they be necessary.

Thank you for your time and consideration.

Authors

评论

Thank you for your response! I choose to keep my score unchanged at this time.

评论

We truly appreciate the careful comments from the reviewers as well as for appreciating the simplicity, efficiency, and creative aspect of our suggested approach. We value the recognition of our efforts to enhance generalizing in offline learning by means of diffusion model-based upsampling combined with data augmentation.

We hope we have addressed all your questions and concerns in the updated paper. If there is anything else we can clarify, please let us know. Otherwise, we would be grateful if you could consider increasing your support for our work with a higher score.

AC 元评审

Summary: This paper investigates a novel approach to improving generalization in offline visual reinforcement learning (RL), focusing on the integration of data augmentation and latent-space diffusion models. Unlike existing methods, which rely solely on augmentation or specific model modifications, the proposed two-stage method first enhances the diversity of the training data through data augmentation and then employs a diffusion model to generate synthetic transitions in latent space. This strategy addresses visual generalization challenges without modifying existing model-free RL algorithms. The approach is evaluated on two distinct benchmarks: V-D4RL (a continuous control task) and Offline ProcGen (a discrete control task). Empirical results demonstrate that combining data augmentation with latent-space upsampling significantly reduces the generalization gap, leading to improved performance in previously unseen environments.

Strengths and Weaknesses: The reviewers generally recognize that this paper addresses a crucial problem in offline visual RL – the generalization to unseen environments. They also appreciate the simplicity and strong performance of the proposed approach, as well as its independence from specific RL algorithms, making it widely applicable to various offline RL models.

However, they express reservations about the significance of the findings and their practical usefulness. Specifically, the method combines classic data augmentation with diffusion-based data synthesis from SynthER in a straightforward two-stage pipeline. The method's reliance on specific augmentations, along with the need for tuning data augmentation and diffusion model parameters, limits its applicability across diverse domains. Despite promising results in benchmark environments like V-D4RL and Procgen, questions remain about the method's performance in more complex real-world scenarios, where the variety of unseen situations is much greater. There is also a risk of overfitting to synthetic data, especially if the generated data does not align with real-world distributions. Additionally, despite claims of computational efficiency, training and running diffusion models can be resource-intensive, which could hinder scaling the approach to larger datasets or high-resolution inputs. Furthermore, the paper lacks a thorough analysis of diffusion model design choices, such as architecture, hyperparameters, and latent space dimensionality, and their impact on performance, instead largely referencing prior work SynthER. The experimental scope is also seen as too narrow, as the paper only includes DrQ and CQL, and expanding the experiments to include additional methods like IQL, TD3+BC, and EDAC (similar to SynthER) would provide a more comprehensive evaluation.

The authors addressed some of these points during the discussion phase. However, the reviewers remained unconvinced and were not championing the paper. While the study's findings are compelling and highlight the effectiveness of data augmentation and synthesis, there is insufficient evaluation to demonstrate the method's strengths in real-world scenarios. This limits the support for the paper's strong claim that "Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data."

Therefore, the paper is not ready for this ICLR. I encourage the authors to continue this line of work for future submission.

审稿人讨论附加意见

The current recommendation is based on the identified weaknesses, particularly the lack of convincing evidence for the significance and practical usefulness of the proposed method, as well as the absence of a thorough analysis of the design choices behind it.

最终决定

Reject