4.5

/10

Rejected4 位审稿人

最低3最高6标准差1.5

3.8

置信度

正确性2.3

贡献度2.3

表达2.3

ICLR 2025

State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

Xintong Duan,Yutong He,Fahim Tajwar,Wentse Chen,Russ Salakhutdinov,Jeff Schneider

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

zero-shot generalization in decision-making tasks to unseen states that are a different (but known) combination of basic objects encountered during training

摘要

Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of $zero-shot generalization to states that are unseen combinations of previously seen elements.$ In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on expert trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.

关键词

RL generalizationdecision makingcombinatorial generalizationdiffusion model

评审与讨论

审稿意见

评分: 6置信度: 42024-10-21

This paper studies the problem of out-of-combination (OOC) generalization in RL. The authors first define OOC generalization mathematically and show that classical behavioral cloning methods struggle when tested on OOC examples. They then propose a Diffusion model to better tackle OOC generalization, and show that implementing the policy through a Diffusion model conditioned on ground truth information about the environment's objects/entities improves generalization in three environments.

优点

Clearly defines an interesting problem and identifies a possible solution.
Exposition is easy to follow and conceptually simple
Proposed method yields considerable improvements in several environments
Good ablations to show what parts of the diffusion approach are important.

缺点

While the proposed approach gives important performance improvements in these OOC domains, two potential weaknesses come to mind:

The diffusion model requires ground-truth information about the entities/object types that compose the environment's state space to work effectively. This type of information is not always available, and it can be a considerable challenge to learn such representations in an unsupervised manner. Is it possible to show how important having reliable information about the compositional latent variable is in, for instance, the Roundabout environment?
A second concern is the complexity and run-time of the algorithm. In the SMACv2 5v5 environment the Diffusion model has a ~10x longer train time than PPO. Maybe there are methods that lie in-between in terms of complexity, and potentially also performance. Are there ways in which the diffusion model could be sped up for faster training?

Some minor comments:

In the Roundabout environment, it would be interesting to compare the model to an off-policy algorithm that collects its own training data, like SAC.
Limitations can be discussed more extensively. For example, the definition of OOC generalization/support supposes a well-defined set of entities (e.g. cars and bikes). In the real world, object type distinctions are not always clear. Should motorized bikes count as a bike with slightly different dynamics or as a whole new entity? Not a major concern, but could be good to discuss these types of limitations.

Overall I think the paper is good and I'd be happy to increase my score if these concerns are addressed appropriately.

问题

Not sure why corollary 5.1 adds to the exposition? I think simply stating that Diffusion models have been successful at combinatorial generalization in the past is sufficient for motivating this.

2024-11-20

Weakness

Is it possible to show how important having reliable information about the compositional latent variable is in, for instance, the Roundabout environment?

We thank the reviewer for raising this interesting question. We would like to demonstrate the importance of reliable compositional latent variables by comparing the performance of the conditioned diffusion model given random and ground truth latent variables. Results are attached below and we have also included them in Section D.8.3:

	Ground Truth Diffusion	Random Diffusion	PPO	BC	CQL
Number of crashes	19.67 (2.49)	24.33 (0.47)	36.0 (6.93)	37.67 (0.47)	47.33 (8.22)
Reward	8.81(0.2)	8.1 (0.09)	8.19 (0.19)	7.50 (0.03)	7.54 (0.49)

Are there ways in which the diffusion model could be sped up for faster training?

[1][2] are some recent works on fast training for diffusion models for image generation. [1] propose a patch-wise diffusion training where instead of learning the score function on the full-size image, they learn a conditional score function on smaller image patches. [2] identified that some timesteps are more important (with a larger rate of change) than others during the forward denoising process and thus speed up training by making the model focus on these timesteps. We thank the reviewer for raising this concern and incorporating these ideas would be an interesting future work to make the diffusion model for RL more efficient.

[1] Wang, Zhendong, et al. "Patch diffusion: Faster and more data-efficient training of diffusion models." Advances in neural information processing systems 36 (2024).

[2] Zheng, Tianyi, et al. "Non-uniform timestep sampling: Towards faster diffusion model training." Proceedings of the 32nd ACM International Conference on Multimedia. 2024.

In the Roundabout environment, it would be interesting to compare the model to an off-policy algorithm that collects its own training data, like SAC.

We thank the reviewer for suggesting off-policy algorithms. The results are attached below:

	Diffusion	SAC	PPO	BC	CQL
Number of crashes	19.67 (2.49)	22.33 (0.94)	36.0 (6.93)	37.67 (0.47)	47.33 (8.22)
Reward	8.81(0.2)	8.61 (0.06)	8.19 (0.19)	7.50 (0.03)	7.54 (0.49)

Limitations can be discussed more extensively. For example, the definition of OOC generalization/support supposes a well-defined set of entities (e.g. cars and bikes). In the real world, object type distinctions are not always clear. Should motorized bikes count as a bike with slightly different dynamics or as a whole new entity? Not a major concern, but could be good to discuss these types of limitations.

We thank the reviewer for this valuable suggestion. We have added them to the limitation in the conclusion.

Question

Not sure why corollary 5.1 adds to the exposition? I think simply stating that Diffusion models have been successful at combinatorial generalization in the past is sufficient for motivating this.

We thank the reviewer for suggesting that previous CV work is sufficient for the motivation. The main intuition for corollary 5.1 is to provide some intuition of why the diffusion model can sample OOC states. Since the future states of an OOC state are likely still an OOC state (next state of {bike, car} will still be {bike, car} but with different attributes like position/velocity), being able to sample OOC state suggests that the conditioned diffusion model has the ability to generate an accurate future trajectory (Figure 6).

2024-11-22

Thanks for the detailed response and for running additional experiments. The first experiment shows that the method is relatively robust to noise in the compositional latent variable. However, the results of the second experiment show that a simple off-policy SAC agent is nearly as good as the diffusion agent. This is a bit worrying since the authors claim on line 233 that

We demonstrate the estimation of value functions is problematic given unsupported states and this can not be solved by more exploration or more training data

It seems that off-the-shelf RL algorithms are indeed viable algorithms if they are allowed to explore. Moreover, these algorithms are a lot cheaper to run than diffusion. I've therefore chosen to maintain my score.

2024-11-23

We appreciate the reviewer’s follow-up! We would like to clarify the experimental results further. All the other models (BC, DIffusion, CQL) reported in Figure 4 use the same dataset collected by PPO so we can conclude the performance gain is due to better modeling (Large PPO and BC included below to eliminate the concern for parameter size). To rigorously test if this OOC state problem can be solved by exploration, it would make more sense to compare the performance of SAC with different timesteps of exploration and see if more exploration can improve the performance. Since both SAC and PPO are online methods that explore and collect their own dataset but update their policy differently, it would be hard to see where the performance gain of SAC comes from compared to PPO (exploration or not). Attached below is the performance of SAC with different exploration steps.

SAC
Exploration timesteps	10000	30000
Number of crashes	22.33 (0.94)	22.0 (2.83)
Reward	8.61 (0.06)	8.58 (0.14)

Since SAC’s performance with more exploration does not increase, we justify our claim that this problem can not be solved with more exploration using online methods.

Besides, we can see that SAC is still not as good as the diffusion model trained on less performative PPO data and SAC performance is still far from the in-distribution performance. This indicates that the diffusion model is still a strong model architecture for this OOC state problem. Please let us know if there are any additional concerns!

(Attached below is the performance of a Large PPO and BC for easier reference to Table 5. )

	Large BC	Large PPO	Diffusion
Model size	55.65 MB	111.29 MB (Policy:55.64 + Value: 55.64 )	54.19 MB
Parameters	13912325	27823622 (Policy:13911040 + Value:13911040)	13546370
Number of crashes	37.3 (4.0)	31.67 (1.89)	20.0 (2.5)
Reward	7.71 (0.3)	8.43 (0.19)	8.81 (0.2)

审稿意见

评分: 6置信度: 32024-10-30

This paper describes the zero-shot generalization problem in reinforcement learning for novel states using the combinatorial property of states, introducing the concept of out-of-combination (OOC) generalization. Based on this formulation, it explains why traditional RL methods tend to fail in such scenarios and proposes that diffusion models can effectively sample OOC states, enhancing zero-shot generalization in OOC cases.

优点

This paper proposes a novel perspective in introducing the diffusion model as a planner for RL, leveraging its combinatorial genearalization capabilities. It proposes the concept of combinatorial states to illustrate the effectiveness of the diffusion model planner in this specific scenario.
The authors conduct experiments in several environments and test the effects of different conditioning on the diffusion model. And the formulaiton of the experiment is quite easy to follow.

缺点

Presentation. (1) The paper’s organization is overly segmented, creating a somewhat confusing structure. For instance, Section 4 contains only one subsection (4.1), which could be merged with Section 5, as Section 4 is fairly brief. Additionally, the introduction to diffusion models would be more appropriately placed in the “Preliminaries” rather than within the methodology description. The main methodology appears to be in Section 6, yet it occupies only a single paragraph, offering minimal explanation of the approach. (2) The content is not properly distributed. In the related work section, too many subcategories dilute the focus. This section should concentrate on three main categories that are closely related to this work: state generalization, state decomposition, and diffusion models for RL generalization. Other less related studies could be briefly summarized. Notably, the role of diffusion models in decision-making, a central theme of this paper, lacks sufficient discussion both in the related work and in the main content. I recommend the authors consider these points to improve the manuscript’s readability.
Technical Contribution. (1) While this paper emphasizes the combinatorial property of states in RL, it lacks explanation on how to determine the basic elements for each environment and how to extract the compositional information behind the observation, which is important for readers to understand the combinatorial property. (2) The paper does not effectively convey the importance of its setting, nor does it fully address the first question posed in the experimental section, "Does the state space of different existing RL environments exhibit a compositional nature?" The notion of "wide applicability" for this setting needs clearer articulation. For example, beyond the environments discussed in the experiment section, if the observation is a 2D image, can this approach still reliably identify the fundamental elements? In a game like CoinRun, where level differences lie in map layouts, how would one define the basic elements across varied terrains? (3) The paper mentions the point that "other work addresses only the situation where states have the same support but different probability density", I would ask the authors to provide more evidence supporting this claim.

问题

While this may extend beyond the scope of the paper, I would still like to discuss with the authors: if novel states are introduced during testing, the environment dynamics are likely to vary as well. Would your method remain applicable under these varying dynamics?
In Algorithm 1, it appears that after the diffusion model generates an action, the state transition is carries out by the environment simulator. Given this, why does the trajectory produced by the unconditional diffusion model violate the maze constraints, as shown in Figure 7(c)? My understanding is that the state transition constraints should be enforced by the environment simulator.
Regarding the conditioning of the diffusion model, how is an appropriate conditioning determined for a specific environment? For instance, in Figure 7(d), the prior with 3 waypoints appears too strong, as these waypoints essentially outline the optimal path.

2024-11-20

Weakness

While this paper emphasizes the combinatorial property of states in RL, it lacks explanation on how to determine the basic elements for each environment and how to extract the compositional information behind the observation.

We appreciate the reviewer for highlighting this concern. Since our definition of the base element is the most elementary and identifiable element that is relevant to the decision-making task of interest, the granularity of the base element depends heavily on the specific task of interest. For example, we can consider each individual vehicle in a self-driving task as a base element if we would like the vehicle to navigate in its surrounding environment or we can also consider a flow of traffic as a base element if the task is on optimizing traffic flow efficiency. Despite the flexibility of the base element, our problem formulation fundamentally tries to address the complexity arising from the exponential growth of combinations of these base elements.

The paper does not effectively convey the importance of its setting, nor does it fully address the question "Does the state space of different existing RL environments exhibit a compositional nature?" The notion of "wide applicability" for this setting needs clearer articulation. For example, beyond the environments discussed in the experiment section, if the observation is a 2D image, can this approach still reliably identify the fundamental elements? In a game like CoinRun, where level differences lie in map layouts, how would one define the basic elements across varied terrains?

We thank the reviewer for raising this concern and we have acknowledged in our limitation that this formulation may not be the best formulation for every RL problem (lines 534-536). However, our motivation is to offer a better perspective for the environments whose complexity comes from the exponentially many combinations of different objects. This complexity can be found in a lot of distinctly different real-world tasks like self-driving and multiagent decision-making, and thus we note wide applicability.

The format of the state (image or vector) would not affect whether it could be considered compositional or not. The formulation is about seeing the environment as a composition of different base elements. We could consider different obstacles like block/coin in CoinRun as base elements and once the agent learns to jump over obstacles and reach the coin, it has the ability to solve different combinations of obstacles. However, since this environment is mainly used to test different map layouts, we would not consider this compositional perspective the most intuitive approach.

The paper mentions the point that "other work addresses only the situation where states have the same support but different probability density", I would ask the authors to provide more evidence supporting this claim.

We thank the reviewer for pointing out this mistake. The main message we are trying to convey is that the type of distribution shift that has received the most attention in RL is where actions in each transition in the datasets were collected with any policy other than the current policy (off-policy RL). This assumption typically posits that the dataset distribution is governed by the data collecting policy, given the entire state space of the environment is accessible. Consequently, different data-collecting policies will result in datasets with different densities but the same underlying support. Most papers addressing RL distribution shift with the same training and testing environment fall under this category (most offline and some off-policy RL) [1, 2, 3]. However, we study a different problem where the testing environment has different support compared to the training environment. We recognized in the introduction and related work that there are some other works studying RL generalization to unsupported states. Compared to these problems, we study a more structured generalization where this support shift can be modeled as new combinations of existing base elements (lines 43-48). Therefore, we rephrased this sentence into “other work either addresses only probability density shift or allows unseen base elements but also introduces other potentially unrealistic assumptions.”

[1]Peng, Xue Bin, et al. "Advantage-weighted regression: Simple and scalable off-policy reinforcement learning." arXiv preprint arXiv:1910.00177 (2019).

[2]Kumar, Aviral, et al. "Conservative q-learning for offline reinforcement learning." Advances in Neural Information Processing Systems 33 (2020): 1179-1191.

[3]Kostrikov, Ilya, Ashvin Nair, and Sergey Levine. "Offline reinforcement learning with implicit q-learning." arXiv preprint arXiv:2110.06169 (2021).

Presentation

We thank the reviewer for this valuable feedback on the paper structure! We have shortened the related work and added more content on how the diffusion model has been applied under this setup.

2024-11-22

I thank the authors for their detailed feedback to my concerns and questions. I also agree with some of the concerns raised by other reviewers, especially the one about the manifold assumption. Also, I still believe the compositional property the authors proposed lacks wide applicability and is a major limitation of the proposed approach. Due to the above reasons, I decide to maintain my score.

2024-11-20

Questions:

While this may extend beyond the scope of the paper, I would still like to discuss with the authors: if novel states are introduced during testing, the environment dynamics are likely to vary as well. Would your method remain applicable under these varying dynamics?

We thank the reviewer for raising the question about the environment dynamics and we would like to elaborate on how conditioned diffusion models stand out exactly because of this reason using Figure 6. Yes, the dynamics of the environment are oftentimes different for different combinations of base elements. For example for environments with all cars, every agent is moving fast whereas for environments with all bikes, every agent is moving slowly. When zero-shot generalizing to environments with a mixture of cars and bikes, the underlying MDP changed but the individual dynamics of every car and bike remain the same. This highlights that our problem formulation represents a realistic yet challenging generalization task, focusing on novel combinations of existing base elements rather than entirely new MDPs. Successfully addressing this requires the model to identify the behavior of each base element and effectively compose their dynamics. As demonstrated in Section 7.3 (Figure 6), the diffusion model produces different and reasonable future trajectories conditioning on different base element combinations. This indicates that the model not only captured the correlation between each base element and their dynamics but was also capable of composing them together to form an accurate prediction of the unseen environment dynamics (takeaway 3).

In Algorithm 1, it appears that after the diffusion model generates an action, the state transition is carries out by the environment simulator. Given this, why does the trajectory produced by the unconditional diffusion model violate the maze constraints, as shown in Figure 7(c)? My understanding is that the state transition constraints should be enforced by the environment simulator.

Figure 7(c) is the trajectory generated by the diffusion model but not the rollout in the environment. The purpose of Section 8.1 is to demonstrate the difference between samples generated by the conditioned and unconditioned diffusion model. Therefore we use the trajectory directed generated by the diffusion model to visualize the impact of the conditioning. We thank the reviewer for pointing out this confusion and emphasized this in the caption.

Regarding the conditioning of the diffusion model, how is an appropriate conditioning determined for a specific environment? For instance, in Figure 7(d), the prior with 3 waypoints appears too strong, as these waypoints essentially outline the optimal path.

The purpose of Figure 7 (Section 8.1) is an ablation to show the difference between samples generated by the conditioned and unconditioned diffusion model to demonstrate the effect of conditioning. Therefore we choose 3 waypoints to verify the conditioning has an unignorable effect on guiding the model to generate unseen samples.

What is an appropriate conditioning?

Our answer to this question is that conditioning is always flexible. For example, 3 waypoints can be used as conditioning but they basically outline the entire trajectory, restricting the diversity of generated samples. We can also use one midpoint as conditioning and there will be more trajectories that satisfy this conditioning. So appropriate conditioning depends on the specific task, available information, and how much we want to restrict the generation.

2024-11-22

We thank the reviewer for getting back to us! We would like to clarify why it is reasonable to make the manifold hypothesis and comment a little more on wide applicability.

Why data constitute a manifold:

Discrete case:

Since our base element set is a finite discrete set, it lies on a 0-manifold. Since the latent vector is the n-cartesian product of this base element set and the cartesian product of manifolds is still a manifold, the latent vector lies on a manifold.

Continuous case:

Since we define the $\sigma$ -algebra as the Borel set, it forms a topological space. Most RL environments and real-world scenarios simulate physical spaces, where the basis of information can often be interpreted as continuous real numbers isomorphic to $\mathbb{R}^1$ (e.g., the agent's x-location, reward values, or a single color channel). These spaces locally resemble Euclidean spaces. Consequently, the space spanned by these n bases locally resembles $\mathbb{R}^n$ , and the corresponding latent space naturally lies on a manifold. Also, if elements naturally form vector spaces such as the examples above, then they naturally form lower-dimensional manifolds for the states, and these underlying manifolds are linear.

Therefore high-dimensional states/observations of the environment in the ambient space lie close to a low dimensional latent manifold.

However, as mentioned in our paper (301-303) and previous rebuttal, this is meant to only provide intuitions for why it is can be a reasonable assumption to simplify the analysis, as opposed to claiming that all states in our setting form manifolds. i.e. for many cases we also don't think the states necessarily strictly resides on a (linear) manifold, however, because of the compositional nature of our setting, it is reasonable to make this simplified assumption when making modeling decisions. This simplified assumption can be found in the analysis for many manifold-aware model analysis [1, 2, 3]. Our experiment results with manifold aware diffusion model also demonstrates superior performance, further supporting the validity of this assumption.

Wide Applicability

We thank the reviewer for raising this concern. While there are limited existing RL environments that explicitly exhibit compositional characteristics, such complexity is prevalent in many real-world decision-making problems. For example:

Self-driving: Traffic elements can be viewed as compositional base elements.

Webpage navigating agent: Webpage is composed of many reusable subcomponents like a widget or window

Music generation: Music is composed of different layers of track or different instrument

Despite the ubiquity of compositional structures in such domains, this challenge remains underexplored in prior work. Through our study, we aim to bring greater attention to the problem of compositional complexity in decision-making settings.

Please let us know if there is any other clarification we can provide.

[1] Donoho, David L., and Carrie Grimes. "Image manifolds which are isometric to Euclidean space." Journal of mathematical imaging and vision 23.1 (2005): 5-24.

[2]Brown, Bradley CA, et al. "Verifying the union of manifolds hypothesis for image data." arXiv preprint arXiv:2207.02862 (2022).

[3]Gong, Sixue, Vishnu Naresh Boddeti, and Anil K. Jain. "On the intrinsic dimensionality of image representations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

审稿意见

评分: 3置信度: 42024-11-04

This paper studies the problem of RL generalization from a state distribution covering only certain combinations of objects to a state distribution covering new combinations of objects. The key contribution is to use a conditional diffusion model, where the diffusion model is given information about the objects in a state. Using this model in RL amounts to planning, where the model is used to generate a trajectory and select the first action in the trajectory. Evaluation is performed on an autonomous driving simulator, the starcraft multi-agent challenge and various maze environments. Relative to a behavioral cloning baseline, constrained Q-learning and vanilla PPO, the diffusion model seems to improve generalization to the test environment that contains unseen combinations.

优点

The problem of out-of-combination generalization is motivated clearly, and it's relevance to RL is demonstrated across a few problems.
Relative to the baseline considered, the proposed conditional diffusion model is clearly an improvement across an autonomous driving simulator, starcraft multi-agent challenge and various maze problems.

缺点

The proposed method of using a conditional diffusion model for planning is not detailed very clearly.
The experiments lack any strong baseline to accurately judge the proposed method. The authors cite a few prior works on combinatorial generalization in RL, but the experiments do not include any representative baselines. It is not clear whether the benefit of the proposed approach is due to the larger model used by the diffusion model, or the specific combination of offline RL with expert data and the proposed model. While some ablations are performed over the type of conditioning, these results do not seem entirely conclusive, except that no conditioning is always worst.

问题

Figure 2 and discussion of subtasks: I am not sure I understand the difference between the proposed definition of combinatorial generalization and trajectory stitching. With trajectory stitching, there can also be a combinatorial generalization problem if the stitched trajectories include some trajectories with some subtask and another trajectory with another subtask. The policy on the stitched trajectory would have seen each subtask separately, but not together.
Problem Formulation and scope of experiments: By construction the problem setting requires some notion of "object identity" for the model to be conditioned on. I consider this a serious limitation of the wide applicability of this approach, as very few RL environments come with predefined objects. One way to demonstrate wider applicability is if the proposed approach could also be combined with a method that performs image segmentation to identify objects.
Problem formulation and test mismatch: How can a practitioner know whether the test distribution differs from the train distribution in the combination sense? Is it possible that the proposed approach can improve over the baseline on problems where there is no out-of-combination distribution? (e.g., without any objects, or when all the combinations are covered)
Wouldn't section 4 be better titled "Why Offline RL Fails"? Even this requires further qualification. It is not clear how CQL is trained here and what information is provided to it in terms of conditioning.
Section 4 (Offline RL focus): Is there necessarily a case for studying this problem exclusively with offline RL? For example, better OOC generalization can contribute to faster learning with PPO on the previously unseen environment.
Section 4 conclusion: Again, while it is true that offline RL methods are not well-suited to the problem where the training environment is different from the testing environment, this is to be expected. The problem of RL is solving an MDP, not transferring from one MDP to another.
Section 5.2 (linear manifold): this section is confusing, why is the latent space a linear subspace in maze naviation?
Section 5.2 (manifold hypothesis) The manifold hypothesis, that the latent space is much smaller, has only really been demonstrated in the case of supervised or self-supervised learning. Is there evidence that suggests a similar hypothesis holds true for problems involving sequential decision making? (Even if a low dimensional representation exists for one-step prediction, the representation for long-term value estimation needs to account for expected future state occupancy.)
Section 6: this section is critical to understanding the proposed method, but many of the details (Figure and Algorithm Pseudocode) are in the appendix. Why is the specific architecture chosen (Diffusion UNet), and what contributed to the use of only expert trajectories? What does it mean to model the action (Algorithm 1 does not seem to model the action)?
Section 7.1 and Figure 4: I think a stronger and more representative baseline must be included to make meaningful conclusions here. Additional details on how PPO/CQL was trained and evaluated are needed. For example, did PPO and CQL have the conditioning information? Was PPO trained online (appendix mentions 2 different instances of PPO were used to gather data in the all-car or all-bike environment, unclear if this was used for PPO performance in Figure 2)?
Section 7.2 and Figure 5: The use of even fewer baselines here makes any conclusions difficult to make. Additionally, error bars seem to be missing.
Section 7.3 and Figure 6: I am not sure that these qualitative results demonstrate "how" conditional diffusion models generalize, but rather that they do generalize. I have a hard time understanding what Figure 6 is meant to convey. Would other models not identify the dynamics of specific units? A quantitative analysis, showing correlations between trajectories for specific conditionings, may be more clear.
Section 8.1 and Figure 7: Can you comment on how this evaluation differs from other work on trajectory stitching?
Section 8.2 and Figure 8: It is not clear that cross attention is large improvement over concatenation without error bars. The performance difference can also be explained by additional parameters introduced by attention.

** Minor Comments

I believe the pseudocode in the appendix has a typo: there is an "x_{t-1}" variable on line 995 that should probably be "s_{t-1}"
Measure theoretic definition: because you work with a finite, countable set of latents, the measure theoretic definitions do not seem necessary. I think Section 3 can be made more clear and concise by focusing on the discrete setting.

2024-11-20

Weakness

Method not detailed very clearly

We thank the reviewer for pointing this issue out. We have added details regarding how the diffusion model is used to generate future trajectories in Section 6.

Lack of combinatorial generalization baseline

We thank the reviewer for raising this concern. Prior works listed in related work have different problem assumptions than ours (e.g. need to define a distance function or need the underlying MDP to be isomorphic in order for their method to map the unseen states back to training states). Thus their methods do not hold under our problem formulation where our setup test generalization without mapping the states back to the training set (lines 144-146, 148-152).

Unclear whether the benefit is due to the larger model size

We thank the reviewer for highlighting the concern regarding different model sizes. We included the results with a large behavior cloning model that has roughly the same number of parameters as the diffusion model in Section D.8.2 to eliminate this concern (lines 1244-1246). Additionally, we trained a large PPO model that has roughly the same number of parameters as the conditioned diffusion model and included this performance in Table 5 as well.

	Regular PPO	Large PPO	Diffusion
Model size	0.59 MB (Policy:0.29 + Value:0.29 )	111.29 MB (Policy:55.64 + Value: 55.64 )	54.19 MB
Parameters	147456 (Policy:73728 + Value:73728)	27823622 (Policy:13911040 + Value:13911040)	13546370
Number of crashes	36.0 (6.93)	31.67 (1.89)	20.0 (2.5)
Reward	8.19 (0.19)	8.43 (0.19)	8.81 (0.2)

2024-11-20

Questions

Difference between combinatorial generalization and trajectory stitching

We would like to elaborate on this with an example of value-based RL methods. The value function takes in the current state and predicts the expected rewards. In trajectory stitching, the state input has been seen in some subtasks in some training trajectories and thus still lies in the area supported by the training data for the value network. In state combinatorial generalization, the state is a new combination of existing objects and therefore an unseen input to the value network (Figure 1 & 2). Thus we say state compositional generalization is different and challenging as the test time states are not supported by the training data. Please let us know if our explanation addresses your question, and we would be happy to provide further clarification if needed!

limitation of the wide applicability of this approach, as very few RL environments come with predefined objects

Our motivation stems from the observation that real-world environments involve a vast diversity of objects, making it almost impossible to enumerate all possible combinations. This work provides a more intuitive framework for environments where complexity arises from the exponential number of such combinations. This compositional complexity is prevalent in a variety of real-world tasks, such as self-driving and multi-agent decision-making, as illustrated in the paper. Consequently, our approach demonstrates broad applicability across these distinctly different domains.

Despite the significance of this compositional complexity, most existing environments are not designed with this complexity in mind. Instead, they often make the idealized assumption that training and testing states have the same support. To address this gap, we propose this novel formulation that emphasizes generalization to new compositions and we hope to encourage the community to explore this important but overlooked problem.

combined with a method that performs image segmentation to identify objects

We thank the reviewer for this suggestion. As the primary focus of this paper is on improving decision-making in the presence of OOC states, we emphasize making reasonable decisions given base element combinations. However, our proposed formulation can be easily combined with any detection method for real-world applications.

How can a practitioner know whether the test distribution differs from the train distribution in the combination sense?

We thank the reviewer for bringing up this concern and we would like to explain this is exactly why we want to study this problem of composition. Accounting for all possible combinations in the real world is exponentially challenging and computationally infeasible due to the vast number of distinct objects it comprises. Consequently, a trained model deployed in such environments is inevitable to encounter states involving unseen combinations of familiar objects. To systematically explore this challenge of compositional complexity, we abstract this concept into our problem formulation, providing a structured framework for both theoretical analysis and empirical evaluation of model performance.

Is it possible that the proposed approach can improve over the baseline on problems where there is no out-of-combination distribution?

We thank the reviewer for raising this concern. For in distribution performance of the conditioned diffusion model, please refer to Section D.9.3 (Tables 10 & 11), specifically rows labeled “ID”. The results indicate that the performance of the diffusion model is nearly identical to that of PPO when there are no OOC scenarios.

2024-11-20

Questions (continued)

Section 6: many of the details (Figure and Algorithm Pseudocode) are in the appendix. Why is the specific architecture chosen (Diffusion UNet), and what contributed to the use of only expert trajectories? What does it mean to model the action (Algorithm 1 does not seem to model the action)?

We thank the reviewer for pointing out this ambiguity and have updated the manuscript with more detail. We use the same Unet architecture as the previous work [4] and since we use the diffusion model for behavior cloning, we train the model with the expert dataset. Modeling action simply means the diffusion model generates future trajectories (both state and action) (section 6.1&6.2). We have updated $x$ to $\tau$ in Algorithm 1.

[4]Janner, Michael, et al. "Planning with diffusion for flexible behavior synthesis." arXiv preprint arXiv:2205.09991 (2022).

Section 7.1 and Figure 4: I think a stronger and more representative baseline must be included to make meaningful conclusions here. Additional details on how PPO/CQL was trained and evaluated are needed. For example, did PPO and CQL have the conditioning information? Was PPO trained online (appendix mentions 2 different instances of PPO were used to gather data in the all-car or all-bike environment, unclear if this was used for PPO performance in Figure 2)?

We thank the reviewer for pointing out these confusions. PPO is trained online in the training environment (with a 50% probability of seeing all cars and a 50% probability of seeing all bikes) and CQL is trained offline with the same dataset as the conditioned diffusion model. All models (including PPO and CQL) have the same information (lines 327-328) and state composition information is passed as part of observation into these models. All models are evaluated in the testing environment (same unseen base element combination).

Since this is a new problem formulation, we compare the diffusion model with the widely acknowledged competitive RL baselines like PPO/CQL but as discussed in the related work and our answers to previous questions, no baseline has been proposed to address this compositional complexity without extra assumptions to our best knowledge.

Section 7.2 and Figure 5: The use of even fewer baselines here makes any conclusions difficult to make. Additionally, error bars seem to be missing.

Figure 5 demonstrates the relative performance of BC and Diffusion compared to MAPPO, a SOTA online policy gradient baseline for multiagent decision-making problems [5]. We thank the reviewers for pointing out the missing error bar. Error bars can be found in Section D.9.3: Detailed results on SMACv2 (Tables 10, 11, 12, 13, 14) and we attach them here for easier reference.

3v3	MAPPO	BC	Diffusion
Simple	0.52 $\pm$ 0.03	0.56 $\pm$ 0.02	0.59 $\pm$ 0.02

Hard	0.42 $\pm$ 0.02	0.4 $\pm$ 0.06	0.49 $\pm$ 0.02

5v5	MAPPO	BC	Diffusion
Simple	0.54 $\pm$ 0.04	0.52 $\pm$ 0.05	0.56 $\pm$ 0.02

Hard	0.26 $\pm$ 0.05	0.28 $\pm$ 0.04	0.32 $\pm$ 0.04

[5] Yu, Chao, et al. "The surprising effectiveness of ppo in cooperative multi-agent games." Advances in Neural Information Processing Systems 35 (2022): 24611-24624.

Section 7.3 and Figure 6: I am not sure that these qualitative results demonstrate "how" conditional diffusion models generalize, but rather that they do generalize. I have a hard time understanding what Figure 6 is meant to convey. Would other models not identify the dynamics of specific units? A quantitative analysis, showing correlations between trajectories for specific conditionings, may be more clear.

Figure 6 shows the “imaginary” trajectory that the diffusion model predicts for different specific conditionings (conditioned on all circles or conditioned on all triangles) but not the actual rollout. It demonstrates that the prediction of the diffusion model accurately reflects the world dynamics. It would be very hard to understand whether other value-based RL algorithms identify the dynamics or if they can do stitching as they do not directly generate future state predictions. We try to prove that the conditioned diffusion model does accurately “imagine” the world dynamics and this contributes to its generalizability. A quantitative analysis of diffusion model performance is plotted in Figure 5 and Tables 10, 11, 12, 13, 14. Please let us know if we addressed this concern. Additionally, we would be happy to engage in further discussion if the reviewer could provide more specific guidance regarding the type of quantitative analysis in mind.

2024-11-20

Question (continued)

Section 4 and Offline RL focus

We thank the reviewer for pointing out this confusion. We would like to clarify that section 4 is focused on the problem with the learned value function (lines 232-234, lines 239-245). This problem also exists for online methods since the states formed by new combinations of base elements are not supported in the training environment (Online methods will never be able to visit them as well) (lines 259-261). We use CQL’s Q value prediction as one example to illustrate this problem. Thus we avoid models that learn a value function like PPO and use behavior cloning instead. We appreciate the reviewer’s suggestion and have added further clarification on how online methods are similarly affected. Additionally, we have included a visualization of PPO’s value prediction in Section C (Figure 9) to support this reasoning.

not clear how CQL is trained here and what information is provided to it in terms of conditioning

We thank the reviewer for pointing out this confusion and are happy to provide more details. CQL is trained on the same offline dataset that is used to train the conditioned diffusion model. The observation space in Roundabout includes observability, the locations and speed of the ego and surrounding agent, and whether this agent is a car or a bike (conditioning for diffusion). Since CQL is not a conditioned model, the observation is passed as a whole to CQL. So both models have access to the same information but have different model architectures.

Section 5.2 (linear manifold): why is the latent space a linear subspace in maze navigation?

In the maze environment, the state can be represented as the base elements such as the x, y-axis locations. Elements like x,y-axis locations can form a linear subspace. Therefore the latent space of maze satisfies the linear manifold hypothesis.

Section 5.2 (manifold hypothesis) Manifold hypothesis for state space:

The manifold hypothesis states that high-dimensional datasets occur in the real world lie on a low-dimensional latent manifold inside a high-dimensional space, essentially saying that data that initially requires many variables to describe can be concisely described with a small number of variables [2][3]. This holds true for states/observations as the information in a state/observation can oftentimes be concisely described (e.g. by base element combination in this case).

Future state occupancy:

We thank the reviewer for raising this insightful question and would like to explain the connection between state probability space and MDP here. Which future state can be visited given current state is a result of the underlying MDP. From the state space perspective, MDP determines the probability density of future states (MDP information is included in the state probability space). Therefore we specifically model the state space in this case. The conditioned diffusion model learns a world model and predicts future states and actions instead of learning a value function. So Corollary 5.1. is essentially saying that since the diffusion model has a larger than zero probability of sampling these unseen future OOC states, it is capable of generating reasonable future trajectories to help it make better decisions.

Additionally, we would like to elaborate that this manifold hypothesis also holds in the context of MDPs as one can follow the definition of our probability state space and obtain another linear space with another basis representing time. In our experiments, we explicitly model the time dimension with diffusion planning, but it is still a linear concatenation of states that satisfies the manifold hypothesis. Also, we have demonstrated experimentally that with this manifold-hypothesis aware model, we can achieve better OOC generalization. Therefore, one of the contributions of our paper is in fact to validate this manifold hypothesis in sequential decision making. We appreciate this question from the reviewer and we have added these clarifications to our manuscript.

[2]Cayton, L. (2008). Algorithms for manifold learning (p. 1). eScholarship, University of California.

[3]Loaiza-Ganem, G., Ross, B. L., Hosseinzadeh, R., Caterini, A. L., & Cresswell, J. C. (2024). Deep Generative Models through the Lens of the Manifold Hypothesis: A Survey and New Connections. arXiv preprint arXiv:2404.02954.

2024-11-20

Questions (continued)

Section 8.1 and Figure 7: how this evaluation differs from other work on trajectory stitching?

We formulate the navigation problem as a one-step trajectory generation process where the diffusion model generates an entire valid trajectory instead of generating one action at every timestep. This is also a rendering of an imaginary trajectory predicted by the diffusion model. We use this imaginary trajectory to demonstrate how conditioning changes the sampled trajectory (lines 466-470).

Section 8.2 and Figure 8: It is not clear that cross attention is large improvement over concatenation without error bars.

We thank the reviewer for pointing out the missing error bars in the plot. The error bars can be found in Tables 12 and 13 and we attach it here for easier reference.

3v3	No Cond	Concat	Attention
$ABC \to ABC$ (ID)	0.55 $\pm$ 0.03	0.59 $\pm$ 0.04	0.59 $\pm$ 0.05
$ABC \to AAA$ (Simple)	0.5 $\pm$ 0.06	0.59 $\pm$ 0.02	0.59 $\pm$ 0.02

$AAA \to AAA$ (ID)	0.4 $\pm$ 0.03	0.64 $\pm$ 0.03	0.61 $\pm$ 0.05
$AAA \to ABC$ (Hard)	0.28 $\pm$ 0.03	0.44 $\pm$ 0.05	0.49 $\pm$ 0.02

5v5	No Cond	Concat	Attention
$ABC \to ABC$ (ID)	0.53 $\pm$ 0.04	0.59 $\pm$ 0.03	0.66 $\pm$ 0.01
$ABC \to AAA$ (Simple)	0.50 $\pm$ 0.03	0.50 $\pm$ 0.01	0.56 $\pm$ 0.02

$AAA \to AAA$ (ID)	0.47 $\pm$ 0.08	0.55 $\pm$ 0.03	0.58 $\pm$ 0.04
$AAA \to ABC$ (Hard)	0.27 $\pm$ 0.03	0.32 $\pm$ 0.04	0.30 $\pm$ 0.04

The performance difference can also be explained by additional parameters introduced by attention.

We also thank the reviewer for raising the concern for model parameter size and we would like to address this concern by showing that the number of parameters with attention or concatenation is roughly the same as the attention module is convolutional layers and concatenation increases the parameters of conditioning layer. We have added these results to Section D.11.

	Attention	Concat
Model size	617.06 MB	619.64 MB
Parameters	154264085	154911187

2024-11-26

Dear reviewer iS1b,

We deeply appreciate the time and effort you’ve dedicated to reviewing our work. We understand that this may be a particularly busy time, but we would like to send this gentle reminder that this is approaching the end of the reviewing process. If there’s anything we can clarify or answer to facilitate the process, please don’t hesitate to let us know! Thank you again for your valuable feedback to improving our work, and we look forward to hearing from you soon.

2024-11-26

I thank the authors for their submission and I appreciate the detailed reply, but I plan to keep my score.

My reasons for doing so are as follows:

Lack of existing combinatorial generalization baseline: If this is truly the first applicable baseline in this line of work then the bar for experimentation is high. There should be a thorough evaluation on the effect of different RL algorithms with different diffusion models. (The updated results with a large PPO is a good step in this direction, but I am afraid it is not enough.)
Combinatorial generalization vs trajectory stitching: Your argument at the state-level is correct, but notice that there is another way to think about these two problems that reveals the relationship. In particular, in the trajectory stitching literature the "primitive" is not the state but the trajectory. At the trajectory level, each state in the trajectory can be considered one of the objects of the trajectory. The problem can thus be phrased equivalently as: given trajectory 1 and trajectory 2 with some non-empty intersection, can a reinforcement learner combine these trajectories to learn a better policy. This is out-of-combination because trajectory 3 must combine information about trajectory 1 and 2, beyond that which a value prediction provides.
On applicability: the point I was making is that objects are rarely labelled as part of the state, and thus there are very few environments for which this approach can even be evaluated. One way to test this idea on more problems is to train a network on the raw sensory stream to detect objects, and see if your approach to OOC generalization can help in those problems too.
Manifold hypothesis: the manifold hypothesis is primarily for supervised learning, I have not seen any good evidence that it holds true for sequential decision making. Notice that in the linear case, the optimal (successor) representation is policy-dependent as it depends on future state occupancy. Thus, even if the individual states admit low dimensional embeddings for mapping one-step predictions of state and reward, this does not necessarily have any bearing on learning a policy or value function.

As it stands, the primary contribution seems to be taking existing findings regarding diffusion models capability for compositionality and applying it to RL. But the paper involves too much unfounded speculation which needs to be backed up by stronger theoretical or empirical evidence. There is definitely potential here, but I do not think a rebuttal period provides sufficient opportunity for these improvements. My suggestion is to focus on building a stronger empirical case, perhaps including a smaller-scale synthetic experiment to build intuition (rather than roundabout) and to highlight where current methods fail (both at the RL level and at the diffusion model level). I would also consider including either more environments (e.g., with a learned segmentation network), or a more thorough analysis into the choice of diffusion model + RL algorithm.

2024-11-28

We thank the reviewer for providing us with suggestions on how to improve the writing. We would appreciate more details on how the existing analysis differs from what the reviewer has in mind.

smaller-scale synthetic experiment to build intuition (rather than roundabout)

The entire Section 4: “Why traditional RL fails” is exactly an analysis of what is the problem with RL value prediction and provides an intuition of why these methods fail. It would be more efficient for us to communicate if the reviewer could elaborate what is the problem of using roundabout as the environment to provide intuition and what environment is better for this task they had in mind.

highlight where current methods fail (both at the RL level and at the diffusion model level).

The entire Section 4: “Why traditional RL fails” is exactly an analysis of what is the problem with RL value prediction and provides an intuition of why these methods fail. The entire Section 7.3: “How do conditional diffusion models generalize to OOC states?”, Figure 6, and takeaway 3 highlight why the diffusion model works.

including either more environments (e.g., with a learned segmentation network)

We have already demonstrated in three drastically different environments including maze, driving, and multi-agent game. If the reviewer thinks this does not cover enough environment, we kindly ask the reviewer to be more specific and suggest the environment you have in mind. Besides, we propose a decision-making problem and focus on analyzing what is a good decision-making algorithm. Adding segmentation on top will dilute the focus and add more varying factors to the result.

more thorough analysis into the choice of diffusion model + RL algorithm.

We have already included a detailed analysis of design choices of the diffusion model in the ablation: whether to use conditioned or unconditioned diffusion (Section 8.1: necessity of combinatorial inductive bias) and architecture choices of using concatenation or attention for conditioning (Section 8.2: model architecture: attention vs concatenation). Our RL baselines include online (PPO/MAPPO) offline (CQL) and BC (and added their large version). Again, we would like the reviewer to provide more details, for example, what is a more thorough analysis that the reviewer has in mind?

2024-11-28

Lack of existing baseline

As we mentioned in our last rebuttal that we have compared diffusion models with different widely acknowledged competitive RL baselines covering both online (PPO/MAPPO) and offline (CQL) and also behavior cloning. If the reviewer thinks this is not enough, can you suggest what baseline you have in mind so we can add more here in the rebuttal?

Combinatorial generalization vs trajectory stitching

TLDR: Trajectory stitching is a consequence of the dynamic programming nature of certain RL algorithms (e.g. Q learning). Combinatorial generalization focuses on the generalization of value/policy/BC network on unsupported inputs.

Trajectory stitching: Trajectory stitching is due to the nature of dynamic programming where it does not care about which trajectory it comes to reach the current state and the next state is determined by selecting the state with maximum value at the next timestep (also doesn’t care if it is in the same training trajectory). Value function does not capture the information of which trajectory the current state is on and trajectory stitching is formed by the action of taking the maximum of all possible values. But in this case, since this state is visited, we call it data-supported and does not require the value network to generalize to unseen inputs.

State combinatorial generalization: We focus on the situation where the current state is not data-supported. The problem here is more on network generalization: how can a learned model make good predictions given unsupported input. It doesn’t necessarily have to be the value prediction for the state.

In all one can say that state combinatorial generalization is requiring the network (could be value or policy or other networks) to generalize to unseen inputs but trajectory stitching is the outcome of doing dynamic programming. These are two fundamentally different problems.

Applicability

Our applicability argument is how the problem we are tackling: exponentially exploding complexity raised from different combinations of base elements can be found in different decision-making problems (maze, self-driving, and multiagent games) (line 399-341: (Wide applicability) Does the state-space of different existing RL environments exhibit a compositional nature?). We did not claim anything related to having this existing information as wide applicability. Using segmentation to detect base elements is out-of-scope of this decision-making problem we are focusing on.

Manifold hypothesis

We would like to first point out that the manifold hypothesis is not only popular in supervised learning but also in generative modeling [1, 2, 3, 4]. As mentioned in our paper, the diffusion model is trained with BC and used to generate future state-action pairs so there is no reward involved. Does “linear case” mean cases where the linear manifold hypothesis holds (like maze)? Also, future state density is captured by the diffusion model as it learns the distribution of the offline dataset. We showed in Figure 6 that the diffusion model is able to generate reasonable future OOC states given that it is trained on data collected by a behavior policy that never visited this OOC state (and so all the future states it incurs). This is also strong empirical evidence that the manifold hypothesis holds for decision-making tasks.

[1]Loaiza-Ganem, G., Ross, B. L., Hosseinzadeh, R., Caterini, A. L., & Cresswell, J. C. (2024). Deep Generative Models through the Lens of the Manifold Hypothesis: A Survey and New Connections. arXiv preprint.

[2]De Bortoli, Valentin. "Convergence of denoising diffusion models under the manifold hypothesis." arXiv preprint (2022).

[3]Loaiza-Ganem, Gabriel, et al. "Diagnosing and fixing manifold overfitting in deep generative models." arXiv preprint (2022).

[4]Pidstrigach, Jakiw. "Score-based generative models detect manifolds." Advances in Neural Information Processing Systems 35 (2022): 35852-35865.

unfounded speculation

We draw all our conclusions/takeaways from experimental results. Key observation about inaccurate value function in lines 252-254 is supported by Figure 3: expected Q value of CQL. Takeaway 1 on better performance of diffusion models is supported by Figure 4: reward achieved in OOC state for different models. Takeaway 2 on how this combinatorial complexity can be observed in different environments is supported by Section 7.1: single agent environment and Section 7.2: multi-agent environment. Takeaway 3 on how the diffusion model can accurately predict future trajectory is supported by Figure 6: a rendering of the future trajectory predicted by the diffusion model. Ablation takeaway is supported by the ablation over conditioned/unconditioned diffusion and concatenation/attention for conditioning (Figure 7 and Figure 8). If the reviewer thinks we have unfounded speculation, please support this claim with evidence, and thank you for helping us improve the paper.

2024-12-03

Dear reviewer iS1b,

We deeply appreciate the time and effort you’ve dedicated to reviewing our work. We understand that this may be a particularly busy time, but we would like to send this gentle reminder that this is approaching the end of the reviewing process. We have provided detailed response to your previous questions/suggestions and believe that more details on some concerns would help facilitate a more productive discussion. Thank you again for reviewing our work and we look forward to hearing from you soon.

审稿意见

评分: 3置信度: 42024-11-04

The authors propose to describe RL tasks using a framework that generalizes in a combinatorial way, meaning they construct states as latent vectors, each constructed themselves from fixed elements. Such vectors lend themselves to generalization to differently sized sets. The authors argue that RL models are inherently ill suited to deal with such a framework, whereas diffusion models are inherently better suited for such a task. They demonstrate their claims through a series of experiments on RL environments.

优点

Combinatorial complexity is a problem that seems often overlooked in machine learning topics. I am happy to see the authors address the subject.
I think the understanding of why diffusion models seem to generalize better is of great importance to the progress of many fields.

缺点

While the paper addresses an important and timely topic, I have some questions and concerns about the technical content of the paper.

The authors argue that a state can be composed of base elements, and attributes to those base elements. While they do not explicitly specify what constitutes those base elements beyond "car" and "bike", they do mention that there are also attributes, which are not relevant to the state as it pertains to the decission making, but only to the rendering function. I disagree with that statement. Certainly the velocity of a car would be important to ones decission making. It is not clear where "position" falls in this framework. Both these variables are important to the state, and would thereby invalidate the authors assumption in most realistic scenarios, since the "finite base set" requirement can not be satisfied.
Central to the authors claims seems to be that the sampled diffusion variables can be represented by a combinatorial set of finite elements. However, the approach in Janner et al. uses diffusion for to sample future states and actions, conditioned upon the current state. In this case, the combinatorial state is not sampled, but rather it is an input to the model.
I fail to see why the states defined by Z lie on a manifold. While there is indeed some prior work about manifolds in the dataspace, this specifically concerns an assumption for images. The dataspace that the authors present does not to fall into that category. In particular, lines 304 to 309 say that because there is a countable number of elements, their data lies on a manifold. If I understand the setup correctly, Z is a cartesian product of the base set E. This would not constitute a manifold. The manifold assumption seems like a central assumption to the paper, but it is not defended. It also seems that Z has a varying dimension, which seems incompatible with the assumption they lie on a manifold. As a matter of fact it is unclear even if n is varying in the setup. The abstract and title seem to suggest that, but I do not see any information in the paper (one way or another) on whether it varies or not.
In the absence of strong theoretical results, I would hope to see very compelling experimental results, but strong experimental results are lacking. PPO, the most competitive of the baseline methods, seems to have been trained with 100x less parameters than the diffusion models.

Minor points that would be good to address:

An "ODE" trajectory as the authors use in their proof precludes noise, as the authors use later on. This would be an SDE trajecotry.
In the proof sigma_t is not specified
In the corollary it is not clear what the purpose of s is. It does not seem needed for the argument that is made in there. The symbol is used in the proof, but it seems unrelated to the statement in the corollary.

问题

I would hope the authors could addreess and clarify all the points brought up under weaknesses. Of particular importance are: can they clarify the setup (does n vary or not)? Can they give a more rigorous defense on why cartesian products of sets lie on a manifold?
In the proof, it seems like the bottom line is, diffusion models put mass everywhere, therefore they generalize better. This is not at all a unique property to diffusion models, does this mean the authors' proof holds for other probabilistic models with that property too?
The authors themselves point out in the proof under appendix B that non-zero probability does not prevent poor behavior, and propose to use data augmentation during training. Is there a reason that this data augmentation would not suffer from the same problems the authors point out in the case of RL?

2024-11-20

Questions

In the proof, it seems like the bottom line is, diffusion models put mass everywhere, therefore they generalize better. This is not at all a unique property to diffusion models, does this mean the authors' proof holds for other probabilistic models with that property too?

We thank the reviewer for raising this interesting question. But we would like to note that our proof only demonstrates that diffusion models put larger than zero probability on unseen states on the MANIFOLD, instead of literally everywhere. Since it is important to put density on unseen on-manifold (OOC) state but not everywhere else, we are uncertain if other models follow this setup and that would be an interesting orthogonal direction. Also, being able to sample these OOC states helps diffusion models that generate future trajectories with better world dynamics prediction and thus contribute to its performance. We are uncertain if other models follow this setup but studying other probabilistic models are certainly interesting next steps.

The authors themselves point out in the proof under appendix B that non-zero probability does not prevent poor behavior, and propose to use data augmentation during training. Is there a reason that this data augmentation would not suffer from the same problems the authors point out in the case of RL?

We would like to apologize for the confusion and would like to clarify that the potential failure mode we mentioned in Appendix B corresponds to the out-of-distribution samples with unseen base elements (the gray area in Figure 1), which is out-of-scope for our paper. In this case, methods like data augmentation can improve the performance if this process can increase the coverage of more base elements. Applying the same strategy to traditional RL methods can also mitigate the problem of generalization to unseen base elements (which is again an out-of-scope problem for this paper), but it still cannot solve the problem of zero-shot generalization to OOC states, which is the focus of our paper. We thank the reviewer for pointing out this confusion and have clarified the writing in our updated manuscript.

Minor points that would be good to address

We thank the reviewers for pointing out these typos and we have fixed them in the manuscript.

2024-11-20

Weakness

The authors argue that a state can be composed of base elements, and attributes to those base elements. While they do not explicitly specify what constitutes those base elements beyond "car" and "bike", they do mention that there are also attributes, which are not relevant to the state as it pertains to the decission making, but only to the rendering function. I disagree with that statement. Certainly the velocity of a car would be important to ones decission making. It is not clear where "position" falls in this framework. Both these variables are important to the state, and would thereby invalidate the authors assumption in most realistic scenarios, since the "finite base set" requirement can not be satisfied.

We thank the reviewer for illustrating this confusion and we would like to explain this setup more clearly. In section 3.1 state formed by element combinations, we define a base element to be the most elementary and identifiable element that is relevant to the decision-making task of interest. Each element can also be associated with a collection of attributes such as the color of the vehicle and the velocity of the agent (lines 170-171, 178-179). So for the car/bike example, velocity and position are included in the attributes $r$ . This definition is oriented from an object-oriented perspective where after we have identified the base elements in this current state, the state/observation can be seen as the collection of attributes that are attached to each of these base elements (thus we refer to as rendering). So we can see that base elements are finite but their attributes can be infinite under this setting. Thus we define n to be a finite fixed number for each environment. We acknowledge that the wording “irrelevant to decision-making” is confusing and irrelevant to our combinatorial state definition and have removed this sentence from our definition.

We would like to further clarify how those attributes fall under the decision-making framework in the answer to the next question.

Central to the authors claims seems to be that the sampled diffusion variables can be represented by a combinatorial set of finite elements. However, the approach in Janner et al. uses diffusion for to sample future states and actions, conditioned upon the current state. In this case, the combinatorial state is not sampled, but rather it is an input to the model.

We thank the reviewer for raising this insightful question and would like to clarify how being able to sample OOC states contributes to the decision-making process. The current state is given to the diffusion model indeed and this includes all the information about the attributes (velocity, position, etc) of every base element that is currently visible in the environment. Central to the approach in Janner et al. is that the diffusion model will learn a world model and predict future states and actions to help it make better decisions. Let’s say that the training states always start with {car, car} or {bike, bike}. In sequential decision-making tasks, current states will affect the distribution of future states according to the environment dynamics, where the future states of {car, car} or {bike, bike} will still be {car, car} or {bike, bike} (but with a different set of attributes like velocity/position). Future states of an OOC state ({car, bike}) are almost always going to be another OOC state (still {car, bike}). We proved that the diffusion model has a larger than zero probability of sampling these OOC states, basically saying that it has the ability to generate reasonable future predictions for these OOC current states. This has also been proved in the rendering of future states predicted by the diffusion model in Figure 6. We have updated Section 6.1 in the manuscript to better clarify this explanation.

2024-11-20

Weakness (continued)

I fail to see why the states defined by Z lie on a manifold. While there is indeed some prior work about manifolds in the dataspace, this specifically concerns an assumption for images. The dataspace that the authors present does not to fall into that category. In particular, lines 304 to 309 say that because there is a countable number of elements, their data lies on a manifold. If I understand the setup correctly, Z is a cartesian product of the base set E. This would not constitute a manifold. The manifold assumption seems like a central assumption to the paper, but it is not defended. It also seems that Z has a varying dimension, which seems incompatible with the assumption they lie on a manifold. As a matter of fact it is unclear even if n is varying in the setup. The abstract and title seem to suggest that, but I do not see any information in the paper (one way or another) on whether it varies or not.

We apologize for the confusion and would like to elaborate on why it is reasonable to use the manifold hypothesis for our analysis. Central to the manifold hypothesis is that high-dimensional datasets occur in the real world lie on a low-dimensional latent manifold inside a high-dimensional space, essentially saying that data that initially requires many variables to describe can be concisely described with a small number of variables [1][2]. Lines 291-294 are meant to provide intuition on why it is reasonable to have a manifold hypothesis in our later analysis rather than claiming or proving that states can always form a manifold. The intuition is that, since in our setting, each state can be represented by a (smaller) collection of base elements (rather than any arbitrary setting of the high dimensional word), it is reasonable for us to assume that the states are lying on a lower dimensional manifold. In other words, since these (smaller) collections of base elements consist of sufficient information for our decision making task, it is reasonable to represent states in a lower dimensional space. We also acknowledged that indeed sometimes the validity of this hypothesis is not very obvious (lines 301-303). Nevertheless, the algorithm and modeling decisions we make based on this assumption still provide superior performance in OOC generalization as demonstrated in our experiments.

As for the dimensionality of Z and n, we thank the reviewer for pointing out this disconnection. It is common to have a varying number of base elements in the environment. Therefore, we specify that Z is finite-dimensional (addressed in the previous answer) and fix n to be the maximum number of base elements. So any cases with less than the maximum number of base elements can be represented by padding the base element vector to the maximum length. We thank the reviewer again for pointing out this confusion and have clarified the writing in our manuscript.

[1]Cayton, L. (2008). Algorithms for manifold learning (p. 1). eScholarship, University of California.

[2]Loaiza-Ganem, G., Ross, B. L., Hosseinzadeh, R., Caterini, A. L., & Cresswell, J. C. (2024). Deep Generative Models through the Lens of the Manifold Hypothesis: A Survey and New Connections. arXiv preprint arXiv:2404.02954.

In the absence of strong theoretical results, I would hope to see very compelling experimental results, but strong experimental results are lacking. PPO, the most competitive of the baseline methods, seems to have been trained with 100x less parameters than the diffusion models.

	Regular PPO	Large PPO	Diffusion
Model size	0.59 MB (Policy:0.29 + Value:0.29 )	111.29 MB (Policy:55.64 + Value: 55.64 )	54.19 MB
Parameters	147456 (Policy:73728 + Value:73728)	27823622 (Policy:13911040 + Value:13911040)	13546370
Number of crashes	36.0 (6.93)	31.67 (1.89)	20.0 (2.5)
Reward	8.19 (0.19)	8.43 (0.19)	8.81 (0.2)

2024-11-22

Thank you for clarifying that vectors are fixed length and padded for smaller sets. I understand the manifold hypothesis, but I still fail to see why the situation described in this work would constitute a manifold. Can the authors confirm that the elements of their vectors are indeed discrete elements encoded in some way (i.e. "car", "bike", "padding")?

Could the authors provide a mathematically rigorous defense of why their data constitute a manifold? And a linear manifold in the maze case?

It appears that reviewer iS1b03 had similar questions.

2024-11-22

elements of latent vector

The elements of the latent vector are not necessarily discrete. In Section 3.2, we define the probability space for both discrete (Definition 3.2, lines 199-202) and continuous latents (Definition 3.3 lines 203-208). In the discrete case (Roundabout) the latent are discrete elements like (car, bike, padding) but in the continuous case (Maze), the latents are the x and y axis and they span the R^2 space.

Why data constitute a manifold:

Discrete case:

Continuous case:

Therefore high-dimensional states/observations of the environment in the ambient space lie close to a low dimensional latent manifold.

Why maze is linear Manifold:

Since the basis of maze is just x and y axis and the latent space is euclidean plane R^2. R^2 is itself a linear manifold since it can be viewed as a shifted version of itself.

[1] Donoho, David L., and Carrie Grimes. "Image manifolds which are isometric to Euclidean space." Journal of mathematical imaging and vision 23.1 (2005): 5-24.

[2]Brown, Bradley CA, et al. "Verifying the union of manifolds hypothesis for image data." arXiv preprint arXiv:2207.02862 (2022).

2024-11-24

I thank the authors for their response, and I have to apologize for some confusion I may have caused. The manifold hypothesis is specifically about being a lower dimension than the space the data lives in, and while the proofs above do indeed show the data lie on a manifold, they do not say anything about a lower dimensional manifold, it is of course that latter that is important to prove.

I have given the rest of the content some more thought, and given the other responses I do not take all that much issue with it. It seems like a relevant topic, and diffusion models seem to work well

However, section 5.2 and appendix B take too much of a prominent role for me to ignore. The authors themselves state on many occasions that the sequence of ideas they propose has flaws (i.e. not all data lies on a manifold, non-zero probability does not mean good performance, etc.). Ultimately, I believe section 5.2 is too hypothetical, assumptions are not well defended, and the math is not rigorous enough to appear in a peer reviewed publication. I have therefore chosen to maintain my score.

2024-11-24

We thank the reviewer for getting back to us and putting thoughts on our reasoning! We apologize if there might be some misunderstanding about state space and latent space but our state space is indeed high dimensional and the latent space is the low dimensional space (formed by composition) that lies on the manifold. State spaces in RL environments are often high-dimensional images or matrices that include redundant information. Thus we propose that this high-dimensional state space can be concisely described as the combination of latent features (agent types, reward, coordinates, etc). For example in the SMACv2 5v5 case, the state/observation space is composed of 130 dimensional global information and 5 agents’ observations which are each 93 dimensional. But we propose that this high dimensional information can be concisely described by a latent vector formed by the combination of agent types (10 dimensional for a total of 10 players), and thus align well with the manifold hypothesis. We would like to clarify that the high-dimensional state/observation is what the diffusion model predicts but we give the low-dimensional latent vector as conditioning for better representation learning. Hope this will fully address the misunderstanding about the manifold hypothesis and we will add the dimension of state space >> dimension of latent space in our problem definition.

We deeply appreciate the reviewer again for your time and valuable feedback! Hope our clarification will address all the misunderstandings and please let us know if there is anything else we can modify.

2024-12-03

Dear reviewer 5rpB,

We deeply appreciate the time and effort you’ve dedicated to reviewing our work! We understand that this may be a particularly busy time, but we would like to send this gentle reminder that this is approaching the end of the reviewing process. Regarding the questions on the manifold hypothesis, we have realized that there may have been a potential misunderstanding regarding the distinction between state space and latent space. Our state space is indeed high dimensional and the latent space is the low dimensional space (formed by composition) that lies on the manifold. State spaces in RL environments are often high-dimensional images or matrices that include redundant information. Thus we propose that this high-dimensional state space can be concisely described as the combination of latent features (agent types, reward, coordinates, etc). For example in the SMACv2 5v5 case, the state/observation space is composed of 130 dimensional global information and 5 agents’ observations which are each 93 dimensional. But we propose that this high dimensional information can be concisely described by a latent vector formed by the combination of agent types (10 dimensional for a total of 10 players), and thus align well with the manifold hypothesis. We would like to clarify that the high-dimensional state/observation is what the diffusion model predicts but we give the low-dimensional latent vector for better representation learning. Hope this will fully address the misunderstanding about the manifold hypothesis.

We thank the reviewer again for giving our content some thought and we would like to clarify that one of our major contributions is to identify this important generalization problem that captures the complexity raised from combinations of different base elements and propose a viable solution with good empirical results. We would propose to move Section 5.2 to the appendix if the reviewer thinks it is not rigorous enough but we believe that moving Section 5.2 would still keep the major contribution and structure of the paper. Also as mentioned in our paper (301-303) and previous rebuttal, this is meant to only provide intuitions for why it can be a reasonable assumption when making modeling decisions to simplify the analysis. This simplified assumption can be found in the analysis of many manifold-aware models [1, 2, 3]. Our experiment results with a manifold-aware diffusion model also demonstrate superior performance, further supporting the validity of this assumption.

[1] Donoho, David L., and Carrie Grimes. "Image manifolds which are isometric to Euclidean space." Journal of mathematical imaging and vision 23.1 (2005): 5-24.

[2]Brown, Bradley CA, et al. "Verifying the union of manifolds hypothesis for image data." arXiv preprint arXiv:2207.02862 (2022).

Thank you again for your valuable feedback on improving our work, and we look forward to hearing from you soon.

2024-12-03

Quoting the authors' earlier response

We would like to clarify that the high-dimensional state/observation is what the diffusion model predicts but we give the low-dimensional latent vector as conditioning for better representation learning. Hope this will fully address the misunderstanding about the manifold hypothesis and we will add the dimension of state space >> dimension of latent space in our problem definition.

Comes back to one of my earlier points (see my original review), which is that the generalization the authors talk about seems to be about the input to the model. The proof in appendix B is about the output of diffusion model. These are fundamentally different things. On top of this there are the various other assumptions I think need re-assessment discussed previously. Altogether this means I will keep my score as it is.

2024-12-03

Thank you so much for getting back to us! We believe we have already discussed this point in our previous rebuttal and we would like to reiterate that this generalization property is used for future state prediction. Central to the approach in Janner et al. is that the diffusion model will learn a world model and predict future states and actions to help it make better decisions. Let’s say that the training states always start with {car, car} or {bike, bike}. In sequential decision-making tasks, current states will affect the distribution of future states according to the environment dynamics, where the future states of {car, car} or {bike, bike} will still be {car, car} or {bike, bike} (but with a different set of attributes like velocity/position). Future states of an OOC state ({car, bike}) are almost always going to be another OOC state (still {car, bike}). We proved that the diffusion model has a larger than zero probability of sampling these OOC states, basically saying that it has the ability to generate reasonable future predictions for these OOC current states. This has also been proved in the rendering of future states predicted by the diffusion model in Figure 6. We appreciate the reviewer for getting back to us again and please let us know if there are any other concerns we can address.

2024-11-26

Dear reviewer 5rpB,

AC 元评审

2024-12-21

While reviewers appreciated the settings studied by the paper, they had concerns about the relevance of the proofs of the paper as well as requests for additional baselines. The AC after reading the paper found it interesting and encourage the authors to revise and resubmit the paper to a later venue.

审稿人讨论附加意见

While the authors and reviewers had an extended discussion, ultimately two reviewers remained unconvinced about the paper (though some concerns such as the manifold hypothesis did not make sense to the AC)

最终决定Reject

2025-01-22

Reject