4.3

/10

withdrawn3 位审稿人

最低3最高5标准差0.9

3.3

置信度

正确性2.0

贡献度2.0

表达2.0

ICLR 2025

Evolution guided generative flow networks

Zarif Ikram,Ling Pan,Dianbo Liu

OpenReview PDF

提交: 2024-09-27更新: 2025-01-11

TL;DR

A powerful method for the GFlowNets training using Evolutionary algorithms that enhances its robustness to long trajectory and sparse rewards

摘要

关键词

GFlowNetsEvolutionary AlgorithmsOptimization

评审与讨论

审稿意见

评分: 5置信度: 32024-11-01

This paper introduces a variant of GFlowNets which is trained via evolutionary optimization. The key argument is that GFlowNets are a generative process which are learned via a reward signal, however the propagation of this reward throughout the time horizon is tricky. The paper proposes the use of classical evolutionary techniques to alleviate these issues. A population of networks are maintained, then selection + crossover + mutation are performed over the parameters during the evolutionary step. The evolutionary step is interleaved with traditional gradient descent.

优点

The idea to apply evolutionary algorithms to GFlowNets is a new direction. Bootstrap-based methods such as GFlowNets often suffer from collapse or poor gradient flow, thus motivation evolutionary algorithms as a potential solution. The proposed method is simple, and outperforms previous methods on synthetic tasks. The proposed method consistently outperforms other GFLowNet settings, and discovers more modes of a solution.

缺点

The method introduces additional complexity in the form of the evolutionary optimization, but does not analyze why such a decision would improve performance. The evolutionary algorithms applied are well-known, and the combined algorithm boils down to an ad-hoc fitting of EA and gradient descent sequentially. The added complexity results in slower training speed, as mentioned in the paper. This paper would strongly benefit from a more principled look at the training dynamics of GFlowNets, and a stronger opinion on why evolutionary algorithms help learning. Given the smaller-scale nature of the tasks considered, this is a reasonable desire.

问题

Can GFlowNets be applied to more traditional generative modelling tasks? (e.g. images, etc).
In Figure 4, it would help to clear up which of the labelled methods are RL, MCMC, or GFlowNet variants.
In page 2 paragraph 2, it would be good to re-clarify what TB stands for, and introduce these prior objectives together.
How are neural network weights mixed in the crossover step? Is this an important detail, or is a naive strategy good enough?

2024-11-22

Dear reviewer kQAn,

Thank you for your comments! The insights you provided in the reviews are invaluable, and we would like to thank you for your time crafting this review.

In what follows, we address your comments one by one.

W1. The method introduces additional complexity in the form of the evolutionary optimization, but does not analyze why such a decision would improve performance. The evolutionary algorithms applied are well-known, and the combined algorithm boils down to an ad-hoc fitting of EA and gradient descent sequentially. The added complexity results in slower training speed, as mentioned in the paper. This paper would strongly benefit from a more principled look at the training dynamics of GFlowNets, and a stronger opinion on why evolutionary algorithms help learning.

Thank you for your observation. To address your concern, we have performed additional experiment in the context of increasing size of the tasks. Figure 22 of the Appendix H reports the results. From the figure, it should be clear that our claim that EA is more resistant to the length of the trajectory holds true, as we can see that the mean reward of the population still rises quickly despite the increasing difficulty of the tasks, compared to the star agent.

Now, why does high reward samples matter? EA's are naturally diverse and explorative, so the high reward samples are also diverse. These samples go to the PRB, where they are sampled for off-policy training for the star agent. This is important, because the added diversity acts as a trajectory length regularizer for the PRB (we show this in the Figure 9 of the Section 5), making sure the star agent is trained sufficiently for all length of the trajectory, and the high-reward samples ensure better training signals.

Q1. Can GFlowNets be applied to more traditional generative modeling tasks? (e.g. images, etc).

While the current state of GFlowNets does not apply to such scale of large number of variables, GFlowNets have been applied to much auxiliary tasks such as pretraining and fine-tuning[1], learning discrete latent variables[2], and text-to-Image diffusion alignment[3].

Q2. In Figure 4, it would help to clear up which of the labelled methods are RL, MCMC, or GFlowNet variants.

Thank you for your suggestion! Our current change should reflect your suggestion.

Q3. In page 2 paragraph 2, it would be good to re-clarify what TB stands for, and introduce these prior objectives together.

Thank you for your suggestions. We have revised our draft to address your comments.

Q4. How are neural network weights mixed in the crossover step? Is this an important detail, or is a naive strategy good enough?

We use a naive strategy. For clarity, we have included the algorithm of our strategy in the Appendix I.

References

Pan, Ling, et al. "Pre-Training and Fine-Tuning Generative Flow Networks." The Twelfth International Conference on Learning Representations.
Hu, Edward J., et al. "GFlowNet-EM for learning compositional latent variable models." International Conference on Machine Learning. PMLR, 2023.
Zhang, Dinghuai, et al. "Improving GFlowNets for Text-to-Image Diffusion Alignment." arXiv preprint arXiv:2406.00633(2024).

2024-11-26

Thank you for the detailed response, the changes are acknowledged and the new version is indeed clearer to read. I've updated my score slightly. I would recommend to the authors that an interesting direction of study is in comparing the relationship between reward-optimization (GFlowNets, RL algs) and evolutionary buffers in general, rather than this specific case only, as the specificity of the study makes it hard to derive insights that may apply elsewhere.

审稿意见

评分: 5置信度: 32024-11-04

The paper proposes using an Evolutionary Algorithm to fill a Prioritized Replay Buffer with (more) diverse trajectories to enhance the training process of Generative Flow Networks.

优点

Overall, the proposed approach is well-explained and illustrated. The paper provides various empirical results in a simple exemplary task, as well as molecule generation tasks, to demonstrate real-world applicability. For a fair comparison, it provides various baselines and ablations. The empirical results show improved performance, particularly in sparse reward scenarios and large state spaces. Furthermore, the authors provide a discussion on potential limitations and provide reasoning for the advantages of the proposed approach based on an intuitive empirical analysis

缺点

Regarding the proposed method, the EA does not seem to influence the actual training process beyond providing diverse experiences to be sampled. In that regard, an evaluation comparing the performance of the population to the star agent to validate the assumptions would have been helpful. Also, in addition to the provided baseline comparison, I am missing a comparison based on the number of evaluated trajectories to assess the sample complexity advantages of the proposed approach. While improving in sparse scenarios, the proposed approach seems to perform slightly worse in the more generic tasks despite being computationally more intense. Regarding the baselines used, especially GAFN and MARS, I am missing a short introduction, explanation, or comparison. Also, the sparsity levels shown in Fig. 4 should be elaborated more concretely. Regarding the presentation, the overall writing might be slightly improved, e.g., regarding grammar.

Minor comments:

p.3 l.109f.: abbreviations FM, DB, and TB should be introduced first.
Alg. 1, l.181: P^*_F should be P_F? Alternatively, the reason the star agent is used to evaluate the population should be elaborated on.
Alg. 1 l.182f.: vars for online and offline trajectories should differ

问题

What is the computational overhead of maintaining a whole population of GFN agents that must be evaluated in addition?

Why not train all agents in the population or train the best agent(s) in the population instead of maintaining a separate star agent?

How do the authors ensure a fair comparison to the provided baseline regarding the number of evaluated trajectories? (And how is the PRB filled for the baselines?)

Regarding the ablation EGFN-PRB-mutation, how is the EA connected with the training of the star agent?

2024-11-22

Dear reviewer yNGR,

Thank you for your comments! The insights you provided in the reviews are invaluable, and we would like to thank you for your time crafting this review.

In what follows, we address your comments one by one.

W1. EA only provides diverse experiences to be sampled. Comparing the performance of the population to the star agent to validate the assumptions.

While it is true that EA provides diverse experiences to be sampled, EA also provides high-reward diverse samples for PRB to sample from. In the Figure 22 of the Appendix H, we compare the rewards between the population and the star agent, showing that the population achieves high mean reward quicker than the star agent.

W2. Comparison based on the number of evaluated trajectories to assess the sample complexity.

Thanks for bringing this point. For fair comparison, we keep the number of evaluations similar, if $S$ and $S'$ is the on-policy samples by other algorithms and EGFN, respectively, it should be enough to make sure that

$S \approx S' + \mathcal{E}k.$

Given your feedback, we also have experimented with number of trajectory evaluations on the X axis comparing GFlowNets and EGFN on Figure 21 in Appendix H with the same hyper parameters of our experiments. We still see significant improvement of EGFN, compared of GFlowNets.

W3. EGFN performs slightly worse in the more generic tasks despite being computationally more intense

It is not clear which generic tasks EGFN performs slightly worse. Besides, given GFlowNets is a sampling algorithm, it is necessary to point out that it is also succeptible to mode collapse--thus not just the speed performance, but also the mode finding capability is important. Finally, we also believe the conversation of using EA for sampling algorithm such a GFlowNets is valuable to the community too, despite its computational intensity.

W4. GAFN and MARS missing a short introduction, explanation, or comparison

Thank you for your feedback. Based on your feedback, we have added a short introduction of all baselines, in the Appendix G.

W5. Elaborate sparsity levels shown in Fig. 4

Thank you for pointing it out. We have fixed it now in our updated manuscript.

W6. p.3 l.109f.: abbreviations FM, DB, and TB should be introduced first.

We introduced the abbreviations in line 42 and 81 in the Section 1. Please let us know if this is not sufficient.

W7. Alg. 1, l.181: P^_F should be P_F? Alternatively, the reason the star agent is used to evaluate the population should be elaborated on.

Thank you. We have edited our algorithm accordingly. It was supposed to be a machanism to store the online trajectories of the star agent to the replay buffer, which our current edit reflects.

W8. Alg. 1 l.182f.: vars for online and offline trajectories should differ

We agree. Our current change reflects your observation.

2024-11-22

Q1. What is the computational overhead of maintaining a whole population of GFN agents that must be evaluated in addition?

Since the population of agents only needs to be kept in memory for sampling trajectory, the memory overhead is the equivalent of another agent. For runtime, we provide an analysis in Table 2. Our method takes longer than GFlowNets by around 35%.

Q2. Why not train all agents in the population or train the best agent(s) in the population instead of maintaining a separate star agent?

While that could be an option, we did not explore such options to keep the distinction between the two methods (EA and GFlowNets). One of our key objective is to explore whether trajectories from EA trained agents are useful for training a GFlowNets agent to better its mode-finding capabilities in the difficult cases.

Q3. How do the authors ensure a fair comparison to the provided baseline regarding the number of evaluated trajectories? (And how is the PRB filled for the baselines?)

We keep the number of evaluated trajectories similar for all baselines by increasing the number of on-policy samples (as only those are evaluated) for the baselines to reflect the increased amount of evaluations for EGFN populations ( $\mathcal{E}k)$ (see also W2). The only exception is PPO, where we have to double the amount of evaluations since we do not use off-policy samples. The off-policy (PRB) samples are similar for all baselines too and we keep a fixed space for the PRB for all baselines.

Q4. Regarding the ablation EGFN-PRB-mutation, how is the EA connected with the training of the star agent?

The population of agents are trained using an EA algorithm on a reward maximization criteria. The trajectories obtained by the population are then stored in the PRB. While sampling off-policy samples to train the star agent, PRB supplies high-reward, yet diverse samples for the training of the star agent.

评论- Rebuttal Response

2024-11-28

Thank you for your response and for addressing the weaknesses pointed out. However, the revision seems to mostly contain changes to the appendix. Also, regarding the computational complexity, I would have wished for a more thorough analysis. Overall, while pursuing an interesting direction, the proposed approach still provides a comparably low contribution and is, in its current form, missing significant results or improvements. Therefore, I will maintain my previous rating.

审稿意见

评分: 3置信度: 42024-11-04

This paper introduces Evolution guided generative flow networks (EGFN), a new algorithm equipped with an evolutionary algorithm (EA) for better generative flow network (GFN) training. EGFN collects diverse and high-reward samples using a population of GFNs that evolves throughout the training procedure. The collected samples are then utilized to train a target star GFN agent in an off-policy manner. EGFN showed faster learning capability in one synthetic and three biochemical tasks, especially when the reward signal is sparse.

优点

The idea of using Evolutionary algorithm that evolves a population of GFlowNets is new, though similar approaches have already been introduced in reinforcement learning to enhance exploration [1, 2].
The proposed algorithm is validated through various experiments, including real-world biochemical tasks. I also enjoyed their analysis of why the proposed algorithm works (section 5).

[1] Salimans, Tim, et al. "Evolution strategies as a scalable alternative to reinforcement learning." arXiv:1703.03864 (2017).
[2] Khadka, Shauharda, and Kagan Tumer. "Evolution-guided policy gradient in reinforcement learning." NeurIPS (2018).

缺点

In its current form, the paper contains several ambiguous or incorrect claims in Section 2.1. Here are some key issues I noticed:
1-1. Lines 93 - 102: The DAG structure should be defined first to clearly specify the action space according to the DAG’s edges. Additionally, the phrase “sample proportionally to different peaks of reward” in line 101 (and line 34) is misleading.
1-2. Line 112: The notation $F(\tau)$ has never been defined but is used to define $F(s)$ .
1-3. Line 118: There’s an incorrect use of the prime ( ` ) symbol. Moreover, the equation $P_F(s' | s, \theta) = F(s \to s')$ is inaccurate. The RHS should be divided by $F(s)$ . A similar issue appears in lines 124-125.
1-4. Line 133: The expression $\sum_x R(x) = \sum_{s:s_0 \to s\in \tau \forall \tau \in \mathcal{T}} P_F(s|s_0;\theta)$ needs more explanation. At first glance, it doesn’t seem to hold generally.
I believe these points could be clarified easily with careful revisions.
I’m unclear on why EGFN improves credit assignments. The star agent in the EGFN framework uses conventional learning objectives like DB or TB, and I couldn’t find any specific design element that enhances credit assignment. From what I understand, EGFN’s main advantage is its evolving population of GFNs, which provides more diverse experiences for the star agent to learn from. This should enhance exploration, which is especially beneficial in sparse environments.
I have some concerns about the experiments:
3-1. Experiment Setup (Reward Calls): Were all algorithms given the same number of reward calls? All learning progress figures use training steps as the x-axis, but I suspect EGFN might use additional reward calls per training step due to the rewards needed for fitness calculation (line 173). However, in real-world applications where reward evaluation is costly (e.g., in vitro experiments), sample efficiency is often more critical than learning efficiency [3, 4]. Therefore, I recommend including results with a fixed number of reward calls, especially for biochemical sequence generation tasks.
3-2. (minor) Line 304 and 898: The paper states the number of modes for the hypergrid task is $2^D$ , but this doesn’t seem correct. There are indeed $2^D$ reward “regions” if a region is defined as a collection of adjacent modes. However, the actual number of modes could be $2^D \times M$ , where $M$ represents the number of modes in each region, potentially increasing with $H$ .
(minor) The reference is outdated and not well organized. Some of them, but not limited to, are: in line 663, Pan et al. 2023a was accepted by ICML 2023, and in line 728, Zhang et al. 2023b was accepted by TMLR. Also, there are two references for "Generative augmented flow networks."

[3] Gao, Wenhao, et al. "Sample efficiency matters: a benchmark for practical molecular optimization." NeurIPS (2022).
[4] Kim, Hyeonah, et al. "Genetic-guided GFlowNets: Advancing in Practical Molecular Optimization Benchmark." arXiv:2402.05961 (2024).

问题

How many reward calls are used per training step for EGFN and each baseline?
The biochemical tasks appear to share many similarities. Is there a specific reason for dividing them into three sections (4.2, 4.3, and 4.4)?
In lines 254-259, two prioritization methods are introduced: proportional sampling and percentile-based heuristics. Which one is actually used in the experiments?
I suspect that memory consumption increases linearly in $K$ (the population size). Is this true?

2024-11-22

Dear reviewer 2fJP,

Thank you for your comments! The insights you provided in the reviews are invaluable, and we would like to thank you for your time crafting this review.

In what follows, we address your comments one by one.

W1. In its current form, the paper contains several ambiguous or incorrect claims in Section 2.1.

1-1. Lines 93 - 102: The DAG structure should be defined first to clearly specify the action space according to the DAG’s edges. Additionally, the phrase “sample proportionally to different peaks of reward” in line 101 (and line 34) is misleading.

To incorporate your feedback, we have written the beginning paragraph of Section 2.1.

1-2. Line 112: The notation $F(\tau)$ has never been defined but is used to define $F(s)$ .

To address your feedback, we have added the definition.

1-3. Line 118: There’s an incorrect use of the prime ( ` ) symbol. Moreover, the equation $P_F(s'|s, \theta) = F(s\rightarrow s')$ is inaccurate. The RHS should be divided by F(s) . A similar issue appears in lines 124-125.

Thank you for pointing them out. Our revision should now reflect your observation.

1-4. Line 133: The expression $\sum_x{R(x)} = \sum_{s:s_0 \rightarrow s \in \tau \forall \tau \in \mathcal{T}}{P_F(s|s_0;\theta)}$ needs more explanation. At first glance, it doesn’t seem to hold generally.

Total flow is the sum of all forward flow from starting state $P_F(s_0|s, \theta)$ where there is an edge between $s_0$ and $s$ . Or, it could be thought of the sum of all terminal rewards upon convergence.

W2. 1. I’m unclear on why EGFN improves credit assignments. The star agent in the EGFN framework uses conventional learning objectives like DB or TB, and I couldn’t find any specific design element that enhances credit assignment. From what I understand, EGFN’s main advantage is its evolving population of GFNs, which provides more diverse experiences for the star agent to learn from. This should enhance exploration, which is especially beneficial in sparse environments.

EA also provides high-reward diverse samples for PRB to sample from. In the Figure 22 of the Appendix H, we compare the rewards between the population and the star agent, showing that the population achieves high mean reward quicker than the star agent. It can also be seen that with increasing difficulty, while the star agent's top 10% reward starts to fall, the EA population stays more consistent with finding high-reward samples for PRB, given that it is a black-box optimization method.

W3.1. Experiment Setup (Reward Calls): Were all algorithms given the same number of reward calls? All learning progress figures use training steps as the x-axis, but I suspect EGFN might use additional reward calls per training step due to the rewards needed for fitness calculation (line 173). However, in real-world applications where reward evaluation is costly (e.g., in vitro experiments), sample efficiency is often more critical than learning efficiency [3, 4]. Therefore, I recommend including results with a fixed number of reward calls, especially for biochemical sequence generation tasks.

EGFN uses 36 calls. Other baselines use 32, except for PPO which doubles the reward calls since we double the amount of on-policy samples due to not having off-policy samples.

However, in real-world applications where reward evaluation is costly (e.g., in vitro experiments), sample efficiency is often more critical than learning efficiency [3, 4] Therefore, I recommend including results with a fixed number of reward calls, especially for biochemical sequence generation tasks.

Indeed it is true that sample effeciency is important for in vitro experiments. However, in many active learning experiments where a surrogate reward is used, learning efficiency is more important. To address your feedback, we have included preliminary results with number of trajectory evaluations on the X axis comparing GFlowNets and EGFN on Figure 21 in Appendix H with the same hyper parameters of our experiments. We still see significant improvement of EGFN, compared of GFlowNets. We will include similar results for the biological experiments in the final revision.

W3.2. Line 304 and 898: The paper states the number of modes for the hypergrid task is $2^D$ , but this doesn’t seem correct. There are indeed $2^D$ reward “regions” if a region is defined as a collection of adjacent modes. However, the actual number of modes could be $M*2^D$ , where $M$ represents the number of modes in each region, potentially increasing with $H$ .

Thank you for your careful observation. This is indeed the correct general number of modes. For our experiments, we found $M$ to be equal to 1. Our revised draft reflects your observation.

2024-11-22

W4. The reference is outdated and not well organized. Some of them, but not limited to, are: in line 663, Pan et al. 2023a was accepted by ICML 2023, and in line 728, Zhang et al. 2023b was accepted by TMLR. Also, there are two references for "Generative augmented flow networks.

Thank you for pointing them out! Our current revision reflects the changes.

Q1. How many reward calls are used per training step for EGFN and each baseline?

EGFN uses 36 calls. Other baselines use 32, except for PPO which doubles the reward calls since we double the amount of on-policy samples due to not having off-policy samples.

Q2. The biochemical tasks appear to share many similarities. Is there a specific reason for dividing them into three sections (4.2, 4.3, and 4.4)?

While they are all biochemical tasks, the purpose and the state space (and therefore, difficulty) are different. For example, the sEH binder generation task is a standard task for previous GFlowNets literatures[1,2] and the antibody sequence generation task is inherited from the discrete walk-jump sampling [3,4] works.

Q3. In lines 254-259, two prioritization methods are introduced: proportional sampling and percentile-based heuristics. Which one is actually used in the experiments?

We used the percentile-based approach after conducting ablations on them (see Appendix F5 and F6).

Q4. I suspect that memory consumption increases linearly in $K$ (the population size). Is this true?

It is true and our current draft reflects it. However, the increased memory requirement has not been an issue in practice, as all our experiments ran on a mid-range daily laptop.

References

Zhang, Dinghuai, et al. "Distributional GFlowNets with Quantile Flows." Transactions on Machine Learning Research.
Pan, Ling, et al. "Generative Augmented Flow Networks." The Eleventh International Conference on Learning Representations.
Frey, Nathan C., et al. "Protein Discovery with Discrete Walk-Jump Sampling." The Twelfth International Conference on Learning Representations.
Ikram, Zarif, Dianbo Liu, and M. Saifur Rahman. "Antibody sequence optimization with gradient-guided discrete walk-jump sampling." ICLR 2024 Workshop on Generative and Experimental Perspectives for Biomolecular Design.

2024-11-26

I apologize for the late response and appreciate your efforts on the rebuttal.

Here are some additional comments on the updated manuscript:

I cannot see the revision history. Is there any way to make the previous version visible to reviewers?
Remaining errors
- line 102: A Directed Acyclic Graph (DAG) includes a tree as a special case. In a tree, there is only one path leading to each state.
- line 119: I suggest using $\mathbb{R}$ for the set of real numbers instead of $R_{\geq 0}$ , as it could cause confusion with the notation for reward.
- line 127 (Eq.3): The flow-matching GFlowNet requires parameterization of the edge flows. The forward policy $P_F(\cdot | \cdot, \theta)$ is derived from these edge flows, as mentioned in my previous review. In its current form, the equation still seems incorrect. Typically, the flow-matching loss is defined using the edge flows.
Regarding Point W2: To my understanding, finding high-reward samples is not directly related to the credit-assignment problem. Credit assignment involves "distributing the credit of success among the multitude of decisions involved" [1]. For more on credit assignment from a GFlowNet perspective, please refer to [2, 3].
Regarding Point W3.1:
- From Figure 21, it appears that EGFN and GFN do not show significant differences in terms of the number of modes discovered.
- If learning efficiency is important, why not compare the algorithms based on execution time?
- I believe that comparing algorithms in terms of sample efficiency offers a fairer assessment, as is common in many GFlowNet studies [3, 4].

[1] Minsky, Marvin. "Steps toward artificial intelligence." Proceedings of the IRE 49.1 (1961): 8-30.
[2] Pan, Ling, et al. "Better training of gflownets with local credit and incomplete trajectories." International Conference on Machine Learning. PMLR, 2023.
[3] Jang, Hyosoon, Minsu Kim, and Sungsoo Ahn. "Learning energy decompositions for partial inference of gflownets." arXiv preprint arXiv:2310.03301 (2023).
[4] Lau, Elaine, et al. "Qgfn: Controllable greediness with action values." arXiv preprint arXiv:2402.05234 (2024).

评论- Regarding Point W3.1:

2024-11-28

F6. From Figure 21, it appears that EGFN and GFN do not show significant differences in terms of the number of modes discovered. If learning efficiency is important, why not compare the algorithms based on execution time?

A6. We thank the reviewers for such detail-oriented feedback. We would like to point the reviewer to the end of the Figure where EGFN's mode count is pointing upwards, while GFlowNe's mode count seems to converge.

Indeed, this is also evident from the L1 error of the Figure where EGFN clearly dominates GFlowNets. The results show with similar number of evaluated samples, EGFN models the reward distribution better, which makes it more likely to find more modes.

With a similar note, we would also like to make it clear that by learning efficiency we specifically mean how well the algorithm can model the reward distribution.

The authors in [5] state "The challenge of mode collapse manifests in GFlowNets as well. A GFlowNet may become fixated on a particular mode during its training phase, motivating various off-policy exploration techniques to enhance discovery of modes during training."

Evidently, as a model better learns the distribution, it should be more likely to discover more modes as we see with our upward trend with EGFN compared to the convergence of GFlowNets. It is especially useful when the model that we use is a surrogate.

F7. I believe that comparing algorithms in terms of sample efficiency offers a fairer assessment, as is common in many GFlowNet studies [3, 4]

A7. We thank the reviewer for their feedback. Our Figure 21 is an effort to offer the assessment, which we plan to extend to for other experiments--something we could not do during the author-reviewer discussion period.

We also believe the results in Figure 21 should convince the reviewer that the results in our other experiments should also hold with that requirement. Indeed, our method only exceeds the GFlowNets baseline by 4, which is mainly due to our design choice to keep the number of online samples a multiple of 16, e.g., 16 for EGFN (with 20 for population) and 32 for other baselines.

Krichel, Anas, et al. "On Generalization for Generative Flow Networks." arXiv preprint arXiv:2407.03105 (2024).

评论- Remaining errors

2024-11-28

F1. I cannot see the revision history. Is there any way to make the previous version visible to reviewers?

A1. As authors, we cannot edit the visibility of the revisions. That said, please let us know which part you'd like to know the before and after of and we can provide a prompt description.

Now that we are aware of that, our responses below will attempt to explicitly mention the changes we made. We thank the reviewer for letting us know.

F2. line 102: A Directed Acyclic Graph (DAG) includes a tree as a special case. In a tree, there is only one path leading to each state.

A2. Thank you for pointing it out. Indeed, we missed the special case as trees are a special case of the DAGs and we do not consider that as "tree-structured DAGs (autoregressive generation) are equivalent to RL with appropriate entropy regularization or soft Q-learning and control as inference", as mentioned by [1].

To address this, we have changed "There exist different paths leading to the same state in the DAG, except for the root, which has no parent." to " We specifically consider DAGs that are not tree-structured, thus there exist different paths leading to the same state in the DAG, except for the root, which has no parent."

F3. line 119: I suggest using for the set of real numbers $\mathbb{R}$ instead of $R_{ \ge 0}$ , as it could cause confusion with the notation for reward.

A3. We agree with the reviewer and added this change from $R$ to $\mathbb{R}$ in our manuscript.

F4. line 127 (Eq.3): The flow-matching GFlowNet requires parameterization of the edge flows. The forward policy is derived from these edge flows, as mentioned in my previous review. In its current form, the equation still seems incorrect. Typically, the flow-matching loss is defined using the edge flows.

A4. To address the reviewer, we have now defined our loss using estimated edge flows.

Particularly we have changed "where $F(s'\rightarrow s'') = R(s)$ if $s \in \mathcal{X}$ . Using a estimated distribution over children $P_F(s'|s,\theta)$ and an estimated distribution over parents $P_F(s'|s'',\theta)$ ,..." to "To achieve the criterion, using an estimated edge flow $F_\theta : \mathcal{E} \rightarrow \mathbb{R}^+$ , ..."

References

Malkin, Nikolay, et al. "Trajectory balance: Improved credit assignment in gflownets." Advances in Neural Information Processing Systems 35 (2022): 5955-5967.

评论- Regarding credit assignment problem

2024-11-28

F5. To my understanding, finding high-reward samples is not directly related to the credit-assignment problem. Credit assignment involves "distributing the credit of success among the multitude of decisions involved" [1]. For more on credit assignment from a GFlowNet perspective, please refer to [2, 3].

A5. We now understand the source of confusion and will try to answer as clearly as possible.

The reviewer's claim that "finding high-reward samples is not directly related to the credit-assignment problem". However, our work tackles the credit-assignment problem in long trajectories and sparse rewards.

Indeed, credit assignment in long trajectories and sparse rewards is difficult with the current advances, as in [6] the authors conclude, "TB trades off the advantage of immediately providing credit to early states with the disadvantage of relying on sampling of long trajectories...".

This problem is evident in other methods too, as the authors in [7] discuss, "... methods in RL use bootstrapping to address this issue but often struggle when the time horizons are long and the reward is sparse"

This being the case, we do not tackle the credit assignment problem, rather address it in the difficult cases and utilize current credit assignment methods for our method. Our evaluations in Figure 9 show that the mentioned cases make the training trajectory lengths skewed, and utilizing EA helps it to be more balanced by improving diversity.

Next, our evaluations in Figure 22 state that our assumption of EA immune to our problem is valid and can indeed bring better samples for improved training in the difficult cases.

Malkin, Nikolay, et al. "Trajectory balance: Improved credit assignment in gflownets." Advances in Neural Information Processing Systems 35 (2022): 5955-5967.
Khadka, Shauharda, and Kagan Tumer. "Evolution-guided policy gradient in reinforcement learning." Advances in Neural Information Processing Systems 31 (2018).

2024-12-02

Thank you for your response.

" We specifically consider DAGs that are not tree-structured, thus there exist different paths leading to the same state in the DAG, except for the root, which has no parent."

This seems a bit misleading. If your method isn't specifically designed for non-tree DAGs, it might not make sense to limit your work to "DAGs except trees." It might be better to remove this sentence, if it is not necessary.

Our evaluations in Figure 9 show that the mentioned cases make the training trajectory lengths skewed, and utilizing EA helps it to be more balanced by improving diversity.

In this sense, I still believe this work is more related to exploration, and thus some of the claims about better credit assignment seem overstated.

We would like to point the reviewer to the end of the Figure where EGFN's mode count is pointing upwards, while GFlowNe's mode count seems to converge.

This feels like a somewhat naive and careless evaluation. Also, given the increased complexity of EGFN, I don’t think the performance gain is significant enough to justify the extra complexity.

(Additional comment) In Figure 5, the graph for GFN and EGFN are both blue-colored, and it is hard to distinguish them.

Overall, I think this work still has many aspects that need improvement. I’m maintaining my score.

撤稿通知

2025-01-11

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.