Hephaestus: Mixture Generative Modeling with Energy Guidance for Large-scale QoS Degradation
We propose Hephaestus, a generative self-reinforcing framework for solving the QoSD problem on large networks with nonlinear edge-weight functions, outperforming prior combinatorial and ML-based methods
摘要
评审与讨论
This paper proposes a self-reinforcing generative framework called Hephaestus to address the QoSD problem under both linear and nonlinear edge-weight functions. Specifically, the proposed method uses graph learning to produce feasible solutions with a performance guarantee, then trains a Mixture of Conditional VAEs guided by an energy-based model to capture solution feature distributions, and finally uses a reinforcement learning agent in the latent space to fine-tune the solutions. Experiments show the superior performance of the proposed method, especially in nonlinear settings.
优缺点分析
Strengths:
- The proposed Hephaestus framework is a novel framework to address QoSD, filling the gap that previous methods can't tackle nonlinear edge-weight functions.
- Experiments show the superior performance of Hephaestus in nonlinear settings and various datasets.
- Each step in this framework has a theoretical guarantee.
Weaknesses: It seems that using a Mixture of Conditional VAEs guided by an energy-based model is not reasonable. For VAEs, it minimizes KL(p|q) for q distribution, which is known as mode coverage, and is always applied to alleviate mode collapse[1]. Conversely, for energy-based models, the minimax training will cause the VAEs to optimize KL(q|p) w.r.t. q, which can lead to mode collapse[2]. Therefore, the motivation for designing a Mixture of Conditional VAEs guided by an energy-based model is not reasonable. At least, it seems this design can't alleviate mode collapse with the paper's explanation.
[1] Divergence Triangle for Joint Training of Generator Model, Energy-based Model, and Inferential Model. CVPR 2019
[2] Exploring bidirectional bounds for minimax training of Energy-based models. IJCV 2025
问题
I suggest the authors provide a more reasonable explanation of why using the Mixture of Conditional VAEs guided by an energy-based model can alleviate mode collapse.
局限性
yes
最终评判理由
I maintain my positive score.
格式问题
None
We thank the reviewer for pointing out potential tension between VAEs and EBMs in generative modeling. First, we clarify that our approach does not train VAEs adversarially via a minimax objective. Instead, the minimax formulation is used solely to derive a principled training objective for the EBM, allowing us to train it independently without requiring explicit sampling from the model distribution or estimating the partition function , which is typically expensive (e.g., via MCMC or Langevin dynamics). Specifically, as shown in Equation (4), our approach starts from the following minimax objective:
This formulation seeks a latent density $q$ that is simultaneously close to the true data distribution $p$ and exposes gaps in the current generator $\Omega$. From this equation we can derive the objective function for EBM (shown in Appendix 2.3: Proof of Theorem 3) that the inner maximization over $\Omega$ yields a variational bound which leads to the following energy‐based training objective for the EBM (we also defined in main paper):
This objective encourages the EBM to assign low energy to real (feasible) samples (from ) and higher energy to generated samples (from ), effectively learning a density discrepancy signal. On the other hand, each VAE expert $\Omega _i$ is then trained independently using an energy‐guided ELBO (we also defined in main paper):
In the above, samples generated by the VAE that are assigned high energy by the EBM are penalized. This encourages each expert to specialize in generating samples within low-energy regions, i.e., those aligned with the real distribution. Whenever an expert can no longer effectively minimize its energy penalty (i.e., it fails to cover a high- region where ), a new VAE expert is added to capture that region. This process is repeated until the overall mixture model covers the high-quality regions of the data distribution, driving the total EBM energy loss toward zero. As a consequence, adding more experts in the Mix-CVAEs progressively reduces density discrepancy, ensuring better coverage of the data manifold and effectively avoiding mode collapse, without requiring the VAE to be inherently mode-covering or mode-seeking.
Second, although this does not affect how the VAE is trained—nor does it require the VAE to be explicitly mode-covering or mode-seeking—for completeness, we note that the VAE training objective is derived from the reversed KL divergence $\mathrm{KL}\left(q _\phi(\mathbf{z} \mid \mathbf{x}) | p _\theta(\mathbf{z} \mid \mathbf{x})\right)$, not $KL(p|q)$. We recall briefly the derivation from original VAE paper [1] here:
As $\log p _\theta(\mathbf{x})$ is constant, maximizing $\mathrm{ELBO}=\mathbb{E} _{q _\phi(\mathbf{z} \mid \mathbf{x})}\left[\log p _\theta(\mathbf{x} \mid \mathbf{z})\right]-\mathrm{KL}\left(q _\phi(\mathbf{z} \mid \mathbf{x}) | p _\theta(\mathbf{z})\right)$ indirectly minimizes the reverse KL(q|p). Recent work [2] instead studies the “forward” KL divergence for Variational inference by leveraging MCMC.
[1] Original VAE paper: Auto-Encoding Variational Bayes. Conference: ICLR 2014
[2] Transport Score Climbing: Variational Inference Using Forward KL and Adaptive Neural Transport. Journal: Transactions on Machine Learning Research 2023
Thanks for your efforts and response. Since my concerns have been addressed, I maintain my positive score.
Thank you very much for your kind response and for taking the time to revisit our rebuttal. We truly appreciate your positive evaluation and are glad to hear that our clarifications addressed your concerns.
If there are any remaining points you'd like us to elaborate on, we would be more than happy to provide further clarification.
Warm regards,
Dear Reviewer,
We’re very glad that our clarifications have addressed your concern regarding the combination of Conditional VAEs and Energy-Based Models. Given your initial assessment that our paper introduces a novel generative modeling + reinforcement learning approach for the important QoSD problem—along with the fact that each component in our framework is supported by theoretical guarantees—we hope the overall potential of our work is clearer now.
In addition to the theoretical motivation, our experiments show that Hephaestus not only outperforms prior SOTA methods under linear edge-weight functions, but is also the first method that successfully handles nonlinear QoSD, which prior methods cannot solve.
Since your remaining concern has now been resolved and you found the paper to be of good quality and originality, we were wondering if you would consider re-evaluating your initial evaluation to see if our work can have a stronger rating based on its theoritical contributions and experimental results.
Thank you very much again for your time and engagement.
Best regards, Authors
This paper investigates the Quality-of-Service Degradation (QoSD) problem, where an adversary perturbs edge weights in a network to degrade end-to-end performance while adhering to budget constraints. It introduces Hephaestus, a novel three-phase generative framework comprising: (1) Forge, which uses a Predictive Path-Stressing (PPS) algorithm guided by a shortest-path attention model (SPAGAN) to generate feasible perturbations; (2) Morph, an energy-based, expert-augmented generative model (Mix-CVAE) designed to capture multimodal solution distributions; and (3) Refine, a reinforcement learning agent that explores the latent space to produce optimized solutions using a smooth, differentiable reward.
优缺点分析
Strengths:
- The paper offers a well-integrated approach to the QoSD problem by combining graph learning, generative modeling, and reinforcement learning, and demonstrates strong empirical performance on large-scale real-world networks.
- The paper provides solid theoretical guarantees (e.g., for PPS, latent RL convergence, and normalization-free energy modeling).
- Its ability to handle nonlinear edge cost functions (e.g., quadratic and log-concave) distinguishes it from prior work.
- The modular structure of Hephaestus (Forge-Morph-Refine) is clearly presented with detailed ablations.
Weaknesses:
- The framework relies on SPAGAN's accuracy—robustness to prediction errors in large, noisy graphs is not extensively discussed.
- The expert expansion mechanism in Mix-CVAE adds computational overhead; resource usage and efficiency trade-offs are not evaluated.
- The reinforcement learning component (Refine) is theoretically elegant, but lacks empirical discussion on convergence behavior and sample efficiency.
- The paper does not compare with recent powerful generative optimization methods (e.g., diffusion-based solvers or large-scale neural MIP solvers).
问题
-
How robust is the PPS algorithm when SPAGAN is trained on distributions that differ from the test graph topology?
-
Has the RL refinement phase been compared with alternative latent optimization strategies like Bayesian optimization or ES?
-
How is the number of CVAE experts in Morph determined? Is there a risk of redundancy or overfitting?
局限性
YES
最终评判理由
The additional clarifications and experiments are appreciated. They address most of my earlier concerns and strengthen the case for the proposed pipeline's robustness. I look forward to seeing these additions and updates incorporated in the final version.
格式问题
A formatting inconsistency exists between the appendix references in the main text and the supplementary material. While the main text refers to sections such as "Appendix B.1" and "Appendix B.2," the supplementary material is labeled numerically (e.g., 1.2, 1.2) and does not contain an "Appendix B.1."
Q1 and W1. How robust is the PPS algorithm when SPAGAN is trained on distributions that differ from the test graph topology?
We thank the reviewer for this insightful question. While SPAGAN exhibits strong generalization on graphs with similar topology (as illustrated in Appendix 3.3), we also take care of the out-of-distribution (OOD) case. In particular, we include a robust safeguard—PPS-I, as described in Section 3.3 (Inference Process) of the main paper. PPS-I is capable of taking any intermediate solution (including those produced by the RL refinement phase) and further refining it using exact shortest-path computation (via Dijkstra’s algorithm), thereby guaranteeing 100% feasibility of the final perturbation vector, regardless of SPAGAN’s prediction quality.
To empirically evaluate robustness under distributional shift, we conducted additional experiments in which we injected controlled levels of random noise into SPAGAN’s predictions—simulating degraded generalization. With number of pairs is 50, we report both feasibility (percentage of source–target pairs successfully disconnected, i.e., shortest path > T) and total cost across varying noise levels (30%, 10%, 5%, and 0% of real value) on three real-world datasets used in the main paper (excluding Skitter due to time constraints during the rebuttal). We also assess the ability of both RL and PPS-I to recover from noisy predictions. Results are summarized below:
| Dataset | Noise Rate (%) | PPS Feas. (%) | PPS Cost | RL Feas. (%) | RL Cost | PPS-I Feas. (%) | PPS-I Cost | PPS-I Fix Time (s) |
|---|---|---|---|---|---|---|---|---|
| 30 | 90.0 | 3284 | 94.0 | 2911 | 100.0 | 3005 | 20.2 | |
| 10 | 94.0 | 3185 | 96.0 | 2759 | 100.0 | 2802 | 15.7 | |
| 5 | 96.0 | 3157 | 98.0 | 2640 | 100.0 | 2712 | 6.1 | |
| 0 | 100.0 | 3091 | 100.0 | 2691 | 100.0 | 2691 | 0.0 | |
| Gnutella | 30 | 86.0 | 4392 | 92.0 | 3775 | 100.0 | 3909 | 40.6 |
| 10 | 92.0 | 4318 | 96.0 | 3694 | 100.0 | 3741 | 21.3 | |
| 5 | 96.0 | 4221 | 98.0 | 3550 | 100.0 | 3568 | 14.9 | |
| 0 | 100.0 | 4118 | 100.0 | 3419 | 100.0 | 3499 | 0.0 | |
| RoadCA | 30 | 78.0 | 13730 | 86.0 | 10985 | 100.0 | 11061 | 214.8 |
| 10 | 88.0 | 12556 | 92.0 | 10148 | 100.0 | 10795 | 137.2 | |
| 5 | 92.0 | 12123 | 94.0 | 9892 | 100.0 | 9907 | 48.6 | |
| 0 | 100.0 | 11283 | 100.0 | 9184 | 100.0 | 9277 | 0.0 |
From the table above, we observe the following:
-
PPS performance degrades as SPAGAN’s prediction noise increases, leading to reduced feasibility and inflated cost due to budget being allocated to non-critical edges.
-
RL refinement helps recover both feasibility and cost, but does not fully guarantee constraint satisfaction.
-
PPS-I consistently restores 100% feasibility, correcting violations through exact computation regardless of prior noise levels.
-
Even in the worst-case scenario (RoadCA with 30% noise), PPS-I incurs only a modest cost increase of approximately 19% compared to the ideal, noise-free case (11,061 vs. 9,277), while still maintaining full feasibility.
These findings highlight the robustness of our three-phase pipeline: even when the learned SPAGAN model fails to generalize, PPS-I ensures feasibility recovery, while RL contributes to cost efficiency along the way. We will include these findings and discussion in the final version of the paper.
Q2. Has the RL refinement phase been compared with alternative latent optimization strategies like Bayesian Optimization or ES?
Based on the reviewer’s suggestion, we conducted a comprehensive comparison between our RL-based refinement phase and alternative latent space optimization strategies, including Bayesian Optimization (BO) and Evolutionary Strategies (ES). As summarized in Table 4 below, we evaluate each method on the Gnutella dataset under varying latent dimensionalities, reporting the final cost (lower is better) and feasibility rate. We note that BO does not scale well to high-dimensional latent spaces (e.g., ) due to its sample inefficiency and surrogate modeling limitations. Therefore, we restrict BO to low-dimensional latent settings ( and ), while allowing higher dimensions for RL and ES. Furthermore, all evaluations are conducted under clean SPAGAN predictions (i.e., without additional noise), to ensure consistent comparison of refinement strategies.
| Method | Latent | Final Cost | Feas. Rate |
|---|---|---|---|
| PPS | - | 4118 | 100% |
| RL (Ours) | 16 | 3435 | 100% |
| 64 | 3419 | 100% | |
| Bayesian Opt. (BO) | 16 | 3590 | 98% |
| 28 | 4055 | 52% | |
| Evo. Strat. (ES) | 16 | 3612 | 94% |
| 64 | 3598 | 98% |
From these results, we can see that our RL-based refinement consistently achieves the best final cost and full constraint satisfaction (100% feasibility) across both latent dimensions. We will also incorporate this discussion in the final version.
Q3 and W2. How is the number of CVAE experts in Morph determined? Is there a risk of redundancy or overfitting?
Thank you for raising this important question. The number of CVAE experts is dynamically determined during training based on a density discrepancy criterion, which quantifies how well the current set of experts captures the latent distribution (we let max expert = 9 in our paper). Specifically, we compute a score that identifies regions where the mixture model fails to match the energy-guided latent distribution. New experts are only added when this discrepancy exceeds a threshold, ensuring that expansion is data-driven and targeted toward unexplained modes.
Regarding the potential risks of redundancy or overfitting, we would like to highlight several theoretical justifications and empirical safeguards that guide our expert expansion strategy as demonstrated in prior works, including:
[1] Towards Understanding Mixture of Experts in Deep Learning, NeurIPS 2020
[2] A Statistical Perspective of Sparse Top-k Softmax Gating in MoE, ICLR 2023
[3] SwiftHydra: Self-Reinforcing Generative Framework for Anomaly Detection with Multiple Mamba Models, ICLR 2025
there are two key theoretical insights:
(i) MoE models are more expressive than single-expert models [1,3], especially in modeling multi-modal or structured solution spaces like those in the QoSD setting. While increased model capacity could raise overfitting concerns, in practice, top- sparse gating acts as an implicit regularizer, activating only the most relevant subset of experts per input. This input-dependent sparsity limits the effective model capacity seen by any given sample, thereby reducing the chance of overfitting despite the large total number of experts [2].
(ii) Even when the number of experts (overspecified setting), our use of top- sparse softmax gating ensures that: Only experts are active per input, keeping inference cost bounded and efficient; The convergence rates of parameter and density estimation remain unchanged compared to dense gating, as proven in [2,3]. This decouples model capacity from inference cost and training stability, eliminating the typical trade-off between expressiveness and generalization. Thus, adding more experts improves coverage of target distribution without harming generalization or causing significant computational overhead.
W4. The paper does not compare with recent powerful generative optimization methods (e.g., diffusion-based solvers or large-scale neural MIP solvers)
We would like to clarify that our baseline set already includes two recent and powerful optimization methods with different scale (1) DiffiLO (ICLR 2025 Spotlight), a differentiable ILP solver that performs well on small to medium-sized graphs and (2) Light-MILPopt (ICLR 2024), a neural MILP solver designed for large-scale graphs.
We believe this combination already represents state-of-the-art in combinatorial optimization across graph sizes, and our method consistently outperforms both on multiple datasets, especially in handling nonlinear cost structures and enforcing feasibility via latent-space control
Q1 / W1 The authors added noise-injection experiments in the rebuttal, showing PPS, RL refinement, and PPS-I performance under varying noise levels, with emphasis on PPS-I's ability to consistently restore 100% feasibility. This partially addresses my concerns about robustness when SPAGAN's predictions are inaccurate.
That said, I still have reservations:
While PPS-I indeed guarantees feasibility, the worst-case cost increase approaches 20%, which may be unacceptable in some application scenarios. A deeper analysis of the practical impact would be valuable.
On Q2 The authors added comparisons with BO and ES, and explained BO's weaknesses in high-dimensional latent spaces.
However, in the final version, it would be advisable to:
- Test RL vs. ES not only under clean SPAGAN predictions, but also under noisy predictions, thereby closing the loop with the robustness discussion in Q1.
- Provide runtime or resource consumption data to clarify RL's training cost and convergence characteristics, since other reviewers also noted that latent-space RL rollouts can be expensive and were not analyzed in detail.
Q3 / W2 The provided explanation of ``dynamic expansion + top-k sparse gating'' with relevant theoretical references seems convincing.
Still, including the expert-count growth curve or related statistics during training in the main text would more clearly demonstrate that this strategy is both controlled and effective in this task, rather than remaining a purely theoretical argument.
W4 The authors note that DifiLO and Light-MILPopt, two recent strong baselines, are already included, which partially addresses the concern about missing strong baselines.
Nonetheless, it is recommended that the final version explicitly explain why diffusion-based methods were not included—whether due to computational cost or lack of applicability—so that readers can better understand the rationale behind the choice.
We appreciate the reviewer’s suggestion and will include additional curve analysis on inference time as the number of experts growth. Regarding expert statistics during training, we already analyzed the dynamics of expert addition in Mix-CVAE across different real datasets (Appendix 3.6). For instance, Figures 6 and 8 illustrate how adding more experts improves coverage of the target distribution guided by the energy-based model. Since the EBM aligns Mix-CVAE toward the true data distribution, increasing the number of experts (up to a cap of 9 in our setup) improves generative quality and benefits the downstream RL refinement. Furthermore, Figure 9 in Appendix 3.6 shows that as more experts are added, the total cost consistently decreases—indicating that a better latent distribution leads to more effective RL refinement.
The reason we did not compare against diffusion-based generative solver (e.g., Gurobi-guided diffusion model [1]) is of the scalability issue to large-scale constrained combinatorial optimization problems—especially over graphs with millions of nodes. In particular, the diffusion-based approaches require costly iterative sampling with series of VAE models: 100 denoising step for their DDIM and 1000 denoising step for their DDPM. Therefore, it is computationally impractical to evaluate them in large-scale setups such as RoadCA or Skitter. We will include this discussion in the final version and thank you for your suggestion.
[1] Effective Generation of Feasible Solutions for Integer Programming via Guided Diffusion. Conference: KDD 2024
Dear Reviewer AUQg,
We’ve provided a detailed rebuttal that we believe directly addresses the concerns raised in your review. If you’ve had a chance to review it, we would be grateful to know whether our clarifications resolved your concerns, and if they might warrant an update to your initial assessment.
Thank you again for your thoughtful review and time.
Best regards,
We appreciate your suggestion. Indeed, we’d like to clarify that to highlight the generalizability of SPAGAN, we added up to 30% of noise (which is very unlikely in practice), of which you see the worst-case cost increase approaches 20% (in large network like RoadCA). However, even under this stress testing scenario, our worst-case cost is still less than that of the SOTA methods. Please see Table 1, RoadCA, threshold T=140% (that is the threshold setting used in our rebuttal) in the main manuscript. Note that in Table 1, no noise was added to stress-test SPAGAN. We will mention this discussion in the final version of the paper.
Thank you for your suggestion to include the testing RL vs. ES under noisy predictions. We conducted additional experiments comparing our RL-based refinement with Evolution Strategies (ES) on the Gnutella dataset under varying levels of SPAGAN noise (5%, 10%, and 30%). Results are summarized in the table below:
| Method | Latent | SPAGAN Noise | Final Cost | Feas. Rate |
|---|---|---|---|---|
| PPS | - | 0% | 4118 | 100% |
| 5% | 4221 | 96% | ||
| 10% | 4318 | 92% | ||
| 30% | 4392 | 86% | ||
| RL (Ours) | 16 | 0% | 3435 | 100% |
| 5% | 3596 | 98% | ||
| 10% | 3714 | 92% | ||
| 30% | 3832 | 86% | ||
| 64 | 0% | 3419 | 100% | |
| 5% | 3550 | 98% | ||
| 10% | 3668 | 96% | ||
| 30% | 3775 | 92% | ||
| Evo. Strat. (ES) | 16 | 0% | 3612 | 94% |
| 5% | 3798 | 91% | ||
| 10% | 3945 | 87% | ||
| 30% | 4201 | 81% | ||
| 64 | 0% | 3598 | 98% | |
| 5% | 3756 | 95% | ||
| 10% | 3912 | 91% | ||
| 30% | 4187 | 85% |
From experiment, we found that Evolution Strategies are more sensitive to SPAGAN prediction noise, especially due to their strong reliance on the quality of the pretrained solution set generated during the Forge phase. In contrast, our RL-based method maintains better feasibility and lower cost under noise, demonstrating stronger robustness. We will include this experiment and analysis in the Appendix of the final version.
We have provided the answers to this similar suggestion in our rebuttal to reviewer GFJA. For your convenience, we list here again:
We analyzed the learning dynamics of the RL policy across training episodes (to standardize comparisons across graph sizes). The training process follows three distinct stages:
| Training Phase | RoadCA | |||||
|---|---|---|---|---|---|---|
| Feas. (%) | Total Budget | Reward | Feas. (%) | Total Budget | Reward | |
| Phase 1: Feasibility Search Behavior | Episodes 0 – 5K | Episodes 0 – 10K | ||||
| Phase 2: Cost Optimization Behavior | Episodes 5K – 15K | Episodes 10K – 35K | ||||
| Phase 3: Convergence Behavior | Episodes 15K – 50K | Episodes 35K – 50K | ||||
-
Stage 1 (Episodes 0–5K for Email, 0–10K for RoadCA): The policy quickly learns to satisfy constraints, increasing feasibility from 80% to 100% on Email and from 40% to 100% on RoadCA. This demonstrates the policy’s capacity to recover feasibility from scratch by optimizing the first term of the reward function.
-
Stage 2 (Episodes 5K-15K for Email, 10k-35k for RoadCA): The focus shifts to minimizing cost (the second reward term). During this phase, total cost drops significantly—from 15,147.21 to 2,697.84 on Email, and from 64,453.49 to 9,265.72 on RoadCA—while feasibility remains above ~98%.
-
Stage 3 (the remaining episodes): The training stabilizes, with both reward and cost showing low variance. For example, on RoadCA, reward converges to 3975.03 ± 37 and cost to 9265.72 ± 49, confirming efficient and stable convergence even on large-scale graphs.
These results highlight both the efficiency and scalability of the proposed RL refinement loop. We will include this new experiment, along with runtime analysis, in the final version of the paper.
Dear reviewer AUQg
We’re reaching out to kindly follow up on our rebuttal for submission #17634. We’ve done our best to address your comments carefully, and we truly appreciate the valuable feedback you provided. If you’ve had a chance to review our responses, we’d be grateful to know whether they have addressed your concerns, or if there’s anything we can further clarify.
Warm regards,
The Authors
The additional clarifications and experiments are appreciated. They address most of my earlier concerns and strengthen the case for the proposed pipeline's robustness. I look forward to seeing these additions and updates incorporated in the final version.
Thank you very much for your follow-up. We’re very glad to hear that our clarifications and experiments have addressed most of your concerns. We will incorporate all the additions and updates you mentioned—including the new analyses and ablations—into the final version in greater detail. Please feel free to let us know if there’s anything else you'd like us to clarify further.
Best regards,
Authors
Dear Reviewer,
As a quick follow-up, we’ve done our best to address all the points raised and are glad that most of your concerns have been resolved. Given these clarifications and enhancements, and that the core contribution—a modular, theoretically grounded, and scalable generative framework for QoSD—is now more clearly supported both empirically and theoritically, we would kindly ask whether you might be open to re-evaluating your initial assessment to see if our work may warrant a stronger rating.
we truly appreciate your time and engagement in this review process.
Warm regards, Authors
Hephaestus is a generative framework for Quality of Service Degradation that aims to identify minimal, stealthy perturbations to network edge weights to degrade system performance. It combines predictive path approximation, energy-guided mixture generative modeling, and latent-space reinforcement learning to efficiently generate high-quality solutions under nonlinear cost functions and large-scale constraints.
优缺点分析
strengths: 1.Unlike prior methods that treat feasibility search, modeling, and optimization as separate tasks, Hephaestus introduces a unified, generative pipeline that handles them jointly. This end-to-end design addresses the inefficiency and fragmentation of previous methods, especially under large-scale nonlinear constraints where classical ILP or heuristic-based solvers fail to generalize. 2.Earlier approximation algorithms like Adaptive Trading or Iterative Greedy handle each source-target pair independently, leading to budget inefficiencies. Hephaestus’s PPS algorithm uses shortest-path estimation and a global potential function to coordinate updates across all pairs, providing better resource allocation with theoretical approximation guarantees.
weakness: 1.The Forge phase depends on a well-trained SPAGAN model to estimate shortest paths, but the paper lacks a thorough sensitivity analysis on how inaccuracies in SPAGAN affect overall performance. If SPAGAN generalizes poorly to unseen graph structures, the entire pipeline could suffer significantly. 2.Although the RL phase improves solution quality, training policies in latent space still requires iterative rollouts and reward evaluation, which can be computationally intensive. The paper does not provide detailed analysis of RL convergence behavior or resource cost under different problem scales.
问题
1.Since PPS relies on estimated shortest path lengths from SPAGAN rather than exact values, it’s unclear how errors in this approximation affect the feasibility or optimality of generated solutions. Has the robustness of the pipeline been evaluated under noisy or miscalibrated SPAGAN outputs? 2.The paper claims generalization to unseen graphs with similar structure, but it is not clear how well the RL-based refinement policy performs when topology or distribution shifts occur. Would retraining be needed for significantly different domains?
局限性
1.The quality and diversity of the pretrained solution dataset (Dsol) is critical for the Morph phase, yet the paper does not examine the failure modes when Dsol is sparse or biased. If the initial PPS-based solutions are suboptimal, the entire pipeline may converge to poor regions of the solution space. 2.The current setup assumes static graph topologies and fixed critical pairs. In practical scenarios (e.g., networks with dynamic routing), it's unclear whether Hephaestus can adapt online or if it requires full retraining from scratch.
格式问题
N//A
Q1, also W1. It’s unclear how errors in SPAGAN approximation affect the feasibility or optimality of generated solutions. Has the robustness of the pipeline been evaluated under noisy SPAGAN outputs?
Thank you for raising this important question. While SPAGAN shows strong generalization on graphs with similar topology (as illustrated in Appendix 3.3), we also address the out-of-distribution (OOD) setting through a robust safeguard—PPS-I, introduced in Section 3.3 (Inference Process) of the main paper. PPS-I is designed to refine any intermediate solution—including those from the RL refinement phase or PPS—using exact shortest-path computation via Dijkstra’s algorithm, thereby guaranteeing 100% feasibility of the final perturbation vector regardless of SPAGAN’s prediction accuracy.
To evaluate the pipeline’s robustness under degraded SPAGAN predictions, we conducted a controlled noise injection study. Specifically, we perturb SPAGAN’s predicted shortest-path values with varying levels of random noise (30%, 10%, 5%) and assess the downstream effects on PPS, RL, and PPS-I across two real-world datasets.
| Dataset | Noise Rate (%) | PPS Feas. (%) | PPS Cost | RL Feas. (%) | RL Cost | PPS-I Feas. (%) | PPS-I Cost | PPS-I Fix Time (s) |
|---|---|---|---|---|---|---|---|---|
| 30 | 90.0 | 3284 | 94.0 | 2911 | 100.0 | 3005 | 20.2 | |
| 10 | 94.0 | 3185 | 96.0 | 2759 | 100.0 | 2802 | 15.7 | |
| 5 | 96.0 | 3157 | 98.0 | 2640 | 100.0 | 2712 | 6.1 | |
| 0 | 100.0 | 3091 | 100.0 | 2691 | 100.0 | 2691 | 0.0 | |
| RoadCA | 30 | 78.0 | 13730 | 86.0 | 10985 | 100.0 | 11061 | 214.8 |
| 10 | 88.0 | 12556 | 92.0 | 10148 | 100.0 | 10795 | 137.2 | |
| 5 | 92.0 | 12123 | 94.0 | 9892 | 100.0 | 9907 | 48.6 | |
| 0 | 100.0 | 11283 | 100.0 | 9184 | 100.0 | 9277 | 0.0 |
From the table, with number of pairs is 50, we observe four following keys:
-
PPS degrades under noisy predictions: feasibility drops by up to 22% (e.g., RoadCA with 30% noise), and cost increases (~20%) as perturbation budgets are allocated to irrelevant edges.
-
RL refinement recovers performance: RL improves feasibility and reduces cost over PPS by leveraging gradient signals in the latent space. For instance, in the RoadCA dataset at 30% noise, RL improves feasibility from 78% to 86%, and reduces cost from 13730 to 10985.
-
PPS-I ensures 100% feasibility in all cases: Regardless of SPAGAN noise level, PPS-I consistently restores feasibility via exact Dijkstra-based correction. This confirms the pipeline’s correctness guarantee.
-
Cost penalty of PPS-I is modest: Even in the worst-case (RoadCA, 30% noise), the PPS-I cost only increases by ~19% relative to the ideal (0% noise) case—i.e., 11061 vs. 9277—while feasibility remains perfect.
We will include this analysis to the main manuscript
Q2, also Limitation 2. It is not clear how well the RL-based refinement policy performs when topology or distribution shifts occur. Would retraining be needed for significantly different domains?
Since our RL policy operates in the latent space, its generalization ability depends on three main factors:
-
Robustness of SPAGAN’s predictions: As SPAGAN provides shortest-path approximations that guide the construction of initial solutions, any degradation here may propagate downstream. However, we have shown in Question 1 that even under significant prediction noise, the RL policy still improves feasibility and cost-effectiveness, demonstrating strong tolerance to upstream approximation error.
-
Generalization of Mix-CVAE: Since the latent space is shaped by a Mix-CVAE trained over solution trajectories, its ability to encode meaningful structures for unseen graphs is critical. Fortunately, generalization of VAE-style models across structurally similar data distributions is well-established in the literature [1,2], especially when trained on diverse samples.
-
Generalization of the policy network: The learned RL policy receives a latent vector as input and refines it toward better solution quality. Its ability to generalize to out-of-distribution (OOD) domains depends on whether the latent embeddings of unseen graphs lie within the training support. Prior works [3,4] in reinforcement learning (e.g., in offline RL and model-based control) show that policies can generalize across tasks when the representation space is sufficiently expressive—which applies here due to the modularity of our pipeline.
However, we also believe that if the target domain exhibits significantly different structural characteristics, retraining or fine-tuning the RL policy may indeed be beneficial to fully capture domain-specific latent dynamics. To our best knowledge, few-shot learning for domain adaptation of generative models is a rapidly advancing field [5,6,7], and such techniques can theoretically be incorporated into our framework to enhance transferability with minimal overhead.
[1] Abstract representations emerge naturally in neural networks trained to perform multiple tasks. Nature Communication 2023
[2] Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017
[3] Planning to Explore. ICML 2020
[4] Reinforcement Learning with Prototypical Representations. ICML 2021
[5] One-Shot Generative Domain Adaptation. ICCV 2023
[6] Make Your One-step Diffusion Model Better Than Its Teacher. ECCV 2024
[7] The Superposition Using the Itô Density Estimator. ICLR 2025
W2. Analysis of RL convergence behavior under different problem scales.
To address this, we provide detailed training statistics on real datasets with different scale: Email (1005 nodes 25571 edges) and RoadCA (1.96M nodes, 2.77M edges), both with clean SPAGAN's prediction.
| Training Phase | RoadCA | |||||
|---|---|---|---|---|---|---|
| Feas. (%) | Expected Total Budget | Reward | Feas. (%) | Expected Total Budget | Reward | |
| Stage1: Feasibility Search Behavior | Episodes 0 – 5K | Episodes 0 – 10K | ||||
| Stage 2: Cost Optimization Behavior | Episodes 5K – 15K | Episodes 10K – 35K | ||||
| Stage 3: Convergence | Episodes 15K – 50K | Episodes 35K – 50K | ||||
As shown in the table, training proceeds in three stages based on episode count. In Stage 1 (Episodes 0–5K / 0–10K), the policy rapidly improves feasibility (maximizing term 1 of reward function) from 80% to 100% (Email) and from 40% to 100% (RoadCA), demonstrating its ability to learn constraint satisfaction from scratch. In Stage 2, the focus shifts to cost optimization (term 2 of reward function): total budget drops significantly from 15147.21 to 2697.84 on Email, and from 64453.49 to 9265.72 on RoadCA, while feasibility remains approximately 98%. Finally, in Stage 3, the reward and cost metrics converge with low variance, e.g., on RoadCA, reward stabilizes at 3975.03 ± 37 and budget at 9265.72 ± 49, confirming stable convergence even in large-scale settings.
Limitation 1. If the initial PPS-based solutions are suboptimal, the entire pipeline may converge to poor regions of the solution space
Our framework is specifically designed to reduce the impact of low-quality initial data by employing a latent-space refinement loop. Importantly, instead of discarding the highest-reward solutions generated by the RL phase, we incorporate them back into the pretrained solution set (as illustrated in Figure 1). This feedback loop serves two main purposes: (1) it enhances the training distribution for the Mix-CVAE in the next iteration, enabling the model to better capture informative solution patterns; and (2) it provides the RL policy with a richer and more structured latent space, allowing it to generate higher-quality solutions in subsequent cycles.
Dear Reviewer GFJA,
We’re writing to follow up on our NeurIPS 2025 submission (#17634), "Hephaestus: Mixture Generative Modeling with Energy Guidance for Large-scale QoS Degradation"
We’ve submitted a detailed rebuttal addressing your comments. If you’ve had a chance to review it and feel your initial assessment could be updated based on our clarifications, we’d greatly appreciate it.
Thank you again for your time and consideration.
Best regards,
Dear reviewer GFJA,
We’re writing to kindly follow up regarding our rebuttal for our submission #17634. We’ve carefully addressed your comments and would be truly grateful to know whether our responses have helped resolve your concerns, or if there are any remaining questions we could further clarify.
Sincerely,
This paper proposes Hephaestus, a novel generative modelling framework for solving the Quality of Service Degradation (QoSD) problem, which is formulated as perturbing network edge weights minimally yet sufficiently to degrade service quality. The paper proposes a three-stage sequential solution for the problem:
- Forge, employing a Predictive Path-Stressing (PPS) algorithm to generate feasible initial solutions with theoretical guarantees.
- Morph, a generative modelling approach using Mixture of Conditional VAEs (Mix-CVAEs) guided by an Energy-Based Model (EBM) to model multimodal solution distributions using the output from step 1. as input.
- Refine, a reinforcement learning method for fine-tuning latent-space solutions to achieve near-optimality using a differentiable reward function.
The paper provides rigorous experimental evaluations across synthetic and real-world networks, demonstrating Hephaestus's effectiveness over classical optimization and ML-based baselines, especially under nonlinear conditions. The effectiveness of the algorithm and theoretical guarantees are provided for the algorithm stages. Notably, it can handle both linear and non-linear edge cost functions and a large number of path constraints (that prevents the problem being stated as ILP), which past methods could not.
优缺点分析
Strengths:
- Tackles a highly relevant real-world challenge: QoS degradation via stealthy perturbations in networked systems (e.g., blockchain, traffic control, distributed ML). Although the problem is relevant, it is not broadly relevant but more relevant for the networking community.
- The paper is well executed and it offers strong technical depth, including theoretical guarantees (i) approximation bounds for PPS i.e. they prove an approximation ratio for the PPS in Forge, ensuring the generated initial solutions are within a bounded factor of optimal – Forge part; (ii) convergence analysis and proofs of Mix-CVAEs with EBM guidance – Morph part; (iii) design of the RL reward and the gradient-based latent refinement (Theorem 4) demonstrate deep technical insight into stabilizing and accelerating training – Refine part.
- Comprehensive Evaluation and Analysis. The experimental section is extensive. The authors evaluate different graph types and sizes (beyond million nodes size networks) in the main part and in more details in the appendix. They also investigate why Hephaestus performs well: for instance, the discussion notes that the learned model tends to concentrate budget on edges that benefit multiple source–target pairs, whereas baselines waste budget on pair-specific edges.
- Strong Empirical Performance: Hephaestus shows clear performance gains over a wide array of baselines. On synthetic graphs, it achieved the lowest costs across all tested densities, essentially matching the optimal in linear cases. In more complex cost settings (quadratic, log-concave) where learning-based ILP solvers cannot be applied, it still outperformed the best classical approximations.
- The paper is generally well-written and flows well, considering the somewhat complex orchestration of the whole proposal.
Weaknesses:
- High Complexity, Implementation Overhead & feasibility to be deployed in practice: The Hephaestus framework is quite complex orchestration, involving multiple models and training stages (a GNN predictor, a heuristic search, an EBM, several VAE “experts”, and an RL agent). Implementing and tuning this entire pipeline might be challenging. The approach trades off runtime per instance for a heavy upfront training cost on a distribution of graphs. In scenarios where such extensive training is not feasible or when one needs a quick solution for a single new network, this could be a drawback. The practical impact in real deployment scenarios remains partly speculative without a deeper discussion of implementation challenges (e.g., noise, model drift in real environments) – which are not discussed as limitations. Due to the orchestration nature of the approach (3 stages) having a bad outcome of the first (Forge) i.e. having a bad data will likely have influence on the other phases.
- Clarity. Though generally well written, given the complexity and number of innovations (PPS, Mix-CVAE, EBM, latent RL), the paper’s readability occasionally suffers from density.
- Focus on Static Scenario: The QoSD problem is treated as a one-shot perturbation planning problem on a static weighted graph. In practice, network conditions and attack strategies might evolve over time. The paper does not address dynamic or adaptive adversarial scenarios (which is beyond its scope, but still a limitation of the current model).
Suitability:
- The paper is relevant to machine learning and representation community but to a limited extend, there are more suitable venues for this work such as SIGCOMM, ICNP, INFOCOM.
Minor typos:
- In the supplemental material, the section is named: DETAILS EXPERIMENTS AND ABLATION STUDIES. Should it be named: DETAILS OF THE EXPERIMENTS AND ABLATION STUDIES or DETAILS, EXPERIMENTS AND ABLATION STUDIES
问题
- The authors should discuss the analysis of failure modes. It would improve the paper to discuss how errors in the learned models (SPAGAN or the generative model) affect the outcome. A suggestion is to include a brief sensitivity analysis or worst-case check: e.g., verify a few final solutions with an exact shortest-path computation to confirm no constraints are violated. Reporting the predictor’s accuracy on shortest path estimates for test graphs would also reassure that the learned model is reliable. If certain graph topologies or weight patterns cause predictor mistakes, acknowledging that and perhaps integrating a safeguard (such as occasional exact path re-checks for critical pairs) could be beneficial.
- The authors should consider simplifying the explanations for critical, complex components (particularly Morph phase and latent RL refinement) to improve the presentation.
- Dynamic/Adaptive Scenario as Future Work: While not necessary for this paper’s scope, mentioning possible extensions to handle dynamic attacks or repeated games could strengthen the impact. For example, one could imagine an online version of Hephaestus that updates its strategy as network conditions or defences change.
局限性
yes
格式问题
No serious formatting concerns.
Minor: In the supplemental material there are tow sections: 3.10 3.10. Hephaestus with different feasibility refinement 3.10. DISCUSSION
This should be fixed.
Q1. How do SPAGAN errors impact final outcomes? Have you conducted sensitivity analysis or exact-path verification to ensure constraint satisfaction? perhaps integrating a safeguard ?
We thank reviewer for a pointing that. Despite that SPAGAN demonstrates generalization on graphs with similar topology (Appendix 3.3), we still explicitly address the failure scenario where SPAGAN's predictions degrade under out-of-distribution (OOD) graphs—by incorporating a robustness safeguard: PPS-I, described in Section 3.3. PPS-I can refine any intermediate solution (including those from PPS or RL) using exact shortest-path computations via Dijkstra, thereby guaranteeing 100% feasibility of the final perturbation.
In response to the suggestion for a worst-case check using exact computation to verify that no path constraints are violated, we note that our framework already includes such a mechanism. Specifically, we define a feasibility check function that applies Dijkstra to compute the true shortest-path distances after perturbation. We then report the feasibility rate (%), measured as the percentage of source–target pairs whose shortest-path lengths exceed a given threshold (i.e., successfully disconnected). To further quantify robustness, we also conducted a sensitivity analysis by injecting controlled levels of random noise (30%, 10%, 5%) into SPAGAN’s predicted shortest-path values. This simulates various degrees of degraded model performance. We then evaluated the impact of this noise on feasibility (out of 50 pairs) and cost across all stages of our pipeline—PPS, RL, and PPS-I—on two benchmark datasets.
| Dataset | Noise Rate (%) | PPS Feas. (%) | PPS Cost | RL Feas. (%) | RL Cost | PPS-I Feas. (%) | PPS-I Cost | PPS-I Fix Time (s) |
|---|---|---|---|---|---|---|---|---|
| 30 | 90.0 | 3284 | 94.0 | 2911 | 100.0 | 3005 | 20.2 | |
| 10 | 94.0 | 3185 | 96.0 | 2759 | 100.0 | 2802 | 15.7 | |
| 5 | 96.0 | 3157 | 98.0 | 2640 | 100.0 | 2712 | 6.1 | |
| 0 | 100.0 | 3091 | 100.0 | 2691 | 100.0 | 2691 | 0.0 | |
| RoadCA | 30 | 78.0 | 13730 | 86.0 | 10985 | 100.0 | 11061 | 214.8 |
| 10 | 88.0 | 12556 | 92.0 | 10148 | 100.0 | 10795 | 137.2 | |
| 5 | 92.0 | 12123 | 94.0 | 9892 | 100.0 | 9907 | 48.6 | |
| 0 | 100.0 | 11283 | 100.0 | 9184 | 100.0 | 9277 | 0.0 |
From the table, we observe:
-
PPS degrades under noise: Higher SPAGAN noise reduces feasibility (up to –22%) and increases cost due to misallocated budget (e.g., RoadCA, 30% noise).
-
RL improves robustness: RL boosts both feasibility and cost-efficiency over PPS by optimizing in latent space (e.g., RoadCA: 78% → 86%, cost drops from 13,730 → 10,985).
-
PPS-I ensures 100% feasibility: It always restores feasibility via exact Dijkstra correction, regardless of noise level.
-
PPS-I cost overhead is modest: Worst-case cost increase is ~19% (RoadCA: 11,061 vs. 9,277) while preserving perfect feasibility.
These show that Hephaestus is still robust to significant errors in SPAGAN’s predictions. We will include this analysis to the main paper
W1. Having a bad data in the first phase will likely have influence on the other phases
This is insightful point. Our framework is explicitly designed to mitigate the influence of poor-quality initial data through a latent-space refinement loop. In particular, our RL operates entirely in the latent space, where it learns to generate high-quality perturbation vectors that improve feasibility and cost-efficiency, as directly reflected by the differentiable reward function. Crucially, the best-performing solutions—i.e., those with the highest rewards—are not discarded after refinement. Instead, they are added back into the pretrained solution dataset (as shown in Figure 1) to enrich the diversity and quality of training data. This feedback mechanism has two key effects:
-
Mix-CVAE retraining in the next cycle benefits from improved latent samples, allowing it to better capture useful solution modes.
-
The RL policy, in turn, receives a more informative and structured latent space, enabling it to produce even better solutions in future cycles.
We empirically validate this self-reinforcing effect through the following table, which shows the cost reduction (%) achieved across successive refinement cycles. Specifically, we measure the relative improvement in cost (budget savings) after each cycle where the top‑ highest-reward samples are added back to the solution set. Results on three benchmark datasets under four different feasibility thresholds clearly demonstrate monotonic improvement with more cycles:
| Dataset | Cycle | T=140% | T=180% | T=220% | T=260% |
|---|---|---|---|---|---|
| 0 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 1 | 4.01 | 5.25 | 6.82 | 7.95 | |
| 2 | 8.67 | 10.15 | 12.44 | 14.21 | |
| 3 | 10.86 | 12.50 | 15.03 | 17.88 | |
| 4 | 11.07 | 12.97 | 15.58 | 18.04 | |
| 5 | 11.23 | 13.11 | 15.98 | 18.52 | |
| RoadCA | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| 1 | 7.01 | 8.55 | 10.11 | 11.58 | |
| 2 | 10.91 | 12.98 | 15.03 | 17.34 | |
| 3 | 12.83 | 15.21 | 17.85 | 20.19 | |
| 4 | 15.67 | 17.89 | 20.45 | 22.87 | |
| 5 | 17.42 | 19.53 | 22.10 | 24.53 | |
| 6 | 18.27 | 20.33 | 22.98 | 25.66 | |
| 7 | 18.51 | 20.76 | 23.42 | 26.09 | |
| 8 | 18.61 | 20.89 | 23.54 | 26.21 |
-
On Email with , improvement saturates after 5 cycles, yielding up to ~18.5% cost reduction.
-
On the large-scale RoadCA dataset with , cost reductions continue to rise even after 8 cycles, peaking at over ~26% under high thresholds.
These results show that our pipeline is resilient to weak initial data.
W2, also Q2. Though generally well written, the paper’s readability occasionally suffers from density.
Thank you for the valuable suggestion. We agree that the Morph phase and latent-space RL refinement are technically dense components, and we appreciate the need for improved clarity. We will revise these sections to enhance readability by summarizing key math intuitively, e.g., describing how Mix-CVAE captures multimodal solution structures from PPS, and how the RL policy operates in latent space to improve feasibility and cost via a differentiable reward function. We will also highlight modularity in the pipeline.
Q3, also W3: Dynamic/Adaptive Scenario as Future Work for Hephaestus
Thank you for the suggestion. Although Hephaestus currently focuses on the static QoSD setting, its modular design naturally supports extensions to dynamic scenarios. For example, SPAGAN can be retrained incrementally as the network topology evolves, Mix-CVAE and RL can adapt via continual or meta-learning, and our add-back mechanism already forms a feedback loop for iterative refinement. As future work, we plan to explore online variants of Hephaestus that adapt strategies in response to changing network conditions or defenses, potentially framing the problem as a repeated game or adaptive attack.
Suitability. The reviewer questions whether the paper is better suited for networking venues (e.g., SIGCOMM, INFOCOM) rather than ML conferences.
While our work intersects with networking, its core contributions also lie in generative mixture modeling, latent-space RL, and structured optimization—central topics in the ML community. Furthermore, our framework is generalizable and can be applied to other challenging CO problems, as long as a reasonably good approximation method exists for the Forge phase. We believe this brings meaningful value to the ML community beyond the original QoSD problem.
Dear Reviewer Tyqx,
Thank you again for your constructive review of our submission (#17634). We’ve carefully addressed your comments in our rebuttal and would greatly appreciate any further feedback or clarification you might be willing to share. If our responses have resolved your concerns, we’d be grateful to know whether they may inform a revised evaluation.
We truly appreciate your time and engagement in the review process.
Warm regards,
Dear reviewer Tyqx,
We wanted to kindly follow up on our rebuttal for submission #17634. We’ve done our best to address your comments in detail, and we truly appreciate the time you took to raise them. If you’ve had a chance to review our responses, we would be grateful to know whether our clarifications have resolved your concerns, or if there are any remaining questions or points you’d like us to elaborate on.
Warm regards,
Dear Area Chair and Reviewers,
We sincerely thank you for your thoughtful reviews and constructive suggestions. In our paper, we introduce Hephaestus, a generative optimization framework for the Quality-of-Service Degradation (QoSD) problem in large-scale networks. Our method scales to graphs with over a million nodes, supports both linear and nonlinear cost functions (to our knowledge, we are the first to address both in the context of QoSD), and is modularly structured to facilitate reuse across optimization tasks.
Our framework comprises three modular phases:
- Forge: We leverage existing SPAGAN to build a novel Predictive Path-Stressing (PPS) algorithm, which generates diverse and feasible initial solutions with theoretical guarantees.
- Morph: We propose a theoretically grounded training method for a Mixture of Conditional VAEs, guided by an Energy-Based Model (EBM), to capture multimodal solution distributions.
- Refine: We train a reinforcement learning (RL) agent to optimize solutions in the latent space. The best-performing solutions (those with the highest rewards) are recycled into the pretraining dataset, enabling the Mix-CVAE to progressively learn better distributions in subsequent iterations. Additionally, we demonstrate that performing gradient ascent on the reward function enables the RL agent to directly discover feasible actions, which facilitates a more effective exploration process during the early stages of training.
We are grateful that all four reviewers acknowledged the strong technical depth of our work (Tyqx), the solid theoretical rigor (GFJA, AUQg, 4YUB), its comprehensive (Tyqx) and strong empirical results (AUQg, 4YUB), as well as the real-world relevance of the QoSD problem in the contexts of networking, system security, and machine learning.
Most concerns centered around experimental robustness, particularly in worst-case scenarios where SPAGAN may fail to generalize due to significant distribution shifts. Below, we summarize the key clarifications and additions made in response to reviewer feedback:
-
Robustness (based on cost and feasibility) of our Hephaestus to Noisy SPAGAN Predictions (Tyqx, GFJA, AUQg): We clarified that our framework also has PPS-I acting as a robust safeguard, applying Dijkstra-based correction to ensure 100% feasibility, regardless of SPAGAN’s prediction quality. Additionally, we conducted sensitivity analyses under varying levels of SPAGAN noise. The experiment shows that even with prediction noise causing up to ~20% cost inflation, our RL refinement phase consistently improved solution quality, and PPS-I maintained constraint satisfaction. As a result, even in the worst-case scenario, Hephaestus still outperforms state-of-the-art baselines such as Light-MILPopt (ICLR 2024) and Predict-&-Search (ICLR 2023), and achieves comparable performance to DiffiLO (ICLR 2025 Spotlight), particularly when these baselines are evaluated under clean, noise-free conditions.
-
Quality of Generated Data and the Self-Reinforcing Loop (Tyqx, GFJA): Our RL loop is self-reinforcing: it recycles high-reward solutions into the pretraining dataset, allowing the generative model to improve over time. As shown in Table shown in answer for W1 (reviewer Tyqx), this leads to consistent cost reductions of up to 26% in large network, demonstrating robustness even when initial training data is weak or noisy.
-
RL Convergence Behavior Across Graph Scales (GFJA, AUQg): We conducted additional experiments to analyze the convergence dynamics on datasets of significantly different scales (Email vs. RoadCA), and observed a consistent three-stage learning behavior: • Stage 1 (early episodes): Rapid feasibility improvement (toward 100%), • Stage 2 (mid episodes): Sharp reduction in budget cost while maintaining approximately 98% feasibility, • Stage 3 (late episodes): Stable convergence of both reward and cost. Importantly, despite the large difference in graph sizes, all three stages consistently emerged within just 50,000 episodes on both datasets. These findings confirm the scalability and stability of our RL module across small- and large-scale graphs.
-
Clarification on VAE–EBM Integration (4YUB): We clarified that the minimax formulation is used only to train the EBM, not the VAEs. The VAEs are trained independently using a standard ELBO-based objective. To ensure full mode coverage, we dynamically introduce new CVAE experts whenever the density gap is large. These experts are selected by the Mixture-of-Experts (MoE) gating network, ensuring broad distribution support and effectively mitigating mode collapse.
We will incorporate all minor revisions and clarifications directly into the final version of the paper. Once again, we sincerely appreciate your thoughtful feedback and the opportunity to improve our manuscript.
Best regards,
The Authors
This paper investigates the Quality-of-Service Degradation (QoSD) problem, where an adversary perturbs edge weights in a network to degrade end-to-end performance while adhering to budget constraints. It introduces Hephaestus, a novel three-phase generative framework for solving the QoSD problem. The reviewers highlighted the strong technical depth, solid theoretical rigor, and comprehensive and strong empirical results. Nonetheless, the reviewers also pointed out the high complexity, implementation overhead & feasibility to be deployed in practice, the primary focus on static scenario, the sensitivity relying on SPAGAN for the overall performance, the lack of empirical discussion on convergence behavior and sample efficiency, and potential issues using a mixture of Conditional VAEs guided by an energy-based model.