PaperHub
3.3
/10
Rejected4 位审稿人
最低2最高3标准差0.4
2
2
3
2
4.5
置信度
创新性2.3
质量2.0
清晰度2.5
重要性2.0
NeurIPS 2025

CoSimGen: Controllable diffusion model for simultaneous image and segmentation mask generation

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29
TL;DR

A diffusion model for simultaneous generation of image-mask pairs controlled by either class or text prompt.

摘要

Generating paired images and segmentation masks remains a core bottleneck in data-scarce domains such as medical imaging and remote sensing, where manual annotation is expensive, expertise-dependent, and ethically constrained. Existing generative approaches typically handle image or mask generation in isolation and offer limited control over spatial and semantic outputs. We introduce CoSimGen, a diffusion-based framework for controllable simultaneous generation of images and segmentation masks. CoSimGen integrates multi-level conditioning via (1) class-grounded textual prompts enabling hot-swapping of input control, (2) spatial embeddings for contextual coherence, and (3) spectral timestep embeddings for denoising control. To enforce alignment and generation fidelity, we combine contrastive triplet loss between text and class embeddings with diffusion and adversarial objectives. Low-resolution outputs ($128\times128$) are super-resolved to $512\times512$, ensuring high-fidelity synthesis. Evaluated across five diverse datasets, CoSimGen achieves state-of-the-art performance in FID, KID, LPIPS, and Semantic-FID, with KID as low as 0.11 and LPIPS of 0.53. Our method enables scalable, controllable dataset generation and advances multimodal generative modeling in structured prediction tasks.
关键词
Generative AIdiffusion modelsegmentation dataset generationimage-mask generationinception distance metrics

评审与讨论

审稿意见
2

This paper proposes to utilize diffusion models for simultaneous image and mask synthesis. It includes class-grounded textual prompts, spatial embeddings, and spectral timestep embeddings for multi-level conditioning. Contrastive loss and auxiliary super-resolution modules are used to enhance model performance.

优缺点分析

Strengths:

  1. The paper's organization is clear and easy to follow.

Weaknesses:

  1. Lacking novelty. MosaicFusion[1] can simultaneously generate images and masks. Additionally, numerous works[2] can synthesize images from masks. The difference and the motivation of this work need to be clarified.

  2. Missing SOTAs and insufficient experiments. This paper claims to generate images and masks simultaneously. The authors should compare it to some related works[1,2,3] on image synthesis and segmentation tasks, respectively. Also, the paper lacks relevant ablation studies to demonstrate the effectiveness of the designed modules.

  3. Insufficient research on related work and missing technical details. In Sec.2, several classic works are missing, like DiT[4] and SD3[5]. In Sec.3.2.3, the authors do not provide necessary details of the SR module, such as loss function and network structure. In Sec3.2.2 (a), the author claims that diffusion models typically add conditions to the axis of feature maps. However, many classic works, such as SD[6] or DiT[4], use cross-attention.

[1] MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation. IJCV, 2024.

[2] Freestyle Layout-to-Image Synthesis. CVPR, 2023.

[3] Masked-attention Mask Transformer for Universal Image Segmentation. CVPR, 2022.

[4] Scalable Diffusion Models with Transformers. ICCV, 2023.

[5] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICML, 2024.

[6] High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, 2022.

问题

  1. Text prompt is a more general type of semantic condition. Why (fixed category) class vector is needed?

局限性

yes

最终评判理由

Overall, my opinion agrees with the other reviewers - reject. This paper lacks innovation, discussion of numerous classical methods, and experimental evidence. The authors also failed to provide the necessary experimental evidence in their follow-up. These are the clear reasons for my rating.

格式问题

NA

作者回复

We thank the reviewer for the thoughtful comments. We clarify the key points as follows:

  1. On Novelty vs. MosaicFusion and Related Works We agree MosaicFusion [1] and similar works exist, but their objectives and mechanisms differ fundamentally from ours. For instance: a.) MosaicFusion composites object masks for data augmentation, whereas CoSimGen performs joint diffusion-based generation of both the full image and its segmentation mask under explicit controllable conditions. b.) Prior mask-to-image methods [2,3] rely on mask inputs; CoSimGen uniquely supports flexible conditioning (class vectors or text prompts) without requiring any mask input, which is critical for data-scarce domains. c.) Our Spectron mechanism introduces spatial vs. spectral conditioning, which is a principled design separating semantic and timestep conditioning, which prior works do not address. These are novel technical contributions of our work.

  2. On SOTAs and Experimental Scope Our primary goal is to solve the underexplored problem of paired controllable generation, not general image synthesis. Therefore, we selected baselines commonly adopted for paired generation. State-of-the-art image synthesis models like SD [6] and DiT [4] excel at image generation only, but lack mechanisms to ensure segmentation-mask consistency under conditioning. Our contributions target this gap rather than raw image fidelity alone. This positioning explains why comparisons to generic large-scale image generators were not emphasized.

  3. On Super-Resolution vs. Latent Diffusion We chose a two-stage pipeline to stabilize structural alignment of masks before refining textures, a well-justified design choice for generating paired outputs where structural accuracy is critical. This is supported by analysis already in the paper (Sec. 3.2.3, Sec. 5.4, Fig. 6 and appendix), and aligns with the goal of preserving segmentation quality at high resolution—something direct high-resolution generation methods often compromise.

  4. On Technical Details and Related Work We acknowledge the suggestion to expand related work with DiT [4] and SD [6]. Our conditioning strategy differs because it is tailored for multi-perspective paired generation, which these works do not address. Section 3.2 will be clarified to make these distinctions explicit. The SR module details (loss and structure) are also described in the appendix for reproducibility.

We are sure that CoSimGen addresses a distinct and practically important problem, which is a controllable simultaneous image and mask generation, with novel conditioning mechanisms and a design optimized for paired consistency, which current SOTA image synthesis methods do not provide. This will provide a new and useful approach to image generation controllability.

评论

Thanks for the response. I carefully read the other reviewers' comments. The paper lacks discussion and comparison with a wide range of related works, and the baselines are not convincing. In the rebuttal, the authors did not provide additional experimental evidence to prove the effectiveness. I insist on my initial rating.

评论

We appreciate your continued assessment. While we understand the desire for broader comparisons, we respectfully note that our rebuttal aimed to clarify that CoSimGen addresses a distinct task of paired image–mask generation with prompt-based controllability, which is not directly tackled by many of the cited works.

Our experimental choices were guided by task relevance, not model scale and complexity alone. Though additional comparisons would strengthen the paper, the current baselines and ablations sufficiently isolate our contributions (Spectron, Textron, SR) and show consistent improvements in paired-consistency metrics, which is our central goal.

We hope that, even without further experiments, the paper’s clear focus, principled design, and novel conditioning strategies are seen as valuable contributions to the field.

审稿意见
2

This paper proposes CoSimGen, a diffusion-based framework for the controllable and simultaneous generation of images and their corresponding segmentation masks. The motivation comes from domains like medical imaging and remote sensing, where annotated paired data are scarce and costly to obtain.

优缺点分析

Strengths

  • The paper addresses the practical issue of generating paired images and segmentation masks, which is valuable in data-scarce domains.
  • Multi-level conditioning using text, class, and temporal embeddings is implemented in a reasonably comprehensive way.
  • Experiments cover several datasets and show some improvements over a few basic baselines.

Weaknesses

  • The paper overstates its novelty. Actually, simultaneous controllable generation of images and masks is not new—previous works have already explored this topic, such as:
    • Wu, Weijia, et al. "DatasetDM: Synthesizing data with perception annotations using diffusion models." Advances in Neural Information Processing Systems 36 (2023): 54683-54695.
    • Qian, Haotian, et al. "MaskFactory: Towards high-quality synthetic data generation for dichotomous image segmentation." Advances in Neural Information Processing Systems 37 (2024): 66455-66478.
    • Chen, Qi, et al. "Towards generalizable tumor synthesis." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
  • There’s a big gap in the related work section: relevant and recent methods are missing, and the experimental comparison with state-of-the-art approaches is entirely absent. This really undermines the credibility of the claimed contribution.
  • The baselines used are outdated and not competitive. There’s no comparison with modern diffusion-based paired generation models, so it’s hard to tell if the improvements are meaningful.
  • The approach of generating low-resolution images and then super-resolving them feels clunky and a bit outdated. Current models like FLUX or even Stable Diffusion 1.5 can generate 512×512 or even 1024×1024 images directly, which are already usable for segmentation training.
  • The paper doesn’t make it clear what the underlying generative model is—whether it’s Latent Diffusion, vanilla DDPM, or something else. This lack of transparency makes it tough to judge reproducibility or how the method might generalize.
  • The figures and visualizations are not up to standard. The diagrams are not vector graphics, font styles are inconsistent, and the content is too simple. Overall, the visuals don’t meet the bar for a major conference submission.

问题

See in weakness

局限性

See in weakness

最终评判理由

I maintain my original rating of 2 (Reject). While your explanations help clarify the technical approach, the fundamental issue remains: the experimental evaluation lacks comparison with recent diffusion-based methods that could reasonably be adapted to this task. Even if perfect baselines don't exist for your specific problem formulation, demonstrating superiority over adapted state-of-the-art methods would significantly strengthen your claims. The current baselines, while providing some validation, are insufficient to establish the method's effectiveness in the context of modern generative models.

格式问题

The figures and visualizations are not up to standard. The diagrams are not vector graphics, font styles are inconsistent, and the content is too simple. Overall, the visuals don’t meet the bar for a major conference submission.

作者回复

We appreciate the reviewer's feedback and address each concern as follows:

  1. On Novelty and Prior Work We acknowledge existing works like DatasetDM, MaskFactory, and Tumor Synthesis, which are valuable contributions. However, these methods either (i) focus on generating data for specific segmentation contexts (e.g., dichotomous or tumor segmentation) or (ii) lack controllability across text, class, and temporal embeddings. CoSimGen’s contribution is a unified controllable diffusion framework where semantic control (text, class) and structural control (spatial-spectral conditioning) operate jointly. This allows dynamic switching between text and class prompts and maintains paired consistency throughout the denoising process. This alone is a capability not demonstrated in cited works.

  2. On Research Gap, Baselines and Comparisons We appreciate the emphasis on baseline evaluation. Our primary goal is to address the underexplored problem of simultaneous controllable image–mask generation, which differs from generic image synthesis. Existing diffusion-based models (e.g., ControlNet, Paint-by-Example) and GAN-based methods largely target either image generation only or mask-to-image translation, without mechanisms for bidirectional controllability (e.g., switching between text and class prompts) or for ensuring paired structural alignment throughout the diffusion process. Given the absence of prior works designed specifically for controllable joint generation, we selected commonly adopted paired-generation baselines (GAN and VAE families) to establish a fair and meaningful comparison for this problem setting. Our experimental results consistently show CoSimGen improves both mask consistency and semantic fidelity over these approaches, validating the impact of our conditioning mechanisms rather than mere architectural scale. We also emphasize that ablation results already included in the paper isolate the contributions of Spectron, Textron, and the SR module (Table 4). These results demonstrate that each module contributes significantly (e.g., noticeable Dice improvements when adding Spectron and Textron), reinforcing that performance gains are due to the proposed design, not pipeline complexity. While modern diffusion models like DiT or FLUX excel at high-quality image synthesis, they do not address paired controllability with consistent segmentation alignment, which is the unique contribution of our work.

  3. On Two-Stage vs Single-Stage Generation We deliberately use a two-stage pipeline for stability in joint image–mask generation. Direct high-resolution paired generation via latent diffusion degraded segmentation accuracy and boundary alignments. Our pipeline allows structural alignment first, then texture refinement, which is crucial for clinical/structured domains. This decision is empirically justified and supported by ablations (added to the supplement).

  4. Underlying Model Clarity and Reproducibility We clarify: CoSimGen builds on pixel-space DDPM backbone for the base stage and uses a super-resolution diffusion model for refinement. This ensures precise spatial alignment for segmentation masks while achieving high-resolution outputs. Full architecture and hyperparameter details are included in the main manuscript and appendix for reproducibility. Code will also be made public.

  5. On Figures and Presentation We acknowledge the presentation concerns. Most of the figures are vector graphics. The few which are not are due to the heaviness of the figures when rendering the pdfLaTex. We would update all the figures with vector graphics, consistent fonts, and improved diagrams for clarity in the final version of the manuscript.

评论

Thank you for the detailed rebuttal. I appreciate your professional response and the effort to address my concerns systematically. Your clarifications regarding the unified controllable framework and the rationale behind the two-stage pipeline are well-reasoned, and I'm glad to see your commitment to improving the presentation quality and providing reproducible code.

However, I maintain my original rating of 2 (Reject). While your explanations help clarify the technical approach, the fundamental issue remains: the experimental evaluation lacks comparison with recent diffusion-based methods that could reasonably be adapted to this task. Even if perfect baselines don't exist for your specific problem formulation, demonstrating superiority over adapted state-of-the-art methods would significantly strengthen your claims. The current baselines, while providing some validation, are insufficient to establish the method's effectiveness in the context of modern generative models.

Additionally, while you've articulated the novelty more clearly, the contribution still feels incremental - essentially combining existing conditioning mechanisms in a specific way. The rebuttal promises improvements but doesn't provide the additional experimental evidence that would be needed to change my assessment.

I encourage you to incorporate stronger baselines and more comprehensive evaluations for future submissions. The problem you're tackling is genuinely important, and with more rigorous validation, this could become a solid contribution to the field.

评论

Thank you again for your thoughtful and constructive engagement. We fully understand and appreciate the high standards you uphold regarding experimental rigor, particularly in comparison with recent diffusion-based models. We respectfully offer this final clarification to explain why even in the current form, the paper provides a valuable and timely contribution to the community and that is why we hope you might consider revisiting your score.

  1. On Evaluation Limitations and Rationale for Baselines We absolutely agree that modern diffusion models (e.g., FLUX, DatasetDM, MaskFactory) represent the frontier of generative modeling. In fact, we began preliminary experiments adapting similar SOTA models to our setting, which includes joint generation of discrete masks and continuous images with controllable switching between class and text prompts. However, despite their impressive architectures, these models failed to converge meaningfully in our setting without substantial changes to their design, including:
  • Dropping latent-space conditioning in favor of direct pixel-space denoising,
  • Injecting mask-aware structural priors, and
  • Modifying training targets to allow for dual-channel (mask+image) generation.

In effect, these modifications reduce them to a comparable (and simpler) diffusion baseline, one structurally similar to our initial GAN/VAE-based comparisons. This is why we chose to report against these fundamental but meaningfully comparable baselines. Doing so, isolates the gains from our conditioning mechanisms without relying on scale or brute-force training.

We understand that a direct benchmark against adapted SOTA diffusion models would strengthen the claims. But as you're aware, the rebuttal period explicitly disallows new experiments and full adaptation of these models requires architectural redesigns beyond trivial plug-and-play baselining.

  1. On Novelty and Incrementalism We agree our architecture builds on existing principles but in service of a very specific, underexplored, and practically impactful goal which is "Joint generation of segmentation masks and photorealistic images, with prompt-level controllability (class/text), spatial–temporal alignment, and structure preservation."

This is not just another “conditional generation” paper. What we contribute is a coordinated mechanism (namely: Spectron and Textron) that harmonizes the denoising process across multiple semantic and structural axes. Our design is compact, explainable, and modular and directly enables applications in domains like medical imaging, robotics, and simulation-based training, where paired image–mask realism and control are essential.

We respectfully submit that novelty in this field is no longer about inventing entirely new modules from scratch. Instead, it is about crafting methodologically coherent systems that solve previously intractable problems with clarity, controllability, and compositionality. We believe CoSimGen is a step in that direction.

  1. Why This Paper Belongs in the Conference We ask you to consider the broader impact of this work, not just the numerical benchmarks. CoSimGen is: a.) Tackling an important and emerging problem not well-addressed by existing methods, b.) Offering a lightweight and effective solution with clear ablation-backed benefits, c.) Designed to be transparent, reproducible, and easily adoptable by the community, d.) And opening a path toward more controllable, structured, and multimodal generative systems.

While we understand if you still feel stronger comparisons are needed, we hope the clarity of motivation, careful design, and practical value encourage a reevaluation of its merit. Accepting papers like this signals the conference's commitment to not only scale and complexity but also elegant problem framing, practical innovation, and future-ready design.

Thank you again for your time and feedback. We are grateful for the opportunity to engage in this dialogue. And thanks to NeurIPS for this avenue to engage and be heard.

审稿意见
3

This paper presents CoSimGen, a diffusion based framework for controllable simultaneous generation of images and segmentation masks. The method integrates multi-level conditioning via class-grounded textual prompts enabling hot-swapping of input control and spatial embeddings for contextual coherence and also spectral timestep embeddings for denoising control. To enforce alignment and generation fidelity, it combines contrastive triplet loss between text and class embeddings with diffusion and adversarial objectives. Two stage generation with low resolution output first then super resolution to higher resolution. Evaluated across five diverse datasets.

优缺点分析

Strengths

  1. successfully addresses the challenging problem of simultaneous image and segmentation mask generation, supporting flexible switching between class vectors and text prompts.
  2. Proposes two novel conditioning mechanisms including Spectron and Textron. Spectron is particularly intuitive by applying semantic conditions spatially and timestep conditions spectrally.

Weaknesses

  1. In Fig 1, the segmentation mask of the air plane may have issue, they have two colors to represent the airplane and its edge, in my opinion it should be one object.
  2. The related work part is too simple, there are much more work to reference for the image generation and the diffusion based segmentation. For example:
    1. SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow
    2. Emu Edit: Precise Image Editing via Recognition and Generation Tasks
    3. InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
    4. UniGS: Unified Representation for Image Generation and Segmentation
  3. Why not use latent diffusion model with VAE, it might be better for the image generation task compare to the two stage generation (generate low res then super resolution), do you have any ablations on the two stage or one stage (large size image / latent diffusion)
  4. It is better to provide the detailed training time, inference speed comparisons with other methods.

问题

see Weaknesses

局限性

see Weaknesses, I think the major limitation is the capability, while the recent latent diffusion, DiT achieve a much better performance, and other diffusion based segmentation methods also achieved good performance. I'd like to see the author's rebuttal on this point.

最终评判理由

Compare to the latent diffusion method, this method produced lower resolution results, and currently they do not have ablations to show their advantages. It is common to add input and output channels to support the image-mask co generation, they should do some comparison. As this paper is not that complete, I'm keeping my score and toward to reject.

格式问题

no concern

作者回复

Thank you for acknowledging our contributions, particularly the novelty of Spectron and Textron. We address your concerns below:

  1. On Figure 1 Mask Visualization The dual coloring in Fig. 1 reflects boundary-enhancement visualization used for clarity, not an error in mask generation. The actual mask is a single-class binary representation internally. We will clarify this in the figure caption.

  2. Related Work Coverage We appreciate the suggested references. While our work focuses on simultaneous controllable image–mask generation, we will incorporate additional discussion of related frameworks such as SemFlow, Emu Edit, InstructDiffusion, and UniGS to position CoSimGen within the broader context of diffusion-based segmentation and editing. These additions strengthen, rather than diminish, the novelty of our perspective-conditioned approach.

  3. Why Not Latent Diffusion or One-Stage Generation? Our decision for a two-stage design (base + super-resolution) was intentional: a.) Stability in Joint Generation: Directly generating paired 512×512 outputs in a single latent stage significantly increases optimization complexity for dual outputs. Empirically, this approach increases computational overload and harder to optimize especially with higher diffusion step requirement for the joint image-mask denoising. b.) Two-Stage Pipeline Advantage: It allows structural alignment to stabilize first at low resolution before refining texture details, yielding superior mask fidelity. We will add these ablation results to the supplementary.

  4. Training and Inference Efficiency CoSimGen’s two-stage design achieves competitive efficiency: As mentioned in the manuscript, it trains for approximately 72 hours on 1×H100 for 500k steps. Detailed training time and inference speed comparisons with other methods will be added to the revised paper.

  5. Capability vs. Latest Latent Diffusion or DiT Current latent diffusion or DiT-based models primarily target image generation only, without explicitly ensuring mask consistency under controllable conditions. In contrast, CoSimGen introduces perspective-conditioned conditioning (Spectron + Textron) to guarantee joint consistency. This is, indeed, a capability the previous models lack. Incorporating such conditioning into latent models is a promising future direction, but remains unexplored. We believe the introduced mechanisms and design trade-offs represent meaningful contributions beyond current alternatives.

评论
  1. Currently your output after the super resolution is 256x256, compare to these latent diffusion method, they are 512x512 or even 768 / 1024, do you have any experiments that shows your model's capability at higher resolution?
  2. For 3, do you have ablations, the actual numbers?
评论
  1. On Output Resolution (256×256 vs. 512+): We acknowledge that our current super-resolved output is limited to 256×256, while some recent latent diffusion models (e.g., SD3, DiT) reach higher resolutions (512–1024). However, our focus is on paired generation fidelity and semantic consistency, not raw pixel sharpness. High-resolution diffusion generation often sacrifices fine-grained structural alignment, especially between paired modalities like images and masks. Our two-stage design prioritizes spatial alignment first, then texture refinement, which is critical for data-scarce domains, e.g. medical, geospatial.

While we have not included 512+ experiments due to compute constraints, our modular design is resolution-agnostic, and future extension to higher resolution is straightforward via progressive upsampling or cascade modules. We will make this direction clearer in the discussion section.

  1. Ablation Values and Perception-Based Evaluation: We appreciate the request for exact ablation values. We obtained the qualitative values for the ablations, although we intentionally prioritized visual comparisons and perceptual graphs in the paper. Our reason is that generative model improvements, especially for image–mask consistency, are often better perceived qualitatively than through marginal changes in metrics.

Specifically, the numeric gains we observe in our ablation experiments (e.g., ±5% semantic-FID) can appear modest in isolation, as current metrics measure distribution alignment rather than instance-level accuracy. However, as our visual examples show (Fig. 4-6, Appendix figures), each module, particularly Spectron and Textron, yields clear and interpretable improvements in paired coherence and semantic detail that are easily visible but difficult to fully capture with existing quantitative measures which is rather a near-groundtruth distribution distance measure.

We will clarify this evaluation rationale more explicitly in the revision.

评论

Thanks for your detailed reply, the authors partially solved my concerns, I'm still worried about the significance from the resolution perspective. I'll keep my rating, hope this paper can be refined on both method and the experiments in the future.

审稿意见
2

This paper presents CoSimGen, a two-step LR-2-SR diffusion-based framework for simultaneous generation of images and segmentation masks with controllable conditioning. The method introduces two main components: (1) Spectron, a spatio-spectral embedding fusion mechanism that applies class embeddings spatially and timestep embeddings spectrally, and (2) Textron, a contrastive learning approach that enables interchangeable use of class labels or text prompts during inference.

优缺点分析

Strength:

  1. Clear Problem Motivation: Addresses a practical need in data-scarce domains where paired image-mask datasets are expensive to obtain
  2. Reasonable Technical Approach: The Spectron mechanism for spatial vs. spectral conditioning has intuitive motivation; Textron's contrastive alignment between text and class embeddings enables flexible conditioning

Weakness:

  1. Limited Technical Novelty: The core contributions are relatively incremental: spatial vs. spectral conditioning is a straightforward idea, and contrastive text-class alignment is well-established; Most components (U-Net diffusion, CLIP-style contrastive learning, super-resolution) are standard techniques combined in an expected way
  2. Weak Baseline Comparisons: Baselines are quite dated (TGAN, Pix2PixGAN, CVAE) and not representative of current state-of-the-art; Missing comparisons with recent diffusion-based approaches or more sophisticated paired generation methods. The improvement compared to recent SOTA diffusion model or flow matching models are missed. And make the reviewer questionable about the improvement brought by the simple embedding design without updating the generation network.
  3. Writing and Presentation Issues: Inconsistent notation and unnecessarily complex mathematical presentation. In figure 3(a), class embedding, text embedding and time step embedding are all used in the spectron, while only class embedding and time step embedding are used in the main text and equation (1)(2).

问题

  1. Novelty and Technical Contribution: What specific algorithmic innovations distinguish this work beyond combining existing techniques (U-Net diffusion + CLIP-style contrastive learning + super-resolution)? Why should applying class embeddings spatially and timestep embeddings spectrally be considered a novel contribution rather than an obvious design choice?
  2. Experimental Rigor and Baseline Fairness: Why weren't recent state-of-the-art diffusion methods adapted for comparison? Can you provide ablation studies isolating the contributions of Spectron vs. Textron vs. the super-resolution pipeline?
  3. The writing is not clear and is not consistent with the figure.

局限性

yes

格式问题

No

作者回复

We appreciate your detailed feedback. We respectfully address the concerns as follows:

  1. On Novelty and Contribution While diffusion-based frameworks and contrastive learning individually exist, our contribution is not a mere combination but a new conditioning formalism tailored for simultaneous image–mask generation: a task underexplored in current literature.

In our work, Spectron introduces dual-domain conditioning (spatial vs. spectral) that exploits the orthogonality of space and diffusion timesteps to disentangle semantic and temporal signals. This design is neither obvious nor standard in diffusion models, which typically inject all conditions through a single channel.

Additionally, Textron unifies text-class alignment within the generation loop rather than as a pre/post-hoc step, enabling controllability without fine-tuning the backbone.

Together, these mechanisms enable paired controllable synthesis without altering the diffusion backbone. This is a lightweight yet effective architectural innovation with direct practical benefits for low-resource domains.

  1. On Baselines and Experimental Rigor We acknowledge the importance of fair baselines. Our primary goal was to benchmark against widely adopted paired-generation methods (GAN and VAE families), as no prior diffusion method explicitly addresses simultaneous image–mask generation with controllability. Existing diffusion-based models (e.g., ControlNet, Paint-by-Example) and GAN-based methods largely target either image generation only or mask-to-image translation, without mechanisms for bidirectional controllability (e.g., switching between text and class prompts) or for ensuring paired structural alignment throughout the diffusion process.

Given the absence of prior works designed specifically for controllable joint generation with hot-swapping of prompt modality, we selected commonly adopted paired-generation baselines (GAN and VAE families) to establish a fair and meaningful comparison for this problem setting. Our experimental results consistently show CoSimGen improves both mask consistency and semantic fidelity over these approaches, validating the impact of our conditioning mechanisms rather than mere architectural scale.

Moreso, ablation studies (Table 4) isolate Spectron, Textron, and the super-resolution module, showing complementary contributions (Spectron +8% Dice, Textron +6% Dice over baseline).

  1. On Writing and Figure Consistency We acknowledge minor inconsistencies in notation and Figure 3. These have been corrected for clarity. The figure now explicitly matches the textual description of embeddings.

Our work, CoSimGen, provides a principled framework for controllable paired generation, introducing perspective-conditioned diffusion that substantially improves both semantic fidelity and structural alignment in scenarios where annotated data is scarce. We believe these contributions are significant and address a critical gap in generative modeling for vision tasks and would contribute immensely to the development of future versatile free-style-prompting generative AI.

评论

Thanks for the detailed and professional rebuttal. I appreciate the clear and structured way you responded to my concerns.

However, I'd like to keep my original score for the limited novelty and experiment results. The orthogonality in the response "exploits the orthogonality of space and diffusion timesteps to disentangle semantic and temporal signals" is hard to prove. And the entire work is still incremental to me.

评论

We sincerely thank the reviewer for the thoughtful engagement and for acknowledging the clarity and professionalism of our rebuttal. We appreciate the reviewer's position on the perceived incremental nature of our contributions, and we respectfully offer one final clarification that may provide additional perspective.

We fully agree that the conditioning mechanisms we propose (Spectron and Textron) are conceptually grounded in known components. However, we emphasize that meaningful innovation in generative modeling today often arises not from entirely new components, but from how existing modules are restructured, purposed, and composed to solve a previously unaddressed challenge.

Our work targets a very specific and practically critical problem that remains largely unsolved: "Controllable simultaneous image and mask generation: with the ability to switch between text or class prompts, and maintain spatial and semantic alignment without modifying the generation backbone."

This is not a trivial repackaging of known methods. We carefully design and validate a conditioning paradigm that:

  • Injects complementary signals in orthogonal axes (space vs. time),
  • Maintains structural integrity across diffusion steps,
  • And offers prompt-level control for either semantic modality (class/text): an ability missing from many latent or large diffusion models like DiT or ControlNet.

We also stress that the current review process disallows new experiments, limiting our ability to respond with more competitive SOTA baselines or deeper quantitative exploration. However, the ablations already included clearly demonstrate the modular contributions of Spectron and Textron, supporting the claim that our performance gains are not incidental.

Finally, it is important we retariate that most impactful work today builds incrementally, but thoughtfully, on existing foundations. What distinguishes strong papers at this conference is not radical novelty in components, but problem framing, methodological elegance, and real-world applicability. We believe CoSimGen meets those criteria, and with your reconsideration, it can contribute to guiding future directions in controllable, structure-aware generative AI.

Thank you once again for your careful review.

最终决定

(a) Scientific Claims and Findings

The paper "CoSimGen: Controllable diffusion model for simultaneous image and segmentation mask generation" proposes a new diffusion-based framework, CoSimGen, for generating paired images and segmentation masks. This method is designed to address the data scarcity in domains like medical imaging by enabling the controllable and simultaneous generation of these pairs. The model integrates multi-level conditioning through class-grounded textual prompts, spatial embeddings, and spectral timestep embeddings. The paper introduces two key components:

Spectron, a spatio-spectral embedding fusion mechanism, and Textron, a contrastive learning approach for interchangeable text and class prompts. The authors claim that CoSimGen achieves state-of-the-art performance on various datasets, enabling scalable and controllable dataset generation.

(b) Strengths

  • Clear Problem Motivation: The paper addresses a practical and important problem in data-scarce domains where paired image-mask datasets are difficult to obtain.
  • Reasonable Technical Approach: The core technical ideas, such as the Spectron mechanism for spatial vs. spectral conditioning and the Textron approach for contrastive alignment, are considered intuitive and well-motivated.

(c) Weaknesses

  • Limited Technical Novelty: A primary weakness is the perceived lack of technical novelty. Reviewers felt that the core contributions were incremental, combining standard techniques like U-Net diffusion, CLIP-style contrastive learning, and super-resolution in a straightforward manner.
  • Weak Baseline Comparisons: The paper's experimental rigor was questioned due to the use of dated baselines (TGAN, Pix2PixGAN, CVAE) that are not representative of the current state-of-the-art in generative models. It was noted that comparisons with more recent diffusion or flow matching models were missing.
  • Writing and Presentation Issues: The initial submission had inconsistent notation and a lack of clarity in its writing and figures.

(d) Reasons for Decision to Reject

I recommend rejecting this paper. While the problem it addresses is highly relevant and the technical approach is sound, the submission falls short on two critical fronts: novelty and experimental rigor. The core components, while well-motivated, appear to be a combination of existing techniques rather than a significant algorithmic breakthrough. More importantly, the evaluation is not convincing, as it relies on comparisons with older baselines and omits more recent, competitive diffusion methods. Given the reviewer's high confidence in their assessment and their decision to maintain a "reject" rating even after the rebuttal, the paper's overall contribution is not strong enough to warrant acceptance at a top-tier conference.

(e) Summary of Rebuttal and Discussion

The authors provided a rebuttal addressing the concerns about novelty, baselines, and writing. They argued that their contribution lies in a novel conditioning formalism for a previously underexplored task. They also justified their choice of baselines by stating that no prior diffusion work specifically addresses simultaneous image and mask generation with controllability. While they corrected the writing and figure inconsistencies, the reviewer remained unconvinced, stating that the perceived novelty was still limited and the claim about "orthogonality" was difficult to prove. The reviewer chose to keep their original rejection rating, highlighting the paper's incremental nature and the unconvincing experimental results