MaskFactory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation
Our method uses diffusion models to generate fine-grained data for dichotomous image segmentation, addressing annotation challenges and boosting model performance.
摘要
评审与讨论
This paper proposes a generative method for dichotomous image segmentation. In detail, non-rigid and rigid editing techniques are used to generate high-quality synthetic masks. Those masks are leveraged for segmentation model training, which typically requires expensive dichotomous image labeling.
优点
This paper introduces an approach for generating high-quality datasets for DIS. In my opinion, the methodology is carefully designed and experiments show the effectiveness. Overall I think this is an interesting and insightful paper.
缺点
-
There seems to have some typos/errors in the paper.
a. In line 140, the (V,E) should be (V, E_s). The E_s does not show in section 3.2.2 except line 139.
b. The font should be consistent, e.g., line 139,140 and Appendix. A Algorithm 1 line 7-11.
c. The Figure 2 does not show what is discussed in Section 3.3. In the Figure 2, the prompt pool is used for generation. But in Section 3.3, the authors say that the segmentation masks and the edge conditions are used. If the text prompt is not used, it is meaningless to introduce the text in the last row of Figure 2 Stage 1 and Section 3.2.2. Also, here the detailed description of {p^m_i} is missing in the paper, though it is quite clear. Only have P_i shows in Section 3.2.2 and at least {p^m_i} should show in the following sections for consistency.
d. "Topylogy" should be "Topology" in Figure 2.
e. Remove full stops in the title of Section 4, 4.1, 4.2.
-
The authors argue that inspired by ControlNet, canny condition is also used. However, since there is no pretrained ControlNet used, some layers are trained according to line 173 and 174, it is unclear whether the canny condition is useful. Though more conditions present more information intuitively, the canny is already actually included in the informative mask condition. So some ablation studies should be presented to show that the additional condition do helps. But according to Section 5.2, is the pretrained Canny ControlNet used? Need to clarify in the main paper.
-
The method has limited increase compared with SAM-HQ, which finetunes SAM with additional branch on some prepared data.
-
It seems the impact of this method itself is limited, especially DIS is not so important. Maybe the authors can think about extending the method to a broader scope. Though I think this is a minor issue.
问题
Some suggestions
-
The authors use Zero123 for rigid mask editing. I am curious whether ImageDream [1] can be used here. Both of them can generate novel view images. [1] can takes the additional text as the input and I am interested in whether it can be used to directly change the shape of the mask like this paper shows that "A round table" can change the shape in Figure 2.
-
For the topology mentioned in Section 3.2.2, is it easy to show some topology structures of some samples? It would be better to include topology results for better illustration.
-
For Table 1, the numbers (2500, 5000, ...) are too close to the line.
-
The authors can include more discussion on related works [2,3]. It is interesting to discuss whether the data filtering strategy in [2] can be combined in this work. And [3], which also takes a dataset as the input and outputs a new one, also includes edited masks for the generation of specially tuned generative models, sharing the similar core idea of this paper.
As I discussed in Strength, overall this is an interesting paper. And I also love the area of synthetic date generation. However, it seems this paper is finished too rush. There are some improper points in the paper, discussed in the Weakness. I believe the authors should revise the paper thoroughly for precise presentation. And it is very important to decribe the method as how it is implemented in fact. It seems the text, masks, and the canny are all used in the generation according to line 388, then the Figure 2, Section 3.3 including Formula 8, and the algorithm in the appendix should be modified. Considering too much issues that need to be resolved, I prefer rating this paper as "Borderline reject" and I am sure to consider raising my rating to "Borderline accept" after the rebuttal for my possible misunderstanding and the authors' proper clarification.
[1] Wang, Peng, and Yichun Shi. "Imagedream: Image-prompt multi-view diffusion for 3d generation." arXiv preprint arXiv:2312.02201 (2023).
[2] Yang, Lihe, et al. "Freemask: Synthetic images with dense annotations make stronger segmentation models." Advances in Neural Information Processing Systems 36 (2024).
[3] Zhu, Lingting, et al. "Generative Enhancement for 3D Medical Images." arXiv preprint arXiv:2403.12852 (2024).
局限性
The limitations are discussed.
We sincerely thank the reviewer for their meticulous review and helpful suggestions. We address each point as follows:
Q1: Typos and Errors
We apologize for these oversights and will correct all identified issues in the revised version:
a. We will change (V,E) to (V, E_s) in line 140 and ensure consistency throughout Section 3.2.2. b. We will unify the font style across the paper, including lines 139-140 and Algorithm 1. c. We will revise Figure 2 and Section 3.3 to accurately reflect our method's implementation, clarifying the use of text prompts, segmentation masks, and edge conditions. We will also provide a detailed description of {p^m_i} for consistency. d. We will correct "Topylogy" to "Topology" in Figure 2. e. We will remove full stops in the titles of Section 4, 4.1, and 4.2.
Q2: Canny Condition Usage
We appreciate this insightful observation. While the canny condition is indeed implicitly included in the mask, our experiments show that explicitly including it improves results. We have added an ablation study to demonstrate this:
| Conditions | maxF1 ↑ | M ↓ | E_φ^M ↑ |
|---|---|---|---|
| Mask only | 0.776 | 0.075 | 0.869 |
| Mask + Canny | 0.784 | 0.073 | 0.875 |
These results show that including the Canny condition leads to better performance across all metrics. In Figure 3 of our global PDF, we demonstrate the effect of adding the Canny condition. We found that incorporating Canny edges provides better constraints on the boundaries of generated images, resulting in more precise and detailed outputs.
Q3: Comparison with SAM-HQ
While our improvement over SAM-HQ is modest, it's important to note that SAM-HQ has already incorporated a large amount of additional data during its fine-tuning process. This likely means that SAM-HQ's performance is approaching a plateau, making significant improvements more challenging. In contrast, our method shows more substantial improvements over other baselines, demonstrating its effectiveness, especially in scenarios where performance hasn't yet reached saturation.
Q4: Suggestions
-
ImageDream: We have explored ImageDream and downloaded the weights as suggested. However, we found that it may not be directly suitable for binary image editing without fine-tuning. The results are shown in Figure 1 of our global PDF. Despite this, the method is indeed very inspiring for our work, and we will include a reference to it in our related work section.
-
Topology visualization: We agree this would enhance the paper. We have included topology structure visualizations for selected samples in Figure 2 of our global PDF. These visualizations clearly demonstrate how our method preserves and manipulates topological structures during the editing process. We will incorporate these more intuitive visualizations in the next version of our paper.
-
Table 1 formatting: We will adjust the spacing in Table 1 for better readability.
-
Related work discussion: We appreciate these suggestions and will expand our discussion of [2] and [3]. We will explore incorporating the data filtering strategy from [2] and the dataset augmentation techniques from [3] into our work, and report the results in the revised version.
We acknowledge that some aspects of the paper need refinement, and we are committed to addressing all these issues in our revision. We will ensure that our method description accurately reflects its implementation, including the use of text prompts, masks, and canny edges in the generation process. We will revise Figure 2, Section 3.3, Formula 8, and the appendix algorithm accordingly.
We appreciate the opportunity to clarify these points and look forward to submitting a significantly improved version of the paper.
References:
[1] Wang, P., & Shi, Y. (2023). Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201.
[2] Yang, L., et al. (2024). Freemask: Synthetic images with dense annotations make stronger segmentation models. NeurIPS 36.
[3] Zhu, L., et al. (2024). Generative Enhancement for 3D Medical Images. arXiv preprint arXiv:2403.12852.
Thank you for your response. The rebuttal has addressed my concerns, and I have accordingly raised my rating to 5.
While the weaknesses of the manuscript are evident, it appears that the draft was produced in haste and requires significant revisions.
Overall, I find the topic of generated data for downstream tasks to be interesting, and the proposed technical solution is reasonable.
Thank you very much for your valuable feedback, time, and attention. We sincerely appreciate your encouragement. We will continue to refine the descriptions in our document to make them even clearer and more accurate. If you have any further questions or need additional clarification on any point, please don't hesitate to let us know.
This paper introduces a new approach for generating diverse and precise datasets. The authors first introduce a mask editing method that combines rigid and non-rigid editing techniques to generate high-quality synthetic masks leveraging geometric priors from diffusion models and adversarial training and self-attention mechanisms for complex, topologically consistent modifications. Then, the authors generate pairs of images and segmentation mask control generation methods. Finally, the experiments on the DIS5K dataset benchmark demonstrate superior performance compared to existing methods.
优点
-
The paper provides a comprehensive and innovative solution to the challenges of DIS dataset creation. The methodology is well-detailed, and the experimental results robustly support the claims of superior performance.
-
The paper is well-written and easy to read.
缺点
-
My major concern is the novelty of the method. Although the paper works on a new task to generate simulated data, the method seems to be a combination of some unrelated approaches including Zero-123, GPT, ControlNet et al.
-
Whether the generated data can boost the Dichotomous Image Segmentation task is not validated. For example, the ControlNet generated images can be unrealistic, have a large domain gap between real-world images and training images, and the generated images seem not very diversified. Can generated images really help this task?
-
The approach is evaluated on only one dataset, which is limited.
问题
- The proposed method is only evaluated on one task, which is only limited to only one object. How about applying the approach on general semantic segmentation datasets, such as VOC, ADE20K and CityScapes?
局限性
See Weakness
We greatly appreciate the reviewer's comprehensive examination of our work. Our responses to each point are as follows:
Q1: Novelty of the method and combination of existing approaches
While our approach does integrate existing methods, its novelty lies in:
-
Task-specific adaptations for DIS tasks: Our method is specifically tailored for Dichotomous Image Segmentation, addressing unique challenges such as high-precision requirements and the need for diverse training samples.
-
Synergistic integration of rigid and non-rigid editing: We combine these techniques in a novel way to provide comprehensive mask editing capabilities, enhancing both geometric precision and topological consistency.
-
Topology-preserving adversarial training: This innovative approach ensures structural integrity during non-rigid editing, which is crucial for maintaining the quality of binary masks in DIS tasks.
These innovations collectively contribute to a method that is greater than the sum of its parts, specifically designed to meet the demands of DIS tasks.
Q2: Validation of generated data boosting DIS performance
We have rigorously validated the effectiveness of our generated data in improving DIS performance:
-
Our method achieves a 5.6% improvement in maxF1 score over the baseline, demonstrating significant performance gains.
-
We employ multi-conditional control generation to enhance consistency between generated images and masks, addressing the concern of unrealistic images.
-
The combination of rigid and non-rigid editing increases data diversity, mitigating the issue of limited variability.
To further illustrate the impact of our generated data, we present additional experimental results:
| Training Data | maxF1 ↑ | M ↓ | E_φ^M ↑ |
|---|---|---|---|
| Real only | 0.742 | 0.081 | 0.848 |
| Real + Generated | 0.784 | 0.073 | 0.875 |
These results clearly demonstrate that our generated data significantly improves performance across multiple metrics.
Q3: Limited evaluation on one dataset
We acknowledge this limitation in our initial submission. To address this, we have conducted additional experiments on various datasets to demonstrate the generalizability of our approach. These results will be included in the revised manuscript.
Q4: Applicability to general semantic segmentation datasets
We have explored the applicability of our method to other semantic segmentation tasks using InSPyReNet [1] as a baseline. Here are some results:
-
ECSSD (Edge detection):
Method S_α ↑ F_max ↑ MAE ↓ Baseline 0.942 0.959 0.023 Ours (Mix1000) 0.950 0.962 0.022 -
HKU-IS (Salient object detection):
Method S_α ↑ F_max ↑ MAE ↓ Baseline 0.931 0.952 0.022 Ours (Mix1000) 0.943 0.954 0.021 -
UOD (Underwater object detection):
Method S_α ↑ F_max ↑ MAE ↓ Baseline 0.868 0.878 0.067 Ours (Gen1000) 0.883 0.901 0.056
These results demonstrate the effectiveness of our approach across various semantic segmentation tasks. We are currently extending our experiments to more challenging scenarios such as NI-seg and FI, and will include these results in the revised manuscript.
Reference
[1] Yu, Q., et al. "Multi-view Aggregation Network for Dichotomous Image Segmentation." CVPR, 2024.
Thanks for the authors' feedback. The feedback resolves my concerns, and I would like to raise the score to 5
Thank you for your kind feedback and support. If you have any further questions, please don't hesitate to ask.
The paper introduces MaskFactory, a method for producing high-quality synthetic datasets for Dichotomous Image Segmentation (DIS) tasks. The method includes a two-stage process: mask editing (combining rigid and non-rigid transformations) and image generation using multi-conditional control methods. The suggested method improves model performance in DIS tasks by providing variation in synthetic data.
优点
- The manuscript is well-written and constructed.
- The authors conduct comprehensive ablation studies to highlight the contribution of each component of their method, further validating their approach.
- The proposed MaskFactory framework shows improvements in the quality, and diversity of synthetic datasets, which is critical for advancing DIS applications.
缺点
There is one limitation that authors in the limitations section acknowledge issues with unnatural images in complex scenarios, but it could delve deeper into how these might impact practical applications.
问题
Please see the weaknesses.
局限性
Please see the weaknesses.
We are grateful for the reviewer's perceptive observation. We provide a more in-depth analysis below:
Q: The paper acknowledges issues with unnatural images in complex scenarios, but could delve deeper into how these might impact practical applications.
A: We address this concern by focusing on two key points:
-
DIS data scarcity for VAE training: We opted for a more general VAE to ensure better generalization. Additional experiments show:
VAE Training Data maxF1 ↑ M ↓ E_φ^M ↑ General dataset 0.784 0.073 0.875 Small DIS dataset 0.762 0.079 0.853 A DIS-specific VAE produces more visually coherent images but leads to performance decrease due to overfitting.
-
Impact on performance: Despite occasional unnatural images, our method significantly improves downstream segmentation performance. We achieve a 5.6% improvement in maxF1 score compared to the baseline, suggesting increased data diversity outweighs drawbacks of some unnatural samples.
This paper introduces MaskFactory, a novel approach aimed at addressing the challenges of generating high-quality synthetic datasets for Dichotomous Image Segmentation (DIS) tasks. The authors tackle the challenges by leveraging a blend of rigid and non-rigid editing techniques to generate accurate synthetic masks. Rigid editing utilizes geometric priors from diffusion models for precise viewpoint transformations, while non-rigid editing incorporates adversarial training and mutual self-attention mechanisms for complex modifications. With the accurate generated segmentation masks, a multi-conditional control generation method is then employed to create high-resolution images. The efficacy of MaskFactory is demonstrated through experiments on the widely-used DIS5K dataset, showing superior quality and efficiency in dataset generation compared to existing methods, thus significantly reducing preparation time and costs.
优点
- The scientific problem discussed here is worth researching, and there is still much room for exploration in this area. If the synthetic technology can be leveraged to substantially expand the simulated image-mask pairs based on existing data, it would benefit model training, especially with the current large multimodal models.
- This paper takes into consideration both rigid mask editing and non-rigid mask editing, combining them to jointly contribute to mask generation. Additionally, during the non-rigid mask editing process, a topology preserving loss is introduced to maintain the topological structure of the original mask.
- The experimental results are very encouraging. Compared to previous methods, the data distribution generated in this paper is closer to the real data. As the amount of synthetic data involved in training increases, the model's segmentation capability also increases proportionally.
缺点
- The writing of this paper is too brief. It does not clearly show how the authors specifically carry out the mask editing part, or what unique designs they have implemented in the editing process.
- The paper lacks novelty in certain aspects. For the rigid mask editing part, using viewpoint changes to generate masks is a very common practice. The network architecture and synthesis method in the non-rigid mask editing part are completely identical to those used in MaskFactory. Although an additional topology preserving loss has been added, this loss is hard to regard as an independent innovation. Furthermore, the ablation study does not present results using only L_GAN and L_content, making it difficult to intuitively understand the significant impact of L_structure.
- Some of the table headers are too narrow.
问题
- Can rigid mask editing and non-rigid mask editing be integrated into one framework, rather than just being simply combined?
- Can the improvement brought by the L_structure loss be intuitively reflected in the ablation study?
局限性
The method faces challenges including the occasional production of unnatural images with stark foreground-background distinctions and inaccuracies in complex scenarios. Additionally, it relies on pre-annotated image-mask pairs, limiting autonomous data generation and requiring high-quality initial annotations. However, these issues are acceptable as there is extensive research aimed at resolving the inherent problems with ControlNet, which does not detract from the effectiveness of the method presented in this paper. Furthermore, the authors could explore using pseudo-labels for mask correction and generation to reduce dependency on pre-annotated image-mask pairs.
We sincerely appreciate the reviewers' thorough evaluation and constructive feedback. We are grateful for the opportunity to clarify and expand on our work.
Q1: Can you elaborate on the methodology, especially regarding mask editing and unique designs?
We will expand the methodology section in the revised manuscript to include:
-
Rigid Editing: We will provide a comprehensive explanation of our rigid editing process using Zero123. As stated in Section 3.2.1: "We leverage the Zero123 method, which employs a viewpoint-conditioned diffusion model ψθ to manipulate masks' perspectives." We will elaborate on how this allows for precise geometric adjustments, focusing on viewpoint and scale transformations.
-
Non-rigid Editing: We will offer a more detailed description of our non-rigid editing process, including:
- Topology-preserving adversarial training
- Mutual attention mechanism
- Foreground-background focus guidance
As mentioned in Section 3.2.2: "We introduce a topology-preserving adversarial training mechanism to mitigate artifacts and structural degradation in binary mask editing." We will expand on how this mechanism works and its importance in maintaining structural integrity.
-
Multi-conditional Control Generation: We will provide a more in-depth explanation of our image generation stage, as described in Section 3.3: "We introduce a multi-condition control generation method to achieve precise RGB image generation."
We will also include architectural diagrams and pseudocode to enhance clarity and understanding of our methodology.
Q2: Can you clarify your contributions beyond existing methods?
Our work presents several key innovations in the field of mask editing and dataset generation for DIS tasks:
-
Specific Improvements for DIS Tasks: Our method is designed for Dichotomous Image Segmentation (DIS) tasks, which require extremely high fidelity and precision in mask editing. Our approach generates diverse samples while maintaining high accuracy.
-
Methodological Innovations:
- Topology-preserving adversarial training for non-rigid editing, ensuring structural integrity of edited masks
- Combination of rigid and non-rigid editing techniques, providing comprehensive mask editing capabilities
- Multi-conditional control generation method for precise image-mask pair creation
-
Performance Improvements: As demonstrated in Table 2 of our paper, our method consistently enhances the performance of various state-of-the-art segmentation networks across multiple datasets, proving its effectiveness and generality.
-
Addressing Unique Challenges in DIS Tasks: Our method addresses the need for high-precision annotations and diverse training samples in DIS tasks, which has rarely been specifically targeted in previous methods.
Q3: Can you provide more comprehensive ablation study results?
We have conducted additional experiments to provide a complete picture of the impact of each loss component:
| L_GAN | L_content | L_structure | maxF1 ↑ | M ↓ |
|---|---|---|---|---|
| ✔ | 0.778 | 0.073 | ||
| ✔ | 0.745 | 0.075 | ||
| ✔ | 0.751 | 0.074 | ||
| ✔ | ✔ | 0.780 | 0.074 | |
| ✔ | ✔ | 0.782 | 0.073 | |
| ✔ | ✔ | ✔ | 0.784 | 0.073 |
These results demonstrate that while L_GAN alone provides a strong baseline, the combination of all three loss components yields the best performance. This underscores the importance of our multi-faceted approach to mask editing and image generation.
Q4: Can rigid and non-rigid mask editing be integrated into one framework?
Currently, we have not integrated these two frameworks, primarily because the rigid editing part cannot use topology preservation constraints. After viewpoint transformation, the detected key points undergo significant shifts, making topology preservation challenging.
In future work, we plan to explore the following directions to attempt integration:
- Develop adaptive loss functions that can handle both rigid and non-rigid transformations
- Design multi-stage editing processes that handle rigid and non-rigid edits separately while maintaining overall consistency
- Research novel architectures that can automatically select appropriate editing techniques based on the input and desired transformation
Q5: How do you plan to address the limitations mentioned in the paper?
We are actively addressing these issues, particularly focusing on the problem of unnatural images:
-
Unnatural Images: We recognize that this issue primarily stems from VAE training. We attempted to train a specialized VAE using the DIS dataset, but due to limited data, the results were not as effective as using a larger general dataset. Here are our preliminary results:
VAE Training Data maxF1 ↑ M ↓ E_φ^M ↑ General dataset 0.784 0.073 0.875 Small DIS dataset 0.762 0.079 0.853 In the future, we plan to explore:
- Data augmentation techniques to effectively increase the size of the DIS dataset
- Transfer learning methods, using VAEs pre-trained on large-scale datasets and then fine-tuned on DIS data
-
Inaccuracies in Complex Scenarios: We are developing hierarchical generation approaches and improved attention mechanisms to better capture long-range dependencies in complex scenes.
-
Reliance on Pre-annotated Pairs: We are researching self-supervised learning methods and weak supervision strategies to reduce the need for extensive pre-annotated data.
We will include preliminary results and discussions of these ongoing efforts in the revised manuscript.
Thank you for your thorough review and insightful questions. In our rebuttal, we have clarified several points of potential confusion and provided extensive additional experimental results to address your inquiries.
As the discussion phase is drawing to a close, we are eager to receive your response. After reviewing our rebuttal, do you have any remaining concerns or questions? Please don't hesitate to raise any issues or seek further clarification.
We appreciate the reviewers' thoughtful comments and suggestions. To address some of the concerns raised and provide additional support for our claims, we have prepared a global PDF with supplementary visual evidence. This document contains three key figures that demonstrate the effectiveness and novelty of our approach.
Figure 1: Comparison with ImageDream
Figure 1 presents a comparison between our method and ImageDream, showcasing the effectiveness of our approach in both rigid and non-rigid editing of binary masks. The figure consists of four columns:
- Original image
- Result edited by ImageDream
- Result from our rigid editing method
- Result from our non-rigid editing method
This comparison clearly demonstrates that while ImageDream is a powerful tool for general image editing, it struggles with the precise requirements of binary mask editing for DIS tasks. Our method, both in its rigid and non-rigid editing capabilities, produces results that are more suitable for DIS applications, maintaining the binary nature of the masks while allowing for meaningful edits.
Figure 2: Topology Modeling Visualization
Figure 2 provides a visualization of our topology modeling structure, which is crucial for our non-rigid editing process. This figure illustrates how topological constraints are applied during non-rigid editing, ensuring that the structural integrity of the mask is preserved even as complex deformations are applied.
The visualization demonstrates the before and after states of the mask, along with their corresponding topological structures. This clearly shows how our method maintains critical topological features while allowing for significant shape changes, a key innovation in our approach to DIS tasks.
Figure 3: Impact of Canny Edge Constraints in ControlNet Generation
Figure 3 showcases the effect of incorporating Canny edge constraints in our ControlNet-based image generation process. The figure presents a side-by-side comparison of generation results:
- Without Canny edge constraints
- With Canny edge constraints
This comparison vividly illustrates that the addition of Canny edge constraints results in generated images with more defined and accurate boundaries. This improvement is particularly crucial for DIS tasks, where precise edge delineation is essential for accurate segmentation.
We sincerely thank all reviewers and ACs for their valuable time and constructive comments. While all reviewers acknowledged the value of our work, we noticed that Reviewer nBLG (R1) raised some concerns about certain aspects of the paper. We hope to take this opportunity to clarify some possible misunderstandings.
Firstly, regarding the writing, R1 found the paper "too brief", which honestly puzzles us. Reviewer h8fH (R3) noted our paper as "well-written and easy to read", and Reviewer AsEU (R2) stated it was "well-written and constructed." This discrepancy in opinions is truly perplexing.
Secondly, concerning the novelty of our method, R1 expressed some concerns, which we find somewhat disheartening. In fact, R2 believed our "MaskFactory framework ... critical for advancing DIS applications," and Reviewer 49D1 (R4) even called it an "interesting and insightful paper." We sincerely hope R1 might reconsider the innovative value of our work.
Lastly, on the experimental evaluation, R1 pointed out that our approach was "evaluated on ... limited." We understand this concern, but we want to emphasize that, as R2 observed, we "conduct comprehensive ablation studies ... further validating our approach." We have put tremendous effort into ensuring the comprehensiveness and persuasiveness of our experiments.
Given the positive evaluations from other reviewers, we earnestly hope that the AC will weigh these different perspectives and encourage R1 to revisit our work.
We deeply appreciate your time and consideration, and look forward to your response.
All reviewers agree that the paper is well written and provides extensive experiments to validate their approach. reviewers nBLG and were concerned about the novelty of the paper, but after rebuttal they recognized that the approach has some novel concepts such as structure-aware training that is novel.
nBLG borderline reject
- originality
- generalization to rigid and non-rigid
- encouraging experimental results
- brief writing
- limited novelty elaborated but not rejecting other papers did similar parts)
- unclear contribution due to lack of ablation study on l-structure (added this)
AsEU accept
- well written
- comprehensive ablation studies
- quality of results
h8fH borderline accept
- innovative solution
- well-written
- novelty; combination of other approaches
- main claim: DIS is not validated; can generated images really help DIS? (yes, have shown)
- evaluation on one dataset (yes, added three more)
49D1 Borderline accept +methodology is carefully designed
- experiments show the effectiveness
- effect of canny condition unclear (show it works)
- dose not improve SAM-HQ (sam was finetunes already)