Synergistic Dual Spatial-aware Generation of Image-to-text and Text-to-image
摘要
评审与讨论
Given that previous works for standalone SI2T (Scene-to-Image-to-Text) or ST2I (Scene-to-Text-to-Image) perform imperfectly in spatial understanding due to the difficulty of 3D spatial feature modeling, this paper proposes tp model SI2T and ST2I together under a dual learning framework. Within this dual framework, a novel 3D scene graph (3DSG) representation is introduced, to capture 3D spatial scene features. Moreover, a Spatial Dual Discrete Diffusion framework is proposed to utilize the intermediate features of the 3D→X processes to guide the hard X→3D processes. Extensive experiments show the proposed method outperforms the mainstream T2I and I2T methods significantly.
优点
-
The proposed dual learning framework with 3D scene graph (3DSG) representation to enhance the 3D spatial feature modeling is novel.
-
The proposed Spatial Dual Discrete Diffusion framework is simple but effective, which can utilize the intermediate features of the 3D→X processes to guide the hard X→3D processes
-
Extensive experiments on VSD dataset have valid the effectiveness of this method. Compared with previous works (e.g., DALLE, CogView), the proposed works show more competitive performances.
缺点
-
Although the proposed method in this paper has achieved good results, the methods compared in Table 2 are somewhat outdated and not the latest SOTA methods (e.g., DALLE-3, CogView-2). Could the authors compare with some of the latest text-to-image methods, such as the Stable Diffusion series, SD-1.5[1], SDXL[2], etc.?
-
The authors should present some failure cases to analyze the shortcomings of the proposed method. For example, can the 3D scene graph always generate perfect outputs? If there are issues in the generated outputs, what impact will it have on the final results? What are the subsequent strategies to address these issues?
[1] https://huggingface.co/runwayml/stable-diffusion-v1-5 [2] Dustin Podell, Zion English and etc. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, in arxiv, 2023.
问题
Please refer tot the "weakness" section.
局限性
Please refer tot the "weakness" section.
We are grateful that you acknowledge the strengths of our work. We've conducted additional experiments and tried every effort to address your concerns. We hope you'll reconsider the evaluation if you find our responses are effective. Following are our outputs.
Q1: Although the proposed method in this paper has achieved good results, the methods compared in Table 2 are somewhat outdated and not the latest SOTA methods (e.g., DALLE-3, CogView-2). Could the authors compare with some of the latest text-to-image methods, such as the Stable Diffusion series, SD-1.5, SDXL, etc.?
A1: Thanks, we agree and we provide the following results. We compare the continuous SD-1.5 and SDXL on VSD dataset as follows:
| FID | IS | CLIP | |
|---|---|---|---|
| SD-1.5 | 20.12 | 26.35 | 64.79 |
| SDXL | 12.60 | 27.95 | 66.41 |
| CogView 3 | 13.28 | 28.63 | 67.29 |
| Ours | 11.04 | 29.20 | 68.31 |
| Ours-XL | TBC | TBC | TBC |
Please note that both Cogview 3 and SDXL have undergone a significant upgrade in model scale (the Unet backbone), while we did not find the released discrete ones that match this scale of parameters. Also, we are sorry but it is hard for us to train an XL model in the limited time from scratch. Thus, the margins in results in this table do not fully reflect the superiority of our method. The results in our manuscript are sufficient to support our main claim, i.e., the dual learning framework and the 3D scene graph modeling indeed improve the SI2T and ST2I. Of course, for the sake of rigor, we will scale up the model and add the new results if we could finish it before the camera-ready deadline.
Q2: The authors should present some failure cases to analyze the shortcomings of the proposed method. For example, can the 3D scene graph always generate perfect outputs? If there are issues in the generated outputs, what impact will it have on the final results? What are the subsequent strategies to address these issues?
A2: Thanks. We carefully checked the outputs and found that we may have two specific cases that need to be discussed. 1) First, for some simple scenes, the 3DSG may degenerate into the VSG/TSG and lose some surrounding objects and some attribute nodes. In this case, the model will still output the correct final results most of the time, because the main spatial information in the 3DSG is correct. 2) Second, the graph diffuser fails to generate the right 3DSG, leading to inacceptable final results. We believe the main reason for both 1) and 2) is that the graph generation module is not solid enough. One possible solution for this issue is to scale up the graph generator and train it with a better dataset. We will add this fail cases and these discussions in the experiment section.
Dear reviewer,
Thank you for the comments on our paper.
We have submitted the response to your comments. Please let us know if you have additional questions so that we can address them during the discussion period. We hope that you can consider raising the score after we address all the issues.
Thank you
This paper presented a novel model for the SI2T and ST2I tasks. The proposed model combines the two dual tasks and let them mutually learn via intermediate feature sharing. Through this framework, both SI2T and ST2I are enhanced. The author also provide analysis that how this method works.
优点
-
The paper proposed a novel dual learning idea and the implemented model is reasonable.
-
Generally comprehensive analysis on the designed model.
缺点
There is no major issue in my view. Some questions and advices are listed following:
About the motivation: The author claims that the 3D feature is important for spatial understanding, and I agree with that. I wonder if 3D modeling is necessary for these two tasks and what special information could the 3D modeling provide. In this framework, it seems the model acquires the capability of 3D understanding via the pretraining of the 3DSG generator. Is the scene graph a reasonable way to modeling 3D features, and how does it match your task? The author should discuss the above points.
The DGAE is pretrained by the gold 3DSG dataset. Intuitively, the quality of this 3DSG dataset significantly affects the DGAE results and further the final results. Figure 7 analyzes this problem by an ablation study on the manual noise data. But I still have a concern that the performance of DGAE may be the bottleneck of the whole model.
About the discrete modeling: To my knowledge, for the respective I2T or T2I task, the continuous diffusion models could achieve better performance, while the Table 6 presents the better final performance of the proposed dual model. Ignoring the efficiency problems, what superiority does discrete modeling has?
This work provides a novel way to solve dual tasks. From your perspective, what kind of tasks can be solved with this dual learning method?
Typos: Line 767 “alignede” should be “aligned”
Line 810 “eh” should be “the”
问题
Please refer the weakness.
局限性
No Limitations
We sincerely thank you for you for your time and provide us with rich and constructive feedback on our paper. And we believe it will surely improve our work. Following we extract your concerns into points and try to address them one by one.
Q1: I wonder if 3D modeling is necessary for these two tasks and what special information could the 3D modeling provide. In this framework, it seems the model acquires the capability of 3D understanding via the pretraining of the 3DSG generator. Is the scene graph a reasonable way to modeling 3D features, and how does it match your task? The author should discuss the above points.
A1: 3D modeling is very critical for spatial understanding. The layout overlap and perspective illusion are most common problem. With 3D modeling and pretraining, the model will get the prior knowledges to “imagine” the right spatial relationships between a pair of common objects just like human beings (for example, the books should “on” the shelf, the food should “in” the bowl). We use 3DSG to model the 3D features for two reasons. First, the graph structure could capture the fine-grained object-level relationships, containing more informative features. Second, the 3DSG specifically marks the high-level concepts (room, etc.), which usually be presented as the easily-confused overlapped object in the 2D image.
Q2: The DGAE is pretrained by the gold 3DSG dataset. Intuitively, the quality of this 3DSG dataset significantly affects the DGAE results and further the final results. Figure 7 analyzes this problem by an ablation study on the manual noise data. But I still have a concern that the performance of DGAE may be the bottleneck of the whole model.
A2: We agree. The bottleneck seems to be the DGAE and it is also the key module. As our responses to reviewer Nx71, the graph module may generate failed 3DSGs and then the final results come incorrect. But we have the viable solution. The straightforward solution is to upgrade this module by scaling up it and training with better data. This effort makes sense and this graph module could be reused in many other tasks.
Q3: About the discrete modeling: To my knowledge, for the respective I2T or T2I task, the continuous diffusion models could achieve better performance, while the Table 6 presents the better final performance of the proposed dual model. Ignoring the efficiency problems, what superiority does discrete modeling has?
A3: We think the reason coms from two aspects. First, the VSD datasets mainly contains the spatial descriptions while the continuous diffusion models are skilled at handling general descriptions. Second, our model gives stronger guidance to the diffusion process, thus leading more similarity to the reference results. The outputs of continuous diffusion models are more random.
Q4: This work provides a novel way to solve dual tasks. From your perspective, what kind of tasks can be solved with this dual learning method?
A4: The dual learning was first proposed in the machine translation task. Now it has been applied in many areas. The effectiveness of dual learning primarily stems from the asymmetric information transfer between the two dual tasks, where learning in a single direction tends to miss some key features. In such cases, bidirectional dual learning enables both tasks to acquire certain necessary prior information, thereby enhancing performance. Overall, the dual system essentially forms a self-supervised system. From our perspective, all the dual tasks involving the conversion between two modalities with asymmetric information could find a similar solution. We provide some here: ASR and TTS, I2T and T2I, dual QA, etc.
Thanks for responding. I have carefully read the rebuttal and the comments of other reviewers, and my concerns have been well resolved. However, in consideration of enhancing the solidity of the work, as other reviewers have mentioned, I suggest that the authors consider scaling up the current diffusion model in future attempts, to further verify the effectiveness of this method in larger-scale models. Moreover, if possible, it is also recommended that the authors apply this method to more general tasks. Of course, considering the length and time constraints, these tasks can be fully addressed in future work. I thank the authors again for their responses.
This paper presented a novel dual learning framework for spatial image-to-text and spatial text-to-image generation. The main model is a combination of three discrete diffusion models, where an intermediate 3D presentation is first generated and then the image and text outputs are generated based on the 3D feature. The proposed method creatively divides the SI2T and ST2I to two pairs of dual stages, i.e., the Text-3D with 3D-Text and Image-3D and 3D-Image, and then takes a dual training strategy to enhance the hard X-3D processes with the easy 3D-X processes. The experimental results show that the proposed methods outperform current I2T and T2I models significantly.
优点
- The novel and effective dual learning solution for the paired ST2I and SI2T tasks.
- The proposed methods provide an interesting perspective for the similar dual tasks, which is of value to the community.
- The paper is well-organized and easy to read. Overall, the idea is novel and the presentation is good. I tend to believe the quality of the paper meets the standard of the conference if the author could address the concerns I raised in the weaknesses.
缺点
Some confusion about the technical details:
-
How many node categories does the 3DSG has and how to define the high-level spatial concepts? Does it follow the definition of previous works? Lack of related discussion and references.
-
Do the VSG/TSG and 3DSG have the compatible node sets? Do you map the nodes among the three types of SG with a rule?
-
About the codebooks of the discrete modeling. How do the codebooks (graph, image, and text) be initialized and updated during the whole training process?
The training of diffusion model is time consuming. I am a little worried about the efficiency problem. Can the author provide the efficiency analysis for each training stage?
The author takes GPT2 as the text decoder. Could it be replaced by other PLMs and how does it influence the performance?
Minor Issues:
-
Two small font size in Figure 2
-
Line 248, the subscript “T23D” and “I23D” are ambiguous.
-
Bars overlap in Figure 5.
问题
Please see my weakness comments.
局限性
Limitations have been discussed and I do not foresee any other negative impact from this work.
Thank you for going through our paper so deep and carefully. We appreciate that you acknowledge the novelty of our proposed task and the comprehensive experiments. All your possible concerns are addressed as follows.
Q1: How many node categories does the 3DSG has and how to define the high-level spatial concepts? Does it follow the definition of previous works? Lack of related discussion and references.
A1: In our dataset it could be more than 200 object tags, which could be the mode categories of the 3DSG. However, limited by the VSG/TSG generator, the input node category number is about 150. Actually, in practical application, the node tags of 3DSG could be open-ended because the graph diffusion model convert the graph only in the latent dense representations. But in order to facilitate evaluation, we use the predefined 150 categories.
About the high-level spatial concepts, they are usually the location, places (such as room, living room) of a normal object (such as table, chair). Their characteristic is that they have a hierarchical relationship with general nodes, which are marked via the special edges. In our datasets, there are only two-level spatial concepts, and we have manually divided them.
We will add these discussions to the third section.
Q2: Do the VSG/TSG and 3DSG have the compatible node sets? Do you map the nodes among the three types of SG with a rule?
A2: As the response for Q1, the node tags are limited to 150. The tags between VSG and TSG are compatible. For the 3DSG, the graph diffusion may generate unseen tags but we do not need to worry. The following image generator and text generator only take dense graph representations. The only problem occurs in the evaluation if we want to decode the graph. In this case, we handle it through manually established mapping rules.
Q3: About the codebooks of the discrete modeling. How do the codebooks (graph, image, and text) be initialized and updated during the whole training process?
A3: The graph codebook is randomly initialized in the training of DGAE (Step-1 in Fig. 3). The image and text codebooks are initialized with the off-the-shelf models and will be frozen all the time. Of course, we can also enhance them with a self-supervised image/text generation between the DGAE pre-training and the spatial alignment (Step-1 and Step-2 in Fig. 3).
Q4: The training of diffusion model is time consuming. I am a little worried about the efficiency problem. Can the author provide the efficiency analysis for each training stage?
A4: The finetuning focus on only part of the parameters. We give the computational complexity analysis as well as our training settings:
| Training Params/Training time | |
|---|---|
| Step-1 DGAE | 110M/1 hours |
| Step-2 Spatial Alignment | 127M/1.5 hours |
| Step-3 2DSG→3DSG Diffusion Training | 350M/ 12 hours |
| Step-4 Overall Training | 1.1B/ 20 hours |
Q5: The author takes GPT2 as the text decoder. Could it be replaced by other PLMs and how does it influence the performance?
A5: Yes, it could be replaced by other PLMs. We take T5-decoder for example. With the same training strategy, the results are not much different:
| BLEU4/SPICE | |
|---|---|
| GPT2 | 27.63/48.03 |
| T5-Decoder | 27.58/48.10 |
We did not consider larger models, i.e., the LLMs, which may improve the final ST2I performance but not the core textual diffusion performance.
Thank the author’s careful response. Your answer has basically addressed my concerns.
I have reconsidered this model. Overall, its key lies in the performance of the graph generation module. The purpose of its design is to utilize various methods to enable it to learn sufficient 3D information, and of course, the quality of the 3DSG data involved is also very important. As the author mentioned, the model supports open object tags, but the currently used data does not achieve this. I think in future explorations, we can try to use better methods, including high-quality data, to further enhance the performance of the graph model.
The paper introduces a dual learning framework and a 3D scene graph representation for enhancing spatial image-to-text and text-to-image tasks in visual-spatial understanding. The proposed Spatial Dual Discrete Diffusion (SD3) system outperforms existing methods on the VSD dataset, demonstrating the effectiveness of the dual learning strategy in improving spatial feature modeling and task performance.
优点
- The authors are commended for providing the code, which facilitates further investigation and reproducibility of the research findings.
- This manuscript is the first to achieve synergy between two spatial-aware cross-modal dual generations.
- The manuscript provides a comprehensive summary of the differences between the 2DSG and the proposed 3DSG.
缺点
- Could the authors consider prioritizing the most relevant or impactful references and possibly discuss the contribution of each cited work in more detail?
- The introduction would benefit from a concise summary of the manuscript's contributions at its conclusion.
- Could the authors briefly discuss how the treatment of spatial features in existing methods compares to the approach taken in this paper, and highlight the distinct contributions of this work?
- The complexity of the SD3 framework may impact training and inference efficiency. The authors should provide a computational complexity analysis to assess its practicality.
- Please provide complete results for Table 2 and consider additional datasets to validate the findings.
- The current reliance on human evaluation for spatial assessment could be complemented with automated evaluation techniques to provide a more comprehensive assessment approach.
问题
See the weaknesses.
局限性
The authors have adequately discussed the limitations and potential negative societal impact.
We sincerely thank you for your time and the careful review for our paper. Your suggestions will definitely help improve our paper. We try to address your questions point by point as follows.
Q1: Could the authors consider prioritizing the most relevant or impactful references and possibly discuss the contribution of each cited work in more detail?
A1: Thanks for advising. We will consider reorganize the content of the related work. For the most relevant ones, e.g., discrete SD models and PLMs, we will discuss the contribution of them detailedly. However, due to space limitation, we may leave others via a summary.
Q2: The introduction would benefit from a concise summary of the manuscript's contributions at its conclusion.
A2: We will consider give a summary of contributions in the last paragraph in the introduction in the revised version.
Q3: Could the authors briefly discuss how the treatment of spatial features in existing methods compares to the approach taken in this paper, and highlight the distinct contributions of this work?
A3: Thanks, the critical distinct contribution of our work is that we use the 3D scene graph to model spatial features, which is a novel way in both I2T and T2I generation. Existing methods takes a general visual embedding or the 2D scene graph to model appearance feature and the spatial features, while our 3D scene graph could capture the 3D spatial relationships via the proposed graph generation pretraining. We will highlight the definition of 3D scene graph and the contributions in the method section.
A4: The complexity of the SD3 framework may impact training and inference efficiency. The authors should provide a computational complexity analysis to assess its practicality.
Q4: Thanks. As other reviewers have mentioned the same issue, we will give the computational complexity analysis as well as our training settings in the revised version.
| Training Params/Training time | |
|---|---|
| Step-1 DGAE | 110M/1 hours |
| Step-2 Spatial Alignment | 127M/1.5 hours |
| Step-3 2DSG→3DSG Diffusion Training | 350M/ 12 hours |
| Step-4 Overall Training | 1.1B/ 20 hours |
| All the training |
Q5: Please provide complete results for Table 2 and consider additional datasets to validate the findings.
A5: Table 2 has listed the full results on the VSD dataset. We will provide the additional results on other datasets. For larger dataset, the training may cost time, and we are sorry we can not provide the results immediately. We will provide in the final version or the arxiv version.
Q6: The current reliance on human evaluation for spatial assessment could be complemented with automated evaluation techniques to provide a more comprehensive assessment approach.
A6: For SI2T, the SPICE could reflect the spatial assessment in to some extent. For ST2I, it is hard to evaluate the spatial assessment directly from the final image with automatic metrics, thus we analyze and evaluate the intermediate spatial scene graph in section 5.4 (Table 4,5 and Figure 5).
General Response to All Reviewers
Dear Reviewers,
Thanks for all of your insightful and cheerful comments on our manuscript. Your feedback will greatly assist us in enhancing the quality of our paper, and we are committed to incorporating your suggestion in our revision process. Meanwhile, we feel encouraged so much that reviewers find our work “novel and effective dual learning” (Rv#Urkw, Rv#WpM7), “simple but effective” (Rv#Nx71), appreciate our 3DSG modeling (Rv#Nx71), “comprehensive analysis” (Rv#cK3d), and other aspects such as “well-organized and easy to read” (Rv#WpM7)
In response to reviewers' comments, we have thoroughly reviewed our paper, performed additional necessary experiments, and prepared a comprehensive response. We will fix all the typos and improve the manuscript according to reviewers' comments. We hope that our response adequately addresses your concerns and provides further clarification on the contributions presented in our manuscript. We kindly request a re-evaluation of our work based on the updated information, and look forward to your recognition.
Thanks and regards,
Authors of Submission#1614
The reviewers consistently praise the paper for its novel dual learning framework that effectively enhances spatial feature modeling through the introduction of a 3D scene graph (3DSG) representation. The proposed Spatial Dual Discrete Diffusion (SD3) system is recognized for outperforming existing methods, demonstrating the effectiveness of the approach. Meanwhile, concerns are raised about the complexity and efficiency of the framework, the lack of comparison with the latest state-of-the-art methods, and the need for further clarification on some technical details.
Based on the strengths of the paper, particularly its innovative approach and demonstrated effectiveness, and despite the identified weaknesses, the overall contribution to the field is significant. The paper should be accepted, with the expectation that the authors address the concerns regarding efficiency, comparison with recent methods, and provide clearer technical explanations in the final revision.