Event-Customized Image Generation
摘要
评审与讨论
This paper presents FreeEvent, a novel approach to customized image generation, targeting complex event-specific scenes rather than just entity appearances or basic interactions. FreeEvent tackles event-customized image generation, where an "event" includes detailed actions, poses, and relationships between entities. FreeEvent introduces two innovative pathways: the Entity Switching Path for entity-specific guidance and the Event Transferring Path for spatial feature and attention transfer.
优点
- Clarity and Readability: The writing quality of the manuscript is commendable, making the intent and message of the paper easy to comprehend.
- Resource-Efficient Methodology: The proposed method is training-free, which is notably advantageous in terms of computational resource requirements, making it accessible and feasible even in resource-constrained environments.
缺点
- Overclaim: The manuscript introduces the 'event-customized image generation task' as a novel contribution. However, this task appears to have been previously addressed in the work titled "Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation" presented at CVPR 2024.
- Lack of Methodological Novelty: The two pathways proposed in the paper appear to primarily combine existing methods. The Entity Switching Path employs a strategy similar to Attend-and-Excite for controlling content generation in specific locations, while the Event Transferring Path largely follows approaches resembling MasaCtrl. These methods are widely established and have been extensively discussed across numerous publications, which may limit the perceived innovation in the paper’s methodology.
- Suboptimal Qualitative Results: The qualitative results presented show room for improvement. Specifically, I noticed that the retention of person identifiers is weak in Figures 1c and 6, which raises questions about the underlying cause; the authors should clarify this aspect further. Additionally, the prompt “P: skeleton, statue, monkey, book” appears four times throughout the paper. Reducing the repetition of this sample would likely enhance the diversity and impact of the results shown.
问题
See weaknesses.
Thank you for the detailed comments. We are willing to address all the mentioned weaknesses and questions.
Q1: Overclaim of the task.
The manuscript introduces the 'event-customized image generation task' as a novel contribution. However, this task appears to have been previously addressed in the work titled "Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation" presented at CVPR 2024.
A1: Thanks for your concerns. Actually, we have already discussed the mentioned action customization work in our introduction section, noting its significant limitations in addressing the ``event-customized image generation task". Below, we aim to further emphasize and analyze its primary limitations.
-
Simplified Customization and Unconvincing Evaluations. This action customization work only focuses on the basic actions of a single person. And it only provided 8 actions for evaluation, which is far from proving its capability to cover a wide range of actions in the real world. Besides, it does not explore or provide any results of more complex and diverse actions that involve multiple humans, let alone the interactive actions between humans, animals, and objects. Thus, based on the narrow focus on action customization and evaluation results, coupled with the absence of publicly available code for further validation, we believe it faces significant limitations in addressing complex actions or interactions between multiple humans, animals, and objects.
-
Insufficient Data. Further considering its proposed method, it learns identifier tokens to represent specific actions. However, for each action, its training-based process requires a set of reference images (e.g., 10 images) paired with corresponding textual descriptions across different entities. Unfortunately, each action is highly unique and distinctive, i.e., gathering images that depict the exact same action is challenging. As shown in Figure 1(b), there are still significant differences in the same action (e.g., handstand) between different reference images, which thus compromises the accuracy of learned tokens, leading to inconsistencies in action between generated images. As shown, the generated handstand pose of "Spiderman" and "panda" are different. Meanwhile, it will be more difficult to gather the example images for more complex events, e.g., the reference images shown in Figure 1(c). This insufficient data issue for identical action has severely limited the practicality and generalizability of this method.
Therefore, considering the setting, evaluation, and methodology of this work, it still faces significant limitations in addressing complex actions or interactions, both in terms of effectiveness and practicality.
In contrast, our "event-customized image generation task" focuses on the customization of "event", including diverse actions, poses, relations, and interactions over different entities (e.g., humans, animals, and objects). And it only needs one single reference image, which also eliminates the need for collecting "exactly the same” example images. Thus, our proposed task addresses the limitations of existing action customization by optimizing both the scope and settings of customization. This advancement enables customized image generation to extend to more complex real-world scenes, making it a novel and well-motivated contribution.
Q2: Lack of Methodological Novelty.
The two pathways proposed in the paper appear to primarily combine existing methods. The Entity Switching Path employs a strategy similar to Attend-and-Excite for controlling content generation in specific locations, while the Event Transferring Path largely follows approaches resembling MasaCtrl. These methods are widely established and have been extensively discussed across numerous publications, which may limit the perceived innovation in the paper’s methodology.
A2: Thanks for your concerns. We want to first emphasize that we made three folds of contributions in this paper: 1) The new and meaningful event-customized image generation task. 2) The first training-free method for event customization. 3) Two evaluation benchmarks for event-customized image generation. Specifically, for our training-free method FreeEvent, we provide more discussion below.
-
Motivation. Based on the two main components of the reference image, i.e., entity and event, we proposed to decompose the event customization into two parts: 1) Switching the entities in the reference image to target entities. 2) Transferring the event from the reference image to the target image. Inspired by the observation that the spatial features and attention maps have been utilized to control the layout, structure, and appearance in text-to-image generation, we further designed the two corresponding paths to address the two parts. While these observations have been widely recognized in previous works, we are the first to integrate them to address this new task in a training-free manner. This approach demonstrates a thoughtful analysis of the task and a strategic application of existing technologies.
-
Improvements. We also made several specific improvements to better address the event customization task. 1) For entity switching, besides the cross-attention guidance, we further regulate the cross-attention map of each entity to avoid the appearance leakage between each target entity. 2) For event transferring, in contrast to previous works [A, B] that perform DDIM inversion on reference images, we directly use forward diffusion. This further reduces the appearance leakage from the reference image and saves the inversion cost and additional model inference time.
While FreeEvent does incorporate some existing methods, its design is rooted in a thoughtful analysis of the new task and a strategic application of existing insights. Furthermore, we also introduced specific improvements, enabling it to address this new task more effectively and efficiently. FreeEvent has proved its effectiveness and efficiency in a wide range of experiments, beating existing controllable generation, image editing, and customization works. As the first work in this direction, we hope our method can unveil new possibilities for more complex customization, meanwhile serving as a challenging baseline for future works.
[A] N Tumanyan, et al. Plug-and-play diffusion features for text-driven image-to-image translation. CVPR, 2023.
[B] M Cao, et al. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. ICCV, 2023.
Q3: Suboptimal Qualitative Results.
The qualitative results presented show room for improvement. Specifically, I noticed that the retention of person identifiers is weak in Figures 1c and 6, which raises questions about the underlying cause; the authors should clarify this aspect further. Additionally, the prompt “P: skeleton, statue, monkey, book” appears four times throughout the paper. Reducing the repetition of this sample would likely enhance the diversity and impact of the results shown.
A3: Thanks for your concerns. We have updated the results in Figure 1(c) and Figure 6. For event-subject customization, we combine our framework with subject customization methods to generate target entities with user-specified subjects, i.e., represented by identifier tokens. Generally, two main elements contribute to promising subject customization: 1) Enough example images (e.g., 5 images) for the given subject. 2) An effective training process for learning the corresponding identifiers. Since we only used one example image for each subject and took the preliminary customization work DreamBooth as an easy exploration, the learned identifier tokens are not "strong" enough to represent the characteristics of each subject, thus leading to the suboptimal results shown in previous Figure 1(c) and Figure 6.
Naturally, this can be further improved by employing more advanced subject customization methods. We took the Break-A-Scene [A] model to make an improvement. Specifically, Break-A-Scene introduces enhanced training processes for learning better identifier tokens, and it can extract multiple concepts from a single image. As the updated results are shown in Figure 1(c) and Figure 6, we effectively achieved the Event-Subject customization with better subject customization results: 1) For the person subjects, we can now better preserve their characteristics (e.g., facial features, hairstyles, clothing textures, and colors). 2) We also enable more flexible customization of diverse regular concepts (e.g., the cup, shell, and panda) and the concept of the background image (e.g., the beach). Meanwhile, we can also combine these different concepts to generate more diverse and creative images. Notably, we do not modify any part of our event customization framework. In summary, the effectiveness of subject customization depends on the methods used for subject customization itself. Employing more advanced methods naturally yields better results, further demonstrating the strong practicality of our proposed framework. It enables seamless plug-and-play integration with state-of-the-art subject customization techniques, facilitating a more diverse and personalized generation.
Additionally, we have modified Figure 4 to reduce the repeated samples and show more diverse samples. We also provided a wide range of events and results in Appendix E.
Dear Reviewer,
Thank you again for your valuable feedback and thoughtful comments. We would like to kindly remind you that the deadline of the discussion period is approaching. If you have any additional questions, concerns, or clarifications you would like us to address, we would be more than happy to provide prompt responses.
Thank you for your attention, and we look forward to hearing from you!
Dear Reviewer,
As the deadline for the author-reviewer discussion period approaches, we would like to confirm whether our response has adequately addressed your concerns. If there are any remaining issues or if you require further clarification, please do not hesitate to let us know.
Thank you!
This paper introduces a new task, event-customized image generation, which aims at accurately capturing the complex task and generate customized images with various target entities. Meanwhile, a training-free event customization method, FreeEvent is proposed to solve the event-customized image generation task. The FreeEvent consists of two paths along side the general diffusion denoising process, i.e., the entity switching path and the event transferring path. The entity switching path applies cross-attention guidance and regulation for target entity generation while the event transferring path injects the spatial feature and self-attention maps from the reference image to the target image for event generation.
优点
- Clear limitation analysis of existing works, i.e., simplified customization and insufficient data.
- Good motivations for addressing the proposed task.
缺点
-
Unclear Definition of "Event-Customized Image Generation": The paper’s definition of "event-customized image generation" lacks clarity, especially regarding the complexity and scope of events. Although the paper explains entity interactions, it does not address attributes adequately. Additionally, there is no quantitative measure for defining the complexity level that qualifies as an event, leaving ambiguity in how "event" is operationalized.
-
Lack of Comparison with Related Work: The training-free framework and emphasis on entity and attribute handling appear to align closely with prior work, specifically the ImageAnything framework [A]. The paper fails to compare itself with ImageAnything in terms of motivation, methodology, and structural framework, missing an opportunity to clarify its novelty and improvements over similar approaches.
-
Limited Information on Similarity Metrics: The similarity metric used for assessments in Table 1 is not specified, leaving readers uncertain about the criteria for evaluation. Without this information, the results may be hard to interpret, limiting the reproducibility and transparency of the evaluation.
-
Insufficient Performance Metrics: The paper could enhance its assessment by including standard image generation metrics, such as FID (Fréchet Inception Distance) and CLIP scores, for a more comprehensive comparison. Relying on a limited set of metrics may not provide a well-rounded evaluation, which could affect the perceived robustness of the proposed method.
问题
- The definition of the "event-customized image generation" is somehow not clear. "Given a single reference image, we define the event as all actions and poses of each single entity, and their relations and interaction between different entities." The entity part is fully addressed, however, how about the attribute part? Is there any quantitative definition for how complex can leads to an event?
- The training free framework and the focus on entity and attribute is somehow similar to a prior work, Imageanything[A]. Please give clear discussion and comparison with this work regarding motivation, methodology, and frameworks. [A] Lyu Y, Zheng X, Wang L. Image anything: Towards reasoning-coherent and training-free multi-modal image generation[J]. arXiv preprint arXiv:2401.17664, 2024.
- What kind of similarity metric is used to assess the methods in Tab.1?
- Could the author include more metrics to assess the proposed FreeEvent and the existing methods, such as FID, CLIP score?
Thank you for the detailed comments. We are willing to address all the mentioned weaknesses and questions.
Q1: Unclear Definition of ''Event-Customized Image Generation"
The paper’s definition of "event-customized image generation" lacks clarity, especially regarding the complexity and scope of events. Although the paper explains entity interactions, it does not address attributes adequately. Additionally, there is no quantitative measure for defining the complexity level that qualifies as an event, leaving ambiguity in how "event" is operationalized.
The definition of the "event-customized image generation" is somehow not clear. "Given a single reference image, we define the event as all actions and poses of each single entity, and their relations and interaction between different entities." The entity part is fully addressed, however, how about the attribute part? Is there any quantitative definition for how complex can leads to an event?
A1: Thanks for your concerns. The key to ''event-customized image generation" lies in capturing the actions, poses, relations, and interactions among the reference entities to generate new entities, so we mainly focus on the transferring of events and the switching of entities (i.e., given by entity nouns). Thus, in this paper, we didn't explicitly model the attributes. However, as the results are shown in Figure 5(b) since we can generate extra content for background and style by giving corresponding text descriptions, we thus tried to model the attributes by giving extra adjectives to the target prompt as an easy and natural exploration. Meanwhile, to ensure the accurate generation of the attributes, we applied the cross-attention guidance and regulation on each attribute using the mask of the entity they describe. As the results shown in Appendix C and Figure 8, our method successfully addresses the attributes of the corresponding entity (e.g., colors, materials, and ages). After all, while the attribute part is not the primary focus of our work, our approach shows potential and effectiveness in addressing it, and we would be happy to conduct further research in our future work.
For the quantitative definition of the event complexity level, we have the following discoveries:
- The complexity of actions and interactions in an image increases as the number of entities increases. Taking the SWiG and HICO-DET datasets we use as an example, these datasets have been widely used in verb and relation detention tasks. The accuracy of detection significantly decreases as the number of entities in the image increases.
- The event complexity among different types of entity also differs, e.g., the interaction of ''three humans fighting" is generally more complex than ''a man holding two apples". Thus, we can first quantitatively measure the complexity level of the event by the total number of entities. For images with the same total number of entities, we further count the distribution of different categories of entities, i.e., the respective numbers of humans, animals, and objects. We can then further quantify their complexity based on the rule of ''human > animal > object", e.g., the event with ''2 humans and 1 animal" is more complex than ''1 human and two objects". However, such rules or settings may not be applicable to all images in the real world.
Therefore, in this paper, we primarily measure the event complexity using the total number of entities.
We have made the corresponding revision in the Introduction (Line 107) to clarify the measurement of event complexity. And provided the exploration of attribute generation in Appendix C.
Q2: Lack of Comparison with Related Work.
The training-free framework and emphasis on entity and attribute handling appear to align closely with prior work, specifically the ImageAnything framework [A]. The paper fails to compare itself with ImageAnything in terms of motivation, methodology, and structural framework, missing an opportunity to clarify its novelty and improvements over similar approaches.
The training free framework and the focus on entity and attribute is somehow similar to a prior work, Image anything[A]. Please give clear discussion and comparison with this work regarding motivation, methodology, and frameworks.
A2: Thanks for your suggestions. While Image Anything (ImgAny) also focuses on diffusion-based training-free image generation, there are key differences among the motivation, methodology, and frameworks compared with our FreeEvent.
-
Motivation. ImgAny aims at taking different input modalities (\eg, language, audio, and vision) for multi-modal image generation. Thus, the target of ImgAny is to generate reasonable content corresponding to the input modalities. For example, given the input audio "meow", the input text "green eye" and the input image "bed", ImgAny aims to generate an image that contains "a cat" with "green eyes" in a "bed". Meanwhile, it will not specify the specific pose or layout of the "cat" and "bed". Differently, we takes fixed input modalities (the reference image and target prompt). The target for FreeEvent is to capture the event from the reference image to generate new images with entities based on target prompt. For example, given reference image "a cat is lying on a bed", FreeEvent aims to capture the specific pose, spatial layout of the cat and bed, and the interaction between them. We can then customize this event with new entities to generate "a tiger is lying on a desk" with the same pose and interactions by giving target prompt "tiger, desk". Or "a dinosaur is lying on a mountain" by "dinosaur, mountain". To summarize, ImgAny focuses on modeling the complex combinations of different input modalities. Our FreeEvent focuses on modeling the complex input event and combinations of target entities.
-
Methodology. For ImgAny, it aims to extract the fused multi-modal feature of the input modalities as conditions. Specifically, it extracts the fused text feature of entity nouns and attributes words as the multi-modal feature to condition Stable Diffusion (SD). That is, based on the pre-trained text-to-image SD model, replacing the general text embedding with the fused multi-modal feature for denoising. Instead of treating different input modalities with text features, we utilized spatial features and attention maps to achieve event customization during the denoising process. Specifically, we transfer the event by injecting the spatial features and self-attention maps, and guide the generation of the target entity by modifying the cross-attention maps and latent. Besides, we take the general text embedding of target prompt as input conditions for SD. To summarize, ImgAny and FreeEvent have distinct methodologies for both feature extraction and generation processes.
-
Framework. Notably, both ImgAny and FreeEvent proposed two branches to address their target from two aspects, i.e., entity and attribute for ImgAny, event and entity for FreeEvent. However, their frameworks have distinct differences. 1) Although both methods have a unique branch for handling the entity, for ImgAny, the entity nouns are retrieved from the vocabulary to represent the multi-modal input. For FreeEvent, the entity nouns are directly given by the target prompt as the generation target. Besides, ImgAny's entity branch focuses on extracting the text feature of the entity nouns, while FreeEvent's entity path focuses on utilizing the cross-attention maps of each entity noun. 2) The attribute branch of ImgAny focuses on extracting the text feature of the attribute words to represent the multi-modal input. As mentioned in Q1, we does not explicitly model the attribute part, and the event path of FreeEvent focuses on injecting spatial features and self-attention maps from reference image. These two branches are also totally different. 3) The whole ImgAny is operated before the denoising process of SD, i.e., replacing the general text embedding with the fused multi-modal feature before each denoising step. In contrast, the FreeEvent is operated during the general denoising process of SD, i.e., performing feature injection and attention guidance alongside the denoising step. To summarize, the frameworks of ImgAny and FreeEvent differ markedly in their detailed design, purpose, and operation.
While these differences make it inappropriate to use ImgAny as baseline, there's no doubt that ImgAny acts as a groundbreaking work in diffusion-based training-free image generation. We have included a discussion and comparison of ImgAny in the related work section.
Q3: Limited Information on Similarity Metrics.
The similarity metric used for assessments in Table 1 is not specified, leaving readers uncertain about the criteria for evaluation. Without this information, the results may be hard to interpret, limiting the reproducibility and transparency of the evaluation.
What kind of similarity metric is used to assess the methods in Tab.1?
A3: Thanks for your concerns. As mentioned in Sec 4.2, we used the CLIP score. Specifically, we extracted the image feature of each image through the CLIP visual encoder and calculated the cosine similarities for image retrieval. We have specified this more clearly in the new manuscript (Line 353 - 355).
Q4: Insufficient Performance Metrics.
The paper could enhance its assessment by including standard image generation metrics, such as FID (Fréchet Inception Distance) and CLIP scores, for a more comprehensive comparison. Relying on a limited set of metrics may not provide a well-rounded evaluation, which could affect the perceived robustness of the proposed method.
Could the author include more metrics to assess the proposed FreeEvent and the existing methods, such as FID, CLIP score?
A4: Thanks for your suggestions. We have provided more metrics to validate the effectiveness of our methods.
| Model | Top-1 | Top-5 | Top-10 | CLIP-I | CLIP-T | FID |
| ControlNet | 10.66 | 23.98 | 31.28 | 0.6009 | 0.2198 | 70.45 |
| BoxDiff | 5.58 | 14.52 | 19.42 | 0.5838 | 0.2153 | 68.49 |
| FreeEvent | 34.10 | 62.04 | 71.82 | 0.7044 | 0.2238 | 29.05 |
-
First, for standard image generation metrics, we reported the FID (Fréchet Inception Distance) score, the CLIP-I score, and the CLIP-T score. We use the CLIP-I score to evaluate the image alignment of generated images with their reference images. And use the CLIP-T score to evaluate the text alignment of the generated images with text prompts. As shown in the above table, our FreeEvent achieves superior performance over baselines across all metrics, which indicates our method can generate images with better qualities and alignment with both the reference images and texts.
-
Furthermore, we also reported the verb detection performance to validate the interaction semantics of the generated images (the Top-K represents the top-k detection accuracy). Specifically, we utilized the verb detection model GSRTR [B] which was trained on the SWIG dataset to detect the verb class of each generated image, and then calculated the detection accuracy based on the annotations of the reference images (i.e., whether the generated images and their reference images have the same verb class). As shown in the above table, our FreeEvent achieves superior performance over baselines, which indicates our method can better preserve the interaction semantics of the generated images.
We hope these metrics can provide a thorough evaluation of our method. We have revised our quantitative evaluation section in Sec 4.2 to add the above results.
[B] Junhyeong Cho, et al. Grounded Situation Recognition with Transformers. BMVC, 2021.
Thanks for the extra information and experimental results provided by the authors! The responses address some of my concerns, however, there are still some questions:
-
"Therefore, in this paper, we primarily measure the event complexity using the total number of entities." Could the authors further give some instances or results to prove that different numbers of entities influence the event complexity?
-
"1) Although both methods have a unique branch for handling the entity, for ImgAny, the entity nouns are retrieved from the vocabulary to represent the multi-modal input. For FreeEvent, the entity nouns are directly given by the target prompt as the generation target. " Which one should be the better choice? Can the authors give more discussion on this problem?
Thanks for your concern. We are willing to address all the mentioned questions.
Q1: About event complexity
A1: We want to emphasize that currently, there is no standard definition or quantification for the complexity of a "event". When conceptualizing an "event" as a graph, it comprises nodes (entities) and edges (representing the various roles and multiple relationships between entities). In this context, acknowledging that both the number of nodes and edges influence the overall complexity of the graph, we used the number of entities to measure the event complexity as a preliminary exploration.
However, the absence of ground truth data at the graph level for each image — along with the lack of annotations detailing entity roles, relationships, and interactions — makes it challenging to prove or evaluate the absolute correlation between the number of entities and event complexity.
Moreover, compared with existing action or interaction customization works that only focus on one or two entities, the number of entities serves as an intuitive starting point for exploring the event complexity, instead of a definitive measure.
We have provided more qualitative comparison results in Appendix E. Specifically, all samples are sorted by the number of entities. As the number of entities increases, we observe more pronounced appearance leakage and failures in generating relationships or interactions within baseline models. In contrast, our FreeEvent method continues to maintain the quality of customization, further demonstrating its effectiveness. We hope these examples and results offer valuable insights.
Q2: About the handling of entity.
A2: We need to first emphasize that the task settings of the two methods are totally different.
-
For event customization, the entity nouns are directly given by the users as input prompt. For example, as shown in Figure 1(c), given the reference image with "a woman and a man are boxing", if the users want to customize it into "a Spiderman and a Batman are boxing", then they can give the target prompt as "Spiderman, Batman". This is also similar with the setting of other action or relation customization works as shown in Figure 1(b), the users direct give the target entity they want to generate as nouns, e.g., panda, monkey.
-
For multi-modal image generation, it takes different input modalities (e.g., language, audio, and vision). Specifically, to better model each modality, ImgAny proposes to represent each multi-modal as entity nouns for further feature extraction. For example, given the input audio "meow", ImgAny retrieved the audio feature with text features of all the entity nouns in the vocabulary to obtain the most pertinent entity word -- "cat". It then use the text feature of "cat" for further feature extraction and fusion.
In summary, for FreeEvent, the entity nouns are provided as input by the users. In contrast, for ImgAny, the entity nouns are obtained as intermediate output during a specific step of the process. Therefore, the two branches for entity nouns serve completely different purposes and contexts, which is inappropriate for comparison.
We hope this answers your questions. Thank you again for your valuable feedback, and please don’t hesitate to let us know if there are follow-up questions.
Dear Reviewer,
As the deadline for the author-reviewer discussion period approaches, we would like to confirm whether our response has adequately addressed your concerns. If there are any remaining issues or if you require further clarification, please do not hesitate to let us know.
Thank you!
This paper proposes a new task called Event-Customized Image Generation, which aims to not only control subjects, but also customize all specific actions, poses, relations, or interactions between different entities in the scene. Then it designs a training-free approach, by alternating the cross-attention within U-Net to eanble target subject generation, and utilizing origin spatial feature and self-attention to enale entity transfer. Experiments on two data-sets validated the effectiveness.
优点
- The proposed training-free approach is easy to adopt and can generate satisfying results.
- The paper is clear and easy to follow.
缺点
- The task setting of event-customized image generation raises following concerns: The poses of each entity, as well as the overall spatial configuration of generated image is restricted to maintain identical with reference image, which hinders the diversity of generated images. Further, even if the poses are successfully maintained, the interaction semantic may be influenced. In the example of “skeleton, statue” in Figure(1), when the laptop is changed to a book, the interaction semantic between human and object is changed. Is it against the proposed definition of task setting? Further, can this approach be integrated with other components to enable generation on specific backgroud image?
- For Method design : (1) the spatial features and self-attention maps of reference images are adopted to inject event information, how to ensure such direct injection can prevent the leakage of subject information of reference image. (2) The auther claims that by equipping with subject-customized image generation approaches, it can generate entity-subject customized images by injecting target concept identifier tokens, can this approach be integrated with more advanced customization approaches to enable more flexible customization of regular concepts rather than just some celebrities visualized in Figure 6.
- In Experiment, (1) The task setting for quantitative evaluation is confusing. How to reproduce the reference image if all the entities within the image stay the same pose as input condition. Providing one visualized example would be helpful to understand it. (2) Quantitative evaluation only adopts a retrieval-based experiment, which is not convicing enough, and the user study for qualitative evaluation only adopts 30 samples. (3) Further, the interaction semantic of generated image is not verified across all the experiments. Considering that HICO-DET and SWIG both contains annotations of interaction, the generated images should also evaluate corresponding interaction detection performance.
问题
- Can the author list some specific applications in reality of the new-proposed task setting to convince me of its value? In some cases, the interaction semantic and poses perservation requirement is conflicted, as mentioned in weaknesses. Can the author explain the priority of such requirements? 2, Can the proposed approach enable more flexible content generation, like specify the background image? Can the entity-subject customization ability be expand to more diverse subject customization other than only some celebrities?
- Can the evaluation setting of retrieval-based experiment be more clearly explained with some visualized examples?
- The author should provide more experiments to validate the effectiveness of approach, like report the performance on interaction detection task.
伦理问题详情
The generated image may arise security and safety concern like abusing celebrity information.
Thank you for the detailed comments. We are willing to address all the mentioned weaknesses and questions.
Q1: The task setting hinders the diversity of generated images.
The task setting of event-customized image generation raises the following concerns: The poses of each entity, as well as the overall spatial configuration of the generated image is restricted to maintain identical with the reference image, which hinders the diversity of generated images.
A1: Thanks for your concerns. While the task setting of event-customized image generation aims to capture the identical poses and spatial configuration from the reference image, the diversity of the generated images can be ensured from different aspects:
- Various target entities. The target images can be generated with various combinations of diverse target entities (e.g., animals, objects, and characters). We provided more visualization results in Appendix E, and reorganized the sample order to show the diversity of generated images based on the same reference image.
- Different backgounds the styles. As the ablation results are shown in Figure 5(b), the target images can be generated with extra content for the background and style by changing the target prompt.
- Combination of subject customization. The event customization can be further combined with subject customization to generate target entities with user-specified concepts. We updated and provided more visualization results in Figure 6, which includes the subject customization of diverse concepts (e.g., celebrities, regular objects, and background images).
Q2: Interaction semantic and poses preservation are influenced in some cases.
Further, even if the poses are successfully maintained, the interaction semantic may be influenced. In the example of ``skeleton, statue" in Figure (1), when the laptop is changed to a book, the interaction semantic between human and object is changed. Is it against the proposed definition of task setting?
In some cases, the interaction semantic and poses perservation requirement is conflicted, as mentioned in weaknesses. Can the author explain the priority of such requirements?
A2: Thanks for your concerns. During event customization, the reference entities and their corresponding target entities may sometimes exhibit semantic differences (e.g., changing ''laptop'' to ''book'' in Figure 1). In such scenarios, in order to generate satisfying target entities, the preservation of interaction semantics and poses may sometimes be influenced. We need to emphasize that this is not a conflict or against the proposed definition of the task setting. Instead, it ensures the balance between transferring refenrece event and generating reasonable entities when dealing with more complex and diverse target entities. Here we take more examples for detailed clarification:
-
As the reference image with ''a cat painted on a rock" shown in Appendix E, Figure 11, when changing the ''rock" to a ''book" (row 1), the interaction semantic between ''book" and ''sheep" was preserved, while the structures were influenced since the books are usually rectangular in shape.
-
As the reference image with ''a woman holding two apples" shown in Appendix E, Figure 12, when changing the ''woman" into a ''robot" (row 8), the pose was preserved, while the interaction semantic between ''robot" and ''cake" was influenced since the robot is equipped with mechanical claw instead of human fingers.
Although the interaction semantics or structures in both two cases are affected, their overall customization effectiveness remains intact. This also ensures that the generated target entities are more reasonable and align better with common sense. Therefore, we do not explicitly restrict the priority of preserving interaction semantics and poses. Furthermore, these cases also highlight the robustness of our method, which ensures the trade-off between event preservation and entity generation to generate more diverse and interesting target images.
Q3: Applications of the new-proposed task.
Can the author list some specific applications in reality of the new-proposed task setting to convince me of its value?
A3: Thanks for pointing this out. Event customization can facilitate many valuable applications like artistic creation and advertisement production. Specifically:
-
Customized movie and animation making. For example, based on the reference comic (e.g., the story of Romeo and Juliet), create attractive target comics with various combinations of new characters (e.g., Spiderman and Batman).
-
Personilzed photo production. For example, replace King Kong and Godzilla in the ''King Kong vs. Godzilla" movie poster with your own dog and cat. Or create a photo with your best friend in the same pose as your childhood group photo, even if you can't see each other now.
Q4: How to prevent the leakage of subject information of reference image.
For Method design : (1) the spatial features and self-attention maps of reference images are adopted to inject event information, how to ensure such direct injection can prevent the leakage of subject information of reference image.
A4: Thanks for your concerns. We prevent the leakage of subject information of reference images from two aspects.
-
Firstly, we only perform the spatial feature injection in the first decoder of U-Net and perform self-attention map injection in the early time steps. These configurations can help to obtain rich image layout and structure information meanwhile mitigating the leakage of subject information.
-
Secondly, in contrast to previous works that perform DDIM inversion to obtain , we directly use forward diffusion. This also further reduces the appearance leakage from the reference image. As shown in Figure 4 and Appendix E, the inversion-based methods PnP and MAG-Edit all struggled with appearance leakage while our method successfully prevented it.
Q5: More flexible customization of regular concepts.
For Method design : (2) The author claims that by equipping with subject-customized image generation approaches, it can generate entity-subject customized images by injecting target concept identifier tokens, can this approach be integrated with more advanced customization approaches to enable more flexible customization of regular concepts rather than just some celebrities visualized in Figure 6.
Can the entity-subject customization ability be expand to more diverse subject customization other than only some celebrities?
A5: Thanks for your suggestion. We have provided more results in Figure 6. There's no doubt that our method can be combined with subject customization to more regular concepts. And we took the subject customization model Break-A-Scene [A] to make an exploration. Specifically, Break-A-Scene can extract multiple concepts from a single image, also denoted by concept identifier tokens. As the updated results are shown in Figure 6, our method can enable entity-subject customization to diverse regular concepts (e.g., the cup, shell, and panda), and these regular concepts can also be combined with the celebrity concepts to generate creative images.
[A] Avrahami O, et al. Break-a-scene: Extracting multiple concepts from a single image. SIGGRAPH Asia 2023.
Q6: Generation on specific background image.
Further, can this approach be integrated with other components to enable generation on specific background image?
Can the proposed approach enable more flexible content generation, like specify the background image?
A6: Thanks for your suggestion. We have provided more results in Figure 6. We made the exploration of specifying background images by taking it as a special concept in subject customization. We also use the Break-A-Scene [A] to learn identifier tokens for the background images. As the updated results are shown in Figure 6, our method successfully enabled the generation of specific background images. And such entity-subject customization can also further combine the concept of the background image (e.g., the beach) with other regular concepts (e.g., the panda) to generate more flexible and diverse content.
[A] Avrahami O, et al. Break-a-scene: Extracting multiple concepts from a single image. SIGGRAPH Asia 2023.
Q7: The task setting for quantitative evaluation is confusing.
In Experiment, (1) The task setting for quantitative evaluation is confusing. How to reproduce the reference image if all the entities within the image stay the same pose as input condition. Providing one visualized example would be helpful to understand it.
Can the evaluation setting of retrieval-based experiment be more clearly explained with some visualized examples?
A7: Thanks for your concerns. We provided the details of the retrieval-based experiment in Appendix B, including the visualized example of the SWiG-Event sample in Figure 7(a) and the evaluation process of image generation and image retrieval in Figure 7(b).
Q8: More experiments to validate the effectiveness of the approach.
In Experiment, (2) Quantitative evaluation only adopts a retrieval-based experiment, which is not convicing enough, and the user study for qualitative evaluation only adopts 30 samples. (3) Further, the interaction semantic of generated image is not verified across all the experiments. Considering that HICO-DET and SWIG both contains annotations of interaction, the generated images should also evaluate corresponding interaction detection performance.
The author should provide more experiments to validate the effectiveness of approach, like report the performance on interaction detection task.
A8: Thanks for your suggestions. We have provided more experiments to validate the effectiveness of our methods.
- More Quantitative Experiments
| Model | Top-1 | Top-5 | Top-10 | CLIP-I | CLIP-T | FID |
| ControlNet | 10.66 | 23.98 | 31.28 | 0.6009 | 0.2198 | 70.45 |
| BoxDiff | 5.58 | 14.52 | 19.42 | 0.5838 | 0.2153 | 68.49 |
| FreeEvent | 34.10 | 62.04 | 71.82 | 0.7044 | 0.2238 | 29.05 |
-
For quantitative evaluation, as shown in the above table, we reported the verb detection performance to validate the interaction semantics of the generated images (the Top-K represents the top-k detection accuracy.). Specifically, we utilized the verb detection model GSRTR [B] which was trained on the SWIG dataset to detect the verb class of each generated image, and then calculated the detection accuracy based on the annotations of the reference images (i.e., whether the generated images and their reference images have the same verb class). Our FreeEvent achieves superior performance over baselines, which indicates our method can better preserve the interaction semantics of the generated images.
-
We further reported more standard image generation metrics for a more comprehensive comparison, including the FID (Fréchet Inception Distance) score, the CLIP-I score, and the CLIP-T score. We use the CLIP-I score to evaluate the image alignment of generated images with their reference images. And use the CLIP-T score to evaluate the text alignment of the generated images with text prompts. As shown in the above table, our FreeEvent achieves superior performance over baselines across all metrics, which indicates our method can generate images with better qualities and alignment with both the reference images and texts.
[B] Junhyeong Cho, et al. Grounded Situation Recognition with Transformers. BMVC, 2021.
- More User Study
| Model | Ours | ControlNet | BoxDiff | PnP | MAG-Edit | DreamBooth | ReVersion |
|---|---|---|---|---|---|---|---|
| Human Judgement | 48 | 19 | 2 | 31 | 13 | 1 | 0 |
- For the user study, we prepared 20 more trials and invited the same 10 experts as before. With the 30 samples we have already collected, this ended in a total of 50 samples. As shown in the above table, FreeEvent achieves better performance on human judgments (HJ) compared with all the baseline models.
We have revised our quantitative evaluation section in Sec 4.2 and the user study section in Sec 4.5 to add the above results.
Dear Reviewer,
Thank you again for your valuable feedback and thoughtful comments. We would like to kindly remind you that the deadline of the discussion period is approaching. If you have any additional questions, concerns, or clarifications you would like us to address, we would be more than happy to provide prompt responses.
Thank you for your attention, and we look forward to hearing from you!
Thank you for providing the additional information and experimental results. The responses address some of my concerns; however, there are still a few questions that remain:
-
While both the subject and background can be replaced, the spatial configuration is fixed. It seems crucial to identify a reference image that fully satisfies the specific spatial configuration requirements. Alternatively, could the authors provide a concrete example to illustrate the practical applications of generating fixed spatial configurations? While I understand that this approach has the potential to generate diverse and interesting content, I would appreciate further clarification on the practical value of maintaining a fixed spatial configuration.
-
The explanation regarding the conflict between pose and action is not convincing. In the introduction, the authors highlight the limitations of the action-customized method, yet the final approach does not seem to consistently preserve interactive semantics, which feels somewhat inconsistent. I would suggest revisiting the definition of the event-customized generation task to resolve the conflict between pose and action.
Thanks for your concerns. We are willing to address all the mentioned questions.
Q1: Practical applications of generating fixed spatial configurations.
A1: Here we present two specific examples regarding customized comic making and personalized photo production.
-
Using reference comic books, such as the story of Romeo and Juliet, we can customize it to feature characters like Spiderman and Batman. Specifically, the spatial configurations of each comic page remain fixed (e.g., the two characters are talking, dancing, or walking), while users can switch the characters to create interesting combinations.
-
When designing a poster for a school singing competition, we can take a reference poster, such as the one for the movie "The Avengers," where all spatial configurations are fixed (e.g., the locations and poses of each hero). In this case, we can replace each hero with the participating singers to create an eye-catching poster.
Q2: The conflict between interaction semantic and poses preservation.
A2: It cannot be denied that the ideal results of event customization is to consistently preserve all interactive semantics while maintaining the same actions and poses. And our approach does face limitations in certain cases, resulting in imperfect results.
We would like to emphasize that these "conflict results" often occur when there are too many entities involved, or when the reference entities and their corresponding target entities exhibit semantic differences. In such scenarios, the preservation of interaction semantics and poses may be compromised in order to generate satisfactory target entities.
Despite these challenges, FreeEvent has demonstrated its effectiveness and efficiency across a wide range of experiments, outperforming existing methods in controllable generation, image editing, and customization. This includes quantitative comparisons based on both interaction semantic evaluations and standard image quality assessments. Additionally, as shown in the qualitative results in Appendix E, our approach can effectively preserve actions and poses with various target entities.
As the first work in this direction, it's not perfect, however, we hope our method can unveil new possibilities for achieving better results, meanwhile serving as a challenging baseline for future works.
This paper proposed a new task of extending the customized image generation to more complex scenes for general real-world applications. The event incorporates specific actions, poses, relations and interactions between different entities in the scene. Free-event the method is proposed to solve the task by adding two extra paths along the diffusion denoising process, entity switching path, and event transferring paths. In addition, for evaluation, the paper proposed two more benchmarks.
优点
- The method seems correct
- The writing and organization seem clear
- The efforts to formalize a new benchmark and task, although from existing datasets
缺点
- The major issue to argue in this paper is the usage of existing technical contributions for this new task. No theoretical justifications and no further big novel ideas. Whether the technical contributions are limited is worth discussion. The experimental results proved the evaluations. The new task and datasets are also worth reporting.
问题
see weakkness
伦理问题详情
n/a
Thank you for the detailed comments. We are willing to address all the mentioned weaknesses and questions.
Q1: Whether the technical contributions are limited.
The major issue to argue in this paper is the usage of existing technical contributions for this new task. No theoretical justifications and no further big novel ideas. Whether the technical contributions are limited is worth discussion. The experimental results proved the evaluations. The new task and datasets are also worth reporting.
A1: Thanks for your concerns. We made three folds of contributions in this paper: 1) The new and meaningful event-customized image generation task. 2) The first training-free method for event customization. 3) Two evaluation benchmarks for event-customized image generation. We appreciate your affirmation of our contribution on the new task and benchmarks. Specifically, for our training-free method FreeEvent, we provide more discussion below.
-
Motivation. Based on the two main components of the reference image, \ie, entity and event, we proposed to decompose the event customization into two parts: 1) Switching the entities in the reference image to target entities. 2) Transferring the event from the reference image to the target image. Inspired by the observation that the spatial features and attention maps have been utilized to control the layout, structure, and appearance in text-to-image generation, we further designed the two corresponding paths to address the two parts. While these observations have been widely recognized in previous works, we are the first to integrate them to address this new task in a training-free manner. This approach demonstrates a thoughtful analysis of the task and a strategic application of existing technologies.
-
Improvements. We also made several specific improvements to better address the event customization task. 1) For entity switching, besides the cross-attention guidance, we further regulate the cross-attention map of each entity to avoid the appearance leakage between each target entity. 2) For event transferring, in contrast to previous works [A, B] that perform DDIM inversion on reference images, we directly use forward diffusion. This further reduces the appearance leakage from the reference image and saves the inversion cost and additional model inference time.
While FreeEvent does incorporate some existing methods, its design is rooted in a thoughtful analysis of the new task and a strategic application of existing insights. Furthermore, we also introduced specific improvements, enabling it to address this new task more effectively and efficiently. FreeEvent has proved its effectiveness and efficiency in a wide range of experiments, beating existing controllable generation, image editing, and customization works. As the first work in this direction, we hope our method can unveil new possibilities for more complex customization, meanwhile serving as a challenging baseline for future works.
[A] N Tumanyan, et al. Plug-and-play diffusion features for text-driven image-to-image translation. CVPR, 2023.
[B] M Cao, et al. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. ICCV, 2023.
Dear Reviewer,
As the deadline for the author-reviewer discussion period approaches, we would like to confirm whether our response has adequately addressed your concerns. If there are any remaining issues or if you require further clarification, please do not hesitate to let us know.
Thank you!
We thank all reviewers for recognizing the presentation of our paper is clear (Reviewer 2kAh, Gzkr, Pbo8), easy to follow (Reviewers 2kAh), and commendable for the writing quality and readability (Reviewer 1gh9). Meanwhile, they have acknowledged our contributions in proposing a new but meaningful task and new benchmarks (Reviewer Gzkr). Besides, our proposed method is easy to adopt (Reviewer 2kAh), correct (Reviewer Gzkr), resource-efficient (Reviewer 1gh9) with good motivations (Reviewer Pbo8). And it has demonstrated its effectiveness with satisfying results (Reviewer 2kAh).
We appreciate their suggestions and comments and carefully revise our paper accordingly. Our major revisions include the following four aspects:
-
In the Introduction section, we added the clarification of the measurement of event complexity (Line 107), and updated the last sample in Figure 1(c).
-
In the Related Work section, we included the discussion and comparison with the prior work (Lines 161 - 163).
-
In the Experiments section, we provided more quantitative comparisons with diverse evaluation metrics (Lines 351 - 370, and Table 1). We also updated the qualitative comparisons with more diverse visualization samples (Figure 4). We updated the new results for the event-subject combination (Line 484, Lines 514 - 517, and Figure 6). We further updated the user study with 20 more samples for a comprehensive evaluation (Lines 525 - 526, and Table 2).
-
In the Appendix:
-
We moved the limitation and potential negative societal impact into Sec. D.
-
We added the exploration on attribute generation in Sec. C.
-
We moved more qualitative comparison results to Sec. E.
-
Please note that we colorized (blue) the revisions in the new version of the paper.
This work obtains two positive and two negative scores. After checking the paper, the AC is still concerned about the limited technical novelty of the proposed application.
As the author said, "While these observations have been widely recognized in previous works, we are the first to integrate them to address this new task in a training-free manner. This approach demonstrates a thoughtful analysis of the task and a strategic application of existing technologies." This indicates that this work seems to be an extension of the existing techniques for the new proposed task.
Moreover, the definition of event-customized is unclear.
Even though the AC acknowledges this paper's well-written and sufficient experiments, this work is not enough for top conference papers.
审稿人讨论附加意见
Points Raised by Reviewers
-
Unclear Definition of "Event-Customized Image Generation":
- Ambiguity in defining "event complexity" and lack of clarity in scope.
- Missing quantitative metrics for measuring complexity levels.
-
Methodological Novelty:
- Claims of novelty in the proposed task overlap with prior works on action customization (e.g., CVPR 2024 paper).
- Event Switching Path and Event Transferring Path are perceived as adaptations of established methods like Attend-and-Excite and MasaCtrl.
-
Comparative Analysis and Metrics:
- Limited comparisons with related works (e.g., ImageAnything framework).
- Missing standard metrics like FID, CLIP scores, and interaction detection performance.
-
Qualitative and Quantitative Validation:
- Concerns over insufficient diversity in qualitative examples.
- Limited evaluation metrics and small-scale user studies.
-
Real-world Applications:
- Unclear practical utility of maintaining fixed spatial configurations.
- Questions about the flexibility of event customization for diverse scenarios.
Author Responses and Revisions
-
Clarification of "Event Complexity":
- Defined complexity in terms of entity count and their interactions.
- Revised the introduction (Line 107) and provided examples in Appendix C.
-
Methodological Justifications:
- Highlighted the novelty in integrating spatial features and attention maps for event customization.
- Improved cross-attention regulation to reduce appearance leakage.
-
Enhanced Metrics and Comparisons:
- Added benchmarks with metrics like FID, CLIP-I, and CLIP-T scores.
- Conducted verb detection experiments using GSRTR, showing superior interaction semantics preservation.
-
Expanded Validation:
- Increased user study samples to 50 and added more qualitative results in Appendix E.
- Demonstrated robustness in handling diverse target entities and complex scenarios.
-
Practical Applications:
- Illustrated use cases like customized comic creation and personalized poster design.
- Showcased results with specified background and subject customization.
Final Decision Rationale
-
Strengths:
- Clear task definition with reasonable scope for further exploration.
- Training-free approach demonstrates efficiency and practicality.
- Extensive revisions addressed most concerns.
-
Weaknesses:
- Methodological novelty remains incremental, relying heavily on existing techniques.
- Persistent ambiguity in task definition and real-world utility.
- Limited diversity in results and small-scale experiments impact generalizability.
Despite the revisions, concerns about originality and practical contributions outweighed the improvements. The decision was to reject, as the work requires further innovation and validation for acceptance.
Reject