5.3

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.5

置信度

ICLR 2024

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Geonmo Gu,Sanghyuk Chun,Wonjae Kim,HeeJae Jun,Yoohoon Kang,Sangdoo Yun

OpenReview PDF

提交: 2023-09-19更新: 2024-02-11

TL;DR

We propose a diffusion model-based composed image retrieval model that allows the versatility and the controllability of the model. We also propose a massive synthetic CIR triplet dataset named SynthTriplets18M

摘要

关键词

Composed Image RetrievalDiffusion Models

评审与讨论

审稿意见

评分: 5置信度: 52023-10-30

This paper proposes a novel diffusion-based model, named CompoDiff, which could merge the multimodal conditional information, for solving composed image retrieval (CIR) task. It also proposes a newly created synthetic dataset, named SynthTriplet18M, of 18 million training triplets (reference image, conditions, and target image). The proposed model and dataset address the poor generalizability of existing CIR methods, due to the small training dataset scale and limited types of conditions. The experimental results show the proposed method achieves better results on four public CIR benchmarks.

优点

The idea of leveraging the synthetic data for training of CIR task is pertinent, since the triplet-labeled training data required for CIR task is laborious to collect.
Borrowing the idea from the diffusion of generative task, the authors also explore the possibility of adopting the diffusion mechanism for latent feature extraction in discrimination task, and prove it has the potential to achieve good retrieval accuracy.
It is interesting that the adoption of diffusion enables negative text for CIR task.

缺点

The section 3.1 needs to be written more clearly, it is preferable to annotate the letters and variables that appear in Eq. (1), (2), (3) in Figure 3, for example, $z_{i,masked}$ ; what is the relationship between $e_{t}$ and $z_{i}^{t+1}$ ? It is difficult to understand from Figure 3 and Eq.(1), (2), and (3).
The diffusion part for feature extraction is very opaque. In Figure 3, what is the intuition of forward diffusion (adding noise T times) and denoise diffusion? In the generation task, the output of the diffusion process consists of pixel values, which have a clear and explicit meaning, while in the retrieval task, the output of the diffusion process is latent variables that do not have a clear and explicit meaning.
I think the authors should consider the time and resource consumption carefully. The training stage is very complex, since the framework involves two stages that both require massive data, and the stage 2 requires some tricks such as alternative strategy, which is unstable, and resource-consuming. For the inference stage, it requires 5 diffusion processes for each query sample while each diffusion process still needs multiple steps, which is very time-consuming since the diffusion process is very slow. It is necessary to compare the inference time with previous methods. In section 4, when collecting the synthesized caption, fine-tuning the OPT-6.7B model is very time-consuming and resource-consuming.
I think there is some mistake on the left side of Eq.(4). Besides, in section 4, x_c is used in the fourth and sixth rows, while x_{c_T} is used in the fifth row.

问题

In keyword-based diverse caption generation of Figure 5, according to my knowledge, it is not quite reliable to collect the alternative keywords using the CLIP feature similarity. Firstly, some alternative keywords may share a similar concept with the target keyword, but the synthesized caption may not be reasonable. For example, “plants”, “flora” share similar concept with “strawberry”, but “plants tart” and “flora tart” is ridiculous. Even if frequency filtering is used and restricting the CLIP similarity within 0.5~0.7, this phenomenon still exists. Moreover, keywords such as “portrait, figure, image” have consistently high similarity with most keywords, but they do not have specific meanings; keywords such as “painting, drawing, walk, hiking” may have different parts of speech (verb and noun), and some keywords such as “light, chair, season” may involve different meanings in different context. Note that what I'm referring to is not limited to the examples mentioned above, but it's a general issue, and all these problems can lead to the generation of very strange modified captions. I am curious how these problems are considered and solved.

2023-11-20

We thank the reviewer for their positive feedback and constructive comments. We will address all the raised concerns by the reviewer, and will revise our paper as soon as possible (Hopefully before Tuesday).

[W1] Meaning of diffusion for feature extraction

We would like to emphasize that the diffusion process itself is invariant to the input domain, whatever the input domain is, a diffusion model. Diffusion model just learns an underlying distribution of the given data into a Gaussian distribution. From this point of view, CompoDiff learns an underlying distribution of image latent embeddings with proper guidance. Note that our work is not the only method to learn a diffusion model on a latent space. For example, LatentDiffusion (or StableDiffusion) learns a diffusion model on 64 x 64 dimensional latent space of the VQ-GAN encoder. Dalle-2 prior model learns a diffusion model on CLIP latent feature space, as ours. If the reviewer wonders the difference between CompoDiff and these methods, please check our response for Reviewer AJF4, “CompoDiff vs. StableDiffusion”.

[W2] [W4] Clarity and typos in section 3.1 and Eq (4)

Thanks for your comment! We will revise our paper as soon as possible.

[Q1] Reliability of keyword-based diverse caption generation

Thanks for the question. Our process can handle the cases raised by the reviewer based on CLIP-based filtering. In page 14, we describe the details of our filtering process:

We apply a filtering process following Brooks et al. (2022) to remove the low-quality ⟨xiR , xc, xi⟩. We filter the generated images for an image-image CLIP threshold of 0.70 to ensure that the images are not too different, an image-caption CLIP threshold of 0.2 to ensure that the images correspond to their captions, and a directional CLIP similarity of 0.2 to ensure that the change in before/after captions correspond with the change in before/after images. Additionally, for keyword-based data generation, we filter out for a keyword-image CLIP threshold of 0.20 to ensure that images contain the context of the keyword, and for instruction-based data generation, we filter out for an instruction modified image CLIP threshold of 0.20 to ensure consistency with the given instructions.

Now, we explain how the examples by the reviewer can be handled by our filtering process. First, if the caption itself is really ridiculous, then StableDiffusion and Prompt2Prompt will not be able to generate proper images. For example, it could make a noisy pixels, just white images, or “broken” images for a weird caption. These images will be filtered out by CLIP similarity because we measure the similarity between the generated images is sufficiently high, and the broken images will have low similarities with clean images.

Second, we define keywords as “noun”; hence, different POS rather than noun will not be extracted. Similar to the first case, if the given caption is really weird and ridiculous to make an image, then the generated model will make a broken image. It will be filtered out.

2023-11-20

[W3] Resource consumption

Thanks for your constructive feedback. We agree that this paper would be better to have a more careful discussion of resource consumption. We will include each item in the revised paper.

[W3-1] Complex training stage

We would like to emphasize that the two-staged training procedure is not too much complex strategy. For example, SEARLE employs two-staged training, where the first stage is for learning pseudo token embeddings, and the second stage is for learning the projection module by distillation using the learned pseudo token embeddings. Combiner also employs two-staged training, where the first stage is a pre-training stage for the text encoder and the second stage is for training the combiner module using the pre-trained model. Our two-staged training strategy is not a mandatory procedure, but we employ two-staged training for better performance. Conceptually, stage 1 is for a pre-training by training text-to-image generation diffusion model with pairwise relationships. The stage 2 is for a fine-tuning process using triplet relationships. Lastly, the alternative optimization strategy for stage 2 is for helping optimization, not for resolving the instability. As shown in Table 3, CompoDiff can be trained only with Eq (3), but adding more objective functions makes the final performances stronger. In terms of diffusion model fine-tuning, we argue that our method is not specifically complex compared to other fine-tuning methods, such as ControlNet.

We partially agree with the reviewer, that both stages are trained with massive data points. However, as shown in Table 4, the massive data points are not the necessary condition for training CompoDiff. In fact, Table 4 shows that CompoDiff follows scale law; namely, CompoDiff performances are consistently improved by scaling up the data points. As far as we know, this is the first study that shows the impact of the dataset scale on the zero-shot CIR performances.

[W3-2] Inference stage

We would like to emphasize that CompoDiff retrieval does not take a very long inference time, as the reviewer's concern. Below, we add the comparison table for the comparisons of the inference speed of comparison methods. The table will be added to the revised paper soon.

Method	Inference time for a single batch (secs)
Pic2Word (ViT-L)	0.02
SEARLE (ViT-L)	0.02
Compodiff (ViT-L)	~~0.23~~ 0.12
ARTEMIS (RN50)	0.005
Combiner (RN50)	0.006

In the table, we can confirm that CompoDiff is practically useful with high throughput (about ~~230ms~~ 120ms). Note that these numbers highly depend on the hardware. For example, in Table 5 in our main paper, we report the average inference time as 120ms with a better machine. In the above table, we measured the numbers with a bit worse machine than the machine used for Table 5. The previous numbers are based on batch size 32, while the other numbers are based on batch size 1. We fixed the table to avoid confusion

One of the advantages of our method is that we can control the trade-off between retrieval performance and inference time. Note that it is impossible for the other methods. If we need a faster inference time, even with a worse retrieval performance, we can reduce the number of diffusion steps. More detailed experiments for the trade-off are shown in Table 5.

Also, we would like to emphasize that one of our main contributions is a novel and efficient conditioning for the diffusion model. Instead of using a concatenated vector of all conditions and inputs as the input of the diffusion model, we use the cross-attention mechanism for conditions and leave the input size as the same as the original size. As shown in Table C.5., our design choice is three times faster than the naive implementation.

[W3-3] OPT-6.7B model fine-tuning

We fine-tune the OPT model using LoRA fine-tuning (Low-Rank Adaptation of Large Language Models) with 8-bit quantization. We would like to emphasize that the quantized LoRA fine-tuning is very lightweight, only a single GPU can handle the model and the fine-tuning usually takes less than one day. Note that InstructPix2Pix uses the GPT-3.5 turbo API, which needs additional API cost, but our LoRA fine-tuning is easier, cheaper, and faster.

2023-11-22

Dear Reviewer e76K,

Thanks for your constructive and valuable comments on our paper. We would like to notify the reviewer that the revised paper has been uploaded. The revised contents are highlighted in magenta. We have revised our paper to address the reviewer's concerns.

[W1] Meaning of diffusion for feature extraction: We add the background section of diffusion models in Section 2.1. Due to the page limitation, the detailed discussion is in Appendix A.2, including our response to the reviewer's question.
[W2] [W4] Clarity and typos in section 3.1 and Eq (4): We fixed the typo. We also omit the time embedding in Figure 3, where the time embedding is too fine detail compared to other concepts. Instead, we specify that the time embedding is a common choice for the existing diffusion models in Appedix A.2
[Q1] Reliability of keyword-based diverse caption generation: We clarify that there exists an additional filtering process in Section 4, and split previous B.3 ("Triplet generation from caption triplets") to B.3 ("Triplet generation from caption triplets") and B.4 ("CLIP-based filtering") for emphasizing the filtering process. We also added our response to the reviewer's concern in Section B.4.
[W3-1] Complex training stage: We clarify the training complexity of our method is not specifically complex compared the others in Section C.1. Also, to avoid confusion, we revise the sentence "Due to the training stability, CompoDiff uses a two-stage training strategy" to "CompoDiff uses a two-stage training strategy" and clarify the meaning of each stage instead. Note that it is not for hiding our weakness, but it is to avoid confusion such as, "our method shows unstable training (hence, not converged) without two-stage training" which is not true.
[W3-2] Inference stage: We add Appendix D.1 Inference time comparison. Note that our first response has an error: CompoDiff takes 0.12s for forwarding a single image, while 0.23s is for batch size 32.
[W3-3] OPT-6.7B model fine-tuning: We clarify that the LoRA fine-tuning is lightweight in Section 4. We also would like to emphasize that OPT is only for generating the synthetic training dataset, not for the retrieval model.

Please feel free to ask anything to us if the reviewers think the revised paper is insufficient. We are open to discussion and will address all the concerns of the reviewers.

评论- Any follow-up question?

2023-11-23

Reviewer e76K,

We sincerely appreciate your efforts and time for the community. As we approach the close of the author-reviewer discussion period in 3 hours, we wonder whether the reviewer is satisfied with our response. It will be pleasurable if the reviewer can give us the reviewer's thoughts on the current revision to give us an extra valuable chance to improve our paper.

审稿意见

评分: 6置信度: 42023-10-31

The authors introduces CompoDiff, a novel diffusion-based model, for the task of Composed Image Retrieval (CIR) with latent diffusion. It also proposes a new dataset named SynthTriplets18M. Importantly, it supports diverse conditions like negative text and image masks, offers control over query importance, and allows trade-offs between inference speed and performance, improving the overall CIR process.

优点

The concept introduced here is really interesting, although the components used here carry less novelty.
The writing of introduction, and the overall paper is quite fluid and easy to understand.
The synopsis of every topic is provided in a self-contained manner.
Qualitative Figures are well portrayed.

缺点

Although the experiments are extensive, little reasoning is provided as to why the methods perform (low/high) in the way they do. More analytical reasoning would be encouraged.
The paper could've been written in a more self-contained manner. A basic background of unCLIP, segCLIP and other components could have been provided instead of simply citing the paper, even 2-3 lines would enhance the readability of the paper.
The training paradigm seems a bit convoluted. Rephrasing of certain sentences could bring about clarity in the understanding, for instance, discussing a small background on diffusion models first, then bringing in text-image composite part.
Although not intuitive in this respect it makes me wonder what would be the effect if a learnable text prompt is used in the CLIP-text branch?
Despite having a few competitors, it would have been better to provide a few baselines focussing on variations of design components used for the proposed method.

问题

Does this retrieval include images containing multiple target objects for retrieval as well?
Although not intuitive in this respect it makes me wonder what would be the effect if a learnable text prompt is used in the CLIP-text branch?

2023-11-20

[W1] More analytical reasoning of the CIR benchmark performance gaps is encouraged

Thanks for the comment! First, we would like to emphasize that our experimental results are “zero-shot”, which means the models are not trained on the target dataset. It can cause a significant domain gap between the target dataset domain and the training dataset. CIR datasets, such as FashionIQ and CIRR datasets, have specific domain characteristics that cannot be solved without training on the datasets. For example, the real-world queries that we examine are mainly focused on small editing, addition, deletion, or replacement. However, because the datasets are constructed by letting human annotators write a modification caption for a given two images, the text conditions are somewhat different from the real-world CIR queries. For example, the CIRR dev set has some text conditions like: “show three bottles of soft drink” (different from the common real-world CIR text conditions), “same environment different species” (ambiguous condition), “Change the type of dog and have it walking to the left in dirt with a leash.” (multiple conditions at the same time). These types of text conditions are extremely difficult to be solved in a zero-shot manner, but we need access to the CIRR training dataset.

When we perform a qualitative study on LAION-2B image index, we observe that the retrieval quality becomes better by increasing the dataset scale from 1M to 18.8M. We could not build a quantitative evaluation benchmark on the LAION-2B index set, because as our previous comment, the existing datasets need human verification, and it is impossible to perform on billion-scale images. The example retrieval results from LAION-2B are shown in Fig 6 and C.1.

[Q1] Does this retrieval include images containing multiple target objects for retrieval as well?

CIRR partially contains such tasks, for example: “Change the type of dog and have it walking to the left in dirt with a leash.” (multiple conditions at the same time). However, basically, as CIRR benchmarks are built upon captions written by human annotators, there are no specific subsets or benchmarks to target multiple target objects. Conceptually, CompoDiff can handle multiple target objects if the target is given in a sentence (the same as the other methods). If target conditions are given in multiple sentences, we can iteratively edit the latent feature, i.e., orig_feature -> (CompoDiff) -> edited_feature_by_sent_1 -> (CompoDiff) -> edited_feature_by_sent_2. However, we do not have a proper quantitative benchmark to compare the methods in such scenario.

[W2] [W3] Enhance writing quality: backgrounds and training paradigm

Thanks for your suggestion. We agree that the current manuscript can be a bit difficult to understand without enough background. We will revise the paper in the rebuttal period. We will inform the reviewer when we upload the revised paper.

[W4] [Q2] Learnable text prompts

Thanks for the suggestion. We believe that the intuition behind this comment means that “Will the quality of the text embedding affect to the final retrieval performances?”, and the answer is “yes”. It can be proven by our experiments (Table 6) on changing the text encoder for the condition of CompoDiff diffusion model. In the table, we can observe that using a better text encoder boosts up the CIR performance with a large gap (e.g., 38.20 -> 44.11 for FashionIQ recall and 29.19 -> 39.25 for CIRR recall). It means that if we can use better text information, the overall performance will be improved.

However, unfortunately, it is not straightforward to apply learnable text prompts to our method. It is because our text encoder is not used for feature matching as other learnable text prompt literature, but the text embedding is used for the cross-attention of the diffusion model, and the supervision for the classifier-free guidance of our diffusion model. It is not trivial to apply learnable prompts to such scenario, especially for classifier-free guidance. We have searched related works, but we cannot find any sound method to be applied to CompoDiff. If the intuition behind this question is the relationship between the text information quality and the retrieval performance as our assumption, we hope our experimental result can answer the question.

2023-11-20

[W5] Other variants of CompoDiff

Thanks for the suggestion. We examined some variants of CompoDiff in Table C.5. Note that some variants cannot handle triplet relationships but can only handle pairwise relationships. Therefore, we measure the text-to-image and image-to-text retrieval performances for comparisons. In the table, we observe that our design choice for conditioning using our cross-attention mechanism shows remarkably efficient inference speed than naive input concatenation (x 3 faster). We also have conducted an additional experiment with RN50 backbone. We observe that our design choice still works well with other types of backbone architecture.

	FIQ R@10	FIQ R@50	CIRR R@1	CIRR R_s@1	CIRCO mAP@5	CIRCO mAP@10	CIRCO mAP@25	GeneCIS R@1
CompoDiff (RN50)	35.62	48.45	18.02	57.16	12.01	13.28	15.41	14.65
CompoDiff (ViT-L)	37.36	50.85	19.37	59.13	12.31	13.51	15.67	15.11
CompoDiff (ViT-G)	39.02	51.71	26.71	64.54	15.33	17.71	19.45	15.48

2023-11-22

Dear Reviewer wkZo,

[W1] More analytical reasoning of the CIR benchmark performance gaps is encouraged: We added the discussion in Appendix C.3.
[Q1] Does this retrieval include images containing multiple target objects for retrieval as well?: We clarify that CIRR contains such case in Appendix C.3.
[W2] [W3] Enhance writing quality: backgrounds and training paradigm: We have updated Section 3.1 (diffusion model background) and 3.2 (training details). Due to the page limitation, the details of diffusion model are in Appendix A.2.
[W4] [Q2] Learnable text prompts: We mention the idea in Section 5.3 "Condition text encoder".
[W5] Other variants of CompoDiff: We added RN50 results in Table 2.

Please feel free to ask anything to us if the reviewers think the revised paper is insufficient. We are open to discussion and will address all the concerns of the reviewers.

评论- Any follow-up question?

2023-11-23

Reviewer wkZo,

审稿意见

评分: 5置信度: 32023-11-01

This paper proposes a composed image retrieval method, called CompoDiff, based on the diffusion model and proposes a large-scale dataset, called SynthTriplets18M, for the composed image retrieval task. In the experiments, the qualitative and quantitative results demonstrated that the performance of the proposed CompoDiff in terms of composed image retrieval exceeds the comparison method. Comparing the results of the model trained with different scales of data, it shows that a large amount of training data can improve the model effect.

优点

This paper proposes a novel CIR method that can additionally limit the scope of the search image based on the input mask and other conditions.
This method can also control the balance between retrieval accuracy and retrieval efficiency without training, as well as control the impact of each condition on the retrieval results.
This paper proposes a dataset that promotes the development of CIR-related research and illustrates, to a certain extent, the impact of dataset size on methods.

缺点

The authors raised the problem of requiring triples for training: but it was not solved well, and the authors just proposed a larger data set.
The experimental results show that the effect of the proposed data set and other data sets are similar at the same level, and after the data volume reaches a certain level, continuing to increase the size of the data set may not significantly improve the model performance. It negates the value of the data set to a certain extent.
The data set used by the comparison method is inconsistent with the data set used by the proposed method, which is not quite fair.

问题

Can additional comparison experiments be conducted, for example, both the proposed method and the comparison method are trained on SynthTriplets18M to illustrate the effectiveness of the proposed method?
The paper mentions that the size of the data set is very important. Can experiments regarding the size of the data set be conducted in other methods to illustrate the effectiveness of the data set?

2023-11-20

We appreciate the reviewer for their positive comments and constructive feedbacks. We will address all the raised concerns by the reviewer, and will revise our paper as soon as possible (Hopefully before Tuesday).

[W1] The authors just proposed a larger dataset

Our main claim is the problem of requiring pre-collected and human-verified triplets, not triplets themselves. We will tone down our argument in the introduction section as: “obtaining human-verified high-quality triplets can be costly”. We will revise our paper as soon as possible. Meanwhile, we would like to emphasize that the dataset scale-up is one of our main contributions: it is challenging to make a synthetic dataset that shows the scaling law, i.e., more data points lead to better performances. Our dataset construction process achieves this property by employing the keyword-based caption generation that ensures the diversity of the captions. As shown in Table 4, the previous data generation process by IP2P shows inferior model performance than our generation process with the same scale (27.2 vs. 31.9 in the Fashion IQ average recall). We will discuss more details in the next bullet.

[W2] The value of the dataset is limited to a certain extent

We would like to emphasize that the scale shown in Table 4 is already very large. Note that 1M is the scale of ImageNet, and 10M is ten times larger than ImageNet. It is even larger than ImageNet 21K (11.8M images). Scaling up the dataset from 10M to 18.8M shows a bit saturated performance improvements in the benchmark, but we would like to argue that the benchmark number is somewhat noisy in CIR. When we perform a qualitative study on the LAION-2B image index, we observe that the retrieval quality becomes better by increasing the dataset scale from 10M to 18.8M. The reason why the given benchmark scores look saturated is that FashionIQ and CIRR datasets have specific domain characteristics that cannot be solved without training on the datasets. For example, the real-world queries that we examine are mainly focused on small editing, addition, deletion, or replacement. However, because the datasets are constructed by letting human annotators write a modification caption for a given two images, the text conditions are somewhat different from the real-world CIR queries. For example, the CIRR dev set has some text conditions like: “show three bottles of soft drink” (different from the common real-world CIR text conditions), “same environment different species” (ambiguous condition), “Change the type of dog and have it walking to the left in dirt with a leash.” (multiple conditions at the same time). These types of text conditions are extremely difficult to be solved by a zero-shot manner, but we need to access to the CIRR training dataset.

We are not certain about the “other datasets” from the comment “The experimental results show that the effect of the proposed data set and other data sets are similar at the same level”, but if other datasets denote the CIR datasets, such as CIRR or FashionIQ training set, we would like to emphasize that it is not a fair comparison with ours. We aim to solve “zero-shot” CIR, i.e., without accessing the expensive human-collected triplets. Directly comparing our zero-shot CIR results and the supervised CIR results on the target dataset is not fair. Note that CLIP ViT-B/16 ImageNet zero-shot classification performance is 80.9 (where the training dataset scale is 400M), worse than that of DeiT-B/16 (a supervised method on 1.2M ImageNet training set) 84.2, but we do not argue that the value of CLIP zero-shot performance is negated.

[W3] Inconsistent training dataset for comparison methods / [Q1] Additional comparison experiments for the comparison method are trained on SynthTriplets18M

We already fit the training datasets for other triplet-based fusion methods, such as ARTEMIS and Combiner. For a fair comparison in terms of backbone size, we have conducted additional experiments with CompoDiff with the same backbone:

	FIQ R@10	FIQ R@50	CIRR R@1	CIRR R_s@1	CIRCO mAP@5	CIRCO mAP@10	CIRCO mAP@25	GeneCIS R@1
ARTEMIS (RN50)	33.24	47.99	12.75	21.95	9.35	11.41	13.01	13.52
Combiner (RN50)	34.30	49.38	12.82	24.12	9.77	12.08	13.58	14.93
CompoDiff (RN50)	35.62	48.45	18.02	57.16	12.01	13.28	15.41	14.65
CompoDiff (ViT-L)	37.36	50.85	19.37	59.13	12.31	13.51	15.67	15.11
CompoDiff (ViT-G)	39.02	51.71	26.71	64.54	15.33	17.71	19.45	15.48

(cont.)

2023-11-20

The table shows that in most of the metrics, CompoDiff still outperforms the comparison methods with large gaps. For example, CompoDiff RN50 shows 18.02 CIRR R@1, while ARTEMIS and Combiner show 12.75 and 12.82, respectively. If we scale up the backbone to ViT-L and ViT-G, the scores become 19.37 and 26.71, respectively. We will include the full results in the paper.

Note that Pic2Word and SEARLE cannot be trained on triplet datasets, because they are only trained on image datasets by their design. For example, Pic2Word trains a projection module from the output of the image encoder to token embedding of the text encoder, to minimize the loss between <the image embedding, and the text embedding of “a photo of [projected image embedding]”>. Therefore, it is not straightforward to train Pic2Word on SynthTriplets18M (SEARLE employs the almost same strategy as Pic2Word, hence it is also impossible). Therefore, we evaluate the other methods as fair as possible, if a method cannot be trained on the same dataset due to its design, we use the same architecture (e.g., ViT-L for Pic2Word and SEARLE); if a method can be trained on the same dataset, we train the model on SynthTriplets18M (e.g., ARTEMIS and Combiner) and the same backbone (RN50).

[Q2] Impact of the dataset scale to the other methods

Thanks for the suggestion. We have conducted the same experiment as Table 4 for ARTEMIS and Combiner on FashionIQ avg(R@10, R@50) and CIRR avg(R@1, R_s@1). Note that it is impossible to train Pic2Word and SEARLE on our dataset as discussed above.

FashionIQ Avg Rs	IP2P(1M)	1M	5M	10M	18.8M (proposed)
CompoDiff	27.24	31.91	38.11	42.41	42.33
ARTEMIS	26.03	27.44	36.17	41.35	40.62
Combiner	29.83	29.64	35.23	41.81	41.84

CIRR Avg Rs	IP2P(1M)	1M	5M	10M	18.8M (proposed)
CompoDiff	27.42	28.32	31.50	37.25	37.83
ARTEMIS	14.91	15.12	15.84	17.56	17.35
Combiner	16.50	16.88	17.21	18.77	18.47

In all methods, we observe a similar phenomenon. Using more data points significantly increases the performance (e.g, CompoDiff FIQ shows 38.11 -> 42.33 for 1M -> 18.8M). Also, using a better-quality of the data points is remarkably helpful for the performance. For example, the same amount of data points (1M) by IP2P and ours show 27.24 and 31.91 FashionIQ average recalls. I will add the discussion in the revised paper.

评论- Revision has been uploaded

2023-11-22

Dear Reviewer FhVu,

[W1] The authors just proposed a larger dataset: We clarify that previous CIR methods are based on pre-collected human-verified triplet dataset in the Introduction. We also emphasize that human verification is not scalable, while our dataset is easily scalable automatically to an infeasible scale for manual collection (e.g., more than 1M). The related discussions are in Section 5.3, Appendix C.3.
[W2] The value of the dataset is limited to a certain extent: We add a related discussion in Section 5.3, Appendix C.3. Particularly, we add a discussion of why the existing benchmarks are somewhat unreliable benchmarks for evaluating authentic zero-shot CIR performances in Appendix C.3. We also clarify that our main evaluation is zero-shot, while FashionIQ-trained counterpart is fully supervised in the Introduction.
[W3] Inconsistent training dataset for comparison methods / [Q1] Additional comparison experiments for the comparison method are trained on SynthTriplets18M: We clarify that it is impossible to train Pic2Word and SEARLE in Section 5.1 and Appendix C.3. We also added CompoDiff RN50 results for a fair comparison with other RN50-based methods on SynthTriplets18M.
[Q2] Impact of the dataset scale to the other methods: We add the experiments and discussions in Appendix D.4 (Table D.8)

Please feel free to ask anything to us if the reviewers think the revised paper is insufficient. We are open to discussion and will address all the concerns of the reviewers.

评论- Any follow-up question?

2023-11-23

Reviewer FhVu,

2023-12-05

Thanks the authors to provide additional experimental results. From the added results, the performance of the models is improved clearly when the scale of the dataset is expanded from 1M to 10M. However, there is only a slight improvement in performance when the scale of the dataset is expanded from 10 M to 18 M, which implies that it is may be unnecessary to expand the scale of the dataset to 18M. In addition, the proposed method requires the mask of the object in the reference image as a condition for training, which results in the model hardly being trained on other datasets. Therefore, I would keep my rating on the post-rebuttal section.

审稿意见

评分: 5置信度: 22023-11-06

This paper addresses the limitation of the small dataset and small categories of conditions on the Composed Image Retrieval (CIR) task through a new diffusion-based model (CompoDiff) and a large-scale CIR dataset (SynthTriplets18M). They show the generalizability of their dataset training CompoDiff on the existing CIR benchmarks.

优点

Clear presentation of their approach
This paper tackles the lack of a dataset for this task, which raises the generalizability of this task. It is a clear objective to present a large-scale dataset. To construct on a large scale, they utilize a large generative model such as Stable Diffusion to produce a synthetic dataset. This reviewer can agree with the direction of their approach, and it contributes to this community.
Flexibility of their model
Not only improving the performance on the existing benchmarks of CIR, they also suggest a more flexible manner of CIR including negative texts or masks, which give a large potential for its application.

缺点

A small improvement using better backbone architecture
For a fair comparison, it is hard to agree that their model (ViT-L) performs better than the previous arts whose backbones (RN50) are much lighter (Table. 2). Therefore, this reviewer recommends comparing them in the backbone with similar capacity (same backbone is the best option) as much as possible.
Efficiency comparison with the previous arts
Even though they show inference time varying the diffusion steps, this reviewer suggests the comparison of latency or flops with the previous arts. Their intra-model analysis is also w
An insufficient contribution of CompoDiff
As far as this reviewer’s understanding, COmpoDiff is a minor modified version of Stable Diffusion. This reviewer considers their flexibility for the conditions also stems from the power of Stable Diffusion. Also, this reviewer wonders what the performance of the Stable Diffusion trained on SynthTriplets18M is on the CIR benchmark.

问题

The questions are naturally raised in the weaknesses section.

2023-11-20

We thank the reviewer for their positive feedback and constructive comments. We will address all the concerns raised by the reviewer, and will revise our paper as soon as possible (Hopefully before Tuesday).

[W1] Comparison with the same backbone

Thanks for the comment. We would like to emphasize that CompoDiff has the same backbone architecture (e.g., ViT-L) as Pic2Word and SEARLE, which are designed for zero-shot CIR. We have conducted CompoDiff with RN50 to compare with ARTEMIS and Combiner with the same backbone as the reviewer suggested. The table below shows that in most of the metrics, CompoDiff still outperforms the comparison methods with large gaps. For example, CompoDiff RN50 shows 18.02 CIRR R@1, while ARTEMIS and Combiner show 12.75 and 12.82, respectively. If we scale up the backbone to ViT-L and ViT-G, the scores become 19.37 and 26.71, respectively. It shows that the improvement by CompoDiff is not by a larger backbone but by our novel diffusion-based retrieval strategy. We will include the full results in the paper.

	FIQ R@10	FIQ R@50	CIRR R@1	CIRR R_s@1	CIRCO mAP@5	CIRCO mAP@10	CIRCO mAP@25	GeneCIS R@1
ARTEMIS (RN50)	33.24	47.99	12.75	21.95	9.35	11.41	13.01	13.52
Combiner (RN50)	34.30	49.38	12.82	24.12	9.77	12.08	13.58	14.93
CompoDiff (RN50)	35.62	48.45	18.02	57.16	12.01	13.28	15.41	14.65
CompoDiff (ViT-L)	37.36	50.85	19.37	59.13	12.31	13.51	15.67	15.11
CompoDiff (ViT-G)	39.02	51.71	26.71	64.54	15.33	17.71	19.45	15.48

In addition, we would like to emphasize that our main contribution is two-fold: (1) the diffusion-based CIR model and (2) a new synthetic triplet dataset. We believe that a weakness in the first contribution does not harm the second one.

[W2] Efficiency comparison

Thanks for the suggestion. We agree with the comment. Below, we add the comparison table for the comparisons of the inference speed of comparison methods. The table will be added to the revised paper soon.

Method	Inference time for a single batch (secs)
Pic2Word (ViT-L)	0.02
SEARLE (ViT-L)	0.02
Compodiff (ViT-L)	~~0.23~~ 0.12
ARTEMIS (RN50)	0.005
Combiner (RN50)	0.006

Also, we would like to emphasize that one of our main contributions is a novel and efficient conditioning for the diffusion model. Instead of using a concatenated vector of all conditions and inputs as the input of the diffusion model, we use the cross-attention mechanism for conditions and leave the input size the same as the original size. As shown in Table C.5., our design choice is three times faster than the naive implementation.

2023-11-20

[W3] CompoDiff vs. StableDiffusion

We would like to clarify that CompoDiff and StableDiffusion are not the same. CompoDiff was partially inspired by LatentDiffusion (the original paper of StableDiffusion), and therefore, CompoDiff and StableDiffusion share some similarities. For example, their diffusion processes are based on feature spaces. However, CompoDiff has distinct differences from StableDiffusion, with three key contributions:

Latent space is different
- StableDiffusion performs the diffusion process on 64 x 64 “latent space” where the latent is the embedding space of VQ-GAN. On the other hand, the diffusion process of CompoDiff is performed on the CLIP latent embedding space (1024-dim feature for ViT-L). Therefore, the edited features by CompoDiff can be directly used for retrieval on the CLIP space, while StableDiffusion cannot
- Note that the prior model in Dalle-2 (Ramesh et al., 2022) also performs diffusion process on CLIP embedding space, but CompoDiff takes different inputs and conditions with the Dalle-2 prior (Please check 3 The method for handling conditions is different for more details)
Architecture is different
- CompoDiff’s diffusion model is based on a Transformer module (while StableDiffusion and other conventional models are based on a U-Net structure).
The method for handling conditions is different
- Moreover, CompoDiff is designed to handle multiple conditions, such as masked conditions. StableDiffusion cannot handle the localized condition, while CompoDiff can.
- CompoDiff also handles the condition different from Dalle-2 prior. Dalle-2 prior handles conditions as the input of the diffusion model, but our CompoDiff diffusion Transformer takes the conditions via the cross-attention mechanism.
- This design choice makes CompoDiff’s inference speed faster. If the conditions are concatenated to the input tokens, the inference speed will be highly degenerated. Table C.5. shows that the structure taking all conditions as the concatenated input (e.g., Dalle-2 prior-like) is three times slower than our cross-attention approach.
The targeting scenario is different
- CompoDiff is designed to deal with a triplet relationship, while the other diffusion models are typically designed for a paired relationship (e.g., text-to-image generation). Our method is relevant to instruct pix2pix from this viewpoint. However, since instruct pix2pix exactly follows the StableDiffusion model structure, CompoDiff still has the abovementioned advances against instruct pix2pix as well.

评论- Revision has been uploaded

2023-11-22

Dear Reviewer AJF4,

[W1] Comparison with the same backbone: We add CompoDiff RN50 results in Table 2.
[W2] Efficiency comparison: We add Appendix D.1 Inference time comparison. Note that our first response has an error: CompoDiff takes 0.12s for forwarding a single image, while 0.23s is for batch size 32.
[W3] CompoDiff vs. StableDiffusion: We add the related discussion in Appendix A.2. We clarify the details in the response in the section.

Please feel free to ask anything to us if the reviewers think the revised paper is insufficient. We are open to discussion and will address all the concerns of the reviewers.

评论- Any follow-up question?

2023-11-23

Reviewer AJF4,

2023-12-05

Thanks for your effort in addressing my questions on this rebuttal.

I appreciate their consistent performance improvement on the other backbone and the detailed explanation of the difference between CompoDiff and the other Diffusion variants. Also, from the first review, I agree that their proposed dataset is valuable.

However, I am still concerned about the Diffusion application for retrieval tasks. Their results also demonstrate that Compodiff has inefficiency in inference (x6 to the model with the same backbone). I am not confident that this direction of study gives the right message to this community.

Therefore, I would keep my rating on the post-rebuttal section but be ready to follow the other perspectives of the reviewers on this paper.

评论- Revision summary

2023-11-22

We truly appreciate all the reviewers and the chairs for their commitment to the community and thoughtful reviews. We are highly encouraged that the reviewers found that the motivation and direction of our method are interesting and can contribute to the community (Reviewer AJF4, wkZo), our work bridges the gap between diffusion model and image retrieval (Reviewer e76K), the flexibility and controllability of our model (Reviewer AJF4, FhVu), our dataset can promote CIR-related research (Reviewer AJF4, FhVu, e76K) and our paper is fluid and easy to understand (Reviewer AJF4, wkZo). We luckily have a chance to make our work stronger with constructive and valuable comments from the reviewers. We have addressed all the raised concerns by the reviewers in the revised paper. The revised contents are highlighted in magenta.

Introduction: We clarify that the existing CIR datasets are not scalable due to expensive human verification, while our dataset is fully automated, an highly scalable. Also, we emphasize that by scaling up the dataset size, our approach consistently improves the overall performance, which is somewhat non-trivial, especially for triplet relationships. We also clarify that our main evaluation is zero-shot, while FashionIQ-trained counterpart is fully supervised (Reviewer FhVu)
Preliminary: Following Reviewer wkZo's suggestion, we add Section 3.1 and Appendix A.2 for a more detailed background of diffusion models. We clarify that the meaning of diffusion model is still sound in a latent embedding space (Reviewer e76K). We also clarify the difference between CompoDiff and StableDiffusion in the same section (Reviewer AJF4).
Method: Following Reviewer wkZo's suggestion, we revised the text to make it clearer. We also omit the time embedding in Figure 3, where the time embedding is too fine detail compared to other concepts. Instead, we specify that the time embedding is a common choice for the existing diffusion models in Appedix A.2 (Reviewer e76K)
Dataset construction: We clarify that LoRA fine-tuning is lightweight, and there exists additional CLIP-based filtering (Reviewer e76K)
Additional results in Table2: We add CompoDiff RN50 in Table 2 (Reviewer AJF4, FhVu)
Discussions related to CIR benchmarks: In Section 5.1, we clarify that the current benchmarks are somewhat insufficient. More details are in Appendix C.3 (Reviewer wkZo, FhVu)
Dataset scale analysis: We revise Section 5.3 (Impact of dataset scale) to clarify our contribution (as well as some texts in Section 5.1). We also add the same experiments with ARTEMIS and Combiner in Appendix D.4 (Table D.8) (Reviewer FhVu)
Training complexity: We clarify the complexity of our training strategy in Section C.3 (Reviewer e76K)
Inference time comparison: We clarify that CompoDiff is not too slower than others in Appedix D.1 (Reviewer AJF4, e76K)

Please feel free to ask anything if the reviewers think the revised paper is insufficient. We are open to discussion and will address all the concerns of the reviewers.

2023-11-23

Dear all the reviewers,

Thanks for your time in reviewing our paper. We have addressed all the raised concerns from the reviewers with the revision of the texts and additional experiments (CompoDiff RN50, ARTEMIS and Combiner experiments by varying the dataset scale, and the inference time comparison). It will be a pleasure if the reviewers can re-evaluate our paper based on the revised paper.

As the revision period is closing in 15 mins, we will not able to timely update our manuscript, but we will address all the concerns from the reviewers in the final revision.

AC 元评审

2023-12-09

This paper proposes a novel diffusion-based method for composed image retrieval (CIR), i.e. retrieval based on a query formulated as an image with additional text, introducing a novel synthetic training datsaset of 18M samples. The method is evaluated in zero-shot mode on several exsiting CIR datasets. The author rebuttal addressed part of the reviewer concerns. One reviewer upgraded their rating while two others retained their rating in light of the rebuttal (and one reviewer did not respond to the rebuttal). None of the reviewers strongly argues for acceptance of the paper.

为何不给更高分

Only one reviewer makes a (marginal) accept recommendation.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject