PaperHub
5.3
/10
Rejected4 位审稿人
最低3最高8标准差1.8
3
5
8
5
4.5
置信度
正确性3.0
贡献度2.5
表达3.0
ICLR 2025

LRVS-Fashion: Extending Visual Search with Referring Instructions

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

This paper introduces a large dataset of paired fashion images dedicated to Referred Visual Search, a conditional retrieval task.

摘要

关键词
Visual SearchImage EmbeddingRetrievalDataset

评审与讨论

审稿意见
3

This paper introduces a new task, Referred Visual Search, which retrieves specified products within a given image based on user requirements, and constructs a new dataset for this task. Additionally, a CLIP-based model architecture is proposed and its performance is validated on the newly created dataset.

优点

  1. The proposed new task has practical application value in the e-commerce domain.
  2. The illustrations are clear and detailed, facilitating understanding.

缺点

  1. The significance of the proposed task needs further clarification. What distinguishes Referred Visual Search from text-guided visual localization or composed image retrieval? Can existing methods in these areas be adapted to this task?
  2. The paper lacks novelty, as applying the CLIP architecture in the e-commerce domain has been previously explored [a-c]. The authors need to further explain the innovation of their proposed method compared to existing methods.
  3. The experiments require further refinement. Given that the authors utilized methods from composed image retrieval for performance comparison, it is essential to validate the proposed method on the FashionIQ and Shoes datasets within this context to effectively demonstrate its efficacy.
  4. The proposed method incorporates the Grounding DINO and SAM models; thus, a discussion on the fairness of performance comparisons with other methods is warranted. Additionally, the time and space complexity of the proposed method should be compared against other representative methods to provide a comprehensive evaluation.

[a] Han Y, Zhang L, Chen Q, et al. Fashionsap: Symbols and attributes prompt for fine-grained fashion vision-language pre-training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 15028-15038.

[b] Zhou J, Zheng X, Lyu Y, et al. E-clip: Towards label-efficient event-based open-world understanding by clip[J]. arXiv preprint arXiv:2308.03135, 2023.

[c] Han X, Zhu X, Yu L, et al. Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 2669-2680.

问题

Please refer to the Weaknesses section and address my concerns regarding the task setup, methodological innovation, and performance of the proposed method.

评论

We thank the reviewer for their feedback and for recognizing the value of our task in the e-commerce domain. We address their concerns below.

W1.1 The significance of the proposed task needs further clarification. What distinguishes Referred Visual Search from text-guided visual localization or composed image retrieval?

Text-Guided Visual Localization is a localization task aimed at explicitly locating an object in an image based on textual instructions; it is not a retrieval task.

Composed Image Retrieval (CIR) is a specific case of Conditional Image Similarity (CIS), where the conditioning text provides editing instructions to retrieve a different item from a gallery, as explained in [1].

Referred Visual Search (RVS) is also a CIS task, but its goal is to extract the embedding of an existing item from a complex input image. It is similar to the unnamed “Focus Object” task in GeneCIS [1], but our benchmark is significantly more demanding, with a gallery size of over 2M images compared to only 15 in GeneCIS.

We have revised our work to better clarify this distinction between our task, CIR, and CIS.

W1.2 Can existing methods in these areas be adapted to this task?

Adapting existing CIR methods is an interesting direction. We used some of these methods as baselines and added additional ones in our revised version, none of them being competitive with our proposed conditional-ViT. See Table 2 of the revised version. Our results demonstrate that dedicated research efforts are needed to bridge the gap between these two tasks.

W2. The paper lacks novelty, as applying the CLIP architecture in the e-commerce domain has been previously explored [a-c]. The authors need to further explain the innovation of their proposed method compared to existing methods.

We only use CLIP as a smart initialisation. Our key difference is that we perform early multimodal fusion of image/text modalities that allows us to obtain multimodal representations (text+image or category+image) without leveraging complex post-processing, whereas CLIP can only compute single modality representations that are comparable across modalities.

Regarding previous use of CLIP in e-commerce domain :

  • FashionSAP is a fashion retrieval model, but it does not address CIS.
  • FAME-ViL is a multi-task fashion model that includes CIR, but not RVS. We have added it to our baselines.
  • e-CLIP is an adaptation of CLIP for event-based recognition, and we find it difficult to draw a direct connection with RVS, CIS, or fashion retrieval in general.

W3. The experiments require further refinement. Given that the authors utilized methods from composed image retrieval for performance comparison, it is essential to validate the proposed method on the FashionIQ and Shoes datasets within this context to effectively demonstrate its efficacy.

As mentioned earlier, we propose a benchmark for RVS that differs from CIR. We adapt some methods from the CIR literature to our setup, fully aware that they were not specifically tailored for our task. As expected, these methods do not yield strong results on our benchmark. Reciprocally, we do not expect our method, which is specifically designed for RVS, to perform well on CIR. In such tasks, specialization is key to attain good performances.

W4. The proposed method incorporates the Grounding DINO and SAM models; thus, a discussion on the fairness of performance comparisons with other methods is warranted. Additionally, the time and space complexity of the proposed method should be compared against other representative methods to provide a comprehensive evaluation.

Our method does not incorporate DINO nor SAM. They are only used to create a strong baseline (Table 1), which we are able to beat with our proposed architecture (Table 2). We agree that DINO and SAM are computationally heavy, but they are under-performing compared to our more frugal approach anyway.

References:

[1] Vaze, S., Carion, N., & Misra, I. (2023). Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

评论

Thank you to the authors for their response, which addressed some of my concerns. However, for the following reasons, this paper is not yet ready for publication:

  1. Task level: Although the authors state that the task is Referred Visual Search (RVS), I believe that, in essence, it is not fundamentally different from existing tasks such as Composed Image Retrieval (CIR). Both involve learning a mapping to the target visual modality by combining visual and textual modalities. Therefore, like other reviewers, I also consider this task to be a subset of CIR. Additionally, the authors have not provided concrete examples for this task, leaving its practical application scenarios unclear.

  2. Method level: The authors have not adequately explained the novelty of the proposed approach. Early multimodal fusion of image/text modalities is a common practice, especially in e-commerce.

  3. The statement: “Reciprocally, we do not expect our method, which is specifically designed for RVS, to perform well on CIR. In such tasks, specialization is key to attain good performances.” is still not solid enough. For example, in the e-commerce domain, a model like FAME-ViL that can effectively address multiple tasks (or incorporate adapters) would save significant storage and maintenance costs.

评论

Thank you for engaging in the discussion:

  1. We do not base the difference between CIR and RVS on our opinion but on [1], a paper published at CVPR, the top conference in the domain, that clearly explain why both are different subtasks of Conditional Image Similarity (CIS). An e-commerce use case for RVS is providing an image of a celebrity and requesting for the exact same hat, as mentioned in the introduction of the paper.
  2. We could not find a reference to prior work using our Conditional-ViT architecture trained with an asymmetric contrastive loss, maybe you could help us in that endeavor. Besides, the novelty also lies in our proposed large dataset and benchmark, and the massive experiments that come with it.
  3. We tested FAME-ViL and, like other recent CIR methods, it does not perform well enough for RVS obtaining only 9.3% R@1 compared to 69.3% for our proposed Conditional-ViT. See Table 2.

Please consider these factual evidences while grading our work.

评论

As today marks the final day of the discussion phase, we note that the reviewer has not provided any counter-arguments. We tested all the suggested methods, and the evidence strongly supports the conclusion that RVS and CIR are two distinct tasks.

审稿意见
5

This paper proposes a new visual search task with referring instructions as the hints. It also creates a new large public dataset, LRVS-Fashion, as the benchmark. The authors also introduce a CLIP-like simple baseline.

优点

  1. The efforts on collecting the benckmark is worthy of applause. This dataset could be potentially used for other fashion-related tasks.
  2. The writing of this paper is good and easy to follow.

缺点

  1. I am not convinced on the motivation of the proposed task. I cannot fully imagine a real application scenario for the proposed Referred Visual Search. According to my experience and understanding on fashion/ads, the query image (ie, dressed human) and the target image (ie, garment, product and so on) shown in Figure 1 are normally from the same product set which provided by the advertiser (at least they share some common information, eg, brand name). In that case, we don't need to do large-scale image-base retrieval, which is implementation-wise expensive in the industry. The only case I understand is letting the user themselves upload their individual query images and do a cross-domain image retrieval. However, the target and query image used in this paper are from the same domain and thus cannot model the real world problem, although I understand collecting cross-domain benchmark is even harder.
  2. I believe the proposed RVS is just a sub-task of Composed Image Retrieval (CIR), while their relationship is not clearly discussed (between L147 and L153) and the latest CIR related works were not cited. Regarding "rather than modifying the image" in L151, I think getting the feature of the ROI of the query image (Figure 4) is also kind of modifying the query image feature.
  3. The proposed model arch is okayish but not that innovative (L081) to me. I agree this can be regarded as a strong baseline but don't think this widely used arch can be claimed as innovative.
  4. I think some zero-shot methods (no training needed) should be considered as baselines too, eg, multimodal LLM that can fuse the query image feature and the instructions.

问题

Please refer to the weakness part.

评论

We thank the reviewer for their feedback and recognition of our efforts and address their concerns below.

W1. I am not convinced on the motivation of the proposed task. I cannot fully imagine a real application scenario for the proposed Referred Visual Search. According to my experience and understanding on fashion/ads, the query image (ie, dressed human) and the target image (ie, garment, product and so on) shown in Figure 1 are normally from the same product set which provided by the advertiser (at least they share some common information, eg, brand name). In that case, we don't need to do large-scale image-base retrieval, which is implementation-wise expensive in the industry. The only case I understand is letting the user themselves upload their individual query images and do a cross-domain image retrieval. However, the target and query image used in this paper are from the same domain and thus cannot model the real world problem, although I understand collecting cross-domain benchmark is even harder.

The intended use case for Referred Visual Search aligns with the second scenario: allowing users to upload a query image to retrieve an item from a large gallery.

However, collecting and utilizing natural user queries introduces significant ethical and privacy challenges. Datasets like LAION, for instance, have been taken down in the past for that reason. By focusing on professional images, we propose a safer alternative. While this approach does not perfectly replicate real-life scenarios, it serves as the closest and most durable proxy for establishing a reliable benchmark.

W2. I believe the proposed RVS is just a sub-task of Composed Image Retrieval (CIR), while their relationship is not clearly discussed (between L147 and L153) and the latest CIR related works were not cited. Regarding "rather than modifying the image" in L151, I think getting the feature of the ROI of the query image (Figure 4) is also kind of modifying the query image feature.

Our task belongs to the broader family of Conditional Image Similarity (CIS) tasks. While RVS is adjacent to CIR — as both fall under the CIS umbrella — it is not a special case of CIR.

As stated in GeneCIS [1]: “CIR [...] tackles the problem of composing an image and text prompt to retrieve relevant images from a gallery [...] with the text prompt acting as an image edit instruction (e.g. ‘the same dress but in white’). As such, CIR tackles a subset of the conditional similarity problem, by presenting models with a ‘negative’ similarity condition.”

Our RVS task is more closely aligned with the unnamed “Focus Object” task described in the GeneCIS benchmark, where retrieval is evaluated with a gallery size of 15 images. In contrast, our benchmark scales to a much larger gallery of 2M images, providing a much more realistic setup.

We have clarified the relationship between RVS and CIR in section 2 of the revised version of the paper.

W3. The proposed model arch is okayish but not that innovative (L081) to me. I agree this can be regarded as a strong baseline but don't think this widely used arch can be claimed as innovative.

Indeed, one key outcome of our paper is that our straightforward proposed method outperforms more complex approaches that rely on multiple dedicated detection modules. This result is both surprising and exciting as it shows that performance gains can be made by simplifying recommendation pipelines. Notably, our proposition — a soft conditioning of a vision transformer on a textual embedding — has not been explored in this field before.

We have clarified this claim in our revised version.

W4. I think some zero-shot methods (no training needed) should be considered as baselines too, eg, multimodal LLM that can fuse the query image feature and the instructions.

We added zero-shots baselines, including miniCPM-V 2.6 [2], a 8B SOTA VLM model. Since the VLM architecture is not inherently designed for image embedding, we experimented with different approaches, including global average pooling of the tokens or using only the last token from various layers.

We would like to emphasize that this baseline introduces a very heavy pipeline, with a significantly high computational cost for embedding the entire gallery (2M samples). Additionally, the higher embedding dimension increases both index size and retrieval latency.

The resulting method achieves 9.2% R@1 with 2M distractors, underscoring the necessity of developing more efficient, general-purpose conditional image retrieval models. These results are added in Table 2 of the revised version.

References:

[1] Vaze, S., Carion, N., & Misra, I. (2023). Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

[2] Yao, Y., Yu, T., Zhang, Ao., et al. (2024) MiniCPM-V: A GPT-4V Level MLLM on Your Phone. ArXiV

评论

Dear reviewer,

Today is the last day for reviewers to ask questions to authors. Did the authors' rebuttal address your concern? Do you have any additional questions?

评论

Since no additional questions were raised, we assume that our rebuttal fully addresses the reviewer' concerns.

审稿意见
8

This paper introduces the Referred Visual Search (RVS) task, addressing the challenges of image similarity search in the fashion domain, particularly in scenarios where product images need to be retrieved from query images containing multiple products or from model pictures. The authors present a large dataset, LRVS-Fashion, containing 272k fashion products and 842k images, designed for this task. They propose a novel method based on weakly-supervised conditional contrastive learning, achieving superior performance without relying on explicit cropping or segmentation. The study demonstrates the effectiveness of their approach in comparison to a variety of baselines. The paper is overall well written.

优点

The introduction of RVS as a task for fashion images addresses a gap in the current literature and reveals significant value for practical applications such as e-commerce systems.

The LRVS-Fashion dataset is extensive in terms of its scale, organization, and cleaning process, providing valuable resources for future research and enhancing the field's accessibility.

The proposed approach, based on conditional contrastive learning, is intuitive yet elegant. By avoiding explicit object detection or segmentation, it is efficient and easier to implement in various applications, particularly by reducing the extra stages of data collection or model design. In addition, it demonstrates impressive robustness against a large number of distractors, which is critical in real-world applications.

The paper provides a thorough literature review and extensive comparisons with existing methods. For example, it compares with compositional image retrieval methods that share a conditional learning requirement but targeting at different application scenarios. The experiment setup is clear and the overall results are convincing in terms of accuracy and efficiency.

缺点

While the experimental results are solid, it could be more comprehensive if the analysis can go deeper into the impact of objects' visibility in query images considering object sizes, occlusions or viewpoints. Or, a simpler referring aspect could be segmentation mask area or its ratio to a detected object bounding boxes? One example is how the retrieval accuracy changes among shirts because they are often partially visible due to overlaid with outwears.

问题

The study focuses primarily on fashion images and leverages LAION-5B to create the benchmark dataset. While it serves as a valuable testbed for future research, it would be interesting to discuss how to construct larger datasets in real-world scenarios and the challenges associated with generalizability to close domains such as street outfit photos that have noisy backgrounds, or even beyond fashion such as furniture, landmarks, etc. This would further strengthen the impact of this work?

评论

We thank the reviewer for their encouraging feedback and address their concerns below.

W1. While the experimental results are solid, it could be more comprehensive if the analysis can go deeper into the impact of objects' visibility in query images considering object sizes, occlusions or viewpoints. Or, a simpler referring aspect could be segmentation mask area or its ratio to a detected object bounding boxes? One example is how the retrieval accuracy changes among shirts because they are often partially visible due to overlaid with outwears.

We can indeed provide a more detailed analysis of our results by examining two additional metrics:

  • R@1 aggregated by category. This uses the annotated categories from the test set.
  • R@1 aggregated by mask area. For each test query, we compute a mask for the target object using Grounded SAM and manually simplify complex captions to ensure more accurate masks.

These metrics reveal that objects with low visibility — due to size, occlusion, or viewpoint — are more challenging to retrieve. We also observe a correlation with certain categories that predominantly contain small objects, such as gloves, shoes, and jewelry.

We have included this analysis in the revised version (see appendix B.4), as it provides valuable insights into the weaknesses of models with respect to specific aspects of the benchmark.

Q1. It would be interesting to discuss how to construct larger datasets in real-world scenarios and the challenges associated with generalizability to close domains such as street outfit photos that have noisy backgrounds, or even beyond fashion such as furniture, landmarks, etc.

We would like to emphasize potential ethical and privacy concerns when building large-scale datasets for real-world applications. In our study, we chose to focus on professional fashion images to avoid using personal data. While this approach may be slightly misaligned with certain real-world use cases, it represents the most durable and ethically sound method for creating a benchmark. This strategy could also be extended to other commerce categories, such as furniture.

With these considerations addressed, we identify two approaches for obtaining more realistic data for real-world applications:

  1. Simulation: Leveraging generative models to create realistic contexts. This method avoids reliance on personal data but may have limitations in the realism of its outputs.
  2. User-Generated Reviews: Utilizing user-taken photos associated with product reviews as noisy queries. This approach is feasible for large fashion retailers, provided it aligns with the Terms and Conditions of their services and regional legal settings (e.g., GDPR in the EU).
评论

Thanks for the authors' comments that address my concerns to some extent. The newly added analysis shows models perform poorly over small-size objects which is interesting for future improvement. Regarding the discussion about whether this work belongs to CIR, it seems to be subjective among reviewers and authors. They are close domains within multimodality applications. My main focus in reviewing this work has been focusing on the contribution to the simple yet efficient e2e model which enables retrievals of exact objects/products from a composed query image. The baselines showed in the paper look to be convincing to me. I will keep my current rating.

评论

Dear reviewer,

Today is the last day for reviewers to ask questions to authors. Did the authors' rebuttal address your concern? Do you have any additional questions?

审稿意见
5

This paper introduces Referred Visual Search (RVS) in the fashion domain. The paper makes two main contributions: (1) A new large-scale dataset called LRVS-Fashion and (2) A weakly-supervised method that outperforms detection-based approaches.

优点

  1. The paper proposes a high-quality dataset that will be beneficial to the field
  2. The paper is well-written and easy to read.

缺点

  1. The proposed task is like a special case for the task of composed image retrieval(CIR). However, the experiment is not compared with recent CIR methods, which makes the result not convincing.
  2. The proposed method shows high performance in LRVS-F, as shown in Table 2, nearly 99% in Cat@1. These superior performances raise the question of the proposed dataset's motivation. Is there any specific problem the dataset wants to diagnose?

问题

see weakness

伦理问题详情

no

评论

We thank the reviewer for their feedback and acknowledging the quality of our dataset and address their concerns below.

W1.1. The proposed task is like a special case for the task of composed image retrieval (CIR).

Our task belongs to the broader family of Conditional Image Similarity (CIS) tasks. While RVS is adjacent to CIR — as both fall under the CIS umbrella — it is not a special case of CIR. For further details on CIS tasks, we refer the reviewer to Vaze et al. [1]. We have clarified the relationship between our RVS task, CIS, and CIR in the revised version in section 2.

W1.2. However, the experiment is not compared with recent CIR methods, which makes the result not convincing.

We update the paper to include results from other pre-trained CIR and CIS models.

  • MagicLens [2]: A SOTA generalist CIS model trained on 36.7M instructions triplets.
  • FAME-ViL [3]: A recent fashion-specific CIR model
  • MiniCPM-V 2.6 [4]: A SOTA Vision-Language model of 8B parameters.

Both MagicLens and FAME-ViL achieve close to 50% R@1 with 10K distractors, showing some degree of task transferability. However, their performance drops sharply with 2M distractors, reaching only 7.5% and 9.3% R@1, respectively. This emphasizes the specificity and difficulty of our fine-grained retrieval benchmark. These results are added in Table 2 of the revised version.

W2. The proposed method shows high performance in LRVS-F, as shown in Table 2, nearly 99% in Cat@1. These superior performances raise the question of the proposed dataset's motivation. Is there any specific problem the dataset wants to diagnose?

We report the Cat@1 as a sanity check. This metric does not adequately reflect a model's ability to retrieve the correct image, as an item could belong to the correct category without matching the object of interest in attributes such as color, pattern, or shape.

The most informative metric is R@1 with millions of distractors, as it measures the model's ability to retrieve an exact item from a large gallery. This metric highlights opportunities for improvement in future work. We have ensured these points are clearly explained in the revised version.

References:

[1] Vaze, S., Carion, N., & Misra, I. (2023). Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

[2] Zhang, K., Luan, Y., Hu, H., et al. (2024). MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions. Proceedings of the 41st International Conference on Machine Learning

[3] Han, X., Zhu, X., Zhang, L., et al. (2023) Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

[4] Yao, Y., Yu, T., Zhang, Ao., et al. (2024) MiniCPM-V: A GPT-4V Level MLLM on Your Phone. ArXiV

评论

Dear reviewer,

Today is the last day for reviewers to ask questions to authors. Did the authors' rebuttal address your concern? Do you have any additional questions?

评论

Since no additional questions were raised, we assume that our rebuttal fully addresses the reviewer' concerns.

评论

Dear Reviewers,

We would like to thank you for your thoughtful feedback. Addressing your questions and comments, particularly regarding the positioning with respect to Composed Image Retrieval (CIR), has significantly improved our work.

Below, we outline our conclusions based on the reviews, addressing the main strengths and weaknesses highlighted by the reviewers. We also provide a summary of the additional evaluations conducted during the rebuttal period.

Main Strengths

We are pleased to see that all the reviewers agreed on the value of our task and dataset for the e-commerce and fashion fields.

“The LRVS-Fashion dataset is extensive in terms of its scale, organization, and cleaning process, providing valuable resources for future research and enhancing the field's accessibility.” - Reviewer DWgA

“The paper proposes a high-quality dataset that will be beneficial to the field.” - Reviewer MVgE

“The efforts on collecting the benchmark is worthy of applause. This dataset could be potentially used for other fashion-related tasks.” - Reviewer risT

“The proposed new task has practical application value in the e-commerce domain.” - Reviewer jqz1

Additionally, all reviewers highlighted the quality of the writing and illustrations, which facilitate a better understanding of our work.

Main Weaknesses

The main concern regarding our task is its positioning relative to CIR. Reviewers MVgE and risT suggest that Referred Visual Search (RVS) can be seen as a special case of CIR, while Reviewer jqz1 requested additional clarification.

Recent work on Conditional Image Similarity (CIS) [1] enumerates CIR and RVS as two sub-tasks of CIS. Both share the same type of input (image + conditioning text) and output (embeddings), and are retrieval tasks, but their goals differ. Notably, CIR introduces a “negative similarity condition” used to edit the image, whereas RVS uses “referring instructions” to focus on a specific part of the image.

Our RVS task aligns with the unnamed “Focus Object” task described in the GeneCIS benchmark [1]. However, while they propose evaluating general-purpose models on small galleries (15 images), we use a specific fashion setting with a large gallery (2 million images) that better aligns with real world use cases.

Our work contributes to the existing CIS benchmarks, and results from CIR models confirm that the tasks are distinct.

[1] Vaze, S., Carion, N., & Misra, I. (2023). Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Summary of new experimental results

Following the reviewers' questions and comments, we conducted several new evaluations during the rebuttal period, which we have incorporated into the revised version of the manuscript.

In-depth analysis - Reviewer DWgA’s suggestion regarding the size of the items and their occlusion led us to the computation of per-category and per-size metrics. The results confirm a general weakness across all methods for small objects, which is particularly detrimental to certain categories (near 20% R@1 for the smallest object categories “Waist”, “Feet” and “Hands” compared to above 80% for the largest object categories “Upper Body” and “Whole Body”).

New baselines - we added 3 models to our baselines:

  • MiniCPM-V 2.6 (suggested by Reviewer risT) - A SOTA VLM method, which highlights an interesting research direction in adapting VLM to CIS tasks, reaching 9.2% R@1.
  • MagicLens - A SOTA zero-shot CIS method trained on a large scale dataset, which shows that our benchmark is much more challenging than existing CIS benchmarks, reaching 7.5% R@1.
  • FAME-ViL (suggested by Reviewer jqz1) - A fashion multi-task model trained on CIR, which shows that fashion RVS differs from fashion CIR, reaching 9.3% R@1.

These new results confirm that dedicated methods, like our simple conditional ViT approach (reaching 69.3% R@1), are still necessary for RVS (see Table 2 of the revised version).

Modifications to the paper made for the revision can be accessed with the OpenReview tool (Revisions → Compare Revisions).

AC 元评审

This paper was reviewed by four experts in the field. The paper received mixed review ratings of 8553.

Reviewer DWgA is the most positive reviewer, and they liked the proposed e2e method. The major concerns from other reviewers are:

  1. The positioning of the work, being a special case of composed image retrieval (CIR)
  2. The novelty and innovation of the method.

The AC read the paper, reviews, and rebuttal carefully. The ACs believed that the concerns around the positioning of the method and the technical advancement this paper brings to the research community outweigh the paper's strengths. While the paper clearly has merit, we regret that the paper cannot be recommended for acceptance at this time. The authors are encouraged to consider the reviewers' comments when revising and resubmitting the paper.

审稿人讨论附加意见

Two reviewers didn't participate in the discussion stage despite multiple reminders. One reviewer is very positive about the method, while the other three reviews were concerned about the positioning of the work and the novelty of the method.

最终决定

Reject