PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
5
5
3.3
置信度
创新性3.0
质量2.5
清晰度2.5
重要性3.0
NeurIPS 2025

QuARI: Query Adaptive Retrieval Improvement

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

Hypernetwork-based framework for dynamic adaption of precomputed database features.

摘要

关键词
image retrievaltext image retrievalhypernetworksmultimodal modelsinstance retrieval

评审与讨论

审稿意见
5

Methods summary

The paper tackles the problem of missing fine-grain capabilities when using global embeddings with their proposed approach, QuARI (Query Adaptive Retrieval Improvement). The key idea in QuARI is to adjust the embedding dynamically based on the input per query. This is achieved by using Hypernetworks H - the H model is trained to take the query embedding and produce a modified embedding q' and a transformation function T that is applied to all the gallery embeddings, and the similarliy function is applied to retrieve the closest matches. The transformation function T is a matrix learnt using low-rank decomposition, and the overall objective is realized through contrastive learning. The CL formulation here is a semi-positive scenario where the top 2 nearest neighbours are considered semi-positive.

Experiment setup

The paper uses 2 benchmarks to evaluate on - ILIAS and INQUIRE. ILIAS evals include images from YFCC100M for distraction. For training, the paper claims any text-image dataset can be used and use MSCOCO, Conceptual Captions and BioTrove (images captioned by Qwen2.5-VL-7B-Instruct)

Results

QuARI is shown to clearly improve upon the baselines in retrieval and re-ranking on both the eval sets. The ablations on each of the components of the QuARI neatly show the importance. Fig 4 also shows the promise of the method wrt time and performance.

优缺点分析

Strength

  1. The paper is well written with clearly presented details on contribution, implementation, and ablations highlighting the how different components contribute to the performance.
  2. The authors show strong improvement in numbers and closely match closed-source VLMs.

Weakness

  1. The authors could've provided analysis on the gap/regret to best possible re-rank and to the closed-source VLMs on where improvements could be gained. While the numbers between QuARI + SigLIP2 are close to closed-source, there is still a relatively large gap to the ideal re-rank. Some analysis on why that still happens could be really helpful to the readers.

Comments

  1. I hope the authors diligently release the code fi accepted as promised.
  2. Some papers like [1] should probably be cited if the authors find them relevant in the background section regarding adaptive retrieval and [2] for the semi-positive samples.
  3. The authors should also probably clarify who this is potentially useful for and limitations as to how "fine-grained" can it really be.

[1] Vaibhav Balloli, Sara Beery, and Elizabeth Bondi-Kelly. 2024. Are they the same picture? adapting concept bottleneck models for human-AI collaboration in image retrieval. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI '24). Article 866, 7824–7832. https://doi.org/10.24963/ijcai.2024/866 [2] Dwibedi, Debidatta, et al. "With a little help from my friends: Nearest-neighbor contrastive learning of visual representations." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

问题

  1. Can the authors add more details on the impact of the number of training data samples? The training data looks like a lot, and I wonder what the impact on performance is of dropping 1 or 2 of them? (and potentially adding these results)

局限性

Yes, the authors have addressed limitations.

最终评判理由

The queries were addressed and maintaining my score

格式问题

The authors seem to have written (and mentioned this in the checklist) the broader statement in the appendix, even though the guidelines clearly state that it should be part of the main paper.

作者回复

Thank you for your review! We appreciate your recognition of our strong improvements and clear presentation and evaluation. We address your comments below:

The authors could've provided analysis on the gap/regret to best possible re-rank and to the closed-source VLMs on where improvements could be gained. While the numbers between QuARI + SigLIP2 are close to closed-source, there is still a relatively large gap to the ideal re-rank. Some analysis on why that still happens could be really helpful to the readers.

Understanding the gap between QuARI and the possible performance of an oracle is certainly interesting. We reviewed example queries with the largest gaps in performance between our results and the optimal results, and observe that they are simply quite difficult queries: images that include occluded subjects, poor lighting/angles, etc., and query texts that require domain knowledge, have complicated adjective lists, etc. We will share representative images that remain unretrieved by either QuARI in either the final paper or supplemental materials.

I hope the authors diligently release the code fi accepted as promised.

We will absolutely share our code and pre-trained models!

Some papers like [1] should probably be cited if the authors find them relevant in the background section regarding adaptive retrieval and [2] for the semi-positive samples.

These are excellent references and we will include them in our related work—thanks!

The authors should also probably clarify who this is potentially useful for and limitations as to how "fine-grained" can it really be.

We think this model is useful for anyone to perform image-to-image or text-to-image retrieval. We believe the model generalizes across many domains because we demonstrate strong performance in our paper, and are excited to present results on additional benchmarks in response to concerns raised by reviewers. The following benchmarks are evaluated on:

  • Flickr30k
  • MS COCO — COCO was part of the training data for QuARI, so we remade the training dataset without COCO and show evaluation results on that data (we leave the COCO data in the training set for all other benchmarks).
  • Flat Object Retrieval Benchmark (FORB)
  • TextCaps

Across all of these new evaluation benchmarks, QuARI, built on SigLIP2 (ViT-L) features, yields significant improvements:

ModelCOCO T2I R@1Flickr30k T2I R@1FORB mAP@5FORB t-mAP@5TextCaps T2I R@1
SigLIP255.285.393.7469.2444.6
FT SigLIP255.085.592.8970.0345.2
SigLIP2+QuARI77.492.995.6778.5355.8

Notably, there is no retraining for new domains—we use exactly the same model to translate the query into a projection matrix that refines the retrieval results (except for the model trained without COCO data in order to evaluate on COCO).

While we have not yet conducted a fine-grained analysis of how “specific” the learned projections are, our experiments demonstrate consistent performance improvements across diverse datasets and query types. This suggests that the model learns broadly useful transformations that are self-adapting to varying retrieval goals. We agree that a deeper analysis of projection specificity would provide valuable insights, and we view this as an exciting direction for future work.

Can the authors add more details on the impact of the number of training data samples? The training data looks like a lot, and I wonder what the impact on performance is of dropping 1 or 2 of them? (and potentially adding these results)

Thank you for pointing out a very interesting dimension of evaluation! We explore this by investigating the performance of QuARI built on SigLIP2 (ViT-L) across random samples of 25%, 50%, 75%, and 100% of the training data. We summarize these results in the table below:

data sampleILIAS I2I@100M (mAP@1k)ILIAS T2I@5M (mAP@1k)INQUIRE mAP@50INQUIRE nDCG@50
25%31.736.647.456.7
50%32.437.548.257.1
75%34.139.249.557.8
100%35.340.650.758.3
评论

I wanted to point this paper out too that the authors can consider citing [3] that is similar in spirit and might be worth pointing out to the readers.

[3] Weller, Orion, et al. "Promptriever: Instruction-trained retrievers can be prompted like language models." arXiv preprint arXiv:2409.11136 (2024).

审稿意见
4

The paper proposes QuARI, a query-specific retrieval framework that introduces query-adaptive linear transformations to enhance visual-language model (VLM) performance on retrieval tasks. While the empirical results show improvements over baselines, several critical issues undermine the novelty, and practical significance of the work.

优缺点分析

Strengths

  1. Comprehensive Evaluation: The paper conducts thorough evaluations of QuARI across multiple challenging benchmarks (ILIAS and INQUIRE) and compares it against a diverse set of baselines, including recent vision-language models (VLMs) and task-adapted models. This breadth of assessment helps contextualize the method’s performance across different retrieval scenarios.
  2. Insightful Ablation Studies: The ablation experiments effectively validate key design choices, such as iterative refinement, semi-positive sample mining, and noise injection. By quantifying the impact of each component, the studies strengthen the rationale for QuARI’s architecture.

Weaknesses

  1. Unclear Motivation for Query-Specific Adaptation: The paper does not sufficiently articulate why query-specific adaptation is necessary beyond domain-level adaptation. It fails to clarify scenarios where per-query adjustments outperform static domain-specific transformations, leaving the core motivation underdeveloped.
  2. Limited Conceptual Novelty: The core idea of using query-specific linear projections to adapt embedding spaces lacks originality. While the paper frames QuARI as an "extreme version" of prior linear transformation methods, this distinction is superficial. The addition of a transformer network to leverage in-domain data aligns more with engineering-level optimizations than impactful scientific innovation.
  3. Trivial Design Choices: The proposed transformation matrix for document embeddings is under-motivated, particularly in the context of cosine similarity-based retrieval. Linear transformations in this space do not address fundamental limitations and is equivalent to linear transformations on query embedding.
  4. Unfair Baseline Comparisons: The comparisons to baselines often omit critical controls. For example, while Table 1 contrasts QuARI with "static task adaptation (TA)", the performance gains reported in Table 1(b) do not align with the trends in Table 1(a), raising questions about consistency. Additionally, the baselines do not adequately incorporate in-domain data, making it difficult to isolate whether improvements stem from QuARI’s design or superior use of training data.

问题

see weaknesses above

局限性

see weaknesses above

最终评判理由

Given clarifications, particularly regarding the hypernetwork motivation, I revise my rating to positive

格式问题

n/a

作者回复

Thank you for your comments! We appreciate your acknowledgement of our comprehensive evaluation and effective ablation studies. We respond to your concerns below:

  1. Unclear Motivation for Query-Specific Adaptation: The paper does not sufficiently articulate why query-specific adaptation is necessary beyond domain-level adaptation. It fails to clarify scenarios where per-query adjustments outperform static domain-specific transformations, leaving the core motivation underdeveloped.

The motivation for query-specific adaptation is that even within a well-defined domain, individual queries often highlight very different semantic aspects of the data. A single domain-level projection must generalize across all of them, often diluting features that may be crucial for a particular query. In contrast, a lightweight query-specific transformation allows the retrieval system to adapt its representation space dynamically to emphasize what matters most for that query.

As an example that might help understand our intuition in crafting QuARI: in instance-level image retrieval like in ILIAS—say, trying to find a specific red backpack—the query might depict it lying on a white floor, worn by a person, or placed in a cluttered room. While all are “backpacks,” the relevance of background features will vary by query. A query-specific transformation can downweight irrelevant elements (like clutter), or upweight contextual ones (like co-occurring objects or textures), depending on what distinguishes the instance in that scene.

Similarly, in the INQUIRE benchmark, queries span a wide range of fine-grained distinctions in the natural world. One user might ask for “a frog with bright blue legs,” while another requests “a frog camouflaged in brown leaves.” Although both refer to frogs, the first emphasizes color and appearance, while the second emphasizes environment and texture. A domain-level projection trained across many such examples may settle on an average representation that captures neither distinction well. A query-specific projection, by contrast, can prioritize the particular attributes mentioned in the query, whether that means focusing on visual details of the object itself or cues from the surrounding context.

Setting aside the intuitional motivation, the results speak for themselves. Query-specific adaptations outperform both fine-tuning and learning of task-level projections by significant margins (as you can see in the original paper’s Table 4 for fine-tuning and in going from the baseline results in Table 1A to the Task Adaptation results in Table 1B for task adaptation). The two approaches are also not mutually exclusive. While we don't explore it in this paper, the potential combination of both domain-level adaptation and query-level adaptation is an interesting direction for follow-up work.

  1. Limited Conceptual Novelty: The core idea of using query-specific linear projections to adapt embedding spaces lacks originality. While the paper frames QuARI as an "extreme version" of prior linear transformation methods, this distinction is superficial. The addition of a transformer network to leverage in-domain data aligns more with engineering-level optimizations than impactful scientific innovation.

We acknowledge that learning linear transformations to adapt pre-trained embedding spaces is not a novel concept, and cite related work in that space in our paper. The novel aspect of QuARI, however, is the learning of query-specific database transformations. We respectfully disagree that this distinction is superficial and are unaware of any prior work that negates the novelty of QuARI. We show that it is both a highly performant and computationally efficient approach to improving retrieval performance, especially for extremely challenging retrieval domains. The core contribution of our paper is the thorough demonstration that learning of query-specific transformations is feasible with a transformer hypernetwork, and that this setup results in strong practical performance improvements without excessive computational cost. We are also excited to have expanded on these results (in response to the concerns of other reviewers), showing that the query-specific adaptation of QuARI yields significant retrieval improvement on a number of additional benchmarks.

  1. Trivial Design Choices: The proposed transformation matrix for document embeddings is under-motivated, particularly in the context of cosine similarity-based retrieval. Linear transformations in this space do not address fundamental limitations and is equivalent to linear transformations on query embedding.

We aren't performing a single linear transformation of the entire query space, but exploring the hypothesis that we can learn effectively a linear transformation of the embedding of the gallery images (and the query) based on the query itself. In the context of the entire image retrieval problem, this is a highly non-linear transformation, but can be implemented very efficiently at run time because it is linear for each query. Our results suggest that this approach is very effective. Therefore, we argue that query-specific linear transformations do address the fundamental limitation that a pre-trained embedding space may not align with the semantic properties of a downstream retrieval task. General domain pre-training results in strong learned features that are useful for a diverse set of downstream domains, but the learning of features relevant to one task may result in the inclusion of features spurious to another task [1,2]. QuARI is effective because it learns to suppress these features that are irrelevant to the goal of the query.

  1. Unfair Baseline Comparisons: The comparisons to baselines often omit critical controls. For example, while Table 1 contrasts QuARI with "static task adaptation (TA)", the performance gains reported in Table 1(b) do not align with the trends in Table 1(a), raising questions about consistency. Additionally, the baselines do not adequately incorporate in-domain data, making it difficult to isolate whether improvements stem from QuARI’s design or superior use of training data.

Thank you for pointing out a source of confusion in the current version of the paper. “Backbone + QuARI” rows in Table 1(b) actually do not refer to the same model as the corresponding row in Table 1(a). Table 1(b) aims to compare the gap between linear adaptation of features on the task-level and the linear adaptation of features on the query-level (QuARI). So, QuARI in Table 1(b) refers to QuARI fine-tuned on a 1M subset of Universal Embeddings, while QuARI in Table 1(a) is trained without “in-domain” instance-level examples. We will clarify this distinction more carefully in the final version of our manuscript. 


[1] Li, Weiwei, et al. "Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025.

[2] An, Bang, et al. "More context, less distraction: Improving zero-shot inference of CLIP by inferring and describing spurious features." Workshop on Efficient Systems for Foundation Models, ICML. 2023.

评论

Thank you for your detailed response and efforts to address the concerns. Your explanations of the motivation and novelty behind QuARI are thorough and clear. I believe my initial points (W1, W2, W3) are interconnected, centering on the potential triviality in QuARI’s design.

First, as illustrated in Figure 2 and confirmed in your reply, QuARI outputs two components: a refined query embedding f(queryi)f(\text{query}_i) and a transformation matrix TiT_i. Applying TiT_i to document embeddings is mathematically equivalent to  QuARI(queryi)=f(queryi)TiT\text{QuARI}(\text{query}_i) = f(\text{query}_i) \cdot T_i^T, in the context of similarity computation. While learning instance-specific transformations is conceptually sound, this design risks triviality because query encoders already perform query-specific transformations. For example, if the query-specific transformation were implemented via a transformer encoder (instead of a hypernetwork), it would likely be viewed as trivial, functioning merely as a second-stage query encoder.

Your clarification helps contextualize QuARI within the framework of hypernetworks—addressing a key limitation of deep neural networks (learning shared parameters across all examples) by adapting to sample-specific gradient directions. However, query-specific transformations demand careful scrutiny precisely because query encoders already generate query-dependent representations. Hypernetworks typically enhance flexibility by adjusting weights learned from batch-level patterns (e.g., [29] learns a weight delta for diffusion model layers). In QuARI’s case, the adaptation is a linear transformation, effectively refining the weights of a final feed-forward layer. This narrow scope raises questions about whether the hypernetwork’s contribution is as impactful as framed.

To rigorously validate the hypernetwork’s role in performance gains, an ablation study isolating its effect—e.g., applying the hypernetwork without updating the query embedding f(queryi)f(\text{query}_i), is critical but missing. The one-step generation ablation (Line 229) suggests the hypernetwork contributes at least 50% of the improvement, but this is partial evidence.

Given your clarifications, particularly regarding the hypernetwork motivation, I am open to revise my rating to positive, as long as the concerns about the potential triviality of the linear transformation design (noted in Line 257) are addressed. To that end, the work would be significantly strengthened by three key refinements::

  • Figure 1 (Column 3) is misleading. QuARI should be explicitly framed as a hypernetwork rather than a "query transforming network."
  • An ablation removing the updated query embedding f(queryi)f(\text{query}_i), or making it a hypernetwork QuARI(queryi)=fi(queryi)TiT\text{QuARI}(\text{query}_i) = f_i(\text{query}_i) \cdot T_i^T, is necessary to isolate the hypernetwork’s impact.
  • The ablation removing query noise requires further justification. If the hypernetwork is critical, why does a noisy query embedding outperform a standard one as a hypernetwork prior?

These adjustments would better clarify QuARI’s novelty and address lingering concerns about design triviality.

评论

An ablation removing the updated query embedding, or making it a hypernetwork, is necessary to isolate the hypernetwork’s impact.

Applying T to document embeddings is mathematically equivalent to QuARI(query) = f(query) * T^T in the context of similarity computation

Thank you for clarifying this concern! We are excited to present additional results isolating the contributions of the query update and the database projection. We also clarify that it is not mathematically identical to transform the database embeddings and the query embedding in our retrieval setup.

First, we remove the hypernetwork projection of the database embeddings, and only update the query embedding with the learned hypernetwork (denoted as hypernetwork query update). Then, we remove the updated query embedding, transforming only the document embeddings (without query update). All experiments are conducted using a ViT-L SigLIP2 backbone encoder trained on identical data:

versionILIAS I2I@100M (mAP@1k)ILIAS T2I@5M (mAP@1k)INQUIRE mAP@50INQUIRE nDCG@50
SigLIP220.824.737.252.3
hypernetwork query update23.226.840.153.9
without query update33.438.248.457.4
QuARI35.340.650.758.3

Removing the projection of the database embeddings degrades retrieval performance significantly, validating the distinction between transforming the database embeddings and transforming the query embedding. We see that while removing the update to the query results in performance degradation, the hypernetwork-based approach still captures strong performance gains over baseline retrieval methods. This validates the importance and soundness of our proposed approach. We will include these results in the final manuscript.

Clarification of non-equivalence

In general, for a transform TT, database/document embedding dd, and query qq, it is not exactly the same to apply the transform TT to either the database or the query vector. In our case, we store L2-normalized database vectors, which are re-L2-normalized after applying the transformation. In our scheme (applying the transform to the database), the similarity is:

s(q,d)=f(q)Tdf(q) Td,s(q,d)=\frac{f(q)^\top T d}{||f(q)|| ~ ||T d||},

while if applying the transformation directly to the query and renormalizing, the similarity is:

s(q,d)=f(q)TdTf(q) ds(q,d)=\frac{f(q)^\top T d}{||T^\top f(q)|| ~ ||d||}

Note that terms Tf(q)||T^\top f(q)||, d||d|| and f(q)||f(q)|| are constants for any specific query. The crucial difference is the term Td\|T d\|, which depends on the specific database embedding dd, creating a document-dependent scaling that cannot be absorbed into the query. Therefore, the two similarities are not exactly the same (unless TT is orthogonal up to scale).

The ablation removing query noise requires further justification. If the hypernetwork is critical, why does a noisy query embedding outperform a standard one as a hypernetwork prior?

We clarify that the injected noise to the query embedding serves as a training-time regularization technique used to bridge the gap between text-image pretraining and downstream tasks, including image-image retrieval. The noise is removed at inference time, but results in strong performance improvements in image-image retrieval even when training only on text-image data. This approach was introduced and experimentally validated by previous work [1].

Figure 1 (Column 3) is misleading. QuARI should be explicitly framed as a hypernetwork rather than a "query transforming network."

Thank you for pointing out a potential point of confusion in Figure 1! We will revise the figure to clearly indicate that QuARI adapts not only the query embedding, but also the database embeddings.

Thank you for your valuable suggestions! We are confident that incorporating the requested results into the final manuscript will help highlight the contributions of our work.

[1] Language-only Training of Zero-shot Composed Image Retrieval. Gu, Geonmo, et. al. Conference on Computer Vision and Pattern Recognition (CVPR). 2024.

审稿意见
5

This paper introduces QuARI (Query Adaptive Retrieval Improvement), a query-specific retrieval framework that enhances image-to-image and text-to-image retrieval performance. The key idea is to use a transformer-based hypernetwork to predict query-specific linear transformations that adapt pre-computed global embeddings from frozen vision-language models (e.g., CLIP, SigLIP) for each individual query. The method employs iterative refinement with low-rank matrix factorization (rank 64) to generate transformation matrices efficiently. The authors evaluate their approach on two retrieval benchmarks (ILIAS and INQUIRE) and demonstrate significant improvements over baselines while maintaining computational efficiency compared to traditional re-ranking methods.

优缺点分析

Strengths:

  • Novel approach to retrieval: The paper presents an interesting idea of using query-specific transformations rather than static domain adaptation, which is conceptually appealing and shows strong empirical results.
  • Computational efficiency: The method is significantly more efficient than VLM-based re-ranking approaches while achieving competitive or better performance, making it practical for large-scale deployment.
  • Comprehensive evaluation: The paper includes comparisons with various baselines including domain-specific adaptations, local feature-based re-ranking methods, and VLM re-rankers.
  • Good ablation study: The authors provide ablation studies showing the importance of iterative generation, semi-positive samples, and noise injection.

Weaknesses:

  • Questionable benchmark selection: The paper relies heavily on ILIAS and INQUIRE datasets, which appear to be very recent (2024-2025) with minimal citations (~1 citation each). This raises concerns about the reliability and representativeness of the evaluation.
  • Missing critical ablations:
    • No justification or ablation for the choice of rank 64 for low-rank factorization
    • No experiments with different numbers of semi-positive samples (fixed at 2)
    • Limited analysis on why iterative refinement is necessary beyond showing performance drops
  • Insufficient implementation details: Critical details like MLP dimensions are missing, hindering reproducibility. The authors mention this will be released upon acceptance, but key architectural details should be in the paper.
  • Limited theoretical justification: The paper lacks theoretical analysis for design choices:
    • Why is low-rank factorization necessary/beneficial?
    • What is the theoretical basis for the iterative refinement process?
  • Narrow evaluation scope: Evaluation is limited to two non-standard benchmarks. Results on widely-used retrieval benchmarks (e.g., MS-COCO retrieval, Flickr30K) would strengthen the claims.

问题

  • Rank selection: Why was rank 64 chosen for the low-rank factorization? What is the performance-efficiency trade-off for different rank values (e.g., 32, 128, 256)?
  • Semi-positive samples: How sensitive is the method to the number of semi-positive samples? What happens with 1, 5, or 10 semi-positives?
  • Standard benchmarks: Can you provide results on standard retrieval benchmarks like MS-COCO or Flickr30K to validate the generalizability?
  • Theoretical analysis: Can you provide theoretical justification for why low-rank transformations are sufficient? How does this relate to the manifold structure of embeddings?

局限性

The authors adequately discuss computational limitations (linear transformations only) and the top-k retrieval constraint. However, they should also acknowledge the limitations of evaluating on non-standard benchmarks.

最终评判理由

The rebuttal addresses all my major concerns. The additional experiments on standard benchmarks show strong generalization, the ablation studies justify the design choices, and the theoretical explanations are satisfactory. The consistent improvements across diverse retrieval tasks demonstrate the practical value of QuARI.

Given these substantial improvements and clarifications, I am updating my rating from 3 (Borderline reject) to 5 (Accept). The paper now presents a well-validated approach with clear practical benefits and sufficient experimental rigor.

I strongly encourage the authors to include all these additional results and explanations in the final paper, as they significantly strengthen the contribution.

格式问题

As far as I can see, there is none.

作者回复

Thank you for your review! We appreciate your recognition of our novel approach, strong results, and computational efficiency. We address your comments below:

  • Questionable benchmark selection: The paper relies heavily on ILIAS and INQUIRE datasets, which appear to be very recent (2024-2025) with minimal citations (~1 citation each). This raises concerns about the reliability and representativeness of the evaluation.
  • Narrow evaluation scope: Evaluation is limited to two non-standard benchmarks. Results on widely-used retrieval benchmarks (e.g., MS-COCO retrieval, Flickr30K) would strengthen the claims.

We appreciate the concern raised regarding the emphasis on the ILIAS and INQUIRE benchmarks. We selected ILIAS and INQUIRE explicitly because they are extremely hard. Pre-trained embedding models produce strong performance in general domain retrieval problems, with state-of-the-art models like SigLIP2 achieving 55.2 R@1 on COCO’s and 84.5 R@1 on Flickr30k on text-to-image retrieval with just a ViT-B backbone.

By comparison, ILIAS is so challenging that the retrieval results are reported as mean average precision at 1000, with a pre-trained SigLIP2 model achieving only 19.8 mAP@1K on the text-to-image retrieval task. mAP@1 results are not reported, because they would be essentially 0. On INQUIRE, a pre-trained SigLIP model achieves just 34.2 mAP@50.

Such challenging real-world retrieval tasks are extremely compelling compared to benchmarks where the retrieval results are already quite good with pre-trained models, or with simple fine-tuning or task-specific learned adaptations. This motivated our design of QuARI. Dataset-level finetuning or task-specific adaptation on their own only provide a small boost (as you can see in the original paper’s Table 4 for fine-tuning, and in going from the baseline results in Table 1A to the Task Adaptation results in Table 1B for task adaptation). We designed QuARI’s query-level retrieval adaptation specifically to improve retrieval on such incredibly challenging problem domains.

We acknowledge, though, that reporting results on the best-known datasets helps the broader research community understand the contribution, so we have added evaluations on the following datasets:

  1. Requested Benchmarks:
  • Flickr30k
  • MS COCO — COCO was part of the training data for QuARI so we remade the training dataset without COCO and show evaluation results on that data (we leave the COCO data in the training set for all other benchmarks).
  1. Additional Benchmarks:
  • Flat Objects Retrieval Benchmark (FORB)
  • TextCaps

Across all of these new evaluation benchmarks, QuARI, built on SigLIP2 (ViT-L) features, yields significant improvements. We also report the performance of a fine-tuned ViT-L SigLIP2 model on the same training data as QuARI for fairness:

ModelCOCO T2I R@1Flickr30k T2I R@1FORB mAP@5FORB t-mAP@5TextCaps T2I R@1
SigLIP255.285.393.7469.2444.6
FT SigLIP255.085.592.8970.0345.2
SigLIP2+QuARI77.492.995.6778.5355.8

As a reminder, all of the evaluations use the exact same QuARI model (except for the COCO evaluation, which was re-trained to remove the COCO dataset from the training data—no additional training data was added, and it was trained in the exact same way as the model that included COCO data). This further demonstrates that our approach generalizes across a wide variety of domains without retraining. We really appreciate the suggestion for evaluation on COCO and Flickr30k and are excited to include these additional results in the final paper.

  • No justification or ablation for the choice of rank 64 for low-rank factorization
  • No experiments with different numbers of semi-positive samples (fixed at 2)
  • Rank selection: Why was rank 64 chosen for the low-rank factorization? What is the performance-efficiency trade-off for different rank values (e.g., 32, 128, 256)?

The reviewer is correct that experimental evidence supporting the selection of these hyperparameters would strengthen the paper—thank you for pointing this out! We note that we constrain all experiments to fit on a single 80GB NVIDIA H100 GPU. For some experiments, changing hyperparameters necessitated a decrease in batch size—we note these cases with an asterisk. We report all results using a SigLIP2 (ViT-L) backbone.

For the choice of rank of the projection, we compare ranks of 16, 32, 64 (the rank included in the original paper), 128 and 256. The rank-64 projections yield the best performance:

rankI2I @ 100MT2I @ 5M
1623.526.4
3230.232.8
6435.340.6
128*33.638.9
256*29.432.6

For the semi-positives, we compare the results without any semi-positives, and with one, two and three semi-positive examples, showing that two semi-positives is significantly better than zero or one, and a tiny bit better than three:

number semi-positivesI2I @ 100MT2I @ 5M
030.335.6
133.137.2
2 (ours)35.340.6
3*35.139.8

We will incorporate these ablations into the final paper.

  • Limited analysis on why iterative refinement is necessary beyond showing performance drops
  • What is the theoretical basis for the iterative refinement process?

Our use of iterative refinement is motivated both by practical gains and by prior work showing that complex structured prediction tasks can benefit from a progressive optimization process [1,2].

In our setting, the model predicts a 128-dimensional transformation basis from a query. This prediction is inherently difficult, as it must adapt a global embedding space to reflect fine-grained, query-specific semantics. As shown in prior hypernetwork-based work like HyperStyle [1] and HyperDreamBooth [2], iterative refinement allows the model to correct coarse or partially aligned initial predictions by re-evaluating and adjusting the predicted transformation in the context of its own outputs.

Theoretically, iterative refinement aligns with the concept of fixed-point iteration or recurrent function application, where successive outputs of a function (or network) approach a stable solution. In practice, this leads to more accurate predictions in tasks where the output lies in a complex, high-dimensional space that cannot be captured effectively in a single forward pass. As seen in [1,2], a small number of refinement steps (<5 iterations) is sufficient to yield large improvements, balancing accuracy and efficiency.

We follow this paradigm and find that QuARI similarly benefits from refinement in both performance and stability.

  • Insufficient implementation details: Critical details like MLP dimensions are missing, hindering reproducibility. The authors mention this will be released upon acceptance, but key architectural details should be in the paper.

We use a two-layer MLP with a hidden dimension of 512 and a GeLU activation function. We will add these details, along with sharing all code upon acceptance.

  • Why is low-rank factorization necessary/beneficial?
  • Theoretical analysis: Can you provide theoretical justification for why low-rank transformations are sufficient? How does this relate to the manifold structure of embeddings?

Thank you for pointing out the limitations of the discussion in our paper. Low-rank factorization serves as both a regularizer and practical design choice for computational efficiency. As each u-v token pair defines a rank-1 component of the output transformation, learning a full-rank transformation would require prohibitively large amounts of GPU memory and inference time.

Low-rank transformations are often sufficient in practice because learned vision-language embedding spaces tend to lie on low-dimensional manifolds within the high-dimensional ambient space. This observation has been made in general [3], in language modeling [4], and in visual representation learning [5, 6], where most of the task-relevant variation can be captured by a small number of dominant directions.

By constraining QuARI's transformation to be low-rank, we effectively bias the model to operate on these dominant directions relevant to the query while avoiding overfitting to spurious or noisy components of the embedding space.


[1] Ruiz, Nataniel, et al. "Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Alaluf, Yuval, et al. "Hyperstyle: Stylegan inversion with hypernetworks for real image editing." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3] Feng, Ruili, et al. "Rank diminishing in deep neural networks." Advances in Neural Information Processing Systems. 2022.

[4] Aghajanyan, Armen, et al. “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.” Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021.

[5] Jaderberg, Max, et al. "Speeding up Convolutional Neural Networks with Low Rank Expansions." British Machine Vision Conference. 2014.

[6] Dong, Wei, et al. "Low-rank rescaled vision transformer fine-tuning: A residual design approach." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

评论

Thank you for the comprehensive rebuttal addressing my concerns. I appreciate the additional experiments and clarifications provided.

  • Standard benchmark evaluation: The addition of results on Flickr30k, MS COCO, FORB, and TextCaps significantly strengthens the paper. The consistent improvements across these diverse benchmarks (e.g., 77.4 vs 55.2 R@1 on COCO, 92.9 vs 85.3 on Flickr30k) demonstrate the generalizability of QuARI beyond the initially presented benchmarks.
  • Hyperparameter ablations: The rank ablation study clearly shows that rank 64 is optimal, with lower ranks (16, 32) underperforming and higher ranks (128, 256) showing diminishing returns or degradation. Similarly, the semi-positive ablation validates the choice of 2 semi-positives.
  • Theoretical justification: The explanation connecting low-rank transformations to the low-dimensional manifold structure of embeddings is reasonable and well-supported by citations. The iterative refinement justification based on prior hypernetwork work is also convincing.
  • Implementation details: Thank you for providing the MLP dimensions (hidden dimension 512) and promising to release code upon acceptance.
审稿意见
5

This paper introduces QuARI, a novel approach for improving retrieval performance by dynamically adapting embedding spaces per query. QuARI learns to map each query to a query-specific feature space transformation that emphasizes relevant features for that particular query. The authors demonstrate that this linear transformation can be applied efficiently to large-scale image collections, achieving state-of-the-art performance on challenging retrieval benchmarks while requiring significantly less computational overhead compared to existing re-ranking methods.

优缺点分析

Strengths

  1. Novel and Intuitive Approach: The paper presents an innovative approach for addressing limitations in vision-language models for retrieval tasks, offering a compelling middle ground between general-purpose and domain-specific retrieval methods.

  2. Good Performance: The proposed method demonstrates consistent and substantial improvements across various backbone models on different retrieval tasks.

  3. Computational Efficiency: The method achieves superior performance with minimal computational overhead compared to alternative re-ranking approaches, making it practical for real-world applications with large-scale image collections.

  4. Clear Presentation: The paper is well-written with clear mathematical formulations and is easy to understand.

Weaknesses

  1. Limitations of Linear Transformations: While the paper acknowledges this limitation, there could be more discussion about scenarios where linear transformations might be insufficient for capturing complex relationships in information-rich images.

  2. Ablation Study for Hyperparameter Selection: The choice of certain hyperparameters seems arbitrary and lacks empirical justification. For instance, the selection of top-2 image embeddings as semi-positive samples and the number of iterations (L) for the transformer refinement process.

问题

An interesting alternative approach worth exploring would be a training-free method using multimodal large language models (MLLMs) to encode query+image pairs for all images in the pool, highlighting query-relevant features. While this would be computationally intensive during inference (requiring MLLM application to all images for each query), it would eliminate training costs. A comparative analysis between this approach and QuARI could provide valuable insights into the trade-offs between training efficiency and inference efficiency in adaptive retrieval systems.

局限性

yes

最终评判理由

I thank the authors for the rebuttal and for making the effort to conduct additional experiments. It addresses my concern about the hyperparameter selection. I will maintain my positive rating.

格式问题

N/A

作者回复

Thank you for your comments! We are glad you found our approach innovative, strong in performance/efficiency, and clearly presented. We address your concerns as follows:

  1. Limitations of Linear Transformations: While the paper acknowledges this limitation, there could be more discussion about scenarios where linear transformations might be insufficient for capturing complex relationships in information-rich images.

Learning nonlinear transformations would certainly allow for greater expressive potential in adapting features from pre-trained embedding spaces that have significant misalignment with a downstream retrieval task. Our work focuses on showing that even within the constraints of lightweight, linear database transformations, it is possible to achieve substantial performance gains through query-specific adaptation, making QuARI practical for large-scale deployment and low-latency retrieval. We see this efficiency as a key strength of our approach. That said, exploring more expressive nonlinear database transformation strategies, building on the nonlinear transform for the query already implemented, is an exciting direction for future work and one that may offer additional flexibility in more semantically complex settings.

  1. Ablation Study for Hyperparameter Selection: The choice of certain hyperparameters seems arbitrary and lacks empirical justification. For instance, the selection of top-2 image embeddings as semi-positive samples and the number of iterations (L) for the transformer refinement process.

Thank you for pointing out a limitation of our evaluation. For the semi-positives, we compare the results without any semi-positives, and with one, two, and three semi-positive examples, showing that two semi-positives are significantly better than zero or one, and a little bit better than three. For some experiments, changed hyperparameters necessitated a decrease in batch size—we note those cases with an asterisk. We report all results using a SigLIP2 (ViT-L) backbone.

number semi-positivesI2I @ 100MT2I @ 5M
030.335.6
133.137.2
2 (ours)35.340.6
3*35.139.8

We will incorporate this ablation into the final paper. Regarding the refinement, we adopted an iterative refinement strategy based on recommendations from previous work [1,2].

  • An interesting alternative approach worth exploring would be a training-free method using multimodal large language models (MLLMs) to encode query+image pairs for all images in the pool, highlighting query-relevant features. While this would be computationally intensive during inference (requiring MLLM application to all images for each query), it would eliminate training costs. A comparative analysis between this approach and QuARI could provide valuable insights into the trade-offs between training efficiency and inference efficiency in adaptive retrieval systems.

Thank you for the suggestion! The suggested approach would certainly allow the usage of MLLMs trained on larger datasets, but conducting inference over very large galleries (5M from INQUIRE, and 100M from ILIAS) becomes computationally infeasible. While this would serve as a potentially strong baseline as a reranking method, we leave the exploration of using MLLMs as retrieval rerankers to future work.


[1] Ruiz, Nataniel, et al. "Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Alaluf, Yuval, et al. "Hyperstyle: Stylegan inversion with hypernetworks for real image editing." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

评论

I thank the authors for the rebuttal and for making the effort to conduct additional experiments. It addresses my concern about the hyperparameter selection. I will maintain my positive rating.

评论

Dear reviewers, Please go over and respond to authors' rebuttal. Best wishes, AC

最终决定

This paper proposes QuARI, a framework that learns query-specific linear transformations of embedding spaces for image/text retrieval. The approach is lightweight, efficient, and shows strong improvements across both challenging new benchmarks and standard datasets. Reviewers initially raised concerns about novelty, reliance on new benchmarks, and missing ablations. After rebuttal, all reviewers acknowledged the clarifications, with one upgrading to positive and others maintaining acceptance.

Overall, QuARI offers a practical and well-validated contribution to adaptive retrieval. I recommend acceptance.