PaperHub
6.0
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
4
3
3.0
置信度
创新性2.8
质量3.3
清晰度2.8
重要性2.8
NeurIPS 2025

Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We introduce GARE, a gap-aware retrieval framework that learns pair-specific increments to alleviate optimization tension and false-negative noise in cross-modal alignment, achieving better uniformity and semantic structure.

摘要

Recent progress in text–video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $\Delta_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $\Delta$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation.
关键词
cross-modal retrievaltext-video retrievalvideo understanding

评审与讨论

审稿意见
4

This paper presents GARE, a video-text retrieval framework that adjusts each pair's similarity by a learnable increment Δij\Delta_{ij}. Δij\Delta_{ij} can be imposed on the embedding of any modality to reduce the modality gap and mitigate the impact of false negatives. Experiments on four popular text-video retrieval datasets show the effectiveness of GARE.

优缺点分析

Strengths:

  • The studied problem, i.e., modality gap and false negatives, is practical in cross-modal retrieval.
  • The proposed method that refines the embedding by an increment is novel and instructive.
  • The paper presents good qualitative analysis.

Weaknesses:

  • This paper is not easy to follow. Many technical details are unclear. ①Is the final training goal the sum of various loss functions? Or does it need to be adjusted through trade-off parameters? ②What is the model structure of ψ\psi? And how does it handle input from both the modality gap and the context vector?
  • In general, the modality gap is a global metric to measure inherent differences, i.e., vˉtˉ|| \bar{v} -\bar{t}||. But the proposed method tackles it in a pair-specific way, which requires some reasonable explanation.
  • Missing related works in contrastive learning/cross-modal retrieval with false negatives.
  • Compared methods are not strong, and many of them are worse than the basic CLIP model. The proposed method should be compared with noise-robust (designed for false negatives) methods.
  • Does the learned network ψ\psi also generalize to inference time to refine the representation?

问题

Please see weaknesses.

局限性

Yes.

最终评判理由

I think this is a technically good paper, with its main weakness being clarity. Since the authors have committed to adding the missing technical details in the revised manuscript, I have decided to increase my rating.

格式问题

Not found.

作者回复

Thank you very much for your positive feedback on our method. We also apologize for the unclear explanations in the manuscript that caused difficulties in reading. We will address your questions one by one.


Q1: Clarification on Technical Details

  • On the training objective: We use weight parameters β\beta, λε\lambda_{\varepsilon}, and λdir\lambda_{\text{dir}} to adjust the overall training objective, which can be divided into two main parts: one is the VIB optimization objective composed of L_info+βL_IB\mathcal{L}\_{\text{info}} + \beta \cdot \mathcal{L}\_{\text{IB}}, and the other is the delta structural regularization composed of λεL_ε+λ_dirL_dir\lambda_{\varepsilon} \cdot \mathcal{L}\_{\varepsilon} + \lambda\_{\text{dir}} \cdot \mathcal{L}\_{\text{dir}}. The full training objective is:

    L_total=L_Info+βL_IB_VIB objective+λεL_ε+λdirL_dir.\mathcal{L}\_{\text{total}} = \underbrace{\mathcal{L}\_{\text{Info}} + \beta \cdot \mathcal{L}\_{\text{IB}}}\_{\text{VIB objective}} + \lambda_{\varepsilon} \cdot \mathcal{L}\_{\varepsilon} + \lambda_{\text{dir}} \cdot \mathcal{L}\_{\text{dir}}.

    For MSR-VTT, we set β=0.07\beta = 0.07, λε=λdir=0.01\lambda_{\varepsilon} = \lambda_{\text{dir}} = 0.01. Since the adjustments are based on the geometric properties of the non-normalized delta embeddings, the two λ\lambda parameters need to be kept small (around 1e-2) to ensure training stability.

  • On the ψ\psi structure: We apologize for the unclear explanation of ψ\psi in the manuscript. ψ\psi is implemented as a single-layer cross-attention Transformer, where the cross-attention structure is combined with an FFN that does not have the 4× expansion. In our method, we use vjtiv_j - t_i as the query input, where tit_i is the [CLS] embedding output from the CLIP text encoder (tensor shape: (a, dim), where a is the batch size of text dimension), and vjv_j is the mean of f frames from the j-th video sample (the frame emb with the shape of (b, f, dim) are obtained by the 4 layers of the video temporal Transformer after CLIP vision encoder, where b is the batch size of video dimension). So, the input for the pair-wise computation is of shape (a, b, dim).

    • For the cross-attention part of the ψ\psi module, since we perform pair-wise delta generation, the corresponding attention score calculation is also parallelized in a pair-wise manner. Specifically, given that the query has a shape of (a,b,dim)(a, b, \text{dim}), we first permute it to (b,dim,a)(b, \text{dim}, a), and then perform a batch matrix multiplication with the key: logits=key×queryi.e.,(b,f,dim)×(b,dim,a)(b,f,a).\text{logits}=\text{key} \times \text{query} \quad \text{i.e.,} \quad (b, f, \text{dim}) \times (b, \text{dim}, a) \to (b, f, a). Next, we apply softmax to the logits and permute them again to (b,a,f)(b, a, f). Finally, the attention output is computed as:

      attn-out=attn-scores×value,i.e.,(b,a,f)×(b,f,dim)(b,a,dim).\text{attn-out} = \text{attn-scores} \times \text{value}, \quad \text{i.e.,} \quad (b, a, f) \times (b, f, \text{dim}) \to (b, a, \text{dim}).

      This achieves pair-wise parallel computation, with virtually no delay introduced in practice.


Q2: Explanation Regarding the Modality Gap

Thank you for raising the question about the modality gap. In fact, the motivation of our work is not to directly address the modality gap itself (i.e., our method does not aim to reduce vt|| \overline{v} - \overline{t} ||), but rather to resolve the optimization tension caused by the modality gap. By releasing this tension, we achieve larger Var(t)\text{Var}(t) or Var(v)\text{Var}(v), allowing us to optimize in a broader space. At the same time, in retrieval tasks, reducing the modality gap is a sufficient but not necessary condition for improving retrieval performance. Even with a large gap, we can still achieve high retrieval performance. By optimization tension, we refer to the issue that arises when optimizing an anchor tit_i with respect to its InfoNCE sub-loss:

L_i=logecos(ti,vi)/τj=1Becos(ti,vj)/τ.\mathcal{L}\_i = -\log \frac{e^{\cos(t_i, v_i)/\tau}}{\sum_{j=1}^B e^{\cos(t_i, v_j)/\tau}}.

The gradient of this loss can be decomposed into the form:

tiL_i=j=1B(pijyij)_cos(ti,vj)L_i[vjtivjcos(ti,vj)titi2]_ticos(ti,vj).\nabla_{t_i} \mathcal{L}\_i = \sum_{j=1}^B \underbrace{(p_{ij} - y_{ij})}\_{\nabla_{\cos(t_i,v_j)}\mathcal{L}\_i} \cdot \underbrace{\left[ \frac{v_j}{\|t_i\|\|v_j\|} - \cos(t_i, v_j) \cdot \frac{t_i}{\|t_i\|^2} \right]}\_{\nabla_{t_i}\cos(t_i, v_j)}.

where yijy_{ij} is the matching label.

In this optimization process, $(p_{ii}-1) \cdot \nabla_{t_i} \cos(t_i, v_i)$ pulls $t_i$ towards $v_i$, while $p_{ij} \cdot \nabla_{t_i} \cos(t_i, v_j)$ pushes $t_i$ away from $v_j$. However, the modality gap forces the optimization of $t_i$ in two directions: 1) towards $\overline{v}$, and 2) away from $\overline{v}$. We collected gradients of $t_i$ during training on the CLIP4Clip model, including the positive sample gradient $(p_{ii}-1) \cdot \nabla_{t_i} \cos(t_i, v_i)$ and the sum of the negative sample gradients $\sum_{j=1, j \neq i}^B p_{ij} \cdot \nabla_{t_i} \cos(t_i, v_j)$. Summing these gradients and calculating the mean and variance across the 512 dimensions, we found that for the dimensions mainly involved in optimization, the positive gradient is around 40-60, while the negative gradient sum is between -40 and -60. When both of them are summed, the resulting value for these dimensions is around -3 to -4. This strongly suggests that the modality gap induces optimization tension, as the gradient error is concentrated near zero, causing stagnation in $t_i$ optimization and severely limiting the model's alignment ability.

(As the rebuttal guidelines prevent the inclusion of plots, I am unable to provide the statistical analysis visuals. If the reviewer's interested, I can provide a detailed explanation of how we observe this phenomenon in CLIP4Clip.)

To address the optimization tension on tit_i, we reallocate the gradient to Δ\Delta, where each Δ\Delta corresponds to a specific pair and only receives gradients from the corresponding pair. Essentially, this approach lifts the ceiling on cosine similarity computation because, in the presence of optimization tension, the optimization of tit_i is confined to a small region near it, but the introduction of Δ\Delta expands the optimization space. As shown in Figure 2 in the paper, qualitative analysis demonstrates that the introduction of Δ\Delta alleviates the tension (enabling matching in a larger scale space) while also improving the uniformity characteristics (lower similarity distribution for positive pairs).


Q3&Q4: Cross-modal Retrieval and False Negatives

Thank you for your suggestion regarding the missing related works on false negatives and the comparison with methods designed to handle false negatives. The methods compared in the current paper are all based on CLIP, typically involving structural modifications and fine-tuning on the well-known CLIP4Clip model used in TVR tasks. In the revised version, we will include additional methods related to false negatives for comparison, including the following cross-modal approaches:

  • CUSA (AAAI 2024) addresses false negatives by using soft-label alignment from pre-trained models, which reduces the impact of hard labels and enhances sample recognition.
  • VSE++ (BMVC 2018) improves performance by focusing on the hardest negatives within each batch via Max-Hinge Loss, making the model more robust to false negatives.
  • L2RM (CVPR 2024) utilizes an Optimal Transport framework to realign mismatched pairs, improving robustness against hard negatives by refining sample alignment.

These works address false negatives through various techniques, such as soft-label alignment, hard negative mining, and improved robustness metrics.

Comparison with Stronger Methods Designed for False Negatives

We now compare GARE with two recent methods designed to handle hard negatives in text-video retrieval:

  • DMAE (ACM MM 2023, R@1: 46.9, +1.6% over base, our reproduction: 47.1) improves fine-grained alignment by mining hard positives, which implicitly helps push away hard negatives. This is conceptually similar to our IB loss, which uses a variational bottleneck to compress Δ\Delta, retaining critical alignment signals for robust optimization.
  • NeighborRetr (CVPR 2025, R@1: 49.5, +2.3% over base, our reproduction: 49.2) uses a memory bank to compute k-neighbor co-occurrence and identify "good hubs," promoting local consistency and reducing over-penalization of hard negatives. However, GARE does not require explicit hard negative mining. Instead, when encountering hard negative sample vjv_j, it shifts part of the loss gradient responsibility to the corresponding Δij\Delta_{ij}, allowing Δij\Delta_{ij} to absorb the noisy gradient acting on tit_i. This approach helps alleviate the erroneous optimization of tit_i, thereby mitigating the noise introduced by hard negatives.

Efficiency Comparison

  • GARE uses only one cross-attention layer, while NeighborRetr includes 8 MLPs and multiple transformer/conv layers.
  • NeighborRetr requires 10240-sample memory banks and about 4.5 hours of training, whereas GARE achieves similar performance with 1 hour and 34 minutes of training and minimal memory.

Empirical Observations

We observe that tit_i in GARE becomes semantically more stable: for semantically similar vjv_j, the similarities s(ti,vi)s(t_i, v_i) and s(ti,vj)s(t_i, v_j) are smoother than in the baseline. This supports more accurate Δ\Delta-based alignment and aligns with Fig. 2, where tit_i functions as a stable prototype. Ultimately, the enhanced semantic stability of tit_i allows Δ\Delta to carry out more precise fine-grained alignment across similar samples.


Q5: ψ\psi in inference stage

Sorry for not making this clear in the paper. The ψ\psi module also participates in the inference phase during forward propagation, contributing to refining the representation for consistent matching responsibility.

评论

Thanks for your rebuttal. I think this is a technically good paper, with its main weakness being clarity. Since the authors have committed to adding the missing technical details in the revised manuscript, I have decided to increase my rating.

评论

Dear Reviewer,

I hope this message finds you well. As the discussion period is nearing its end with less than three days remaining, I wanted to ensure we have addressed all your concerns satisfactorily. If there are any additional points or feedback you'd like us to consider, please let us know. Your insights are invaluable to us, and were eager to address any remaining issues to improve our work.

Thank you for your time and effort in reviewing our paper.

评论

Thank you for your thoughtful response and for the constructive and insightful comments.

We sincerely appreciate your patience in reviewing our work and providing valuable suggestions.

We will carefully incorporate the missing technical details and clarifications into the revised manuscript and appendix. Thank you again for your support and for helping us improve the quality of our paper.

审稿意见
4

This paper proposes GARE, a method for improving text-video retrieval by addressing issues in contrastive learning. It introduces a learnable pair-specific adjustment that helps reduce gradient conflicts caused by modality gaps and false negatives. The approach shows consistent performance gains across multiple benchmarks.

优缺点分析

Pros:

  1. Introduces a novel mechanism (∆_{ij}) to locally adjust representation gaps between text and video, rather than forcing global alignment.
  2. Provides a solid theoretical foundation using first-order Taylor expansion under trust-region constraints.
  3. Effectively addresses both modality gap and false negatives, which are common challenges in contrastive learning.

Cons:

  1. While the authors derive ∆_{ij} updates from a multivariate first-order Taylor approximation under a trust-region constraint, the actual implementation relies on a learned approximation using a neural module. The paper doesn’t sufficiently quantify how closely this learned update matches the ideal gradient direction. Including some error analysis between the true and learned update directions would strengthen the claim of "structure-aware" correction.
  2. The method introduces a pairwise increment ∆_{ij} for each text-video pair, which potentially scales quadratically with batch size. However, the paper does not address the memory or computational overhead introduced by this component, nor how it compares to other methods like hard negative mining or momentum encoders.
  3. From the ablation results, norm and direction regularization only help when combined with the IB loss. This suggests some over-regularization may occur if these components are applied individually. The paper might benefit from a discussion of when to apply these losses and how to tune their trade-off hyperparameters.

问题

please see the weaknesses.

局限性

yes

最终评判理由

The authors' clarifications and empirical evidence have largely addressed my concerns. I will keep my original score unchanged.

格式问题

N/A

作者回复

We sincerely thank Reviewer for the insightful and constructive comment. We apologize for the potential ambiguity in our original manuscript regarding the interpretation of the delta update mechanism.


Q1: Error Analysis

We clarify that the update rule for each Δij(t+1)\Delta_{ij}^{(t+1)}, derived from a multivariate first-order Taylor expansion under a trust-region constraint, is not meant as a unique or ideal direction, but rather as a principled descent direction compatible with InfoNCE optimization. The Taylor expansion offers a general update framework, not a fixed solution.

To address the batch-local limitation, we introduce a neural module ψ\psi to generate Δi(t)\Delta_{i*}^{(t)} and embed the update into end-to-end training. Backpropagation allows deltas to evolve implicitly, capturing transferable and structure-aware patterns across batches while remaining consistent with the theoretical formulation.

Since the descent direction is not unique, a strict vector-wise comparison is infeasible. Instead, we assess consistency via step size, focusing on the analytic length αij(t)\alpha_{ij}^{(t)} that enforces the trust-region radius ε\varepsilon:

Δij(t)αij(t)g^_ij(t)2ε2,whereg^_ij(t)=ΔijL_i(t)ΔijL_i(t),ε=Δij(t+1).\left\| \Delta_{ij}^{(t)} - \alpha_{ij}^{(t)} \hat{g}\_{ij}^{(t)} \right\|^2 \le \varepsilon^2, \quad \text{where} \quad \hat{g}\_{ij}^{(t)} = \frac{ \nabla_{\Delta_{ij}} \mathcal{L}\_i^{(t)} }{ \left\| \nabla_{\Delta_{ij}} \mathcal{L}\_i^{(t)} \right\| }, \varepsilon=\|\Delta_{ij}^{(t+1)}\|.

We take the equality case to project the update precisely to the boundary of the feasible region, and expand the squared norm leads to a quadratic equation in αij(t)\alpha_{ij}^{(t)}:

αij(t)22αij(t)Δij(t)g^_ij(t)+Δij(t)2ε2=0,\alpha_{ij}^{(t)^2} - 2\alpha_{ij}^{(t)} \Delta_{ij}^{(t)\top} \hat{g}\_{ij}^{(t)} + \left\| \Delta_{ij}^{(t)} \right\|^2 - \varepsilon^2 = 0,

whose positive root gives the closed-form solution:

αij(t)=Δij(t)g^_ij(t)+(Δij(t)g^_ij(t))2Δij(t)2+ε2.\alpha_{ij}^{(t)} = \Delta_{ij}^{(t)\top} \hat{g}\_{ij}^{(t)} + \sqrt{ \left( \Delta_{ij}^{(t)\top} \hat{g}\_{ij}^{(t)} \right)^2 - \left\| \Delta_{ij}^{(t)} \right\|^2 + \varepsilon^2 }.

In contrast, the neural module ψ\psi is updated through AdamW, and after one optimization step, the actual update can be written as:

Δij(t+1)=Δij(t)ηij(t)Θ(t)L_i(t)Θ(t)L_i(t),\Delta_{ij}^{(t+1)} = \Delta_{ij}^{(t)} - \eta_{ij}^{(t)} \cdot \frac{ \nabla_{\Theta^{(t)}} \mathcal{L}\_i^{(t)} }{ \left\| \nabla_{\Theta^{(t)}} \mathcal{L}\_i^{(t)} \right\| },

where ηij(t)\eta_{ij}^{(t)} denotes the effective step size resulting from the AdamW update. To quantify the closeness between the learned and analytic updates, we compute the scalar error:

ηij(t)αij(t).\left| \eta_{ij}^{(t)} - \alpha_{ij}^{(t)} \right|.


Empirical results:

  • For positive pairs (i=ji = j): We observe that the error ηiiαii\left| \eta_{ii} - \alpha_{ii} \right| remains within a bounded range of [1.0, 4.5] during training and shows a clear trend of convergence. Considering the 512-dimensional embedding space, this deviation is moderate and acceptable.

  • For negative pairs (iji \ne j): the error is larger and non-convergent, reflecting their inherently repulsive role without consistent alignment supervision. These fluctuations are meaningful: by expanding negative Δij\Delta_{ij} norms and lowering their cosine similarity, they enhance embedding uniformity, a key property in contrastive learning.


In summary, the scalar step-size analysis shows that ψ\psi produces updates consistent with the analytic trust-region rule for positive pairs, validating its structure-aware behavior, while the divergence on negative pairs naturally serves to promote embedding uniformity, in line with contrastive learning objectives.

(We will include the error plots in the appendix and public codebase if the paper is accepted, as current rebuttal guidelines do not permit figures.)


Q2: Effeciency and Hard Negative

We thank the reviewer for the observation. Although our method introduces pairwise Δij\Delta_{ij} (quadratic in batch size), we never store all deltas. Instead, a batch-streaming strategy keeps memory practical:

  • For each 128×128 text–video block, Δij\Delta_{ij} (≈32MB in float32) is computed on-the-fly by ψ\psi, used to form the similarity matrix, and immediately discarded;
  • Only one delta tensor exists at a time, and only the similarity block is retained for later concatenation.

This design keeps memory low and preserves dual-tower efficiency. Runtime and memory (4×RTX 4090) on MSR-VTT are:

MetricCLIP4ClipGARE
Training Time1h 30min1h 34min
Inference Time7.6s6.9s
Training Memory4 × 12175MB4 × 12561MB
Inference Memory4136MB4216MB
Training FLOPs (batch-wise)39,167.58 GFLOPs39,287.55 GFLOPs
Inference FLOPs (batch-wise)11,868.70 GFLOPs11,905.05 GFLOPs

Takeaway:

  • Δij\Delta_{ij} is transient, only the similarity matrix is stored, and overhead remains minimal.
  • Training memory increases by ~3%, and inference memory increases by ~1%, which is modest.
  • Δ\Delta is computed via pair-wise paralleled cross attention (details in ztES).

(Due to space constraints, the detailed logic of the inference stage can be found in my response to Reviewer ztES. I apologize again for this.)


Hard Negative Comparison

We compare GARE with two recent methods related to hard negative handling:

DMAE (ACM MM 2023, R@1: 46.9, +1.6% over base, our reproduction: 47.1): DMAE improves fine-grained alignment by mining hard positives (e.g., queries tied to specific frames), which implicitly enhances the model’s ability to push away hard negatives. This is conceptually similar to our IB loss, which compresses Δ\Delta via a variational bottleneck to retain critical alignment signals.

NeighborRetr (CVPR 2025, R@1: 49.5, +2.3% over base, our reproduction: 49.2): Uses a memory bank to compute k-neighbor co-occurrence and identify “good hubs.” This promotes local consistency and reduces over-penalization of hard negatives. While not explicitly framed as hard negative mining, selecting top-k co-occurring samples achieves similar effects.

GARE (R@1: 49.1, +2.5% over base): GARE does not require explicit hard negative mining. Each Δij\Delta_{ij} absorbs loss gradients locally from the corresponding pair, reducing reliance on tit_i when encountering hard negatives. This softens noisy updates and improves generalization.

Efficiency Comparison:

  • GARE uses only one cross-attention layer; NeighborRetr includes 8 MLPs and multiple transformer/conv layers.
  • NeighborRetr requires 10240-sample memory banks per modality and ~4.5h training time; GARE achieves similar performance with 1h34min training and minimal memory.

Empirically, we observe that tit_i in GARE becomes semantically more stable: for semantically similar vjv_j, the similarities s(ti,vi)s(t_i, v_i) and s(ti,vj)s(t_i, v_j) are smoother than in the baseline. This supports more accurate delta-based alignment and aligns with Fig. 2, where tit_i functions as a stable prototype. Ultimately, the enhanced semantic stability of tit_i allows Δ\Delta to carry out more precise fine-grained alignment across similar samples. (Due to rebuttal guidelines, we regret that we cannot include figures here.)

Q3: Role of IB loss

Our method follows the variational information bottleneck (VIB) principle. The standard VIB objective is

L_VIB=Ezp(zx)[logq(mz)]+βKL(p(zx)r(z)),\mathcal{L}\_{\text{VIB}} = -\mathbb{E}_{z\sim p(z|x)}[\log q(m|z)] + \beta \cdot \mathrm{KL}\big(p(z|x) || r(z)\big),

where xx is the input, zz is the bottleneck variable, and mm is the match label. In our cross-modal retrieval setting, we set x=(ti,vj)x = (t_i, v_j) and z=Δijz = \Delta_{ij}. The first term naturally corresponds to the InfoNCE loss on positive/negative text–video pairs, and the second term corresponds to our IB loss (KL divergence) on Δij\Delta_{ij}.

In principle, the compression term should be pair-wise:

KL(p(Δijti,vj)r(Δ)),\mathrm{KL}\big(p(\Delta_{ij}\mid t_i,v_j) || r(\Delta)\big),

which would require modeling each Δij\Delta_{ij} as a distribution and sampling kk times, leading to (k,Bt,Bv,dim)(k, B_t, B_v, \text{dim}) complexity. To make this tractable, we adopt a video-conditioned approximation:

p(Δijti,vj)p(Δijvj),p(\Delta_{ij}\mid t_i,v_j) \approx p(\Delta_{ij}\mid v_j),

and compute the KL term by taking the mean and standard deviation along the text dimension:

μj=mean_i(Δij),σj=std_i(Δij),\mu_j = \mathrm{mean}\_i(\Delta_{ij}), \quad \sigma_j = \mathrm{std}\_i(\Delta_{ij}),

approximating p(Δijvj)N(μj,σj2)p(\Delta_{ij}\mid v_j) \approx \mathcal{N}(\mu_j, \sigma_j^2). This is consistent with the many-to-many nature of video–text datasets (e.g., MSRVTT has ~20 captions per video) and still satisfies the Monte Carlo property by treating text samples as stochastic draws. Empirically, video-conditioning outperforms text-conditioning because videos provide more stable, modality-level common information that forms an effective bottleneck.

Our overall training objective is therefore:

L_total=L_Info+βL_IB_VIB objective+λεL_ε+λdirL_dir,\mathcal{L}\_{\text{total}} = \underbrace{\mathcal{L}\_{\text{Info}} + \beta \cdot \mathcal{L}\_{\text{IB}}}\_{\text{VIB objective}} + \lambda_{\varepsilon} \cdot \mathcal{L}\_{\varepsilon} + \lambda_{\text{dir}} \cdot \mathcal{L}\_{\text{dir}},

where we set β=0.07\beta = 0.07, and λε=λdir=0.01\lambda_{\varepsilon} = \lambda_{\text{dir}} = 0.01. The small weights for norm/direction regularization are due to the naturally large vector magnitudes in a 512-D embedding space, where a small coefficient is sufficient for stable adjustment.

Finally, this also explains why norm and direction regularization only take effect when combined with IB loss. Without the IB loss, Δij\Delta_{ij} has excessive freedom under InfoNCE and can trivially vary in magnitude or orientation, making these penalties ineffective. The IB loss compresses Δij\Delta_{ij} into a low-entropy, structured bottleneck, which stabilizes its distribution; under this constraint, the norm and direction terms provide meaningful geometric guidance to improve alignment and uniformity.

评论

Dear Reviewer,

I hope this message finds you well. As the discussion period is nearing its end with less than three days remaining, I wanted to ensure we have addressed all your concerns satisfactorily. If there are any additional points or feedback you'd like us to consider, please let us know. Your insights are invaluable to us, and were eager to address any remaining issues to improve our work.

Thank you for your time and effort in reviewing our paper.

评论

Dear Reviewer,

I hope this brief follow-up is not an inconvenience, and I apologize for reaching out once more. As the discussion period is nearing its close (approximately one day remaining), I wished to gently check whether there are any additional comments or clarifications I might address. I sincerely value your perspective. If time allows, any guidance you can share would be greatly appreciated and will inform our final revisions.

I also understand that reviewer participation during the discussion phase is expected and encouraged under the NeurIPS process, as it helps ensure a balanced and thorough assessment. I would be grateful for any feedback at your convenience and look forward to hearing from you.

Thank you again for your time and thoughtful consideration.

审稿意见
4

The paper proposes GARE, a new gap-aware retrieval framework that introduces learnable pair-specific increments (Δij) to alleviate optimization tension in text-video contrastive learning. Through first-order Taylor approximation of InfoNCE loss, the work derives Δij as an optimal adjustment direction that resolves gradient conflicts caused by modality gaps and false negatives. The framework employs a lightweight neural module to predict Δij based on semantic gaps between pairs, enabling structure-aware corrections while maintaining training stability. Three key regularizations (trust-region constraint, directional diversity, and information bottleneck) are introduced to enhance the interpretability and geometric properties of the learned increments. Extensive experiments across four benchmarks demonstrate consistent improvements in alignment accuracy and robustness, with qualitative analysis showing Δij effectively redistributes gradient tension while preserving semantic structure.

优缺点分析

[Strengths]

+ Novel solution. The paper introduces a novel approach to address the inherent optimization tension in contrastive learning by proposing pair-specific increments (Δij). This mechanism dynamically redistributes gradient conflicts caused by modality gaps and false negatives, offering a more stable and interpretable optimization landscape.

+ Theoretical support. The work derives the ideal form of Δij through a first-order Taylor expansion under trust-region constraints, providing a solid theoretical foundation.

+ Extensive experiments. The paper demonstrates consistent improvements across four diverse benchmarks (MSR-VTT, DiDeMo, ActivityNet, MSVD), outperforming recent state-of-the-art methods. The ablation studies (e.g., modality choice, regularization strength) provide actionable insights for adapting the framework to different data characteristics.

+ Qualitative analysis. The visualization reveals that Δij learns geometrically structured corrections—pushing embeddings outward while maintaining angular separation. This aligns with the uniformity principle of contrastive learning.

[Weaknesses]

- Lack of efficiency analysis. The application of learnable per-pair increments inevitably introduces additional training and inference overhead. However, the article only provides qualitative and quantitative analysis of this method on the Text-Video Retrieval task. Does the performance improvement solely come from adding more learnable parameters?

- The method lacks adaptability. According to the left side of Table 3 in the ablation experiments section, the authors extensively explored the Context Modality Choice and concluded that the decision on which modality to inject Δij into should be based on the characteristics of the dataset. Moreover, the reported results for each dataset in the article require selecting the best-performing setting through trial and error. This raises concerns about the method's poor generalizability and universality, as it necessitates dataset-specific configurations. Can this process be made adaptive?

- More tasks are needed to verify generalizability. Due to the presence of annotation noise, the issue of optimization tension is not limited to contrastive learning in text-video retrieval tasks but also exists in various cross-modal contrastive learning tasks. It is recommended that the authors conduct additional experiments on tasks such as text-image retrieval to verify whether the proposed method possesses sufficiently strong generalizability.

Writing standards and typos:

  • Equations 2 and 3 exhibit obvious subscript and superscript errors. For example, "ii" should be a subscript, and "BB" should not be a superscript of "i=1i=1" but rather should be written as j=1B\sum_{j=1}^{B}.
  • Line 268: "MST-VTT" should be corrected to "MSR-VTT."
  • The authors are advised to carefully review the article to correct these typos and formula formatting issues.

问题

See Weaknesses section. I look forward to the author's response to my concerns during the rebuttal stage.

局限性

yes

最终评判理由

The rebuttal effectively addressed my concerns regarding efficiency and generalizability:

  • Due to the use of a batch-wise streaming strategy, GARE is comparable in efficiency to CLIP4Clip.
  • GARE also demonstrated performance improvements over the baseline in the Image-Text Retrieval task.

My primary concern remains with the method's adaptivity. The author provided experimental results for bidirectional delta generation along with a detailed analysis, concluding that the two-path module will weaken its alignment behavior, and it is indeed necessary to decide in which modality to inject Δij based on the characteristics of the dataset.

In summary, this paper proposes an innovative method to address the inherent optimization tension in contrastive learning, supported by thorough theoretical analysis and extensive experiments that validate the method's effectiveness. From my perspective, it is a work worthy of acceptance. Since my initial score was positive, I have decided not to change the score.

格式问题

Writing standards and typos:

  • Equations 2 and 3 exhibit obvious subscript and superscript errors. For example, "ii" should be a subscript, and "BB" should not be a superscript of "i=1i=1" but rather should be written as j=1B\sum_{j=1}^{B}.
  • Line 268: "MST-VTT" should be corrected to "MSR-VTT."
  • The authors are advised to carefully review the article to correct these typos and formula formatting issues.
作者回复

Thank you for raising the questions and carefully pointing out the incorrect writings! We will correct them and check the spelling again. Below is my response to the questions.


Q1: Impact of Efficiency and Parameter Quantity

We thank the reviewer for the insightful observation and concerns about the efficiency of the method. While our method introduces pairwise Δij\Delta_{ij} with quadratic complexity, we do not store all deltas. Instead, we use a batch-wise streaming strategy that keeps memory usage practical.

At inference time, we precompute all CLIP encoder outputs and evaluate cosine similarities between text–video batches (e.g., 128×128 pairs) using the lightweight delta module ψ\psi. For each batch pair:

  • Δij\Delta_{ij} (shape: 128×128×512, ≈32MB in float32) is computed on-the-fly;
  • It is immediately discarded after computing the 128×128 similarity matrix;
  • Only the similarity block is retained and later concatenated into the full similarity matrix.

Thus, only one delta tensor is active at any time, and is Pair-Wise parallelization.

Below is a runtime and memory comparison (4×RTX 4090 GPUs) on MSR-VTT:

MetricCLIP4ClipGARE
Training Time1h 30min1h 34min
Inference Time7.6s6.9s
Training Memory4 × 12175MB4 × 12561MB
Inference Memory4136MB4216MB
Training FLOPs (batch-wise)39,167.58 GFLOPs39,287.55 GFLOPs
Inference FLOPs (batch-wise)11,868.70 GFLOPs11,905.05 GFLOPs

Key takeaway:

  • Δij\Delta_{ij} is a transient variable, not stored;
  • Only the similarity matrix is retained;
  • Δ\Delta is computed via pair-wise paralleled cross attention (details in ztES).

(Due to space constraints, more details about pair-wise parallelization computing can be found in my response to Reviewer ztES. I apologize again for this.)


Impact of parameter quantity

Thank you for raising this concern. To clarify that the improvements in GARE are not merely due to the increase in parameters, we conducted several experiments comparing GARE with the baseline (while the performance of GARE: R@1=49.1,R@5=74.7,R@10=83.6):

  1. Baseline comparison with additional layers: We increased the video temporary Transformer from 4 to 5 layers in the baseline, aligning its parameters with GARE. The performance was R@1 = 47.3, R@5 = 73.2, R@10 = 82.5, showing that the increase in parameters led to only a modest improvement, indicating GARE’s better performance with the same parameter budget.
  2. Parameter alignment with baseline: Reducing the video temporary Transformer layers in GARE to 3, we aligned the parameter count with the baseline. The results were R@1 = 48.1, R@5 = 73.8, R@10 = 83.6, confirming that GARE has a structural advantage.
  3. Interaction modification: We modified GARE’s interaction by setting the query for ψ\psi to be tit_i and the key-value to be frames from vjv_j, outputting tit_i'. To match the scale of the delta in GARE, we computed Δ=vjti\Delta = v_j - t_i' (or tivjt_i' - v_j), effectively removing the relative structure information between tit_i and vjv_j. This modification resulted in R@1 = 47.6, R@5 = 73.1, R@10 = 82.0, MdR = 2.0, and MnR = 12.6, showing that while GARE uses a pair-wise interaction, its interaction logic is superior. The results demonstrate that ψ\psi's awareness of the relative structure between vjv_j and tit_i is indeed crucial for improving performance.

In conclusion, GARE’s performance gains come from structural advantages and improved interaction logic, not just the addition of parameters (only 1.5M, far less than CLIP’s 151M).

Q2: Adaptivity of the Method

Thank you for raising the question about adaptivity. To explore this, we intuitively used two ψ\psi modules for bidirectional delta generation, resulting in (ti+Δt_ij,vj+Δv_ij)(t_i + \Delta^{t}\_{ij}, v_j + \Delta^{v}\_{ij}). This led to performance degradation, with R@1 dropping from 42.6 to 41.2 on ActivityNet Captions and from 49.1 to 48.4 on MSR-VTT, though still outperforming the baseline, indicating the delta-based approach maintains adaptivity. Additionally, we observed performance improvement in an image-text retrieval task on MSCOCO, further supporting GARE’s adaptivity across different domains (details in the final answer).

We hypothesize that this decline results from progressive weakening of Δiit\Delta^{t}_{ii}'s alignment role due to semantic ambiguity introduced by Δv_ii\Delta^{v}\_{ii}. As training progresses, viv_i evolves into a prototype-style anchor, similar to Fig. 2(b). When viv_i is used as the context in ψ\psi to compute Δiit\Delta^t_{ii}, its over-generalized nature hinders accurate computation of Δiit\Delta^t_{ii}, reducing alignment effectiveness and compromising training.

To validate that Δii\Delta_{ii} performs the alignment task, we compare the learned delta update magnitude with the theoretically derived update magnitude, measuring the error to assess whether Δii\Delta_{ii}'s behavior aligns with the theoretical expectation. By expanding L_i\mathcal{L}\_i w.r.t. Δij(t)\Delta_{ij}^{(t)}, we obtain the analytic update step size αij(t)\alpha_{ij}^{(t)}, ensuring Δij(t+1)\Delta_{ij}^{(t+1)} stays within a trust region:

Δij(t)αij(t)g^_ij(t)2ε2,g^_ij(t)=ΔijL_i(t)ΔijL_i(t),ε=Δij(t+1).\left\| \Delta_{ij}^{(t)} - \alpha_{ij}^{(t)} \hat{g}\_{ij}^{(t)} \right\|^2 \le \varepsilon^2, \quad \hat{g}\_{ij}^{(t)} = \frac{\nabla_{\Delta_{ij}} \mathcal{L}\_i^{(t)}}{\|\nabla_{\Delta_{ij}} \mathcal{L}\_i^{(t)}\|}, \varepsilon = \|\Delta_{ij}^{(t+1)}\|.

Simplifying for the boundary case gives the analytic step size:

αij(t)=Δij(t)g^_ij(t)+(Δij(t)g^_ij(t))2Δij(t)2+ε2.\alpha_{ij}^{(t)} = \Delta_{ij}^{(t)\top} \hat{g}\_{ij}^{(t)} + \sqrt{ \left( \Delta_{ij}^{(t)\top} \hat{g}\_{ij}^{(t)} \right)^2 - \|\Delta_{ij}^{(t)}\|^2 + \varepsilon^2 }.

In contrast, the actual update induced by ψ\psi after one optimizer step is:

Δij(t+1)=Δij(t)ηij(t)Θ(t)L_i(t)Θ(t)L_i(t),\Delta_{ij}^{(t+1)} = \Delta_{ij}^{(t)} - \eta_{ij}^{(t)} \cdot \frac{\nabla_{\Theta^{(t)}} \mathcal{L}\_i^{(t)}}{\|\nabla_{\Theta^{(t)}} \mathcal{L}\_i^{(t)}\|},

where ηij(t)\eta_{ij}^{(t)} is the implicit learned step size. To evaluate their consistency, we compute the error between the two step sizes:

ηij(t)αij(t).|\eta_{ij}^{(t)} - \alpha_{ij}^{(t)}|.


  • Empirical results show that for positive pairs (i=ji = j), the error remains within a moderate range ([1.0, 4.5] in 512-D space) and decreases steadily as training progresses, indicating that ψ\psi produces updates consistent with the theoretical rule. This validates that Δii\Delta_{ii} plays a precise alignment role.

  • For negative pairs (iji \neq j), the error is larger and does not converge, as Δij\Delta_{ij} receives only push gradients from vjv_j without alignment supervision. This optimization occurs in a (D1)(D-1)-dimensional space, leading to more variable updates compared to the 1-dimensional alignment of Δii\Delta_{ii}. Consequently, Δij\Delta_{ij} primarily enhances uniformity by increasing scale and dispersing representations.

  • Comparison with Dual ψ\psi Modules: When using dual ψ\psi modules (i.e., (ti+Δijt,vj+Δijv)(t_i + \Delta^t_{ij}, v_j + \Delta^v_{ij})), both Δiit\Delta^t_{ii} and Δiiv\Delta^v_{ii} show slower convergence to the theoretical update step size η(t)\eta^{(t)}. Surprisingly, after some initial convergence, the update magnitude error for Δiit\Delta^t_{ii} and Δiiv\Delta^v_{ii} starts to diverge, which suggests that their alignment capabilities are weakened, preventing accurate update directions from being learned.


This validates our analysis: when Δii\Delta_{ii} performs the alignment function, using the two-path ψ\psi module will weaken its alignment behavior. In future work, we will focus on alleviating the optimization tension while seeking a more adaptive dual-path method.

(We will include the error plots in the appendix and public codebase if the paper is accepted, as current rebuttal guidelines do not permit figures.)

Q3: GARE' performance on Image-Text Retrieval

Thank you for your suggestion to evaluate GARE’s generalization in image-text retrieval. We used CLIP as the baseline and conducted ITM tasks on MS-COCO 1K and 5K, comparing it with methods like VSE∞, PCME++, and PAU. Due to time constraints, we did not fine-tune GARE on MS-COCO, resulting in limited improvement on the ITM task. Currently, we use CLIP’s image patch tokens as the context input for the module, but these often introduce noise and weaker semantic information compared to the image’s [CLS] embedding. In the future, we plan to aggregate the patches to extract cleaner features, enabling better validation of GARE’s effectiveness on the ITM task. Below are the results of GARE on MS-COCO with ViT-B/32 as the backbone:

Method1ki2t1kt2i5ki2t5kt2i
R@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
VSE∞82.097.298.969.092.696.862.387.193.348.276.785.5
PCME++81.697.299.069.292.897.162.186.893.348.176.785.5
PAU80.496.298.567.791.896.663.685.292.246.874.483.7
Baseline80.195.798.267.191.496.662.984.991.646.573.882.9
ours80.796.899.168.291.996.663.985.592.447.274.883.5

At the same time, we also performed noise correspondences training on MS-COCO, which demonstrates significantly better robustness compared to the baseline:

Noise ratioMethod1ki2t1kt2i5ki2t5kt2i
R@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
20%Baseline76.094.397.563.489.094.855.379.186.941.068.879.3
ours79.496.098.865.290.796.458.983.791.243.770.780.8
50%Baseline73.993.097.260.187.394.054.178.586.639.767.277.5
ours77.293.697.262.589.194.256.180.887.941.268.679.2
评论

Thank you very much for detailed reply. The rebuttal effectively addressed my concerns regarding efficiency and generalizability:

  • Due to the use of a batch-wise streaming strategy, GARE is comparable in efficiency to CLIP4Clip.
  • GARE also demonstrated performance improvements over the baseline in the Image-Text Retrieval task.

My primary concern remains with the method's adaptivity. The author provided experimental results for bidirectional delta generation along with a detailed analysis, concluding that the two-path module will weaken its alignment behavior, and it is indeed necessary to decide in which modality to inject Δij based on the characteristics of the dataset.

In summary, this paper proposes an innovative method to address the inherent optimization tension in contrastive learning, supported by thorough theoretical analysis and extensive experiments that validate the method's effectiveness. From my perspective, it is a work worthy of acceptance. Since my initial score was positive, I have decided not to change the score.

评论

Thank you for your thoughtful comments and for highlighting the importance of evaluating generalization and adaptivity in image-text retrieval tasks.

We truly appreciate your insightful suggestions, which helped broaden the scope of our evaluation and strengthen the practical significance of our method. We have taken your advice seriously and will include corresponding experiments and discussions in the revised manuscript and appendix.

Thank you again for your valuable feedback and support throughout the review process.

评论

Dear Reviewer,

I hope this message finds you well. As the discussion period is nearing its end with less than three days remaining, I wanted to ensure we have addressed all your concerns satisfactorily. If there are any additional points or feedback you'd like us to consider, please let us know. Your insights are invaluable to us, and were eager to address any remaining issues to improve our work.

Thank you for your time and effort in reviewing our paper.

审稿意见
3

The paper proposes GARE, a framework addressing optimization challenges in text-video retrieval by introducing learnable pair-specific increments Δi,j\Delta_{i,j} to mitigate gradient tension from modality gaps and false negatives. The paper presents a theoretically grounded approach to addressing gradient tension in text-video retrieval, with strong experimental support for its effectiveness. However, its departure from the dual-tower paradigm and potential inefficiencies raise critical questions about its practical utility.

优缺点分析

Strengths

  1. Valid Core Motivation: The work effectively targets a critical and well-recognized issue in cross-modal learning: the domain (modality) gap between text and video representations. By framing this gap as a source of optimization tension under contrastive learning, the paper identifies a meaningful problem with clear practical relevance, laying a solid foundation for the proposed solution.
  2. Rigorous Experimental Design: The experiments are comprehensive and well-structured. The evaluation spans four standard benchmarks (MSR-VTT, DiDeMo, ActivityNet Captions, MSVD), with consistent comparisons to state-of-the-art methods. Ablation studies systematically validate the contributions of key components (Δi,j\Delta_{i,j}, regularization terms, modality-specific application), ensuring the observed improvements are attributable to the proposed mechanisms rather than confounding factors.
  3. Sound Theoretical and Design Contributions: The derivation of Δi,j\Delta_{i,j} via multivariate Taylor expansion of the InfoNCE loss under a trust-region constraint is mathematically rigorous, providing a clear rationale for pair-specific increments. The analysis of Δi,j\Delta_{i,j}’s geometric properties (e.g., angular distributions) and the design of complementary regularizers (norm constraints, directional diversity, information bottleneck) demonstrate careful consideration of how to stabilize and interpret the learned increments. This integration of theory and design strengthens the framework’s validity.

Weaknesses

  1. Violation of Dual-Tower Architecture Principles: A central drawback is that the introduction of pair-specific Δi,j\Delta_{i,j} disrupts the independence of text and video towers in cross-modal retrieval. Traditional dual-tower models enable efficient retrieval via precomputed embeddings and dot-product similarity, but GARE’s Δi,j\Delta_{i,j} introduces dynamic, pair-dependent interactions ti+Δi,jt_i + \Delta_{i,j} that cannot be precomputed. This undermines the core advantage of dual-tower designs—scalable, offline retrieval—and limits practical deployment in real-world systems where low latency is critical. The paper does not address this trade-off or propose mitigations (e.g., approximation strategies for faster inference).
  2. Efficiency Concerns: The training and inference efficiency of GARE is inadequately discussed. The learnable Δi,j\Delta_{i,j} is pair-specific, meaning its computational cost scales with the number of text-video pairs in a batch, which could become prohibitive for large datasets or high-throughput applications. If Δi,j\Delta_{i,j} requires real-time computation for each query-target pair during inference, this would drastically slow down retrieval compared to static embedding-based methods. Without details on latency or scalability, the practical feasibility of GARE remains uncertain. Additionally, the paper does not clarify how Δi,j\Delta_{i,j} is computed for unseen test pairs.

问题

please see weaknesses above

局限性

please see weaknesses above

格式问题

no

作者回复

We sincerely thank the reviewer for raising two critical concerns: (1) whether the introduction of pair-specific deltas violates the principles of dual-tower architectures; (2) whether the proposed ψ module could impact training and inference efficiency.

We address both points in detail below.


Q1: Dual-Tower Compatibility

GARE augments a standard dual-tower CLIP architecture with a lightweight cross-modal adjustment module ψ.

  • ψ is implemented as a single-layer cross-attention Transformer with FFN hidden dimension not expanded, adding only 1.58M parameters compared to the 354M parameters in the CLIP encoder. At the same time, during the inference phase, the GFLOPs per batch introduced by ψ is 36.35 GFLOPs, much smaller than the baseline encoder’s 11,868.70 GFLOPs per batch.
  • It operates after embeddings are precomputed by the dual towers, applying pair-specific deltas to adjust text embeddings before computing cosine similarity.

Importantly, this design does not compromise dual-tower precomputability:

  1. All text and video embeddings are computed and cached offline by the CLIP encoder (4-layer video temporary Transformer equipped after CLIP vision encoder).
  2. ψ only introduces a transient delta computation for each text–video batch pair.
  3. No delta embedding is stored, and only the resulting cosine similarity block is retained.

Thus, GARE preserves the key advantages of dual-tower architectures—offline caching and scalable batch-parallel retrieval—while enabling fine-grained pairwise alignment through a minimal adjustment step.


Q2: Training and Inference Efficiency

Thank you for raising concerns about computational efficiency. We will address them by comparing FLOPs, memory usage, how efficient batch-pair level similarity calculation is implemented during inference, and how the ψ\psi module ensures negligible latency by computing (text, video) pair-wise parallelizable cross-attention.

We provide a detailed comparison of GARE and CLIP4Clip on 4×RTX 4090 GPUs (batch size 128) and MSR-VTT:

MetricCLIP4ClipGARE
Training Time1h 30min1h 34min
Inference Time7.6s6.9s
Training Memory (reserved)4 × 12175MB4 × 12561MB
Inference Memory (reserved)4136MB4216MB
Training FLOPs (batch-wise)39,167.58 GFLOPs39,287.55 GFLOPs (+0.3%)
Inference FLOPs (batch-wise)11,868.70 GFLOPs11,905.05 GFLOPs (+0.3%)

Module-wise GFLOPs Breakdown

Here is a summary of the GFLOPs per module during forward propagation:

ModuleGFLOPsPercentageMain Calculation Source
CLIP Visual Encoder11,673.6098.05%ViT Image Processing
Video Temporal Transformer38.810.33%FFN (66.4%) + Self-Attention (33.3%)
psi (Cross-Attn Transformer)36.350.31%Linear Projection (73.6%)
CLIP Text Encoder156.291.31%Text Sequence Processing
Total11,905.05100%-

Key observations:

  • Training memory increases by ~3%, and inference memory increases by ~1%, which is modest.
  • FLOPs increase in inference is ~0.3% per batch pair and remains dominated by the encoder.
  • End-to-end inference is slightly faster (6.9s vs 7.6s) due to efficient batch-wise similarity computation. Delta computation is identical for training and inference and remains transient (e.g., 128×128×512, ~32MB) during the forward pass. Only the cosine similarity block is retained for further computation.

Crucially, deltas are computed on-the-fly and immediately discarded, with a typical tensor (128×128×512, float32) occupying only ≈32MB transiently. This streaming computation ensures that memory usage remains minimal even for large-scale datasets.


Inference Procedure (Pseudocode)

The inference pipeline preserves dual-tower efficiency via batch-parallel streaming computation:

# Precompute all text and video embeddings offline using CLIP encoders

for each text batch T:
    for each video batch V:
        # 1. Compute pair-specific delta via ψ and adjust text embeddings
        logits = model.get_similarity_logits(T, V)  
        
        # 2. Append cosine similarity block to the global similarity matrix
        sim_matrix.append(logits)
        
        # 3. Discard the delta tensor immediately (transient variable)
        
# Concatenate all similarity blocks for final retrieval

Key points:

  • Deltas are never stored, only used to produce logits;
  • Memory footprint remains low, and computation is fully GPU-parallel;
  • The design is deployable for real-world retrieval with minimal latency.

Cross-Attention Pair-Wise Parallelization

To further clarify how the cross-attention mechanism in $\psi$ operates efficiently, here is the pseudocode for the pair-wise parallelized cross-attention computation:

def CrossAttentionLayer(query, key, value):
    """
    Compute Q, K, V of Cross-Attention (omitting multi-head).
    
    Args:
        query: ti - vj with shape (a, b, dim)
        key & value: video context sequence with shape (b, f, dim), where f is the number of frames
    
    Returns:
        Attention output
    """
    
    a, b, dim = query.shape
    _, f, _ = key.shape
    
    query = query.permute(1, 2, 0)  # -> (b, dim, a)
    
    q_proj = Q_proj(query)
    k_proj = K_proj(key)
    v_proj = V_proj(value)
    
    # This operation can be pair-wise parallelized
    attn_logits = k_proj @ q_proj  # (b, f, dim) x (b, dim, a) -> (b, f, a)
    attn_scores = softmax(attn_logits / sqrt(d), dim=1).permute(0, 2, 1)  # -> (b, a, f)
    
    attn_out = attn_scores @ v_proj  # (b, a, f) x (b, f, dim) -> (b, a, dim)
    attn_out = Out_proj(attn_out).permute(1, 0, 2)  # -> (a, b, dim)
    
    return attn_out

Key points:

  • This computation is pair-wise parallelizable, meaning it processes text-video pairs within a batch efficiently, leveraging parallelization across the batch dimension.
  • This design ensures that the cross-attention operation remains scalable and efficient, preserving the advantages of dual-tower retrieval while introducing fine-grained alignment.

Conclusion

By embedding ψ in a streaming, transient, and batch-parallel workflow, GARE:

  1. Preserves dual-tower scalability and offline precomputability;
  2. Adds negligible memory and FLOPs overhead;
  3. Maintains low-latency retrieval, compatible with large-scale deployments;
  4. Provides expressive pair-specific alignment that significantly improves retrieval performance.

We appreciate the reviewer’s feedback, which helped us clarify GARE’s practical efficiency and deployability.

评论

I appreciate the authors' detailed response and thorough analysis of the training and inference time costs, including specific metrics like FLOPs, runtime, and memory usage. These breakdowns help contextualize the work’s practical efficiency, and I acknowledge the effort put into addressing these technical details. However, my core concern—namely, that this work violates the fundamental principles of Dual-Tower Architectures—remains unaddressed, and my broader efficiency worries are deeply intertwined with this issue.

To clarify, violating Dual-Tower principles does not inherently lead to slower performance when measured against a naive O(NM)O(NM) matching baseline (where NN represents the number of queries and MM the number of documents). In fact, in some cases, adding interactive modules might even yield marginal speedups in controlled scenarios. The critical issue, though, lies in how Dual-Tower Architectures are designed to enable scalable, real-world retrieval systems. At their core, these architectures rely on the separation of query and document processing to facilitate approximate nearest neighbor (ANN) search—a cornerstone of efficient large-scale retrieval. ANN systems thrive on static, precomputable distance metrics: for instance, linear products that translate to cosine similarity or Euclidean distance. These metrics are optimized to reduce retrieval complexity from O(NM)O(NM) to O(NlogM)O(N log M), making them feasible for datasets with millions or billions of documents.

A key reason these metrics work is their adherence to mathematical properties that underpin ANN’s efficiency: nonnegativity, identity, symmetry, and, crucially, the triangle inequality. These properties allow ANN algorithms to prune the search space effectively, avoiding exhaustive comparisons.

By introducing a cross-modal adjustment module, however, this work disrupts this foundational framework. First, the module breaks the triangle inequality, a property that ANN systems depend on to approximate nearest neighbors efficiently. Second, it transforms the distance metric into a query-dependent measure—one that shifts based on the specific query being processed. This dynamism undermines the ability to precompute document embeddings or distance scores, which is essential for scaling retrieval to large document collections. Instead of leveraging precomputed indices, the system would need to recalculate interactions for each query, effectively reverting to a more computationally heavy paradigm that sacrifices the Dual-Tower Architecture’s primary advantage: scalable, low-latency retrieval.

Thus, while the analysis of FLOPs and runtime provides useful insights, the deeper issue of violating Dual-Tower principles—and the consequent impact on scalability—remains a significant concern for the work’s practical applicability in large-scale retrieval scenarios.

评论

Dear Reviewer,

I hope this brief follow-up is not an inconvenience, and I apologize for reaching out once more. As the discussion period is nearing its close (less than one day remaining), I wished to gently check whether there are any additional comments or clarifications I might address. I sincerely value your perspective. If time allows, any guidance you can share would be greatly appreciated and will inform our final revisions.

I also understand that reviewer participation during the discussion phase is expected and encouraged under the NeurIPS process, as it helps ensure a balanced and thorough assessment. I would be grateful for any feedback at your convenience and look forward to hearing from you.

Thank you again for your time and thoughtful consideration.

评论

Dear Reviewer,

I hope this message finds you well. As the discussion period is nearing its end with less than three days remaining, I wanted to ensure we have addressed all your concerns satisfactorily. If there are any additional points or feedback you'd like us to consider, please let us know. Your insights are invaluable to us, and were eager to address any remaining issues to improve our work.

Thank you for your time and effort in reviewing our paper.

评论

Thank you again for your thoughtful response and continued attention to our work. We would also like to clarify that pair-wise interactions after dual-tower encoders constitute a relatively common framework in text-video retrieval. Below, we provide a detailed explanation.

In the domain of text-video retrieval, it is actually quite common for recent methods to go beyond the strict dual-tower architecture by introducing cross-modal interaction modules after the independent encoders. We list several representative examples below, all of which introduce pair-wise interaction after the dual-tower stage:

  • EMCL [5] (NeurIPS 2022): R@1 = 46.8, R@5 = 73.1, R@10 = 83.1
  • X-Pool [4] (CVPR 2022): R@1 = 46.9, R@5 = 72.8, R@10 = 82.2, MnR = 14.3
  • DiCoSA [2] (IJCAI 2023): R@1 = 47.5, R@5 = 74.7, R@10 = 83.8, MnR = 13.2
  • DiffusionRet [3] (ICCV 2023): R@1 = 49.0, R@5 = 75.2, R@10 = 82.7, MnR = 12.1

All these methods are built upon the CLIP4Clip [1] dual-tower backbone and introduce cross-modal interactions after feature encoding. This reflects a broader trend in the field, where many methods have moved beyond strict dual-tower architectures to incorporate pair-wise alignment strategies that better capture co-occurrence patterns and fine-grained correspondences between text and video. Notably, these classical works have inspired a range of subsequent advancements, further refining such interaction modules and demonstrating their effectiveness in improving retrieval accuracy. This evolution highlights that, in text-video retrieval, enhancing cross-modal alignment remains a key research focus.

In fact, within the academic context of text-video (or image-text) retrieval, the primary challenge lies not in scalable hierarchical retrieval with ANN techniques (e.g., O(NqlogNd)O(N_q \log N_d) complexity via clustering), but rather in improving semantic alignment across modalities.

This motivates a large body of research focused on alignment modeling rather than large-scale indexing, such as:

  • Fine-grained alignment at the frame-word level, action-word level, prototype level, or multi-level fusion.
  • Cross-modal generation or fusion, e.g., DiffusionRet [3] reframes the retrieval objective from discriminative p(vjti)p(v_j|t_i) to generative p(ti,vj)p(t_i,v_j), which is also need pair-wise interaction after dual-tower stage.
  • Many-to-many alignment strategies that reduce the impact of false negatives and the entropy imbalance between text and video.

These efforts primarily aim to address the difficulty of aligning semantically sparse text with visually redundant video content. While we acknowledge that adherence to dual-tower principles is foundational for large-scale retrieval efficiency, we note that in the current landscape of text-video retrieval, the primary research focus remains on improving cross-modal alignment and discriminative performance, rather than optimizing for large-scale ANN-friendly indexing.


References

[1] Luo, Huaishao, et al. "Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning." Neurocomputing 508 (2022): 293-304.

[2] Jin, Peng, et al. "Text-video retrieval with disentangled conceptualization and set-to-set alignment." arXiv preprint arXiv:2305.12218 (2023).

[3] Jin, Peng, et al. "Diffusionret: Generative text-video retrieval with diffusion model." Proceedings of the IEEE/CVF international conference on computer vision. 2023.

[4] Gorti, Satya Krishna, et al. "X-pool: Cross-modal language-video attention for text-video retrieval." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[5] Jin, Peng, et al. "Expectation-maximization contrastive learning for compact video-and-language representations." Advances in neural information processing systems 35 (2022): 30291-30306.

评论

Thank you very much for your detailed comments and for highlighting the dual-tower efficiency concerns. We apologize for not providing a thorough discussion on ANN-based retrieval in the initial submission. Below, we present our refined analysis and solution.


1. Two-Stage ANN Retrieval with GARE

Although GARE introduces pair-dependent refinements Δij\Delta_{ij} via cos(ti+Δij,vj)\cos(t_i+\Delta_{ij}, v_j) during training, the dual-tower encoders still produce static embeddings tit_i and vjv_j, which are suitable for efficient ANN indexing. In fact, when we evaluate the dual-tower part (which is exactly CLIP4Clip [1], a classical dual-tower method in text video retrieval) on the MSR-VTT dataset after training with GARE, we obtain R@1 = 44.2, R@5 = 71.7, R@10 = 81.2, MdR = 15.2, which is almost identical to the original CLIP4Clip results (44.5 / 71.4 / 81.6 / 15.3). This demonstrates that our dual-tower branch maintains retrieval stability, ensuring that it can still be reliably used for large-scale indexing via ANN. Based on this, we can apply GARE in large-scale scenarios using a two-stage retrieval pipeline: in the first stage, we build the ANN index with the dual-tower outputs of GARE and retrieve the Top-K candidates online; in the second stage, we apply the GARE's ψ-module to these Top-K candidates for refined re-ranking.

  • Stage 1 (Candidate Retrieval)

    1. Normalize $(t_i, v_j)$ to construct the ANN index.
    2. Store both normalized vectors and their L2 norms, because cosine similarity depends on direction, while GARE refinement also involves magnitude.
    3. Perform ANN search using $\cos(t_i,v_j)$ to efficiently obtain Top-K candidates.
  • Stage 2 (Δ-based Reranking)

    1. Retrieve the original (non-normalized) embeddings for the Top-K candidates.
    2. Compute $\Delta_{ij}$ with the ψ-module and refine similarities as $\cos(t_i+\Delta_{ij}, v_j)$.
    3. Rerank the Top-K to obtain the final retrieval results.

This design maintains scalable ANN efficiency while leveraging Δ for fine-grained alignment.


2. Deviation Analysis of Δ

Because GARE modifies similarity to $\cos(t_i+\Delta_{ij}, v_j)$, we analyze how Δ affects static dual-tower similarity $\cos(t_i,v_j)$.

Theoretical upper bound: Let $a=t_i$, $b=v_j$, $\delta=\Delta_{ij}$; on MSR-VTT-1k, $|t_i|\approx11.0$, $|\Delta|\approx3.31$.

δij=cos(a+δ,b)cos(a,b)δa0.30\delta_{ij} = \big|\cos(a+\delta,b)-\cos(a,b)\big| \lesssim \frac{\|\delta\|}{\|a\|} \approx 0.30

This first-order approximation shows Δ induces ≤0.3 deviation in the worst case, and is smaller for most pairs.

Empirical deviation: We evaluate $|\cos(t_i+\Delta_{ij},v_j)-\cos(t_i,v_j)|$ across MSR-VTT-1k:

  1. Mean Deviation ($\mathbb{E}[\delta_{ij}]$)

    • Average deviation across all pairs: 0.084
    • Meaning: Δ changes similarity by only 8.4% on average, indicating minimal perturbation.
  2. 95th Percentile Deviation ($P_{95}$)

    • 95% of all deviations < 0.17
    • Meaning: Δ refinement is locally smooth, and extreme deviations are rare.
  3. Top-K Candidate Coverage ($C_K$)

    • Overlap between Top-10 by GARE and Top-10 by static dual-tower within Top-256 ANN candidates ≥ 99%.
    • Meaning: A K=256 candidate pool almost always recovers the final GARE Top-10, confirming reliable ANN pre-filtering.

These results show that static dual-tower embeddings remain highly reliable for ANN retrieval even with Δ refinement.


3. Stability of Static Embeddings

GARE also stabilizes the dual-tower outputs:

  • Hard negatives often produce noisy gradients on $t_i$ in standard InfoNCE.
  • GARE absorbs part of this tension into Δ, reducing direct noise on $t_i$.
  • Empirically, the variance of $t_i$ decreases after GARE training, and static dual-tower retrieval remains competitive.

This means dual-tower embeddings remain both stable and ANN-compatible.


4. Conclusion

  • Static embeddings remain ANN-compatible with negligible retrieval loss.
  • Δ perturbations serve as lightweight, second-stage refinements with bounded deviation (≤0.3 theoretical, 0.084 empirical).
  • Two-stage retrieval with Top-K reranking preserves scalability while exploiting Δ for fine-grained alignment.

These results confirm that GARE maintains the dual-tower paradigm for large-scale retrieval and that Δ is a practical, noise-resilient enhancement.


References

[1] Luo, Huaishao, et al. "Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning." Neurocomputing 508 (2022): 293-304.

最终决定

Considering that most of the reviewers' concerns have been addressed in the post-rebuttal process, the paper received three BA and one BR. The paper offers a novel mechanism and solution with a solid theoretical foundation, and features rigorous experimental design alongside extensive and qualitative analysis. It is recommended to accept this paper.