PaperHub
6.8
/10
Poster5 位审稿人
最低4最高5标准差0.4
4
5
4
4
4
3.0
置信度
创新性2.6
质量3.2
清晰度2.8
重要性3.2
NeurIPS 2025

Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Video Temporal Grounding

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29
TL;DR

We introduce a Uncertainty-quantified Rollout Policy Adaptation method, which uses rollout variance to estimate pseudo label uncertainty and enables test-time adaptation for cross-domain temporal grounding without target labels.

摘要

关键词
Knowledge Transfer; Video Understanding; Temporal Grounding

评审与讨论

审稿意见
4

Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Temporal Grounding introduces URPA, a data-efficient and annotation-free approach for adapting video temporal grounding models across domains. The method begins with training on a labeled source domain using GRPO, a reinforcement learning framework, and then adapts the model to an unlabeled target domain using only 100–200 videos. By averaging multiple rollout predictions, URPA generates pseudo labels and estimates their uncertainty through variance. These uncertainty scores are used to weight the adaptation reward, enabling the model to focus on more reliable supervision. Experiments across six cross-domain benchmarks demonstrate that URPA enhances generalization and supports efficient, real-time adaptation without requiring ground-truth annotations.

优缺点分析

Strengths:

  1. URPA achieves high data efficiency, requiring only 100–200 unlabelled target-domain videos for adaptation, making it practical for real-world scenarios where annotated data is limited or costly to obtain.
  2. Theoretical soundness is well-established, with a rigorous analysis demonstrating that rollout variance approximates Bayesian predictive variance, thereby validating the use of uncertainty for reward weighting during adaptation.
  3. The method enables fully unsupervised adaptation, eliminating the need for ground-truth labels in the target domain by leveraging pseudo labels combined with uncertainty estimation for reliable self-supervision.
  4. The framework supports lightweight, test-time adaptation without the need for full retraining, making it highly suitable for real-time and on-device applications in resource-constrained environments. Weaknesses:
  5. The adaptation performance does not consistently improve with more unlabelled target samples. As shown in the paper, excessive adaptation can lead to overfitting on noisy pseudo labels, which limits the method’s scalability in large-scale unsupervised settings.
  6. The method focuses on temporal grounding optimization but does not explicitly address semantic misalignment between language queries and video content, which may limit its effectiveness in scenarios with ambiguous language or complex video semantics.

问题

How can models handle ambiguous or vague language descriptions?

局限性

yes

最终评判理由

Thank you to the authors for addressing my concerns in the rebuttal. The clarifications on the adaptation performance with unlabelled target samples and the use of uncertainty-aware pseudo-labeling are satisfactory, especially the insight that URPA excels in data-efficient scenarios. The explanation on how temporal IoU indirectly aligns language queries with video content, as well as the strategies to handle ambiguous queries through GRPO rollouts and uncertainty-based weighting, are well-justified. Based on these responses, I am satisfied with the revisions and will maintain my original score.

格式问题

No major formatting issues were observed.

作者回复

We thank the reviewer for valuable feedback and for appreciating the idea of our method. Following are the responses regarding your concerns.

W1: The adaptation performance does not consistently improve with more unlabelled target samples.

As demonstrated in Figure 4 in the manuscript, our method employs uncertainty-aware pseudo-labels to effectively reduce the influence of noisy supervision in low-resource scenarios, outperforming naive self-learning baselines. When adapting with 100–200 unlabelled target videos, URPA achieves strong performance by filtering unreliable predictions and focusing on confident supervision. However, as the number of target samples increases, the absolute number of noisy pseudo-labels also grows, their accumulation can lead to overfitting on incorrect guidance, thereby degrading performance. This highlights a key insight: URPA is best suited for data-efficient adaptation, where the pseudo-label quality can be more reliably controlled via uncertainty estimation. Designing more robust noise-aware mechanisms for scaling to larger datasets remains an important direction for future research.

W2: How to address semantic misalignment between language queries and video content?

While our method does not introduce an explicit semantic alignment loss, the use of temporal IoU as the task reward indirectly encourages the model to align the query semantics with relevant video segments. Accurate grounding requires the model to match language cues with visual content, and the reward function provides learning signals that reinforce such alignment during adaptation.

Q1: How can models handle ambiguous or vague language descriptions?

Ambiguous or vague queries pose a major challenge for temporal grounding, as they lack explicit visual cues. Our method addresses this in two ways. First, by leveraging multiple GRPO rollouts, the model is encouraged to explore diverse temporal hypotheses rather than committing to a single point estimate. This helps uncover plausible alignments even under semantic uncertainty. Second, our use of a variance-based uncertainty metric allows the model to down-weight pseudo-labels generated from ambiguous queries, reducing their influence during adaptation. Together, these strategies help the model remain robust in the presence of vague or underspecified language.

评论

The authors have adequately addressed the concerns I raised, and I will maintain my original score.

询问 ChatGPT

审稿意见
5

This paper presents URPA, a method for temporally localizing natural language queries in videos from diverse domains. URPA is first trained on a source domain with temporal annotations, using two losses: one to make the model answer following a specific format, and another to improve temporal grounding quality using extended ground truth intervals. To adapt to a target domain without annotations, URPA selects a small set of target videos, samples timestamps from the model's predictive distribution, and uses the mean as a pseudo-label. These pseudo-labels are weighted by a confidence, which is estimated from the standard deviation of the sampled timestamps. URPA achieves good results on three cross-domain temporal grounding benchmarks, outperforming off-the-shelf and source-only models, and even some UDA methods that use the full target set.

优缺点分析

Strengths

  1. Despite its simplicity, URPA is a principled approach that boosts temporal grounding performance on the target domain by leveraging pseudo-labels while accounting for their uncertainty.
  2. Unlike other test-time adaptation methods that adapt to each test input, URPA performs a one-time adaptation before inference using only a small subset of videos, which is a more practical scenario.
  3. The paper is generally well-written (with some typos, though), is easy to follow, clearly motivates the problem, and supports its claims with experiments.

Weaknesses

  1. Although URPA, implemented on top of Qwen2.5-7B, consistently outperforms its ablated versions on Charades → Activitynet, it is unclear whether these improvements would generalize to other LMMs.
  2. Figure 4 shows that URPA's performance on the target domain starts to decrease after using 300 target samples. It would have been interesting to see how unsupervised domain adaptation baselines (e.g., UDA-TSL) perform when limited to the same randomly sampled target data as URPA.
  3. It would be interesting to evaluate the model on the source dataset after adaptation, to understand whether it retains its ability to perform temporal grounding in the source domain.

问题

  1. Have the authors evaluated URPA with other LMMs than Qwen2.5-7B to understand whether the method's effectiveness generalizes across different models?
  2. Do other unsupervised domain adaptation baselines also benefit from training on a limited subset of the target dataset (possibly the same subset used by URPA) compared to using the full target set?
  3. How does URPA perform on the source domain after adapting to the target domain?
  4. According to Table 2, extending the start and end timestamps of the ground truth boundaries leads to a significant performance improvement. How was the value of α=0.1\alpha = 0.1 selected?

局限性

yes

最终评判理由

The authors addressed my concerns about the generalizability of URPA to other large multimodal models by implementing the method on LLaVA-ST and Qwen2.5-3B, in addition to Qwen2.5-7B, and demonstrating improved performance across models. They also clarified the differences between URPA and the best-performing cross-domain temporal grounding method, UDA-TSL, and explained why UDA-TSL does not show the same performance trend as URPA when using a limited number of target domain samples. I agree that the performance gains on the target domain justify the slight performance drop on the source domain after adaptation. Given these clarifications, I am increasing my score to Accept.

格式问题

No formatting issues

作者回复

We are grateful for the positive response, as well as the valuable suggestions. Following are the responses regarding your concerns.

W1/Q1: Whether URPA can generalize to other VLM-based temporal grounding models?

To demonstrate the generality of URPA, we conducted additional experiments using two different temporal grounding backbones: LLaVA-ST and Qwen2.5-3B. As shown in the table below, applying URPA to LLaVA-ST yields substantial performance improvements, indicating that our adaptation strategy is effective across different architectures. Similarly, URPA also improves the performance of Qwen2.5-3B, though the overall results are lower than those of the 7B model due to its more limited reasoning capacity. These results confirm that URPA generalizes well across various vision-language models, regardless of model size or backbone design.

ModelCharades→ActivityNetActivityNet→TACoS
R@0.5R@0.5
Qwen2.5-7B15.27.7
URPA+Qwen2.5-7B42.622.0
Qwen2.5-3B9.64.5
URPA+Qwen2.5-3B33.914.3
LLaVA-ST-7B13.66.5
URPA+LLaVA-ST-7B36.216.9

W2/Q2: Do other unsupervised domain adaptation baselines also benefit from training on a limited subset of the target dataset?

To address the reviewer’s concern, we compare URPA with UDA-TSL, the current SOTA method for cross-domain temporal grounding, using the same limited number of unlabelled target samples. We evaluate three settings:

  • UDA-TSL (NoUDA): trained only on the source domain;
  • UDA-TSL (200-shot): adapted using 200 unlabelled target videos;
  • UDA-TSL (2000-shot): adapted using 2000 target videos;

As the original code is unavailable, we reimplemented UDA-TSL following the paper. Results show that UDA-TSL achieves notable improvements only with ≥2000 samples. In contrast, its 200-shot variant barely outperforms the source-only baseline due to insufficient target domain coverage.

URPA, designed for data-efficient adaptation, yields substantial gains with just 200 samples and outperforms UDA-TSL in both transfer directions. This highlights URPA’s robustness and practicality for real-world low-resource scenarios where collecting large-scale unlabelled target data is infeasible. These results confirm that URPA is more suitable for few-shot unsupervised cross-domain adaptation, offering better performance without relying on extensive target data.

MethodCharades→ActivityNet(R@0.5 / R@0.7)TACoS→Charades(R@0.5 / R@0.7)
UDA-TSL (NoUDA)35.83 / 19.5227.25 / 16.42
UDA-TSL (200-shot)36.02 / 19.6827.44 / 16.71
UDA-TSL (2000-shot)39.15 / 21.8330.76 / 18.87
URPA (source-only)36.46 / 18.7854.40 / 29.81
URPA (200-shot)42.57 / 21.2555.54 / 32.04

W3/Q3: How does URPA perform on the source domain after adapting to the target domain?

We use mIoU to evaluate source-domain performance before and after adaptation with 200-shot target data. As shown in the table below, URPA achieves better alignment with the target domain after adaptation, while source-domain performance decreases slightly as source-specific knowledge is partially replaced by target-domain features. This trade-off is typical in domain adaptation, yet URPA continues to maintain strong temporal grounding ability on the source domain.

Model\TaskCharades→ActivityNet (mIoU)ActivityNet→Charades (mIoU)
URPA (before adaptation) on source56.338.6
URPA (after adaptation) on source54.537.2

Q4: How was the value of α selected?

We determine α through cross-validation. With small values, the model retains accurate temporal grounding while modestly correcting annotator subjectivity. Larger values further reduce annotation bias but simultaneously lower label precision. We observe that α = 0.1 provides the best balance between mitigating annotation bias and preserving temporal grounding accuracy.

评论

I appreciate the detailed response from the authors. I have read both the other reviewers’ comments and the authors’ replies. I have just one question: in other reviews, the authors mention that using a large number of pseudo-labels (e.g., 2000) in URPA can harm performance because of noise in the supervision signal. In contrast, UDA-TSL appears to perform better with a large number of shots. Could the authors explain why these two unsupervised adaptation methods show such different trends?

评论

UDA-TSL is a distribution alignment method, where more joint sampling of source and target data enables better distribution characterisation and thus more effective alignment. In contrast, URPA is a source-free pseudo-labelling method, where fast adaptation can be achieved with only a few target videos, but larger amounts of target data may introduce noisy supervision that hurts performance.

UDA-TSL works well only when both source and target distributions are sufficiently represented through large-scale joint sampling. With limited target samples, the target distribution cannot be reliably captured, making alignment ineffective. Our rebuttal experiments show that substantial performance gains occur only when thousands of target videos are available, which necessitates simultaneous access to both domains and is often impractical in deployment.

URPA, by contrast, never requires access to source data during adaptation. The model first learns task-related knowledge on the labelled source domain and then adapts to the target domain using only a few unlabelled videos. While pseudo-label noise may accumulate and lead to a rise-then-drop trend, we are already mitigating this issue with our uncertainty estimation mechanism (Fig. 4 in Appendix) and will further improve robustness against noisy supervision in future work.

评论

Thank you to the authors for addressing all my concerns. I have no further questions.

评论

Dear Reviewer tuX4,

Thank you very much for confirming that all concerns have been addressed. This is very encouraging for us!

Best,

审稿意见
4

This paper proposes a novel approach for Unlabelled Cross-domain Temporal Grounding (UCTG), termed Uncertainty-quantified Rollout Policy Adaptation (URPA). Specifically: (1) The two tailored reward functions, combined with the introduction of GRPO, ensure robust grounding capability in the source domain. (2) A pseudo-labeling mechanism is designed based on GRPO, where label quality is assessed via uncertainty quantification, enabling effective target adaptation. Extensive experiments on multiple benchmarks demonstrate the superiority of URPA over existing methods.

优缺点分析

Strengths:

  1. The effectiveness of URPA in achieving cross-domain temporal grounding with limited samples is commendable. The judicious application of GRPO to this task provides an economical yet impactful research direction for the field.
  2. The quality assessment design for pseudo-labels is well-justified. Through rigorous theoretical analysis and comprehensive experiments, the authors convincingly demonstrate the validity of their evaluation strategy.

Weaknesses:

  1. Motivation: The necessity of efficiency for UCTG is inadequately justified in lines 44-46. Additionally, the authors lack task-specific analysis - what are the key challenges constraining UCTG development?
  2. Table 1 Results: URPA achieves better performance than full-dataset methods in certain cases (e.g., TACoS→Charades), yet shows significant gaps in most settings. What causes this phenomenon?
  3. Performance at Scale: While URPA's low sample dependency is impressive, how does its performance evolve with increased samples (e.g., 1000-shot, 10000-shot)? A superior method should demonstrate balanced effectiveness and efficiency.
  4. There appears to be an error in the expression/formulation of Equation 3.

问题

Although the method introduced in this paper is reasonable and has been proven effective through experiments, due to doubts about the motivation and insufficient experiments, my current rating is 4.

局限性

Yes.

最终评判理由

This rebuttal addresses my main concerns.

格式问题

None.

作者回复

We thank the reviewer for the positive evaluation of our work and our detailed feedback to the concerns are as follows.

W1(1): The necessity of efficiency for UCTG is inadequately justified.

The efficiency of UCTG is particularly critical not only because collecting dense temporal annotations is costly, but also due to the inherent storage and computational demands of video data. Compared to images, videos require significantly more space and time to annotate and store. A single video often exceeds the size of an entire image dataset, making full-set target adaptation impractical in online or real-time settings. Thus, efficient, data-lite adaptation becomes a necessity to enable scalable deployment of temporal grounding models in real-world applications such as surveillance or instructional content, where labelled data is limited and fast domain adaptation is required.

W1(2) What are the task-specific key challenges constraining UCTG development?

The development of UCTG faces three task-specific challenges. First, the lack of ground-truth supervision in the target domain makes it difficult for the model to correct domain-specific errors under domain shift. Second, temporal annotations (i.e., start and end timestamps) are often manually labeled and inherently subjective, introducing annotator bias. Finally, the duration and temporal location of events vary significantly across datasets. Without a deeper understanding of the causal or semantic structure of events, models tend to rely on learning dataset-specific temporal occurrence patterns. As a result, their predictions often fail to transfer across domains with different event distributions, leading to substantial performance drops.

To address these challenges, first, URPA introduces a confidence-based pseudo-labeling strategy to mitigate the absence of target supervision. Second, It uses a relaxed tIoU reward to reduce sensitivity to annotation bias. Thirdly, by leveraging GRPO-style reinforcement learning, the model learns to reason about event structure and temporal logic, rather than memorizing dataset-specific temporal distributions, enabling better generalization across domains.

W2: Explain the performance gaps between URPA and full-dataset methods methods in most settings, despite URPA outperforming them in a few cases?

The gap is explained by two factors. First, the richness of source-domain supervision provides different levels of grounding priors, which makes adaptation easier in some transfers such as TACoS→Charades. Second, URPA relies on GRPO-based adaptation with uncertainty filtering, which uses limited target samples more effectively but cannot fully exploit the entire target distribution. As a result, URPA can sometimes surpass full-set methods when source priors are strong, but shows gaps in transfers with larger domain differences.

W3: How does its performance evolve with increased samples?

We include an experiment in Appendix A.1 (Fig. 4) to evaluate how performance evolves with increasing adaptation samples. URPA consistently outperforms the baseline under all shot settings, confirming the effectiveness of our uncertainty-guided strategy. Interestingly, performance peaks around 300 samples and then declines. This is due to the error accumulation of noisy pseudo labels. Although our method mitigates their effect via uncertainty filtering, fully unsupervised adaptation still suffers when label quality degrades at scale. This highlights a key challenge: more unlabelled data does not necessarily yield better performance unless pseudo-label reliability is also improved. We discuss this in our limitations and plan to address it in future work.

W4: An error in Equation 3.

Thank you for pointing this out. We will correct it to reflect the intended behaviour: a reward of 1 when the output matches the required format, and 0 otherwise.

评论

Dear Reviewer VEkt,

We hope our rebuttal has clarified your concerns. Please kindly let us know if everything looks good on your side.

Best,

审稿意见
4

This paper proposes a method for cross-domain temporal grounding, addressing the challenges of domain shifts and the lack of annotations in unlabelled target domains. By leveraging Group Relative Policy Optimization rollouts and uncertainty estimation, the method enables efficient adaptation using only a small number of target domain videos. The authors introduce a pseudo-labelling mechanism, where predictions are averaged across multiple rollouts, and confidence scores are computed based on variance. Experiments across six cross-domain settings on three benchmark datasets demonstrate the effectiveness of URPA, achieving competitive or superior performance compared to full-dataset adaptation methods while maintaining resource efficiency.

优缺点分析

Strengths:

  1. The introduction of URPA as a data-efficient unsupervised domain adaptation method is a novel contribution to the temporal grounding task, particularly for real-world applications where computational and storage resources are limited.

  2. A rigorous theoretical analysis connects GRPO rollouts to Bayesian predictive variance, providing strong justification for the use of rollout variance as a measure of epistemic uncertainty.

  3. The method performs well with only a small number of unlabelled target videos, which is significantly more practical than full-dataset adaptation methods.

  4. Results are provided on six cross-domain settings across three datasets, demonstrating the generalizability and robustness of URPA. The ablation studies clearly highlight the contributions of uncertainty quantification and soft labeling.

Weaknesses:

  1. While the paper demonstrates the importance of URPA components, it does not compare GRPO rollouts with other reinforcement learning approaches or variants. This could provide further insight into the unique strengths of GRPO in the temporal grounding context.

  2. Although uncertainty quantification is used to mitigate noisy pseudo-labels, the method still assumes that pseudo-labels with low variance are reliable. This assumption may not hold in cases where low variance predictions are consistently incorrect due to systematic bias in the source model.

  3. The experiments suggest diminishing returns when increasing the number of rollouts (G>8). However, the computational trade-offs of using fewer rollouts versus the impact on uncertainty estimation are not fully explored.

问题

  1. How does the choice of the exponential decay function in Eq. (9) affect the confidence score? Have alternative formulations for scaling uncertainty been tested?

  2. In scenarios where the target domain exhibits significant visual or semantic differences from the source domain, how robust is the method? Does the variance metric still reliably estimate epistemic uncertainty?

  3. The results show that performance peaks at G = 8 rollouts. Could you provide more insight into why increasing rollouts beyond this threshold does not yield better results? Is it due to computational inefficiency, or does it introduce noise?

  4. Although the method is unsupervised in the target domain, how does it perform when compared to fully supervised target training (beyond the 200-shot scenario)?

局限性

Yes

最终评判理由

Thank you for the detailed response from the author, which has answered most of my questions. I have decided to maintain my rating.

格式问题

None

作者回复

We thank the reviewer for the positive evaluation of our work and for acknowledging the value and insight of the proposed method.

W1: Compare GRPO rollouts with other reinforcement learning approaches.

In our setting, such a comparison is not directly applicable. URPA is designed for unsupervised adaptation in the target domain, where no target temporal annotations are available. In contrast, most RL methods like PPO, A2C, or SAC operate in supervised or reward-accessible settings and generate rollouts across different samples, not multiple diverse predictions for the same input, which is essential for estimating epistemic uncertainty in our framework.

We compare URPA with existing reinforcement learning-based temporal grounding methods in Table below. Prior approaches such as BAR [1], TripNet [2], MBAN [3], URL [4], and the original GRPO are all trained in a fully supervised manner using the Charades dataset. In contrast, URPA is designed for unsupervised cross-domain adaptation: it is pretrained on source datasets such as TACoS or ActivityNet, and adapted to the target domain using only 200 unlabelled videos without timestamp annotations.

Despite operating without ground-truth supervision in the target domain, URPA achieves comparable or even superior performance on the Charades dataset in terms of R@0.5 and R@0.7. These results demonstrate the effectiveness of our GRPO-based uncertainty-guided adaptation strategy and its strong generalisation ability under low-resource conditions.

ModelR@0.5R@0.7
URPA (TACoS→Charades)55.532.0
URPA (ActivityNet→Charades)65.139.6
RL-based supervised learning approach
Bar[1]46.522.7
TripNet[2]38.316.1
MBAN[3]56.332.3
URL[4]55.7-
Target Full Supervised GRPO72.547.9

[1] Wu, Jie, et al. "Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos." Proceedings of the 28th ACM International Conference on Multimedia. 2020.

[2] M. Hahn, A. Kadav, J. M. Rehg, and H. P. Graf, “Tripping through time: Efficient localization of activities in videos,” in BMVC, 2020.

[3] X. Sun, H. Wang, and B. He, “Maban: Multi-agent boundary-aware network for natural language moment retrieval,” IEEE TIP, vol. 30, 2021.

[4] Y. Zeng, D. Cao, S. Lu, H. Zhang, J. Xu, and Z. Qin, “Moment is important: Language-based video moment retrieval via adversarial learning,” ACM TMCCA, vol. 18, 2022.

Q1: How does the choice of the exponential decay function in Eq. (9) affect the confidence score? Have alternative formulations for scaling uncertainty been tested?

The exponential decay function in Eq. (9) is designed to more aggressively suppress high-uncertainty pseudo-labels, offering sharper penalisation than linear or reciprocal alternatives and leading to more stable training. We experimented with softmax-based, sigmoid-based, and MLP-based alternatives. However, softmax requires batch-relative normalization and sigmoid compresses the variance range too aggressively, both leading to unstable training. The MLP variant adds extra computational cost while yielding inferior performance, making exponential decay a more effective and efficient choice.

W2/Q2: Does the variance metric still reliably estimate epistemic uncertainty under large domain gap?

In scenarios with significant domain shift, we acknowledge that low variance does not always guarantee correctness due to potential systematic bias from the source model. However, our method mitigates this risk in two ways. First, GRPO's structured rollouts capture epistemic uncertainty more robustly than neuron-level dropout by exposing variability in the model's temporal predictions. Second, our adaptation framework avoids relying solely on absolute variance values. Instead, we apply a decaying weighting function to softly down-weight uncertain pseudo-labels, preventing overfitting to potentially biased predictions. Empirically, as shown in the Charades→ActivityNet setting (with large visual-semantic shift), our method still achieves substantial improvements over the source training model (R@0.5: 36.5 → 42.6), confirming the robustness of the uncertainty signal.

W3/Q3: More insight into why increasing rollouts beyond 8 does not yield better results?

We observe that increasing the number of rollouts initially improves performance by stabilizing pseudo-label estimation and better capturing uncertainty (e.g., from G=4 to G=8). However, as G continues to grow, performance degrades due to limitations of the variance-based confidence metric. When most predictions are concentrated, the impact of a few divergent samples becomes diluted, causing the variance to shrink and underestimating uncertainty. This leads the model to overestimate pseudo-label reliability, weakening its ability to distinguish between reliable and unreliable supervision and ultimately hindering adaptation.

Q4: Comparison with full-set fully supervised target training results.

Although our method is few-shot and unsupervised in the target domain, it achieves strong performance compared to fully supervised target-domain training. We further compare URPA with recent state-of-the-art supervised grounding models including HawkEye [1], TimeChat [2], TRACE [3], VideoChat-T [4] and target fully supervised training GRPO, all of which are fully trained on the target Charades dataset. In contrast, URPA is pretrained on source datasets such as TACoS or ActivityNet and adapted using only 200 unlabelled Charades videos. Despite this data-efficient and unsupervised setting, URPA achieves comparable or even superior performance, highlighting its practical value and effectiveness in low-resource cross-domain scenarios.

ModelR@0.5R@0.7
URPA (TACoS→Charades)55.532.0
URPA (ActivityNet→Charades)65.139.6
Supervised learning approach
HawkEye[1]58.328.8
TimeChat[2]46.723.7
TRACE[3]61.741.4
VideoChat-T[4]67.143.0
Target Full Supervised GRPO72.547.9

[1] Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos, 2024.

[2] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In CVPR, pages 14313–14323, 2024.

[3] Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling. arXiv preprint arXiv:2410.05643, 2024.

[4] Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. arXiv preprint arXiv:2410.19702, 2024.

评论

Dear Reviewer noQ9,

We hope our rebuttal has clarified your concerns. Please kindly let us know if everything looks good on your side.

Best,

审稿意见
4

This paper proposes URPA, a data-efficient method for unlabelled cross-domain temporal grounding. It adapts a GRPO-trained model using a small number of unlabelled target videos via rollout-based pseudo-labels and confidence-weighted rewards. The method is theoretically grounded and achieves strong results on six benchmark transfer settings.

优缺点分析

Strengths:

  • Clear problem motivation: The paper addresses a practical and under-explored setting, data-efficient unsupervised cross-domain temporal grounding, where only a small number (100–200) of unlabeled target videos are available.
  • Rich experimental results: Extensive evaluations are conducted on three datasets across six cross-domain settings, including detailed ablation studies and visualizations.
  • Theoretical support: The authors prove that rollout variance converges to Bayesian predictive uncertainty under mild assumptions, offering a principled justification for using rollout standard deviation as confidence.

Weaknesses:

  • Comparative methods are dated: Most baseline methods are from 2020 or earlier, and a few others are recent. This limits the completeness of the comparisons.
  • Limited methodological novelty: The approach combines existing techniques, i.e. GRPO, MC Dropout-style uncertainty estimation [11], and pseudo-labeling with rollout averaging. While the combination is elegant and practical, it lacks fundamentally new learning formulations.
  • No adaptation to strong modern VLMs beyond Qwen2.5, and it is unclear whether it generalizes to other foundation models.

问题

  1. Comparative scope: Most baselines are from 2020 or earlier. Could the authors include comparisons with more recent methods from 2024-2025?
  2. Model generality: The current experiments are based solely on Qwen2.5-7B. Can the authors clarify whether URPA generalizes to other VLMs or LLM-based temporal grounding models?
  3. Novelty clarifications: Since URPA builds on existing techniques (GRPO, MC Dropout, pseudo-labeling), can the authors better articulate what constitutes the core methodological novelty beyond this integration?

局限性

Yes

最终评判理由

I have carefully read all the reviewers' comments as well as the authors' response. The authors have provided a comprehensive and thoughtful reply, which addresses all of my concerns.

格式问题

N/A

作者回复

We thank the reviewer for the careful reading and insightful comments. Following are the responses regarding your concerns.

W1/Q1: Could the authors include comparisons with more recent methods?

Our compared UDA-TSL (2024) is the current state-of-the-art method for cross-domain temporal grounding. To further validate URPA, we compare our URPA with recent supervised grounding models including HawkEye [1], TimeChat [2], TRACE [3] and VideoChat-T [4] on Charades dataset , all of which are fully trained on the Charades dataset. In contrast, URPA is pretrained on source datasets such as TACoS or ActivityNet and adapted using only 200 unlabelled videos from Charades. Despite this data-efficient and unsupervised setting, URPA achieves comparable or even better performance, highlighting its strong effectiveness in low-resource scenarios.

ModelR@0.5R@0.7
Qwen2.5-7B7.72.8
URPA (TACoS→Charades)55.532.0
URPA (ActivityNet→Charades)65.139.6
Supervised learning approach
HawkEye[1]58.328.8
TimeChat[2]46.723.7
TRACE[3]61.741.4
VideoChat-T[4]67.143.0

[1] Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos, 2024.

[2] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In CVPR, pages 14313–14323, 2024.

[3] Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling. arXiv preprint arXiv:2410.05643, 2024.

[4] Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. arXiv preprint arXiv:2410.19702, 2024.

W2/Q3: What constitutes the core methodological novelty beyond integration?

Our core methodological novelty lies in reinterpreting the rollout variance from GRPO as a principled uncertainty signal for test-time adaptation. Unlike MC Dropout, which injects randomness at the neuron level, our approach leverages policy-level stochasticity to estimate epistemic uncertainty without requiring architectural changes. This insight enables the first unsupervised reinforcement learning–based temporal grounding framework, where uncertainty guides adaptation without any temporal annotations. Furthermore, we formalise and address a new and practical setting of data-efficient unsupervised cross-domain temporal grounding, which remains underexplored in prior work.

W3/Q2: Whether URPA can generalize to other VLM-based temporal grounding models?

To demonstrate the generality of URPA, we conducted additional experiments using two different temporal grounding backbones: LLaVA-ST and Qwen2.5-3B. As shown in the table below, applying URPA to LLaVA-ST yields substantial performance improvements, indicating that our adaptation strategy is effective across different architectures. Similarly, URPA also improves the performance of Qwen2.5-3B, though the overall results are lower than those of the 7B model due to its more limited reasoning capacity. These results confirm that URPA generalizes well across various vision-language models, regardless of model size or backbone design.

ModelCharades→ActivityNetActivityNet→TACoS
R@0.5R@0.5
Qwen2.5-7B15.27.7
URPA+Qwen2.5-7B42.622.0
Qwen2.5-3B9.64.5
URPA+Qwen2.5-3B33.914.3
LLaVA-ST-7B13.66.5
URPA+LLaVA-ST-7B36.216.9
评论

Thank you again for your thoughtful review. Could you kindly let us know if our response addressed your concern? Your feedback would be very helpful for us to better understand your evaluation and improve our paper.

评论

Dear Reviewer vGBE,

As the author–reviewer discussion phase is ending soon, and we cannot view your final score or justification, we’d greatly appreciate it if you could briefly let us know whether our rebuttal addressed your concerns.

Best, The authors

最终决定

The paper addresses the task of Cross-domain Temporal Grounding (ie localising natural language queries in videos), where key challenges are domain shifts, and also a lack of annotations in the unlabelled target domain. Concretely, the method is first trained on a source domain with temporal annotation. Second, to adapt to a unlabelled target domain, the authors select a small set of target videos, sample videos from the model's output and use the mean as the pseudo-label, using GRPO rollouts.

The reviewers commented positively about the strong results achieved by the method, that the method is both principled and simple, and that unlike other test-time adaptation methods which must be run for each test-input, this only needs to be done once on a small subset of videos. The method is therefore more suited for practical applications.

The concerns that reviewers raised were addressed well during the rebuttal phase. As a result, the final decision is acceptance. Please update the camera ready according to the reviewers' comments and the rebuttal.