PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
4
5
3.5
置信度
创新性3.0
质量3.0
清晰度3.5
重要性2.8
NeurIPS 2025

CausalVTG: Towards Robust Video Temporal Grounding via Causal Inference

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We propose a causal framework for video temporal grounding that mitigates confounding biases and improves robustness to linguistic variations and irrelevant queries.

摘要

关键词
Video Temporal GroundingCausal InferenceVision-Language Understanding

评审与讨论

审稿意见
5

This paper proposes CausalVTG, a novel framework for Video Temporal Grounding that explicitly incorporates causal inference into the model design. By constructing a Structural Causal Model and employing front-door adjustment via a Causality-Aware Disentangled Encoder, the method aims to eliminate dataset-induced confounding biases. Additionally, a Multi-Scale Temporal Perception Module is used to enhance temporal reasoning at different granularities, and a Counterfactual Contrastive Learning objective is introduced to distinguish relevant from irrelevant queries. Extensive experiments on five widely-used VTG benchmarks demonstrate that CausalVTG achieves state-of-the-art performance in both localization precision and query relevance detection tasks. The paper is well-structured and the proposed ideas are clear and well-motivated.

优缺点分析

Strengths: The paper addresses two fundamental but underexplored challenges in VTG — spurious correlations and grounding absence — through a principled causal framework. The introduction of causal inference, particularly front-door adjustment, into the VTG task is novel and fills a critical gap in the literature. The proposed components, including the SCM-based encoder, multi-scale temporal module, and counterfactual learning objective, are coherently designed and theoretically justified. The experimental evaluation is comprehensive, covering multiple benchmarks and metrics, and effectively validates the proposed contributions. The paper is clearly written, well-organized, and presents figures and tables that are informative and supportive of the narrative.

Weaknesses: The ablation study shows that removing the causal module (CADE) causes smaller performance drops compared to other modules, raising concerns about the direct impact of the causal design.The counterfactual learning strategy relies on negative sampling rather than generating challenging counterfactuals, potentially underutilizing the full potential of causal reasoning.The method depends heavily on large-scale pre-trained models, but there is no detailed discussion on computational cost or efficiency compared to existing baselines.

问题

  1. While CADE is conceptually important, its removal causes relatively minor degradation in performance. Could the authors provide more targeted experiments (e.g., under domain shifts or style variations) to better demonstrate its effect?
  2. Have the authors considered generating more challenging or realistic counterfactual samples instead of random negative sampling? Would this further improve the robustness?
  3. The method depends heavily on large-scale pre-trained models, but there is no detailed discussion on computational cost or efficiency compared to existing baselines.

局限性

YES

最终评判理由

With all queries resolved through the authors' cogent responses and the study's inherent strengths maintained, I retain my positive rating as originally assigned.

格式问题

The paper uses bolded phrases at the beginning of paragraphs without any punctuation; The reference formatting is inconsistent.

作者回复

We sincerely thank the reviewer for the thoughtful and detailed feedback. In response to the concerns raised, we have made the following clarifications and additions:

  • We evaluated CADE on the Charades-CG dataset to show its benefit under compositional generalization and domain shift scenarios (Q1).
  • We introduced two more challenging negative sampling strategies — GT-excluded and Semantic Hard negatives — and demonstrated their effectiveness on Charades-RF (Q2).
  • We reported training and inference cost comparisons with major baselines, confirming CausalVTG’s computational efficiency (Q3).

We believe these clarifications and additional results address the reviewer’s concerns. If any issues remain unclear, we would be happy to provide further discussion.

Response to Q1

While CADE is conceptually important, its removal causes relatively minor degradation in performance. Could the authors provide more targeted experiments (e.g., under domain shifts or style variations) to better demonstrate its effect?

Yes, we further evaluated CADE separately on the Charades-CG dataset [1], which specifically tests novel phrase compositions and unseen words in temporal grounding tasks. These experiments demonstrate CADE's strong capability in addressing superficial co-occurrence patterns by enabling the model to generalize beyond previously observed linguistic combinations, an advantage particularly evident in compositional generalization scenarios​​.

Novel-CompositionNovel-Word
R1@0.5R1@0.7mIoUR1@0.5R1@0.7mIoU
CausalVTG56.6832.7149.5959.2834.9651.12
w/o CADE52.1529.0546.1554.6830.3647.03

Response to Q2

Have the authors considered generating more challenging or realistic counterfactual samples instead of random negative sampling? Would this further improve the robustness?

We thank the reviewer for the constructive suggestion. In addition to the original random negative sampling, we investigated two more challenging strategies to improve robustness: (1) GT-excluded negatives, where ground-truth segments are masked to force the model to reason over visually similar but irrelevant content; and (2) Semantic Hard negatives, where each query is paired with the most semantically similar but non-matching video in the batch.

Experiments on Charades-RF, which contains naturally ungrounded queries, confirm the benefit of both strategies: GT-excluded negatives improve performance under stricter IoU settings, while Semantic Hard negatives further strengthen overall discriminative ability by compelling the model to distinguish fine-grained semantic differences.

ModelAccR1@0.3R1@0.5R1@0.7mIoU
CausalVTG84.7876.2271.0761.0367.86
+ GT-excluded83.1776.2972.3562.3667.88
+ Semantic Hard86.2477.0571.9060.5867.11

Response to Q3

The method depends heavily on large-scale pre-trained models, but there is no detailed discussion on computational cost or efficiency compared to existing baselines.

To evaluate the computational overhead, we trained and evaluated all models on the QVHighlights dataset using an NVIDIA A800 GPU (80GB memory) with a batch size of 64 over 50 epochs. As summarized below, CausalVTG's inference runtime is slightly longer than simpler baselines, yet remains highly competitive. Notably, prior methods such as Moment-DETR and QD-DETR originally required up to 200 epochs, and R2-Tuning utilized extensive GPU memory in training due to reversed recurrent block.

MethodTrainingInference
GPU Memory#ParametersTimeTime
Moment-DETR [2]1.41 GB4.82 M9.71 min31 s
QD-DETR [3]1.89 GB7.58 M13.15 min37 s
CG-DETR [4]3.09 GB12.61 M40.05 min45 s
R2-Tuning [5]37.24 GB2.7 M544.33 min69 s
CausalVTG2.31 GB7.86 M43.21 min53 s

Response to Paper Formatting Concerns

The paper uses bolded phrases at the beginning of paragraphs without any punctuation; The reference formatting is inconsistent.

We thank the reviewer for carefully pointing out these formatting issues, and we will correct the bolded paragraph headers and ensure consistent reference formatting in the final version.

[1] "Compositional temporal grounding with structured variational cross-graph correspondence learning." CVPR 2022

[2] "Detecting moments and highlights in videos via natural language queries." NeurIPS 2021.

[3] "Query-dependent video representation for moment retrieval and highlight detection." CVPR 2023.

[4] "Correlation-guided query-dependency calibration for video temporal grounding." arXiv preprint arXiv:2311.08835 (2023).

[5] "R2-tuning: Efficient image-to-video transfer learning for video temporal grounding." ECCV 2024.

评论

I appreciate the authors' thorough and thoughtful rebuttal. The additional experiments and clarifications directly addressed the concerns I raised. The results added in the revision further support the paper’s claims and strengthen its empirical validation. I am satisfied with the response and maintain a positive overall assessment of the paper.

评论

Thank you again for your insightful and constructive feedback. We’re excited to hear that our revisions have effectively addressed your concerns.

审稿意见
4

In this paper, authors firstly formulate a structural causal model for visual temporal grounding task. They define stylistic variations in visual and textual modalities as confounders. Building on top of the proposed structural causal graph, authors implement CausalVTG and verify its effectiveness on five benchmarks. As a technical summary, authors integrate causality-aware disentangled encoder, multi-scale temporal perception module, and counterfactual contrastive learning objective in the proposed CausalVTG framework, which composes a solid contribution to the community.

优缺点分析

Strengths

  1. The clarity of this paper is good and the paper is easy to follow. This paper is well motivated by two challenges: 1) superficial co-occurrence patterns and 2) assumption that relevant segments always exist.

  2. Authors also consider a scenario when the given query is not relevant to the video content, the model should be able to reject grounding any segment. This makes the proposed paper more thorough and suitable for real world application.

Weaknesses

  1. No thorough ablation study is provided in main paper. Since the proposed CausalVTG is a combination of CADE, MSTP, QRM and QGR as shown in Table 4 in supplementary material, a detailed ablation study is needed in main paper. A base version of CausalVTG can be base model + CADE, then the integration of any one or two modules/methods given MSTP, QRM, and QGR should be included.

  2. Performance gain from CausalVTG is mainly from MSTP, which is not related to causal reasoning.

问题

  1. The first claimed contribution of this paper is causal graph for video temporal grounding task. It would be great to straightforwardly show what is the confounder exactly in main paper. From common understanding of our community, confounder can be something like video background information or some template or pattern in language query. Explicitly show it in paper can enhance the paper quality.

  2. Referring to the weakness part, can you include a detailed and thorough ablation study in main paper? You can save much space by making Figure 2 as single-column

局限性

Authors discuss limitations of proposed method on page 14 and 15. Other limitations please refer to weakness part.

最终评判理由

Thank you for the detailed reply. Additional results on Charades-CG make the method more convincing. I've changed my rating from borderline reject to borderline accept.

格式问题

no major formatting issues.

作者回复

We appreciate the reviewer’s critical feedback and agree that additional targeted analysis is essential for strengthening the paper. In response, we conducted new experiments and provided clarifications.

  • To address the need for a thorough ablation study, we added detailed experiments showing the individual and combined contributions of CADE, MSTP, QRM, and QGR, confirming that each module is complementary (W1, Q2).
  • To clarify the reviewer’s concern that performance gains mainly come from MSTP, we demonstrate that MSTP improves proposal generation, while CADE uniquely mitigates spurious correlations via causal adjustment, playing an irreplaceable role (W2).
  • To respond to the reviewer’s question about confounders, we explicitly define and illustrate them in the VTG context (e.g., spurious query–visual co-occurrences) and commit to clarifying these points in the main paper (Q1).

We believe these additions resolve the reviewer’s key concerns and strengthen the overall contribution of our work. If any issues remain unclear, we would be happy to provide further discussion.

Response to W1

No thorough ablation study is provided in main paper. Since the proposed CausalVTG is a combination of CADE, MSTP, QRM and QGR as shown in Table 4 in supplementary material, a detailed ablation study is needed in main paper. A base version of CausalVTG can be base model + CADE, then the integration of any one or two modules/methods given MSTP, QRM, and QGR should be included.

We thank the reviewer for highlighting this important point. In response, we have conducted a comprehensive ablation study following the reviewer’s suggestion. The results demonstrate the individual and joint contributions of CADE, MSTP, QRM, and QGR, validating their effectiveness. The ablation results demonstrate that each module is complementary and their integration is crucial to the overall effectiveness of CausalVTG. We will incorporate a detailed version of this analysis into the main paper.

CADEQGRMSTPQRMR1@0.5R1@0.7mAP@0.5mAP@0.75Avg. mAP
(a)57.7436.5258.9635.3635.19
(b)59.9439.5560.2237.4336.47
(c)60.3938.5260.3937.0536.61
(d)67.6851.6869.4851.3347.85
(e)60.7739.1661.3236.9836.87
(f)61.3238.5861.6537.2237.11
(g)68.5852.7169.6950.8948.99
(h)62.1940.0561.6637.8637.42
(i)68.1352.969.955249.54
(j)70.2654.3271.3452.6750.15
(k)70.845672.1753.7950.98

Response to W2

Performance gain from CausalVTG is mainly from MSTP, which is not related to causal reasoning.

We thank the reviewer for this important point. While MSTP indeed contributes the largest raw performance gain by improving multi-scale proposal generation, CADE plays a distinct and critical role in addressing superficial co-occurrence patterns through front-door adjustment, which MSTP alone cannot resolve. To evaluate CADE’s effect, we evaluated it on the Charades-CG dataset [1], which stresses compositional generalization (novel phrase compositions and unseen words). Results show that removing CADE leads to clear drops in performance on these challenging settings, confirming that CADE is essential for enabling the model to generalize beyond spurious correlations, complementing MSTP rather than being redundant.

Novel-CompositionNovel-Word
R1@0.5R1@0.7mIoUR1@0.5R1@0.7mIoU
CausalVTG56.6832.7149.5959.2834.9651.12
w/o CADE52.1529.0546.1554.6830.3647.03

Response to Q1

The first claimed contribution of this paper is causal graph for video temporal grounding task. It would be great to straight forwardly show what is the confounder exactly in main paper. From common understanding of our community, confounder can be something like video background information or some template or pattern in language query. Explicitly show it in paper can enhance the paper quality.

Thank you for your valuable feedback. In our work, the primary confounder is the presence of superficial co-occurrence patterns between query phrases and visual contexts that spuriously influence grounding outcomes.

For example, in the Charades-STA training set, queries containing the word “sink” are commonly paired with actions like “wash” and “put”. However, when the query also includes “kitchen”, the dominant associated actions shift to “put” and “run”. This distributional shift—caused by an irrelevant contextual word—demonstrates a classic confounding effect: the added word jointly affects both the input representation and the prediction target, while being independent of the actual intended action.

This explains the failure case in Figure 1 of our paper, where R2-Tuning [2] succeeds on Query 1 but fails on Query 2. The presence of “kitchen” introduces a confounder that changes the model’s grounding behavior despite both queries referring to the same event.

In addition, we also consider unobservable confounders, such as visual decoration styles or recurring sentence templates, which may similarly bias grounding decisions. Our method addresses both types via a front-door adjustment mechanism, as supported by prior causal grounding work [3].

To address the reviewer’s suggestion, we will revise the main paper to more directly and concretely present these confounders.

Response to Q2

Referring to the weakness part, can you include a detailed and thorough ablation study in main paper? You can save much space by making Figure 2 as single-column

Following the reviewer's suggestion, we will resize Figure 2 to a single-column format in the main paper and incorporate the detailed ablation study from the supplemental material into the updated version.

[1] "Compositional temporal grounding with structured variational cross-graph correspondence learning." CVPR 2022.

[2] "R2-tuning: Efficient image-to-video transfer learning for video temporal grounding." ECCV 2024.

[3] "Cross-modal Causal Relation Alignment for Video Question Grounding." CVPR 2025.

评论

Thank you for the detailed reply. Additional results on Charades-CG make the work more convincing. I've changed my rating from borderline reject to borderline accept.

评论

We sincerely appreciate the time and effort you dedicated to reviewing our paper. Your invaluable comments and insights have been helpful in improving our work. We are also grateful for your decision to raise the score.

审稿意见
4

This paper targets the task of Video Temporal Grounding (VTG) and proposes the causal-inference framework CausalVTG. The method has three key components: 1.Causality-Aware Disentangled Encoder (CADE) based on front-door adjustment to eliminate confounding bias between visual and textual modalities; 2.Multi-Scale Temporal Perception (MSTP) to adaptively capture actions of different temporal lengths; 3.Counterfactual Contrastive Learning to judge whether a query can truly be grounded. The unified model simultaneously addresses Moment Retrieval, Highlight Detection, and Query Relevance. It achieves state-of-the-art results on five benchmarks—QVHighlights, Charades-STA, ActivityNet-Caption, Charades-RF, and ActivityNet-RF—showing especially strong performance under strict IoU thresholds and in “false-query” scenarios.

优缺点分析

Strengths 1.First to systematically introduce structured causal modeling, front-door adjustment, and counterfactual contrastive learning into VTG, effectively mitigating vision/text confounding and improving interpretability and generalization. 2.Builds a single framework that performs Moment Retrieval, Highlight Detection, and Query Relevance in one pass, avoiding error accumulation in pipeline systems and keeping the overall design concise. 3.Achieves top accuracy of 84.78 / 89.20 Acc on Charades-RF and ActivityNet-RF “false-query” tests, outperforming existing methods.

Weaknesses

  1. The experimental section does not include comparisons with other causal-inference-based VTG methods, making it difficult to judge whether the proposed front-door adjustment combined with counterfactual contrast truly offers an advantage over alternative causal strategies. 2.The model assumes that a K-means–derived mediator satisfies the front-door criterion, yet provides no evidence on how the choice of cluster number K or clustering stability affects performance. Without such sensitivity analysis, the claimed causal robustness remains speculative.
  2. The paper does not report mean ± std or statistical significance tests (e.g., p-values) for the main metrics, so it is difficult to gauge the reliability of the reported performance gains.

问题

  1. Could you include at least one representative causal baseline in the comparison tables (or explain why this is not feasible) to clarify the benefit of your causal strategy?

  2. How sensitive is CADE to the K-means cluster number K and clustering randomness? An ablation or plot would help justify the causal robustness claim.

局限性

yes

最终评判理由

During the rebuttal process, most of my concerns were adequately addressed. After reviewing the feedback from other reviewers, I have decided to maintain my positive score and increase the confidence score to 3.

格式问题

None

作者回复

We thank the reviewer for the thoughtful and constructive feedback, which helped us strengthen the paper with deeper analyses and clarifications. In response, we:

  • Added causal baselines (DCM) and the other state‑of‑the‑art methods (DoRi, CG‑DETR) to our comparison tables, clearly demonstrating that our front‑door adjustment with counterfactual contrast achieves consistent and substantial gains over alternative causal strategies (Q1).
  • Provided a sensitivity analysis of the K‑means cluster number (K) in CADE, reporting mean ± std metrics across multiple runs, which confirms the stability of our mediator design and supports the claimed causal robustness (Q2).

We believe these additions address the reviewer’s concerns and further highlight the robustness and contribution of our approach.

Response to Q1

Could you include at least one representative causal baseline in the comparison tables (or explain why this is not feasible) to clarify the benefit of your causal strategy?

Yes. In our revision, we have added DCM (Deconfounded Cross‑modal Matching) ​[1]– a causal method addressing dataset temporal‑annotation bias via back‑door adjustment . To further strengthen our empirical comparison, we also included DoRi ​[2] and CG‑DETR [3], which are state‑of‑the‑art methods that mitigate superficial co-occurrence patterns by reducing correlations between background and foreground.

MethodCharades-STAActivityNet-CaptionQVHighlights
R1@0.5R1@0.7mIoUR1@0.5R1@0.7mIoUR1@0.5R1@0.7mAP Avg.
TCN+DCM [1]55.834.448.744.927.743.3---
DORi [2]59.6540.5653.2841.4926.4142.78---
CG-DETR [3]58.436.350.1---65.4348.3842.86
CausalVTG70.8949.2559.9645.6226.2845.7468.8752.5349.63

The updated comparison demonstrates that our method significantly outperforms all baselines, particularly on the Charades-STA dataset, where the presence of multiple stylistically diverse annotations per video segment renders models more susceptible to superficial co-occurrence patterns.

Response to Q2

How sensitive is CADE to the K-means cluster number K and clustering randomness? An ablation or plot would help justify the causal robustness claim.

We performed a comprehensive sensitivity analysis on the Charades‑RF dataset by varying the number of K‑means clusters in {16, 32, 64, 128, 256, 512, 1024, 2048}. For each value, we ran multiple random seeds to compute mean ± standard deviation of metrics like R1@0.7, mIoU, and classification accuracy. Due to rebuttal constraints, only tabular results are presented here, but we commit to including detailed plots in the final version of the paper.

Metric/n_cluster16326412825651210242048
R1@0.758.93±1.7160.39±0.6661.25±0.3160.29±0.4058.20±0.4159.12±0.8959.25±0.5359.06±0.45
mIoU65.58±1.3066.18±0.6867.16±0.0766.44±0.3764.36±0.5465.64±0.3365.56±0.1665.15±0.43
Acc84.25±0.5785.11±0.1485.76±0.6984.96±0.5183.50±0.1584.37±0.2884.00±0.5184.11±0.11

The experimental results indicate that a cluster number of 64 yields optimal performance. This choice provides comprehensive coverage of underlying semantic categories, effectively capturing confounders. In contrast, using too few clusters risks missing critical semantic distinctions, while employing too many introduces computational overhead and noisy representations, negatively impacting model performance.

[1] "Deconfounded video moment retrieval with causal intervention." SIGIR 2021.

[2] "DORi: Discovering object relationships for moment localization of a natural language query in a video." WACV 2021.

[3] "Correlation-guided query-dependency calibration for video temporal grounding." arXiv preprint arXiv:2311.08835 (2023).

评论

Thank you for the detailed response. Most of my concerns were adequately addressed. After reviewing the feedback from other reviewers, I have decided to maintain my positive score and increase the confidence score to 3.

评论

We sincerely thank you for your thoughtful comments and valuable time, and we’re glad to hear that our revisions have addressed your concerns.

审稿意见
5

This paper proposes CausalVTG, a framework designed to address two major limitations in Video Temporal Grounding (VTG): (1) reliance on superficial co-occurrence patterns due to dataset biases, and (2) inability to handle scenarios where the query content might not be present in the video. The authors employ causal inference, specifically front-door adjustment, to mitigate confounding biases. They introduce a Causality-Aware Disentangled Encoder (CADE) to obtain unbiased modality-specific representations, and a Multi-Scale Temporal Perception (MSTP) module to capture temporal dynamics at multiple granularities. Additionally, the authors incorporate counterfactual contrastive learning to improve the model’s capability to distinguish between grounded and ungrounded queries. Experimental evaluations on five established benchmarks demonstrate state-of-the-art results across various settings.

优缺点分析

Strengths

  • The paper clearly identifies important, practical limitations in existing VTG methods, and convincingly motivates the application of causal inference.

  • The introduction of the causal inference framework, specifically using the front-door adjustment through the proposed CADE, is novel and technically sound. The MSTP module enhances temporal modeling effectively, complementing the causal design.

  • Extensive experiments across five widely-used benchmarks (QVHighlights, Charades-STA, ActivityNet Caption, Charades-RF, ActivityNet-RF) clearly demonstrate improvements, especially under strict IoU thresholds and challenging ungrounded-query scenarios. I would recommend to add YouCookII and the partictions for Charades-STA and ActivityNet from https://arxiv.org/abs/2101.09028

  • The paper is very clearly written, methodically structured, and provides comprehensive implementation details and hyperparameters, aiding reproducibility.

Weaknesses

  • The combination of multiple intricate modules (CADE, MSTP, counterfactual training) introduces considerable complexity. This complexity may hinder interpretation of performance gains, particularly the exact contribution of each individual component.

  • The complexity of multi-scale temporal processing and causal disentanglement is computationally demanding. While runtime details are briefly mentioned, a deeper analysis of inference efficiency or scalability considerations would be valuable.

问题

Questions

  • Could you elaborate further on the assumptions underpinning the structural causal model and clarify their appropriateness and limitations specifically for VTG tasks?
  • How sensitive is your framework to the design of mediators? Have alternative mediator designs been explored, and how might different mediator definitions affect grounding performance?
  • Given multiple complex components (CADE, MSTP, counterfactual contrastive learning), could you further clarify which components are most critical? Additional ablations or sensitivity analyses would strengthen the paper significantly.
  • Could you provide more details on computational overhead during inference (e.g., runtime and memory consumption), especially compared to simpler baselines?
  • Can you provide analyses of your predictions similar to https://openaccess.thecvf.com/content/ICCV2023W/CLVL/papers/De_la_Jara_An_Empirical_Study_of_the_Effect_of_Video_Encoders_on_ICCVW_2023_paper.pdf Figure 1? such that we can evaluate potential bias of the predictions of the model.

Suggestions

I have some suggestions and comments that might help further strengthen your paper:

  1. Related Work & Citations (Lines 42-50):
    It's great to see that you've cited Escorcia et al. (2019). For completeness and richer context, the paper would strongly benefit from citing two additional relevant works:

    Including these references would enhance your motivation and clarify the novelty and contributions of your approach relative to existing work.

  2. Multiple Occurrences and Temporal Causality:
    An interesting scenario that your method might help analyze further relates to queries that describe events occurring multiple times within the same video.

  • Also,Figure 8 of DORi’s supplemental material, the Charades-STA query "person walks over to the refrigerator open it up" occurs twice. This raises the important practical scenario where a query might occur multiple times, one, or not at all in a video.

    Your causal grounding method could show excellent performance for that kind of challenging queries to better showcase its strengths and contributions. An explicit analysis or discussion would significantly strengthen the practical motivation of your work.

  1. Generalization and Causal Dependencies (YouCook2 dataset):
    It would be highly valuable and insightful to see your causal inference-based grounding applied to a dataset like YouCook2, where the causality of an action depends directly on previously executed actions.

    • For instance, Figures 13 and 14 of the supplemental material from DORi indicate specific cases where models incorrectly learn spurious correlations (e.g., associating certain ingredients or actions like "pouring dressing" or "adding oil" with incorrect moments due to biases in object presence or appearance).

    Demonstrating your method’s robustness against such spurious correlations and temporal dependencies using YouCook2 would substantially strengthen the empirical validation and show clearer evidence of the claimed causal benefits.

I encourage you to address these suggestion, as they would greatly enhance the quality, depth, and impact of your contribution.

局限性

Yes, the paper discusses limitations clearly. However, additional reflections on the complexity, computational efficiency, and assumptions in causal modeling would further strengthen this section.

最终评判理由

Thank you to the authors for their thoughtful and well-articulated response. I remain positive about the paper and will maintain my original score

格式问题

None. The paper adheres to NeurIPS formatting guidelines.

作者回复

We appreciate the reviewer’s constructive feedback, which helped us strengthen the paper. In response, we:

  • Clarified the core causal assumptions and their appropriateness and limitations for VTG (Q1).
  • Analyzed mediator design sensitivity and discussed alternative definitions (Q2).
  • Expanded ablations to show the distinct and complementary roles of CADE, MSTP, and counterfactual learning (Q3).
  • Detailed computational overhead with runtime and memory comparisons to simpler baselines (Q4).
  • Provided prediction distribution analyses and committed to adding more visualizations to assess bias (Q5).
  • Incorporated suggested references and scenarios, including temporal causality, multiple occurrences, and new YouCook2 results (S1,S2,S3).

We believe these additions meaningfully address the reviewer’s concerns and further highlight the novelty, rigor, and practical impact of our work.

Response to Q1

Key Causal Assumptions:

  • Assumption 1 (Existence of Confounders): There exist latent confounders ZZ, such as visual stylistic cues (e.g., background contexts) and linguistic variations (e.g., phrasing habits), which simultaneously influence both the inputs X=V,QX=\\{V,Q\\} and the grounding outcome YY.
  • Assumption 2 (Front-door Identifiability): There exists a set of mediator variables MM (semantic representations) that fully mediate the causal relationship XYX→Y. These mediators are assumed to be independent of direct influences from the latent confounders ZZ, thus satisfying the conditions required for the front-door adjustment.

Appropriateness for VTG tasks:

  • Prior studies [1,2] clearly document the presence of stylistic biases within existing VTG datasets, highlighting the necessity of explicitly modeling such confounders.
  • Employing semantic mediators aligns with recent causal inference methods that effectively disentangle spurious correlations in related multimodal grounding tasks[3,4]​.

Limitations and Constraints:

Our current SCM explicitly addresses stylistic variations but does not directly account for temporal location biases, which are also prevalent in VTG tasks [5,6]. In contrast, the DCM framework [5] explicitly employs back-door adjustments to mitigate temporal biases.

Response to Q2

The sensitivity of our framework to mediator design primarily stems from the choice of cluster number KK used in the K-means clustering of semantic mediators. To evaluate this sensitivity, we conducted experiments varying KK, summarized in the table below.

Metric/n_cluster16326412825651210242048
R1@0.758.93±1.7160.39±0.6661.25±0.3160.29±0.4058.20±0.4159.12±0.8959.25±0.5359.06±0.45
mIoU65.58±1.3066.18±0.6867.16±0.0766.44±0.3764.36±0.5465.64±0.3365.56±0.1665.15±0.43
Acc84.25±0.5785.11±0.1485.76±0.6984.96±0.5183.50±0.1584.37±0.2884.00±0.5184.11±0.11

Experimental results show that setting the number of clusters to 64 achieves the best performance, effectively balancing semantic granularity and noise. Fewer clusters may miss key distinctions, while too many introduce redundancy and degrade model stability.

Recent work, such as FDVAE (Front-Door Variational Autoencoder) [7], explores an alternative approach for front-door adjustment by learning latent mediators through variational inference instead of explicit clustering. Due to time constraints, we have not explored alternative mediator designs, but we will investigate them in future work to further improve performance and robustness.

Response to Q3

To clarify the contribution of each component, we provide detailed ablation results in Table 4 of our paper, which show that MSTP plays a crucial role in standard VTG tasks by generating proposals across multiple scales and granularities. To further support this, we include a new sensitivity analysis on the choice of temporal strides used in MSTP. This demonstrates that multi-scale temporal modeling is key to capturing varied action durations and enhances the model’s ability to localize events more precisely.

temporal stridesR1@0.5R1@0.7mAP@0.5mAP@0.75Avg.mAP
w/o MSTP61.3538.0061.4036.7536.29
{1}60.8440.5862.1537.8137.56
{1,2}67.1048.4567.6945.1643.35
{1,2,4}70.3953.1070.9150.0048.63
{1,2,4,8}70.8456.0072.1753.7950.98
{1,2,4,8,16}70.1954.9772.3353.7651.88

Furthermore, we evaluated the impact of Counterfactual Contrastive Learning (QRM) on the Charades-RF dataset, which includes queries not grounded in videos. Results confirm that QRM substantially improves the model's ability to discern and reject ungrounded queries.

AccR@0.3R1@0.5R1@0.7mIoU
CausalVTG84.7876.2271.0761.0367.86
w/o QRM47.8238.8434.1425.1130.61

Finally, we validated the effectiveness of CADE on the Charades-CG dataset [8], which includes novel phrase compositions and unseen words in the test split. The observed performance gains highlight CADE’s crucial role in mitigating stylistic biases and enhancing the model’s generalization to unseen linguistic patterns.While MSTP indeed contributes the largest raw performance gain, CADE plays a distinct and critical role in addressing superficial co-occurrence patterns through front-door adjustment, which MSTP alone cannot resolve.

Novel-CompositionNovel-Word
R1@0.5R1@0.7mIoUR1@0.5R1@0.7mIoU
CausalVTG56.6832.7149.5959.2834.9651.12
w/o CADE52.1529.0546.1554.6830.3647.03

Response to Q4

To evaluate the computational overhead, we trained and evaluated all models on the QVHighlights dataset using an NVIDIA A800 GPU (80GB memory) with a batch size of 64 over 50 epochs. As summarized below, CausalVTG's inference runtime is slightly longer than simpler baselines, yet remains highly competitive.

MethodTrainingInference
GPU Memory#ParametersTimeTime
Moment-DETR [1]1.41 GB4.82 M9.71 min31 s
QD-DETR [11]1.89 GB7.58 M13.15 min37 s
CG-DETR [12]3.09 GB12.61 M40.05 min45 s
R2-Tuning [13]37.24 GB2.7 M544.33 min69 s
CausalVTG2.31 GB7.86 M43.21 min53 s

Response to Q5

Due to rebuttal constraints, we cannot include full visualizations. As a substitute, we provide below the normalized distributions of predicted temporal intervals on the Charades-STA test set. The first table corresponds to successful cases (IoU > 0.7) and the second table corresponds to failed cases (IoU < 0.7).We commit to including full visualizations and deeper analysis of temporal prediction distributions and feature biases in the final version.

start/end0-0.20.2-0.40.4-0.60.6-0.80.8-1.0
0-0.20.0720.3170.1350.0070
0.2-0.400.0040.0940.0350.003
0.4-0.6000.0040.0660.078
0.6-0.80000.0010.151
0.8-1.000000.027
start/end0-0.20.2-0.40.4-0.60.6-0.80.8-1.0
0-0.20.0670.2860.1110.0230.002
0.2-0.400.0030.0850.0730.018
0.4-0.6000.0030.0650.092
0.6-0.80000.0030.145
0.8-1.000000.023

Response to S1

We thank the reviewer for pointing out these two key references, both are highly relevant to the challenges addressed in our work. We will include and discuss them in the final version.

Response to S2

For temporal causality, we thank the reviewer for highlighting these important aspects. Regarding temporal causality, the referenced works introduce queries requiring reasoning over event dependencies rather than isolated segments. We agree this is a challenging and meaningful scenario, and we plan to explore temporal causal reasoning in future work.

For multiple occurrences, the QVHighlights dataset [1] includes annotations where queries may appear multiple times within a video. Our framework can handle such cases, as illustrated by the example visualization in the third subgraph of Figure 4 in our paper, which shows correct localization of repeated events. We will clarify this capability more explicitly in the final version.

Response to S3

We appreciate the reviewer’s suggestion to evaluate causal grounding on YouCook2.Our results demonstrate that CausalVTG outperforms prior methods by a clear margin, indicating improved modeling of causal relations.

MethodR1@0.3R1@0.5R1@0.7mIoU
DORi [9]43.3630.4718.2430.46
LOCFORMER [10]46.7631.3315.8130.92
CausalVTG52.3838.8023.5437.09

[1] "Detecting moments and highlights in videos via natural language queries." NeurIPS 2021.

[2] "Compositional temporal grounding with structured variational cross-graph correspondence learning." CVPR 2022.

[3] "Cross-modal causal relation alignment for video question grounding." CVPR 2025.

[4] "Vision-and-language navigation via causal learning." CVPR 2024.

[5] "Deconfounded video moment retrieval with causal intervention."SIGIR 2021.

[6] "A closer look at temporal sentence grounding in videos: Dataset and metric." ACM MM 2021.

[7] "Causal inference with conditional front-door adjustment and identifiable variational autoencoder." ICLR 2024.

[8] "Compositional temporal grounding with structured variational cross-graph correspondence learning." CVPR 2022.

[9] "DORi: Discovering object relationships for moment localization of a natural language query in a video." WACV 2021.

[10] "Memory-efficient temporal moment localization in long videos." EACL 2023.

[11] "Query-dependent video representation for moment retrieval and highlight detection." CVPR 2023.

[12] "Correlation-guided query-dependency calibration for video temporal grounding." arXiv preprint arXiv:2311.08835 (2023).

[13] "R2-tuning: Efficient image-to-video transfer learning for video temporal grounding." ECCV 2024.

评论

Thank you to the authors for their thoughtful and well-articulated response. I remain positive about the paper and will maintain my original score.

评论

Thank you again for your valuable feedback and suggestions, we sincerely appreciate your positive assessment and support.

最终决定

This paper proposes a novel framework for video temporal grounding that explicitly incorporates causal inference into the model design, consisting of structural causal model, modality-specific front door adjustment, and counterfactual reasoning. Extensive experiments on five widely-used benchmarks demonstrate that it achieves state-of-the-art performance in both localization precision and query relevance detection tasks. This paper received positive ratings from all reviewers; all reviewers agree that the paper tackles an important problem, proposes an interesting approach, and shows significant improvement over previous methods. The rebuttal also addresses most of the reviewer’s concerns, e.g., missing comparisons and analyses. The AC also appreciates the scientific contribution of the novel causal inference framework for video temporal grounding, and thus recommends accepting this work.

公开评论

I'm delighted to see your interesting work and would like to follow it. Could you please tell me how to obtain the video and text features extracted by CLIP in InternVideo2 used in the paper? Also, I have another question: does the paper report experimental results using the video features (I3D, C3D) and text features GLoVe compared to the baseline?