HIPPO-VIDEO : Simulating Watch Histories with Large Language Models for History-Driven Video Highlighting
We introduce a large-scale dataset for personalized video highlighting by simulating user watch history and generating segment-wise saliency scores, enabling more user-centric video summarization.
摘要
评审与讨论
This paper proposes HIPPO-VIDEO, a synthetic dataset for personalized video highlight detection, generated using a multi-stage LLM-based pipeline that simulates user watch histories. The authors introduce HiPHer, a personalized highlight prediction method that leverages these simulated histories. The dataset consists of over 20,000 videos, and the method is evaluated on several downstream tasks.
接收理由
The paper addresses a novel and underexplored task—personalized video highlighting—while proposing a scalable, privacy-preserving approach to building such datasets.
The staged LLM-based simulation (involving history generation, engagement prediction, and preference updating) is well-motivated and reflects a creative use of generative models to bootstrap training data.
The paper evaluates the proposed method (HiPHer) across multiple tasks (highlight detection, moment retrieval, video summarization), and includes both quantitative benchmarks and human studies.
拒绝理由
While the simulation framework is novel, the entire process—from user history generation to highlight extraction—is carried out using LLMs without grounding in real user data. This raises concerns about the reliability and ecological validity of the synthetic histories. The lack of comparison with actual user histories—perhaps even on a small scale—leaves the realism of the generated behaviors unverified.
Human verification is only conducted on 5 examples per condition with 3 annotators per example. This limited scale reduces confidence in the generalizability of the reported findings.
In Table 2, 17.95% of the simulated histories were labeled at least one “Disagree” by human annotators. However, the authors still categorize this subset under “Agree” in their summary analysis. This aggregation likely overstates the success of the simulation pipeline and underrepresents failure cases.
给作者的问题
This paper presents a creative and promising direction for data generation in personalized video understanding. However, several issues—especially around over-reliance on LLMs and limited human validation—undermine confidence in the robustness and generalizability of the proposed dataset.
Thank you for your thoughtful and constructive feedback. Your comments helped us clarify and strengthen several key aspects of our work.
The realism of simulated user histories
- We understand the concern regarding the ecological validity of simulated user histories. To address this, we have conducted additional experiments using real user watch histories and included the results in the official comment. (Will be updated soon)
Human verification scale
- We would like to clarify that human verification was conducted on the entire set of simulated watch histories, not just a small sample. Each instance was independently reviewed by three annotators, ensuring comprehensive validation across all 2040 sessions.
Clarification for Results in Table 2
- Consistent with , we treat both “Agree” and “Neutral” judgments as acceptable, reflecting the inherent subjectivity of personalized highlight evaluation. Following prior work, we report outcomes based on the majority vote per case. Importantly, in 97.44% of cases, the majority of annotators did not select “Disagree,” which supports the plausibility of the simulation. We acknowledge that this table is hard to parse, so we would provide more details in the final version.
Sul, Jinhwan, Jihoon Han, and Joonseok Lee. "Mr. hisum: A large-scale dataset for video highlight detection and summarization." Advances in Neural Information Processing Systems 36 (2023): 40542-40555.
Thank you for dedicating your time and effort to reviewing our work. Your insightful comments and suggestions have been crucial in enhancing the quality and clarity of our paper. We hope our explanations and newly added results sufficiently address your concerns. If there are any remaining questions or points that need further clarification, we would be happy to address them.
I appreciate the authors’ efforts and find the core idea of the paper interesting. However, I still have some concerns regarding the simulated trajectories. Overall, I find the paper borderline acceptable, but I would not be strongly opposed to its rejection.
This paper introduces HIPPO-VIDEO, a novel dataset for personalized video highlighting, generated using an LLM-based user simulator to create realistic watch histories reflecting diverse preferences. This dataset comprises of 2,040 (watch history, saliency score) pairs across 170 categories. In addition, the authors propose HiPHer that leverages watch histories to predict segment-wise saliency scores. Experiments show HiPHer outperforms generic and query-based baselines on their collected dataset.
接收理由
-
This paper is well-written, and the research task is very interesting. The idea of using LLMs to efficiently simulate human behavior for automated dataset construction is also great.
-
The authors checked the authenticity of their dataset through human evaluations. Additionally, they conducted some visual analyses and distribution comparisons.
-
The proposed HiPHer successfully leverages multimodal features to assign scores to video segments.
拒绝理由
-
While the authors have conducted human evaluations, I believe a machine-simulated test set alone cannot fully assess the model’s performance. I suggest the authors should at least incorporate real-world user data or established public benchmarks in their testing phase to validate the model’s generalizability beyond simulated scenarios.
-
The performance evaluation section is relatively weak. Firstly, the authors only compare their method with QD-DETR, a 2023 approach. They should conduct more literature review to identify, implement, and compare against additional state-of-the-art methods (e.g., recent 2024 approaches) to provide a more comprehensive benchmark. Additionally, evaluating HiPHer solely on their proprietary test set may overstate its effectiveness. To rigorously validate the proposed method, I recommend testing it on the diverse datasets enumerated in Table 1 (e.g., YouTubeHighlights) to demonstrate cross-dataset robustness. Furthermore, critical performance aspects like time efficiency and computational complexity—essential for real-world deployment—remain unaddressed. The paper lacks analysis on how the model scales with larger inputs or handles real-time inference, which are vital for practical applications.
-
The key ablation studies are missing, such as the impact of the hyperparameter γ (gamma) in Equation 2. Varying γ could significantly affect the contrastive loss and model behavior, but the authors do not explore its influence on performance.
给作者的问题
See the reasons to rekect.
Thank you for your thoughtful and constructive feedback. Your comments offered a useful perspective in assessing the robustness of our approach.
The lack of recent baselines
- We conducted additional experiments with two recent approaches: TR-DETR and UVCOM . TR-DETR introduces a task-cooperation module that jointly enhances Moment Retrieval (MR) and Highlight Detection (HD) via mutual guidance, while UVCOM proposes a unified architecture that integrates local relation modeling (for MR) and global context understanding (for HD) using a multi-granularity module and contrastive learning. Although both models outperform earlier baselines, HiPHer still achieves superior results across key metrics, underscoring the benefit of incorporating user watch histories for personalized video highlighting. We will include these results and analysis in the final version to provide a more comprehensive benchmark.
| Method | Training Data | Hit@1 (≥7) | Hit@1 (≥9) | Recall@1 (0.5) | Recall@1 (0.7) |
|---|---|---|---|---|---|
| UVCOM | QVHighlights | 0.489 | 0.149 | 0.413 | 0.183 |
| UVCOM | HIPPO-VIDEO | 0.441 | 0.146 | 0.357 | 0.154 |
| TR-DETR | QVHighlights | 0.352 | 0.105 | 0.359 | 0.195 |
| TR-DETR | HIPPO-VIDEO | 0.435 | 0.149 | 0.243 | 0.127 |
| HiPHer | HIPPO-VIDEO | 0.507 | 0.166 | 0.452 | 0.245 |
Sun, Hao, et al. "Tr-detr: Task-reciprocal transformer for joint moment retrieval and highlight detection." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 5. 2024.
*
Xiao, Yicheng, et al. "Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Time efficiency and complexity
- To address the reviewer’s concern about scalability and efficiency, we provide a breakdown of HiPHer’s computational complexity
- n : number of segments for each video (assume all videos have same number of segments for computational convenience) ,
- m : number of history videos (10 in our setup),
- C: cost of processing one segment with CLIP (vision + text)
- d: embedding dimension,
- L: transformer depth,
- g: number of GPUs (we use 4× NVIDIA RTX A6000)
- The time complexity can be broken down as follows: target video encoding requires n × C and the transformer-based scoring introduces L × n² × d + n × d² (from self-attention and projection). For history video encoding costs (m × 𝑛̄ × C)/g when parallelized across GPUs. The total cost is therefore: n × C + L × n² × d + n × d² + (m × 𝑛̄ × C)/g.
- In practice, approximately 96% of the inference time is spent on CLIP-based segment encoding, which is common across many video-language pipelines, while HiPHer’s transformer module accounts for only 4% of the total latency. Moreover, if the history embeddings are pre-computed and cached, the history encoding cost is eliminated entirely.
- HiPHer can support near real-time inference with minimal delay, because segment encoding is frame-local and fully parallelizable. With history embeddings pre-computed, it can incrementally process incoming segments, making it suitable for real-world applications.
Ablation on saliency margin γ
- While we initially set γ = 1 following prior works , we conducted an ablation study to assess its impact on HiPHer’s performance. We observed that smaller margins (γ = 0.1–0.2) consistently yield better results, as they enable finer-grained preference modeling. In contrast, larger margins tend to encourage overconfident separation between segments, which reduces generalization. We will include these ablation results in the final version to specify the impact of γ.
Lei, Jie, Tamara L. Berg, and Mohit Bansal. "Detecting moments and highlights in videos via natural language queries." Advances in Neural Information Processing Systems 34 (2021): 11846-11858.
*
Moon, WonJun, et al. "Query-dependent video representation for moment retrieval and highlight detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
| γ | Hit@1 (≥9) | Recall@1 (0.5) | Recall@1 (0.7) |
|---|---|---|---|
| 0 | 0.175 | 0.428 | 0.217 |
| 0.1 | 0.157 | 0.455 | 0.239 |
| 0.2 | 0.197 | 0.445 | 0.23 |
| 0.5 | 0.153 | 0.453 | 0.217 |
| 1 | 0.162 | 0.357 | 0.172 |
| 2 | 0.177 | 0.42 | 0.204 |
Limitations of Using Simulated Test Sets
- We appreciate the reviewer’s concern regarding the limitations of using a machine-simulated test set. To address this, we have conducted additional experiments using real user watch histories and included the results in the official comment. (Will be updated soon)
cross-dataset generalization
- HiPHer is specifically designed to model user preferences based on watch history, which requires a multi-video, history-conditioned setup. As current datasets listed in Table 1 lack this structure, evaluation on HIPPO-VIDEO was necessary to assess our model for history-driven video highlighting.
- However, we acknowledge that evaluating HiPHer on diverse datasets is needed. Thereby we plan to split HIPPO-VIDEO into two non-overlapping subsets and perform cross-set evaluation to examine generalization. We will include these results in the remainder of the rebuttal period.
About cross-dataset generalization
- To demonstrate that HiPHer’s performance is not limited to a single proprietary split, we conducted cross‐subset evaluations on HIPPO-VIDEO by dividing it into three non-overlapping subsets (A, B, C). This setup emulates a cross-dataset evaluation scenario while retaining the history-conditioned structure required by HiPHer.
- The results show that HiPHer achieves comparable performance across subsets it was not trained on, suggesting that the model does not rely on subset-specific patterns. Instead, it captures generalizable behaviors that transfer across different portions of the dataset.
- This internal cross-subset evaluation thus serves as a practical proxy for assessing the generalization ability of HiPHer in the absence of publicly available history-conditioned datasets.
| Recall@1 (0.5) | test_A | test_B | test_C |
|---|---|---|---|
| train_A | 0.358 | 0.381 | 0.379 |
| train_B | 0.304 | 0.343 | 0.336 |
| train_C | 0.357 | 0.380 | 0.404 |
| Recall@1 (0.7) | test_A | test_B | test_C |
|---|---|---|---|
| train_A | 0.174 | 0.187 | 0.214 |
| train_B | 0.149 | 0.139 | 0.174 |
| train_C | 0.188 | 0.170 | 0.186 |
| Hit@1 (≥7) | test_A | test_B | test_C |
|---|---|---|---|
| train_A | 0.411 | 0.525 | 0.566 |
| train_B | 0.467 | 0.559 | 0.496 |
| train_C | 0.458 | 0.568 | 0.442 |
| Hit@1 (≥9) | test_A | test_B | test_C |
|---|---|---|---|
| train_A | 0.150 | 0.195 | 0.256 |
| train_B | 0.140 | 0.169 | 0.155 |
| train_C | 0.150 | 0.212 | 0.147 |
Thank you for dedicating your time and effort to reviewing our work. Your insightful comments and suggestions have been crucial in enhancing the quality and clarity of our paper. We hope our explanations and newly added results sufficiently address your concerns. If there are any remaining questions or points that need further clarification, we would be happy to address them.
Thank you again for your valuable feedback. As the discussion period comes to an end, please feel free to let us know if any further questions or concerns arise. We would greatly appreciate the opportunity to address them and further improve our work.
Considering that the authors have tried their best to answer my concerns, and added experiments missed in the original submission, I would like to raise my score a.
This paper presents a dataset of personalize video highlights constructed via simulations powered by LLMs. The dataset is useful in tasks such as query-based video and segment retrieval/recommendation, and summarization This simulation essentially contains 4 steps:
(1) Seeds are initialized as written "profiles" of preferences, following a set of 170 "personalities" fully described in text, extracted from previous work. (2) A video is selected by asking an LLM to match a given profile against a representation of the video context, in text for, which emulates what is available to YouTube viewers. (3) The video is consumed by the LLM: the video is split into segments, then frames are extracted from these and fed to a captioner (essentially a LVLM), to finally obtain a pure text-based description of the video (3) The profile is updated by asking a LLM to combine the viewing experience with the current profile. This is fundamentally based on LLM-generated explanations of the choices of video made before, and a review of the video that was "watched" (based on its text-based representation).
Once collected and obtained, the data is used to create a dataset of saliency of videos, i.e. per-profile scene preferences. This is done by asking an LLM to assign relevance scores for each video segment for a given profile description and LLM-based reviews of each video. The resulting dataset essentially introduces the novel task of "personalized video recommendation".
A model is then proposed to tackle this new task, with results including relevant comparisons of the model on the related tasks of highlight recommendation video summarization. The proposed model outperforms previous models (adapted, where necessary) on this new dataset.
Overall, I like this paper and I think it shows a promising and seemingly effective mechanism to generate what essentially is multimodal data for training LLMs without any (or little) human supervision. Although I think there are some bias issues in allowing such data to be purely generated by LLMs, this paper provides evidence showing a reasonable amount of agreement with human judgment for both watch history simulation, and saliency annotation.
On the model side, I also think the approach proposed by this paper, including the empirical framework is pretty solid and includes relevant baselines and datasets (there are some omissions here like ActivityNet and YoucookII, but probably will not make a large difference in the overall story).
接收理由
-
I'm excited by the possibility of having LLMs interact with videos as a viable mechanism to generate data valuable to train multimodal models. The proposed approach, which essentially decomposes videos into chunks, and describes them purely via text as a means to simulate "video watching" shows that even at this arguably rudimentary level, we can obtain reliable signals that can lead to better models down the road. These are valuable findings, so I think this paper can have an important impact in our community.
-
The proposed dataset is relevant and advances research in query-based video and scene recommendation in a meaningful direction.
-
The proposed empirical study is sound, containing relevant datasets and ablation studies. Critical data creation points are studied in terms of their agreement with human behavior, showing positive results.
-
The paper is generally clear and generally well written (with some exceptions, please see below).
拒绝理由
-
One concern I have is the lack of more practical details regarding the creation of the dataset. For starters, it's unclear which specific LLM was involved in the creation of the dataset (please see my corresponding question, below). Moreover, we know that working with LLMs has some uncertainty due to their stochastic nature, which leads should lead to hiccups in the proposed loop of data creation/collection. Of course, I may be wrong here, and it's possible that LLMs excel at the tasks, specially if the prompts are adequately tuned. However, I would like this paper to be a bit more upfront about the reality of how the data collection worked, to offer a more nuance perspective on the applicability of this technique in other video-watching simulation scenarios.
-
A minor concern, not a strong a reason for rejection, but this point does take away from the paper: I found several Figures and Tables difficult to parse and read. Please see my questions/comments below.
给作者的问题
-
One thing that was not clear throughout the paper is which model(s) was/were used to generate data. Could the authors provide more details on this?
-
Related to my above question, I'm also wondering if it would be possible to provide the reader with an estimate of the efforts spent to create the dataset, both in terms of approximate total number of queries (and/or budget in terms of compute/funds) and the total time (e.g. in man-hours).
-
On the same line as above, I expect models to not always follow the provided instruction, for example not complying with the output format stipulated in the prompt, which should introduce noise. Even for a 99% success rate, I think this means that some output will have to be re-processed. In this context, I would like to get a sense of how streamlined the data creation process was. Was the process smooth, or did it often require revisiting steps and/or performing data cleaning? Any insights will not only help me better understand the contribution, but also allow potential readers of the paper fully grasp what this data generation pipeline involves.
-
Figure 5 shows performance of the proposed approach grows steadily as more videos in the history are allowed to be used by the model. In particular, R1@0.5 shows a dramatic increase from n=7 to n=10. Would the authors be able to comment on what this may indicate? Also, what do you think the gains will be for your model if the watch history context continues to grow?
-
I find Table 4 is a bit deceiving. I think the table at least partially intends to show how performance on the summarization task changes as the amount and type of textual information grows. In order to do that, models that use different textual contexts are compared. However, it is not clear why these baselines are comparable, how they differ and how they relate to the proposed approach (line 256 mentions some points, but this is not sufficient for me). If the purpose is to show the impact of different textual context, it would make sense to me to present a proper ablation study, with variations of the proposed approach. Would it be possible to test such approaches, for example, by asking an LLM to generate non-personalized sentence and word level queries?
Also, some comments (pasting them here for lack of a more adequate place)
-
I think Figure 1 is far from great, it really didn't help me better understand the overall proposed approach, so I would encourage a revision. Perhaps a more workflow-based char would help here?
-
Figure 3 also requires a re-think, in my opinion. Perhaps a more informative caption that helps the reader parse what is shown in context?
-
Table 2 was super hard for me to parse. Even after reading the part of the paper where the Table is mentioned, I am not sure if I understand the contents clearly. I would recommend a revision of the inter-annotation agreement study here, for clarity.
-
In Table 3, could you clarify how the "Improv." score is computed. Concretely, with respect to what value is this score computed?
-
If you use the package
booktabsmake sure you respect their design patterns and remove vertical lines in tables.
Thank you for your thoughtful and constructive feedback. We found your comments very helpful in improving and clarifying our work.
More practical details regarding the creation of the dataset
- We used gpt-4o-mini as the backbone model for simulating user behavior, including watch history generation, preference modeling, and saliency annotation.
- Regarding dataset creation cost and effort, we provide below the average token usage and associated costs aggregated across all 2,040 simulation sessions. The total LLM cost amounted to approximately ) | Output Cost () | | :---- | ----- | ----- | ----- | ----- | ----- | | Video understanding | 21088 | 1234 | 0.01265 | 0.00296 | 0.01561 | | Video selection | 45552 | 1448 | 0.02733 | 0.00348 | 0.03081 | | Preference update | 18683 | 2393 | 0.01121 | 0.00574 | 0.01695 | | Action decision | 23141 | 185 | 0.01388 | 0.00044 | 0.01432 |
Figure 5: Effect of watch history length on preference modeling
- In Figure 5, the sharp increase in R1@0.5 at m = 10 (where m is the number of history videos) suggests that the model begins to capture a more stable representation of user preferences. This aligns with our simulator’s design, where a history length of 10 constitutes the full context for modeling preference.
- HiPHer’s preference modeling module relies on identifying recurring patterns across the watch history. Longer histories help capture consistent user behaviors while shorter histories may include noisy preferences.
Clarifying the intent of Table 4
- Our intent with Table 4 was to compare HiPHer against prior methods that claim to achieve personalization using sentence- or word-level textual inputs (CLIP-It and VSL). These were not intended as ablations, but as external baselines to highlight that such query forms lack the depth of user preference modeling provided by watch history.
Clarifications for Figures and Tables
- Thank you for the valuable suggestions. We fully agree that clearer visualizations and more precise table formatting are essential. We will revise the following elements accordingly in the final version to improve clarity and readability:
- Figure 1 was designed to illustrate the motivation behind our task that video highlights should adapt to user preferences inferred from watch history.
- Figure 3 is intended to illustrate diversity and saliency score characteristics of HIPPO-VIDEO, but we agree the caption can be improved to help readers interpret it more easily - such as:
- (a) exploration ratio within individual watch histories,
- (b) inter-history preference diversity visualized via t-SNE, and
- (c) distribution of saliency scores across target videos.
- Table 2: We agree that Table 2 could benefit from clearer presentation. In the final version, we will revise the table format and accompanying explanation to make the inter-annotator agreement results easier to interpret. Specifically, we will clarify the meaning of each row, the labeling scheme (A: Agree, U: Unclear, D: Disagree), and how majority judgments were used in the summary analysis.
- Table 3 : “Improv.” refers to the relative improvement of HiPHer over each baseline in Hit@1 (≥ 9), as noted in the table caption.
- We sincerely appreciate the reviewer’s constructive comments and will reflect these improvements in the final version to enhance the clarity of the paper.
Thank you for taking the time to address the concerns I raised in my review. Below, I'm sharing a few quick comments regarding some points made.
More practical details regarding the creation of the dataset
Please make sure to incorporate these details in the next version of the paper. I think talking about the actual effort that went into creating the dataset is crucial for enabling reproducibility and letting other researchers know what really goes into this kind of experiment.
Clarifying the intent of Table 4
I understand this point, and would request the authors to please clarify this in future versions of the paper.
Finally, my overall impression and assessment of this paper, which leans clearly towards acceptance, did not change significantly after reading the rebuttal. I think this is a neat and well-executed idea, and hence I will be keeping my score.
Realism and Reliability of LLM-Simulated Watch Histories in Hippo-Video
- While real user data is ideal, it is limited by privacy and platform constraints. Our simulation framework offers a scalable, privacy-preserving alternative by modeling user behavior through multi-step interactions and dynamic preferences. We validated its reliability via human evaluations on video selection and query generation (Section 3.3), with strong alignment to human choices at the single-turn level. To further address concerns, we collected 40 real user histories (10 videos each) under consent, using the profile seed setup from Section 3.1 and conducted several experiments to compare with our simulated histories.
- Following recent LLM-as-judge approaches , we used GPT-4 to distinguish between simulated and real histories generated from the same initial profile seed. Out of the real–simulated pairs, GPT-4 achieved an accuracy of 0.4, which is below the 0.5 baseline expected from random guessing, indicating that it often failed to effectively differentiate simulated histories from real ones.
- Furthermore, we employed Fast-DetectGPT , which flags machine generation by comparing the log-probability of the original input against its perturbed variants. It reports an AUROC score, quantifying the separability between two distributions. An AUROC of 1.0 indicates perfect separability, while 0.5 suggests no meaningful difference. Using real histories as the positive class, Fast-DetectGPT achieved an AUROC of 0.768 (95% CI: 0.66–0.86), suggesting that the simulated histories are not easily distinguishable from real ones.
- Additionally, we evaluated Hit@1, where Fast-DetectGPT was tasked with identifying a simulated history among k real ones. A lower Hit@1 indicates higher confusion between real and simulated data. As shown in the table below, Hit@1 remained below 0.350, further supporting the similarity between simulated and real histories.
| k | HiT@1 |
|---|---|
| 9 | 0.350 |
| 19 | 0.186 |
| 29 | 0.126 |
Chiang, Wei-Lin, et al. "Chatbot arena: An open platform for evaluating llms by human preference." Forty-first International Conference on Machine Learning. 2024.
*
Luo, Ziyang, et al. "VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation." arXiv preprint arXiv:2411.13281 (2024).
*
Bao, Guangsheng, et al. ‘Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature’. The Twelfth International Conference on Learning Representations, 2023.
*
Mitchell, Eric, et al. "Detectgpt: Zero-shot machine-generated text detection using probability curvature." International Conference on Machine Learning. PMLR, 2023.
We further evaluated all methods on a newly collected set of real user watch histories. As in the simulation process, the last video in each history was set as the target, and real users assigned 1-to-10 preference scores to every clip of the target video. All models remained trained solely on the simulated HIPPO-VIDEO data, and we used the same experimental setup described in Section 5.1. Despite its small scale, it still provides an initial view of generalization: HiPHer outperforms the baselines on most metrics, indicating encouraging robustness to real user data.
| method | RMSE | Hit@1 (≥7) | Hit@1 (≥9) | F1@0.5 | F1@0.7 |
|---|---|---|---|---|---|
| Moment-DETR | 0.419 | 0.472 | 0.389 | 0.417 | 0.273 |
| QD-DETR | 0.446 | 0.444 | 0.361 | 0.385 | 0.266 |
| TR-DETR | 0.443 | 0.306 | 0.250 | 0.429 | 0.275 |
| HiPHer | 0.427 | 0.486 | 0.400 | 0.624 | 0.319 |
We sincerely thank all reviewers for their valuable insights and helpful suggestions. Your comments motivated us to conduct additional analyses and experiments, which we found to be very insightful. We will incorporate further clarifications, discussions, and these new results into the final version of the paper.
Summary
This paper presents HIPPO-VIDEO, a large and diverse dataset generated through LLM-based user simulation. HIPPO-VIDEO supports various tasks such as video summarization, moment retrieval, highlight detection, and personalized video highlighting. The authors also propose HiPHer, a preference-driven approach that shows potential in enhancing user-centric video highlighting by incorporating detailed user histories. The experimental results emphasize the importance of personalized preferences in video content delivery, outperforming existing generic and query-based approaches.
Pros
- Using LLMs with sequential viewing history to simulate user preference is novel. The overall idea is new and has not been explored in previous research.
- This research highlights the importance of integrating multimodal information and user-specific preferences for personalized video content delivery.
Cons
- The main concern is the gap between the real-world user data and LLM-based synthetic data (xrxV, 6S7Q). In the rebuttal phase, the authors provide additional results about the realism and reliability of LLM-simulated watch histories.
- Since the proposed personalized video highlighting task is new, some technical definitions, implemented details, and evaluation details are not clearly described (jWne, xrxV, 6S7Q). The authors provide additional results in their rebuttal response, and promise to revise the manuscript accordingly.