PaperHub
6.3
/10
withdrawn4 位审稿人
最低5最高8标准差1.1
6
5
8
6
4.3
置信度
正确性2.5
贡献度3.3
表达3.0
ICLR 2025

Calibrating Video Watch-time Predictions with Credible Prototype Alignment

OpenReviewPDF
提交: 2024-09-26更新: 2025-01-23

摘要

关键词
Prototype learningoptimal transportrecommendation

评审与讨论

审稿意见
6

This paper introduces a novel two-stage method named ProWTP for predicting user watch-time in video recommendation systems. The method enhances prediction accuracy by aligning label distributions with instance representation distributions through a combination of prototype learning and optimal transport, specifically using a Hierarchical Vector Quantised Variational AutoEncoder and Semi-relaxed Unbalanced Optimal Transport. The innovation lies in calibrating the instance space by addressing the natural distribution properties of labels and the instance representation confusion, which are often overlooked in existing methods. Experiments on two datasets illustrate the effectiveness of the proposed ProWTP.

优点

S1. The research question is quite interesting and meaningful for video recommendation.

S2. The paper accurately identifies the issues of ignoring label distribution properties and instance representation confusion in watch-time prediction methods.

S3. ProWTP offers an innovative approach to calibrating watch-time predictions by combining prototype learning and optimal transport, addressing key issues in existing methods.

缺点

W1. The research question is quite good. However, the proposed method with VAE and OT seems too heavy to be implemented in the real world. The authors should discuss the necessity of these complex designs in detail. Further, complex analysis is lacking, but it is necessary here.

W2. While performing well on two datasets, the paper does not discuss the method's generalizability to different types of recommendation systems or varying lengths of video content.

W3. The compared baselines are a bit out of date. The authors should compare with more recent SOTA recommendation models and VAE-based models.

W4. The paper lacks discussion on how to interpret prototypes and their relationship with user behavior, which may limit understanding of model predictions.

问题

Q1: What is the inference efficiency of the proposed model, compared with baselines?

Q2: When user behavior patterns change, how can the model be effectively updated and adjusted? Are there strategies for online learning or incremental learning?

Q3: The discussion of the impact of the number of prototypes is insufficient.

伦理问题详情

N/A

审稿意见
5

The paper introduces ProWTP, a two-stage method that combines prototype learning and optimal transport (OT) to improve watch-time prediction in video recommender systems. ProWTP uses a hierarchical vector quantized variational autoencoder (HVQ-VAE) to convert continuous labels into discrete prototypes, aligning the label distribution with the instance representation distribution via OT to enhance prediction accuracy. Offline experiments on two industrial datasets demonstrate ProWTP's superior performance.

优点

Watch time prediction is a critical problem in video recommender systems. This paper presents a novel approach to solving the watch time prediction problem through prototype learning and optimal transport. In summary, the paper has the following main strengths:

  • The paper is the first to leverage optimal transport techniques to enhance watch time prediction.
  • The paper investigates the multimodal distribution properties of watch-ratio, thereby generating credible prototype vectors.
  • The paper identifies the representation confusion issue and proposes the use of prototype learning to mitigate it.

缺点

I think the paper has the following major limitations that could be addressed to further improve its quality:

  1. The proposed approach is novel and unique, having never been used for watch time prediction. However, the paper lacks a clear motivation for why such an approach is necessary. For example, in existing work like D2Q, the authors highlight the duration bias problem, which motivates their approach. In contrast, this paper states, 'However, those methods struggle to consistently maintain high predictive accuracy across different models,' but the proposed approach seems to address a different problem (duration bias versus distribution alignment). The authors are suggested to improve the motivation for their approach.
  2. he paper lacks a comprehensive review of existing research on watch time prediction and misses some important citations below. Specifically, the authors claim, 'We investigate the multimodal distribution properties of watch-ratio across different video duration buckets for the first time,' but the bimodal distribution nature of KuaiRand was studied in Zhao et al., 2023.
    • Zhao et al., 2023, Uncovering User Interest from Biased and Noised Watch Time in Video Recommendation
    • Zheng et al., 2022. DVR: Micro-Video Recommendation Optimizing Watch-Time-Gain under Duration Bias
    • Zhao et al., 2024, Counteracting Duration Bias in Video Recommendation via Counterfactual Watch Time
  3. It is not clear why aligning the feature space and the obtained prototype vectors via optimal transport helps with watch time prediction. Are there any simpler methods that could achieve similar results? The authors are suggested to clarify why optimal transport is chosen and how it specifically benefits the prediction task.
  4. The proposed method includes three separate loss functions in Equations 9, 11, and 12. However, it is not clear how these losses are combined for final training. Additionally, it is unclear why the final task loss uses prototype vectors as predictions instead of instance representations h. The authors are suggested to provide more clarity on these aspects.

问题

  1. The following sentence repeats twice in the related work. "D2Q alleviates the duration bias by conducting backdoor adjustments and models watch time with direct watch-time quantile regression. "
  2. The authors highlight, "However, those methods struggle to consistently maintain high predictive accuracy across different models." Could you provide more detailed explanations for why existing methods cannot maintain high predictive accuracy across different models?
  3. Regarding the sampling in the pre-processing step, the process of transforming original one-dimensional multimodal distributions into D ∗ L one-dimensional near-Gaussian distributions w of length needs further clarfirication. Please provide a clear description of how the data is sampled and transformed.
  4. The computation for optimal transport is expensive. Although the authors reduce the computational cost by randomly sampling 20% of the instances in a batch, could you please provide the trade-off numbers between effectiveness and efficiency? Specifically, how much does this reduction in computational cost impact the model's performance?
审稿意见
8

This paper presents a novel two-stage method, i.e., ProWTP, for accurately predicting user watch-time in video recommendation systems. The authors address the limitations of existing approaches by combining prototype learning and optimal transport, aligning label and instance representation distributions to improve prediction accuracy. The use of HVQ-VAE for converting continuous labels into high-dimensional discrete prototypes and formulating the alignment problem as a Semi-relaxed Unbalanced Optimal Transport (SUOT) problem are innovative contributions. The introduction of assignment and compactness losses further enhances the method's effectiveness. Extensive offline experiments demonstrate ProWTP's superiority in real-world applications, making it a valuable contribution to the field of video recommendation systems. Overall, this paper is well-written, thorough, and makes significant advancements in the area of watch-time prediction.

优点

(1) The paper introduces ProWTP, a method that significantly enhances model prediction performance in the watch-time prediction (WTP) task by addressing the instance representation confusion problem. By aligning label distributions with instance representation distributions through optimal transport, ProWTP ensures that the model's predictions are more accurate and reliable. This innovation is crucial for improving user stickiness and retention in video recommendation systems. (2) The paper is the first to investigate the multimodal distribution properties of watch-ratio across different video duration buckets. By utilizing a hierarchical vector quantized variational autoencoder (VQ-VAE) to transform these properties into credible high-dimensional prototype vectors, the paper provides a more precise reference for recommendation models calibration. This innovation allows the model to better understand user preferences and behaviors, leading to more personalized and effective recommendations. (3) The paper conducts extensive offline experiments on two industrial datasets to validate the effectiveness of the proposed approach. The experimental results consistently demonstrate the superiority of ProWTP over existing methods, providing strong evidence for its practical applications in real-world scenarios. This innovation ensures that the proposed method is not only theoretically sound but also feasible and effective in practical use.

缺点

(1) The boundaries and logic between different modules in Figure 2 are not clearly delineated. To enhance readability and comprehension, it is recommended to add borders around each module. This will help readers better understand the structure and relationships within the figure. (2) To ensure robustness and reproducibility, it is advisable to present the mean and variance of multiple experimental runs in Table 1. This will provide a more comprehensive evaluation of the proposed method's performance.

问题

(1) The observed multimodal phenomenon is intriguing. Could you elaborate on the potential user behavior or habits that may be underlying this phenomenon? Understanding the root causes could provide deeper insights into the data and its implications. 2. Did the real data used in your study undergo any preprocessing steps? If so, could you outline the specific preprocessing techniques employed and highlight any important considerations or challenges encountered during this process? This information would help replicate your results and understand the data quality.

审稿意见
6

This paper introduces ProWTP, a two-stage framework for watch-time prediction. In the first stage, ProWTP employs HVQ-VAE to capture clustering structures among instances based on watch ratios. In the second stage, ProWTP uses SUOT to refine instance representations to align with the learned clusters. ProWTP is validated in the offline evaluation. ProWTP shows improvement over baseline methods on two benchmark datasets. The ablation study also shows the effectiveness of various design choices within the framework.

The reviewer primarily has the following concerns regarding the presentation and experiments of this work.

  1. For Figure 1a, to the reviewer, it is unclear whether the watch-ratio distribution for videos of short and medium durations is truly multimodal, as the density of the first peak (around 1.5 for medium-duration videos) does not appear significant.
  2. As presented in Section 4.2, ProWTP samples a one-dimensional distribution ww from the multimodal distribution and claims this distribution should be near-Gaussian. However, to the reviewer, with random sampling alone, ww may not exhibit a near-Gaussian form. The authors are encouraged to elaborate further on any specific sampling strategies employed to achieve this distribution.
  3. The prototype generation process remains unclear. Specifically, does the encoder use individual watch ratios or the entire distribution as input? If it uses the distribution, does this imply that the HVA-VAE is trained on a single sample, as the reviewer assumes there is one distribution for the entire dataset?
  4. The instance representation confusion issue is not intuitive and well-defined.
  5. In Section 4.3, further elaboration is needed on why both the instance representation distribution and the prototype representation distribution follow a uniform distribution.
  6. To the reviewer, the improvement over baselines on the two datasets appears marginal, suggesting the need to test the significance of these improvements. Additionally, while the authors adopt the parameter settings from the original papers for baselines, these models may not be optimized for the MLP architecture used here, potentially making the settings sub-optimal. The authors are thus encouraged to search for optimal baseline parameters to ensure a fair comparison.
  7. The authors are encouraged to include an online evaluation to better validate ProWTP’s effectiveness with real users.
  8. No empirical results demonstrate whether the representation confusion issue has been addressed.
  9. The trend in Table 3 raises concerns for the reviewer. If there is a clustering structure in the data that benefits watch-time prediction, as suggested in the other tables, why does k-means underperform random?
  10. The literature review overlooks a recent work for watch-time prediction [1], which is closely related to this work.

[1] SWaT: Statistical Modeling of Video Watch Time through User Behavior Analysis

优点

Overall, the framework presented in ProWTP is novel and interesting, and the experiments demonstrate certain advantages. However, further validation is necessary to fully conclude its effectiveness.

缺点

Please refer to the summaries for detailed weaknesses.

问题

Please refer to the summaries for detailed questions.

评论

I greatly appreciate the authors' time and effort in addressing all my comments. I have updated my rating. Good luck.

评论

Thank you for taking the time to review our responses and for updating your rating. We are truly honored to receive your thoughtful feedback, which has greatly helped us improve the paper. Your support means a lot to us. Thank you again, and we wish you all the best!

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.