Towards Relaxing the Unbiasedness Condition of Doubly Robust Estimators for Debiased Recommendation
摘要
评审与讨论
The paper introduces a novel doubly robust estimator to address observational bias in collaborative filtering.
It focuses on the effectiveness of the method, particularly in handling inaccurate pseudo-labels.
The authors propose debiasing techniques with constraints on propensity scores, aiming to ensure unbiasedness when the pseudo labels deviate from the true labels with an arbitrary inductive bias.
The paper combines theoretical analysis and experimental evaluations to demonstrate the method's effectiveness in comparison to benchmark approaches.
优点
-
Comprehensive Background: The authors provide a well-explained introduction and extensive coverage of prior related work, enhancing the paper's accessibility to readers by providing context and a clear understanding of the research landscape.
-
Novel and Effective Method: The proposed method effectively addresses the issue of pseudo-labels deviating from true values with specific inductive bias.
-
Theoretical and Empirical Validation: The paper combines theoretical analysis with empirical results, covering both semi-synthetic and real-world datasets, demonstrating the method's technical soundness and its superior performance compared to existing approaches.
缺点
-
Limited Motivation: The paper may lack novelty in the central idea. Is there any empirical evidence that existing pseudo-labeling suffers from user/item specific inductive bias? If users have an inductive bias in the training set, the user should have the exact inductive bias in the test set.
-
Strong Assumptions: The assumptions regarding inductive biases for users and items might not be realistic. The paper assumes constant inductive biases for users across all items, which may not hold true in real-life scenarios. Users typically exhibit varying preferences for different items.
-
Unclear Propensity Score Calculation: The paper lacks a clear description of how propensity scores were calculated. There is a need for a more detailed explanation.
问题
- Personally, I cannot understand the intuitive concept of the experiment with the synthetic dataset. Can you explain the intuition?
- Is there any empirical or theoretical evidence for the motivating inductive bias problem?
伦理问题详情
NA
The paper presents a new doubly robust estimator for the missing-not-at-random bias. The authors propose constraints on propensity scores to handle pseudo-labels deviating from the true value with the user-specific bias. They provide theoretical analysis and practical experiments, showing how their approach outperforms benchmark methods.
优点
- the solution (a constrained propensity model) is simple yet effective.
- the authors provide theoretical analyses on the unbiasedness and the variance of the proposed estimator.
缺点
- the target problem is too specific and minor. The authors noted that UIDR can effectively alleviate the "inaccurate pseudo-labeling problem" in the previous DR estimators. However, they only treat the situation where the pseudo-labelings deviate from the true labels with arbitrary user-specific inductive bias. This assumption looks quite unrealistic as the user-specific inductive bias is assumed to be equivalent for every item.
- the proposed procedure is not well-motivated. The authors put constraints on the propensity model, not the imputation model, in order to tackle the inaccurate imputation model. If the imputation model is inaccurate the straightforward remedy would be either adjusting the imputation model itself or designing a new loss function robust to the inaccurate imputed errors. If we adjust the propensity model for the imputation model, the accuracy of the propensity model can be harmed.
问题
- please refer to weaknesses.
- ex) If we adjust the propensity model for the imputation model, how does the accuracy of the propensity model become?
The paper presents a study on the challenges and potential solutions associated with debiasing the recommender system models due to the sampling selection bias in the process of data collection. The authors propose several novel doubly robust estimators that are unbiased. These estimators are unbiased for arbitrary user-specific, item-specific inductive bias, and even both. Authors also theoretically prove these estimators’ unbiasedness. Besides, they propose a propensity reconstruction learning approach that adaptively updates the constraint weights to meet the constraints of the proposed UIDR estimator.
优点
(1) The authors introduce a series of innovative double robustness (DR) estimators through a rigorous theoretical framework. These estimators maintain their unbiased nature even when pseudo labelings diverge from the true labels, accommodating arbitrary and unknown biases specific to users, items, or a combination thereof. This represents a significant stride in addressing user-specific and item-specific inductive biases, showcasing the adaptability and robustness of the proposed methods.
(2) In a further extension of their work, the authors present a principled propensity reconstruction learning strategy, which adeptly utilizes an attention mechanism to adaptively update the constraint weights. This approach not only enhances the adaptability of the model but also ensures that the variance of the DR estimators remains within controllable and manageable bounds. This aspect of the work underscores the authors’ commitment to developing robust and reliable estimators, contributing to the stability and efficacy of the proposed methods.
(3) The paper’s empirical validation is robust, encompassing semi-synthetic experiments that attest to the effectiveness of the proposed methods in scenarios involving arbitrary user-specific and item-specific inductive biases. This is a notable achievement, as previous methodologies have fallen short in providing unbiased estimates of the ideal loss under these conditions. Additionally, the authors extend their validation to real-world contexts, conducting comprehensive experiments across three real-world datasets. These experiments serve to highlight the tangible advantages and superior performance of the proposed methods, solidifying the paper’s contributions to the field.
缺点
(1) While the theoretical foundation appears robust, there is a potential concern regarding the complexity and practicality of implementing such estimators in real-world scenarios when we have large U and large I. The paper could benefit from a more detailed discussion on the potential challenges and limitations associated with these novel DR estimators, providing a more balanced and critical perspective.
(2) The introduction of a principled propensity reconstruction learning approach, utilizing an attention mechanism to adaptively update constraint weights, is indeed a novel contribution. However, the claim that the variance of the DR estimators is highly controllable and manageable warrants a more rigorous scrutiny. Can we see any tradeoff between the bias and variance, since sometimes we want to minimize the MSE in the ML community when you mentioned controllable and manageable.
问题
-
In section 2, it seems that in [0,1] rather in R, which is the rating/
-
After corollary 1, “the biases of the proposed DR estimators are still strictly smaller than the previous DR” , Is this toy example realistic? Could you please show a real example here?
-
For the optimization problem, do we have a constraint that < 1?
-
What is A? What is b? What is and ? How do we learn it? Why has been applied tanh but is not been applied?
This paper proposes User-DR, Iten-DR, and User-Item-DR for unbiased recommendation. The proposed methods have both strong theoretical guarantees and improvement in comparison with the baselines.
优点
- Strong theoretical guarantee on the proposed method
- The experimental studies are solid. I would like to point out that a ~10% improvement in NDCG@5 is very significant for unbiased recommendation. Note that the machine learning model is not changed in this paper. To be more accurate, only the debiased method is different from the baselines if I understand correctly!
缺点
- Some baselines are lacking introduction. For example, the best baseline, ESCM-DR, is not introduced in detail.
- The running time (or time complexity) is lacking in analysis in the paper. I think this paper can be improved if ESCM-DR is slower than the proposed algorithms.
- The evaluation matrics, AUC, NDCG, and F1, are missing clear definitions (can be in the appendix).
问题
- Can the authors provide some detailed explanations for ECSM-DR and Multi-DR?
- Any record of the running time (or time complexity) can be provided?
- Can the authors provide a rigid definition for F1 measure?
- Can the authors also provide the experimental results similar to Table 3 for NDCG and F1 @ other values in the appendix? At least for Coar and music datasets, Top-1 should be also an important measure.