Rethinking Causal Ranking: A Balanced Perspective on Uplift Model Evaluation
摘要
评审与讨论
This work focuses on building and evaluating models for uplift modeling. The work finds a critical limitation in existing evaluation metrics, as many of these models do not weigh negative outcomes enough. The work finds that this lead to biased evaluations due to incorrect orderings between persuadable and sleeping dogs with negative outcomes, potentially resulting in biased models receiving higher curve values. The authors show this through both empirical results and theoretical results. Given the limitation of existing evaluation metrics, the work proposes the principled uplift curve (PUC), and show that it properly weighs different individuals in both the positive and negative outcome groups. The authors propose PTONet by integrating the PUC into the objective function to reduce bias during uplift model training. Through experimental results, the efficacy of PUC is shown by its alignment with ground-truth evaluation in synthetic settings, and the efficacy of PTONet is established in both synthetic and real data.
Update after Rebuttal
I appreciate the authors work in answering the questions. Overall, I am satisfied with the response and have updated my score. I do hope the authors consider the limitations brought up by other reviewers and discuss them to ensure the paper does not over claim the contributions of the proposed metric (which I do not believe they currently do).
给作者的问题
- When proving Proposition 4.1, how does the difference between the two bounds prove the first half of Proposition 4.1? I would appreciate seeing the steps more clearly.
- What is the exact intuition for why the loss (11) is useful? What does identifying g(t_i, y_i) from the estimated treatment effect help guide models to assign higher CATEs to persuadable individuals?
论据与证据
The primary claims of this work are:
- Existing uplift model evaluation curves can result in suboptimal ordering
- The proposed PUC provides a more balanced and unbiased evaluation compared to regular uplift and Qini curves
- The proposed PTONet enhances a model's ability to rank CATEs effectively.
The authors provide sufficient evidence for all of these claims.
方法与评估标准
The proposed methods and evaluation criteria make sense for the problem at hand, and the inclusion of both synthetic data and real data is appreciated.
理论论述
I examined the correctness of proposition 4.1. Truthfully, it is a bit difficult to follow. The rough steps make sense, though it is not clear why we define the value function has the difference between the two bounds -- this is stated without justification in the proof. The proof for the second half of 4.1 is more clear.
实验设计与分析
The experimental design and analyses seem sound overall.
补充材料
I reviewed the proofs and derivations in the supplementary material as well as the experimental details.
与现有文献的关系
The ability to accurately evaluate uplift models is critical across many fields, including marketing and advertising where personalized decision-making is key. The contributions in finding the weaknesses of existing evaluation metrics as well as the new proposed evaluation metric are hence quite important. The proposed PTONet also shows a new way to improve uplift modeling, which may be used in these downstream applications as well.
遗漏的重要参考文献
I would appreciate more discussion on [1]. In the evaluation metrics considered in [1], every example will contribute to the TOC/AUTOC curve, which may mitigate some of found issues with most uplift modeling metrics.
[1] Yadlowsky, Steve, et al. "Evaluating treatment prioritization rules via rank-weighted average treatment effects." Journal of the American Statistical Association (2024): 1-14.
其他优缺点
Strengths:
- The authors find a really subtle yet interesting flaw in how most uplift evaluation metrics are evaluated
- The clear examples of when an unbiased model is rated worse than a biased model in Table 2/3 is very compelling
- The correlation between the ground-truth AUTGC and PUC in the experimental results is very convincing, and seems to support the theoretical findings for PUC
- PTONet has strong performance in terms of PUC. The fact that the next performing model is PU S-Learner shows more credence to the authors findings regarding how to optimize the PUC.
- The ablations of the proposed method are convincing in terms of showing the importance of every part of the objective
Weaknesses and Concerns:
- The work does not mention the relationship to [1] (i.e., does TOC overcome many of the issues of past work?)
- The organization of the proposed method is quite difficult to follow. Specifically, it is not clear how the loss in (11) is formulated. A more clear description of this would be useful as this is the crux of the proposed method.
- The proof of Proposition 4.1 is not well formulated and very difficult to follow. Once again, as this is a major part of the proposed contribution, a more clear proof for this proposition would be appreciated.
其他意见或建议
N/A
References &Weaknesses 1: I would appreciate more discussion on [1].
Response 1: Thank you for your concern. As far as we know, TOC/AUTOC can be understood as introducing a threshold and a logarithmic function to the conventional uplift and Qini curves, as shown in the following formula (from Equation (2.5) in [1]):
This indicates that the metric places particular emphasis on the contribution of the top few individuals (as shown in Figure 2 of [1], where, if only the top 10% of the population is considered, the overall gain from AUTOC is higher than that from QINI). In other words, this metric amplifies the imbalance issue inherent in the uplift and Qini curves. In contrast, the goal of our metric is the opposite—we aim to address and mitigate this imbalance problem.
Weaknesses 2&Questions 2: What is the exact intuition for why the loss (11) is useful? What does identifying g(t_i, y_i) from the estimated treatment effect help guide models to assign higher CATEs to persuadable individuals?
Response 2: Thank you for your question. The intuition behind the loss function in equation (11) is derived from Table 1. Specifically, the TP () and CN () samples should be ranked ahead of TN () and CP ().
To incorporate this constraint during training, we introduce as a binary classification task label, where the labels for TP and CN samples are 1, and the labels for TN and CP samples are 0. Then, we use the estimated causal effect as the input to train this binary classification task. This approach effectively constrains the model such that the estimated causal effects for TP and CN are as large as possible, while for TN and CP, should be as small as possible. In this way, when the trained model is tested or during model selection, the model will rank TP and CN ahead of TN and CP based on in descending order. The persuadable group corresponds to the TP and CN samples, while the sleeping dog group corresponds to the TN and CP samples.
Claims&Weaknesses 3&Questions 1: The proof of Proposition 4.1 is not well formulated and very difficult to follow. When proving Proposition 4.1, how does the difference between the two bounds prove the first half of Proposition 4.1?
Response 3: Thank you for your concern. Below, we will focus on explaining the meaning of these two bounds and why the difference between them is able to distinguish between the persuadable group and the sleeping dog group.
In appendix E, we define the number of persuadable individuals in total samples as and the number of sleeping dog individuals in total samples as , then we have the two bounds that and .
The first bound holds because the persuadable group only includes the TP () and CN () groups. Therefore, the number of individuals in the persuadable group, , will always be less than or equal to the sum of the number of TP individuals, , and CN individuals, . Similarly, the sleeping dog group only includes the TN () and CP () groups. Thus, .
We aim for the evaluation metric PUC to correctly distinguish between the persuadable group and the sleeping dog group. This means we want PUC to increase as the persuadable group grows, and decrease as the sleeping dog group increases. Therefore, we define:
As we can see, the first two terms include all the individuals in the persuadable group, while the last two terms include all individuals in the sleeping dog group. When the persuadable group is included in PUC, the value of PUC increases; conversely, when the sleeping dog group is included in PUC, the value of PUC decreases. Thus, our metric can effectively distinguish between the persuadable group and the sleeping dog group.
We will include this explanation in Appendix E of the final version of the paper.
Thank you once again for your valuable feedback. If you have any further concerns or questions, we are always happy to address them. If you feel that our responses have addressed your concerns, we would appreciate it if you could consider raising your recommendation score.
This paper proposes PTONet, a new uplift model that integrates the Principled Uplift Loss (PUL) to improve CATE ranking accuracy, outperforming existing models in experiments on simulated and real-world datasets.
给作者的问题
Please refer to the above.
论据与证据
The paper effectively presents its claims and supports them with clear evidence.
方法与评估标准
Yes, the paper proposes a method to improve CATE ranking accuracy.
理论论述
The theoretical claims and their proof are correct.
实验设计与分析
The paper's experimental design appears methodologically sound.
补充材料
I've reviewed the Appendix.
与现有文献的关系
The key contributions of this paper are positioned within the broader context of uplift modeling, which has been a subject of significant interest in domains like marketing, customer retention, and personalized treatment recommendations.
遗漏的重要参考文献
None
其他优缺点
None
其他意见或建议
None
伦理审查问题
None
Thank you for your feedback. If you have any additional concerns or questions, we would be happy to answer them. If you have no additional concerns, we would appreciate you considering increasing your recommendation score.
I confirm that I have read the author's response to my review and will update my review in light of this response as necessary.
In an RCT with two groups, treatment and control, an uplift model is supposed to rank four types of units - treatment positive, treatment negative, control positive and control negative in alignment with their CATE (which is unobservable). This paper claims that existing evaluation metrics such as uplift curves and Qini curves are biased towards treating all negatives the same way, i.e., they ignore the potential that a control negative could potentially be a treatment positive and must be ranked at par with treatment positives. It then proposes a simple fix to the evalution metrics that leads to the proposed 'Principled Uplift Curve'. This idea is then also used to add an additional loss function for uplift modeling. Experimental results show that the proposed fix correlates best with true CATE rankings.
给作者的问题
I have mentioned most questions in the previous sections. Regarding the results, even the S-Learner with the loss function fix seems to be competitive with PTONet for the fixed evaluation metrics. It would interesting to see if this holds for other uplift models too.
###update after rebuttal ###
I am satisfied with the responses that the authors provided. On the one hand, the experiments that they performed underscore an important point that existing learners with the loss function fix seem to be competitive with their proposed learner. However, on the other hand I feel it takes away from the utility of PTONet. So, I'll keep my score unchanged.
论据与证据
I don't see enough clear explanation for the claim that equation 12 handles treatment assignment bias. While this is not the main point of the paper, given that it is included and important for the PTONet (the proposed uplift model), either it should be explained more or if it is from previous work, clear citations should be added.
方法与评估标准
The paper compares its fix against multiple uplift models and evaluation criteria present in literature. A few specific issues a) Why does the architecture in Figure 4 have the input from h(X,T) being added to g(T,Y). It seems like in equation 11, the BCE only takes in . b) the paper needs to address the text on treatment bias with some more explanation.
理论论述
The main theoretical claim that the paper makes is that the proposed uplift curve is sound in its ranking. This follows immediately from the proposed fix.
实验设计与分析
The experimental design is sound as per my understanding. A few issues a) I couldn't understand why the outcome in the synthetic data, in Appendix I is real-valued when the rest of the analysis is for a binary outcome. b) Also, the form of the functions don't seem to have a rationale mentioned in the text.
补充材料
Yes, the supplementary material contains the related work section, proof of the theoretical claim and experimental details.
与现有文献的关系
This work challenges existing evaluation metrics in the uplift modeling literature. It draws specifically from Devriendt et. al. 2020 that uses helper functions to come up with loss functions for training uplift models.
遗漏的重要参考文献
None that I know of.
其他优缺点
Overall, the main contribution is to propose a fix to existing evaluation metrics in the uplift modeling literature. This is certainly important and impactful. The paper is written clearly overall. There are some parts which need more work that I specify later. My overall impression was that the proposed fix was 'obvious' and what anyone should do in the first place. On one hand, the presentation of the problem such that the solution is obvious, is a strength of the paper. However, in this case, I find that the paper lacks any further insight apart from this fix.
其他意见或建议
- Section 2.2 has some notation that is not right. I(k) is assumed to be an ordered index but it is not made clear what the order is when the index i ranges from 1 to I(k). Is I_diff same as I?
- SUC in line 124 occurs before it is defined below.
- The plot in Figure 5 needs some more explanation about what the shaded region is and what the lines are.
Thank you for your positive feedback. We will address each of your concerns one by one.
Claims&Methods: ... explanation for the claim that equation 12 handles treatment assignment bias. ... clear citations should be added.
Response 1: Thank you for your suggestion. We will revise the citation in line 315 of the original text to "(please refer to Section 3 in Shi et al. (2019))" for better readability.
Treatment Assignment Bias occurs when treatment assignment is influenced by systematic factors rather than being entirely random, potentially leading to biased results. Since this paper focuses on RCT data, where treatment is assumed to be random, Treatment Assignment Bias is not a concern. The Targeted Regularizer in PTONet improves scalability to non-RCT data, enhancing its applicability in industry and future research.
Methods: a) Why does the architecture in Figure 4 have the input from h(X,T) being added to g(T,Y).
Response 2: Thank you for your question. The function can be derived as:
The function corresponds to the arrow in Figure 4. To avoid this ambiguity, we will modify the arrow in Figure 4 from to .
Analysis: a) why the outcome in the synthetic data is real-valued. b) the form of the functions don't seem to have a rationale.
Response 3: Apologies for any confusion in your reading. We forgot to emphasize in Appendix I that the final observed outcome is generated as follows:
where . The first term represents the observed outcome for the persuadable group, the second term corresponds to the sleeping dogs, and the third term accounts for the observed outcome of sure things and lost causes.
The rationale behind this data generation process is as follows:
is designed to simulate the real-world scenario in our business data, where the number of treated samples is significantly smaller than the number of control samples.
In outcome functions, the sine and cosine functions are introduced to incorporate nonlinearity, while the different coefficients are used to adjust the proportion of samples with . This adjustment helps simulate our real-world business scenario where positive outcomes are relatively rare.
We will include these details in Appendix I of the final version of the paper.
Comments 1: what the order is when the index i ranges from 1 to I(k).
Response 4: Thank you very much for your comments. The ordered index used in is based on the descending order of the score function . Due to the character limit of the response, please refer to Appendix C of the paper, where we provide a detailed explanation of the calculation process for .
Comments 1&2: Is I_diff same as I? SUC in line 124 occurs before it is defined below.
Response 5: Thank you for your correction. The term is a typo and will be revised to in the final version. Similarly, "SUC" in line 124 will be corrected to "uplift and Qini curve."
Questions: It would interesting to see if this holds for other uplift models too.
Response 7: Thank you for your question. We have additionally included experiments with T-Learner, TARNet, and EUEN models. The results are as follows:
| Synthetic | PEHE (↓) | SUC (↑) | SQC (↑) | JUC (↑) | JQC (↑) | PUC (↑) | AUTGC (↑) |
|---|---|---|---|---|---|---|---|
| T-Learner (PU) | 0.867 ± 0.14 | 0.763 ± 0.17 | 0.536 ± 0.12 | 0.748 ± 0.15 | 0.537 ± 0.11 | 0.937 ± 0.20 | 0.952 ± 0.15 |
| TARNet (PU) | 0.893 ± 0.08 | 0.759 ± 0.12 | 0.533 ± 0.09 | 0.754 ± 0.11 | 0.534 ± 0.08 | 0.944 ± 0.14 | 0.957 ± 0.11 |
| EUEN (PU) | 0.781 ± 0.15 | 0.767 ± 0.16 | 0.538 ± 0.11 | 0.742 ± 0.15 | 0.538 ± 0.11 | 0.932 ± 0.19 | 0.948 ± 0.15 |
| PTONet | 0.883 ± 0.13 | 0.780 ± 0.14 | 0.547 ± 0.10 | 0.746 ± 0.13 | 0.546 ± 0.10 | 0.948 ± 0.15 | 0.961 ± 0.11 |
The performance of these three models after incorporating the Principled Uplift loss function is comparable to that of PTONet, with significant improvement observed across all metrics. We will add these experiments in the final version of this paper.
Thank you once again for your valuable feedback. If you have any further concerns or questions, we are always happy to address them. If you feel that our responses have addressed your concerns, we would appreciate it if you could consider raising your recommendation score.
This paper reveals the limitations of previous uplift and Qini curves in evaluating uplift models, demonstrating their susceptibility to manipulation by suboptimal ranking strategies that can artificially enhance the performance of biased models. To address this, the authors introduce the Principled Uplift Curve (PUC), a metric that accounts for both positive and negative outcomes, offering a new assessment of uplift models. Additionally, they propose PTONet, a PUC-guided uplift model that optimizes uplift predictions by directly maximizing the PUC value.
给作者的问题
Please see the weaknesses.
论据与证据
Yes, the claims made in the submission are supported by clear and convincing evidence.
方法与评估标准
Yes, the proposed methods and evaluation criteria make sense for the problem or application at hand.
理论论述
I did not thoroughly check each proof. However, they do not contradict my prior understanding.
实验设计与分析
The paper evaluates the performance of uplift models and their corresponding evaluation metrics using both synthetic and real-world datasets. It conducts experiments on a synthetic dataset and the Criteo dataset (Diemert Eustache et al., 2018; Diemert et al., 2021) to assess model effectiveness in practical scenarios. Additionally, to examine the scalability of the proposed method in high-dimensional settings, the paper presents further experimental results on the Lazada dataset (Zhong et al., 2022).
补充材料
All.
与现有文献的关系
The paper contributes to the broader literature on uplift modeling methods and evaluation metrics for uplift models. It examines limitations in existing evaluation approaches, particularly the susceptibility of uplift and Qini curves to biased rankings. By introducing the Principled Uplift Curve (PUC) and the PTONet model, the paper offers a refined evaluation metric and an optimization-based modeling approach, adding to the ongoing research on uplift modeling and causal inference.
遗漏的重要参考文献
No.
其他优缺点
The paper should provide a stronger justification for the proposed methods and explain why they outperform existing metrics. A key concern is that the approach does not improve the worst-case scenario, where ST^TP and ST^CP are ranked first. In many real-world applications, such as advertising and recommendation systems with budget constraints, this ranking can lead to significant opportunity costs. In contrast, conventional metrics at least ensure that PE^{TP} is ranked no lower than second place. While the proposed method distinguishes persuadable individuals from sleeping dogs, it also introduces a tradeoff in opportunity cost and does not guarantee improved decision-making performance in practical applications.
It is not surprising that the proposed PTONet achieves the highest PUC and AUTGC, as it is specifically designed based on PUC. However, it does not outperform other methods on alternative evaluation metrics.
其他意见或建议
No.
Thank you for your feedback. We will address each of your concerns one by one.
Weak 1: A key concern is that the approach does not improve the worst-case scenario, ... conventional metrics at least ensure that PE^{TP} is ranked no lower than second place. ..., it also introduces a tradeoff in opportunity cost and does not guarantee improved decision-making performance in practical applications.
Response 1: Thank you for your concern. As stated in the original manuscript, although PUC is not a perfect metric—it cannot fully identify the four groups—it is more suitable than conventional metrics for evaluating uplift models. When the budget is limited, in the worst-case scenario, the PUC metric at least ensures less harmful decisions compared to conventional metrics. In the best-case scenario, PUC guarantees both less harmful decisions and the highest possible gains. Therefore, our metric outperforms conventional metrics.
Specifically, regarding 'conventional metrics ensuring that is ranked no lower than second place', it seems to overlook the severe issue where is ranked behind . This means that using regular metrics for causal ranking in an uplift model forces decision-makers to target all potential customers ( and ) at the expense of customers who would have originally clicked or purchased (). Such a strategy should be avoided in practice as it harms a large group of customers.
Even with a budget that covers only and not , the regular metrics-guided model loses a large number of potential customers, . This misstep can mislead decision-makers into believing there are no further growth opportunities when, in fact, potential customers remain untapped. (Please refer to Figure 2 in the original paper; in the worst-case scenario, samples from 3. to 6. are considered neither beneficial nor harmful under conventional metrics.)
In contrast, the PUC metric does not exhibit this issue in the worst-ranking case. If decision-makers realize that a small-budget promotion won't yield incremental returns, they can continue to expand the budget until the PUC slows down or the promotion is scaled back, without harming customer interests or missing potential customers. (Refer to Figure 3 in the original paper; decision-makers can clearly identify that only the groups from 1. to 4. are yielding benefits.)
Most importantly, in the best ranking case, PUC guided models can achieve the highest-gain decisions with the minimal budget, while conventional metrics cannot.
Therefore, the PUC metric should be used over regular metrics to select an uplift model that accurately targets potential customers without alienating existing customers who are willing to purchase. Future work can improve upon the limitation of PUC's inability to identify the four groups, but regular curves should no longer be used for uplift model evaluation.
Finally, we appreciate you for suggesting the application scenarios in advertising and recommendation.
Weak 2: It is not surprising that the proposed PTONet achieves the highest PUC and AUTGC, as it is specifically designed based on PUC. However, it does not outperform other methods on alternative evaluation metrics.
Reponse 2: Thank you for your concern. Our simulated data is not specifically designed for PUC; Our data generation process is simple and easy to follow:
is designed to simulate the real-world scenario in our business data, where the number of treated samples is significantly smaller than the number of control samples. This also ensures differentiation from the Criteo and Lazada datasets.
The outcome functions are defined as:
Here, the sine and cosine functions are introduced to incorporate nonlinearity, while the different coefficients are used to adjust the proportion of samples with . This adjustment helps simulate our real-world business scenario where positive outcomes are relatively rare. For details on the proportion of the treated group and the positive outcome rate, please refer to Table 10 in the original paper.
PTONet performs suboptimally with regular metrics but achieves the best performance with the PUC metric, which further validates the issue of bias in regular metrics highlighted in this paper, and confirms that the PU loss function can directly improve the model's performance on the PUC metric.
Thank you once again for your valuable feedback. If you have any further concerns or questions, we are always happy to address them. If you feel that our responses have addressed your concerns, we would appreciate it if you could consider raising your recommendation score.
This paper proposes a new evaluation metric, the Principled Uplift Curve (PUC), which assigns equal importance to individuals with positive and negative outcomes and offers an unbiased evaluation of uplift models. The authors derive a new loss function with a new model architecture to reduce bias during uplift model training.
给作者的问题
-
Could you give some intuition about the discussion in section 3? What does the max curve indicate here in Fig 3.
-
Is the experiment for the correlation between AUUQC and AUTGC as described in Appendix I? Do the results still hold for other data distribution?
论据与证据
The paper claims that the traditional uplift and Qini curves might lead to biased evaluations and proposes the Principled Uplift Curve (PUC) and compares it with other evaluation metrics by their correlation with AUTGC.
The proposed method is demonstrated to outperform the existing method with both synthetic and real-world datasets.
方法与评估标准
The proposed architecture is evaluated in multiple real-world datasets benchmarks and evaluation metrics.
理论论述
I checked the derivation of individual contributions in Appendix D and did not find any issue with it.
实验设计与分析
I checked the experiment for the proposed model and evaluation metrics.
The model is shown to outperform the existing models in the proposed evaluation metric.
The proposed metric is shown to be more reliable with synthetic data. I think there could be more discussion and careful experiments on this.
补充材料
I did not review the supplementary material.
与现有文献的关系
Evaluation is a challenging problem in this area. This paper proposes a new evaluation metric and shows it is more reliable than the existing ones, which could be a nice contribution to uplift modeling.
遗漏的重要参考文献
I'm not aware of any essential references that are not discussed.
其他优缺点
Pros:
- The paper tackles a challenging problem in uplift modeling and proposes a method to improve the modeling and evaluation.
- The proposed method is evaluated with many real-world benchmarks
Cons: The proposed evaluation metric could be discussed in more detail since it is an essential part of the proposed method.
其他意见或建议
NA
Thank you for your positive feedback. We will address your concerns and questions one by one.
Cons: The proposed evaluation metric could be discussed in more detail since it is an essential part of the proposed method.
Response 1: Thank you for your concern. Based on your suggestion, we provide additional discussion on the evaluation metric as follows:
-
Providing intuition behind :
represents the maximum PUC value, which is achieved when all samples with and , as well as those with and , are ranked ahead of samples with and , as well as those with and . -
Repositioning the intuition behind the PUC metric (Proposition 4.1):
We will move the first paragraph of explanation after Proposition 4.1 to the paragraph after formula (9) to improve readability. -
Clarifying the intuition behind :
We will clarify that assigns a value of 1 to samples with and , as well as those with and , while samples with and , as well as those with and , are assigned a value of 0.
Using as the label, we train a binary classifier to constrain , ensuring that samples with and , as well as those with and , have a larger , whereas samples with and , as well as those with and , have a smaller . -
Providing intuition behind :
This loss function encourages the CATE of samples with and , as well as those with and , to be greater than the CATE of samples with and , as well as those with and .
We appreciate your feedback and will incorporate these intuitions in the final version of our paper.
Question 1: Could you give some intuition about the discussion in section 3? What does the max curve indicate here in Fig 3.
Response 2: Thank you for your question. The intuition behind the discussion in Section 3 is that we aimed to verify whether SUC and other regular metrics reach their maximum values only when the causal effect ranking is completely accurate. If this were the case, then SUC would be a reliable metric. However, we found that even when the causal effect ranking is entirely correct, SUC does not always attain its maximum value. On the contrary, certain incorrect causal effect rankings can lead to SUC achieving its highest score (refer to Tables 2 and 3). This observation led us to further investigate SUC and related formulas, ultimately inspiring this paper.
In Figure 3, the max curve corresponds to in Equation (8). We have supplemented its underlying intuition in Response 1.
Question 2: Is the experiment for the correlation between AUUQC and AUTGC as described in Appendix I? Do the results still hold for other data distribution?
Response 3: Thank you for your question. Yes, the experimental setup in this paper is described in Appendix I. The results still hold for other data distributions, as long as the data is RCT data.
Thank you once again for your valuable feedback. If you have any further concerns or questions, we are always happy to address them. If you feel that our responses have addressed your concerns, we would appreciate it if you could consider raising your recommendation score.
The paper discusses the limitations of Qini curves/uplift models, focusing on bias in evaluation and offering a solution through PUC. Reviewers agreed that evaluation is challenging in this area. Authors adequately addressed concerns that were raised during the discussion including that the approach does not improve the worst-case scenario, and relation to Yadlowsky et al AUTOC evaluation measure.