PaperHub
7.2
/10
Spotlight4 位审稿人
最低3最高4标准差0.4
3
4
4
4
ICML 2025

Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

Inspired by log-sum-exponential operator, we propose a novel estimator for off-policy learning and evaluation under heavy-tailed assumption on weighted reward.

摘要

Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret—the performance gap between our LSE estimator and the optimal policy—assuming bounded $(1+\epsilon)$-th moment of weighted reward. Notably, we achieve a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$ for the regret bounds, where $\epsilon\in[0,1]$ and $n$ is the size of logged bandit feedback dataset. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach. The code for our estimator is available at the following link: https://github.com/armin-behnamnia/lse-offpolicy-learning .
关键词
off-policy learningoff-policy evaluationlog sum exponentialregret boundestimation boundconcentrationbias and variancerobustnessheavy-tailed reward

评审与讨论

审稿意见
3

The paper introduces a Log-Sum-Exponential (LSE) estimator to address off-policy evaluation (OPE) and learning (OPL) in contextual bandit settings where rewards or propensity scores may be noisy or heavy-tailed. By applying a log-sum-exp transformation over importance-weighted rewards, this estimator improves robustness and variance control. The main theoretical contributions include a bound on the regret (O(nt1+ε)\mathcal{O}\left(n^{-\frac{t}{1+\varepsilon}}\right)) under a (1+ε1+\varepsilon)-th moment assumption for the weighted reward and bias–variance analyses showing LSE can reduce variance compared to IPS. Experiments on synthetic data and supervised-to-bandit tasks (e.g., EMNIST) demonstrate lower MSE and higher accuracy than standard baselines (IPS variants, ES, PM, IX, etc.).

给作者的问题

How does one best pick λ\lambda in practical scenarios? Is there a recommended cross-validation strategy you find most reliable?

Could the LSE concept apply similarly in off-policy RL with trajectories? Are there any theoretical or practical obstacles?

Have you tried combining LSE with direct reward modeling or a doubly robust approach? If so, does the non-linear structure complicate it?

论据与证据

Major Claims

  • Lower variance and bias–variance trade-off: By taking a log-sum-exp over the weighted rewards, the estimator becomes less sensitive to large outlier samples.

  • Robustness under heavy-tailed reward distributions: They provide a theoretical analysis under a (1+ε1 + \varepsilon)-th moment assumption, which accommodates unbounded or heavy-tailed random variables.

  • Favorable regret/convergence bounds in off-policy learning: They prove an O(nε1+ε)\mathcal{O}\left(n^{-\frac{\varepsilon}{1+\varepsilon}}\right) convergence rate, which interpolates between O(n1/2)\mathcal{O}(n^{-1/2}) and O(n0)\mathcal{O}(n^0) depending on the bounding of higher moments.

  • Empirical gains: The authors demonstrate in synthetic and real-ish tasks (EMNIST or KUAIREC-like data) that the LSE estimator can outperform baselines in terms of MSE (in OPE) or final policy accuracy (in OPL).

Evidence

  • Theory: Theorem 5.3 (regret bounds) and Propositions 5.5/5.7 (bounds on bias and variance) offer analytical proof that LSE yields guaranteed convergence rates and improved variance control, provided certain moment assumptions.

  • Experiments: On synthetic data, the LSE exhibits lower MSE and variance than standard IPS or other model-free estimators (PM, ES, IX, OS). On EMNIST-based bandit tasks, it achieves good accuracy even with noisy reward signals or estimated propensity scores.

These results collectively support the authors’ main claims. Some details (like data-driven choice of \lambda) are shown in appendices, along with further ablations and real-data results.

方法与评估标准

Core method:

  • The LSE estimator, VLSEλV_{\text{LSE}}^\lambda​, applies 1λlog(1ni=1neλriwθ(ai,xi))\frac{1}{\lambda} \log \left(\frac{1}{n} \sum_{i=1}^{n} e^{\lambda r_{i} w_{\theta}\left(a_{i}, x_{i}\right)}\right) with a tunable parameter λ<0\lambda<0.

Evaluation:

  • In OPE, they empirically compare MSE, variance, and bias across multiple estimators.

  • In OPL, a learned policy πθ\pi_{\theta} is optimized by maximizing the LSE-based objective. The main metric is regret or final classification accuracy after learning.

  • Baselines: The paper carefully compares against standard IPS variants (e.g., truncated IPS), exponent smoothing (ES), power-mean (PM), self-normalized IPS (SNIPS), and so on.

All these benchmarks and metrics (variance, MSE, accuracy, regret) align with accepted practice in bandit OPE/OPL research.

理论论述

The authors present:

  • Regret Analysis (Theorem 5.3 & Proposition 5.4): Shows the LSE-based OPL converges at O(nε1+ε)\mathcal{O}(n^{-\tfrac{\varepsilon}{1+\varepsilon}}) under a heavy-tailed assumption on weighted rewards.

  • Bias and Variance Bounds (Propositions 5.5 & 5.7): Provide an upper bound on LSE’s bias in terms of λε|\lambda|^{\varepsilon} and a straightforward variance bound that is no greater than that of IPS under second-moment assumptions.

  • Robustness: Theorem 5.9 addresses noisy reward distributions, bounding the regret to show that total variation distance from the clean distribution plus the LSE’s own hyperparameter λ\lambda determine the final bound.

These proofs are given at length in the appendices, citing standard concentration inequalities (Bernstein, etc.) and carefully handling non-linear transformations. The logic in the statements is consistent, and no major errors stand out in the derivations.

实验设计与分析

  • Synthetic OPE: They design heavy-tailed reward distributions (Pareto-like or exponential) and random logging vs. target policies, measuring bias/variance/MSE across 10K replications.

  • Supervised-to-bandit OPL: On EMNIST, the authors transform classification data into bandit feedback with partial logging. They vary logging policy quality (temperature \tau) and introduce both noisy propensity scores (modeled via inverse Gamma) and flipping reward noise.

  • Baselines: They compare with up to 7–8 standard baseline methods.

  • Metrics: MSE in OPE; classification accuracy for learned policies in OPL.

  • Findings: LSE typically has lower MSE and variance than baselines in OPE, and yields better policy performance in OPL, especially under heavier reward tails or noisier propensity scores.

These experiment designs match the paper’s stated goals, although the authors might highlight more real-world scale tasks in future. The demonstration is nonetheless credible for a conference submission.

补充材料

The authors present:

  • Extended proofs of the theorems (in Appendix D), including details on how the heavy-tail assumption is invoked.

  • Additional experiments: e.g., using the Lomax distribution for logging and target policies, exploring real-ish data from KUAIREC, sensitivity to λ\lambda, ablations on sample size, comparisons with older baselines, etc.

  • PAC-Bayes viewpoint, data-driven selection of λ\lambda, and additional details about the gamma noise for estimated propensity scores.

I have skimmed through the relevant appendices. The supplementary materials add clarity on both theoretical derivations and experiment details.

与现有文献的关系

  • The paper extends model-free OPE approaches (IPS, truncated IPS, ES, PM, IX, etc.) by introducing a non-linear transformation that can handle outliers.

  • The approach is reminiscent of “robust mean estimation” under heavy tails (median-of-means, trimmed mean), but specialized to importance-weighted bandit feedback.

  • The results connect to well-known bandit or offline RL settings under “pessimistic” or distribution-shift assumptions, though here it focuses specifically on the LBF dataset with (potentially) heavy-tailed or noisy rewards.

Overall, the authors do cite standard references (IPS, ES, PM, etc.) and situate their work among current OPE/OPL techniques. Additional connections to sub-Gaussian bounding and robust M-estimators are also discussed.

遗漏的重要参考文献

The paper covers the most relevant prior model-free approaches to mitigate variance in off-policy evaluation (IPS, truncated, ES, PM, etc.). It also references works on heavy-tailed or robust bandit algorithms. No obviously critical references are missing. I believe the submission includes enough prior work for the standard context.

其他优缺点

Strengths

  • This log-sum-exp transformation is conceptually simple yet yields robust performance for unbounded or noisy data.

  • Theoretical thoroughness: They provide formal bias-variance bounds and regret guarantees that unify heavy-tailed analysis with a single hyperparameter λ\lambda.

  • Robustness: The authors show how the LSE estimator can remain stable even with substantial reward noise or propensity mis-specification.

  • Empirical validations: They compare thoroughly to an array of baselines.

Weaknesses

  • The paper discusses data-driven methods in the appendices, but real deployments might need more straightforward or adaptive selection heuristics.

  • The approach is a single-pass aggregator, so computational complexity is no worse than other OPE methods, but potential memory or tuning overhead for extremely large datasets is not deeply addressed.

  • They rely mostly on small to medium transformations (EMNIST or partial KUAIREC). The method’s performance in large, production-scale logs (with tens of millions of rows) is not tested.

其他意见或建议

See the above weaknesses section.

作者回复

We thank the reviewer for their careful reading and comments on the paper. We will address their concerns as detailed below.

single-pass aggregator

R1: Thanks for raising this point. It is true that theoretically, LSE is not separable and should be applied to the whole dataset. But, due to the nice gradient form of LSE, complete stochastic optimization of LSE is also possible for large-scale datasets. Suppose that x1,...,xklx_1, ..., x_{kl} is the data of kk batches of size ll. LSE is a quasi-arithmetic mean function. So we have,

=mathrmLSEleft(A_1,A_2,...,A_kright) =\\mathrm{LSE}\\left(A\_1, A\_2, ..., A\_k\\right)

Now we have,

So the gradient of the LSE on the entire data is a weighted average of the gradient of each batch, but these weights are not uniform as in the case of monte-carlo mean. We create the following procedure for SGD optimization.

We store a coefficient c_ic\_i for each A_iA\_i. Our final objective is to find cic_i such that c1,...,ckc_1, ..., c_k are proportional to mathrmsoftmax(lambdaA_i)\\mathrm{softmax}(\\lambda A\_i). For this to happen, suppose we are at step t\+1t \+ 1 and we have, c1,...,ctc_1, ..., c_t as the coefficients of A1,...,AtA_1, ..., A_t. We now want to find ct+1c_{t + 1} and apply GD on c_t\+1A_t\+1c\_{t \+ 1}A\_{t \+ 1}. It is sufficient to have the following equality,
\\frac{e^{\\lambda A\_{t \+ 1}}}{e^{\\lambda A\_{t}}} \= \\frac{c\_{t+1}}{c\_t} Hence, at step tt, we apply gradient descent this way:

where eta\\eta is the learning rate and c_1=1c\_1=1. We will add this procedure for batch optimization as a discussion in the paper.

KUAIREC and real-world dataset

R2: Thank you for the insightful comment. We agree that evaluating performance on production-scale datasets is crucial. KuaiRec’s 12M-interaction (big) dataset (7K users, 10K items) was used for training/validation, while a denser subset (small) was used for testing. The smaller subset was used solely for evaluation, given its density. We will clarify this in the final version to emphasize that our method has indeed been evaluated on a dataset of production-level scale. Furthermore, we conducted more experiments on more datasets, including the PubMed 200K RCT dataset.

Choosing λ\lambda in practical scenarios

R3: We have 3 types of λ\lambda selection throughout the paper. One is based on validation data performance (gridsearch), one is data independent selection (App G.7), and the last one is data-driven selection (App G.8). The first method gives better performance and is recommended when it's computationally feasible, but for the large-scale datasets, the data-driven approach can work especially well, because it doesn't require any prior, or any heavy computations and can find a suitable λ\lambda according to the data.

LSE in off-policy RL with trajectories

R4: Thank you for the insightful suggestion. In principle, the LSE operator could be applied in off-policy reinforcement learning, potentially serving as an alternative to traditional importance sampling. However, it brings theoretical (e.g., requiring i.i.d. assumptions) and practical challenges (e.g., computational overhead in algorithms like PPO [1]). We consider this a promising direction and plan to explore it in future work.

LSE with direct reward modeling or a doubly robust approach

R5: Thank you for the suggestion. Estimators that incorporate reward modeling fall under model-based approaches, while those that do not are considered model-free. In this work, we primarily focus on model-free estimators. However, we did explore combining our estimator with the doubly robust (DR) approach, as discussed in Appendix G.3, and found that the resulting DR-LSE variant outperforms other baselines. Importantly, the non-linear structure of LSE does not introduce significant complications in this integration. For a fair comparison, we did not explore reward modeling based directly on the LSE framework, as our definition of LSE is grounded in weighted rewards rather than direct reward estimation. Nonetheless, we find this direction promising and plan to consider it in future work.


References:

[1]- Schulman, John, et al. "Proximal policy optimization algorithms."

审稿意见
4

The paper introduces a novel Log-Sum-Exponential (LSE) estimator for off-policy evaluation (OPE) and off-policy learning (OPL) in reinforcement learning, especially when dealing with logged bandit feedback datasets that may contain unbounded or heavy-tailed rewards. The paper analyzes the LSE estimator's regret bounds, bias, and variance, and it also explores its robustness to noisy rewards and propensity scores. The LSE estimator's performance is empirically compared against several baseline estimators, including truncated IPS, PM, ES, IX, Bandit-Net, LS-LIN, and OS, using both synthetic and real-world datasets. The document also offers theoretical insights into why the LSE estimator is well-suited for scenarios with heavy-tailed reward distributions and provides guidelines for selecting the estimator's parameter λ. The experimental code and data are also provided to support claims about its effectiveness.

给作者的问题

Can the authors use RCT data to validate the proposed method?

论据与证据

The claims of this paper are generally supported by the argument in this paper.

方法与评估标准

All evaluations are based on simulations. The results will be more convincing if data from randomized controlled trials can be used to validate the proposed method.

理论论述

The theoretical claims and proofs of the paper are rigorous and valid.

实验设计与分析

The simulation experiments are comprehensive and convincing. That said, I think the experiments and validation could be stronger if data from RCTs can be used to evaluate the proposed LSE-based method.

补充材料

The supplementary materials are comprehensive.

与现有文献的关系

The main contributions of this paper are three-fold. First, the authors develop a novel non-linear estimator based on the LSE operator, which substantially reduces the variance. Second, comprehensive theoretical performance guarantees of LSE-based OPE and OPL are provided. Third, simulated experiments show that the proposed estimator performs well compared with other SOTA algorithms.

遗漏的重要参考文献

Not aware of.

其他优缺点

I think the paper can be strengthened with evaluations based on experimental data.

其他意见或建议

N.A.

作者回复

We thank the reviewer for their comments, and generally positive assessment of the paper. We will address their concerns as detailed below.

The simulation experiments are comprehensive and convincing. That said, I think the experiments and validation could be stronger if data from RCTs can be used to evaluate the proposed LSE-based method. Can the authors use RCT data to validate the proposed method?

R1: We appreciate the reviewer's suggestion for running experiments in RCT dataset. The result can be found in the following link.

审稿意见
4

The paper proposes an estimator based on the log-sum-exponential (LSE) operator designed for off-policy evaluation (OPE) and off-policy learning (OPL) in contextual bandit settings. The LSE estimator addresses the issue of high variance in inverse propensity score (IPS) estimators by introducing robustness to noisy propensity scores and heavy-tailed reward distributions. The authors provide theoretical guarantees on the bias, variance, and regret for this estimator, with particular focus on its performance under heavy-tailed assumptions on weighted rewards. Empirical results from synthetic experiments and real-world datasets validate the practical effectiveness of the proposed method compared to existing estimators like IPS and others.

给作者的问题

  • Could you clarify how the smoothing parameter λ\lambda affects the performance of the LSE estimator?
  • Have you considered experimenting with a broader range of real-world datasets to assess the robustness of the LSE estimator in different domains?

论据与证据

The paper makes different claims, which seem to be supported by theoretical results and empirical evidence.

First, it claims that the LSE estimator reduces variance and handles heavy-tailed reward distributions more effectively than the IPS estimator and its variants. This claim is supported by a detailed theoretical analysis, which includes bounds on bias, variance, and regret.

Empirically, the paper shows that the LSE estimator performs better in terms of mean squared error (MSE) and variance compared to competing methods in both synthetic and real-world experiments. This part, in my opinion, can be strengthened with an additional experiment.

方法与评估标准

The proposed LSE estimator is evaluated both theoretically and empirically. This is the usual way of evaluating OPE/OPL methods.

理论论述

From what I could check, the theoretical claims in the paper seem well-supported.

实验设计与分析

The experimental design appears sound, but there are some limitations in the scope of the experiments. The experimental setup could benefit from more diverse datasets and a wider range of experimental conditions. For instance, experiments involving different types of reward distributions and more real-world applications

补充材料

The supplementary material, including proof details, additional experiments, and discussion on related work, is thorough. However, there could be further clarification regarding the role of the smoothing parameter λ\lambda and its effect on the performance across different types of reward distributions.

与现有文献的关系

The paper is related to prior work in the area of off-policy evaluation and learning. Also, it is related to heavy-tailed bandits, which is not typical in the OPE/OPL literature. I think that the authors did a good job in their related work sections, which position the paper with respect to specific related prior papers.

遗漏的重要参考文献

I think that the authors did a good job in their related work sections and all the essential references are more or less present in the paper.

其他优缺点

In my opinion, the main weakness of the paper (apart from the experimental section which could be expanded, a point already discussed) is the apparent lack of novelty. Essentially, the paper applies the well-known log-sum-exponential (LSE) technique to OPE/OPL

However, despite this perceived lack of novelty, I believe the paper still meets the high standards of ICML due to its interesting theoretical analysis and the strong performance of the LSE estimator in the OPE/OPL domains. To the best of my knowledge, LSE has not previously been applied to OPE/OPL, and this work may represent the first contribution demonstrating that LSE can be a valuable addition to the OPE/OPL toolkit.

其他意见或建议

Very minor issue: there seem to be some inconsistencies in the References section

作者回复

We thank the reviewer for their comments, and generally positive assessment of the paper. We will address their concerns as detailed below.

Novelty

R1: We appreciate the reviewer’s feedback and the opportunity to clarify the novelty of our work. While it is true that we employ the log-sum-exponential (LSE) technique, one of our key contributions lies in the novel theoretical results we derive in the context of OPE/OPL. Specifically, our analysis establishes new insights that were not previously explored in the literature. These results go beyond a straightforward application of LSE, as we introduce a novel formulation and provide rigorous proofs that reveal deeper theoretical properties of the method in this setting under heavy-tailed assumptions.

Additionally, while LSE is a well-known technique/operator, its application to OPE/OPL presents unique challenges, which we address through our theoretical analysis. We believe these contributions offer a meaningful advancement in understanding LSE for heavy-tailed applications.

We hope this clarification helps address the reviewer’s concern, and we would be happy to further elaborate on the specific theoretical novelties if needed.

For instance, experiments involving different types of reward distributions and more real-world applications. Have you considered experimenting with a broader range of real-world datasets to assess the robustness of the LSE estimator in different domains?

R2: We conducted more experiments on RCT dataset where the results can be found in the following link. In addition, additional experiments including multiple reward distributions are conducted. We fixed the distribution of the policies to be Gaussians with different locations. To test a variety of heavy-tailed distributions, we assume that we have a single state s=s0s=s_0 and when as0π0a|s_0 \sim \pi_0, the distribution of r(s0,a)r(s_0, a) is of a particular family. We considered Lomax, Generalized Extreme Value (GEV), T-student, and Fréchet distributions. We consider the absolute value of the samples to ensure the reward is positive. The table of the performance of our method compared to other methods is available here.

Could you clarify how the smoothing parameter λ\lambda affects the performance of the LSE estimator?

R3: The effect of λ\lambda for the supervised2bandit experiments where the reward is binary is investigated in App G.6, but for continuous heavy-tailed reward distributions, in the same setting mentioned in R2, we change λ=10i3i2,iZ|\lambda| = \\{10^i | -3 \leq i \leq 2, i\in \mathbb{Z}\\} observe the change of MSE of the LSE estimator. The graphs are available at this link.

inconsistencies in the References section

R4: Thanks for pointing out. It is fixed now.

审稿人评论

Thank you very much for the additional experiments and the clarifications! They addressed many of my concerns. I raised my score accordingly.

作者评论

Dear Reviewer Ve41,

We just wanted to sincerely thank you for taking the time to carefully read our rebuttal and for your thoughtful consideration of our responses. We truly appreciate the constructive feedback you provided throughout the review process, and we are grateful for your support and for the updated evaluation of our work.

Your detailed comments and suggestions were very helpful to us in improving our paper, and we are glad that our clarifications could address your concerns.

Thank you again for your time, effort, and support.

Best regards,

Authors

审稿意见
4

The paper proposes to use the log-sum-exponential operation as an off-policy estimator, proving bounds on the mean and variance of the estimates, as well as the performance gap between the optimal and learned policies in off-policy learning, and convergence rates. They follow this up with empirical evaluations.

The chosen estimator is chosen specifically to deal with the heavy-tailed distribution resulting from inverse propensity weights, so their analysis holds under heavy-tailed assumptions, which additionally holds for unbounded rewards, making the analysis applicable in a variety of domains.

给作者的问题

It seems that these results could potentially be related to a Bayesian setting with exponential family distributions. Have the authors thought about this? I'm curious about the results, and it would increase my (already positive opinion of) the paper, even though I don't think it needs to be included here.

论据与证据

There are two sets of claims to the paper.

The first is that the proposed estimator reduces variance, and has provable bounds on bias and variance with heavy tailed and noisy reward observations. Further, they provide regret bounds in the off-policy evaluation and learning setups, as well as the regret convergence under heavy-tailed settings which include unbounded reward. These claims are primarily and adequately supported by theoretical results.

The second set of claims is that the proposed estimator has competitive performance.

方法与评估标准

Yes, they do a standard setup (running various estimators with 10k trials of taking 1k samples, and calculating mean squared error and variance) for OPE on a synthetic data distribution and OPL on the EMNIST dataset.

理论论述

Not carefully

实验设计与分析

Yes, the experimental designs for OPE and OPL were checked

补充材料

I poked through the proofs but not in detail

与现有文献的关系

They compare against several similar off-policy estimators

遗漏的重要参考文献

None that I am aware of

其他优缺点

The work is original as far as I know.

The work is significant, being applicable to a common setting and addressing a key problem of heavy-tailed rewards.

The paper is consistently clearly written.

其他意见或建议

none

作者回复

We thank the reviewer for their comments, and generally positive assessment of the paper. We will address their concerns as detailed below.

It seems that these results could potentially be related to a Bayesian setting with exponential family distributions. Have the authors thought about this? I'm curious about the results, and it would increase my (already positive opinion of) the paper, even though I don't think it needs to be included here.

R1: We thank reviewer 2WnN for their insightful suggestion. An interesting direction for future work is to explore potential application of theoretical results in Bayesian inference frameworks, particularly through the lens of exponential family distributions and variational approximations. As the LSE can be interpreted as log-partition function in exponential family distribution, our methods can be applied in this field.

最终决定

This paper proposes to use log-sum-exp (LSE) operation as off-policy estimator for both OPE and OPL and demonstrate its effectiveness both theoretically and empirically. The paper received positive scores across the board, where all the reviewers praise the novelty of using LSE in the context of OPE/OPL, as well as the solid technical and theoretical contribution. There were some minor concerns/clarifications which were largely resolved during the author-reviewer discussion phase. Given the overall very positive feedback, I am voting for accept.