PaperHub
6.0
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
4
3
2.8
置信度
创新性2.5
质量2.5
清晰度2.3
重要性2.0
NeurIPS 2025

Leveraging semantic similarity for experimentation with AI-generated treatments

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
ExperimentationKernel methodsCausal inferenceEmbeddingsLLM-generated treatments

评审与讨论

审稿意见
4

This work proposes a kernel-based representation learning method that models treatment effects as an inner product of unknown latent representations of treatment embeddings and user covariates. This lets practitioners test content (treatment) variations by changing their embeddings and quickly test hypotheses.

优缺点分析

Strengths

  • The problem of testing different treatment (content) variations is very relevant and well motivated.
  • As far as I could check, the theoretical statements are correct.
  • The preliminaries and warmup section are quite nice and important to understand the result.

Weaknesses

  • My main issue with this paper is connecting it with related works and more clearly seeing the contributions with respect to the setting they are presenting. Can the authors be more clear about what their estimator has that others works don't have? and how is that connected to using treatment embeddings? Maybe a table with existing literature, features of the estimators, including yours, could help visualize this. Right now the work seems disconnected, despite my understanding of the theoretical contribution and estimator.
  • The experimental setup seems very limited, can the authors justify not running the solution on more real-world settings?
  • A minor thing is to increase the resolution of the plots, consider using pdf figures.

问题

See weaknesses above, I'm willing to raise my score if the authors can address them (which I think are reasonable).

局限性

yes

格式问题

NA

作者回复

Thank you very much for your careful engagement with our paper and the great questions.

Re weakness 1:

  • The novelty of the estimator: the key part of our estimator is that we introduce the kernel-based factorization approach to learn the CATE function. This estimator enjoys several advantages: (i) it allows us to learn the representations of treatment and user covariates contributing to the treatment effect. Treatment embeddings are taken in by the treatment kernel to learn a low-dimensional representation. (ii) it is computationally efficient comparing with some deep learning baselines. This naturally extends to an efficient adaptive algorithm that assigns treatments to users in an online manner. (iii) backed with strong theoretical guarantees, including convergence rates for treatment effect estimation and a sublinear regret bounds of an algorithm for the adaptive experimentation setting.
  • Below is a table that compares several estimators :
EstimatorMethodComputation EfficiencyIncorporating both user and treatment infoLearning low-dimension representation
DKRLKernelHighYesYes
SINDeep Neural NetworkLowYesYes but less interpretable
Product KernelKernelHighYesNo
Treatment onlyGeneral ML modelHighNoNot directly
Covariate onlyGeneral ML modelHighNoNot directly

The table demonstrates the advantages of our proposal. We will include the comparison in the updated draft.

Re weakness 2: While we are eager to demonstrate our proposed framework on additional real-world problems, finding suitable, publicly available data that satisfies all the necessary desiderata (fully randomized, containing A/B tests on text, images, with user features) has proved challenging. To address this, we therefore consider a range of different relaxations of the semi-synthetic simulation setup, which we believe is more representative of real-world settings.

  • First, we use a different real-world dataset to collect both text-based treatment and user-level covariates to avoid simulating user covariates.
  • Second, we relax the outcome generation process to allow more complex user-treatment interaction.
  • Third, we test out a broader set of baselines methods for a more comprehensive understanding.

Specifically, we analyze the MIND dataset, a benchmark dataset containing traffic from Microsoft for click rate prediction and news recommendation. It is an observational dataset where the probabilities of news being recommended to users are unknown. We consider a semi-synthetic scenario: suppose our goal is to train this recommendation system from scratch by adaptively collecting recommendation-clicking data and improve recommendation quality. We take XX to be user features such as news category preference and historical news click embeddings, and ZZ as the embeddings for a set of candidate news to be recommended to users. All these embeddings are taken from the knowledge graph embeddings that were originally included in the dataset. We consider a synthetic outcome model for the click-through rate: y=zTΓx+epsilony = z^T\Gamma x + epsilon, where Γ\Gamma is a matrix that encodes the interaction mechanism between users and treatments. It can be interpreted as a match between user preference and news information. We consider several setups of Γ=UΛV\Gamma = U \Lambda V^\top, where Λi=iq\Lambda_i = i^{-q} ensures that the interaction model varies from low-rank (high eigenvalue decay rates qq) to a high rank (low eigenvalue decay rates qq) setting. We compare our method with four baselines: LASSO, XG boost, a Feed-forward neural network, and kernel regression. The methods are combined with different estimation strategies: "Z" means only incorporating treatment information, "X" means only include covariate information, and "ZX" means include a (Z,X) vector. The following two tables compare the training and testing RMSEs:

Training MSE

Decay (q)DKRLZXZX
Lasso_ZXGB_ZFNN_ZKernel_ZLasso_XXGB_XFNN_XKernel_XLasso_ZXXGB_ZXFNN_ZXKernel_ZX
00.006 (0.002)0.034 (0.002)0.027 (0.002)0.030 (0.002)0.031 (0.002)0.055 (0.005)0.053 (0.005)0.055 (0.005)0.055 (0.005)0.031 (0.002)0.002 (0.000)0.012 (0.002)0.035 (0.001)
20.003 (0.001)0.009 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.002)0.013 (0.004)0.013 (0.004)0.014 (0.004)0.013 (0.004)0.009 (0.001)0.001 (0.000)0.008 (0.002)0.008 (0.002)
40.003 (0.001)0.008 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.001)0.011 (0.003)0.010 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.001 (0.000)0.007 (0.002)0.006 (0.001)
60.002 (0.001)0.008 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.002)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.001 (0.000)0.008 (0.002)0.007 (0.001)

Testing MSE

Decay (q)DKRLZXZX
Lasso_ZXGB_ZFNN_ZKernel_ZLasso_XXGB_XFNN_XKernel_XLasso_ZXXGB_ZXFNN_ZXKernel_ZX
00.007 (0.003)0.034 (0.003)0.036 (0.003)0.034 (0.003)0.031 (0.003)0.055 (0.005)0.056 (0.006)0.055 (0.005)0.055 (0.005)0.032 (0.003)0.022 (0.001)0.016 (0.002)0.035 (0.002)
20.003 (0.001)0.009 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.002)0.013 (0.004)0.014 (0.004)0.014 (0.004)0.013 (0.004)0.009 (0.001)0.004 (0.001)0.010 (0.002)0.008 (0.002)
40.003 (0.001)0.008 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.001)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.004 (0.001)0.009 (0.002)0.007 (0.001)
60.003 (0.001)0.008 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.002)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.004 (0.001)0.010 (0.002)0.007 (0.002)

From the table, we can tell that low-rank signals (larger qq) do bring benefits to all methods—but DKRL exploits this structure most effectively, yielding the best generalization. Even when the signal is high-rank (for example, q=0q = 0 which is full rank), DKRL can effectively exploit the similarity structure in embedding themselves and enhance accuracy. It actually achieves a comparable performance that is close to XG boost with combined features.

Re weakness 3: Thank you for the suggestion. We will take the advice and switch to pdf figures.

Thank you very much for the constructive points. We sincerely hope these explanations address your concerns!

评论

I'd like to both: a) Thank the authors for addressing my concerns, especially comparing with previous literature. I also apologize for my late reply, I understand it can sometimes hurt the discussion but since the authors addressed my questions, I don't think this will be an issue. b) Say that the authors addressed my questions. I initially didn't have major concerns about the paper and after reading all the discussion I remain the same. This work seems correct to me, and to the best of my knowledge new. The only comment I can add is that empirical results could be stronger, e.g. showing results on tasks of larger public interest. For instance, see [1] (not the same task, but shows some flavor of what I mean).

[1] Dhawan, Nikita, et al. "End-to-end causal effect estimation from unstructured natural language data." Advances in Neural Information Processing Systems 37 (2024): 77165-77199.

评论

Thank you for your feedback and for acknowledging our revisions. We’re glad that our updates have fully addressed your questions. We also agree on the value of strengthening our empirical evaluation with larger public benchmarks. While we have not yet identified a dataset that precisely matches our task, we will monitor upcoming releases—such as the benchmarks you suggested—and incorporate any suitable datasets into our evaluation at the earliest opportunity.

审稿意见
4

The paper proposes a kernel representation learning method for multiple continuous treatment variables and high-dimensional covariates, and applies it to adaptive treatment assignment in online experiments. The proposed method is theoretical guaranteed and validated through experiments on semi-synthetic datasets.

优缺点分析

Strengths:

  1. The theoretical results of the proposed method are extensive.

  2. The review of related work is thorough.

Weaknesses:

  1. The paper is difficult to follow and gives a sense of exaggeration. The authors spend a lot of space in the title, abstract, introduction, and conclusion discussing the use of AI generative models, such as LLMs, to generate treatments. This gives the impression that the focus of the paper is on how to incorporate semantic information into generative models to guide the generation of treatments that better align with real-world scenarios. However, the subsequent sections of the paper hardly involve generative models, LLMs, or related topics. Instead, the focus is on using kernel representation learning to learn embeddings of complex treatments, with only a brief mention (mainly on page 7) of how this representation can guide treatment assignment in online experiments. Based on the content in the later sections, I believe the main motivation of the paper should be to learn embeddings of complex treatments, with guiding generative models to generate treatments and adaptive treatment assignment in online experiments being potential applications, rather than the primary focus of the work. I strongly recommend that the authors reconsider the positioning of the paper, clearly stating in the title, abstract, introduction, and conclusion that guiding generative models to generate treatments and adaptive treatment assignment in online experiments are potential applications of the proposed method, and that the real motivation behind the proposed method—learning embeddings for complex treatments—should be clearly highlighted.

  2. The paper does not provide a description of the assumptions required for the proposed method.

  3. It would be beneficial to include experiments in real online environments (rather than semi-synthetic data) and experiments where the proposed method guides LLMs to generate treatments, which can better validate the effectiveness of the proposed method in downstream applications.

  4. Many symbols are used before being defined, such as z and x in line 43, and d in line 235, etc.

问题

  1. What does d in line 235 mean? Is it d1+d2d_1+d_2?

  2. What assumptions are required for the proposed method? For example, is the assumption of an additive-noise outcome model required? Are the SUTVA, overlap, and unconfoundedness assumptions required? Is there a need to assume a specific distribution for the noise term? (As Theorem 4.2 needs to assume that the noises are sub-exponential.)

  3. Is generating treatments by LLMs or other generative models only a downstream task? If I am mistaken, could the authors please clarify which step of the proposed method involves these related techniques?

局限性

Yes.

最终评判理由

During the rebuttal process, the authors addressed most of my concerns and committed to revising the paper’s motivation, positioning, and core contributions. Additionally, they have provided more experimental results in richer scenarios. As a result, I am now inclined toward acceptance, provided these revisions are adequately incorporated in the final version.

格式问题

No paper formatting concerns.

作者回复

We sincerely thank you for your careful review of our paper and insightful questions.

Re weakness 1: Thank you for this suggestion. We have carefully considered the positioning of our paper, and will try to take your advice to reframe the paper. Since we cannot revise the manuscript during the rebuttal period, we hope to propose a new abstract following your suggestion:

The growing popularity of Large Language Models (LLMs) introduces exciting opportunities for digital experimentation, as marketers increasingly combine human-generated and model-generated content to design treatments. A central challenge in this setting lies in representing these complex, high-dimensional treatments in a way that preserves their semantic meaning and enables scalable analysis. In this paper, our primary focus is to learn low-dimensional representations of complex treatments that captures this semantic structure. These representations can then serve as a foundation for downstream applications such as guiding generative models to produce meaningful treatment variants and enabling adaptive treatment assignment in online experiments. To this end, we propose a method called double kernel representation learning, which models the causal effect through the inner product of kernel-based representations of treatments and user covariates. We develop an efficient alternating-minimization algorithm to learn these representations from data. Additionally, we propose an adaptive design strategy for online experimentation as an application. We provide convergence guarantees under a low-rank factor model and demonstrate the effectiveness of our approach through numerical experiments.

We will incorporate your suggestions in the next iteration of our paper.

Re weakness 2: We apologize for not clearly stating the assumptions. We listed several assumptions for estimation in Theorem 4.2 and 5.1, including the noise distribution and the low-rank interaction structure. Moreover, as we are studying an experimental setup with no interference, we need SUTVA, unconfoundedness, and a weaker non overlapping assumptions (ez(x)(0,1)e_z(x)\in(0,1)). We will clarify these in the updated draft.

Re weakness 3, part 1: While we want to demonstrate our proposal on additional real-world problems, finding suitable, publicly available data that satisfies all the necessary desiderata (fully randomized, containing A/B tests on text, images, with user features) has proved challenging. To address this, we consider a range of different relaxations of the semi-synthetic simulation setup, which we believe is more representative of real-world settings.

  • First, we use a different real-world dataset to collect both text-based treatment and user-level covariates to avoid simulating user covariates.
  • Second, we relax the outcome generation process to allow more complex user-treatment interaction.
  • Third, we test out a broader set of baselines methods for a more comprehensive understanding.

Specifically, we analyze the MIND dataset, a benchmark dataset containing traffic from Microsoft for click rate prediction and news recommendation. It is an observational dataset where the probabilities of item recommendation are unknown. We consider a semi-synthetic setup: suppose our goal is to train this recommendation system from scratch by adaptively collecting recommendation-clicking data and improve recommendation quality. We take XX to be user features such as news category preference and historical news click embeddings, and ZZ as the embeddings for a set of candidate news to be recommended. All these embeddings are taken from the knowledge graph embeddings that are included in the dataset. We consider a synthetic outcome model for the click-through rate: y=zTΓx+epsilony = z^T\Gamma x + epsilon, where Γ\Gamma is a matrix that encodes the interaction mechanism between users and treatments. It can be interpreted as a match between user preference and news information. We consider several setups of Γ=UΛV\Gamma = U \Lambda V^\top, where Λi=iq\Lambda_i = i^{-q} ensures that the interaction model varies from low-rank (high eigenvalue decay rates qq) to a high rank (low eigenvalue decay rates qq) setting. We compare our method with four baselines: LASSO, XG boost, a Feed-forward neural network, and kernel regression. The methods are combined with different estimation strategies: "Z" means only incorporating treatment information, "X" means only include covariate information, and "ZX" means include a (Z,X) vector. The following two tables compare the training and testing RMSEs:

Training MSE

Decay (q)DKRLZXZX
Lasso_ZXGB_ZFNN_ZKernel_ZLasso_XXGB_XFNN_XKernel_XLasso_ZXXGB_ZXFNN_ZXKernel_ZX
00.006 (0.002)0.034 (0.002)0.027 (0.002)0.030 (0.002)0.031 (0.002)0.055 (0.005)0.053 (0.005)0.055 (0.005)0.055 (0.005)0.031 (0.002)0.002 (0.000)0.012 (0.002)0.035 (0.001)
20.003 (0.001)0.009 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.002)0.013 (0.004)0.013 (0.004)0.014 (0.004)0.013 (0.004)0.009 (0.001)0.001 (0.000)0.008 (0.002)0.008 (0.002)
40.003 (0.001)0.008 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.001)0.011 (0.003)0.010 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.001 (0.000)0.007 (0.002)0.006 (0.001)
60.002 (0.001)0.008 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.002)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.001 (0.000)0.008 (0.002)0.007 (0.001)

Testing MSE

Decay (q)DKRLZXZX
Lasso_ZXGB_ZFNN_ZKernel_ZLasso_XXGB_XFNN_XKernel_XLasso_ZXXGB_ZXFNN_ZXKernel_ZX
00.007 (0.003)0.034 (0.003)0.036 (0.003)0.034 (0.003)0.031 (0.003)0.055 (0.005)0.056 (0.006)0.055 (0.005)0.055 (0.005)0.032 (0.003)0.022 (0.001)0.016 (0.002)0.035 (0.002)
20.003 (0.001)0.009 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.002)0.013 (0.004)0.014 (0.004)0.014 (0.004)0.013 (0.004)0.009 (0.001)0.004 (0.001)0.010 (0.002)0.008 (0.002)
40.003 (0.001)0.008 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.001)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.004 (0.001)0.009 (0.002)0.007 (0.001)
60.003 (0.001)0.008 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.002)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.004 (0.001)0.010 (0.002)0.007 (0.002)

From the table, we can tell that low-rank signals (larger qq) do bring benefits to all methods—but DKRL exploits this structure most effectively, yielding the best generalization. Even when the signal is high-rank (for example, q=0q = 0 which is full rank), DKRL can effectively exploit the similarity structure in embedding themselves and enhance accuracy. It achieves a comparable performance that is close to XG boost with combined features.

Re weakness 2, part 2: To your second point of using the proposal to guide LLMs to generate treatments, our online experimentation discussion in Section 5 can serve as an example. We use Explore-then-commit strategy to demonstrate how the proposed method can guide treatment optimization: in the exploration stage, our method can be used to learn a good user-system interaction model, and in exploitation, this model can be used to assign the best personalized treatment among a pool of LLM-generated treatments to maximize final rewards. Furthermore, we believe the downstream applications is extensive, for example, to finetune a language model to generate personalized treatment based on the learned representation model. These are saved as future work.

Re Q1: yes, d=d1+d2d = d_1 + d_2 is the sum of dimensions for user covariates and treatment embeddings. We apologize for the missing notation.

** Re Q2**: for theory, we need the additive noise model and low-rank assumption. Also, for identification, we need SUTVA, weak overlapping and unconfoundedness. We apologize that we did not formally state these, and will add these in the updated draft.

Re Q3: generating treatments by LLMs is used in two aspects: (i) To learn how semantic information and user covariates interact and contribute to the outcome in A/B testings, we can build models in experiments with LLM generated treatments; (ii) it can guide optimizing personalized treatment generation with the learned model. So it's not just a downstream task; instead it is a full pipeline to borrow semantic information in LLM embeddings to understand and optimize treatment generation for maximizing rewards.

Thank you very much for the constructive points. We truly hope these explanations mitigate your concerns.

评论

Thank you for the detailed response. The rebuttal has addressed most of my concerns. My remaining concern is whether such a substantial amount of revisions will be reflected in the final version of the paper. Regardless of whether the paper is accepted or published in another journal or conference, I hope the authors will incorporate these revisions carefully and present the paper’s motivation and contributions more clearly. Based on the authors' revision outline and additional experiments, I will raise my score to 4. Once again, I hope the authors will revise the paper thoroughly to prepare it for publication.

评论

Thank you very much for reviewing our responses, and we are glad that our updates help with addressing your concerns.

We promise to incorporate all the updates into the next iteration of the paper, especially: (1) a renewed positioning of the whole draft, highlighting more on the usage of the learned representation on downstream tasks such as better variant generation and experiments adaptation; (2) more clarification on the assumptions, with clear Assumption environments; (3) an update on the more comprehensive numerical experiments.

Thanks for all the comments again! We believe these make significant improvements to the paper.

审稿意见
4

This paper addresses the challenge of designing and analyzing experiments with a large number of treatments generated by Large Language Models (LLMs). The authors propose a novel method, Double Kernel Representation Learning (DKRL), to improve statistical efficiency by leveraging the semantic similarity between treatments, encoded in their embeddings. The core idea is to model the Conditional Average Treatment Effect (CATE) as a low-rank factorization of kernel-based representations of treatments and user covariates. The paper introduces an alternating minimization algorithm to learn these representations and provides theoretical guarantees, including convergence rates for estimation and a sublinear regret bound for an associated adaptive "explore-then-commit" online experimentation strategy. The method's effectiveness is demonstrated on semi-synthetic data derived from the Upworthy and ASOS datasets, where it shows improved performance over several baselines.

优缺点分析

Strengths:

  1. The paper tackles a timely and significant problem. As LLMs make it easy to generate a vast number of content variations (e.g., ad creatives, email subject lines), the challenge of efficiently testing them becomes imminent. The proposed framework, which explicitly uses the semantic structure of these treatments via kernel methods, is a well-motivated approach to this problem.

  2. The proposed DKRL method is technically sound. The development of an alternating minimization algorithm for estimation is a standard but appropriate and computationally efficient solution. The theoretical contributions are a major strength, providing both estimation error bounds (Theorem 4.2) and a sublinear regret guarantee for the online algorithm (Theorem 5.1).

Weaknesses:

  1. The primary weakness is the reliance on semi-synthetic data for experimental validation. While the use of real treatment embeddings (from Upworthy headlines) is a good step, the user covariates are simulated Gaussian vectors, and the outcome is generated from a linear model with a low-rank ground truth. This setup is perfectly aligned with the model's assumptions, which almost guarantees that the proposed method will outperform others. The true challenge lies in real-world scenarios where the relationship between treatments, users, and outcomes is far more complex and does not necessarily adhere to a clean, low-rank structure. The lack of experiments on a fully real-world A/B testing dataset where user features are available makes it difficult to assess the method's practical utility and robustness.

  2. The paper's core assumption is that the CATE function has a low-rank structure. While this is a common and powerful assumption in the matrix factorization literature, it may be too restrictive for many real-world applications. The paper briefly mentions that this can be relaxed to an approximately low-rank setting, but the theoretical analysis and experiments are based on the strict low-rank case. It is unclear how the method would perform if the true CATE function has a more complex, high-rank structure.

  3. The choice of baselines could be strengthened. For instance, the main competitor in the estimation task is SIN, a deep learning method. While DKRL outperforms it in the given setting, this is not surprising given that the data was generated from a model that favors DKRL.

问题

I'd appreciate the authors' response to the weakness section.

局限性

Please see the weakness section.

最终评判理由

The authors have provided new empirical results and resolved my concern about their low-rank assumption. Although the empirical results are still preliminary, I find the motivation interesting and the theory rather solid.

格式问题

NA

作者回复

We sincerely thank you for your careful review of our paper and insightful questions.

Re weakness 1: Thank you very much for the critical comments. While we are eager to demonstrate our proposed framework on additional real-world problems, finding suitable, publicly available data that satisfies all the necessary desiderata (fully randomized, containing A/B tests on text, images, with user features) has proved challenging. To address this, we therefore consider a range of different relaxations of the semi-synthetic simulation setup, which we believe is more representative of real-world settings.

  • First, we use a different real-world dataset to collect both text-based treatment and user-level covariates to avoid simulating user covariates.
  • Second, we relax the outcome generation process to allow more complex user-treatment interaction.
  • Third, we test out a broader set of baselines methods for a more comprehensive understanding.

Specifically, we analyze the MIND dataset, a benchmark dataset containing traffic from Microsoft for click rate prediction and news recommendation. It is an observational dataset where the probabilities of news being recommended to users are unknown. We consider a semi-synthetic scenario: suppose our goal is to train this recommendation system from scratch by adaptively collecting recommendation-clicking data and improve recommendation quality. We take XX to be user features such as news category preference and historical news click embeddings, and ZZ as the embeddings for a set of candidate news to be recommended to users. All these embeddings are taken from the knowledge graph embeddings that were originally included in the dataset. We consider a synthetic outcome model for the click-through rate: y=zTΓx+epsilony = z^T\Gamma x + epsilon, where Γ\Gamma is a matrix that encodes the interaction mechanism between users and treatments. It can be interpreted as a match between user preference and news information. We consider several setups of Γ=UΛV\Gamma = U \Lambda V^\top, where Λi=iq\Lambda_i = i^{-q} ensures that the interaction model varies from low-rank (high eigenvalue decay rates qq) to a high rank (low eigenvalue decay rates qq) setting. We compare our method with four baselines: LASSO, XG boost, a Feed-forward neural network, and a kernel regression. The methods are combined with different estimation strategies: "Z" means only incorporating treatment information, "X" means only include covariate information, and "ZX" means include a concatenated (Z,X) vector. The following two tables compare the training and testing RMSEs:

Training MSE

Decay (q)DKRLZXZX
Lasso_ZXGB_ZFNN_ZKernel_ZLasso_XXGB_XFNN_XKernel_XLasso_ZXXGB_ZXFNN_ZXKernel_ZX
00.006 (0.002)0.034 (0.002)0.027 (0.002)0.030 (0.002)0.031 (0.002)0.055 (0.005)0.053 (0.005)0.055 (0.005)0.055 (0.005)0.031 (0.002)0.002 (0.000)0.012 (0.002)0.035 (0.001)
20.003 (0.001)0.009 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.002)0.013 (0.004)0.013 (0.004)0.014 (0.004)0.013 (0.004)0.009 (0.001)0.001 (0.000)0.008 (0.002)0.008 (0.002)
40.003 (0.001)0.008 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.001)0.011 (0.003)0.010 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.001 (0.000)0.007 (0.002)0.006 (0.001)
60.002 (0.001)0.008 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.002)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.001 (0.000)0.008 (0.002)0.007 (0.001)

Testing MSE

Decay (q)DKRLZXZX
Lasso_ZXGB_ZFNN_ZKernel_ZLasso_XXGB_XFNN_XKernel_XLasso_ZXXGB_ZXFNN_ZXKernel_ZX
00.007 (0.003)0.034 (0.003)0.036 (0.003)0.034 (0.003)0.031 (0.003)0.055 (0.005)0.056 (0.006)0.055 (0.005)0.055 (0.005)0.032 (0.003)0.022 (0.001)0.016 (0.002)0.035 (0.002)
20.003 (0.001)0.009 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.002)0.013 (0.004)0.014 (0.004)0.014 (0.004)0.013 (0.004)0.009 (0.001)0.004 (0.001)0.010 (0.002)0.008 (0.002)
40.003 (0.001)0.008 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.001)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.004 (0.001)0.009 (0.002)0.007 (0.001)
60.003 (0.001)0.008 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.002)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.004 (0.001)0.010 (0.002)0.007 (0.002)

From the table, we can tell that low-rank signals (larger qq) do bring benefits to all methods—but DKRL exploits this structure most effectively, yielding the best generalization. Even when the interaction matrix is high-rank (for example, q=0q = 0 which is full rank), DKRL can effectively exploit the similarity structure in embedding themselves and enhance accuracy. It achieves a comparable performance that is close to XG boost with combined features.

Re weakness 2: We appreciate the reviewers questions regarding the low-rank assumption for user-treatment interactions.

Our low rank assumption states:

rank(Γ)r.\text{rank}{(\Gamma^\star)} \le r.

First, we clarify that we can relax this assumption from a strict low rank to a so-called approximate low-rank structure by characterizing the spectral decay (i.e., the sorted eigenvalue distribution) of the matrix Γ\Gamma^\star. This characterization is using the following inequality, where we bound the lql_q norm of the eigenvalues.

λiqRq, for some 0q<1.\sum \lambda_i^q \le R_q, ~\text{for some } 0 \le q < 1.

This is a much weaker version of strict low rank assumption, in the literature and has been well studied by references:

  • Negahban, Sahand, and Martin J. Wainwright. "Restricted strong convexity and weighted matrix completion: Optimal bounds with noise." The Journal of Machine Learning Research 13.1 (2012): 1665-1697.
  • Rohde, Angelika, and Alexandre B. Tsybakov. "ESTIMATION OF HIGH-DIMENSIONAL LOW-RANK MATRICES1." The Annals of Statistics 39.2 (2011): 887-930.

Under this assumption, we can establish the following guarantee in our setting:

1d1d2Γ^ΓF2CRqλN2q\frac{1}{d_1d_2}||\hat{\Gamma} - \Gamma^\star||_F^2 \le C R_q \lambda_N^{2 - q}

Strict low-rank assumption is a special case for this when q = 0; which then aligns with our theoretical results in the paper. More generally, it allows a continuously decaying signals, and there is a trade-off between the signal decaying speed and the estimation accuracy: the closer q is to 1, the harder it is to separate the signals and the wider the ERROR bound. Thus, our results continue to hold under this relaxed restriction.

From a numerical perspective, we also added additional experiments to evaluate the performance of the method under the approximate low-rank setting with the MIND experiment in response to your point 1. As we can see from there, when there is low-rank struture, DKRL exploits the strict low-rank structure most effectively, yielding the best generalization. When the interation matrix is high-rank (for example, q=0q = 0 which is full rank), DKRL can still effectively exploit the similarity structure in embedding themselves and enhance accuracy. It actually achieves a comparable performance that is close to XG boost with combined features.

Re weakness 3: Thank you for this suggestion. In response to your point 1, we have performed a comparison with multiple baselines and estimation strategies, under different rank specifications. DKRL proves to have the advantage of adapting the low-dimensional information in both the embeddings and the interaction matrix.

Thank you very much for the constructive points. We truly hope that these explanations will mitigate your concerns regarding our work.

评论

I thank the authors for the detailed response and the new experiment which have addressed my major concerns. I've raised my rating accordingly.

评论

We thank you for the efforts in reviewing our paper and for raising the score. We will incorporate all the updates in future iterations of the paper!

审稿意见
3

This paper looks at estimating treatment effects in a randomized assignment setting where the interesting twist is that the set of possible treatments is high cardinality and it is possible that there is an interaction effect between user covariates and treatment. The basic idea is to model both the user covariates and treatments as embeddings in the same space. The key assumption is equation 3 at line 160, which (essentially) says y(x,z) = g(z)^T h(x) + noise, where g(z) is an embedding of treatment z and g(x) is embedding of unit covariates x. The paper’s main contribution is a kernel-based learning approach for this model.

优缺点分析

The paper is overall clear, the modeling assumption seems fine, and the theory seems fine (though I did not check it carefully).

My main concerns here are:

  1. (Why) is it important to view the treatments as LLM generated, and to consider this an experimentation setup? Like, what distinguishes the methodology here from any other problem using the form of (3) (e.g., most traditional approaches to recommender systems)
  2. Are the kernel methods themselves really important? It looks like equation 3 could be solved by parameterizing g and h as neural nets in the standard contrastive learning manner. Indeed, for the particular application in the experiments, taking g to the the original LLM that produced the embedding seems much more natural than manipulating that embedding with kernel learning. I’d at least like to see a baseline that just naively implements LoRA on the text-embedder.
  3. The only experiment included in the main paper has the user covariates entirely simulated! That seems like a substantial limitation of the analysis, above and beyond the missing LLM baseline noted above.

问题

See strengths and weaknesses

局限性

yes

最终评判理由

I do not find my main concerns fully addressed by the author response, so I keep my score the same. But I am also not opposed to the paper being published as is.

格式问题

n/a

作者回复

We sincerely thank you for your careful review of our paper and insightful questions.

Re concern 1: we agree with you that our setup shares many similarities with relevant problems, such as those in recommender systems. The reason we formulate the problem this way is that we are seeing an increasing practice in digital experimentation of incorporating LLM-generated treatments in A/B testing problems such as generating advertisements or email campaigns. The differences from traditional methods are two-fold:

  • Traditional recommendation methods typically choose among a fixed catalog of a few thousand items. Yet here, the "treatment" might be any of millions of possible sentences, with various styles or tones the LLM can produce. Designing an experiment to explore that space is qualitatively different from merely scoring existing items. Moreover, LLMs can be prompted to generate new variants on the fly (e.g. refine a headline after mid-experiment analysis), which turns the pipeline into an adaptive experiment rather than a static recommendation.
  • Here we are focusing on experimental design settings, where the treatments are "Interventions", not just features. This allows us to ask causal questions such as “What causal impact does wording X have on click-through, comprehension, or satisfaction?” rather than merely correlation questions like “What patterns correlate with clicks in historical logs?” This is guaranteed by unconfoundedness via randomization. By randomizing which LLM‐generated treatment each user sees, we break any link between unobserved user traits and the chosen treatment. That means we can estimate unbiased treatment effects (CATEs), whereas in a pure recommender system, we need to correct for selection bias and confounding issues on observational data.

Re your concern 2: While one could indeed parameterize gg and hh as neural nets or fine-tune the original LLM (e.g. via LoRA or contrastive learning), our kernel-based formulation brings three key advantages:

  1. Nonparametric flexibility, avoiding the need to choose and tune a specific network architecture.

  2. Theoretical guarantees on convergence and generalization.

  3. Low-dimensional feature extraction, which enhances interpretability of the learned treatment effects.

We agree that a LoRA-based baseline would be a valuable comparison—this is a natural next step, and we intend to explore it in future work. In the present paper, we focus on CATE estimation with a fixed embedder and a downstream online experimentation strategy. We will leave treatment optimization and generation via fine-tuning as exciting directions for follow-up studies.

Regarding concern 3: While we are eager to demonstrate our proposed framework on additional real-world problems, finding suitable, publicly available data that satisfies all the necessary desiderata (fully randomized, containing A/B tests on text, images, with user features) has proved challenging. To address this, we therefore consider a range of different relaxations of the semi-synthetic simulation setup, which we believe is more representative of real-world settings.

  • First, we use a different real-world dataset to collect both text-based treatment and user-level covariates to avoid simulating user covariates.
  • Second, we relax the outcome generation process to allow more complex user-treatment interaction.
  • Third, we test out a broader set of baselines methods for a more comprehensive understanding.

Specifically, we analyze the MIND dataset, a benchmark dataset containing traffic from Microsoft for click rate prediction and news recommendation. It is an observational dataset where the probabilities of news being recommended to users are unknown. We consider a semi-synthetic scenario: suppose our goal is to train this recommendation system from scratch by adaptively collecting recommendation-clicking data and improve recommendation quality. We take XX to be user features such as news category preference and historical news click embeddings, and ZZ as the embeddings for a set of candidate news to be recommended to users. All these embeddings are taken from the knowledge graph embeddings that were originally included in the dataset. We consider a synthetic outcome model for the click-through rate: y=zTΓx+epsilony = z^T\Gamma x + epsilon, where Γ\Gamma is a matrix that encodes the interaction mechanism between users and treatments. It can be interpreted as a match between user preference and news information. We consider several setups of Γ=UΛV\Gamma = U \Lambda V^\top, where Λi=iq\Lambda_i = i^{-q} ensures that the interaction model varies from low-rank (high eigenvalue decay rates qq) to a high rank (low eigenvalue decay rates qq) setting. We compare our method with four baselines: LASSO, XG boost, a Feed-forward neural network, and a kernel regression. The methods are combined with different estimation strategies: "Z" means only incorporating treatment information, "X" means only include covariate information, and "ZX" means include a concatenated (Z,X) vector. The following two tables compare the training and testing RMSEs:

Training MSE

Decay (q)DKRLZXZX
Lasso_ZXGB_ZFNN_ZKernel_ZLasso_XXGB_XFNN_XKernel_XLasso_ZXXGB_ZXFNN_ZXKernel_ZX
00.006 (0.002)0.034 (0.002)0.027 (0.002)0.030 (0.002)0.031 (0.002)0.055 (0.005)0.053 (0.005)0.055 (0.005)0.055 (0.005)0.031 (0.002)0.002 (0.000)0.012 (0.002)0.035 (0.001)
20.003 (0.001)0.009 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.002)0.013 (0.004)0.013 (0.004)0.014 (0.004)0.013 (0.004)0.009 (0.001)0.001 (0.000)0.008 (0.002)0.008 (0.002)
40.003 (0.001)0.008 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.001)0.011 (0.003)0.010 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.001 (0.000)0.007 (0.002)0.006 (0.001)
60.002 (0.001)0.008 (0.002)0.005 (0.001)0.009 (0.002)0.006 (0.002)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.001 (0.000)0.008 (0.002)0.007 (0.001)

Testing MSE

Decay (q)DKRLZXZX
Lasso_ZXGB_ZFNN_ZKernel_ZLasso_XXGB_XFNN_XKernel_XLasso_ZXXGB_ZXFNN_ZXKernel_ZX
00.007 (0.003)0.034 (0.003)0.036 (0.003)0.034 (0.003)0.031 (0.003)0.055 (0.005)0.056 (0.006)0.055 (0.005)0.055 (0.005)0.032 (0.003)0.022 (0.001)0.016 (0.002)0.035 (0.002)
20.003 (0.001)0.009 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.002)0.013 (0.004)0.014 (0.004)0.014 (0.004)0.013 (0.004)0.009 (0.001)0.004 (0.001)0.010 (0.002)0.008 (0.002)
40.003 (0.001)0.008 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.001)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.004 (0.001)0.009 (0.002)0.007 (0.001)
60.003 (0.001)0.008 (0.002)0.007 (0.002)0.011 (0.002)0.006 (0.002)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.011 (0.003)0.008 (0.002)0.004 (0.001)0.010 (0.002)0.007 (0.002)

From the table, we can tell that low-rank signals (larger qq) do bring benefits to all methods—but DKRL exploits this structure most effectively, yielding the best generalization. Even when the signal is high-rank (for example, q=0q = 0 which is full rank), DKRL can effectively exploit the similarity structure in embedding themselves and enhance accuracy. It actually achieves a comparable performance that is close to XG boost with combined features.

Thank you very much for the constructive points. We hope these explanations mitigate your concerns regarding our work.

评论

Thank you for the reply. Unfortunately, I don't think my questions have been fully answered.

  1. Recommender systems and contrastive learning generally regularly handle very large selection sets (easily exceeding millions). You emphasize that a key point is that LLMs are generative rather than a fixed corpus of responses, but it's not clear to me why this is meaningfully different than, e.g., recommendation where new items can come in. Generally, I think there just needs to be much more serious engagement with the apparently related work---either to explain crisply why it's not actually related, or to understand if it in fact solves the problem.

  2. It seems to me that none of these advantages are actually advantages per se. Kernel modeling is an alternative to deep learning, not a magic spell that frees us from design. Empirically, it in fact usually seems easier to get a deep learning system to work than a kernel based one. That's particularly true given that you've got an LLM in the pipeline anyways!

  3. I appreciate the improved experiment, and think it will make future versions of the paper stronger.

评论

Thank you for these responses. We want to share a few more ideas regarding these insightful points:

Response to Comment 1

We thank you again for pointing out the close connection between our work and large‐scale recommendation or contrastive‐learning systems. Indeed, both settings share the goal of learning and optimizing a user–item (or user–treatment) interaction function. To clarify how our approach differs—and to motivate a deeper engagement with recommender‐system methods—we will expand the discussion in the revised manuscript along two dimensions:

  • Causal–experimental emphasis on treatment effects. Standard recommenders focus on predicting outcomes such as click-through rates; in our causal framework, we model the treatment effect τ(z,x)=E[YX=x,Z=z]E[YX=x,Z=z0]\tau(z,x)=E[Y∣X=x,Z=z]−E[Y∣X=x,Z=z_0]. By imposing a low‐rank structure on the contrast τ(z,x)\tau(z, x) rather than on raw outcomes, we exploit the belief that treatment effects are often more structured (and lower‐dimensional) than the outcomes themselves. Crucially, under randomization, the treatment‐assignment mechanism is known, obviating the need for a fully correct outcome model. We will spell out these points more crisply and situate our work vis-à-vis both kernel‐based and deep‐learning recommender approaches.

  • Generative treatment design versus fixed-item ranking. Traditional recommender or ranking systems assume a (virtually) static catalog of items and can only select or re-rank from that fixed set—they do not optimize how new treatments are generated. In contrast, our LLM-based pipeline has the potential to actively generate candidate treatments and then optimize them for user response. This “design-and-optimize” loop enables us to propose entirely novel interventions—something a classic ranker alone cannot do—and to co-optimize generation and selection in one unified framework.

Response to Comment 2

We agree that kernel methods are not a magic spell and that, in practice, deep learning—especially fine-tuning pretrained LLMs in this setting—can achieve strong empirical performance with less manual kernel design. Our primary motivation for exploring a kernel‐based pipeline is the balance it affords between:

  • Practical efficacy: we demonstrate through simulations that these nonparametric kernel regression estimators attain competitive performance out-of-sample, and

  • Theoretical insight: the RKHS framework permits finite-sample error bounds and explicit analysis of the exploration–exploitation trade-off.

By contrast, a deep‐learning formulation—while potentially effective—is much more challenging to analyze theoretically. That said, we fully acknowledge the value of end-to-end LLM fine-tuning and will discuss ongoing work on hybrid schemes that combine our low‐rank causal kernels with LLM adaptation. We believe this direction holds promise for unifying theoretical rigor with state-of-the-art generative models.

Response to Comment 3

We are very glad that the new simulation addresses your concern. We will add these results along with all these discussions in the updated version of our paper.

最终决定

This paper introduces a new method, double kernel representation learning, to enhance the statistical efficiency of designing and analysing LLM-generated treatments in large-scale experiments. The method improves statistical efficiency by explicitly incorporating semantic similarity from treatment embeddings. It works by factoring the causal effect through the inner product of low-dimensional, kernel-based representations of treatments and user covariates. The paper proposes an alternating-minimisation algorithm to efficiently learn these representations from data and also propose an adaptive design strategy for online experimentation.

After the author-reviewer discussion, the reviews for this paper are generally positive, and it's leaning toward borderline acceptance. The concerns raised by the expert reviewers are positioning of the paper, choice of methodology (i.e., kernel methods), and limited empirical evaluation. In the rebuttal, the authors provided additional empirical evaluation and promised to improve the positioning of the paper. The authors also justified their choice of kernel methods. In general, the reviewers found that the authors' responses addressed some of their original concerns.

My own assessment is that, the paper studies the timely topic and provides comprehensive theoretical and experimental results to support their claims. Given the authors keep their promises to improve the aforementioned aspects of the paper, the paper should be deemed acceptable to be published at NeurIPS.