Linear Bandits with Partially Observable Features
This work proposes an algorithm for linear bandits with latent features, achieving sublinear regret by augmenting orthogonal basis vectors and using a doubly-robust reward estimator, without requiring prior knowledge of the unobserved feature space.
摘要
评审与讨论
The paper proposes a method to solve the linear bandit problem with only a subset of features per arm visible to the learner without assuming any structural property beforehand. The authors do this by transforming each context vector onto an augmented space with dimension and learn the respective augmented feature vector to minimize the regret. They apply a doubly robust algorithm (robust wrt. estimated model and rewards). The authors provide a sublinear regret upper bound, a lower bound and proof the consistency of their estimator. Experiments showcase the effectiveness of their algorithm to other baselines.
给作者的问题
Do you actually use a compatibility condition or do you rely on a full rank gram matrix?
If you use a full rank gram matrix in the augmented space, how does you exploration strategy ensure full rank?
How do you select the set of orthogonal basis vectors ?
论据与证据
I did not check all of the proofs but from what I saw they look mostly clear and convincing I list some concerns in the weakness/questions section.
方法与评估标准
Comparing regret bounds for different compositions of number of arms and dimension make sense in this setting.
理论论述
I only skimmed through the proofs, with more focus on theorem 3.
实验设计与分析
I read through the experimental section and the scenarios explained for each plot. I have no issues there, though I did not check any code.
补充材料
See above.
与现有文献的关系
The key contribution is a new model for the partially observable features setting without any additional assumptions on the structure of the problem through feature augmentation and the respective regret bound.
遗漏的重要参考文献
Not to my knowledge.
其他优缺点
Strength: I like the idea of using augmentation to mitigate the missing information and exploiting the observable features. The fact that no further assumptions are made makes the model quite powerful.
Weakness: I find the paper in section 3.2 hard to follow. For example it is not quite clear to me wether a compatibility condition is used or not for the oracle inequality. Also the exploration aspect of the algorithm is not quite clear to me, what is the benefit to random exploration for instance? Also I don't see how the basis vectors for the augmented context vectors are determined, I assume they have to minimize ?
其他意见或建议
I think there is a small error in the regret definition in equation (2). Shouldn't it be just and instead of ?
At the end of page 6 you wrote "To the best of our knowledge,Theorem 3 is the first regret bound sublinear in T for the latent features without any structural assumption." Technically this is not correct since the regular UCB algorithm for stochastic bandits would also achieve a sublinear bound.
In lines 228 to 231 you wrote that for the compatibility condition to hold the minimum eigenvalue has to positive, which does not have to be true in general. In some instances the compatibility condition is a more relaxed condition than having a positive minimum eigenvalue.
We are thankful for your careful review and for acknowledging the impact of our contributions. We will address the following questions one by one, and we believe that the answers will collectively provide a comprehensive response.
- On Use of Compatibility Condition
- It is well known that the full rank matrix implies the compatibility condition. Without loss of generality, we assume that the observed features are full rank (lines 187-188, left column) and the augmented vectors are orthogonal each other and to the observed feature space. Please refer to Lemma 4 for the detailed proof starting from line 1486 in the last page of appendix. Because the Gram matrix of the augmented features has positive minimum eigenvalue, it satisfies the compatibility condition for the convergence of the lasso estimator.
- Ensurability of Full Rank Gram Matrix under Our Exploration Strategies
- Our exploration method ensures full rank in two ways: (i) by randomly sampling a augmented feature vector for predetermined number of round , and (ii) by performing resampling and coupling based on the multinomial distribution defined in (9). These two strategies explore over all arms efficiently. Additionally, as we have indicated in the footnote 3 in the manuscript (lines 217-219, left column), even though the observed feature Gram matrix is not full rank, we can apply singular value decomposition on the observed features to reduce the feature dimension to with .
- Selection of Orthonormal Basis
- Theoretically, our regret bound holds for any choice of basis in . In practice, we perform singular value decomposition (SVD) of the observed feature matrix and select the right singular vectors corresponding to zero singular values, i.e., , as in our experiments. Then form an orthonormal basis that is orthogonal to the row space of .
- Other Corrections
- Regret definition in Eq. (2)
- Thank you for pointing this out. We will revise the notation in the revision.
- At the end of page 6 you wrote "To the best of our knowledge,Theorem 3 is the first regret bound sublinear in for the latent features without any structural assumption." Technically this is not correct since the regular UCB algorithm for stochastic bandits would also achieve a sublinear bound.
- We appreciate your clarification. We intended to state that our result is the first of its kind in linear parametric bandit problems. If by "regular UCB algorithms" you are referring to the algorithms for non-feature-based multi-armed bandit (MAB) settings, we agree that the original sentence may be misleading. We will revise it for clarity as follows: "To the best of our knowledge, Theorem 3 presents the first regret bound faster , particularly for algorithms that account for unobserved features, without relying on any structural assumptions."
- In lines 228 to 231 you wrote that for the compatibility condition to hold the minimum eigenvalue has to positive, which does not have to be true in general. In some instances the compatibility condition is a more relaxed condition than having a positive minimum eigenvalue.
- Thank you for pointing this out. We agree with your observation and will revise the sentence, for accuracy, as follows: "For the estimator in Eq. (8) to correctly identify the zero entries in , the compatibility condition must hold (van de Geer & Bühlmann, 2009). While the compatibility condition does not necessarily require a positive minimum eigenvalue in general, in our setting, in our setting, the Gram matrix has a positive minimum eigenvalue. Thus, the compatibility condition is implied without any additional assumption."
- Regret definition in Eq. (2)
In this work, we study linear bandits with partially observable features, where rewards depend on both observed and unobserved features, a common situation in real-world applications. Our problem setting is more general than prior works, and it remains underexplored. We directly tackle challenges arising from unobserved features. Specifically, we propose a novel algorithm equipped with a new estimation technique, and a novel analytical framework distinct from existing approaches. Our method achieves regret not only sublinear but also converges faster than those reported in previous studies, which also shows superior empirical performance in numerical experiments. We believe these contributions are significant.
Reference: van de Geer, S. A. and Bühlmann, P. (2009), On the conditions used to prove oracle results for the lasso.
Problem Setting
The paper studies the linear bandit problem with partially observable features, where rewards depend on a full set of features, but the learner only observes a subset of them. This setting models real-world scenarios (e.g., recommendation systems) where unobserved latent features (e.g., user preferences) influence outcomes. Existing linear bandit algorithms fail here because they either assume full feature observability or impose restrictive structural assumptions (e.g., linear mappings) between observed and latent features. The paper relaxes these assumptions, allowing latent features to have arbitrary relationships with observed ones, and formalizes the problem through geometric relationships between the subspaces spanned by observed and latent features.
Main Algorithmic Ideas
The proposed RoLF (Robust to Latent Features) algorithm addresses partial observability via two key innovations:
- Feature Space Decomposition: RoLF explicitly decomposes the reward into components spanned by observed features and their orthogonal complement, enabling estimation of both observable and latent effects without prior knowledge of latent dimensions.
- Doubly Robust Estimation: A novel estimator combines ridge regression on observed features with importance weighting to correct for bias from unobserved factors, ensuring robustness to arbitrary latent feature interference.
Critically, RoLF requires no prior knowledge of latent feature properties (dimensions, alignment with observed features) and dynamically adapts to the geometric relationship between feature subspaces.
Main Results
- Regret Bounds:
- If observed features span the latent space (), RoLF achieves regret, matching the optimal rate for standard linear bandits.
- If latent features dominate (), regret scales as , aligning with multi-armed bandit limits.
- In the general case, regret is , where is the effective latent dimension orthogonal to observed features.
- Empirical Validation: Experiments confirm RoLF outperforms baseline algorithms when latent features are present.
给作者的问题
I have some question about the result of regret bound. Regarding the results in Table 1, I understand the first scenario: it suggests that even though there are some unknown parts of the features, these features are either unimportant or can be linearly expressed by the known features, thus reducing the problem to standard linear bandits. The second scenario indicates that the unknown features are so numerous that even with some side information, it’s insufficient, causing the problem to degenerate into a multi-armed bandit setting, where the features essentially useless. Both scenarios make sense to me.
However, I find the third result less clear. Based on my understanding, the bound includes as the dimension of known features and as the dimension of unknown features. Intuitively, as the proportion of unknown features increases relative to known features, the problem should become more challenging. That is, if remains constant, increasing should make the problem harder. Yet, according to the bound, as long as remains unchanged, the final performance is the same. This seems confused to me, and I believe the authors may need to discuss this problem.
论据与证据
All claims in the paper are supported by theoretical proofs.
方法与评估标准
The method design is sensible and supported by experiments. The evaluation tests the algorithm's handling of unknown features by varying latent feature dimensions and their impact, which appears reasonable.
理论论述
The theory appears to be sound, but I have not had the time to thoroughly check the proofs.
实验设计与分析
I’m a bit confused about the experimental results. In each graph, the cumulative regret of the proposed RoLF-Lasso algorithm suddenly flattens at a certain point and stays flat afterward. This doesn’t align with my understanding of how regret typically grows in bandit experiments. I think the authors might need to explain why the regret suddenly becomes flat.
补充材料
I have not closely reviewed the supplementary material.
与现有文献的关系
This paper is the first to study partially observable features in linear bandits. It shows that when too few features are observable, the linear bandit will recover to multi-armed bandits (MAB), linking the two frameworks and identifying the point where linear modeling and additional context advantages weaker.
遗漏的重要参考文献
To my knowledge, I haven’t found any references that were left undiscussed.
其他优缺点
I believe the strength of this work lies in introducing a meaningful new topic within the linear bandit setting and providing what appears to be a solid theoretical analysis. However, the weakness lies in the presentation of the paper, as many parts lack clear explanations. Below are my specific comments and questions for further clarification.
其他意见或建议
The paper can be further improved in terms of writing and presentation. For instance, there are excessive blank lines between lines 326-329, and page 8 is not fully utilized. While these issues are not strictly necessary, the paper does not convincingly demonstrate that the content can be fully explained without filling these spaces.
Specifically, several key concepts and definitions are left unexplained:
1.the content of Table 1 is not well-explained, and the accompanying text does not align with the table.
2.In line 147, the regret formula includes and , neither of which is defined. Based on the notation in the regret formula, these appear to be -dimensional vectors rather than full feature vectors, but this is not clearly explained.
Therefore, I believe the work could be further improved in terms of clarity and writing.
We appreciate your feedback on page usage. We will make full use of the page limit and avoid leaving unused space in the revision.
- On Explanation and Mislocation of Table 1
- We appreciate you for giving us an opportunity to clarify Table 1. We will relocate the table to a more suitable position (e.g., below Section 3, line 159, right column) and add clarifying explanation.
- Table 1 summarizes how the regret bound of our algorithm varies with the relationship between the row space of observed features (span(observed), i.e., ) and that of unobserved features (span(latent), i.e., ).
- Specifically, if span(latent) is fully included in span(observed), the quantity becomes 0 since the unobserved reward component can be expressed by the observed features. For the opposite case, . A detailed discussion can be found from line 206 (left column) to line 177 (right column).
- We also provide the experimental results for Case 1 and 2 in Table 1: (i) and (ii) , respectively. For (i), we sample from and a coefficient matrix from , and compute as , where (line 91, right column). In case (ii), we sample , then construct via multiplication with a coefficient matrix, as in (i). The full feature matrix is formed by concatenating and .
- The experiments are conducted with the following setups: (i) , and , the number of arms, varies by 20, 30, and 40; (ii) , with ; (iii) , with . Results are available at: https://tinyurl.com/RoLFFigs1
- On Notation of and
- We appreciate the opportunity to clarify the definition of the given notation. In Eq. (2), denotes the true feature vector associated with the optimal action , as defined in lines 139–140 on the left column, and denotes the true feature vector corresponding to the action selected in round , i.e., . As defined in Eq. (1) (line), both vectors consist of a combination of observed and unobserved components. We will include this clarification in the revision.
- Clarification of Theoretical Results
- We appreciate the opportunity to clarify the results regarding .
- Before providing clarification, we would like to emphasize that the first scenario in Table 1 does not imply that the unobserved features are unimportant; rather, as you have correctly pointed out, it indicates that they can be linearly expressed by the observed features.
- The formal definition of is provided in (6) (line 216, the left column). To clarify, is not the dimension of unknown features; is the number of required basis vectors to express the unknown portion of reward, .
- Even if the dimension of the unobservable features, , increases, the problem does not become more difficult -- as long as the unobserved component of the reward can still be expressed using only basis vectors. However, if the increment of the unobservable feature increases , then the problem becomes harder as it requires additional parameter to express the unknown portion of reward. Thus, characterizes the intrinsic difficulty of the problem, which supports our regret bound. We will include the details in the manuscript.
- Clarification of Experimental Results on Curve Flattening
- We appreciate your careful observation. While the cumulative regret curve of our algorithm may appear to flatten suddenly, it is not completely flat -- this impression comes from the steep growth of the regret curves of baseline algorithms. In fact, it continues to grow slowly over time. The apparent flattening is due to the forced exploration phase (lines 5-6 in Algorithm 1), which ensures sufficient coverage of the action space early on. After this phase, resampling and coupling strategies both enhance the efficiency of the doubly robust (DR) estimation, resulting in slower regret growth. A shorter exploration phase would yield a more gradual and adaptive flattening.
We would like to emphasize the significance of the problem: linear bandits with partially observable features, where rewards depend on both observable and unobservable features -- common in practice, yet underexplored. Our work addresses core challenges with a new algorithm for partial observability, equipped with a novel estimation strategy. Our method achieves sublinear regret bounds with faster convergence rate than prior work, and shows superior empirical performance. We believe these contributions are substantial and hope they are recognized.
Thank you for your effort on the rebuttal. While the concern regarding Table 1 has been addressed, I still have a few remaining concerns.
-
You emphasize the significance of the problem by stating that “it is common in practice yet underexplored”. However, you don’t provide a convincing example to support this claim. Even the example at the beginning of the paper—about product recommendation—is rather vague. In the linear bandit setting you consider, it’s unclear how the roles of users and items are defined. Based on the standard interpretation in linear bandits, the context typically corresponds to the items (which the recommender system can observe and choose), while the user is modeled as the unknown parameter. So it’s confusing that in your setup, the user is treated as the observed context and the item as the unknown parameter. I’m not claiming my understanding is definitely correct, but the fact that such confusion arises suggests that your explanation of the problem setting is unclear. This does not appear to be a well-established problem in the bandit community (at least for now), and for that reason, I find your attempt to emphasize the significance of the problem unconvincing, as it seems to imply that this is already a widely recognized issue in the community — which, to my knowledge, is not the case.
-
Regarding the experiment, I’m not convinced by your explanation. You explaned that the figure are influenced by the impression of linear regret in the overall plot, but based on the scale of the y-axis, the regret of your two methods shows very little change within the 1200 rounds. Even without the impression of linear regret, the variation is still minimal. This unusually small increase in regret over time seems quite odd. This pattern is not observed in prior works mentioned in this paper, such as Park & Faradonbeh (2022), Kim et al. (2023a), and Park & Faradonbeh (2024).
[New] Thank you for your further reply. I believe the problem is meaningful, and your code appears to be correct. However, my remaining concern is the abnormal phenomenon, which I believe should be clearly explained in the revision.
It’s unclear whether this is caused by a plotting artifact, an issue with the experimental setup (e.g., the task might be too easy, allowing the algorithm to immediately find the optimal arm after the exploration phase), or some underlying assumption. The initial explanation related to “impression” doesn’t seem very convincing to me. If possible, could the authors provide a version of the plot with the linear regret removed — showing only the currently flat-looking part? This might help clarify the explanation without the impression issue and allow for a clearer observation of the regret growth in this region.
Since the discussion deadline is approaching, the authors may consider directly updating the new plot in the anonymous link.
Significance of the problem
We would like to offer a more concrete case to better illustrate the significance of linear bandit with partially observable features. Consider in online advertising (without personalization, i.e., serving to general users):
Each ad (arm) is associated with a true but partially observable feature vector :
where is observable and is latent.
For example:
- : ad category (e.g., travel, fashion, tech)
- : ad format (e.g., banner, video, carousel)
- : time of display
These are observable to the platform. (Note that some of these observable features may be categorical, which can be turned into dummy variables.) However, latent factors such as emotional appeal, creative design quality, or brand familiarity also influence the click-through rates but are not directly quantifiable or observed ('s). The reward (click/no click) depends on the entire , but the learner (advertiser/platform) only sees —leading to a realistic partial observability challenge. This example illustrates why we believe our model is practically meaningful.
Similar situations arise in many other domains:
- Clinical trials, where treatments (arms) have observable features (e.g., dosage, chemical formulation) and unobserved ones (e.g., manufacturing variability, side effect risks) that influence outcomes.
- Online retail, where products have not only observed metadata (e.g., price, brand, category), but also latent factors (e.g., freshness, trendiness) can impact purchase rates.
In all of these cases, outcomes depend on both observed and unobserved features, and our model provides a principled and practical framework for learning in such settings—with strong theoretical guarantees.
We invite the reviewer to the following reflection:
Question: Which is more realistic? (i) All features influencing the reward are always observed, or (ii) Some relevant features might be unobservable in practice?
We believe many practitioners would agree that (ii) better reflects real-world conditions.
Moreover—and just as importantly—in practice, one typically does not know if all relevant features are observed or some influential factors are unobserved. Our work directly addresses this aspect, offering a solution that requires no prior knowledge of the unobserved feature space.
We also respectfully disagree with the claim that our setting is not well-established within the bandit research community. There is a growing body of works on bandits with partially observable features, including. Tennenholtz et al. (2021), Kim et al. (2023), and Park & Faradonbeh (2022; 2024). Our work generlizes this line of research, by removing the structural assumptions on the latent features commonly imposed in prior studies.
Putting these together, we strongly believe that our setting is both theoretically well-motivated and practically relevant. More importantly, we propose a provably efficient algorithm that also shows superior performance in experiments.
Experiments
We take such concerns very seriously. Our code was already submitted for full transparency, and we have re-validated the implementation following your comment. We found no issues at all. It is 100% correct. For even more transparency, we provide the notebook file that reproduces our results:
Link: Jupyter notebook with results.
So there should NOT be any more concerns.
Regarding the prior works mentioned, we respectfully note that those studies assume stochastic features that are resampled at each round. In contrast, our setting considers fixed feature vectors as clearly stated in the paper—making the settings quite different.
Finally, we believe the seemingly flatter regret curve arises naturally from basis-augmented features, which reduce information loss; doubly robust estimation, which improves statistical efficiency; and forced exploration, known to yield fast convergence after initial rounds (e.g, Goldenshluger & Zeevi, 2013; Hao et al., 2020; Chakraborty et al., 2023).
We kindly and respectfully ask the reviewer to re-evaluate our important work.
Additional references
- Goldenshluger & Zeevi (2013), "A linear response bandit problem."
- Tennenholtz et al. (2021), "Bandits with partially observable confounded data."
- Hao et al. (2020), "High-dimensional sparse linear bandits."
- Chakraborty et al. (2023), "Thompson sampling for high-dimensional sparse linear contextual bandits."
This paper studies linear bandits with partially observable features. The authors suggest an epsilon greedy type algorithm based on a doubly robust estimator. A regret bound of the algorithm is provided with supporting numerical experiments.
给作者的问题
Could you provide more details about the experiment setups, especially the space R(X) and R(U)? Also, could you please provide the regression equations of rewards?
How (10) is unbiased even though lasso/ridge is used?
论据与证据
The authors claim that the basis of orthogonal space to feature space can be used for the estimation of rewards to fill the absence of information due to partial observability. The reviewer believes that this is convincing.
方法与评估标准
The regret is a widely used evaluation criterion. It makes sense.
理论论述
I have not checked all proofs. But, the claims make sense and fit my intuition.
实验设计与分析
The experiment should be more neutral to other benchmark methods. The space of (significant) unobserved vectors should be a small subset of R(X)^t for fair comparisons. If no unobserved vectors are associated with rewards, regret could be worse than other methods due to overfitting, even though the lasso/ridge estimator provides some shrinkage. This part is not fully described, so hard to figure it out.
补充材料
I have not.
与现有文献的关系
As this work addresses linear bandits (not contextual bandits), for literature review, could you please compare it with others about linear bandits, not with ones about contextual bandits? If no literature about linear bandits is available, the authors should clarify it.
遗漏的重要参考文献
The following work is similar to this work.
Tennenholtz, Guy, et al. "Bandits with partially observable confounded data." Uncertainty in Artificial Intelligence. PMLR, 2021.
其他优缺点
This work's idea and method are great. Also, the quality of English is good.
Weakness:
Methods are not fully described. In (8), what is p and \delta? How do we determine it? Could you please more specifically describe the details about the method, such as the reference of the provided exploration method and the constants C_e log(2Kt^2/\delta)?
其他意见或建议
NA
伦理审查问题
NA
We are grateful for your valuable feedback and acknowledgment of our contributions. We are happy to address each of your comments.
-
On Descriptions of Notation and Methods
- Thank you for the opportunity to clarify this point. Definitions of and are given after Eq. (9) (lines 263-270, left column) and Algorithm 1 (lines 276-277, right column), respectively. We will add explanations immediately after Eq. (8) in the revision. As described in Algorithm 1, both quantities are hyperparameters: (i) is the coupling probability used to define the multinomial distribution for pseudo-action sampling; (ii) is the algorithm's confidence parameter.
- Our exploration method uses coupling of an action from -greedy policy with a counterfactual action , drawn from the multinomial distribution in Eq. (9). While Xu & Zeevi (2020) also use counterfactual actions for exploration, our method differs in two points: (i) they sample counterfactuals from the previous policy, whereas ours conditions on ; (ii) we explicitly resample for coupling. A minimum of exploration rounds ensures the imputation estimator is accurate enough for use in the main estimator . Fewer rounds may cause error accumulation in .
-
On Experimental Setups and Comparison with Baselines
- Our experimental setting is the "otherwise" case in Table 1, where neither nor fully contains the other.
- We also examine cases (i) and (ii) . For (i), we sample from and a coefficient matrix from , and computed as , where (see line 91, right column). In case (ii), we first sample , then construct analogously as in case (i). is formed by concatenating and .
- Additional experiments are conducted for both cases, with the following setups: (i) , and , the number of arms, varies by 20, 30, and 40; (ii) , with ; (iii) , with . Results are available at: https://tinyurl.com/RoLFFigs1
- For the case with no unobserved features, results are already included in the manuscript (lines 416–426, left column; Fig. 4), where , thus removing latent potions from rewards. As shown in the figure, our algorithm still outperforms the baselines with no overfitting observed.
-
On Regression Equations
- As noted in the manuscript(line 123, left column), the true mean reward is , where is observed and is latent. Baseline methods (e.g., OFUL, LinTS, DRLasso) only use , modeling the regression as . However, ours augments with to model as which recovers the true mean reward as argued in Eq. (4) (lines 194–200, left column) and Eq. (7) (lines 184–187, right column).
-
On Unbiasedness of Estimators
- The pseudo-reward (Eq. (10) in lines 237-240, right column) is unbiased as it involves no regularization. In contrast, the main estimator is biased due to the inclusion of lasso/ridge regularization.
-
Comprehensive Literature Review
- Comparisons with Linear Bandits: Due to space limitations, we deferred the comparison with linear bandits to appendix, but we are more than happy to move it to the main text in the revision. Regarding linear bandits with fixed features, we have also discussed comparisons with linear bandits under model misspecification (lines 96-108, right column) as well as in Appendix A.
- Tennenholtz, Guy, et al. (2021): We appreciate your pointer to this related work. Upon inspection, while their setting appears similar to ours -- the true features and rewards contain both observed and unobserved components -- their work differs from ours in two points. First, they assume access to offline partially observed dataset to reduce the dimensionality of the problem, whereas partial observability arises naturally in ours, with no offline access. Second, they leverage the correlation between observed and unobserved features, which requires estimation; ours is agnostic to such correlation. Moreover, we exploit the observed features via feature augmentation and doubly robust estimation, resulting in faster regret convergence than their UCB-based approach. We will include this comparison in the revision.
Reference: Xu, Y. and Zeevi (2020), A. Upper counterfactual confidence bounds: a new optimism principle for contextual bandits.
This paper addresses the linear bandit problem under partial feature observability, proposing an algorithm based on feature augmentation and doubly robust estimation. The authors provide regret bounds under various relationships between observed and latent features, supported by theoretical analysis and experimental results.
During the rebuttal and the discussion, the reviewers and I generally agree that the theoretical contribution is strong. In particular, the paper proposes a novel modeling framework and achieves regret bounds that generalize existing results without relying on strong structural assumptions. This provides a meaningful step forward in the study of linear bandits with latent features.
However, there remain some concerns about the experimental section, especially regarding the behavior of the regret curves and the clarity of the problem motivation. While the authors offered plausible explanations and additional results during the discussion phase, some questions remain. I strongly encourage the authors to clarify the experimental setup and presentation in the camera-ready version, particularly to help readers better understand the flattening of regret curves in the experiments and the real-world relevance of the problem.
Overall, despite these caveats, I find that the core contribution is novel and well-executed, and that the concerns raised do not outweigh the value of the work.