PaperHub
6.3
/10
Poster3 位审稿人
最低3最高4标准差0.5
3
4
3
ICML 2025

Reinforcement Learning with Segment Feedback

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

Standard reinforcement learning (RL) assumes that an agent can observe a reward for each state-action pair. However, in practical applications, it is often difficult and costly to collect a reward for each state-action pair. While there have been several works considering RL with trajectory feedback, it is unclear if trajectory feedback is inefficient for learning when trajectories are long. In this work, we consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback. In this model, we consider an episodic Markov decision process (MDP), where each episode is divided into $m$ segments, and the agent observes reward feedback only at the end of each segment. Under this model, we study two popular feedback settings: binary feedback and sum feedback, where the agent observes a binary outcome and a reward sum according to the underlying reward function, respectively. To investigate the impact of the number of segments $m$ on learning performance, we design efficient algorithms and establish regret upper and lower bounds for both feedback settings. Our theoretical and experimental results show that: under binary feedback, increasing the number of segments $m$ decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing $m$ does not reduce the regret significantly.
关键词
Reinforcement learning (RL)segment feedbackbinary feedbacksum feedback

评审与讨论

审稿意见
3

This paper studies RL with segment feedback, where each episode in an MDP is divided into mm equal-length segments, and the agent receives reward feedback only at the end of each segment instead of every state-action pair. The authors investigate two feedback settings: binary feedback, where the agent receives a binary outcome per segment, and sum feedback, where the agent observes the cumulative reward within a segment. They design TS-style algorithms for binary feedback and UCB-style algorithms for sum feedback, establishing regret upper and lower bounds for both known and unknown transitions. Their theoretical and empirical results show that increasing mm exponentially decreases regret under binary feedback, but has little effect under sum feedback.

给作者的问题

See comments.

论据与证据

The paper's claims are generally well-supported by theoretical analysis and empirical results. However, two strong assumptions should be further discussed:

  1. The paper assumes that the type of segment feedback (sum or binary) is known in advance. Since the conclusions differ significantly depending on the feedback type, understanding the feedback structure seems crucial before applying these methods. However, in real-world scenarios, it is often difficult to determine how segment feedback is derived from step-level rewards.
  2. The assumption that each episode is divided into mm equal-length segments is somewhat strong and may not always hold in real-world scenarios where feedback aggregation is irregular. A discussion on the potential impact of non-uniform segment lengths on the proposed methods, along with possible extensions to handle variable-length segments, would improve the robustness of the claims.

方法与评估标准

The authors design TS-style algorithms for binary feedback and UCB-style algorithms for sum feedback, which are appropriate choices given the problem structure.

理论论述

The theoretical claims and proofs, particularly those on regret bounds, appear correct and well-structured, with no major issues observed.

实验设计与分析

Since trajectory feedback algorithms can be naturally extended to the segment feedback setting, it is important to clarify the advantages of the proposed approach over these existing methods. An experimental comparison with trajectory feedback algorithms (e.g., [1, 2]) should also be included.

[1] On the Theory of Reinforcement Learning with Once-per-Episode Feedback.
[2] Reinforcement Learning with Trajectory Feedback.

补充材料

I reviewed the supplementary material, especially the proofs. They are well-structured and correct.

与现有文献的关系

This paper discusses segment feedback, a topic previously studied from different perspectives in [1,2]. Building on [3], the authors design TS-style algorithms for binary feedback, and based on [4], they develop UCB-style algorithms for sum feedback.

[1] Reinforcement Learning from Bagged Reward.
[2] Harnessing Causality in Reinforcement Learning with Bagged Decision Times.
[3] On the Theory of Reinforcement Learning with Once-per-Episode Feedback.
[4] Reinforcement Learning with Trajectory Feedback.

遗漏的重要参考文献

The paper does not discuss "Harnessing Causality in Reinforcement Learning with Bagged Decision Times" (AISTATS 2025), which presents a similar problem setting.

其他优缺点

Strengths

  1. The paper is well written, and the theoretical results appear correct, with detailed analysis and proofs supporting the claims.
  2. The counterintuitive finding that increasing mm does not significantly reduce regret under sum feedback is well-supported by both theoretical analysis and experimental results.

Weaknesses

See comments.

其他意见或建议

  1. (Claims And Evidence) Discuss potential extensions to non-uniform segment lengths to enhance practical applicability.
  2. (Claims And Evidence) In practical applications, how do we determine whether segment feedback follows the sum or binary structure, or if it follows a different aggregation rule altogether? Since the proposed methods rely on knowing the feedback type, addressing this question would improve their real-world usability.
  3. Can the proposed methods be understood as making structural assumptions about the reward aggregation process? If so, does this imply that any form of segment feedback (inclued binary segment feedback) could, in principle, be analyzed based on sum segment feedback as a general case?
  4. (Experimental Designs or Analyses) The paper does not include a comparison with trajectory feedback methods. Since trajectory feedback algorithms can be extended to the segment feedback setting, an experimental comparison would strengthen the empirical evaluation and clarify the advantages of the proposed approach.

I will look forward to your response.

作者回复

Thank you for your time in reviewing our paper! We will revise our paper according to your comments.

1. Determine the Feedback Type

We assume an MDP setting, so the sum of step-level rewards being the trajectory reward is natural. Given this model, we can ask a human to give a score to a trajectory, which is an approximation to the sum feedback and has been considered in prior works cited in our paper. Or we can ask for thumbs up/down feedback which is the binary feedback model. So we believe that in our context, it is reasonable to assume that the type of feedback is known ahead of time since the type of feedback is designed by us.

2. Assumption on Equal-Length Segments

We agree with the reviewer that in practice, the segment length may be unequal. However, studying segments of equal length is an important starting point which has uncovered non-trivial challenges, and we reveal a new and surprising result: segments help under binary feedback, but do not help much under sum feedback.

We believe that our algorithms and analysis can be extended to segments of variable length. For example, a natural extension is to consider that the arrival of reward feedback follows a point process. In this case, the variable segment length will affect the noise variance of reward feedback, and the sum analysis of segment visitation indicators. One idea to handle this case is to obtain high probability bounds of the segment length and the number of segments within a trajectory, and use these bounds in our analysis.

3. Experimental Comparison

We ran preliminary experiments to compare our algorithms with the adaptations of trajectory feedback algorithms [1, 2] (reviewer's reference numbers) to the segment setting.

In our experiments, we set A=5|\mathcal{A}|=5, rmax=0.5r_{\max}=0.5, δ=0.005\delta=0.005 and m=2m=2. In the binary feedback setting, we set S=5|\mathcal{S}|=5, H=6H=6 and K=1000K=1000. In the sum feedback setting, we set S=3|\mathcal{S}|=3, H=10H=10 and K=5000K=5000. For each algorithm, we performed 1515 independent runs and presented the average regret with standard deviation, and average running time across runs.

Binary FeedbackRegret (Std.)Running Time (s)
SegBiTS234.26 (35.12)36.98
Adaptation of "UCBVI with trajectory labels" [1]199.94 (29.54)7796.90
Sum FeedbackRegret (Std.)Running Time (s)
E-LinUCB3954.66 (7.68)940.15
Adaptation of OFUL [2]4180.95 (12.74)925.60

As shown in the tables, under binary feedback, while our algorithm SegBiTS has a slightly higher regret than the adaptation of "UCBVI with trajectory labels" [1], SegBiTS achieves a significantly faster running time, since SegBiTS is computationally efficient while "UCBVI with trajectory labels" is not. Under sum feedback, our algorithm E-LinUCB attains a lower regret than the adaptation of OFUL [2], since E-LinUCB employs the experimental design technique to sharpen the regret.

The experimental results for m1,Hm\in\\{1,H\\} are similar. Here we only present the results for m=2m=2 due to space limit. We will include more experiments in our revision.

4. Reviewer's Reference [2] ("Harnessing ...")

The problem and focus in [2] are very different from ours. [2] considers RL with bagged decision times, where the state transitions are non-Markovian within a bag, and a reward is observed at the end of the bag. This problem can be regarded as a periodic MDP with non-stationary state transitions and reward functions in a period, and [2] adapts an RL algorithm for standard stationary MDPs to this periodic MDP to solve this problem. The focus of [2] is to handle the non-Markovian state transitions within a bag using a causal directed acyclic graph, instead of investigating bagged rewards. [2] does not consider how to infer the reward function of state-action pairs from bagged rewards, or study how the frequency of reward feedback impacts learning performance like us.

We will cite [2] and include this discussion in our revision.

5. Structural Assumptions

We assume an MDP model, which is widely used in reinforcement learning. In the MDP model, the reward of any segment of any trajectory is the sum of the rewards of each individual step in the segment. So the type of reward aggregation is fixed by our MDP model. And for binary feedback, we assume that it is generated according to a sigmoid function, which is also widely accepted and used. Beyond these two assumptions, we didn't make any other structural assumptions. To the best of our knowledge, the fact that one gets extremely coarse information about the sum reward in the binary feedback case makes it impossible to have a common analytical technique for both feedback models.

审稿意见
4

This paper studies the problem of RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback. The authors study two settings with binary and sum feedback, developing computationally efficient algorithms achieving nearly optimal regret upper bounds. They show that the number of segment m significantly matters in the binary feedback setting, while not that important in the sum feedback setting. They also provide regret lower bounds and experimental results to validate their claims.

给作者的问题

Could you please explain more about why the setting for the binary feedback considered in this work is better than that of (Chatterji et al.,2021)?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

No.

实验设计与分析

No.

补充材料

No.

与现有文献的关系

There have been several works that consider RL with trajectory feedback (Efronietal.,2021; Chatterjietal.,2021). And a recent work, Tang et al.(2024), studies the segment feedback, yet only from the empirical aspect. This is the first work to theoretically study RL with segment feedback.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. This is the first work to theoretically study the important RL with segment feedback setting. It is well motivated.
  2. The paper is well-written and easy-to-follow.
  3. The authors conduct a complete study in this topic, providing algorithms with nearly optimal regret upper bounds, and proving novel lower bounds. They also provide experimental results to support the theoretical findings.

Weaknesses:

The authors consider a different setting for the binary feedback from the previous work (Chatterji et al.,2021). It would be better to explain why the setting the authors considered is better.

其他意见或建议

None.

作者回复

Thank you for your time and effort in reviewing our paper! We will certainly incorporate your comments in our revision.

1. Comparison with [Chatterji et al., 2021] in Formulation

The objectives in [Chatterji et al., 2021] and our paper are different. In [Chatterji et al., 2021], the goal is to choose a policy which maximizes the probability of a trajectory getting a rating of 11, i.e., getting positive feedback. Whereas, in our model, our goal is to maximize the expected sum of rewards over trajectories. In our opinion, both models have value and are applicable in different contexts. In particular, our model is better suited to situations where we want to solve an MDP but only get binary segment feedback.


Reference:

Chatterji et al. On the theory of reinforcement learning with once-per-episode feedback. NeurIPS, 2021.

审稿意见
3

This paper introduces Reinforcement Learning with Segment Feedback, a framework where episodic MDPs are divided into mm segments, and feedback is provided at the end of each segment. The authors study binary feedback (stochastic binary outcomes) and sum feedback (cumulative rewards with noise), analyzing how the number of segments mm impacts learning efficiency.

给作者的问题

I think this paper deals with tabular RL. Do you expect similar results will happen in non-tabular settings?

论据与证据

Binary Feedback: Exponential Regret Reduction with More Segments

Theoretical Bounds: Theorems 4.1 and 4.3 establish regret upper bounds scaling as O~(exp(H/m)/K)\tilde{O}(\exp(H/m)/\sqrt{K}), while Theorem 4.2 provides a matching lower bound. The exponential dependency arises from the sigmoid function’s sensitivity: smaller segments reduce the effective reward scale, improving distinguishability of actions.

Figure 2a shows regret decreases rapidly as mm increases.

Sum Feedback: Regret Insensitive Theorems 5.1 and 5.3 show regret O~(HK)\tilde{O}(\sqrt{HK}) independent of mm, with a matching lower bound (Theorem 5.2). The intuition is that segment-level noise and covariance matrix shrinkage counteract the benefit of more observations.

Figure 2b shows minimal regret variation across mm validating the theory.

方法与评估标准

SegBiTS (binary feedback): Uses Thompson sampling with Gaussian perturbations and MLE for reward inference.

E-LinUCB (sum feedback): Incorporates E-optimal design to refine the covariance matrix, improves Efroni et al. (2021) by a H\sqrt{H} factor.

In the experiments, the authors plot the cumulative regret to verify the theory.

理论论述

This is a theory paper. See Claims And Evidence.

实验设计与分析

The authors use a very simple experiment to show verify the theory results. See Claims And Evidence.

I understand that this is mainly a theory paper, but is it possible to experiment on more complex environments than what you have used, such as Atari.

补充材料

I check the Appendix A for experimental details.

与现有文献的关系

Bridging Classic RL and Trajectory Feedback: The paper introduces a new RL model with segment feedback that acts as a middle ground between per-state-action feedback in classic RL (Sutton & Barto, 2018) and trajectory feedback (Efroni et al., 2021; Chatterji et al., 2021)​. This novel paradigm allows feedback at the segment level, which is more practical in real-world scenarios like autonomous driving and robotics. The authors highlight that previous works on trajectory feedback rely on the martingale property of entire trajectories (Efroni et al., 2021), which does not hold for segment feedback due to the dependency among segments within a trajectory.

遗漏的重要参考文献

This paper properly discusses the references I know.

其他优缺点

Please see previous reviews.

其他意见或建议

Please see previous reviews.

作者回复

Thank you for your time and effort in reviewing our paper! We will certainly incorporate your suggestions in our revision.

1. More Experiments

This is an excellent suggestion. However, it is difficult to perform experiments on the Atari environments before the rebuttal deadline since it would take some time to adapt our discrete state-space setting to Atari's continuous state space setting of images. However, following the reviewer's suggestion, we ran preliminary experiments on a more complex Grid World environment called Cliff Walking, which is used in the popular RL textbook [Sutton & Barto, 2018].

(0)(0)(0)(0)(0)(0)
\rightarrow (r_max)\rightarrow (r_max)\rightarrow (r_max)\rightarrow (r_max)\rightarrow (r_max)\downarrow (r_max)
\uparrow start (0)\otimes (-r_max)\otimes (-r_max)\otimes (-r_max)\otimes (-r_max)end (r_max)

The above table depicts the Cliff Walking environment considered in our experiments. Each table cell denotes a state. An agent starts from the left bottom state (marked as "start") and aims to arrive at the right bottom state (marked as "end"). The reward of each state is marked in the bracket ()(\cdot) in the state. The states with \otimes denote a cliff and generate negative rewards rmax-r_{\max}. The optimal path is the path close to the cliff, and the path close to the upper wall is suboptimal. In our experiments, S=18|\mathcal{S}|=18, A=4|\mathcal{A}|=4, H=8H=8, m1,4,8m\in\\{1,4,8\\}, K=50000K=50000, rmax=2r_{\max}=2 and δ=0.005\delta=0.005. We performed our algorithm SegBiTS for 33 independent runs under binary segment feedback, and reported the average regret with standard deviation to see how the number of segments mm impacts learning performance.

SettingRegret of SegBiTS (Std.)
H=8H=8, m=1m=11.114×1061.114 \times 10^6 (1.112×1031.112 \times 10^3)
H=8H=8, m=4m=46.365×1056.365 \times 10^5 (1.167×1031.167 \times 10^3)
H=8H=8, m=8m=84.544×1054.544 \times 10^5 (1.186×1021.186 \times 10^2)

As shown in the table, under binary feedback, as the number of segments mm increases, the regret significantly decreases, which matches our theoretical finding.

Due to time limit, here we only present the results for this instance and binary segment feedback. We will include more experimental results on larger instances in our revision.


Reference:

Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. 2018.

2. Non-tabular Settings

Yes, we expect that similar results will hold in non-tabular settings, e.g., linear function approximation. In the linear function approximation setting, it is assumed that each state-action pair is associated with a feature vector ϕ(s,a)Rd\phi(s,a) \in \mathbb{R}^d, there exists a global reward parameter θRd\theta^* \in \mathbb{R}^d, and segment reward feedback is generated according to (hHϕ(sh,ah))θ(\sum_{h \in \mathcal{H}} \phi(s_h,a_h))^\top \theta^*, where H\mathcal{H} denotes the set of step indices in a segment. Since our analytical procedure is mainly based on visitation indicators ϕ(s,a)\phi(s,a) and the fact that segment reward feedback is generated linearly with respect to ϕ(s,a)\phi(s,a), we believe that generalizing ϕ(s,a)\phi(s,a) from a visitation indicator in the tabular setting to a feature vector in the linear function approximation setting should be possible. We will further investigate non-tabular settings in our future work.

最终决定

This paper provides a theoretical analysis of Reinforcement Learning with segment feedback. They study two settings with binary and sum feedback, developing computationally efficient algorithms achieving nearly optimal regret bounds for each one of these. During the discussion phase with reviewers, it was agreed this work provides a meaningful contribution to the study or RL with novel forms of feedback. That is why I am recommending acceptance.