Fixed-Confidence Multiple Change Point Identification under Bandit Feedback
摘要
评审与讨论
This paper investigates the problem of identifying multiple change points in a piecewise constant function under bandit feedback, ensuring a fixed level of confidence in the results. The authors consider two scenarios: 1) known number of change points, and 2) unknown number of change points (). They establish instance-dependent lower bounds on the sample complexity of identifying change points. Building on this, they design a computationally efficient algorithm inspired by Track-and-Stop, which achieves asymptotic optimality.
Update After Rebuttal
The authors have successfully addressed my concerns, so I will maintain my positive score.
给作者的问题
Regarding the "other weakness," can you provide any practical method to address the issue of determining (the number of discretization points) when is unknown?
论据与证据
The overall claims in the paper are clear and well-supported by both theoretical analysis and empirical results.
方法与评估标准
Overall, the proposed methods make sense for the problem setting.
理论论述
The theoretical claims are strong and well-reasoned. While I didn’t go through every detail of the proofs, they seem correct.
实验设计与分析
The experimental designs are sound and sufficiently support the theoretical results.
补充材料
I have primarily reviewed the proofs for the main theorems, along with parts of other sections.
与现有文献的关系
As discussed in Section 7, I agree that identifying all change points whose magnitudes exceed a threshold is a more practical and interesting problem setting. I hope that the methods and insights from this paper can also contribute to addressing this more practical setting.
遗漏的重要参考文献
I think the paper covers the related work quite thoroughly.
其他优缺点
Other Strengths:
- The proposed algorithms are computationally efficient.
- The empirical results clearly strengthen the paper.
Other Weaknesses:
- In the setting where the number of changes is unknown () and the original action space is continuous, the practical choice of is unclear. Specifically, if the true number of changes is unknown, it is impossible to confidently select an appropriate number of actions since must be at least . This ambiguity limits the practicality of the algorithm, as one cannot set without prior knowledge of .
其他意见或建议
No other comments.
Thank you for your review of our work. We appreciate that you found the claims in the paper clear, that the theoretical claims were strong and well-reasoned, and that the experimental results were supportive of our theoretical contributions. We also are hopeful that the proposed methods will contribute to addressing additional interesting problems, as you point out. We respond to some of your comments below.
Extension to the Continuous Setting:
If the action space were continuous, then we cannot expect to identify a change point exactly. In this case, it would be reasonable to introduce some acceptable ‘tolerance’ in error between our identified change points and their true locations. This would naturally lead to a uniform discretization with gaps of size meaning that there are arms (note that this is similar to the setting considered by Lazzaro & Pike-Burke, 2025). The tolerance level can either be set using domain-specific knowledge of an acceptable tolerance; or can be set by assuming that there is some minimum distance between the change points. This is reasonable as we cannot expect to detect changes if they are arbitrarily close to each other.
It would be interesting to consider approaches more focused on this continuous analogue of our problem, such as adaptive discretisation/Zooming which are popular in continuous action spaces [1]. However, this would still need some reason/knowledge or domain-specific guidance on when to stop for this fixed confidence setting. This is because there could be changes in mean that occur over an arbitrarily small region of the action space (i.e. changes could occur arbitrarily close to each other). Hence there will be some change points that we may inevitably miss if we want to stop in finite time in general. This is normally avoided in regret-minimisation settings by assuming that the mean reward function is sufficiently smooth across the action space [1,2]. However, in our abruptly-changing setting there are no such smoothness assumptions. Therefore, to stop in finite time in our setting, we would need to make an assumption on the distance between adjacent change points (mentioned above) or a restriction on the number of change points (‘m’, which you point out).
We will include more discussion of the extension to continuous action spaces in the camera ready version. Thank you for the suggestion.
References:
[1] Kleinberg, Robert, Aleksandrs Slivkins, and Eli Upfal. "Bandits and experts in metric spaces." Journal of the ACM, 2019
[2] Bubeck, Sébastien, et al. "X-Armed Bandits." Journal of Machine Learning Research, 2011.
Goal is to identify with high confidence the change points in the mean reward of the actions across the action space, by using as few samples as possible. A new Track-and-Stop algorithm is proposed for this task with asymptotic optimality. Matching lower-bounds are also proven.
-- update after rebuttal --
The authors' clarification of the novelty of their work wrt Gairiver & Kaufmann 2016 is convincing and helpful. I decide to keep my score.
给作者的问题
Addressing the following questions would enhance my understanding of the presented results, and I will update my evaluation accordingly:
-
Garivier & Kaufmann 2016 is cited quite a few times in the paper, could you elaborate the novelty of the present work as compared to GK2016? In particular, it seems quite a few results in this paper (e.g. Theorem 4.22, D-tracking) are specialised form of results from GK2016.
-
line 318-323: is the coupled effect helpful or unhelpful? it’s unclear from the writing here. What I find counterintuitive is that you say that it’s harder to obtain closed-form expressions for when N is known than when N is unknown (where you obtained (17)) Why is this the case?
-
Can you compare the performance of Algorithm 2 with the alternative approach described on line 415-417?
-
eq (5): how do you initialise?
-
line 237 (right): can you elaborate how Z(t) relates to the CUSUM statistic, by giving exact expressions? I'm not suggesting to add this to the paper, but just to help me appreciate your results better.
论据与证据
The paper is well-written, with matching upper and lower bounds presented for both the single changepoint and multiple changepoints settings.
方法与评估标准
Simulation result shows that the stopping time of the proposed algorithm follows the lower bound pretty closely.
理论论述
I only checked the theoretical claims intuitively, but didn't check the proofs in detail.
line 318-323: is the coupled effect helpful or unhelpful? it’s unclear from the writing here. What I find counterintuitive is that you say that it’s harder to obtain closed-form expressions for when N is known than when N is unknown (where you obtained (17)) Why is this the case?
实验设计与分析
Can you compare the performance of Algorithm 2 with the alternative approach described in lines 415-417? This would be an informative result to have.
补充材料
I skimmed through Appendix D.
与现有文献的关系
An interesting work for the bandits community and changepoint detection community.
遗漏的重要参考文献
NA
其他优缺点
Presentation is clear, and I particularly appreciate how the authors always devote a sentence or two after each theoretical claim to explain its significance and implications.
I think the distinction between the present setting and the non-stationary bandits setting described in lines 58-62 (right) should probably be moved to earlier on. This is important.
其他意见或建议
line 272 (right): it might be clearer to refer to Theorem 4.1 instead of Corollary 4.3
nitpick: line 138 (right) it’s clearer to first define Exact-(N, ) and then substitute in N=1
nitpick: can combine (11) into Prop 4.4 to simplify (7) and stay consistent with line 3 of Algorithm 1
Thank you for the detailed review and for your constructive questions. We are glad that you found the paper well-written and that you think our work would be interesting for the bandits/change point community. We respond to your comments below.
Other Strengths and Weaknesses:
We will include further discussion earlier in the paper on the distinction between our setting and non-stationary bandits, emphasising the novelty of our work. Thank you for the suggestion.
Other Comments Or Suggestions:
Thank you for pointing these out, we will update the paper accordingly.
Questions For Authors:
- (Gairiver & Kaufmann, 2016) is a seminal work and their tracking ideas have motivated the class of track-and-stop strategies and theory for a wide variety of works, including our own. This is because the lower bounds they construct for best arm identification are also applicable to a wide variety of other pure exploration bandits problems (Ch33.2, Lattimore & Sepesvari, 2020). However, the bounds they present are in implicit form (i.e. as the solution of an optimisation problem). Hence, their tracking methods are computationally expensive and somewhat unintuitive. In our work we are able to build off the bounds from (Garivier & Kaufmann, 2016) to provide intuitive instance-dependent lower bounds which can then, in turn, be used to define computationally efficient and intuitive tracking procedures for our setting. We will include additional discussion in the paper to clarify the novelty of our results.
- Having the additional knowledge of the true number of change points in the environment makes the learning problem easier than when the learner does not know the exact true number of change points. Intuitively, when the number of change points is known and we have some information about a subset of them, this could be informative regarding the location of the other change points. For example, suppose we knew there are exactly three change points in an environment and suppose a learner has sampled around and sufficiently such that they are confident of their respective locations. These samples would then also be informative about the location of and will help identify with fewer samples, since we know there is only one change point remaining. Incorporating this information means that the optimal proportion of actions we should play near one change point no longer just depends on that change point, but also on adjacent change points. Because of this, the optimization problem in equation (14) becomes more complicated and finding a general closed form solution becomes more challenging. However, Theorem 5.2 tells us that the cost in sample complexity of not knowing the true number of change points is at most a factor of 2 asymptotically. Moreover, even when the number of change points is known, we can run the same computationally efficient algorithm, MCPI. This algorithm avoids solving the complex optimization problem and obtains sample complexity within a factor of 2 of the optimal (non-closed form) rate asymptotically. We will improve the writing of this section of the paper to clarify this and include additional discussion, thank you for highlighting this.
- We believe that the theoretical (asymptotic) performance of MCPI (Algorithm 2) and the alternative approach we mentioned would be similar. However we re-emphasise that MCPI will quickly identify very large changes when they are easy to identify, whereas the alternative approach will identify all change points at the same rate due to the tracking methods used. More importantly, we believe Algorithm 2 is much easier to extend to the case where we have no knowledge about the number of change points as different/new change points can be approached sequentially. We will clarify this in the paper.
- We only update the values after playing each of the actions once in both algorithm 1 and 2. Hence, the change point estimate will be initialised only after the initial set of observations have been made across the whole action space. We will update the paper to discuss initialisation, thank you for pointing this out.
- Define and as the empirical means and the number of samples made from arms i to j inclusive, respectively. Then the CUSUM statistic for exactly one change point can be written (Eq 3, Verzelen et al., 2020) as . If we square this statistic, we see that its form is very similar to (11) except we consider only the samples adjacent to the change point rather than all actions to the left and right of the change point. This exclusion is important for settings in which the true number of changes is unknown, as considered later in our paper. We will add further discussion of this to the appendix.
I thank the authors for their detailed response. Their clarification of the novelty of their work wrt Gairiver & Kaufmann 2016 is convincing and helpful. I decide to keep my score.
This work introduces a fixed-confidence piecewise constant bandit problem, and provides instance-dependent lower bounds for the complexity of change point identification in this problem. This work also devises a computationally efficient algorithm as a variant of track-and-stop and prove its asymptotic optimality.
给作者的问题
see above
论据与证据
yes
方法与评估标准
yes
理论论述
partially, the theoretical claims seem to be correct
实验设计与分析
yes, this work uses a simple and straightforward experimental design, but I have some questions for the setup, especially for ordering K arms, see "Other Comments Or Suggestions" below.
补充材料
only part Appendix A
与现有文献的关系
The key contributions of this paper is related to multi-arm bandit problems, but I think the scope might be a bit narrow --- there is no concrete application examples provided to justify the setup, also the change-point defined in this work is a bit confusing since it requires a perfect ordering of the arm indexes. And the work is less related to change-point detection literature which instead focuses more on temporal changes.
遗漏的重要参考文献
I am only familiar with a few key papers in this area, and I am not aware of other works studying the same problem, so to the best of my knowledge, no.
其他优缺点
strengths: a novel problem for identifying change-point within arms, and theoretical lower bounds on sample complexity are provided; moreover, the practical algorithms that can achieve asymptotic optimality are also proposed.
weakness: the motivation for the problem setting considered is a bit vague, and it is not clear whether such change-point identification can lead to some downstream task performance improvements such as reward maximization, etc. Also it seems to be a strict assumption (or need certain prior knowledge) that the ordering of the arm indexes satisfies the N change-point assumption/definition in this work.
其他意见或建议
For the scenario with one or multiple change-points, the formulation presented in equations on lines 120 and 129 assumes a "correct" ordering of K arms, such that the change-points are defined in terms of the arm index j∈{1,…,K}. Could the authors justify why such an ordering always exists or can be assumed in practice? For example, without such pre-knowledge of the correct ordering {1,1,1,2,2,2}, I could perhaps order the arms to be {1,2,1,2,1,2} in their mean reward, and does the proposed method still work in such random ordering?
To elaborate further on this point, a change-point refers to a specific time index at which the distribution changes (as commonly seen in non-stationary, piecewise constant bandit problems). However, in this work, the change-point refers to a spatial index. A key open question is thus how one should define an appropriate spatial ordering to preserve a "piecewise constant" structure for the mean reward function. Providing a more concrete practical example would help clarify this setup. In general multi-arm bandit problems, each arm may have a different mean reward, and even after grouping similar rewards together, these arms may not necessarily be neighbors in the given indexing. For instance, the example provided in Appendix A assumes a very restrictive scenario: without prior knowledge of reward means, the 19 arms are already ordered in a way resulting in exactly N=2 change-points (yielding the sequence 2,...,2, 4,...,4, 0,...,0). A random ordering of these arms, however, would produce significantly more change-points according to the definition used in this paper.
Moreover, how does the problem considered differ from the best-arm identification problem in multi-arm bandits when there is only one change-point (or equivalently, when there are only two possible reward distributions: low and high)? A comparison of sample complexity between these two settings would be helpful. Additionally, since this work primarily focuses on the expected sample size (stopping time) required to accurately identify the “change-point,” it implicitly serves as an indirect approach to eventually maximizing the reward. This aspect naturally broadens the scope of the proposed algorithm. Thus, I am curious about how the eventual reward compares to traditional algorithms, such as UCB-type approaches. It would greatly clarify the practical utility of the proposed approach if the author(s) could briefly discuss this comparison.
In equations (7) and (11), the notation appears but is not defined.
We thank the reviewer for their detailed, constructive comments and we are glad that you found the problem novel. We provide some responses to your comments below.
Related Literature:
You are correct that most of the literature on bandits with change points is focused on non-stationary environments which change over time and these exploit online change point analysis methods. One of the novelties of our work is that we instead consider a discontinuity in the action space and assume that the mean rewards are stationary across time. This is motivated by several real world problems (see below). This was discussed in Section 2, however we will include further discussion earlier in the paper to avoid confusion and emphasise the novelty of our work. Thank you for the suggestion.
Order of the arms:
Our goal in this paper is to develop algorithms for settings in which there is structure in the mean rewards across the action space. This is motivated from experimental settings in which the order of arms have a spatial or experimental meaning. For example, in material development we may want to identify under what experimental conditions new materials abruptly change between different physical behaviours (Park et al., 2021). Hence, we could test the behaviour of a new material at different pressure levels. There is a natural ordering of pressure levels. Moreover, we expect similar behaviour from the material as we increase the pressure until we observe abrupt changes in its behaviour. To find the location of the changes it is necessary to consider the fixed natural ordering of pressure levels. In this setting, and many others, it would not make sense to randomize the order of the arms as this would distort the meaning of the problem.
Section 1 discusses several other real world settings where there is a change point in an action space with fixed ordering. We also point out that there are many other well-studied bandit problems which consider fixed orderings of arms, for example montone bandits [1] and continuum armed bandit problems such as Lipchitz bandits [2].
We finally note that the problem where we do not care about arm ordering and aim to group arms into clusters with the same reward is known as Clustering Bandits (Yang et al., 2022). As discussed in Section 2, this is a different problem to the one we study, and we show that exploiting the structure present in our problem can lead to improved performance.
We hope this addressed your main concern and you may consider raising your score.
Complexity of our setting vs best arm identification:
As suggested, suppose there are K arms (-Gaussian) with mean rewards such that there is a unique ‘best-arm’ and one change point. From [Garivier & Kaufmann, 2016] the complexity of the best-arm identification problem will asymptotically be of order and from our Corollary 4.3 the change point identification problem has order . We no longer have a linear dependence on K since we can use our piecewise constant structure and information from across our action space to confidently locate the change in mean, whereas in best arm identification we have to sample each of the K arms sufficiently to confidently identify the arm with the highest mean reward. We will include further discussion of this in the paper, thank you for the suggestion.
Regret Extension:
While the pure exploration set-up of this paper and the regret-minimisation set-up in other works are distinct, it would be interesting to consider extensions of our setting to the latter. As you allude to, an outcome of our learning methods could be the identification of regions in the action space with high mean reward. However, our current methods are designed to specifically locate the change points in piecewise constant structures, regardless of the adjacent rewards. Therefore modifications to our methods and analysis would be necessary in order to focus on regret minimisation. Hence, for now we leave these ideas for interesting future work and will include a discussion of this in the paper, thank you for the suggestion.
Definition:
Indeed was defined later in line 1011. This should have been included earlier and this will be rectified, thank you for bringing this to our attention.
References:
[1] Aziz, Maryam, Emilie Kaufmann, and Marie-Karelle Riviere. "On multi-armed bandit designs for dose-finding trials." JMLR, 2021
[2] Magureanu, Stefan, Richard Combes, and Alexandre Proutiere. "Lipschitz bandits: Regret lower bound and optimal algorithms." COLT, 2014.
This paper discusses a variant of fixed-confidence pure exploration problems where a finite sequence of arms is given and the objective is to detect the set of arms such that the expected reward is different from the next arm. Problem-dependent lower bounds are derived and the algorithms asymptotically achieving this bounds are also proposed. Several questions and concerns are raised for this paper, such as the validity and the required knowledge on the formulation and relation with existing work on BAI. These concerns seem to be mostly solved by the rebuttal, and I expect that the authors polish the paper by carefully revise the discussed points in the final version.
My small concern is that, while the algorithms and the theoretical guarantees are solid, they seem to be mostly within the already established framework of fixed-confidence BAI. So I think the justification and practical usefulness of the problem are very important, and the significance would be strengthened if there are experiments for more realistic scenarios.