PaperHub
5.7
/10
Poster3 位审稿人
最低4最高7标准差1.2
6
4
7
3.7
置信度
正确性3.0
贡献度3.3
表达3.0
NeurIPS 2024

Linear Causal Bandits: Unknown Graph and Soft Interventions

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06

摘要

Designing causal bandit algorithms depends on two central categories of assumptions: (i) the extent of information about the underlying causal graphs and (ii) the extent of information about interventional statistical models. There have been extensive recent advances in dispensing with assumptions on either category. These include assuming known graphs but unknown interventional distributions, and the converse setting of assuming unknown graphs but access to restrictive hard/$\operatorname{do}$ interventions, which removes the stochasticity and ancestral dependencies. Nevertheless, the problem in its general form, i.e., _unknown_ graph and _unknown_ stochastic intervention models, remains open. This paper addresses this problem and establishes that in a graph with $N$ nodes, maximum in-degree $d$ and maximum causal path length $L$, after $T$ interaction rounds the regret upper bound scales as $\tilde{\mathcal{O}}((cd)^{L-\frac{1}{2}}\sqrt{T} + d + RN)$ where $c>1$ is a constant and $R$ is a measure of intervention power. A universal minimax lower bound is also established, which scales as $\Omega(d^{L-\frac{3}{2}}\sqrt{T})$. Importantly, the graph size $N$ has a diminishing effect on the regret as $T$ grows. These bounds have matching behavior in $T$, exponential dependence on $L$, and polynomial dependence on $d$ (with the gap $d\ $). On the algorithmic aspect, the paper presents a novel way of designing a computationally efficient CB algorithm, addressing a challenge that the existing CB algorithms using soft interventions face.
关键词
causal banditunknown graphsoft interventions

评审与讨论

审稿意见
6

The paper considered a specific setting in the causal bandits problem where 1) the graph is unknown, 2) the causal model is linear , and 3) the action set consists of 2N2^{N} soft interventions. To tackle the problem, the author proposed an algorithm that first learns the causal structure and then uses UCB-based approaches to find the best action and minimize the regret. The regret of the algorithm is analyzed for both cases: with and without the knowledge of the graph.

优点

1- The paper considers a novel setting where the causal graph in unknown and SEM is linear.

2- The flow of the paper is clear, and each part of the proposed algorithm is thoroughly discussed.

3- The upper bound of GA-LCB-ID, when the graph is known, demonstrates an improvement compared to related work. Also, it is closed to the minimax lower bound.

4-  GA-LCB-ID, in the main case when the causal graph is unknown, achieves a reasonable upper bound.

5- The related work is comprehensively discussed in the introduction.

缺点

1- The algorithm assumes that the causal graph does not contain latent variables.

2- The algorithm requires access to identifiability parameters, which is an unrealistic assumption.

3- The set of interventions is limited, for each node, we only have one soft intervention.

问题

Can you provide a discussion for the above points?

1- I expected in the regret bound, the identifiability parameter would take part similar to previous work in an unknown setting. What is the intuition behind excluding it in this setting?

2- I didn’t understand why you took an expectation in Theorem 2. The theorem should show that with high probability, the regret is less than a certain bound. Why is it necessary to take an expectation? Note that you defined regret with implicit expectation over the randomness of rewards in Equation 5.

局限性

Yes.

作者回复

Comment: The algorithm assumes that the causal graph does not contain latent variables. The algorithm requires access to identifiability parameters, which is an unrealistic assumption.

Response: We thank the reviewer for the thoughtful comment. In principle, we agree with the reviewer about the importance of relaxing these assumptions. However, we note that these are standard assumptions in the current causal bandit literature (e.g., [14,17,19,22]). The primary reason is that causal bandits are not fully understood even under such assumptions. Causal bandit literature has been evolving towards removing as many assumptions as possible about graph topology and interventional distribution. This is the first paper that has dropped both under general structures and interventional models (with the remaining piece being extending linear SEMs to non-linear SEMs). However, we certainly agree that next, it is also important to dispense with assumptions on latent variables and identifiable parameters.

Comment: The set of interventions is limited, for each node, we only have one soft intervention.

Response: Thanks for the note. For presentation clarity, we have focused on the binary intervention model. However, the algorithms can be readily extended beyond binary interventions where we can show that the regret will scale as KT\sqrt{KT} when KK is the number of interventions per node. In the final version we can use the additional one page to provide a brief discussion on the extension to KK interventions per node. Importantly, we note that expanding the intervention space will not impose additional costs for identifying the causal structure, as one soft intervention is sufficient. We also note that under dodo intervention, the regret scales with KT\sqrt{KT} where KK is the number of possible dodo interventions. This means that under a significantly more relaxed intervention model, we can achieve the same regret that dodo interventions achieve.

Questions:

Comment: I expected in the regret bound, the identifiability parameter would take part similar to previous work in an unknown setting. What is the intuition behind excluding it in this setting?

Response: We thank the reviewer for this question. If we define R=(m/η)2R=(m / \eta)^2 and incorporate it into the regret, the term NN in the regret order will appear as RNRN. We will include this in the updated version. The RNRN term arises from the step of identifying the ancestor relationship. Notably, a similar term is observed in related literature [19,22], where the term RR is coupled with the degree as they only need to identify pa(N)pa(N). Besides, in previous work, KK represents the number of distinct interventions, and we will have similar behavior when generalized to multiple soft interventions.

Comment: I didn’t understand why you took an expectation in Theorem 2. The theorem should show that with high probability, the regret is less than a certain bound. Why is it necessary to take an expectation? Note that you defined regret with implicit expectation over the randomness of rewards in Equation 5.

Response: We thank the reviewer for the sharp observation. Having the expectation is a typo, and we will remove it.

评论

Thanks for your response. I also have another question. How do you compare your algorithm and Corollary 1 with the Naive approaches (e.g., the ucb-based algorithm in classic bandit) on a specific graph like the following graph. X1 -> X2 -> .... -> Xn. Also for all ii, Xi is the parent of Xn. It's a chain graph where all variables are parent of Xn.

评论

We thank the reviewer for this comparison question. A chain graph is a special simple graph with a maximum in-degree of d=1d=1 and a maximum causal depth of L=N1L=N-1.

Regret Comparison: For UCB-based algorithms in the classic bandit setting without knowing the graph, the regret scales as O~(AT)=O~(2NT)\tilde{\mathcal{O}}(\sqrt{|\mathcal{A}| T}) = \tilde{\mathcal{O}}(\sqrt{2^N T}). In contrast when the the graph is known, the regret reduces significantly and it scales as O~(T)\tilde{\mathcal{O}}(\sqrt{T}) (e.g. Corollary 1). The reason for the higher regret of the UCB-based algorithm without knowing the graph is that due to the lack of graph structure, the algorithm treats each set of interventions as an independent arm. This means that there are 2N2^N possible intervention sets, leading to the regret mentioned above.

Computational Cost: Here we compare the computational cost of three algorithms: UCB-1 (a UCB-based algorithm for the classic bandit problem), LinSEM-UCB [14] (a UCB-based algorithm for causal bandits), and our proposed algorithm. The UCB-1 algorithm incurs a computational cost of O(2N)\mathcal{O}(2^N) because it needs to select the best intervention from A\mathcal{A} with A=2N|\mathcal{A}|=2^N. The LinSEM-UCB algorithm, designed for causal bandits, has a computational cost of O(N22N)\mathcal{O}(N \cdot 2^{2N}), as each UCB iteration requires maximization over the vertices of an NN-dimensional hypercube, with each vertex having a computational cost of O(N)\mathcal{O}(N). Additionally, it selects the best intervention from A\mathcal{A}. In contrast, our algorithm has a computational cost of O(NAs)\mathcal{O}(N |\mathcal{A}_s|), where As\mathcal{A}_s represents the set of possible optimal interventions at time tt after elimination, with As2N|\mathcal{A}_s| \leq 2^N and As1|\mathcal{A}_s| \rightarrow 1 as tt \rightarrow \infty. This is because the computational cost of the confidence width calculation in equation (19) is O(1)\mathcal{O}(1) (recall that d=1d=1), and the UCB only requires the estimated mean value (of order O(N)\mathcal{O}(N)) and NN confidence widths. Consequently, our algorithm has a slightly higher computational cost at the beginning but becomes more efficient as As|\mathcal{A}_s| decreases to 2NN\frac{2^N}{N}, which can be achieved in a few elimination steps.

Overall, our approach achieves a substantial reduction in regret while maintaining a comparable computational cost to UCB-1 in the context of this simple graph.

Finally, we emphasize that as the graph topology becomes more complex, the UCB-based algorithms further degrade. For example, in hierarchical graphs, the intervention set size is 2dL+12^{dL+1}, leading to a regret of O~(2dL+1T)\tilde{\mathcal{O}}(\sqrt{2^{dL+1} T}), compared to our algorithm’s O~(dL12T)\tilde{\mathcal{O}}(\sqrt{d^{L-\frac{1}{2}} T}).

评论

Sorry, it seems you misunderstood my example. My example was a chain graph where all variables also are parents of Xn (reward node). So, the graph has 2(n1)2(n-1) edges (if I'm not wrong, d=N1d = N -1 and L=N1L = N-1)

评论

Thank you for the clarification. To lay some context for discussion, we would like to address this in the dichotomy of instance-dependent versus class-level regret analysis. The former analyzes the performance of a policy on a specific bandit instance while the latter captures the performance over a class of instances. In this context, for any causal bandit algorithm (including ours), one can discuss

  • Instance-dependent regret analysis, which involves analyzing a policy for a specific causal bandit instance (e.g., with a specific graph topology).
  • Class-level analysis, which involves analyzing a policy designed for a class of causal bandits (e.g., causal bandits with maximum degree dd and causal depth LL, which is standard in causal bandit literature).

Given this context, we address the regret of the example proposed by the reviewer in both settings:

Class-level regret for the class of graphs with maximum in-degree dd and causal depth LL: The UCB-1 algorithm can have a class-level regret that scales as O~(2dL2T)\tilde{\mathcal{O}} \big(2^{\frac{d^L}{2}} \sqrt{T}\big) (e.g., in the reverse tree graph). On the other hand, our class-regret scales as O~(dL12T)\tilde{\mathcal{O}}\Big( d^{L - \frac{1}{2}} \sqrt{T}\Big).

Instance-dependent regret for the chain graph: As the UCB-1 algorithm does not take the causal structure into account, its regret remains as O~(2N/2T)\tilde{\mathcal{O}}(2^{N/2} \sqrt{T}). On the other hand, we can readily analyze the instance-dependent regret of our algorithm, which scales as (N1)T(N-1)\sqrt{T}. To get the instance-level regret of our algorithm, let's examine the source of the dL1/2d^{L-1/2} term in our regret:. if we define the maximum in-degree at each causal depth as d()=maxi[N],Li=did_{(\ell)} = \max_{i\in [N], L_i=\ell} d_i for [L]\ell \in [L], then the dL1/2d^{L-1/2} term can be refined to the instance-dependent term d=1N1d()\sqrt{d} \prod_{\ell=1}^{N-1} d_{(\ell)} (in the proofs we need to replace dd with d()d_{(\ell)}). In the term d=1N1d()\sqrt{d} \prod_{\ell=1}^{N-1} d_{(\ell)}, d\sqrt{d} reflects the complexity of the linear function, and the product accounts for the compounding effect where each causal depth contributes to d()d_{(\ell)}.

评论

Thanks for your response. based on the current result, the upper bound is O((N1)(N3/2)T)\mathcal{O} \left ( (N -1) ^{(N-3/2)}\sqrt{T} \right)? If so, it seems that, compared to the classic UCB for this instance, the regret bound is not particularly advantageous. It may be beneficial to discuss the improved instance-dependent regret, as you mentioned, in the revised version.

I intend to maintain my current score; however, I have concerns regarding soft interventions. In the title, abstract, and introduction, the authors claim to propose an algorithm for soft intervention cases. However, in the problem formulation, they define only a single soft intervention per node, which was unexpected.

评论

As the reviewer points to, we have consistently mentioned that we focus on soft interventions. This is the most general form of intervention and subsumes hard and dodo interventions.

If the reviewer is asking about one versus multiple interventions per node, we re-emphasize that using one is standard in the causal bandit literature for notational simplicity. Extensions from one to multiple interventions is trivial. A constant 2 will be replaced by KK (number of interventions) in cardinality of the intervention space.

We kindly ask the review to base their judgment on the theoretical contribution of this paper: this paper significantly extends the scope of the causal bandit literature by entirely removing the assumption about knowing the graph topology and general soft interventions. All our other assumptions are either in line with or more relaxed compared to those in the existing literature. We are glad that the reviewers have not pointed to any technical flaws in the analysis.

评论

Just for clarification regarding soft interventions. I expected that you define it in the following way. By soft intervention on node ii, we can change any entries of BiB_i to any values (with some restriction). But in your setting, by soft intervention, we can only change BiB_i to BiB^*_i. Could you discuss the complexity of the former setting? Is it not learnable? I understand the theoretical contribution of the paper, and my judgment is based on that.

评论

Thanks for giving us a chance to clarify this. Please note BiB_i is not the interventional value of the random variable XiX_i generated by node ii. Rather, it specifies the mechanism for generating it.

Post-interventional random variables: Specifically, upon intervening on node ii, the random variable XiX_i can take any arbitrary real value. Its value is not limited to only one or finite number of choices. Specifically, by applying a soft intervention on node ii, we are changing its conditional distribution P(Xi  Xpa(i))\mathbb{P}(X_i \ | \ X_{{\sf pa}(i)}) to a distinct conditional distribution. The post-intervention random variable XiX_i will be generated according to the post-intervention conditional distribution. This is in contrast to dodo interventions that specify a choice for XiX_i (which essentially means placing the entire post-interventional conditional probability on a singular value).

Post-interventional SEMs: Since we are working with linear SEMs, both pre- and post-interventional conditional distributions are fully specified by matrices BB and BB^* as well as the exogenous noise model. We are assuming that all these (i.e., BB, BB^*, and noise distribution) are fully unknown. This means that the pre- and post-interventional distributions are fully unknown. Our algorithm is learning all these, which includes learning BB and BB^* (their support and their real-valued entries). The complexity of learning all 2N2N such vectors {Bi,Bi:i[N]}\{B_i,B^*_i:i\in[N]\} is already included in the regret. We have considered one post-intervention SEM BB^*. This can be readily extended to any finite number of post-intervention SEMs. Extensions to an infinite number of post-intervention models can be a potential extension. That necessitates learning an infinite number of interventional distributions, which to the best of our knowledge even for simpler settings (e.g., knowing the topology or even interventional distributions) is an open question.

So, if we are understanding the reviewer’s concern correctly, having one single post-intervention BiB_i^* does not mean that we are forcing XiX_i to take only one value. Rather, we are changing the conditional distribution that generates XiX_i and it still can take a real value generated by the interventional distribution (and this is a generalization of dodo interventions that specify the XiX_i to take a specific deterministic value – which in the context of causal bandits has a finite number of possibilities).

评论

Thanks for your reply. I understood the answer to my question in the following sentence: "Extensions to an infinite number of post-intervention models can be a potential extension. That necessitates learning an infinite number of interventional distributions, which to the best of our knowledge even for simpler settings (e.g., knowing the topology or even interventional distributions) is an open question.", Thanks for your clarification.

审稿意见
4

This paper studies the linear scm setting for causal bandits. In particular, there are two vectors associated with the linear response of every node, and the learner may independently choose which of the two vectors to use. The value as well as the graph are unknown. Hence, the action space is 2**number_nodes. The authors provide an algorithm with a regret upper bound that nearly matches a lower bound, also presented.

优点

The problem formulation is natural and seems to be “the right size;” it’s added generality compared to the previous state of the art is enough to be interested but still allow for a solvable problem. The presentation is generally clear and polished.

缺点

There are some technical concerns; please see the questions below.

问题

What does assuming that B_i and B_i^* have the same support buy you?

Is the Lasso calibrated for the max number of parents?

The bound you have on kappa in (27) is very big, O(m^2). Doesn’t this make Theorem 1, part 2 trivial?

Is assumption 5 necessary? Why can’t you argue that soft interventions that would break assumption 5 would have a minimal effect on the expected loss and therefore only contribute a small amount to the regret? One might even be able to have a bound that is adaptive to eta. Or do you need it for graph discovery?

What is c, or at least what does it depend on?

Why use ridge regression for estimating B, B*? They are assumed sparse, after all.

Why do you need to solve an optimization problem over 2^|A|? Won’t the solution be on a corner of the |A|-dimensional hypercube, so a continuous relaxation would suffice?

Can you provide some intuition for what the lower bound construction looks like?

局限性

No limitations discussed in the main body.

作者回复

Support of weights matrices: We clarify that BiB_i and BiB_i^* always have the same support under soft interventions. as such interventions only change the conditional distribution of node ii without affecting the topology. The causal topology remains intact, leading to BiB_i and BiB_i^* having the same support. As a side note, we also remark that under more restrictive types of interventions (dodo and hard), an intervention on node ii removes its causal dependence on its parents, i.e., pa(i)pa(i). In such cases, the support of BiB_i^* becomes a subset of BiB_i. However, even in such cases when the topology is unknown, assuming that the support of BiB_i^* is a subset of BiB_i does not provide any gain. Overall, since we are assuming soft interventions, BiB_i and BiB_i^* have the same support.

Lasso calibrated for the max number of parents: Yes, the algorithm is designed for the entire class of causal models with maximum in-degree bound dd. If more node-level information is available, the Lasso regression can be accordingly modified. We remark that the interest of the community working on causal bandits has been towards removing as much as possible assumptions about the graph structure and interventions. Our setting is in line with extending the state of the art by removing topology information.

Loose bounds on kappa: We thank the reviewer for the sharp observation. The reviewer is right that this is not a tight bound. Nevertheless, we note that for achieving a tight regret, such a loose bound is inevitable, and further improving it compromises the regret. Here’s the reason: the regret is only a function of the topology information, and the topology should be learned only to the extent that it facilitates identifying the best intervention. In other words, learning the topology up to some level improves regret and, after that, starts compromising it since learning the topology requires collecting more samples, which implies less sample efficiency. Therefore, any algorithm that learns the topology accurately is over-learning, and our conjecture is that it will compromise the regret. Please note that the purpose of Theorem 1 is to provide the guarantees needed only for characterizing the regret, and it is not intended to claim that the topology is learned accurately.

Necessity of Assumption 5: As the reviewer points out, this assumption is needed for graph recovery, and it is standard when graph skeletons is unknown [19,22,R1]. When a node acts as a confounder, intervening on that node might not impact certain intermediary nodes but still influences reward. Without Assumption 5, the confounder relationships cannot be identified, leading to an incorrect graph topology recovery. The reviewer’s point is correct: the interventions on some nodes might have minimal effect on the reward. However, when the topology is unknown, these nodes need to be identified and treated properly. The reason is that even a minimal effect on the regret can still accumulate linear regret if it is chosen linearly in TT. So the algorithm needs to also properly identify these nodes to ensure they are selected sublinearly.

[R1]: Elahi et. al. Partial Structure Discovery is Sufficient for No-regret Learning in Causal Bandits, ICML 2024 Workshop RLControl Theory.

Definition of cc: Thanks for the sharp observation. This constant is c=κc=\kappa with κ\kappa is defined in (27). We recognize this notation is redundant and will change it to κ\kappa.

The weights matrices are assumed sparse: Thank you for this question. We are not assuming sparsity for BB and BB^*. The maximum in-degree dd can be arbitrarily large, i.e., it can be as large as N1N−1 . This means the columns of BB and BB^* are dd-sparse, where dd can be N1N−1. Nevertheless, the reviewer is correct that if we limit the scope of the problem to assume that BB and BB^* are sparse, then sparse-recovery methods can be more effective than ridge regression.

Continuous relaxation for optimization We recognize that the reviewer might have an interesting point about solving a bandit combinatorial problem via continuous relaxation. We are unaware of such relaxation in the bandit literature. One challenge in adopting such an approach is that the utility to be optimized (i.e., UCB) does not take values at non-discrete values. Specifically, UCBaUCB_{a} is not defined for ai{0,1}a_i\notin\{0,1\}. As such, while we appreciate the reviewers' thoughts, we do not see an immediate way of using continuous relaxation in combinatorial bandit decisions.

Lower bound construction: In this paper, we provide a minimax lower bound. This is done by constructing two bandit instances that are highly similar, and distinguishing them is difficult. Specifically, we construct two bandit instances that share the same hierarchical graph structure. These instances differ only in the parameters that switch the observational and interventional weights for the nodes with causal depth of 1, resulting in distinct optimal interventions. We identify the minimum samples that any algorithm will need to distinguish between the two bandit instances and the minimum regret it will incur on one of the instances. This serves as a lower bound on the minimax regret.

Final note about “soundness”: We hope we have addressed the reviewer’s concerns about “soundness” (rate 2: fair) – especially why BiB_i and BiB_i^{\star} have the same support, why the second part of Theorem 1 is sufficient for our regret minimization purposes, lack of sparsity in BB and BB^{\star}, and continuous relaxation. We remark that this paper significantly extends the scope of the causal bandit literature by entirely removing the assumption about knowing the graph topology (a common assumption by extensive recent publications in JMLR, NeurIPS, ICML, and AISTATS). This is a major leap in causal bandits, and kindly request that the reviewer consider re-evaluating their rating of the manuscript.

评论

Thanks for the detailed response, and I think many of my concerns are answered, concerning theorem 1. On a related note, how does the regret bound scale with m?

评论

Scale with mm: We appreciate the reviewer for raising this important question. The regret bounds scale linearly in mm. An intuitive reason is that if we multiply all the random variables of the system by a constant MM then the final reward will be scaled up with MM and the regret of any algorithm scales up with a constant MM. As discussed at the end of Section 2, the dependence of the regret on mm is linear even in simpler linear bandit problems (e.g., [25]). An exact same linear dependence has been also reported in other causal bandit settings [14,17]. We note that in some literature such dependence might not appear explicitly since they normalize the range of the random variables or rewards to fall in the range [0,1], i.e., by setting m=1m=1 (e.g., causal bandit with dodo intervention [1,6] and linear bandits [R2]).

[R2]: Abbasi-Yadkori et. al. ​​Improved algorithms for linear stochastic bandits, NeurIPS 2011

Other concerns: We are glad that the reviewer’s concerns about Theorem 1 are addressed. We will be happy to also elaborate more on other concerns if there are still remaining ones.

评论

Thanks, I agree that a linear scaling with m is not avoidable. I had a suspicion that it could have super-linear scaling, as the probability of error bounds degrade in m so I would expect some sample complexity terms to at least pick up a log(m) to compensate.

评论

Again, we thank the reviewer for the thoughtful point. We'll add a comment to the revised manuscript to emphasize linearity in mm. We'll be also happy to address any remaining concerns.

Based on the discussions, we'd be grateful if the reviewer considers re-evaluating their rating of the paper.

审稿意见
7

The authors consider a stationary causal bandit problem for an unknown linear model with weight matrix BRN×NB \in \mathbb{R}^{N \times N} and noise vector ϵRN\epsilon \in \mathbb{R}^N. In their setup, intervention on the node XiX_i replaces all of the weights into XiX_i with those from another unknown matrix BRN×NB^* \in \mathbb{R}^{N \times N}.

The authors propose a two-phase approach: learning the causal structure by (1a) learning a valid topological ordering (1b) learning the causal DAG, and then (2) applying a phase elimination algorithm to learn the best pulls over a time horizon TT.

The authors also include a minimax lower bound for their problem setup. For either bound, the authors argue that the maximum causal depth LL and a known upper bound (denoted here as d+d^+ for clarity) for the maximum in-degree dd are the relevant topological parameters.

It is assumed throughout that interventions do not affect the causal structure: Supp(B)=Supp(B)\mathrm{Supp}(B)=\mathrm{Supp}(B^*).

优点

The authors identify a novel problem setup in the understudied field of causal bandits with soft interventions and unknown causal structure. Their results are significant to the causal bandit literature and have clear applications for real-world modelling.

缺点

The authors make an informal claim near the beginning of the paper, which I suspect affects the tightness of their bounds. They argue that all conditional independencies must be learned for regret minimisation with an unknown graph. However, it should be clear that only the nodes in An(Y)\mathrm{An}(Y) are relevant. In stage (1b), beginning with the reward node NN and iteratively learning the parent sets pa(N)\mathrm{pa}(N) then pa(pa(N))\mathrm{pa}(\mathrm{pa}(N)) should be a more efficient approach. This suggests to me that the relevant causal path depth in their bounds is actually LNL_N, while the relevant in-degrees in stages (1) and (2) of their algorithm are d+d^+ and d:=maxian(N)N(di)d^* := \mathrm{max}_{i \in an(N) \cup N} (d_i) respectively.

The authors also seem to ignore the parameter R:=(m/η)2R := (m/\eta)^2 in their regret bounds. While it is not a topological property of the population DAG, it is certainly an important parameter in the problem setting and could be thought of as a kind of guaranteed minimal signal-to-noise ratio (SNR). From the definition of T1T_1 there should be a linear scaling of the regret bounds with RR, up to a poly-log factor. This SNR is important for real-world modelling and could also offer insights into relaxing assumptions 3, 4 and/or 5.

Putting the above together, my estimation of the same upper regret bound is O~((cd)LN1/2T+R+d++N)\tilde{O}((cd^*)^{L_N - 1/2} \sqrt{T} + R + d^+ + N). Please correct me otherwise and/or update the manuscript. I will revise my score and evaluation of the paper's soundness following the authors' rebuttal.

The authors provide a very minimal experiment, relegated to the appendix. It would have been informative to see parameter sweeps and experiments with random DAGs to compare their bounds with empirical scalings.

The authors should make no mention of the possibility of hidden confounders and whether their models can incorporate this. In particular, the claim that only pa(N)\mathrm{pa}(N) are possibly optimal with hard interventions is false - see [1] for details.

[1] Sanghack Lee and Elias Bareinboim, Structural Causal Bandits: Where to Intervene?, NeurIPS 2018.

问题

Does your approach assume causal sufficiency? In particular, are noise terms assumed to be mutually independent (i.e., Corr(ϵi,ϵj)=δij\mathrm{Corr}(\epsilon_i, \epsilon_j) = \delta_{ij})?

I do not understand the first case in for the estimator for the ancestors of ii in equation (9): it appears to read "if ii is not the reward node but has decedents then it can't have any ancestors". Is this a typo?

In table 1, the meaning of KK in the benchmark approaches should be indicated. Is this the number of possible hard interventions on a node for categorical data or something else?

局限性

Yes, the authors provide some key underlying assumptions. Whether causal sufficiency is assumed should be indicated either way.

作者回复

We would like to thank the reviewer for the thoughtful questions, especially those pertinent to the instance-dependant regret and hidden confounders. We provide more discussions and we hope these clarify the reviewer’s technical concerns.

Dependence on dd and LL: The reviewer raises a good point about also characterizing instance-dependent regret bounds. This falls into the general dichotomy of class-level versus instance-level regret analysis. We have provided regret bounds for a class of bandits with a maximum in-degree dd and maximum causal length LL. As the reviewer correctly points out, this can be further fine-tuned to recover instance-dependent regret bounds that use the instance-level information LNL_N and dd^*. Characterizing these bounds is verbatim similar to our analysis, and we will add a remark in the paper to state these instance-dependant regret bounds.

The changes in the instance-dependent bounds can be done as follows:

  • The effect of LL and LNL_N. We note that neither Algorithm 1 nor Algorithm 2 requires the information of LL. Algorithm 1 implicitly learns LL by learning the parent sets, which is then fed to Algorithm 2. For the regret bound proof, we use recursion along the longest causal path. Each layer in the causal path contributes a term cdcd to the regret, the compound effect of all layers along the causal path becomes (cd)L(cd)^L in the class-level regret. For the instance-dependent regret on the reward node, it is sufficient to perform this process along the causal path on the subgraph that only contains an(N)an(N), in which case the regret will depend on LNL_N. This subgraph is learned by Algorithm 1.

  • The effect of dd and dd^{\star}. Similar to the previous point, when we evaluate the regret compounding along the causal paths formed by an(N)an(N), the maximum in-degree will be dd^*. Thus, the contribution of each layer to the regret will cdcd^* instead of cdcd.

  • Algorithm 1 requires knowing dd, and the iterative approach suggested by the reviewer can save some computational cost by applying Lasso estimators only on an(i)an(i). But this approach does not tighten the regret since the same amount of observational data T2T_2 is needed to learn any parent sets.

Parameter R=(m/η)2R=(m / \eta)^2: We certainly agree with explicitly incorporating the effect of RR on the regret bounds. We agree with the reviewer’s point that introducing this notion of SNR allows for replacing/relaxing Assumptions 3-5 could be restated equivalently as an Assumption on the SNR term RR. We further elaborate on the dependence on RR in the answer to the next comment.

We also note our objective has been primarily focused on the fundamental question of how the regret depends on (i) the graph topology and (ii) interventional distribution. Causal bandit literature has been evolving towards removing as many assumptions as possible about graph topology and interventional distribution. This is the first paper that has dropped both under general structures and interventional models (with the remaining piece being extending linear SEMs to non-linear SEMs). However, we certainly agree that it is also important to understand the dependence of regret on other model parameters.

Upper regret bound: Based on the previous point, the correct instance-dependant regret bound is O~((cd)LN1/2T+d++RN)\tilde{\mathcal{O}}((c d^*)^{L_N - 1/2} \sqrt{T} + d^{+} + RN). The term RNRN term arises from the step of identifying the ancestor relationship. We note that a similar term appears in related literature [19,22], where the term RR is coupled with the degree as they only need to identify pa(N)pa(N).

Additional experiment: Please refer to the global response where we provide more experiments.

Hidden confounders: Thank you for the comment. We note that all our statements are in the context of our specified setting (no hidden confounders). Under this setting (i.e., no hidden confounders and full intervention capability), only pa(𝑁)pa(𝑁) are optimal. We will re-emphasize this as a remark in the paper to avoid confusion.

We highlight that having no hidden confounders is a standard assumption in the existing literature on causal bandits (even under more restrictive dodo interventions). The primary reason is that causal bandits are not fully understood even when there are no hidden confounders. Nevertheless, we certainly agree that understanding the effect of unobservable confounders is also important. Including hidden confounders will introduce non-trivial challenges. For instance, it brings in bi-direction edges, as a result of which one cannot express the reward as a linear function of observable variables.

Causal sufficiency: We thank the reviewer for this question. Yes, our approach does assume causal sufficiency, and we will include this in Assumption 4.

first case in equation (9): Thanks for the sharp observation. Yes, based on Assumption 5, indeed the condition part "if  if i<N and de^(i)>0\text { if } i<N \text { and }|\widehat{\operatorname{de}}(i)|>0 Is a typo and should be removed.

KK in table 1: Yes, KK represents the number of possible dodo interventions on one node. We will add a comment to specify it.

Final note about “soundness”: We hope we have addressed the reviewer’s concerns about “soundness” (rate 2: fair) – especially about having hidden confounders and class-level versus instance-level regret (which can be readily recovered from our result). We remark that this paper significantly extends the scope of the causal bandit literature by entirely removing the assumption about knowing the graph topology (a common assumption by extensive recent publications in JMLR, NeurIPS, ICML, and AISTATS). This is a major leap in causal bandits, and kindly request that the reviewer consider re-evaluating their rating of the manuscript.

评论

Many thanks to the authors for their thoughtful and detailed responses, which have addressed my major concerns and questions.

In light of this, I will upgrade my overall rating to a 7: Accept, and my soundness rating to a 3: Good. My high overall score is based off the significance and novelty of the authors' contribution.

I suspect correlated noise variables (i.e., latent confounders) could be incorporated into the authors' analysis, which presents one exciting avenue for future work.

评论

We are grateful to the reviewer for the great suggestion. Indeed -- we agree that modeling the latent confounders via modifying the noise model is an effective of integrating them into the causal bandit model. We certainly agree that is an important next line of research in the causal bandit literature.

作者回复

We thank the reviewers about the comments about our assumptions and empirical evaluations. We would like to clarify the following:

Theoretical contributions: We remark that this paper significantly extends the scope of the causal bandit literature by entirely removing the assumption about knowing the graph topology and general soft interventions. All our other assumptions are either in line with or more relaxed compared to those in the existing literature. We have provided more details on these. We are glad that the reviewers have not pointed to any technical flaws in the analysis. We believe this paper provides a major contribution in the trajectory of development in causal bandits, which has been moving away from stylized assumptions. We certainly agree that there are still important issues to address (e.g., hidden confounders) that remain for future work.

Additional empirical evaluations: With the additional results, overall we have provided evaluations to assess the key theoretical findings of the paper. These include: Scaling behavior with respect to the three key parameters in the results, i.e., d,Ld, L. Sublinear regret and fast convergence. Computational advantage, which is a bottleneck for almost all the existing approaches to causal bandits with even stronger assumptions.

We remark that due to computational challenges, causal bandit algorithms are generally focused on small graph sizes. Additionally, the hierarchical graph is the one that can mimic the worst-case regret for algorithms with given parameters. The hierarchical graphical models that we are evaluating have L=9L=9 layers and N=19N=19 nodes, which is the largest model considered in the literature (for instance the JMLR paper [14], the AISTATS paper [17], and the JSAIT paper [24] consider hierarchical with no more than 3 layers; JMLR [14] and JSAIT [24] also consider simpler graph with at most N=10N=10 nodes; and [22], which focuses on dodo intervention with unknown graph, provides experiments for a simple binary tree with N=20N = 20 nodes.)

Scaling with LL: Scaling of the regret with respect to the causal depth LL is depicted in Figure 1 in the attached document for setting of a hierarchical graph with d=2d=2 and LL varies in the range 11 to 99. The theoretical results (regret upper and lower bounds) predict that the regret grows at the range (2c)L(2c)^L (i.e., exponential in LL). The empirical results in Figure 1 corroborate that the cumulative regret scales exponentially with length LL, and the actual regret closely tracks the upper bound’s trend.

Scaling with dd: Scaling of the regret with respect to the maximum in-degree dd is depicted in Figure 2 in the attached document based on a hierarchical graph with L=2L=2. We note that we increase the number of sufficient exploration parameters T1T_1 and T2T_2 to ensure the results for all the degrees. Theoretical predictions suggest that our algorithm’s regret scales as d3/2d^{3/2} (i.e., polynomial growth in dd). For the choice of LL, the lower bound becomes linear in LL. Figure 2 demonstrates that our regret is super-linear and tracks the polynomial trend of the regret upper bound (i.e., our achievable regret).

Sublinear regret: Figure 3 of the attached document shows the cumulative regret for a larger graph with L=6L=6 and d=2d=2. In this scenario, both GA-LCB with and without graph information exhibit sublinear regret. However, the GA-LCB with unknown graph information incurs higher cumulative regret due to the need for additional exploration required for estimating the topology.

Computational cost: As discussed in Section 3.3, a key contribution of our algorithm is its significant computational advantage compared to the existing causal bandit algorithms (which also rely on stronger assumptions). First, our algorithm circumvents the computational complexity of computing the upper confidence bounds (which generally requires solving an optimization problem) by iteratively computing the value through the causal depth. To highlight the advantage, we note that the UCB-based causal bandit algorithms are computationally viable for only L2L\leq 2 or for special graphs such as chain graphs [24]. In contrast, the computational complexity of our algorithm scales linearly with LL (especially scales as with O(Ld3)\mathcal{O}(Ld^3)). This has allowed us to easily implement algorithms for causal paths as long as L=9L=9. Additionally, our algorithm includes a phase elimination step that removes sub-optimal arms from the potential best intervention set. Since we need to calculate UCBs for each possible intervention, the phase elimination step prevents unnecessary UCB computation. This results in significant computational savings as the time horizon increases. We have also provided more empirical analysis of the computational complexity in Appendix A. Figure 4 (Appendix A) illustrates the computational differences between various algorithms. It is evident that our algorithm significantly reduces computation time, even for small graphs. In contrast, the other algorithm is not scalable with respect to both degree dd and length LL.

最终决定

The paper studies causal bandit with unknown causal graph structure and soft interventions. The reviewers acknowledge that the paper addresses an important problem extension in causal bandits with significant results. The issues addressed by the reviewers are mostly addressed by the authors during the rebuttal stage. Overall, the paper provides solid contributions to the study of causal bandits, and I recommend accepting the paper.