Causal Logistic Bandits with Counterfactual Fairness Constraints
We analyze logistic bandits regret with a long-term constraint for counterfactual fairness.
摘要
评审与讨论
This paper introduces a framework for causal logistic bandits with counterfactual fairness constraints, addressing sequential decision-making under fairness requirements and offering theoretical guarantees and practical algorithms for balancing performance and fairness.
update after rebuttal
The authors added experimental analysis of the proposed method. However, the baselines compared are a bit old. Even without experiments, I agree this is a good paper. As a result, I decided to keep my high score.
给作者的问题
No.
论据与证据
Line 99-101: "We provide a unified analysis for the estimation method, algorithm design, and performance guarantee." This paper has only theoretical effectiveness analysis, no experimental validation of performance.
方法与评估标准
The proposed methods in the paper are theoretically well-aligned with the problem of causal fair decision-making in sequential settings. However, this paper has only theoretical bounds and no empirical validation.
理论论述
I have checked the theoretical claims in this paper. I think the proofs are correct under the stated assumptions.
实验设计与分析
There is no experimental verification in this paper.
补充材料
Yes, I reviewed specific parts of the supplementary material, especially Appendix D(Logistic Regret Upper Bounds), which includes the proof of the bounds.
与现有文献的关系
The key contributions of the paper intersect with multiple strands of the scientific literature, addressing gaps in causal inference, fairness in machine learning, and sequential decision-making.
Specifically, the framework of this paper is built on counterfactual fairness [1], with a logistic bandits formulation [2].
[1] Kusner, Matt J., et al. "Counterfactual fairness." Advances in neural information processing systems 30 (2017).
[2] Abeille, Marc, Louis Faury, and Clément Calauzènes. "Instance-wise minimax-optimal algorithms for logistic bandits." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- This paper introduces a novel integration of causal inference and counterfactual fairness into a logistic bandit framework.
- The proposed algorithm learns fairness constraints online without prior safe policies assumption.
- This paper establishes sublinear regret and constraint violation bounds for causal logistic bandits with counterfactual fairness.
Weaknesses:
- The paper’s theoretical contributions are not supported by empirical results (e.g., simulations or real-world experiments). This limits its practical credibility, as theoretical bounds may be not sharp with complex causal structures.
- While the theoretical sections are rigorous, the mathematical density may challenge readers unfamiliar with advanced bandit theory or causal inference. Simplifying key ideas could improve accessibility.
其他意见或建议
Including a description of the meaning of each variable in Figure 1 can help to understand the counterfactual fairness problem considered.
Thank you for the thoughtful feedback. Below, we address the main comments provided in the review.
Claim and Evidence
1. "performance guarantee" This paper has only theoretical effectiveness analysis, no experimental validation of performance.
We will add empirical results using a synthetic dataset, evaluating both cumulative regret and constraint violations over multiple horizons. See below for details.
Weaknesses
1. The paper’s theoretical contributions are not supported by empirical results (e.g., simulations or real-world experiments).
We will add experiments using a synthetic dataset. Plots of the empirical results (penalized cumulative regret, cumulative regret, and cumulative constraint violations) can be accessed through the following link:
https://anonymous.4open.science/r/constrained-causal-logistic-bandits-EA00
- Data set description We use a synthetic structural causal model (modified from an example from [1], i.e.,
$
A \leftarrow U_{A},
$
$
W \leftarrow \mathcal{N}(0, \max ( |U_{W}|, |A| )),
$
$
M \leftarrow \mathcal{N}(0, |W| / 2 + |U_{M}| / 3 )\ \text{if } A = 1; M \leftarrow \mathcal{N}(0, |W| / 3 + |U_{M}| / 2 )\ \text{if } A = 0,
$
$
D_t \leftarrow \mathcal{N} (0, \max (|W|, |M| )),
$
$
Y \leftarrow 1 ( U_{Y} + \frac{1}{3}MD - \frac{1}{5} W > 0),
$
$
U_{A} \in \{0,1 \}, P(U_{A} =1) = 0.5; U_{W}, U_{M}, U_{Y} \sim \mathcal{N}(0,1),
$
where $A$ denotes the sensitive attribute (binary), $W$ is the confounded feature, $M$ represents intermediate feature, $\mathcal{D}_t$ is the decision set, and $Y$ is the outcome. At every round, we generate a set of 20 random feature vectors that form the (time varying discrete) decision set, along with their corresponding counterfactual feature vectors.
-
Baselines We evaluate three different algorithms: GLM-UCB [1] (generalized linear bandits without constraint), OFULog-r [2] (logistic bandits without constraint), and CCLB (our method, causal logistic bandits with constraint, with a modified constraint set (see note below)). The code for "Achieving counterfactual fairness for causal bandit" is not publicly available. We reached out to the authors requesting a copy of their code and followed up, but have not received it.
-
Metrics We use cumulative regret, cumulative constraint violations, and a penalized form of cumulative regret, where if the learner chooses an unfair action, the learner observes the value (i.e. the learner can improve the reward parameter estimate ) but we penalize the learner by counting the reward as being 0. In this way, constraint violations are allowed but are not profitable.
-
Results Beginning with penalized cumulative regret, where rewards are only received for fair actions, there is a large gap between our method (CCLB) and the baselines, with the gap growing larger for longer horizons. Both baselines appear to have linear penalized cumulative regret across horizons used. This is expected since the baselines do not account for constraints. Their cumulative constraint violations also appears to be linear.
-
Confidence set The confidence set in Algorithm 1 is non-convex. We propose an additional variant of our algorithm using a convex confidence set (based on [2]), improving our algorithm's complexity though with slightly worse regret and constraint violation bounds. Our proofs for the convex set are largely the same as our current proofs. We use this variant for experiments.
[1]. Sarah Filippi, Olivier Cappe, Aur´elien Garivier, and Csaba Szepesv´ari. Parametric bandits: The generalized linear case. Advances in neural information processing systems, 23, 2010.
[2]. Marc Abeille, Louis Faury, and Cl´ement Calauz`enes. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics, pages 3691–3699. PMLR, 2021.
2. Simplifying key ideas could improve accessibility.
That is a good suggestion. Where possible in the main text (respecting space limitations) alongside definitions we will explain quantities and operations work with a running example of hiring (mentioned in the introduction) or online recommendation (mentioned in Appendix C.1.). Alternatively, we will expand Appendix C.1. to discuss how definitions, operations, and properties would be instantiated for applications to give readers stronger intuition.
Comments or Suggestions
1. Including a description of the meaning of each variable in Figure 1 can help to understand the counterfactual fairness problem considered.
We will revise Figure 1's caption as discussed with Reviewer Ye1T for Comments or Suggestions 1.
Thanks for the response and I appreciate for the synthetic experiment added. However, I think the latest baselines are not compared. I will keep my high score.
However, I think the latest baselines are not compared.
Thank you for your time and feedback. We would like to clarify a point about baselines. To the best of our knowledge, [1] is the latest baseline on the topic of causal bandits with counterfactual fairness. They achieve zero constraint violations for every round; however, their method does not require a priori knowledge of a fair action for their method (see Footnote 1 of our paper). Their code is not public. We reached out to the authors requesting a copy of their code and followed up, but still have not received it. There are some recent works on unconstrained logistic bandits that focus on computational efficiency [2] or multinomial framework [3]. But their settings are different from ours and are not directly comparable. We would appreciate it if you could direct us to other latest baselines for comparison. We will incorporate them into our revision.
[1]. Huang, W., Zhang, L., and Wu, X. Achieving counterfactual fairness for causal bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6952–6959, 2022.
[2]. Faury, L., Abeille, M., Jun, K.-S., and Calauz `enes, C. Jointly efficient and optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics, pp. 546–580. PMLR, 2022.
[3]. Lee, J., Yun, S.-Y., and Jun, K.-S. Improved regret bounds of (multinomial) logistic bandits via regret-to-confidenceset conversion. In International Conference on Artificial Intelligence and Statistics, pp. 4474–4482. PMLR, 2024.
This paper studies the problem of imposing counterfactual fairness constraints in causal logistic bandit problems. The authors propose an algorithm that leverages primal-dual optimization for constrained causal logistic bandits, where the non-linear constraints are a priori unknown and must be learned over time. The authors theoretically demonstrate that their algorithm achieves sublinear regret and constraint violations. Notably, they also provide a user-chosen parameter to control the trade-off between regret and the degree of constraint violation (fairness).
给作者的问题
- Could you please explain the limitations of proposed algorithm?
2 Could you provide details on how can be estimated in practice?
论据与证据
Claims made in the paper are supported by clear and convincing evidence.
方法与评估标准
The proposed methods make sense for solving the problem.
理论论述
I didn't check the correctness of any proofs for theoretical claims in detail.
实验设计与分析
There is no numerical experiments in this paper. The theoretical analyses are sound to me.
补充材料
I briefly reviewed the supplementary material.
与现有文献的关系
This paper is closely related to literature on causal logistic bandit and counterfactual fairness.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
1. The proposed algorithm is well-analyzed, and the authors provide rigorous theoretical analyses and performance guarantees.
2. The paper address the gap between incoperating CF constraint into causal logistic bandits.
Weakness
1. The paper could benefit from extensive numerical studies.
其他意见或建议
The paper's contribution would be enhanced by including numerical experimentation that demonstrates the validity of the proposed algorithm and the sublinear regret and constraint violation guarantee.
We appreciate your time and feedback. Below, we address the main comments provided in the review.
Weaknesses
1. The paper could benefit from extensive numerical studies.
We have done some empirical experiments using a synthetic dataset, evaluating both cumulative regret and constraint violations over multiple horizons. See the discussion for Reviewer y2yQ's comments (Weakness 1) for the results and discussion.
Questions
1. Could you please explain the limitations of proposed algorithm?
There are a few limitations of our proposed algorithm
- Unobserved confounding: We assume that there are no unobserved confounders for the outcome (i.e. we observe (possibly multi-variate) context features and intermediate features ). See the discussion with Reviewer Ye1T for Weakness 2 regarding possible future extensions for settings with unobserved confounders.
- Stationarity: See the discussion with Reviewer Ye1T for Question 2 for discussion about extending to non-stationary environments.
- Budget constraints: Another direction for improvement is to also consider global resource constraints, where different actions may incur different costs and there is a total budget the learner must adhere to.
2. Could you provide details on how can be estimated in practice?
To estimate , at each round, we need to estimate the confidence set of reward parameter and choose the optimistic parameter within the confidence set (see step 3 and step 4 in Algorithm 1 in line 279), in which has the highest expected reward for decision . Specifically,
$
\hat{\Delta}(d, X_{t}) = g (f (Z_{t})^\top \theta_{t}^\sim ) - g (f (Z_{a_{t}^{\prime}})^\top \theta_{t}^\sim ),
$
where is standard logistic function, and are the factual and counterfactual feature vectors, specifically, ( and are the sampled intermediate feature from its distribution; is also the sampled confounded feature from its counterfactual distribution; is the counterfactual sensitive attribute), both and are known for the learner.
This paper forwards an online decision-making framework constrained using counterfactual fairness constraints. The paper considers a sequential decision-making setup where a set of applicants arrive at each time step. The goal is to make labeling decisions that maximize the cumulative outcome reward associated with decisions across all time steps while minimizing the cumulative fairness violations of the decisions. The fairness criterion that the paper focuses on is counterfactual fairness which quantifies bias/unfairness of any set of decisions as the change in expected outcome reward when the protected attribute takes a counterfactual value.
For this constrained setup, the paper formulates an online learning algorithm that employs the primal-dual paradigm. They obtain sublinear guarantees in the horizon for the cumulative reward regret and for the cumulative constraint violations using this method.
给作者的问题
I have noted the questions on the two points I discussed above.
- How does the regret bound in Theorem 2 change with ?
- Does the sensitive attribute intervention in the counterfactual fairness metric only change the sensitive attribute value (keeping everything else unchanged)? If so, is that realistic in the applications relevant to this paper?
论据与证据
Overall, the paper provides a useful framework for fair sequential decision-making that can maximize cumulative reward while minimizing counterfactual violations, with provable guarantees. Despite the rigidity of online decision-making setups, the algorithm proposed to solve the constrained problem does seem to be practical. The main claim of sublinear reward regret and sublinear cumulative constraint violation in Theorem 1 looks correct.
However, the claims made regarding the follow-up improved regret bounds in Section 3.4 are a bit unclear to me. In particular, the paper introduces a tightness parameter for the fairness constraint whose value can be set in a manner that allows one to upper bound the cumulative constraint violations by zero. However, this comes with tradeoffs in reward regret. The third term in the regret bound has an additional dependence on which could use some more explanation.
Specifically, (as defined in Assumption 3) is based on the assumption there exists a policy that achieves "negative" expected constraint violations. Intuitively, if I understand correctly, the larger the value the easier it likely is to find a "fair" policy, and the smaller will be the reward regret. So the inverse dependence on in Theorem 1 makes sense. However, the opposite seems to be true in Theorem 2, where substituting by the upper bound of leads to a direct proportional dependence of regret on , which doesn't seem right. Would be good to know if I am missing something here or whether this relationship between regret and is to be expected for the case of zero-bounded constraint violation setting.
方法与评估标准
The methodology used to solve the constrained bandit problem makes sense.
Regarding the chosen fairness criteria, the assumptions on the counterfactuals can be further expanded. The authors define as the feature set at time .
And is used to denote the counterfactual feature set. The paper claims that both are in the feature space , but it doesn't specify the impact of the sensitive attribute intervention on other features.
In particular, does the intervention imply that only the sensitive attribute changes in the counterfactual without any changes to the other attributes in the feature set (or their underlying distribution)? If so, that doesn't seem realistic about how sensitive attributes work in real-world settings. For instance, gender influences other intermediate features used for making the final decision, e.g., "soft skills" or "prior experience" in interview settings. An appropriate counterfactual with respect to gender should ideally track (stochastically) the changes in the distributions of other related features as well.
理论论述
Theoretical claims seem mostly correct. I am curious about the dependence in Theorem 2 which I have noted above.
实验设计与分析
No experimental analyses.
补充材料
No supplementary material. All proofs are provided in the paper appendix.
与现有文献的关系
The paper contributes to the broad literature on algorithmic fairness methods, especially targeting the setting of online decision-making. With regard to fairness, the paper chooses to focus on causal formulations of fairness to identify and tackle causal relationships between sensitive attributes and decisions that can potentially lead to discriminatory outcomes.
遗漏的重要参考文献
All main references are discussed.
其他优缺点
Overall, the paper is well-written. Theoretical claims are sufficiently explained and discussed without an overreliance on technical details. I also appreciated the inclusion of Remarks 3 and 4 as they help flesh out the implications of the performance guarantees and their relation to bounds established in prior related works. As a mostly theoretical paper, the authors do a good job of grounding the results in the context of the problem statement and providing intuitive explanations for their theorems and propositions.
其他意见或建议
No other comments.
We are grateful for your detailed and constructive suggestions on our submission. We address the main comments provided as following.
Claims and Evidence
1. Intuitively, if I understand correctly, the larger the value the easier it likely is to find a "fair" policy, and the smaller will be the reward regret.
That is correct, means there are strictly feasible actions, whose counterfactual fairness for that context is strictly smaller than the threshold . The larger of a margin for a problem, intuitively the easier it will be to identify some fair actions (though they could have low rewards).
We note that larger only reduces regret bounds (e.g. only simplifies learning) to a certain extent. Even with large , the best fair actions (in terms of expected reward) could still be on the boundary (i.e. for many contexts , we could have the optimal action satisfy ). Regarding how larger manifests in improving regret bounds, for binary rewards, we have fairness thresholds (since counterfactual fairness violations are absolute differences of expected rewards, which are binary valued), so , so larger only shrinks the minimum in the regret bounds towards a factor of (i.e. not towards ).
2. substituting by the upper bound of leads to a direct proportional dependence of regret on , which doesn't seem right.
Sorry for the confusion regarding those terms and the discussion on the regret formula. Larger means the learner is increasingly cautious, effectively working with a stricter fairness constraint . Large means there is more cautiousness possible (bigger range for ). The more cautious the learner is (larger ) the more the learner might miss out (in a worst-case) on higher reward actions among the larger (in ) set of fair actions.
After Theorem 2, we will add an extra remark before the current Remark 4 to clarify this point:
Remark 4. The difference in regret bounds between Theorem 2 with an user chosen and Theorem 1 is . For a problem-dependent (fixed) Slater's constraint qualification constant , increasing only worsens the regret bound, as the learner is increasingly cautious (increasing in ), selecting from a smaller set of actions than the learner would have with . If is large, shrinks towards 2 and so for a fixed tightness the regret bound reduces. Larger also allow for a bigger range of and thus more room for caution (and regret).
Questions
1. How does the regret bound in Theorem 2 change with ?
See Claims and Evidence 1 (above).
2. Does the sensitive attribute intervention in the counterfactual fairness metric only change the sensitive attribute value (keeping everything else unchanged)? If so, is that realistic in the applications relevant to this paper?
The (hypothetical) intervention on the sensitive attribute does change the distribution of other variables as well, namely descendants of in the structural causal model. The structural causal models for the intermediate features , outcome , and potentially confounders , depend on the sensitive attribute value, so as the sensitive attribute that is intervened on their (estimated and exact) conditional distributions change.
For example, in our synthetic dataset experiment (see the discussion for Reviewer y2yQ's comments for the results and discussion), we create a structural causal model (including exogenous variables, endogenous variables, and structural causal equations), where the causal relationships are described by the extended standard fairness model, more specifically, for the causal mechanism between the sensitive attribute (binary), the intermediate variable , and the confounded variable is represented as:
$
M &\leftarrow \mathcal{N}(0, |W| / 2 + |U_{M}| / 3 )\ \text{if } A = 1;
$
$
M &\leftarrow \mathcal{N}(0, |W| / 3 + |U_{M}| / 2 )\ \text{if } A = 0,
$
where is the exogenous feature for , is the norm distribution.
The paper introduces a new problem formulation: Constrained Causal Logistic Bandits (CCLB), which addresses online decision-making under counterfactual farness constraints. This paper proposes an algorithm utilizing primal-dual optimization for constrained causal logistic bandits with non-linear constraints and priori unknown. Further, they propose the regret bound as well as the fairness constraint violation bound in section 3.3. In section 3.4, the paper further shows the fairnes constraint bound could be further improved by introducing tightness variable in the dual update.
给作者的问题
Can you introduce more about the tightness variable . To be specific, why this variable could help improve the constraint violation bound intuitively?
Is it possible to extend this model to handling distribution shift over time? Especially in online MAB type of algorithm, the distribution might change over time.
论据与证据
The paper claims to propose an algorithm in solving CCLB, and provide the theoretical bound on it. The claims is supported by math proofs.
方法与评估标准
N.A.
理论论述
The proofs are correct.
实验设计与分析
N.A.
补充材料
N.A.
与现有文献的关系
The paper proposes a constrained causal logistic bandits (CCLB). The fomulation and online setting is interesting both theoretically and meaningful for the real-world applications.
遗漏的重要参考文献
I am not aware of any such references that should have discussed.
其他优缺点
Strengths
The paper is well written and easy to follow. The proposed methods and theoretical bound is solid. Theoretical contribution is new and creative.
Weaknesses
My primary concern is the lack of empirical validation. Though this paper is purely theoretical, simulation on synthetic or semi-synthetic dataset is expected. For instance, the comparison on the unfair counterpart as well as the convergence behaviour would significanly strength the paper claims.
Another concern is the assumption on the causal graph, without unobserved confounders, the problem might not be generalize in real-world systems where unobserved confounders are usually expected.
其他意见或建议
While the paper is generally well written, I have one suggestion. In the graph, there does not have as a causal variable, but you directly introduce in page 4. Readers need to guess how is related to other variable like .
We appreciate your time and feedback. Below, we address the main comments provided in the review.
Weaknesses
1. My primary concern is the lack of empirical validation.
We will add empirical results using a synthetic dataset, evaluating both cumulative regret and constraint violations over multiple horizons. See the discussion for Reviewer y2yQ’s comments (Weakness 1) for the empirical results and discussion.
2. Another concern is the assumption on the causal graph, without unobserved confounders, the problem might not be generalize in real-world systems where unobserved confounders are usually expected.
Counterfactual fairness is an important fairness criterion. Since we cannot actually intervene on the sensitive attribute (race, gender, etc.), instead we must estimate hypothetical interventions from the data we collect. In general, in the presence of unobserved confounders (i.e. if W is not/partly observed), counterfactuals might not be identifiable. To be clear, it is not simply a limitation of our method or analyses, but a more fundamental challenge regarding counterfactuals and causality more broadly.
For problems with unobserved confounders, there is active research on bounding counterfactual queries, such as [1]. An important future direction is to extend our method to work with those. Thus, even for the broader family of problems that have unobserved confounders, we view our current work as an important step.
[1]. Junzhe Zhang, Jin Tian, and Elias Bareinboim. Partial counterfactual identification from observational and experimental data. In International Conference on Machine Learning, pages 26548–26558. PMLR, 2022.
Comments or Suggestions
1. Readers need to guess how is related to other variable like .
Thank you for pointing this out. We will revise Figure 1's caption to be:
Extended Standard Fairness Model (SFM). denotes the sensitive attribute, is a set of confounded features, is a set of intermediate features, is the decision, and is the outcome. Let denote contextual information the learner has available to make the decision, and let denote variables the reward distribution may depend on. In the main text we will also define in terms of its components .
Questions
1. Can you introduce more about the tightness variable . To be specific, why this variable could help improve the constraint violation bound intuitively?
We will revise the text in Section 3.4 line 404 to include more intuition for how improves the constraint bound:
"... same asymptotic order of regret as before. Intuitively, with a tightness variable in the constraint, the learner will be more cautious in selecting actions by effectively working with a stricter constraint (e.g. with fairness threshold instead of ). Then, under this new hypothetical ..."
2. Is it possible to extend this model to handling distribution shift over time?
That is a good question.
For MAB problems without constraints, there is work on black-box adaptations to convert stationary MAB methods to non-stationary ones. One well known method is “Non-stationary Reinforcement Learning without Prior Knowledge: an Optimal Black-box Approach”, however as recently pointed out in “Is Prior-Free Black-Box Non-Stationary Reinforcement Learning Feasible?”, the former one is not order optimal in some settings. The authors of the latter paper recently proposed a blackbox method to convert algorithms from stationary MAB to piece-wise stationary MAB “Detection Is All You Need: A Feasible Optimal Prior-Free Black-Box Approach For Piecewise Stationary Bandits”. Since they do not consider long-term constraint so we cannot apply their method directly, but it is a promising direction to explore.
An alternative route is to specially adapt a method, for which there are several results without long term constraints, e.g., ”Revisiting weighted strategy for non-stationary parametric bandits” and some results with constraints, e.g., ”Non-stationary bandits with knapsacks”. Both of these two works propose interesting methods which might help us extend our current model to non-stationary environment. The former one analyzes non-stationarity in logistic bandits using a weighted strategy and the latter one studied knapsack (long-term) constraints in an nonstationary environment using sliding-window strategy.
Thanks for the reply. I will keep my score.
This paper focuses on the following bandit setting: the learner seeks to make online fair decisions, where the feedback can be encoded by a logistic function, and the fairness requires a certain counterfactual reasoning. The objective is to maximize the cumulative reward while minimizing the cumulative fairness violations over time. The paper leverages a primal-dual formulation for solving the constrained optimization problem where the constraints are nonlinear and have to be learned over time. The paper theoretically analyzes the regret in outcome rewards and in fairness violations. Improvements are discussed too.
This paper is a technically solid contribution to the community of causal fairness and online decision making.