Multi-agent Markov Entanglement
This paper establishes the mathematical foundation of value decomposition in MARL.
摘要
评审与讨论
This paper provides a theoretical analysis of the error introduced by approximating a joint value function (across multiple agents) as linear combination of individual value functions, referred to as value decomposition. In particular, they introduce a measure of Markov entanglement, which bounds the error of such decompositions. To estimate this entanglement, they propose to measure the total variation distance between the transition matrix of focal agent ’s local state (and its next action/policy) conditioned only on its own current local state and action, and the transition matrix and policy conditioned on the global state and actions.
优缺点分析
Strengths
- The idea of entanglement makes intuitive sense — it seems natural that value functions can be linearly decomposed into their component parts if their transition distributions are independent.
- The motivation for the work is clearly laid out, and I was able to follow the argument clearly throughout despite not being familiar with some of the experiment settings and index policies.
Weaknesses
- The empirical evaluation is limited, being only a single demonstration for index policies that demonstrates sublinear decomposition error scaling. However, I would argue that this is not a significant weakness given the primarily theoretical nature of the paper.
- I believe that the paper’s notion of Markov entanglement is missing some key relevant work (expanded further upon in Questions).
- Small nit: typo in Meta Algorithm 1 “Leaning”
问题
- The proposed method to estimate Markov entanglement seems to be closely related to common counterfactual methods in MARL used to estimate influence. For example, [1] uses a similar measure of information-theoretic influence to encourage agents to explore highly entangled transitions , and [2] proposes a similar mutual information-based method for entanglement of policies via the causal influence of one agent’s actions on another agent’s action. While this work uses it to bound value decomposition error rather than as intrinsic rewards for exploration, I would be interested to hear the authors discuss how their notion of Markov entanglement relates to these previous works.
- In J.4, Proposition 12 seems to be circular in some sense — bounding the error of value decomposition in terms of reward entanglement, which itself measures the accuracy of a global reward decomposition over the stationary state distribution. Can the authors clarify the contribution of this point?
- Apart from index policies, I would like to see more experiments to demonstrate the practical ramifications of using value decomposition in highly entangled MDPs with function approximation, e.g., a toy environment where the transition matrix/rewards can be altered to tune the level of entanglement, showing worse asymptotic performance with increasing entanglement for common decomposition methods such as VDNs [3] or QMIX [4]. I acknowledge that this would require a fair bit of engineering work, but might be useful for a camera-ready copy to make this more relevant to a wider body of MARL practitioners — I find that experiments only on index policies provide a rather narrow scope to demonstrate the utility of this approach.
[1] Wang, T., Wang, J., Wu, Y. & Zhang, C. Influence-Based Multi-Agent Exploration. in Eighth International Conference on Learning Representations (2020).
[2] Jaques, N. et al. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning. in Proceedings of the 36th International Conference on Machine Learning 3040–3049 (PMLR, 2019).
[3] Sunehag, P. et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning.
[4] Rashid, T. et al. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 21, 178:7234-178:7284 (2020).
局限性
Yes
最终评判理由
The Inventory Simulator experiments clearly demonstrate the relationship between increased entanglement and the performance degradation of value decomposition-based methods, which addressed my concerns about the actual effects of Markov entanglement. They also addressed my questions about J.4.
As said in my response, while I believe that an alternative framing and experiments on more commonly used MARL environments would increase the impact of this work, I believe that it provides a valuable contribution to the community and a solid theoretical framework to investigate decomposition-based approximations. Therefore, I will raise my score to a 5 - with limited experimental evaluation being the primary flaw that precludes a 6 (however, I do not think this merits a 4 as this paper's contribution is primarily theoretical).
格式问题
N/A
Thanks a lot for your review! Your suggestions are very insightful in helping us improve the paper and build strronger connections to the MARL literature.
New Empirical Evaluations
We have conducted additional empirical evaluations to examine how VDN's performance varies with different levels of system entanglement. These experiments utilize a synthetic inventory control environment, which we describe in detail below.
An Inventory Simulator
A single-product inventory system consists of warehouses and 1 central factory. We denote the inventory level at each warehouse as and at factory as . At each timestep,
- Each warehouse observes an customer order of products. is drawn from according to a Bernoulli distribution with parameter .
- Central factory's action is to decide how many new products to produce (an integer ).
- Each Warehouse's action is to decide how many products to order from the central factory (an integer )..
- Allow backlogged order. An order not fulfilled this timestep can be fulfilled at future timesteps.
- Local state of each warehouse is , the current inventory level and the current order. The central factory has state as its inventory.
- The transition for each warehouse follows Similarly, for the central factory, it transits as
- Reward for local warehouse where is the pentalty for the backlogged order and for the cost of storage. The reward for central factory is where is the corresponding penalty.
Note that the central factory's state transition depends not only on its local action but also on the actions of all warehouses. This interdependence creates potential system entanglement. Intuitively, entanglement increases with the number of warehouses. We therefore control the system's entanglement by adjusting , the number of warehouses.
Benchmark
We implement VDN in [1] as the benchmark algorithm and test its performance under different levels of entanglement. More implementation details will be added in the next version of the paper. The optimal policy has access to the global information, thus it can fully avoid the cost. In this case, the factory supplies exactly the known order amount. Consequently, the optimal value function achieves zero cost. We then summarize the empirical results as follows
| Estimated Entanglement | Global Optimal | VDN | |
|---|---|---|---|
| 0.50 | 0 | -9.72 | |
| 0.81 | 0 | -25.28 | |
| 0.92 | 0 | -36.94 | |
| 0.98 | 0 | -60.20 | |
We have the following observation
- VDN's performance degrades as system entanglement increases. This aligns with VDN's reliance on local value functions for action selection: as entanglement grows, local state information becomes increasingly insufficient for optimal decision-making.
However, we also note that entanglement may not always determine VDN's performance. Through its centralized training framework, VDN can encode global reward/state information in its local value functions. As shown in our Section 3.1 and Appendix E, certain entangled MDPs still permit exact decomposition when local values effectively capture global reward information.
Relations to Influenced-based MARL
Great question! It turns out the mutual information can be viewed as the measure of Markov entanglement under KL-divergence. Specifically, we can rewrite mutual information in [2] as This is highly related to our measure of Markov entanglement under a -weighted agent-wise KL-divergence, which we can define as Intuitively, the measure of Markov entanglement can be viewed as how closely one agent can be approximated as an independent subsystem. This characterization aligns naturally with mutual information. Furthermore, since KL-divergence provides an upper bound for total variation distance, it consequently bounds our Markov entanglement measure relative to the ATV distance introduced in our paper. This connection demonstrates that influence-based MARL methods naturally fit within our theoretical framework, corresponding to a specialized distance measure.
Reward Entanglement in Appendix J.4
This section extends our analysis to scenarios where the global reward cannot be decomposed. In such cases, we propose learning an approximate local decomposition of the global reward, which is then used to derive local value functions. Our theoretical results demonstrate that when the global reward admits a good approximation via summed local rewards (under the stationary distribution), the decomposition error remains bounded (under the stationary distribution).
[1] Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017).
[2] Wang, Tonghan, et al. "Influence-based multi-agent exploration." arXiv preprint arXiv:1910.05512 (2019).
I appreciate the additional Inventory Simulator experiments, which clearly demonstrate the relationship between increased entanglement and the performance degradation of value decomposition-based methods, and for their clarifying comments on the proof in Appendix J.4.
I also thank the authors for their thoughtful responses relating their work to influence-based MARL. I am inclined to agree with Reviewer bQ7W in the sense that the notion of Markov entanglement can be made more intuitive to the RL community from the lens of mutual information (MI) between agent actions, rather than quantum entanglement, given that the formulations very closely related. However, I also recognize that inspiration can come from many avenues and certainly would not fault this work for adopting a physics-based approach, as is common in many other areas of ML.
While I believe that an alternative framing and experiments on more commonly used MARL environments would increase the impact of this work, I believe that it provides a valuable contribution to the community and a solid theoretical framework to investigate decomposition-based approximations. Accordingly, I will raise my score to a 5 (accept), and ask that the authors consider including the above discussion of related MI approaches in a camera-ready copy.
We thank the reviewer for the insightful and sharp comments. The connection between mutual information in multi-agent RL and our study is indeed thought-provoking!
As suggested, we will add a more thorough discussion of mutual information approaches in the final version and revise the framing where appropriate. We also plan to incorporate additional commonly used multi-agent RL experiments. These comments are extremely helpful for improving our paper, and we sincerely appreciate them.
This paper shows that the concept of entanglement (analogous to the idea of quantum entanglement) can be used to mathematically justify the why value decomposition works in reinforcement learning.
优缺点分析
Strengths:
- The theoretical results are very rigorous and the insights are important.
- The link between entanglement and value decomposition is novel and this is an important early work in this space.
- The two agent insights are useful.
Weaknesses:
- The experimental results are very limited.
- Scalability with the number of agents is not discussed.
- The linkage to quantum entanglement is rather tenuous.
问题
- Can you provide a clearer discussion on how your concept truly connects to quantum entanglement?
- The analysis will only be valuable if you can scale it to more than two agents.
- Can you expand the experimental results to more varied use cases?
- Do your results still hold under non-stationarity?
局限性
Limitations are adequate
最终评判理由
I maintain my review and I see this paper as a useful contribution. The authors should, however, revise the work to omit the analogy to quantum as it does not really stand and could mislead the community.
格式问题
None
Thank you for the helpful review! In response to your questions:
More Experimental Results
We have conducted additional experiments to investigate how the performance of a value decomposition based algorithm [1] varies with different levels of system entanglement. Please see our response to Reviewer H55K for complete experimental details and results.
Generalize to Multi-agent
Our framework naturally extends to multi-agent scenarios, as briefly introduced in Section 5 and comprehensively discussed in Appendix H. All concepts and results established for two-agent MDPs maintain their validity in the multi-agent setting. Specifically, we attach Theorem 7 (line 688) in Appendix H here for reference
Theorem 7
Consider a -agent MDP and policy with the measure of Markov entanglement w.r.t the -weighted agent-wise total variation distance, it holds the decomposition error in -norm is bounded by the measure of Markov entanglement,
Relations to Quantum Entanglement
So far, the connection is primarily conceptual: we observe a mathematical structure in multi-agent Markov systems that parallels quantum entanglement, and our definition of the measure is inspired by its quantum counterpart. We also believe that additional concepts from quantum physics, such as information scrambling—the process by which initially localized information becomes delocalized throughout a quantum system—may have meaningful analogues in the multi-agent Markov context. Furthermore, this shared mathematical structure suggests that quantum computers could potentially accelerate computations in multi-agent reinforcement learning, helping to overcome the curse of dimensionality (it will be exciting if this speedup is exponential over the classical computers). The graphical frameworks such as 1D/2D chains developed for understanding entanglement (studied in quantum simulation through a lens of tensor networks) may also find novel applications in the Markov setting. Overall, we believe this work opens a door between two distinct fields, and we are excited to see how advances in quantum theory might benefit multi-agent RL—and perhaps vice versa.
Extension to Non-stationary MDPs
Good question! We believe our results can be extended to finite-time and non-stationary MDPs. We note that to achieve an entry-wise error bound, our proof does not require the transition data to be sampled from the stationary distribution; it only requires that the sampling distribution has positive measure for all state-action pairs. We leave further extensions to future work.
[1] Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017).
I thank the authors for the response. While I remain positive of the work, I think the connection to quantum entanglement is tenuous and somewhat misleading to the community,. I would maintain my score only if the authors would thoroughly revise the work to make the discussion on entanglement rather limited and down to earth. The current version of the paper somewhat makes that connection a central contribution but it isn’t even a one-to-one analogy. I would also revise the title and omit entanglement from it.
We thank the reviewer for the valuable comments and suggestions! The mathematical form of quantum entanglement indeed initially inspired our study; however, we agree that this connection is limited and that an MDP system does not involve any underlying physics of quantum entanglement. We view this connection as a mathematical inspiration rather than a core contribution. Following the reviewer’s advice, we will revise the paper to clarify this point, make the discussion down-to-earth, and ensure readers are not misled into expecting any physical interpretation related to quantum entanglement.
Meanwhile, as suggested by Reviewer H55K, we will include a discussion on influence-based multi-agent RL to provide an alternative intuition supporting our value decomposition analysis.
Additionally, we will reintegrate our N-agent results from Appendix H into the main paper in the final version. The two-agent scenario was used solely for educational and illustrative purposes.
We hope these revisions address the reviewer’s comments and welcome any further feedback.
This paper establishes a rigorous theoretical foundation for value decomposition in multi-agent reinforcement learning. The authors introduce Markov entanglement—an analogue of quantum entanglement—to characterize when a joint transition probability is separable and thus admits exact value decomposition. They prove that exact decomposition holds if and only if the Markov entanglement is zero, and they derive upper bounds on the approximation error in terms of an agent-wise total variation entanglement measure.
As an application, they show that under standard index policies for Restless Multi-Armed Bandits, the Markov entanglement scales as , and they confirm this rate via numerical experiments on circulant RMAB benchmarks. Additionally, they propose a simple Monte Carlo–based convex optimization procedure to efficiently estimate the entanglement, allowing practitioners to verify the feasibility of value decomposition without ever computing the full global Q-function.
优缺点分析
Strengths
The paper establishes a rigorous theoretical foundation for value decomposition in MARL. It provides a clean characterization of when value decomposition is possible, via the novel and well-defined concept of Markov entanglement. The authors derive explicit upper bounds on the decomposition error in terms of an agent-wise TV entanglement measure, covering both worst-case and stationary distribution weighted scenarios. While the main text focuses on two agents for clarity, the framework is extended to the general -agent settings in the appendix. The paper includes appropriate numerical results using the restless multi-armed bandit benchmark, supporting the theory with empirical evidence.
Overall, the paper is organized and well written.
Weaknesses
W1.
While the theory is well-developed, its practical relevance to more complex MARL domains remains unclear, as the evaluation is limited to a syntheic circulant RMAB benchmark.
W2.
I think assuming the decomposition is restrictive when agents compete or cooperate to achieve a goal. In such case, the interaction should be modeled differently. I believe most multi-agent scenarios fall into this category, where reward decomposition is unnatural.
问题
Q1.
Unlike quantum entanglement, Markov entanglement allows the coefficients to be negative. What is the fundamental reason behind this difference? Could you elaborate on why the constraint is not imposed in the definition of Markov entanglement?
Q2.
I’m curious about the motivation for choosing the restless -armed bandit setting for the experiments. Why was this particular environment selected?
局限性
yes
最终评判理由
The paper presents a novel theoretical foundation, well supported by a range of experiments. The authors have addressed my questions appropriately, and I have also benefited from discussions with other reviewers. Overall, I believe the paper has sufficient academic merit to warrant a final rating of 5.
格式问题
I did not find any significant formatting issues in the paper.
Thank you for the helpful review! In response to your questions:
More Evaluations
We have conducted additional experiments to investigate how the performance of a value decomposition based algorithm [1] varies with different levels of system entanglement. Please see our response to Reviewer H55K for complete experimental details and results
Decomposition of Reward
We argue that many practical MARL settings fall into our modeling, especially in the field of operations research (OR). Typical examples include
- Ride-hailing system (e.g. [1][2]) The platform's overall revenue is the summation of the revenue of each driver.
- Warehouse Robot routing (e.g. [3]) The local reward is the indicator of whether the agent completes a task. The global objective is the overall number of task completed.
- Inventory Control (e.g. [4]) The overall operational cost is the summation of the costs of each warehouse.
- The well-established class of Weakly-coupled MDPs (e.g. [5]) and RMABs (e.g. [6]) in OR literature. These models have been widely applied in practice, including in resource allocation, revenue management, healthcare, and many other fields.
Furthermore, in Appendix J.2, we also provide a solution for scenarios where the global reward cannot be decomposed. We propose a concept called reward entanglement to characterize how well the global reward function can be approximated by a sum of local rewards. Building on this concept, we extend our analysis, demonstrating that the decomposition error can be bounded by considering both transition and reward entanglement.
Q1: Negative Coefficient in Markov Entanglement
In Appendix C we thoroughly discuss this difference between Markov entanglement and quantum entanglement. In short, we did find a separable two-agent MDPs that can only be represented by linear combinations but not convex combinations of independent subsystems. This result justifies the necessity of negative coefficients and highlights a structural difference between Markov entanglement and quantum entanglement. Specifically, we construct the following transition matrix in One can verify this transition matrix can not be represented by the convex combination of tensor products of transition matrices in .
Q2: RMAB as Evaluation Environment
We choose RMAB out of the following reasons
- RMAB forms the foundation for several award-winning industrial applications in operations research, including ride-hailing systems (DiDi, Lyft) [1,2] and Kenya's mobile healthcare platform [8]. We then adopt RMABs as our evaluation environment to empirically investigate why value decomposition based methods prove effective in large-scale systems and verify our theoretical justifications.
- RMAB involves two major challenges in MARL
- Scalability: as demonstrated in our experiment, a RMAB instance usually contains thousands of agents. For example, the ride-hailing system in NYC contains thousands of active drivers.
- Action Constraint: each agent's action is coupled with each other. For example, if one driver takes one order, other driver cannot fulfill the same order.
These structures make RMAB an active research area in OR and MARL.
[1] Qin, Z., et al. "Ride-hailing order dispatching at didi via reinforcement learning." INFORMS Journal on Applied Analytics 50.5 (2020): 272-286.
[2] Azagirre, Xabi, et al. "A better match for drivers and riders: Reinforcement learning at lyft." INFORMS Journal on Applied Analytics 54.1 (2024): 71-83.
[3] Chan, Shao-Hung, et al. "The league of robot runners competition: Goals, designs, and implementation." ICAPS 2024 System's Demonstration track. 2024.
[4] Alvo, Matias, Daniel Russo, and Yash Kanoria. "Neural inventory control in networks via hindsight differentiable policy optimization." arXiv preprint arXiv:2306.11246 (2023).
[5] Adelman, Daniel, and Adam J. Mersereau. "Relaxations of weakly coupled stochastic dynamic programs." Operations Research 56.3 (2008): 712-727.
[6] Whittle, Peter. "Restless bandits: Activity allocation in a changing world." Journal of applied probability 25.A (1988): 287-298.
[7] Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017).
[8] Baek, Jackie, et al. "Policy optimization for personalized interventions in behavioral health." Manufacturing & Service Operations Management 27.3 (2025): 770-788.
I appreciate the authors' detailed feedback. Their clarifications and additional experiments have resolved my concerns, so I am raising my rating to 5.
This paper investigates the equivalent condition of the linear-decomposability of the value function in multi-agent MDPs. The authors introduces the concept of Markov entanglement, inspired by the entanglement in quantum mechanics, to characterize property of the dynamics of the multi-agent MDP. They prove that the linear-decomposability of the value function is equivalent to the Markov disentanglement of the dynamics (in non-trivial cases). Beyond this results, they define a quantitative measure of Markov entanglement, and relate it to the value decomposition error. The paper further shows that index policies exhibit low entanglement and thus enjoy a scale of decomposition error. Finally, they show that the entanglement can be efficiently estimated, making it a useful tool for analyzing the linear decomposition error of the value function.
优缺点分析
Strengths
-
The paper is well-written and the presentation is clear.
-
The paper provides a new perspective to understand linear decomposition of the value function in multi-agent MDPs.
-
The theoretical results are comprehensive, and the paper also provides a method for estimating entanglement, making it applicable in practical settings.
Weaknesses
-
It would be better if the authors devote more discussion to the intuitive understanding of entanglement, and clarify how the proposed necessary and sufficient condition relates to other commonly used notions such as independence and Definition 3.1 in [1]. Discussion on the relationship between entanglement and coupling is also beneficial.
-
The paper does not provide the efficiency analysis of the entanglement estimation method.
-
(Minor) Linear decomposition of the value function is a fundamental problem in multi-agent RL, while it is not so effective in practice. Most of the existing empirical works focus on the monotonic decomposition[2]. It would be interesting to see the analysis of the monotonic decomposition.
[1] Dou, Zehao, Jakub Grudzien Kuba and Yaodong Yang. “Understanding Value Decomposition Algorithms in Deep Cooperative Multi-Agent Reinforcement Learning.” ArXiv abs/2202.04868 (2022): n. pag.
[2] Rashid, Tabish, Mikayel Samvelyan, C. S. D. Witt, Gregory Farquhar, Jakob N. Foerster and Shimon Whiteson. “QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning.” ArXiv abs/1803.11485 (2018): n. pag.
问题
See Weaknesses.
局限性
yes
最终评判理由
A technically solid paper. It proposes an interesting theory, bringing new insights into the understanding of value decomposition in the MARL field.
格式问题
no
Thank you for the helpful review! In response to your questions:
Intuition of Entanglement
In the introduction, we highlight that Markov entanglement intuitively captures whether an MDP can be viewed as a linear combination of independent subsystems. To further clarify this concept, we provide an illustrative example in Section 3 (Line 191) comparing entanglement with coupling. Specifically, we construct an MDP that is coupled but not entangled, highlighting the fundamental distinction between the two notions.
As a supplement, our example relies on a special MDP where two agents share the same randomness. This idea also has a direct parallel in quantum physics. Shared randomness plays a crucial role in characterizing the fundamental distinction between classical correlation and quantum entanglement in the quantum world, e.g. the famous Einstein-Podolsky-Rosen (EPR) paradox. When it comes to the Markov world, it delineates the difference between coupling and Markov entanglement. We plan to incorporate this perspective more explicitly in the next version of the paper.
Relations to Independence and Definition 3.1 in [1]
Our notion of Markov entanglement strictly generalizes independence by allowing MDPs to be expressed as a linear combination of independent subsystems, allowing for coupled (non-independent) MDPs that are separable. Definition 3.1 in [1] also assumes the global reward as the summation of local rewards. It further requires the transition to decompose into a sum of specific functions. While the authors show their definition is equivalent not only to exact value decomposition but also to the exact decomposition under the Bellman operator, they admit that MDPs satisfying their definition are "rare in practice." Specifically, we find Definition 3.1 in [1] cannot accommodate certain independent cases.
Efficiency Analysis of the Entanglement Estimation
We briefly present the efficiency results here, with a detailed analysis to be included in the next version of the paper. First, note that the optimization problem can be abstracted as follows: where refers to the L1-norm. Denote one optimal solution . Similarly, let with one optimal solution . We summarize the results in the table below
| Efficiency | |
|---|---|
| Asymptotic | |
| Non-asymptotic | where is the size of local state-action pairs. |
We then briefly go through the proof sketch here
- The asymptotic efficiency analysis is based on empirical process theory. Note that . For the first part, it's clear following standard CLT. For the later part, one can verify and . Thus it suffices to bound . We can verify the covering number of satisfies polynomial decay and thus apply standard analysis of uniform convergence (e.g. Theorem 37 in [2]). We have . Finally, we derive the asymptotic result via Slutsky’s theorem. For simplicity, the proof above assumes that transitions are sampled i.i.d. from the stationary distribution. When the data is drawn from a non-episodic Markov chain, standard Markov chain CLT (e.g. [4]) results can be applied to obtain analogous conclusions.
- The non-asymptotic analysis is based on generalization theory. We note that with probability at least and is the empirical Rademacher complexity. We can bound via the covering number and conclude its order .
In summary, our entanglement estimation has efficiency that does not scale with the number of agents . In contrast, estimation of decomposition error without using Markov entanglement requires direct evaluation of global value functions, whose asymptotic efficiency has rate where (e.g. see [3]). In other words, the rate of direct estimation is . Thus the estimation of decomposition error via Markov entanglement is exponentially faster.
Beyond Linear Decomposition (Minor)
We agree that linear value decomposition is a fundamental problem in MARL. We also acknowledge the rich body of work on non-linear decomposition methods (such as monotonic) that offer richer representations and enjoy greatempirical successes. Our work on linear decomposition and Markov entanglement can serve as a foundational framework for future analysis of such algorithms.
[1] Dou, Zehao et.al, "Understanding Value Decomposition Algorithms in Deep Cooperative Multi-Agent Reinforcement Learning." ArXiv 2022.
[2] Polldard, David, "Convegence of Stochastic Processes." Springer 1984
[3] Chen, Shuze, et.al, "Experimenting on Markov Decision Processes with Local Treatments." Arxiv 2024
[4] Galin L. Jones, "On the Markov chain central limit theorem." Arxiv 2004
Thank you for your response. I have no further questions.
Overall, the reviewers were quite favorable towards this paper. They had some questions which the authors easily addressed. I hope that the authors will review their manuscript in light of these questions and make edits aimed towards minimizing any ambiguities that might confuse future readers. In that vein, I hope that the authors will be careful to avoid leaning too heavily on the quantum entanglement motivation (per discussion with reviewer bQ7W) unless they can explain how the type of entanglement here is more like quantum entanglement than other cases of variable entanglement that are clearly understandable without resorting to quantum metaphors.