PaperHub
5.8
/10
withdrawn4 位审稿人
最低3最高8标准差1.8
6
3
8
6
2.8
置信度
正确性2.8
贡献度2.8
表达2.5
ICLR 2025

Quantum-Inspired Reinforcement Learning in the Presence of Epistemic Ambivalence

OpenReviewPDF
提交: 2024-09-28更新: 2025-01-23

摘要

关键词
Reinforcement learningdecision-makinguncertaintyepistemic ambivalence

评审与讨论

审稿意见
6

This paper presents a theoretical framework that integrates quantum mechanics with reinforcement learning, exploring the interactions between EAs and quantum states. They introduce a novel reward operator that considers separated quantum outcomes, facilitating the calculation of expectation values under a bijective mapping between quantum states and environmental rewards. The analysis includes the optimal value functions for two-site and many-site systems, highlighting the effects of complex probability amplitudes on the reward structure. Experimental results demonstrate significant variations in optimal policies as parameters change, particularly at phase transition points. This work contributes to a deeper understanding of how quantum features can enhance decision-making processes in reinforcement learning environments.

优点

The paper effectively bridges quantum mechanics and reinforcement learning, offering fresh perspectives on reward calculations and decision-making processes. The mathematical formulation is rigorous, with well-defined operators and expectation values that enhance the theoretical foundation of the study.

缺点

The paper could benefit from more detailed discussions on the practical implementation of the proposed framework in real-world scenarios.

问题

How do the findings of this framework extend to more complex or varied environments beyond those studied in the paper?

What computational challenges might arise when implementing this quantum reinforcement learning model in real-world scenarios?

How does the performance of this quantum-based approach compare to traditional reinforcement learning methods in terms of efficiency and accuracy?

评论

We would like to thank you for the precious feedback. Below, we respond to your questions in the same order. We sincerely hope that you will find the responses and revisions satisfactory.

Response to Q1: Our proposed EA-MDP framework can be extended in several ways. For instance, as mentioned in the paper, one could consider using a time-dependent quantum state (revoking Assumption 1), or a time-dependent outcome set. Another possible direction for future research could involve combining EA-MDP with PO-MDP or using non-stationary rewards. Based on your comment, we will add such possible extensions as future research directions to the paper.

Response to Q2: The main challenge in this context is to convert classical data into quantum data, particularly in the form of superposition. For example, consider multiple pieces of evidence that need to be encoded together with the corresponding probability amplitudes. In recent years, the conversion of classical data into quantum data (quantum states) has gained significant attention (Biamonte, et al. -2017). For example, mapping classical pixel values to quantum states has been explored in various applications (Yan,Venegas-Andraca-2020). In EA-MDP, there are more challenges than typical quantum machine learning. In addition to encoding classical data to quantum data with corresponding probability amplitudes, a set of outcome representations is needed. This set of outcomes is essential for calculating the rewards, as it is used to determine the measurement and the reward. Selecting such a suitable outcome set is crucial to ensure accurate reward calculations. In addition, in real-world experiments with non-separable outcome set, the number and dimension of outcomes may increase significantly. If nn is the number of the underlying states and mm is the number of EA bases, the memory needed to store the outcomes scales by a factor of O((mn)2)O(( m n)^2). Similarly, the memory needed to store the separated quantum states like Eq. (3) scales by a factor of O(mn)O( m n). Furthermore, the computational time required to calculate the reward also scales as O((mn)2)O((m n)^2). Based on your comment, we will add these challenges to the paper.

Response to Q3: Traditional RL methods are designed to perform efficiently in simpler environments under MDPs with less complex dynamics than EA-MDP. Therefore, RL agents trained with traditional RL algorithms struggle in EA-MDP-based environments when dealing with contradictory evidences. Compared to EA-MDP, traditional RL frameworks do not consider epistemic ambivalence into their model and solution. As a result, agents trained with traditional RL algorithms perform poorly in our setting and fail to converge. In contrast, EA-MDP framework deals with EA effectively by using quantum superposition and probability amplitudes. In turn, our proposed algorithm, which is designed specifically for EA-MDP-based environments, helps agents make informed decisions in the presence of epistemic ambivalence.

评论

Dear reviewer eEvs.

We appreciate your time and work when reviewing our paper.

We hope that our replies appropriately addressed your concerns and explained the issues you mentioned.

If there are any remaining questions or require clarification, please let us know. We will be happy to give extra information and discuss them further.

Thank you for your time and for contributing to the review process.

Best Regards,

Authors

审稿意见
3

The authors propose a novel sequential decision-making framework that incorporates the concept of epistemic ambivalence. In particular, they propose an MDP-based framework that leverages concepts from quantum mechanics to quantify the epistemic ambivalence, primarily via the state-space and reward function. They show that this MDP framework can be used to find an optimal policy, and present two empirical experiments in support of this.

优点

This paper tackles an interesting problem of decision-making under epistemic ambivalence. Most of the MDP-related literature only considers epistemic and/or aleatoric uncertainty, so considering this kind of uncertainty is refreshing. The paper incorporates several non-trivial aspects of quantum mechanics into an MDP framework which is impressive. The paper is generally well-written, and the technical concepts are explained in a relatively clear manner.

缺点

My biggest concern with this paper is that the experiments, while interesting, fail to support the overall narrative of the paper. In particular, the narrative of the paper suggests that the EA-MDP leverages quantum mechanics to optimize decision-making under epistemic ambivalence. This implies that one should be able to take a given MDP-based system that has epistemic ambivalence, convert it to an EA-MDP, and then use the proposed quantum-inspired methods to find an optimal policy. However, the experiments presented in the paper start with an EA-MDP that is already formulated and then proceed to solve it, thereby missing the crucial step of showing how this EA-MDP can be derived from a regular MDP. This gives the impression that the primary use-case for the EA-MDP is to model quantum phenomena. Of course, modeling quantum phenomena using MDPs is in itself a worthwhile and an interesting topic, however if this is indeed the intended use-case for the EA-MDP, then the text should make this more clear. Conversely, if the EA-MDP can indeed be used more generally to optimize decision-making under epistemic ambivalence, then a more intuitive and complete set of experiments is required, where the experiments show how one can turn a regular MDP to an EA-MDP. The authors used a good analogy related to buying or selling stock in the presence of conflicting evidence at the start of the paper when explaining epistemic ambivalence. I suggest that the authors use more compelling and intuitive experiments that are similar to this analogy.

Moreover, the paper could (potentially) benefit from a comparison between epistemic ambivalence and entropy, given that these concepts appear to be similar in nature. Moreover, and especially given that there are lots of existing MDP-based methods that utilize entropy, it would be useful if the authors could compare their method to methods that use entropy (or some sort of other baseline) to highlight the usefulness and advantages of their methods.

Finally, it may be more beneficial to the reader, in the context of ICLR, if most of the equations in Section 6 of the paper be moved to the appendix, and that the resulting extra space in the main body be used instead to better explain the experiments and/or to include more helpful figures to better explain the intuition behind the experiments.

Overall, this paper feels stuck somewhere in the middle between a physics paper and a ML paper. Both directions can be good, however the authors need to lean into one direction or the other. A good litmus test that the paper currently fails, is that after reading the paper I should be able to know under which conditions the EA-MDP is a good tool to use. However, in its current state, the paper is unable to successfully answer this question.

As such, I am open to increasing my score, if 1) the authors can convince me that the experiments in the current draft are adequate, or 2) the authors improve the existing experiments in such a way that they address my concerns, or 3) the authors can provide additional experiments that address my concerns.

问题

  1. What are the implications of the \approxeq in Equation 12?

  2. From an MDP perspective, it appears odd that in the two-site experiment, there is only a single action in the action-space. If there is only a single action, then is there no decision to be made? Could the authors please explain how and why only having a single action makes sense? (see lines 368-369). Is it perhaps that there should be two actions: {move to the other site, remain in the current site}?

评论

We would like to thank you for the precious feedback. Below, we respond to your questions in the same order. We sincerely hope that you will find the responses and revisions satisfactory.

  • Response to weaknesses:

    • We are currently running additional experiments based on your comment and add them to the paper as we obtain results.

    • We are uncertain regarding the definition of entropy that the reviewer has in mind, as the entropy can be used to measure uncertainty in various components of our problem formulation. For example, in quantum mechanics, the Von Neumann entropy is defined as S(ρ)=Tr(ρlogρ)S(\rho) = -Tr(\rho \log \rho) (Nielsen and Chuang, 2010). Since in our proposed EA-MDP formulation we used a pure quantum state, the density matrix is defined as ρ=s~><s~\rho = |\tilde{\boldsymbol{s}}> <\tilde{\boldsymbol{s}}|. As a result, the Von Neumann entropy is always zero. This is due to the fact that pure states do not have mixedness. As another example, in policy optimization, state entropy can be calculated which is defined based on the current policy, i.e., the probability distribution over actions, to measure the uncertainty of decision-making for certain states. We hope that the reviewer can provide us with some references to make the discussion on entropy more explicit. Regardless of the specific entropy definition, we will calculate the policy entropy in our experiments and add the results to the paper based on your comment.

    • Based on your comment, we will move most of the equations in Section 6 to the Appendix to have more space to explain the experiments and the intuition behind them further.

  • Response to Q1: By comparing (8) and (11), we conjecture that (left side) = (right side) in (12). To make it more clear, we will replace \cong with equality sign (=).

  • Response to Q2: In the first experiment, we used a single action to make the state transition probability and the policy deterministic. This approach allowed us to focus on analytically calculating the value function based on the computed rewards, which in turn helps to clarify the EA formulation and validate the theory. However, as an alternative, we could consider a more general setting with two actions in each site, one that moves to the other site and another one that remains in the same site. In this case, we could assume that the corresponding state-transition probability of remaining at the same site is zero. This alternative would be equivalent to our current setting. In our experiment, we defined a single action to avoid confusion and simplify the setting. In any case, we assumed so in order to make the state transition probability and policy deterministic.

评论

I thank the authors for their reponse and clarifications. The entropy used in max entropy RL is that of the policy (formally, it's the conditional action entropy - see the equation at the bottom of page 2 of [1]). Note that it may be the case that it doesn't make sense to compare EA to entropy. This is fine as long as it is explained in the paper. If they can be compared however, then there are a lot of baselines that use max entropy RL that the authors can use to compare their methods with (for example, soft-Q learning, soft actor-critic, etc.). Note that a simple toy experiment, with a comparison to a single baseline, is sufficient for me to reconsider my score. I hope this has clarified the confusion regarding the entropy.

I look forward to seeing the updated draft once it gets uploaded.

[1] Eysenbach, Benjamin, and Sergey Levine. "If MaxEnt RL is the answer, what is the question?." arXiv preprint arXiv:1910.01913 (2019)

评论

I thank the authors for the updated draft of their paper. The new experiment that compares the EA-MDP to max entropy RL has increased my confidence that the proposed method has merit, and has significantly strengthened the paper's claims. However, I have chosen to keep my original score for the following reasons:

  • Above all else, I still feel that this paper can significantly benefit from another round of reviews. In particular, it is still not clear to me how I can take a given decision-making problem and convert it to an EA-MDP. For example, how would I select my bases and their amplitudes? Is it an arbitrary choice, or is there logic involved? Can any decision-making problem be converted to an EA-MDP? These are the kinds of questions that the current draft of the paper does not answer (note that this concern was mentioned in my original review). It appears that the authors have all the pieces in place for a compelling paper, but they need to further refine the paper's narrative to communicate the results in an adequate manner.
  • Similarly, while the comparison to entropy is encouraging, the authors do not provide any commentary on whether it makes sense to compare the EA-MDP to entropy to begin with, or how the comparison highlights the strengths of their approach. Ideally, the comparison should highlight a weakness in max entropy methods (or other baselines) and show (with an accompanying discussion) how and why the EA-MDP overcomes such weaknesses. To tie this back to my first point, one potential narrative device that the authors could explore is to first present an experiment that highlights a weakness of max entropy (or other baseline), then use it to motivate the EA-MDP, show how one can convert the max entropy MDP to an EA-MDP, then show the comparison. All in all, the authors need to better communicate how and where their methods fit in relative to similar RL methods.

I thank the authors for their commitment and effort, and I hope that these suggestions will help improve the paper moving forward.

评论

Thank you for your feedback on our updated draft. Below, we respond to your comments in the same order.

Comment: It is still not clear to me how I can take a given decision-making problem and convert it to an EA-MDP. For example, how would I select my bases and their amplitudes? Is it an arbitrary choice, or is there logic involved?

Response: While converting classical data to quantum data, the choice of bases and probability amplitudes are not arbitrarily; in general, the conversion process involves several challenges, including normalization, efficient mapping, resource overhead, and maintaining fidelity and correlation of the data [1,2,3,4]. In addition, in recent years, due to the extension of quantum computing capacity, this conversion has gained significant attention, e.g., mapping classical pixel values to quantum states [5,6]. We would like to point out that, such conversion is an entirely different and active field of research, which would be out of scope of this paper. In addition, investigating such conversion in this paper is not possible due to time limitations. However, we already included explanations regarding this in the paper based on the request of reviewers (we also mentioned it in paper revision summary we posted after uploading the revised draft). For example, in lines 1117-1124, we already mentioned the challenges of implementing our framework in real-world problems including the conversion of classical data to quantum data, such as choosing probability amplitudes and suitable outcomes.

We would like to bring to your attention that, one of the main purposes of our research article is to introduce EA to the field of artificial intelligence, in particular reinforcement learning. In fact, our paper takes the first step in this regard, and to the best of our knowledge, there are no previous works that consider EA in machine learning problems. Hence, our work lays the foundation of a new research area related to EA-MDP by demonstrating its potential in handling epistemic ambivalence.

[1] Ranga, Deepak, et al. "Quantum Machine Learning: Exploring the Role of Data Encoding Techniques, Challenges, and Future Directions." Mathematics 12.21 (2024): 3318.

[2] Pande, Mandaar B. "A Comprehensive Review of Data Encoding Techniques for Quantum Machine Learning Problems." 2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE). IEEE, 2024.

[3] Lloyd, Seth, et al. "Quantum embeddings for machine learning." arXiv preprint arXiv:2001.03622 (2020).

[4] Vorwerk, Christian, et al. "Quantum embedding theories to simulate condensed systems on quantum computers." Nature Computational Science 2.7 (2022): 424-432.

[5] Yan, Fei, and Salvador E. Venegas-Andraca. Quantum image processing. Springer Nature, 2020.

[6] Ruan, Yue, Xiling Xue, and Yuanxia Shen. "Quantum image processing: opportunities and challenges." Mathematical Problems in Engineering 2021.1 (2021): 6671613.

评论

Comment: Can any decision-making problem be converted to an EA-MDP? Response: We would like to clarify that, in a conventional MDP, there is no EA element. Hence, it lacks the existence of EA bases. Consequently, Equation (3), which is the core definition in subsequent formulations of EA-MDP, cannot be defined in an MDP. Hence, unless there is conflicting evidence in the environment, it is not feasible to formulate an EA-MDP.

Comment: The authors do not provide any commentary on whether it makes sense to compare the EA-MDP to entropy to begin with, or how the comparison highlights the strengths of their approach. Ideally, the comparison should highlight a weakness in max entropy methods (or other baselines) and show (with an accompanying discussion) how and why the EA-MDP overcomes such weaknesses.

In lines 331-337, we highlighted the main differences between EA-MDPs to conventional MDPs and explained why we expect our model and solution allows the agent to tackle the EA uncertainty, which is due to that fact that EA-QL algorithm is EA-aware, hence it computes an average reward over outcomes. Further, in lines 459-464, we validated this and mentioned the weakness in the max entropy method S-QL (and in general, in all traditional RL-based methods) compared to our proposed algorithm, which is the result of being EA-agnostic. In addition, we explained our remedy to compare the two algorithms fairly; we make sure that the environment provides the same information to both EA-QL and S-QL agent, which means the environment provides all possible outcome rewards, and the S-QL agent receives the average of the outcome rewards at the locations where EA exists with same weights. Moreover, in lines 1106-1109, we mentioned that, to calculate the conditional action entropy of EA-QL agent, we additionally apply the softmax function to q-values of actions in each state.

The results presented in Fig. 3 and accompanied explanations in lines 465-485 clearly highlight the strengths of our approach in that EA-MDP is able to achieve a higher average gain compared to S-QL, while maintaining a lower conditional action entropy. The entropy is just an indication of how deterministic is our agent in its decision-making, and the results also agree with the design of our algorithm. Obviously, a higher conditional action entropy means that the agent is more adaptable to future changes in the environment, as it explores the environment further. In the case of EA-QL algorithm, we can control the exploration by choosing an appropriate ε\varepsilon parameter. We communicated all the above points in the paper revision summary that we posted after uploading the revised draft.

Comment: To tie this back to my first point, one potential narrative device that the authors could explore is to first present an experiment that highlights a weakness of max entropy (or other baseline), then use it to motivate the EA-MDP, show how one can convert the max entropy MDP to an EA-MDP, then show the comparison.

Response: This concerns the flow of the paper. As the concept of EA in artificial intelligence is a new concept and requires precise terminology and rigorous formulations, we preferred to first introduce EA, provide the necessary background, formulate the problem rigorously, and then move towards the experiments. This way, it would be beneficial for readers to understand the main idea first and see the actual results and compare experimental results with theoretical findings. However, re-writing the paper to just change the flow is not possible due to time limitations, and also, we avoid it to be consistent with ICLR's guidelines related to discussion period. Nevertheless, we thank the reviewer and take the reviewer's point into account in our future works.

评论

I thank the authors for their most recent comments. I recognize that this paper combines two separate fields that are complex in their own right, and that combining technical concepts from both fields makes this paper more challenging to write compared to other papers. However, this is no excuse to present a paper that does not clearly communicate its intent. As mentioned in my original review, the narrative of the paper suggests that one should be able to take a given MDP-based system that has epistemic ambivalence, convert it to an EA-MDP, and then use the proposed quantum-inspired methods to find an optimal policy. However, as per the authors' most recent comments, there are several conditions that need to be met in order to use the EA-MDP, such as having to convert the data to quantum data. This is something that is not clearly communicated in the most recent draft (along with the other points mentioned in the authors' most recent response).

For clarity: having to satisfy the aforementioned conditions to use the EA-MDP is not a weakness in my view. What is a weakness however, and the primary rationale for my score, is that the authors have not effectively communicated such conditions in the paper, thereby giving the impression that the EA-MDP can be applied in a more universal manner than it actually can be. As stated in my original review: A good litmus test that the paper currently fails, is that after reading the paper I should be able to know under which conditions the EA-MDP is a good tool to use. However, in its current state, the paper is unable to successfully answer this question.

Finally, regarding: re-writing the paper to just change the flow is not possible due to time limitations, and also, we avoid it to be consistent with ICLR's guidelines related to discussion period, I kindly remind the authors that this was a suggestion not a request, and in either case, the comment was posted before the extension to the discussion period (along with updated guidance for reviewers) was announced.

评论

We thank the reviewer for the recent comments. In the following, we will clarify the intuition behind our proposed method and its general workflow based on the reviewer's concerns.

Comment: The narrative of the paper suggests that one should be able to take a given MDP-based system that has epistemic ambivalence, convert it to an EA-MDP, and then use the proposed quantum-inspired methods to find an optimal policy.

Response: We would like to clarify that, such conversion of MDP-based system that has epistemic ambivalence to EA-MDP is not the intent of our proposed method. As correctly pointed out by the reviewer such misunderstanding could potentially arise due to the fact that one would possibly distinguish between an MDP-based system that has epistemic ambivalence and an EA-MDP. However, these two concepts are the same, and we do not distinguish the two in our paper. In fact, to motivate the idea of EA-MDP and its connection to MDP, our paper starts by defining MDP, then combines it with EA and calls it EA-MDP. Hence, to emphasize, an MDP with EA is the same as EA-MDP. Nevertheless, based on your comment, we will make these points more clear in the final version, provided that we can obtain the editor's agreement.

Comment: As per the authors' most recent comments, there are several conditions that need to be met in order to use the EA-MDP, such as having to convert the data to quantum data. This is something that is not clearly communicated in the most recent draft.

Response: We would like to bring to your attention that we already included these points based on reviewers' comments in the revised paper and highlighted them as limitations and challenges of implementing EA-MDP in practice, including the necessity of converting the classical data to quantum data (please see lines 1117-1129).

If you have any remaining questions or require clarification, please let us know. We will be pleased to provide more information and discuss it more.

审稿意见
8

In this paper, the authors develop the EA-MDP, a variant of the classic MDP, that can capture a different form of uncertainty called Epistemic Uncertainty (EA). In a EA state, the agent can coexist in multiple configurations at the same time. The authors demonstrate that the fundamental results of MDP planning, such as the contractiveness of the Bellman operator, also hold in this setting, with some modifications. The most notable difference is that rewards depend on the outcome of each quantum state (which can be interpreted as measurements), and therefore rewards should be computed through an additional expectation. The authors also demonstrate the result of planning an optimal policy for EA-MDPs, in two small discrete environments, for a range of environment parameters.

优点

The definition of the EA-MDP appears to be an appropriate model for epistemic uncertainty. This fulfills the main objective of the paper. Moreover, the model seems to be consistent and well designed. The theoretical results in Section 5 prove the minimum basic properties that allow to extend the classical planning tools to EA-MDPs with small modifications. I could not find mistakes in the proofs of these results.

The experimental demonstrations and the pictures help the reader to understand the basic principles about how the model work.

The topic and the ideas seem to be novel. However, I am not able to confideltly relate this paper with the related work of previous uses of quantum information in ML.

缺点

  1. The paper tries to preserve the great generality allowed by quantum states, basis states, and outcomes. However, this complicates the exposition for an audience that is not familiar with the formalisms used in quantum mechanics. Some concepts are still not entirely clear to me:
    • What is the need for treating outcomes as distict entities from the underlying states? In my understanding, outcomes may be restricted to be the set of discrete locations of the system. Clearly, this would be a limitation with respect to more general bases such as eq (22), however, this seems to be sufficient for what is needed in RL.
    • What is the role of sts_t in equation (3)? I would expect ψst\psi_{s_t} to be a complete description of environment state by itself. The fact that the state space contains the "underlying state of the system" seems to contradict the intuition that the quantum state ϕst\phi_{s_t} is describing a mixture of basis states.

An example early on in the paper would greatly help in clarifying most concepts, even before the experimental section.

  1. It should be pointed out that the EA state is observed by the agent. This is analogous to what happens in standard, fully-observable MDPs. However, the rich information provided by the EA state allows to fully compute the superposition of states. In turn, this allows to plan in EA-MDP with analogous techniques as for MDPs, by means of an additional expectation over outcomes. While convenience is surely an advantage, this also means that the proposed framework cannot be used by an agent that encounters conflicting observations, because the environment should return the explicit quantum state, which itself encodes the EA uncertainty. This discrepancy slighly contradicts the main motivation of this work, which is the development of RL algorithms in presence of conflicting evidence.

问题

The authors may answer the doubts listed in Weaknesses, point 1.

评论

We would like to thank you for the precious feedback and positive evaluation of our work. Below, we respond to your questions in the same order. We sincerely hope that you will find the responses and revisions satisfactory.

Response to the first part of Q1: In our formulation, to preserve generality, we propose a general formulation for EA-MDP and consider the general form of an outcome. However, in Section D of the supplementary material, we propose a simplified version of our formulation which is based on separated outcomes and define the reward based on the EA part of the outcome. Consequently, as correctly pointed out by the reviewer, in our experiments, we work with this simplified version and consider only the EA part of outcome. This simplified version is sufficient for defining the outcomes based on the quantum states introduced in Eq. (3). Hence, we worked with the separated outcomes in our experiments in order to execute the proposed EA-MDP in practice while keeping the setting simple for the non-technical readers. However, in a more complex experiment, one may consider non-separable outcome sets.

Response to the second part of Q1: In Eq. (3), we consider two subsystems for a given quantum system. One for the underlying states and another one for the EA quantum state. Each ψ| \psi \rangle is the mixture of the basis states of the EA, j\\{\left| j \right\rangle \\}. Since ψ| \psi \rangle is a mixture of basis states, sψ|s\rangle \otimes | \psi \rangle is also a mixture of basis states (like Eq. (21)). Generally speaking, the underlying state can serve as a classical description of certain aspects of the environment, such as restricting the agent to occupy only one location at each time step, which we considered in our experiments. This representation is crucial for integrating classical and quantum aspects within the framework of EA-MDPs. Based on your comment, we will revise the text to clarify this point and describe the formulation more clearly. We will also add an example to clarify most concepts further.

Response to Q2: We would like to highlight that our model and solution allows the agent to tackle the EA uncertainty-which is encoded via probability amplitudes and superposition of quantum states- by estimating the rewards using an extra expectation over the possible outcomes. In addition, in terms of quantum state observation, we agree with the reviewer and our setting is similar to fully observable MDPs. However, from the perspective of underlying states, our model is not similar to conventional fully observable MDPs as the agent does not observe underlying states directly. Moreover, our model differs from conventional MDPs with respect to reward and value-function definition as there is no outcome set in conventional MDPs; in EA-MDP there is an extra expectation over outcomes which is the result of having probability amplitudes and outcome set. This outcome set helps to assign rewards to the multiple configurations of conflicting evidence. Based on your comment, we add this to the paper to clarify it further for the readers.

评论

I thank you the autors for the comments and I apologies for my late reply.

Q2. It is clear to me that EA-MDPs are fundamentally different from classic MDPs. The similarity in the solution techniques is an advantage, not a weakness. The only weakness I can identify is that the EA must be observable and fully known to the agent (this is a weakness of the current paper and algorithm, not the proposed setting itself). This seems to be confirmed by the authors' response. I believe that the authors should explicitly list this as a limitation of the current work, if they did not already.

Q1. So, if I understand correctly the role of "underlying states", these are state features in which we do not need to express any epistemic uncertaintly. These may be equivalently be expressed by suitably extending the basis states, but they are kept separate just for convenience. Is this intuition correct? In this case, the adjective "underlying" may generate confusion in the readers, as they may interpret it as "the true state", invalidating the superposition allowed by EA. I suggest changing it for something else.

Despite the weakness in Q2, I belive this paper proposes an interesting framework that is analyzed in sufficient detail. For this reason, I prefer to keep my current score.

评论

Response to Q2: As correctly pointed out by the reviewer, we assumed that EA is fully observable in our experiments. Based on your comment, we will highlight that EA is observable and fully known to the agent in our experiments in the final version, provided that we can obtain the editor's agreement.

Response to Q1: Alternatively, it is possible to extend the basis states in order to express the underlying states using the mixture of these extended bases. However, more conditions are needed to restrict the quantum state to limiting it to having only one underlying state (at each time step) and to assign suitable probability amplitudes for the main EA basis and new basis corresponding to the underlying basis. Using two interacting subsystems is more convenient than extending the basis, especially in the presence of the mentioned restriction on the underlying state to describe the classical description of the model. Regarding the naming and in order to avoid confusion for the readers as suggested by the reviewer, we will rename it in the final version, provided that we can obtain the editor's agreement.

审稿意见
6

The paper proposes a framework to include quantum states in MDPs, with the aim of representing “epistemic ambivalence (EA)”. EA is a concept borrowed from the philosophical literature and represents situations in which a person believes in two (or more) contradicting statements at the same time. The proposed EA-MDP is based on augmenting the traditional state space with a Hilbert space H_EA that reflects EA and is a quantum state. Theoretical results are developed, showing that in an EA-MDP, an optimal value function and an optimal policy exist.

优点

This is a novel, innovative work, which develops an entirely novel framework. The paper is well-written and also includes some analysis.

缺点

I was left puzzled by what EA might actually represent in the context of sequential decision processes in practice. I tried to understand by looking at some of the literature on EA, but most of it was on an abstract philosophical level that did not help too much. Maybe it is too much to ask of this paper, but given that the audience at ICLR does not have a background in philosophy, it would be very helpful to give concrete examples. The experimental setups remain at an abstract level and the real-world equivalent to the setup is not clear. This makes it unclear how the concept can actually be useful to solve real-world decision processes.

问题

  1. I looked at the literature on EA cited by the authors [Amaya 2021, Williamsom 2021, Lam 2013]. These sources present ambivalence as a concept distinct from uncertainty, whereas the authors describe it as a form of uncertainty. This should be clarified.
  2. What are real-life examples of EA for which a sequential decision process could be formulated? Is it possible to motivate the experiments with such examples?
  3. In which way is EA-MDP expected to outperform partially observable MDPs in practice?
评论

We would like to thank you for the precious feedback. Below, we respond to your questions in the same order. We sincerely hope that you find the responses and revisions satisfactory.

Response to Q1: We agree with the reviewer's comment regarding the distinction between ambivalence and uncertainty. That is why, in lines 68-61 of the paper, we first tried to distinguish ambivalence from uncertainty. However, EA can be considered a type of uncertainty, since it represents an agent's uncertainty that arises due to contradictory experiences (please see lines 67-69 of the paper). While traditional uncertainty represents imperfect knowledge, EA represents a cognitive state in which conflicting pieces of evidence coexist, resulting in an uncertain decision making process. We will revise the text and make it more clear based on your comment.

Response to Q2: We included one real-world example in the paper related to stock trading. Another real-world example would be autonomous robot navigation problem where a robot receives conflicting data from multiple sensors; some sensors indicate a clear path ahead, while others detect an obstacle which may be either near or far away. Given this contradictory evidence, the robot is in a state of epistemic ambivalence where it must decide whether to slow down, stop, or maintain its current speed. To simulate a similar scenario, in experiment 2 of Section 6 (Experimental Results and Analysis), we considered multiple site systems, which is a mimicry of the MAZE problem. The goal is to navigate the optimal route to reach the terminal state. In this experiment, multiple EA are applied at different sites and the agent must decide on its next movement at each site.

Response to Q3: Our model is fundamentally different from PO-MDP, as in our model, the reward is generated based on some probability distribution over outcomes. First, while PO-MDP concentrates on situations with partial knowledge, EA-MDP employs quantum-inspired approaches for expressing and managing EA caused by contradicting evidence. Second, the final s~\tilde{s} is the Kronecker product of ss and ψ\psi, so we cannot really separate these two components in quantum state s~\tilde{s} without any extra measurements. Third, in our approach, as shown in proof of Theorem 1, the environment needs to estimate the probability of outcomes and, consequently, estimate the expected reward (8), in order to compute the expectation value of the reward. An environment in a PO-MDP is not aware of such probabilities (is not aware of EA), hence PO-MDP methods fail in our setting.

评论

Dear reviewer KWqS.

We appreciate your time and work when reviewing our paper.

We hope that our replies appropriately addressed your concerns and explained the issues you mentioned.

If there are any remaining questions or require clarification, please let us know. We will be happy to give extra information and discuss them further.

Thank you for your time and for contributing to the review process.

Best Regards,

Authors

评论

We thank the reviewers for their constructive feedback. We revised the manuscript based on the comments. In the following, we summarize the revisions made to the manuscript.

  • We proofread the paper to eliminate any possible typos or grammatical mistakes.
  • (lines 67-69) We added further explanations to clarify the distinction between traditional uncertainty and epistemic ambivalence.
  • (lines 72-78) We added an extra example related to presence of EA in real world scenarios.
  • (lines 210-224) We added explanations regarding the EA quantum state in Eq. (3) to clearly distinguish classical aspects from quantum aspects of our formulation.
  • (lines 331-337) We highlighted the differences between conventional MDPs and EA-MDPs.
  • (lines 459-506) We ran additional experiments to compare the performance of our proposed algorithm with a soft Q-Learning algorithm in terms of average gain and conditional action entropy.
  • (lines 521-524) We elaborated on possible future research directions.
  • (lines 944-971, lines 1004-1005, and 1048-1071) We re-organized and moved some of the equations in Section 6 to the Appendix to provide more space in the main text for the new experimental results.
  • (lines 1078-1124) We added remarks regarding the challenges and limitations of implementing our proposed framework in real-world scenarios.
撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.