Rethinking the Buyer’s Inspection Paradox in Information Markets with Language Agents
This work explores the buyer's inspection paradox in a simulated digital marketplace, highlighting enhanced decision-making and answer quality when agents temporarily access information before purchase.
摘要
评审与讨论
This paper considers a model of information markets with LLM based agents to resolve the buyer's inspection paradox. Specifically, the buyer wants to assess the information to determine its value, while the seller wants to limit the access to prevent information leak.
优点
-
It appears to me that the information market will become an increasingly important problem in the future, but this problem is not well studied beyond abstract theoretical models. This paper proposes an interesting initiative to solve the problem of inspection paradox in information markets through LLM based agents, as those agents are innate to understand the text in context without the capacity to memorize the information.
-
This paper designed a simulated information market to generate text contents for sales and evaluate the potential performance of the proposed method. Several interesting experiments with a good amount of efforts are conducted to evaluate the economic rationality of LLM based agent.
-
The findings about debate prompting is interesting by itself --- but should the paper cite the source of this technique if it is not an original invention? At a high level, this observation seems to suggest that decision making, especially those require strategic thinking should involve counterfactual reasoning (obtained from debating or self-questioning); this insight might be useful to improve the strategic reasoning skill of LLMs.
缺点
-
The paper only compares the performance between agents powered by different LLMs. Though the "Evaluating the Evaluator'' experiment made some comparison between GPT4 and humans as evaluators, what about baselines based on algorithmic approaches, e.g., a buyer agent that designs quoting strategy based on keyword-matching? The ability to forget the information from rejected quotes can also be artificially planted in these algorithms. Hence, without these algorithmic or heuristic baselines, it is unclear to me how it is necessary to use LLM based agents in this task.
-
The paper focuses on the inspection paradox, but the procedure of inspection, especially in the context of information market, is oversimplified --- it is basically whether to allow the agent to read all the text content. However, there is a rich line of econ/cs/ml literature (see e.g., [1, 2, 3]) on the information market that concerns the power of information design for buyer inspection. For example, the seller could only give out a summary of the content, or only answer certain query questions from the buyer agent --- because it is unclear whether it is reasonable or enforceable to trust the buyer to forget all the information from rejected quotes in reality.
[1] Bergemann, Dirk, Alessandro Bonatti, and Alex Smolin. "The design and price of information." American economic review 108.1 (2018): 1-48.
[2] Ghorbani, Amirata, and James Zou. "Data shapley: Equitable valuation of data for machine learning." International conference on machine learning. PMLR, 2019.
[3] Chen, Junjie, Minming Li, and Haifeng Xu. "Selling data to a machine learner: Pricing via costly signaling." International Conference on Machine Learning. PMLR, 2022.
- To my knowledge, there is no clear evidence so far that an LLM based agent has any reasonable rationality or strategic reasoning skill --- GPT4 cannot play tic-tac-toe, or even reliably compare two numbers. This is also verified in the paper's experiment. Hence, I am not sure whether it is meaningful for the paper to give LLM based agents the ability to make economic decisions --- these decisions could be left for the human or some simple algorithmic procedures while only asking the LLM to provide its reasoning or evaluation scores on the value of the text.
问题
Please see my comments above
Thank you for your detailed review and valuable feedback. We were excited to learn that you found our work an “interesting initiative” on an “increasingly important problem […] that is not well studied beyond abstract theoretical models”, and that you appreciated the effort we invested in our experiments. We address your concerns below.
Evaluation and Keyword-Matching Baseline
As you suggested, we added a new baseline experiment with a keyword matching baseline (BM25) and found that 95% of the time it was outperformed by LLama-2-70B.
- For 95% of the questions, the LLM-powered buyer agent’s answers (based on Llama-2-70b) are preferred to the BM25 agent’s answers by the evaluator. This verifies that leveraging LLMs can significantly boost the quality of the generated answers.
- Increasing the budget for the BM25 heuristic yields better results — the GPT-4 evaluator prefers answers from the high-budget simulations for 67% of all questions (vs. low-budget simulations). This result serves to verify that the simulated marketplace functions as expected.
We have updated the manuscript accordingly (Appendix C).
Trust and Security in the Information Bazaar
In our design, we operate under the assumption that sellers trust the marketplace but not necessarily the buyers. This is because the marketplace is controlled by software that implements checks to ensure legal actions by the buyer agents. For instance, the software guarantees that the buyer stays within budget and cannot leak information. This means that the buyer agent can try to buy information outside its budgetary constraints, but the software will not permit that behavior. Therefore, the need to trust buyers is eliminated as the marketplace software provides data security, facing the same set of vulnerabilities as other online businesses. This risk can be reduced through standard cybersecurity practices such as regular security audits, threat modeling, implementation of security protocols, and the addition of compliance layers.
Crucially, the buyer agents can decide whether to buy information after they have received it. They will never commit to buying information before having seen all of it. This approach provides a mechanism for vendors to safely increase the amount of inspection they allow, trusting only the marketplace. While we did not investigate all possible inspection mechanisms, we demonstrated the benefit of allowing more inspection in our study. Specifically, we showed the advantage of allowing content inspection versus metadata-only inspection. This eliminates the need for sellers to strategically withhold or present information, which is one of the key benefits of our approach. We hope that future work builds on the information bazaar concept to evaluate the efficacy of these alternative vendor strategies.
LLM Rationality, Debate Prompting, and Economic Decision-making
We agree that all the LLMs we evaluated display a degree of irrational behavior, and one contribution of this paper is to highlight these. We developed this debate prompting methodology to ameliorate some of those behaviors. To our knowledge this style of method was first published in a blogpost titled The Socratic Method for Self-Discovery in Large Language Models on May 5, 2023 by Runzhe Yang and Karthik Narasimhan — we cite this in our work (Section 3.4, paragraph on “Debate Prompting”). In contrast to their contribution, our method underscores the importance of adaptable character shaping within the debate, providing the opportunity to balance the debate dynamics by offering tactical hints to the respective characters.
The key conceptual idea in this work is that of a buyer agent that is guaranteed to forget (by the marketplace) and can therefore be safely granted access to all information. Such buyer agents can be implemented in different ways, we focus on an LLM-based proof of concept because they are currently by far the most powerful agents for operating on textual data (besides humans, but erasing their memory would be ethically problematic). We completely agree that LLMs fail to act rationally in many cases, but they are rapidly becoming more powerful and there is a growing body of literature on LLMs as economic agents (see e.g. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?). Any advancements in LLMs will directly translate to improved performance of our approach.
In closing, thank you again for your valuable feedback on our work. If you have further thoughts or questions on this subject, please feel invited to engage with us.
Thanks for the detailed explanations, especially the additional experiments using BM25. My judgement on the strengths and weakness of this paper remains, so I would keep my score as it is. Again I appreciate the initiative of this paper, and here is my key takeaway from this paper: while LLMs may be reasonably capable of evaluating the value of information, they are not ready yet to fully take charge of the information acquisition task as a rational agent. I take this as an opportunity for future work, if we resort to certain algorithmic approaches for some part of this task, we might be able to design an AI agent of higher performance and reliability.
The authors propose text-based digital information market environment is proposed. In such an environment, buyer agents try to obtain the necessary information by transactions with vendor agents without overspending the budgets. Vendor agents need to sell the information for market credits. As the evaluation and access process is implemented by LLMs, the information is not directed accessed by the buyers.
优点
- Abundant experiments about whether LLM can be used to evaluate information and make economic decisions. It also examines the performance of different LLMs in different aspects.
- Give detailed introductions about how LLM can become a useful agent in Information Bazaar, including the prompts, interaction frameworks, and dataset analysis.
- An open-source simulator has been established, which is helpful for future work.
缺点
- The idea of protecting the information lies in the belief in LLM. However, LLM may face data leakage risks. The authors need to clarify whether it is secure for LLMs to access the information, even if only metadata is accessed.
- This paper primarily focuses on explaining how to transform LLM into an agent capable of performing rational actions in an information bazaar, but it lacks detailed explanations regarding using LLM. In other words, replacing LLM with any intelligent agent possessing information evaluation capabilities and trading intelligence would still be able to effectively address the problem of determining the value of information.
问题
see weakness
We appreciate your review and the constructive feedback. Your positive comments about our experimental results, presentation, and our open source contribution are highly valued. We will address your concerns in the order they were presented.
Security in the Information Bazaar
Thank you for asking this question. To clarify, we do not rely on the LLM to avoid data leakage. Data security is provided by classical software running the marketplace and faces the same set of vulnerabilities as other online businesses. The buyer agent only finds and buys information. The marketplace software guarantees that the buyer stays within budget and cannot leak information. Here’s how this works:
- The buyer agent is entirely controlled by the marketplace. The software that runs the marketplace implements checks that constrain the buyer agent to acting “legally”. It is this static code that ensures that only purchased information can leave the marketplace. This means, for example, that the buyer agent can try to buy information outside its budgetary constraints, but the software will not permit that behavior.
- Cybersecurity Measures: The cybersecurity risk can be reduced through standard practices, such as regular security audits, threat modeling, implementation of security protocols, and the addition of compliance layers.
What makes a good buyer agent?
We appreciate your insightful question. Our framework necessitates buyer agents to meet two primary conditions:
- They must possess the capacity to accurately estimate the value of the information for the buyer.
- They must have the capability for their memory to be audited and manipulated in a hard-coded manner.
The key ideas we present indeed apply to any such agent. LLMs are the most promising implementation of such an agent due to their unprecedented ability to operate on textual data. We added an experiment with a simpler, keyword-based agent to highlight this point.
Does this address your concerns, or are there other points where more discussion would be helpful?
The paper introduces an interesting automated market run by LLMs that avoids the information asymmetry characteristic of information markets. The market is run entirely by agents implemented as LLMs that both evaluate text offerings then forget what they've read. This avoids the need for the human agent--the principals-- to view content offered in the market to be able to evaluate it, creating a problem for the vendor who would prefer to not reveal the content prior to sale. By use of automated agents for both buyer and vendor, a selection is made without human intervention, so not revealing any content used for evaluation.
The paper works out a full simulation of multiple agents together with a pricing scheme to demonstrate the feasibility of LLMs performing these tasks. Various LLM implementations are used --
优点
The intricacy and just pure inventiveness of the marketplace proposed is impressive. To create system with LLMs playing multiple competing roles is novel. SImilarly the insight that such automation can address a longstanding question in information economics. One can imagine that this work could inspire a flock of similar LLM-driven markets to resolve similar inefficiencies in actual markets.
The simulation presented shows favorable qualities of better performance with more capable LLMs, rational price behavior in both micro- and macroeconomic scenarios, and improved performance as information content improves.
缺点
There is a fundamental evaluation question, concerning is definition and creation of a evaluation baseline. The paper takes as a premise that an LLM-based method is by nature superior to a conventional automated method. Granted, evaluating LLM performance is an area open to many approaches, and no conventional method exists such as cross validation serves for supervised learning. Specifically in this paper, one could create a surrogate for non-LLM buyer and vendor agents (the vendor could be a trivial version that just presents metadata), to serve as a point of comparison for the LLM models presented. This would be the analog of the statistician's null hypothesis. For example, if just metadata were used - e.g. the by an algorithm based on conventional similarity measures of relevance, how would it perform? Hence the research question posed, "Does this marketplace enable buyers to more reliably identify and value information?" begs the question, "more reliably than what?"
In the Section "Evaluating the Evaluator" a comparison is made between GPT4's result and a human label, showing reasonable human-level performance. This is interpreted to mean that the method has a subjective component. This does not answer the fundamental evaluation question.
Incidentally Figure 6 is missing the caption "Figure 6" Figure 6a could be more succinctly presented since the off-diagonal elements are simply 1 minus the other. The choice of visualization method is different than for 6b - one is a comparative percentage, the other a correlation, which is confusing.
问题
One could argue that judging a paper on a task that it does not propose to do is to force a requirement on the paper outside of the scope of work. One conceivably could always come up with such arguments - "what about X or Y -- why were they not included in the work?" and hence such criticisms might appear in general unfair. The argument as to the weakness described could be considered as such. However this is a stronger argument, that claims about a contribution must be substantiated against a conventional norm. I am open to reconsideration of this point should the current contribution based purely on novelty be considered adequate.
Thank you for your constructive review. We are glad to hear your feedback that the “intricacy and just pure inventiveness of the marketplace proposed is impressive”, and that “this work could inspire a flock of similar LLM-driven markets to resolve similar inefficiencies in actual markets.” In what follows, we address your concerns in the order in which they were presented.
Evaluation and Baselines
Thank you for raising this point. We share your view that evaluating LLMs is still fundamentally an open question, and that comparing with a standard non-LLM baseline will strengthen this work. In response to your query, we have conducted additional experiments.
New Baseline and Experiment with non-LLM Buyer and Vendor Agents. TL;DR, we added a new experiment with a conventional baseline (BM25) and found that 95% of the time it was outperformed by LLama-2-70B. More specifically, we replace the LLM in the buyer-agent with the widely used keyword matching algorithm (BM25), and also replace the vendor agent’s LLM embeddings with BM25. The buyer agents’ policy is natural — it ranks informational goods by their relevance to the question (via BM25), and purchases all goods until the budget is spent.
We run this simulation with a lower budget (25 credits) and with a higher budget (100 credits). Our findings are two fold:
- For 95% of the questions, the Llama-2-70b buyer agent’s answers are preferred to the BM25 agent’s answers by the GPT-4 evaluator. This verifies that LLMs can significantly boost the quality of the generated answers.
- For 67% of the questions, the BM25 heuristic with a high budget is preferred to the BM25 heuristic with a lower budget by the GPT-4 evaluator. This result serves to verify that the simulated marketplace functions as expected.
We have updated the manuscript accordingly (Appendix C).
Metadata-only Vendors. Regarding your point that "the vendor could be a trivial version that just presents metadata", we would direct your attention to the experiments presented in Table 1 and Figure 5 (right), which compare against a baseline ("Without inspection") where the buyer agents can only review the metadata instead of the full content. There, we see that the ability to review the full content ("With inspection") provides a clear advantage. Your broader point is well-taken that a large family of vendor strategies may be explored.
Figure 6
Thank you for pointing this out; we will fix this in our next revision.
In closing, we thank you for the effort invested in reviewing our work. If you have further questions or clarifications, we would be happy to discuss.
This paper studies how the use of agents based on large language models (LLMs) could potentially affect information markets. To this end, the paper introduces a suitable environment simulating an information market, and it experimentally evaluates various LLM-based agents within it.
优点
I believe that the challenges addressed in the paper are relevant for better understanding how large language models behave in application scenarios in which strategic aspects are concerned. The paper makes a very good job at presenting the studied setting and derived results.
缺点
The first concern that I have is about the exposition of the results in the paper. While the obtained results are very-well explained at least intuitively, I think that the paper misses some formalism that is needed to really fully understand the presented results. In the end, the actual framework that has been developed to model agents' strategic interactions is never introduced in the paper. Perhaps this is not so important from an experimental perspective, but it could be of great value for those that are more interested in the theoretical implications that the obtained results have.
The second concern that I have is more from a deployment perspective. I found interesting the idea of equipping agents with the ability of forgetting information if this is not acquired by them. However, I have some doubts on how this behavior can be enforced in practice.
问题
See the second concern in the Weaknesses section.
Thank you for reviewing our work. We are pleased to know that you found that our paper does a “very good job at presenting the studied setting and the derived results”. We will now address your concerns.
Theoretical Results
We chose to not fully formalize our model because the key idea of agents whose memory can be erased is applicable to a wide range of models and economic assumptions. Nevertheless, we’d be happy to explore if introducing some high-level formalism early on would help with clarity of exposition. Was the model you had in mind something along the lines of the one in Appendix A, titled “Formal Result on the Impact of Inspection on Expected Utility”. In this section, we demonstrate that under the assumption of “Monotonicity in Information”, the expected utility of the buyer agent increases if inspection is permitted.
Practical Deployment
Thank you for raising this point, which we rephrase as: “How can we practically ensure that the agents discard the information that they are required to forget?” Our solution involves two strategies:
- The buyer agent is entirely controlled by the marketplace. The software that runs the marketplace implements checks that constrain the buyer agent to acting “legally”. It is this static code that ensures that only purchased information can leave the marketplace. This means, for example, that the buyer agent can try to buy information outside its budgetary constraints, but the software will not permit that behavior.
- Cybersecurity Measures: On top of this, a practical deployment needs to address cybersecurity risks. These can be reduced through standard practices, such as regular security audits, threat modeling, implementation of security protocols, and the addition of compliance layers.
We will include this clarification in our manuscript.
In conclusion, we thank you again for your review. If you have further questions about either aspect, please feel free to engage with us.
The paper introduces an innovative concept of a digital marketplace, the Information Bazaar, where multiple large language model (LLM) agents buy and sell information. The study focuses on enabling buyers to assess the value of information without fully accessing it, thus mitigating the risk of information theft for sellers. Key experiments uncover biases and irrational behaviors in language models, investigate price effects on demand, and demonstrate that inspection and higher budgets lead to better outcomes.
Thanks for the great work. It could also be beneficial to discuss prior work on multi-LLM agents for the study of cooperative behaviors [1].
[1] Li, Guohao, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. "CAMEL: Communicative Agents for" Mind" Exploration of Large Language Model Society." NeurIPS 2023
Dear reviewers: Please read the replies from the authors carefully, and submit your reactions. Please be open-minded in deciding whether to change your scores for the submission, taking into account the explanations and additional results provided by the authors.
Thank you!
Dear authors: Here are some comments in addition to those provided by the reviewers.
The 1962 paper by Arrow cited in the submission, namely "The Economic Implications of Learning by Doing," is not relevant. It does not discuss the buyer’s inspection paradox, and indeed does not contain the word "information." The authors should instead cite "Economic Welfare and the Allocation of Resources for Invention."
Please provide more references, and more discussion, about the true importance of the paradox. In the business world, there is a standard solution to the paradox, namely contracts and their enforcement. As one special case, every day thousands of so-called non-disclosure agreements (NDAs) are signed, and respected. Another simple and common mechanism is to provide old information as a sample, since in many cases, only up-to-date information is valuable.
Please cite and discuss the literature on alternative academic solutions to the paradox. Reviewer STDL provided these:
[1] Bergemann, Dirk, Alessandro Bonatti, and Alex Smolin. "The design and price of information." American economic review 108.1 (2018): 1-48.
[3] Chen, Junjie, Minming Li, and Haifeng Xu. "Selling data to a machine learner: Pricing via costly signaling." International Conference on Machine Learning. PMLR, 2022.
Dear Area Chair sQj519,
Thank you for your insightful comments and suggestions. We appreciate the opportunity to clarify our position and address your concerns.
Firstly, we acknowledge the misattribution of Arrow's work — it was a bibtex error on our end. We appreciate your suggestion to cite "Economic Welfare and the Allocation of Resources for Invention", and we have made this correction in our revised manuscript. Further, Arrow’s work mentioned this problem as one of “demand determination”, and we believe the term "Buyer’s Inspection Paradox” was coined by “A Proposal for Valuing Information and Instrumental Goods” for the same phenomenon, which we now also cite.
The standard solution to the paradox of using contracts does indeed work well in relatively slow-moving fields with established players. However, it has several limitations which our approach may address:
- Legal action requires substantial resources and is very slow, hence it is only a credible threat if the seller is sufficiently large and willing to engage in legal battles. This represents an important barrier of entry for small, less established sellers.
- The buyer’s identity must be known, and they must be subject to the same jurisdiction as the seller. In addition, it is hard to prove breach of contract, as information revealed to the buyer may influence their decisions in subtle, possibly even unintentional ways. Hence sellers will be very cautious what buyers they will engage with, which represents a barrier of entry for small, unestablished buyers.
Similarly, the approach of inspecting old information is only applicable to established sellers.
Our approach removes these market frictions. It allows for a highly dynamic market place where thousands of buyers and sellers can interact continuously, new small buyers and sellers (without an established reputation) can join at any point in time and be immediately compensated for the actual value of their information. Further, buyers can easily engage in comparison shopping — a behaviour poorly supported by alternatives. We believe that this may lead to a much more diverse, democratic, and reactive information ecosystem. We have added a discussion about this in our introduction.
We appreciate the related academic references provided by Reviewer STDL. The works of Bergemann et al. (2018) and Chen et al. (2022) indeed offer valuable insights into the design and pricing of information and the use of costly signaling in data selling.
Our work diverges from these studies in several ways, but the most important is this: those papers are about how the seller should strategically price / bundle / advertise information in the case where the seller has to agree on the price before seeing all the information. In our case, the buyer only agrees on the price after seeing the information, hence the strategic considerations studied in these papers become unnecessary for the seller, and the buyer can make a more informed decision. We have added new discussion about these papers to the related works section of our paper.
We are grateful for your and the reviewers’ feedback, which has helped us improve the quality of our work. Please let us know if you have any more questions or observations.
Two reviewers are marginally positive about this paper, and two are negative, so this is a borderline case. As the area chair, I am ultimately slightly negative. My basic reason is that the paper is heuristic, so the results cannot be viewed as solid; at any time in the future, someone might come up with a formal argument, or a new clever heuristic, that makes the conclusions here invalid.
The work is based on the assumption that AI agents can be made to forget information, unlike humans. But this is a questionable assumption. It is an open question how to prevent an LLM from memorizing and/or leaking its training data. And a lot of the fears around general AI are that agents will lie about what they are actually doing. The authors, in their own words below, assume that "The buyer agent is entirely controlled by the marketplace." But a major fear, and research question, is that LLM agents may be impossible to "entirely control."
为何不给更高分
There are legitimate concerns about this work, revolving around its heuristic nature.
为何不给更低分
Reviewers find the work interesting and the authors have defended it in a sensible way.
It is an open question how to prevent an LLM from memorizing and/or leaking its training data.
This seems irrelevant, the point is about forgetting information from its context window, not from its training data.
This is a valid comment. The paper is based on the ability of LLMs to forget previous prompts.
The broader issue remains that the proposal is fundamentally heuristic and relies on trust that is not fully justified formally. Buyers and sellers would have to trust the marketplace and that the LLMs and their surrounding software really are forgetting previous prompts. Given this reliance, the method is not so different from existing human institutions that rely also on various types and levels of trust.
Reject