An engine not a camera: Measuring performative power of online search
We design and conduct an online experiment to get quantitative insights into the ability of search engines to steer web traffic
摘要
评审与讨论
The authors describe performative power, a pre-existing proposed measure of platform market power, and give an approach to measuring performative power using a browser extension. The browser extension perform random modifications to the search results page of results from target search engines, and measure click behavior. The modifications and user clicks are logged, providing sufficient statistics to compute a variant of the performative power metric. They deploy the extension to 85 users, resulting in about 57k clicks, and produce performative power calculations based on these interactions. The authors then provide some discussion of the results.
优点
I feel there are two main strengths to the paper
- The authors make the case for measuring performative power using a browser extension
- The authors created a browser extension to measure performative power. They appear to have been somewhat careful, discussing issues with hiding the DOM until rewriting is complete, discussing some latency numbers that are not bad (around 40ms, certainly enough to impact user behavior, but not enough to be wildly visible). They discuss privacy implications of their storage and logging of user search events, and have taken a reasonable approach with this, logging only intervention ids and click positions.
缺点
W1: supporting ongoing work. I didn't see a reference to the source code for the browser extension, or any discussion of it being available for further work -- this seems like a surprising omission, though perhaps I didn't read the right section? W2: on novelty, I have a hard time characterizing exactly where the main contribution of the paper lies. The development of the performative power measure, and the argument for its relevance in regulation, comes from prior work. The understanding of the impact of position on click likelihood is also quite heavily studied, so the headline numbers the authors show have reasonable support in the literature already. The authors make a small modification to the definition to incorporate changes to the page beyond the organic results, but this does not seem to be the key point. The observation that PP can be computed from a browser extension seems fairly straightforward, not a centerpiece for the argument of novelty -- such approaches have been taken for modifying search results and measuring interactions in the past. The paper could arguably make the case for taking a straightforward idea (browser extension for performative power) and exploring the many thorny problems in designing and deploying this measurement, but my second primary concern below is that these issues remain largely outside the scope of the paper. Overall, I feel that there isn't a clear case to be made for the dimension of novelty. W3: platform power. The authors paint a picture of developing a causal understanding of the "power" present in the platform to differentially route user attention across resources. My concern is that some interventions are sustainable (ie, the platform could actually implement such a modification) while others are not. As an extreme and somewhat laughable example, consider the intervention that replaced every search result with a link to the CEO's gofundme account. This would change clicks dramatically, resulting in high performative power, but nobody would argue that the platforms could sustainably deploy this intervention. Instead, the goal is to consider power the platform has to alter the ongoing distribution of user attention to online resources, so making this measure robust requires significant attention to the issue of reasonable interventions. The related literature the authors cite (cascade, eye-tracking, MLIR) is quite careful in these areas. It is not clear that placing a pure navigational result at position 3 instead of position 1 is sustainable, and the issue is not so simple: users habituate to platform behavior, and will respond differently when the platform stops behaving as expected, both in determining which results to consider and hence where to click, and in bigger ways (ie, changing providers). Correspondingly, platforms themselves rely on user feedback, which would be adulterated by these types of interventions, with unclear implications. For a proposal intended to be a "blue print for how performative power can be used to guide digital market investigations," I think it's a significant omission not to consider these issues.
问题
I'd love to hear the authors' thoughts on the three issues I raised in the weaknesses section above. Here are a few smaller questions also (I'll start numbering at 4, assuming the three points above are questions 1-3)
-
Could you discuss the fact that you consider only a single click, given that the intervention is likely to produce a higher number of clicks than the control (as satisfaction with the top result will be lower)? Likewise, the browser extension has the capability to consider more details of the interaction, such as immediate click-back from an unsatisfactory result page, followed by a click on a following page. I know it's biting off a lot to consider these types of interactions, but perhaps you could discuss a little.
-
Nit: line 112 has a typo in the cite of Chuklin et al
-
Could you clarify the triggering for arrangements a4-a6? Do they trigger only on results that have ads / shopping content? How is the accounted for in the analysis? Sorry, this may already be specified in the paper and I may have missed it.
-
Latency sounds pretty good, but probably worth referencing literature about the impact on search user behavior from additional latency of this magnitude
-
Study participants are trusted individuals solicited by personal outreach -- this suggests they will likely trust the researchers, tend to give the browser plugin the "benefit of the doubt," and so forth. Could you discuss the issues of conditioning that result from this choice of users?
局限性
I feel that the limitations I discuss above should be covered in more detail by the authors. For ethics review, the authors indicate that this type of study does not require IRB approval at their institution, so I take this at face value.
Thank you for the thoughtful feedback and comments. We hope to provide additional clarification and address your questions in the following.
W1: Source code. The extension has been published in the Chrome store. The code can be inspected using the developer console in the Chrome browser. We did not include the link for anonymity reasons, but we will add it to the final version. If you wish to inspect the code, we uploaded it to an anonymized repository shared with the AC.
W2: Novelty. The core contribution of our work is to take a theoretical concept and provide a first demonstration for how it can be instrumentalized in a practical scenario. A priori it is unclear how to relate performative power to micro experiments that are feasible to perform on digital platforms. Our experimental design together with the required theoretical assumptions outline a possible avenue for doing this. By relating tools from computer science with a policy relevant question at the heart of a major antitrust investigation our work provides a promising interface between regulatory questions and the expertise of the NeurIPS community.
W3: Choice of interventions. Note that the instantiation of the actions set is part of the performative power definition. Different choices are valid, also larger changes, but they lead to different conclusions about power. So it needs to be instantiated carefully in any given context. In our case we care about the effect of actions the platform has implemented in the past (downranking search results and adding visually appealing elements). Our counterfactuals are designed carefully to give us insights into performative power related to these interventions.
As you pointed out it is important not to interleave experiments with treatments that have a disruptive impact on user experience and behavior, as we want to measure the effects of our interventions under natural interactions of users with the platform. Being exposed to one intervention should not impact the behavior for a subsequent query. We make this requirement explicit in Assumption 1. It is also why we refrained from including larger swaps as an additional treatment group, after feedback from an initial testing phase. All our counterfactuals constitute minimal interventions. As you point out, the largest modification we perform is to move the first search results down by two positions. This is comparable to having two Ads on top of the page and it typically does not move the result out of sight. Based on our own experience and feedback from participants there is no evidence that the extension was noticeable by any of the participants.
It is also important to note that modifications are triggered at random. This offers no structure a user could implicitly or explicitly adapt to. The swap 1-3 only happens with probability ⅛ for any user query (including navigational queries), and for 45% of the queries the extension does not swap results. We checked the behavior in the control group by splitting the data collected before and after November 10. The difference in the average click rate for the first result differs in 0.8% across these two groups (the base rate is 43%) which is significantly smaller than the sampling uncertainty. We will think of a more rigorous experiment to add to the appendix to demonstrate that behavior of participants is stable across the duration of the study.
Triggering a4-a6. Arrangements a4-a6 are triggered each with a fixed probability (p=1/6). If there is no content to hide for this query, a5 solely performs a swap and the other interventions leave the website as is. This ensures that treatment assignment is independent of the potential click outcome which is important to obtain unbiased estimates. On aggregate the effect of hiding Ads will naturally be larger if Ads are present more often.
Latency. The latency refers to the time for loading the search page after entering the search query. A technical report on Google [1] could not find any effect for delays below 100ms on search behavior. Similarly, [2] find that “up to a point (500ms) added response time delays are not noticeable by the users.” Our delay of 40ms is below any of these thresholds and thus not expected to impact our results. We will include these references.
Considering a single click only. As you pointed out we are not making a distinction between follow-up searches and new searches. What we measure is the ability of the platform to steer clicks treating all clicks equal. To analyze a different outcome variable, such as a specific type of click, we would have to adjust the experiment accordingly. However, the scenario you mentioned will still meaningfully affect our measurements. If a user returns to click the second result after clicking the first under intervention a1, this will surface as a reduction in the performativity gap. This leads to the interesting observation that performative power is larger in situations where equally relevant results are being ranked.
Participants tend to give the browser plugin the "benefit of the doubt." There is a bias in the selection of participants, as people who trust us in performing a legitimate scientific experiment and take privacy seriously, are more likely to install the extension. But once they have the extension installed, the experiment is randomized at a query level and we can safely assume that they consume Google search in a natural way. There is no structured change they could respond to. We even deleted the data from the first 4 days after onboarding to be on the safe side and avoid potential biases due to participation awareness.
Please clarify if we misunderstood your last questions. But we hope the additional context is helpful.
References:
[1] J. Brutlag. Speed matters for google web search. Online report by Google. 2009. [2] Arapakis et al. Impact of response latency on user behavior in web search. SIGIR. 2014.
Thanks for clarifying about the chrome extension -- this answers the question completely.
Regarding the other questions, I think the authors have made reasonable points in terms of extensibility of the framework for other actions and measures.
I think my remaining question centers around weakness #3. Let me lay out the concern here more fully, and the authors can perhaps respond with counter-arguments, point out things I'm missing, etc.
The authors' rebuttal discusses that users will not habituate to the intervention because the intervention is randomized and reasonably infrequent. I don't think this is true. If navigational queries result in the correct answer at organic position #1 97% of the time, that's good -- users develop an expectation of the engine's behavior for that query class. Now, let's say that 12% of the time, an intervention moves the #1 result to #3, so the correct answer is now at #1 85% of the time. This is a massive change in the experience: the error rate grew by a factor of 5 from 3% to 15%. A competitor might find the top result 96% of the time. The intervention could be made less often to reduce this gap, but consider what is happening here: we are considering what level of intervention would not change the user's perception of value from the system, but immediately then measure the PP of the system under the assumption that the intervention is applied 100% of the time -- this clearly represents a different user experience, and the platform would resultingly occupy a different position in a new equilibrium.
In the language of the original PP paper, performative power is known to go to zero as competition increases towards a state of perfect substitutability. The question of how much a search engine can adulterate its experience before it suffers competitive losses is therefore a question about the state of competition. My main concern here is that a careful handling of this question is required: just how much can the platform really modify its results before, over time, users begin to realize that a competitor provides better performance. This question seems quite tricky. So my concern is that the current paper is an empirical study of performative power in search ranking, but leaves out the attempt to tackle what appears to be perhaps the single most critical issue in understanding just how much power a platform has.
Note further that the original paper considers this question of the PP of a platform with respect to viewers in a carefully-chosen setting: that of content recommendations of content hosted uniquely at the platform under study. There is no perfect substitute competitor in this setting. Web search, on the other hand, is fundamentally about providing access to a public database of content, so the issue of what competitors do is much more important.
I wonder if the authors could engage a little on this and share thoughts.
Thank you for providing additional context, now we better understand your main concern.
Let us explain why measuring the effect of reranking on clicks, holding the current state of the market fixed, is actually what we care about.
The Google Shopping case is about the ability of the search algorithm to distort traffic in the downstream market of comparison shopping services (CSS), a market in which Google is also competing. It is about the effect the search algorithm can have on other online businesses that receive large portions of their incoming traffic through Google search. Consider a specific competitor offering a CSS service. Now if an update to the search algorithm consistently down ranks their website by two positions this has a significant impact on the clicks they get and hence their business, without necessarily impacting users’ search experience nor Google’s position in the market of search. In fact Google would most likely aim to avoid updates that negatively impact search user's satisfaction or retention.
But it is a fact that Google has significantly down-ranked competitors in their ranking for certain Shopping queries (sometimes even moving them to the second page). Our goal is to establish a plausible lower bound on the effect of such a ranking change. Thus, being careful to measure the effect of reranking without impacting user's browsing behavior is not an omission but part of the design. An important distinction compared to the example of performative power you refer to in the original paper is that here the effect and the conduct are happening in different markets.
We hope this clarifies your concern. We are happy to discuss more if it is helpful.
Thank you for your clarification. Let me try to rephrase, and maybe you can confirm that I take your point correctly. If we consider an algorithm change that downranks the top webiste by two positions, there are two cases:
- Such a change might be roughly indistinguishable in quality to users, perhaps because there are several possible responses of roughly equal quality, or
- Such a change might result in a meaningfully worse user experience, which would cause the platform to lose market share
I believe your clarification is that certain changes might be deployed in practice, as in the example you give -- such changes belong presumably to bucket #1 as the platform does not wish to launch a degraded search experience. In this case, you would like to state that the platform has high performative power because the behavior of users with respect to visiting the top 3 alternatives has changed significantly.
In some sense, one might argue that using your measurement as a lower bound on the engine's ability to move user attention is valid because the impact of changes that do negatively impact user experience is likely to be even smaller -- users will in some cases note that the top-ranked result does not meet their information need, and will continue to scroll, resulting in a smaller behavior change in bucket #2 above than bucket #1. Hence, from this standpoint, your measurement is more likely to be an underestimate than an overestimate if the top alternatives are more similar in quality. I believe this argument is reasonable, and your discussion in section 6 is consistent.
I think my outstanding concern is that the concept of change in quality of the user experience is not mentioned or discussed. Consider an alternate universe in which three shopping providers compete. CSS #2 redesigns its website, resulting in improved user experience and better customer loyalty. CSS's #1 and #3 fail to invest, and their catalogs decay, and their excessive use of <blink> tags drive customers away. The search platform ranking adjusts to prefer competitor #2, as do other competing search services -- failing to recognize the improved quality of #2 would in fact result in the search platform itself suffering market losses. The observed situation is identical to the example you give -- the platform changed ranking and drove more traffic to #2 -- but in this case, the change in user behavior is forced on the search platform in order to remain competitive in its market. Do you feel that this distinction is meaningful in discussing performative power (ie, the distinction between ranking changes forced on the platform to remain competitive, versus those available to the platform at its discretion, without impacting user experience)? If not, any additional thoughts or clarification appreciated. If so, do you have suggestions for how to incorporate this distinction?
Let us clarify that the way we apply performative power is retrospectively. While you outline several valid scenarios for why a ranking change may or may not be implemented by a platform, our argument goes the other way around.
You start from the kind of action that was evidently taken and you ask: What effect does it have? It's a fact that the update Google performed changed ranking positions of CSS providers by several positions. And the EC argued the algorithmic update was implemented for Google’s own benefit. What isn't clear is how large an effect we expect this to have on the traffic that goes to CSS providers.
Performative power brings in a key quantitative piece of information: How large is the effect of such a change? Performative power does not, and cannot, answer the question whether harm occurred, or whether a change was financially sound for a company. Performative power will tell you: For the kind of change that the platform did, how large an effect can we expect? What's needed for an antitrust case, such as the one pursued by the EC, is a plausible lower bound on this effect size. This is the central piece that performative power formalizes and operationalizes.
With our browser extension we emulate different conservative versions of the update Google implemented to obtain a lower bound on the performative effect relevant for the investigation. So the arguments from our previous response regarding the stable behavior of participants throughout the experiment concerned the validity of estimating the effect of these multiple different updates by interleaving them in an experiment the way we do. And the argument you rephrased correctly for why measuring performative power for queries where the relevance gap is large yields a plausible lower bound for scenarios where it is smaller, is to argue how we can relate the effect we measure in our experiment to the types of queries relevant for the case. But it is important to keep the argument for the validity of the experimental design separate from the discussion of the reason to litigate an antitrust investigation. The latter determines the concrete instantiation of performative power which is where we start from, and the former concerns the measurement step to obtain plausible quantitative evidence. We hope this clarifies your concerns.
In any case, we certainly agree with you that it would be worthwhile to extend the discussion of these important aspects. We enjoy your constructive engagement and take this as a sign that the paper should be discussed with a broader audience at NeurIPS. We hope that you consider voting favorably for its acceptance.
Thank you, this is very helpful. I think the first sentence of your response ("Let us clarify that the way we apply performative power is retrospectively.") is at the core of our discussion. I believe your writing is entirely consistent with your statement, but I think many of my concerns arose because the point that performative power is to be applied retrospectively is not made explicitly. This makes it easy for readers (like myself) to interpret the flow differently, as saying instead that you measure how much a search engine might directly optimize the shape of traffic it distributes, within a very broad design space outlined by your measurements. I think we are in agreement that this alternative interpretation is not your intention, and would raise a host of issues about whether a particular traffic shaping outcome would allow the platform to remain competitive. Is it possible to clarify in the final version this point about retrospective application of the measure?
We are glad this resolved your concern. Your comments and the discussion overall was very helpful to identify this core argument worth emphasizing more to anticipate similar questions in the future. And yes, we are happy to incorporate your feedback and make the point about performative power being applied retrospectively explicit in the final version.
This paper describes an online experiment seeking to measure how much power online search provides have in terms of impacting what content people consume. In short, the study attempts to measure the causal effect of small ranking rearrangements on click-rate for a population of web users. The authors present experimental data that can be used to estimate the expected impact that operators of search engines, recommender systems, and other ranking technologies might have on viewership of items when they make small changes to their rankings.
优点
In terms of originality, quality, and clarity:
- Originality is high. The authors review (in reasonably terse fashion) a number of studies that have sought to understand the impact of ranking items in search on attention (i.e., clicks and visual attention that items receive). The study is grounded in this past work and notes its major contribution is to begin studying this effect experimentally.
- Quality of the study is high overall. See some concerns / questions below, but overall I would consider this to be a very convincing set of results.
- Clarity is very high throughout. Experimental details are crisply described.
In terms of significance
- Reasonable dataset (for this kind of study) with 57,000 "real" queries from 80 participants
- Known caveat that getting this data without direct access to search operator datasets is prohibitively expensive, hence why this is novel.
缺点
Two weaknesses (with respect to venue fit and potential of the current draft) stood out:
First, a minor note: this kind of experimental work might be somewhat unusual at NeurIPS as it doesn't engage with the modelling side of ranking. Personally, I do not think this should be a blocking reason -- I think many in the community would like to see future works that incorporate this kind of experimental approach -- but it felt like an important piece of context to note.
Second, some readers may concerned with perceived issues with ecological validity. To some extent these are unavoidable in any experimental study like this -- there will always be some set of ecological validity concerns, they just trade off with each other.
- While this choice is reasonable, it may impact the kinds of queries used: "The study participants are trusted individuals of different age groups and backgrounds, recruited by reaching out personally or via email."
- It might be helpful (esp. given the work may impact policy discussions) to know more about the domain / type of queries, but very reasonable privacy choice to avoid sharing any information about that.
- Very minor: It's (by choice of technology companies) unclear if the kinds of perturbations studied here map to the kinds of a/b tests or experiments that are frequently rolled out by those companies. I don't think this is something the paper needs to address explicitly, but is also worth noting.
问题
A few questions that might be worth addressing in the next version of the paper
- Are there major constraints you would expect to face if trying to apply these methods to arbitrary types of other ranking platforms (feeds, matching/marketplaces platforms, etc.).
- To what extent, if any, do you expect the specific algorithmic / modelling choices made by platforms to matter here?
- Are there domains that might be hard to study, because e.g. the number of items is too high?
局限性
Limitations of the methods are reasonably discussed throughout.
Thank you for the positive feedback on our work. In the following we first discuss your questions and then provide some thoughts on the additional comments below.
Applying method to other use cases. This is an interesting point we have not discussed in our work. An important feature of our design is that the complexity of the underlying algorithm does not impact the experimental design. The intervention is implemented at the level of the user interface, rather than the algorithmic system. This means the system does not have to be changed. From a technical viewpoint, changing how results are displayed to users can in principle be applied to any website using a browser extension, independent of how these results are selected by an underlying algorithm. However, one important constraint is that the updates that can be emulated at the display level are limited to the information available on a website. This means, we can swap and hide elements, but not replace them with an alternative that is hidden. Similarly, we can not use any proprietary data the firm might use to determine content allocation. This limits the updates we can emulate. However, for the purpose of lower-bounding performative power, it is sufficient to argue for one feasible update that can be implemented. Evaluating the effect of this update provides a lower bound on the corresponding definition of performative power. Here, the swap of adjacent results usually offers a good proxy for potentially larger updates a firm could implement.
Domains that are harder to study. In light of the constraint mentioned above the design of counterfactuals is harder for settings where the firm offers the user fewer choices, like in the case of the Amazon Buy Box where a single element is selected to be displayed behind a visually appealing button. In such cases natural updates are harder to argue without additional insights into the algorithm, or the alternative options. Large number of items offer the opposite situation which does not seem to be a problem, but we are not entirely sure we understand the question. One interesting aspect of a large number of items is usually that more of them can be relevant, which leads to a stronger effect of ranking and a larger performativity gap. But this would just impact the effect size, not the implementation.
Ecological validity. As the reviewer mentioned, concerns related to ecological validity are unavoidable to some extent. What we can offer is an additional ablation and robustness checks where we remove individual participants from the evaluation to check the sensitivity of the statistics. Results are shown in the supplementary PDF, but error bars are still very small.
Recording query. Thank you for supporting this design choice. Search queries are one of the most sensitive personal data. For our proof of concept it was not necessary to collect this information which is why we refrained from doing it. However, we agree that when inspecting more nuanced policy related questions additional information might be necessary to record, and the benefits may outweigh the privacy costs in a specific case. However, the exact question should come first and guide the corresponding decision. From a technical perspective there is nothing preventing us from recording user queries, we have the information readily available. For a future study we might consider storing an embedding of the search query, but for the current data we simply do not have this information at our disposal.
Why we chose NeurIPS. We agree that this type of experimental work is less typical from the perspective of developing better ranking models. However, from a perspective of causal inference, and algorithmic auditing, directly tackling the measurement problem is quite natural. We also believe that decoupling modeling from empirical measurement is very important for developing investigative tools in the context of AI systems. Modeling behavioral aspects in the interaction of participants with platforms is very complex and this complexity should not prevent us from developing effective monitoring and measurement tools. For additional arguments for why we think our work is interesting for the NeurIPS community and falls under the call for paper can be found in the response to Reviewer 9NWC, but it does not seem that we need to convince you of this.
We hope our rebuttal addresses your questions satisfactorily and we will incorporate additional discussion of the points you raised in a future version.
References
[1] Hardt, Jagadeesan, and Mendler-Dünner. "Performative power." NeurIPS. 2022.
Thanks to the authors for this response and additional information. I stand by my positive review, and would echo the argument for why this particular paper and kind of paper does fit the current CFP.
The paper titled "An engine not a camera: Measuring performative power of online search" presents a study on the performative power of online search engines, specifically focusing on how they can influence web traffic through the arrangement of search results. The authors designed and executed an experiment to quantify the ability of search engines to steer user clicks by rearranging search results, a concept known as performative power.
The main contributions of the paper are as follows:
Conceptual Framework: The authors build upon the recent definition of performative power to develop a framework for understanding the influence of digital platforms, particularly search engines, on user behavior.
Experiment Design and Quantitative Findings:
The study involved tens of thousands of clicks collected over a five-month period from more than 80 participants, providing robust quantitative insights into the causal effects of search result arrangements. The paper reports significant quantitative findings, such as the observation that consistently downranking the first search result by one position causes an average reduction in clicks of 42% on Google search, and even more for Bing.
优点
Originality:
The paper introduces a novel concept, the performative power of online search engines, which is a significant contribution to the field of digital platform policy and regulation.
Quality:
The analysis is thorough, with robust quantitative findings supported by extensive data collection over a five-month period from a diverse participant base.
Clarity:
The paper is well-structured, with clear definitions and explanations of key concepts such as performative power and the methodology used for the study.
Significance
The researchers presents that search engines are active "engines" that can significantly shape and influence the information landscape, with important sociopolitical implications.
缺点
- Limited Sample Size and Diversity. The paper could benefit from a larger and more diverse sample to ensure the results are representative of different user behaviors across various demographics.
- Lack of case studies. While the paper provides clear quantitative results, additional qualitative analysis or case studies could help interpret why certain patterns emerge, offering deeper insights into user decision-making processes.
- Lack of discussion on interventions. The researchers provide a compelling framework for measuring this performative power, but the paper does not extensively address potential countermeasures or interventions that could help mitigate the negative aspects of search engines' performative power.
问题
See Weaknesses
局限性
The author has already discussed the limitations of the paper.
Thank you for the feedback. Let us explain why we see the focus on measurement (rather than qualitative modeling) as an important opportunity of our approach, rather than a weakness.
Qualitative insights. We agree that qualitative insights are highly valuable. However, it is possible to measure and monitor the influence of AI systems on user behavior without the need to understand the complex mechanism behind it. This is an important takeaway of our work in the context of market regulation specifically, as regulators have been struggling with the complexity of digital markets, see e.g., [1]. Our work shows that the complexity to model digital markets should not prevent us from developing effective measurement and monitoring tools.
Countermeasures. Related to the above comment, it is useful to treat measurement separately from the design of countermeasures. Remedies are often context specific and require the balancing of many competing objectives. This goes far beyond measurement. And there are many use cases where measurement is what you primarily care about. This separation is also common in antitrust investigations for example, where there is no need to suggest an effective remedy to run a case. But as a first step towards designing effective remedies, what we can offer is an investigative tool for researchers, regulators, and potentially also platform owners to measure the effectiveness of remedies, as well as setting transparent evaluation criteria.
Sample size. We agree that a larger pool of participants would always strengthen the results, and a larger sample size could be useful specifically to reduce uncertainty on some of the query subset evaluations, where the bootstrap confidence sets are not as tight as for the aggregate results. We can not add additional data at this point, but we have included additional robustness checks in the supplementary PDF, where we evaluate Jackknife error bars with respect to the impact of excluding a random agent from the study. They can not speak to ecological validity, but they can provide an indication for how stable the effect measures are to the observed collection of agents.
References:
[1] Final report: Stigler committee on digital platforms. 2019
Thanks to the author for the reply, I have no other questions. Given that the paper is quite interesting, I maintain a relatively positive attitude towards the acceptance of the paper.
The authors conduct a user study on the performative power of search engines, i.e., how much search engine providers can affect the information seen by the end user by tweaking the algorithmic ordering of results. In the specific context and assumptions formulated in this paper, this essentially amounts to measuring click-through rate differences across arrangements of search results. The authors ran a live RCT experiment by injecting different arrangements directly on the result page with the help of a browser extension of rheir design. Results show that there are notable differences in CTR and therefore commercial search engines have a large performative power.
优点
The paper is very well written, with all methodology and assumptions being laid out clearly. Moreover the experiment design and scale seem sufficient for the task at hand according to my understanding. The results are made more solid and generalizable thanks to the use of different providers. Also this work is more complete than some previous user studies of this kind and the performative power angle is original and relevant.
缺点
I have two major concerns :
- I am not sure this paper should appear in NeurIPS. It does contin any neural system, nor does it indicate how to create one from the results. While these results could in general be useful to an ML practitioner working with search engines, this could be said of many other research outcomes from different fields, that ypicaltly don't appear in NeurIPS proceedings.
- The paper does not relate enough to previous work on position bias and other types of biases in searh results. While I believe new, updated user studies are always valuable, I think the authors should compare their results with those obtained by past studies.
问题
- Do your results contradict or enrich previous estimations of position bias ? If so, what explains this ?
- Do you think your method base on the extension is more reliable, or complementary to other methods used in past studies, e.g., eye tracking, result randomization, intervention harvesting,... ?
- How do your results relate to other types of biases that have been identified, especially trust bias ?
局限性
Limitations are correctly adressed and very clearly laid out.
Thank you for the feedback. We are glad you enjoyed reading the paper. We hope to clarify your questions by providing additional context related to the comparison points you mentioned. We will incorporate these discussions in the manuscript for a future version.
Other types of biases. As you pointed out, position bias [6], trust bias [4], presentation bias [3], and other behavioral aspects have been shown to impact the effect of ranking on user clicks. In our work we are not interested in pinpointing any mechanism specifically, rather we are interested in directly quantifying the effect of ranking updates on clicks within the context of a given platform (we call this the performativity gap). The different types of biases will naturally enter such a measurement. We argue that it is not necessary to understand all these complex behavioral mechanisms to measure performative power and monitor algorithmic systems.
In that regard a browser extension is very powerful. It serves as a tool to conduct a controlled experiment and measure the effect of specific algorithmic changes under natural interactions of users with the platform. As a result it implicitly captures trust, display, and other behavioral aspects specific to the platform under investigation.
Methodological differences. Information harvesting, and randomization are methods to explicitly or implicitly exploit interventions to the ranking algorithm to measure their effect on clicks. These methods all get at the fundamental problem of causal inference; the effect of ranking can not be measured from click data alone, due to unobserved feedback and biases. Our work is also a way to gather and exploit interventional data. In contrast to prior work, e.g., the work on data harvesting in [2] where the authors consider interventions to the Arxiv ranking algorithm, the Powermeter design performs interventions at the level of the platform-user interface. We emulate updates to the algorithm without actually touching the algorithm underneath. While the observed effect is equivalent to implementing the corresponding change to the algorithm, it allows us to study the Google search algorithm, without controlling it, which is an important methodological novelty. We are not aware of any such experimental design in the literature.
Finally, you mentioned eye-tracking studies. They are designed to measure the allocation of visual attention, e.g., to support the design of click models, such as cascade models [5]. However, they can not replace click statistics, and are complementary to performing the interventions to study.
Discussion of prior quantitative insights. While our study pursues a different goal than estimating parameters of a click model, some of our intermediate experimental insights can be compared to prior work. While we are not aware of comparable studies related to interventions a4-a6, intervention a1-a3 constitute pairwise swaps routinely used to infer propensities in click models. The most relevant comparison point is work by researchers out of Google from 2010 [3, Figure 1] who published click statistics on FairPairs (adjacent swaps [7]) on Google search. They report an average gain in click counts of 75% when ranked first instead of second under swap 1-2, we observe a gain of 82%. For swap 2-3 they report 48% gain, whereas we see a gain of 66%. Thus, while qualitatively similar the effect size in our results is larger. This might be attributed to changes to the website design since 2009, for example, including the introduction of featured snippets that lead to a larger spatial separation between results, and may increase performativity. While this is only a hypothesis, it demonstrates how effects can change over time even for a fixed platform. With Powermeter we offer a method to reassess such effects within the context relevant for a specific investigation. Beyond swaps, we can provide insights into different counterfactuals that have not been studied before, such as the combination of downranking results and adding visually appealing elements, which is relevant in a regulatory context.
Why NeurIPS? We build on a concept of performative power that has its origin in the NeurIPS community [1], and we offer a first account of its practical applicability which we see as an important contribution to the area of “Social and economic aspects of machine learning" referred to in the CFP. Focusing on neural systems in a broader context is not uncommon for NeurIPS, see work related to auditing and regulation. Similarly, citing again from the CFP: "Machine learning is a rapidly evolving field, and so we welcome interdisciplinary submissions that do not fit neatly into existing categories." While we understand that this is subjective, there is an easy case to be made for why our work is in scope, accessible and relevant for the broader NeurIPS audience. In any case, we believe that the AC/SAC can ultimately make an executive decision about this and we would appreciate if this could be treated independent from the assessment of the quality of our work.
References
[1] Hardt, Jagadeesan, Mendler-Dünner. "Performative power." NeurIPS. 2022.
[2] Fang, Agarwal, Joachims. “Intervention Harvesting for Context-Dependent Examination-Bias Estimation”. SIGIR. 2019.
[3] Yue, Patel, Roehrig. “Beyond Position Bias: Examining Result Attractiveness As a Source of Presentation Bias in Clickthrough Data.” WWW. 2010.
[4] O’Brien, Keane. “Modeling result–list searching in the World Wide Web: The role of relevance topologies and trust bias.” CogSci. 2008.
[5] Craswell, Zoeter, Taylor, Ramsey. “An Experimental Comparison of Click Position-bias Models”. WSDM. 2008.
[6] Joachims et al. “Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search”. TOIS. 2007.
[7] Radlinski Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. AAAI. 2006.
Thank you for your response.
Regarding the relevance to NeurIPS, I don't really have a strong opinion on that. With your response, I think the AC has enough input to make a decision.
Thank you for the discussion of quantitative measurements. This increase in PP/bias since 2010 is an interesting result and should appear somewhere in the paper for a final version.
I understand your argument that modeling the behavorial mechanisms causing bias (and PP) is not necessary to monitor search engines, in the sense that performativity gap is enough to state how much a provider can influence the exposure by reranking its results. The usefulness of modeling the users lies in choosing what measures to take, after observing this result. See many studies on fairness in ranking where the underlying user model matters a lot in the final solution (e.g., [1] vs [2] which use the same technique but a different model). In your example with the Google shopping case, Google could argue that, indeed, the way they place results matters, but that without knowing the user behavior, there is technically no way to prove they are not already doing what's best for users/competition. I can understand how this is out of the scope of the paper, but this should be clearly stated as a limitation of the method then.
Finally, regarding the methodological differences with respect to prior work, I disagree with your answer. First, it is not always impossible to recover the causal effect on ranking from observational data alone. This is the goal of the entire field of causal discovery [3] (There are some arguably strong hypotheses required). Second, intervention (not "information") harvesting does precisely that: infer user biases from click data alone. The work you cite uses interventions to the Arxiv ranking algorithm for evaluation; it is not part of their method. Intervention harvesting certainly has advantages and drawbacks compared to your proposal, but control of the ranking algorithm is not one of them, as neither the extension-based protocol nor the intervention harvesting approach require that.
Overall, while the core study is solid, I think the paper still needs a bit more polishing to clarify its relation to prior work and which specific problem it allows to solve.
[1] https://arxiv.org/pdf/2202.03237 [2] https://arxiv.org/pdf/2205.07647 [3] https://www.sciencedirect.com/science/article/abs/pii/S0888613X22001402
We are happy to do more polishing to clarify the relation to prior work. But let us be more explicit that our goal is not to design fairer or more optimal rankings, nor to debias click data. Thus, the methods you mentioned have a different focus and they are not directly related to our work.
Instead, we are interested in measuring the performative effect of a very specific ranking update; an update that was documented in the context of the Google Shopping case, and found to be anti-competitive. And we tackle the question of how to design experiments and gather data that allow us to obtain a plausible lower bound on the effect of such a specific update. What makes this challenging is that we are not in the position of the platform, but someone outside aiming to monitor a system.
Intervention harvesting as we understand it is a method to build on implicit interventions and extract information from click logs of multiple historic rankers to debias click models, e.g., for better propensity estimation. But debiasing click data is not what we are looking for, as we are not interested in learning a ranking model from data. But we are interested in measuring outcomes for a specific intervention where we do not have logs available. Thus, our focus is to emulate such an intervention and gather click data to estimate its effect. This is unrelated to methods that build on top of available logs to design (fair) ranking algorithms, such as [1,2].
In general, an important distinction to most existing literature in ranking is that we take the perspective of a regulator, not the platform controlling and optimizing their own algorithm. This leads to different problem statements and challenges. For example, we aim to establish a plausible use case specific lower bound for the effect of a specific ranking update performed by a particular platform. This requires us to gather strong empirical evidence from data collected on that platform rather than relying on modeling assumptions of how users might respond to such ranking updates, because such assumptions are necessarily inexact and hard to justify as quantitative evidence, even though at the same time they can be useful for learning, or for guiding the design of potential remedies.
We hope this better answers your question. If you see a relevant connection we missed, after clarifying the specific problem we aim to solve, please let us know, and we are happy to discuss it.
Thank you.
I realize now I misunderstood your motivation. You focus on quantifying the effect of one specific intervention (i.e., one switch for one query or a set of queries) that is under scrutiny, not the general effect of performing interventions on the platform (which would require modeling the user behavior more explicitly to ensure that findings generalize to other interventions/other queries). This is probably the main difference with the literature on biases in IR, which targets a broader goal of quantify "bias" and the effect of reranking, often with the final goal of learning models based on the study's results. Then, I agree your method is entirely sufficient for the specific problem you consider.
Please try to clarify this in the related work section. Reading the paper again, I probably over-interpreted what was said, but some statements are ambiguous in that regard.
Also, I think it would be nice to state more clearly that some methods use online interventions but require control of the platform (e.g., FairPairs), and some do not require control but rely on observational data with the obvious caveats it implies (e.g., browser extension to collect logs + intervention harvesting), and you propose online interventions without control of the platform.
Thank you for following up and providing a summary of the main points in your comment. We will make sure to remove ambiguity and polish the related work with your feedback in mind.
We thank all the reviewers for the feedback. The attached PDF contains an additional robustness evaluation to support the author-reviewer discussion. We address questions below, responding individually to all reviewers.
This submission presents a Chrome browser extension that modifies the arrangement of search result pages and records user click statistics, for the purpose of studying how the change of result ranking affects user clicks. At the first glance, the settings and approach conducted in this work seem identical to what studied in the prior information retrieval research, and the conclusions also follow the previous observations. Hence, most reviewers concerned the novelty of this work.
Throughout lengthy and engaged discussions with the authors, the main motivation became clearer to the reviewers that this work actually focuses on quantifying the effect of one specific intervention in a retrospective manner, instead of the general phenomenon itself or in a controlled test to study its impact. This helps the reviewers to reach an agreement about the evaluation of this work, though it also raises some concerns regarding the effective sample size for such analysis to be practically valid.
During the discussion with the reviewers, all of them expressed their support of this work, but also urged the authors to revise the manuscript to reflect what they discussed and clarified in the rebuttal, which would help others to most precisely position and understand this work.
To summarize, we recommend acceptance of this work, but ask the authors to carefully revise the content to make the motivation clearer to its readers.