Monoculture or Multiplicity: Which Is It?
We systematically evaluate the concerns of multiplicity and monoculture in a suite of large language models and prediction tasks.
摘要
评审与讨论
This paper provides empirical evidence to test two theories about machine learning ecosystems: monoculture and model multiplicity. Monoculture hypothesizes that the ecosystem will tend toward a single model making all decisions, leading to homogeneous outcomes for individuals. Multiplicity suggests that instead there will be many different models, leading to large variation in individual outcomes. THe paper tests 50 language models on social science prediction tasks to see whether outcomes are monolithic or varied, as measured by metrics like ambiguity, discrepancy, and recourse of an individual. The results suggest that real ecosystems lie between the two extremes. Model outcomes are quite similar, but not as much as monoculture suggests, suggesting that it may be rare that individuals be systemically excluded by all models.
优缺点分析
Strengths: The paper brings a nice empirical lens to an important theoretical question in the literature. The findings are actionable, there are natural followups, and the paper is extremely clearly written.
Weaknesses: I worry that "strict monoculture" is a bit of a straw man; the literature talks about algorithmic homogeneity more as a continuum or a tendency. So the finding that there is not "strict monoculture" seems less exciting; I am more excited by the details of empirical findings themselves, which are fascinating. So I might suggest that the discussions, especially around recourse, might want to have more nuance.
问题
-
I wonder if "individuals overwhelming have recourse" is too strong a statement. If 8% of positive instances are rejected by all models, doesn't that seem to be a problem? and the numbers are much much higher for the two ACS tasks in figure 8, or for Table 6. Prior theories of monoculture like Bommasani et al 2022 do seem to emphasize that recourse is a problem even if only a small fraction of individuals don't have recourse. Should your discussion and claims be more nuanced?
-
Related, I do worry that an unsympathetic reader might see the "pure monoculture" hypothesis as a straw man. Some of your results on monoculture and recourse do seem socially problematic, even if we consider monoculture as a tendency rather than an absolute. Can this discussion and claim also be more nuanced? This probably requires some changes throughout the paper.
局限性
yes
最终评判理由
This paper brings a nice empirical lens to an important theoretical question in the literature. The findings are actionable, there are natural followups, and the paper is extremely clearly written.
格式问题
no (except two typos: line 130 "found as empirical" should be "found the empirical" and the 3rd line after equation 1 between line 132 and 133, "corresponds the agreement" should be "corresponds to the agreement")
Dear Reviewer t4zS,
Thank you for the time you spent reviewing our work and your valuable feedback and suggestions! We appreciate your positive comments on our empirical analysis and the value of our findings. We address your remaining questions or concerns in the following paragraphs:
"strict monoculture" is a bit of a straw man [...] suggest that the discussions, especially around recourse, might want to have more nuance
We thank the reviewer for raising this point. We find it helpful to contrast both extremes, strict monoculture and strict multiplicity, as a way to evaluate and reason about a given model ecosystem. To address your comments, we will more clearly term strict monoculture as the far end of a spectrum encompassing model multiplicity and monoculture and will revise our discussion of recourse in order to better reflect the nuances of our findings.
"individuals overwhelming have recourse" might be too strong a statement.
We agree that the fraction of positive instances experiencing no recourse is indeed significant and hence this phrasing might come across as too strong. We will update the text to reflect this better.
numbers [of individuals experiencing no recourse] are much much higher for the two ACS tasks in figure 8
Thank you for bringing this up. We assume you're referring specifically to the ACSMobility and ACSPublicCoverage tasks, where we observe notably higher rates of individuals experiencing no recourse. We provide some discussion on both tasks in the Appendix, but will revise the text to address and clarify this earlier in the paper.
Both tasks are characterized by high class imbalance, with a low prevalence of the positive class. Since the Rashomon set is constructed based on overall accuracy, it tends to favour models that prioritize the majority class. Especially pronounced for ACSMobility, the Rashomon set includes mainly models behaving highly similar to the constant majority-class predictor (see also the general model performance on ACSMobility), resulting in a high fraction of individuals receiving no recourse. This may be exacerbated by low predictive signal of the task, which makes it challenging for models to improve over trivial baselines.
When models are instead selected using balanced accuracy (see Appendix B.2.2), which accounts for class imbalance, the picture changes: fewer individuals experience no recourse, and more receive substantial or full recourse. A similar trend is seen for ACSPublicCoverage.
This suggests that the elevated rate of instances without recourse observed for these tasks may be driven by a combination of class imbalance, low signal, and accuracy-based model selection for the Rashomon set.
If you have any further questions or concerns, we will be happy to address them.
The authors
Thanks, I appreciate your point that it's helpful to first study the extremes, even if they are not actual systems, but I'm glad you will be adding more discussion on this. And I'm glad you will be adding more nuanced discussion into the main paper about the lack of recourse in those domains! I will maintain my score.
Thank you again for your thoughtful engagement with our work. Your comments have been helpful in refining the paper.
The authors
This paper provides an evaluation of current LLMs in terms of the degree of monoculture and multiplicity. As an evaluation study, the experimental work covers 50 LLM models across 6 different prediction tasks with 4 different prompt variations. The paper concludes that LLMs as predictors fall between the extremes of monoculture and multiplicity.
优缺点分析
S:
- This work presents a novel evaluation direction regarding the homogenization of LLMs.
- The experimental designs are sound, and the analysis is well-reasoned.
- The paper concludes that racism faced by individuals in LLM predictions cannot be fully addressed from the perspectives of monoculture or multiplicity.
W:
-
The paper seems to lack deeper insights from the evaluation results. Specifically, the conclusion that LLMs as predictors fall between monoculture and multiplicity does not provide clear guidance for researchers on how to use LLMs effectively. While the authors mention that the harms faced by individuals resist capture by either monoculture or multiplicity, there is still a gap in the work regarding what actions should be taken moving forward.
-
The tasks used for evaluation are highly limited. Both the ACS series tasks are actually derived from a single dataset, with only different labels. Apart from that, the only other dataset used is the BRFSS Blood Pressure task. In addition to the limited task variety, the task modalities are also restricted, as both tasks are based on structured tabular datasets.
问题
-
What are the possible factors that cause the significant differences in the sample proportions corresponding to rec(x) > 0.5 across different datasets, as shown in Figure 2?
-
Could the authors discuss more insights from their evaluation conclusions for the design of future LLMs, or provide recommendations on how to organize LLMs models to avoid harming individuals?
局限性
The work can be further improved in terms of the variety and number of tasks. Additionally, the paper could convey deeper insights from the evaluations.
格式问题
na
Dear Reviewer CF6P,
Thanks a lot for the time you spent on providing valuable feedback on our work and helpful suggestions on improving the paper. We appreciate your positive comments on the novelty of our evaluation and our sound experimental design.
To address remaining concerns regarding our manuscript, we have revised the text where suggested and, to improve the empirical evaluation, added additional results on the Survey of Income and Program Participation (SIPP). Please find below a detailed discussion of the points you have raised.
tasks used for evaluation are highly limited
To address your concerns regarding task diversity, we have added empirical results for one additional task defined on the longitudinal Survey of Income and Program Participation (SIPP). Here, the goal is to predict whether a person's income is significantly above the Official Poverty Measure (OPM). Summarizing the results we find similar patterns as observed for the other tasks. There is substantial predictive similarity across models. We observe a mean agreement rate of 0.84 that clearly exceeds the baseline agreement rates (63%). Still, agreement is generally below full agreement for all model pairs. In terms of recourse, 5% of individuals experience no recourse, 77% substantial recourse and 48% full recourse.
While we acknowledge your point that all ACS tasks are derived from the same underlying data source (US-wide ACS PUMS data), we want to point out that ACS, being derived from US Census data, is one of the most comprehensive surveys worldwide. As such, it provides a rich, diverse, and high-quality representation of the US population, making it a strong foundation for a wide range of prediction tasks such as employment, education or housing.
To ensure comparability with prior work, we adopt five tasks predefined by Ding et al. (2022). These tasks span a broad range of prediction challenges—from high predictive signal to more difficult low-signal settings (such as ACS Mobility)—and collectively make up a diverse and valuable benchmark suite. While some features (such as age, race, and sex) appear across tasks due to their relevance for fairness analysis, their overall feature sets differ, and none is a strict subset of another. Further, each task is constructed on a distinct subpopulation (e.g., adults, employed individuals), with no individual appearing in all five tasks and a maximum pairwise overlap of only 10.1% of individuals relative to the task’s size. Importantly, we view the partial overlap in individuals as an opportunity to study model multiplicity across different contexts (see Bommasani et al., 2022).
task modalities are also restricted
We appreciate the reviewer’s suggestion to include additional task modalities. We agree that extending the analysis to a broader range of modalities is a valuable direction. However, this falls outside the scope of the current paper as it would require a new model suite and task set. We want to note that our focus on tabular data reflects both practical and normative considerations: Tabular data is the primary modality for risk assessment in high-stake domains such as hiring, credit, criminal justice or other bureaucratic contexts. As a result, much of the normative discourse about the concerns of model multiplicity and algorithmic monoculture as well as recourse has centered on this modality. In this sense, our choice aligns with a broader body of research that has largely concentrated on tabular data. Nevertheless, we see expanding to other modalities as a promising direction, and we expect that our evaluation framework could be readily adapted for that purpose.
- What are the possible factors that cause the significant differences in the sample proportions corresponding to rec(x) > 0.5 across different datasets, as shown in Figure 2?
We assume you're in particular referring to the ACSMobility and ACSPublicCoverage tasks, where we observe notably higher rates of no recourse affecting also the fraction of individuals that experience substantial recourse (). We provide some discussion on both tasks in the Appendix, but will revise the text to address and clarify this earlier in the paper.
Both tasks are characterized by high class imbalance, with a low prevalence of the positive class. Since the Rashomon set is constructed based on overall accuracy, it tends to favour models that prioritize the majority class. Especially pronounced for ACSMobility, the Rashomon set includes mainly models behaving highly similar to the constant majority-class predictor (see also the general model performance on ACSMobility), resulting in a high fraction of individuals receiving no recourse. This may be exacerbated by low predictive signal of the task, which makes it challenging for models to improve over trivial baselines.
When models are instead selected using balanced accuracy (see Appendix B.2.2, Figure 10), which accounts for class imbalance, the picture changes: fewer individuals experience no recourse, and more receive substantial or full recourse. A similar trend is seen for ACSPublicCoverage.
This suggests that factors such as class balance, the predictive signal of the prediction task, and the selection criterion for the Rashomon set can all influence the observed levels of recourse. We hope thas adequately addresses your question. If anything remains unclear, we’d be happy to clarify or discuss further.
- Could the authors discuss more insights from their evaluation conclusions for the design of future LLMs, or provide recommendations on how to organize LLMs models to avoid harming individuals?
Thank you for raising this point. We will incorporate it more clearly into the discussion. Our work illustrates what a system-level analysis might look like to assess the degree of monoculture or multiplicity within a model ecosystem. We believe that conducting such system-level evaluations and increasing transparency about the extent to which individuals have recourse is both valuable and necessary. This kind of analysis could serve as a starting point for developing something like a "multiplicity index," as suggested by Reviewer 6KfP. Accordingly, we advocate for auditing entire model ecosystems, not just individual models. If homogeneity is identified as a significant concern, then concrete (policy) interventions, such as randomization, may be warranted. For a detailed discussion of the policy implications surrounding monoculture and multiplicity, see also Gur-Arieh et al. (2025).
With regard to individual harm, we caution against treating model multiplicity as a straightforward remedy for the risks posed by monoculture. Both extremes carry distinct concerns, and neither should be viewed as a solution concept for addressing harm in language model ecosystems. Simply increasing diversity among models does not guarantee better outcomes for individuals or society. Instead, our findings suggest that the kinds of harm individuals may experience are not adequately captured by focusing solely on either monoculture or multiplicity. A more nuanced framework is needed to understand and mitigate these risks effectively.
Thank you for your thoughtful comments. We believe the revisions have strengthened our work, and we would be happy to address any further questions or concerns.
The authors
Thank you for your response.
Regarding the insights from this paper, I understand that the intention is to conduct a system-level individual-recourse evaluation. My concern, however, lies in the fact that the recourse evaluation results appear to vary across tasks, making it hard to provide guidance for the subsequent design of model ecosystems. Furthermore, the tasks themselves are limited to binary classification and relatively small-scale risk assessments and predictions. Although the authors acknowledge these as existing limitations, I hope there can be at least one feasible discussion on how the findings of this paper can be applied to design model ecosystems — for instance, selecting models that maximize required outcome diversity, accuracy, or even fairness.
The models evaluated in this paper are open-source, and there may be overlapping base models and training data. I wonder whether these factors place the findings of this work in a somewhat underappreciated middle ground between monoculture and multiplicity. What would happen if the model sets were trained on disjoint datasets? For instance, when evaluating sensitive topics (highly national culture related) across models trained on different language corpus such as English, Chinese, Hindi, or Japanese, would we observe multiplicity? Alternatively, if the training datasets are highly identical, would this lead to monoculture? If such possibilities exist, it suggests that the conclusions drawn from the current experimental results may be limited and difficult to generalize broadly. I hope the authors have some thoughts on the observed “middle ground between monoculture and multiplicity” and the reasons behind it.
Given these points, I am currently inclined to retain the original score and look forward to hearing the authors’ thoughts on the insightful value of their work.
Dear Reviewer CF6P,
Thank you for your continued engagement and thoughtful feedback.
limited to binary classification and relatively small-scale risk assessments
Thank you for pointing this out. We’d like to clarify that there was a typo in the main paper regarding dataset sizes – this is corrected in the appendix, where the accurate data ranges (60,000 to 3,000,000 data points) are reported, making the dataset sizes quite large.
Regarding our focus on binary classification, this setting is widely used in the literature on model multiplicity and provides a conceptually clean foundation for introducing our proposed metric of recourse with respect to a model set. While recourse does not strictly require binary labels, it does assume the existence of a clearly favorable outcome, such as an accepted loan application or a job offer. We believe that extending this work for a multi-class setting is an excellent avenue for future work. We note that this requires a natural ordering or the existence of a preferred outcome in order for the notion of recourse to be well-defined. For the scope of our initial study, we found it important to be able to compare with prior work on monoculture and multiplicity, reason for which our main results are illustrating the binary classification setting.
discussion on how the findings of this paper can be applied to design model ecosystems — for instance, selecting models that maximize required outcome diversity, accuracy, or even fairness
Finding the ‘best’ model for a particular objective such as accuracy, fairness or interpretability falls under the umbrella of model selection and typically focuses on identifying a single model. In this context, prior work has argued that when multiple equally good models exist, decision makers have a duty to search for less discriminatory alternatives (Coston et al., 2021; Black et al., 2024). However, model selection alone does not capture aggregate dynamics that can arise in an ecosystem with multiple decision makers – such as monoculture and model multiplicity.
Our work illustrates what a system-level analysis might look like to assess the degree of monoculture or multiplicity present in a given language model ecosystem. Prior work has predominantly examined model multiplicity and monoculture in isolation. Our work provides empirical evidence revealing the interplay between these two phenomena. While primarily diagnostic, we believe that conducting such system-level evaluations is valuable to inform targeted policy interventions. For example, if homogeneity is identified as a significant concern in a particular deployment context, observing a strong tendency towards monoculture may itself justify interventions such as randomization.
For our analysis, we select models for the Rashomon set based on their accuracy. Exploring how alternate objectives or constraints (e.g. due to legal regulations) shape ecosystem-level behavior remains an interesting avenue.
Regarding the variations between tasks, we offer potential explanations in our initial response: reasons include differences in class balance and the predictive signal of the task. High class-imbalance can lead models in the Rashomon set to perform similarly to the majority-class predictor, resulting in homogeneous outcomes. Additionally, some tasks are inherently more difficult due to low predictive signal (Ding et al., 2022). We assess this by reporting the performance gap between the majority-class predictor and XGBoost, a strong baseline for tabular data (Appendix C.1); larger gaps indicate that models can meaningfully improve over trivial baselines. More broadly, we agree that variability across tasks complicates the formulation of one-size-fits-all design recommendations. Developing domain-specific solutions may be necessary regardless, due to a range of contextual factors outlined by Gur-Arieh and Lee (2025). As mentioned, we will revise the text to clarify this earlier in the paper.
(continued)
(continuation, please see responses above)
Models evaluated in this paper are open-source. [...] What would happen if the model sets were trained on disjoint datasets? Alternatively, if the training datasets are highly identical, would this lead to monoculture?
To clarify, our evaluation includes both open-source and closed-source models, specifically GPT-3.5 and GPT-4.1. While most models analyzed are open-source and could, in principle, be retrained from scratch with controlled pre-training data, such retraining is beyond the scope of this work – particularly considering the scale and resources required to train large language models.
More importantly, our goal is not to study hypothetical training regimes, but rather to empirically assess the current landscape of widely deployed models as they exist today. We deliberately retain the training data and procedures as employed by different model developers, grounding our findings in the realities of today’s model ecosystem. Technical reports for various open-source models provide insights into their pre-training data mix, which often includes sources such as CommonCrawl, Wikipedia and Wikibooks, GitHub, ArXiv, and Semantic Scholar, suggesting considerable overlap across datasets. Regarding language coverage, most models are primarily trained on English text; however, newer variants – such as Llama 3, OLMo 2 and Qwen – incorporate multilingual data. Unfortunately, most technical reports lack detailed breakdowns by language.
We agree that overlap in training data likely influences the emergence of monoculture. However, even when restricting our analysis to models developed by the same organization – presumably sharing substantial overlap in both training data and development pipelines – we still observe recourse levels and agreement rates that fall between strict monoculture and strong. This suggests that high overlap alone does not guarantee algorithmic monoculture and highlights the need for empirical, system-level diagnostics to understand ecosystem behavior in practice. Prior work, such as Bommasani et al. (2022), has explored the effects of different training data regimes in more controlled settings for classical models (e.g., SVMs, random forests, small neural networks). Extending these controlled studies to large foundation models represents an important direction for future research. Bommasani et al.’s model-sharing experiments offer a useful starting point; their results on RoBERTa-base suggest that models are most similar in their predictions when no task-specific adaptation is performed. In contrast, our contribution focuses on documenting and diagnosing the current landscape of the LLM ecosystem, rather than conducting controlled retraining experiments.
We will incorporate this discussion to better highlight the focus and limitations of our work in the revision.
We hope this response addresses your remaining concerns. Please feel free to reach out with any further questions or feedback.
The authors
Different LLMs can make different classification decisions when prompted in the same way. In this paper, this "multiplicity" is seen as an opportunity, an antidote to "monoculture": if an individual was denied for a mortgage by one bank based on their income levels or other traits, they have "recourse" in that they go to another bank, who might use another LLM, and potentially be approved for the mortgage by that bank. The experiments in this paper quantify the degree of multiplicity across 50 LLMs and six tasks. They find that there's some consistency across models (more than expected by chance), but most individuals still have recourse - the consistency doesn't rise to the level of monoculture. Variation in prompts and few-shot examples increases multiplicity.
优缺点分析
STRENGTHS
The paper has very clear motivation and definitions: the first few pages are a crash course on this topic for someone like myself who doesn’t know much about this area.
The experiment is comprehensive: the authors experiment with a range of LLMs from different companies, with multiple tasks, and with prompt variations.
WEAKNESSES
It is hard for me to gauge the quantitative significance of the results. Neither of the extreme outcomes - fully uncorrelated errors across models (full multiplicity) and fully correlated errors across models (monoculture) - seem plausible or even mathematically possible given non-insignificant differences in accuracy across models (see questions below). So it's hard to know what to make of the fact that the truth is somewhere in the middle. I wonder if there's a way to come up with a "multiplicity index", to get a feel for how strongly the errors are correlated across LLMs compared to, say, other ML models (I suspect there will be more multiplicity for the LLMs, especially in the zero-shot setting, but I'm not sure). Perhaps the increased variability due to prompts and/or few-shot examples can help here.
问题
Could you spend a little bit of space in the paper explaining the following issue: "Since language models are widely miscalibrated on non-realizable tasks, we adopt the approach from Cruz et al. [2024] to calibrate the model predictions. For each model and task, we fit a decision threshold t on n = 2000 samples from a validation set to maximize accuracy. The threshold is then applied to turn the risk scores into class predictions." What is a non-realizable task? Should we be concerned about the fact that this adjustment is necessary? Does it affect the conclusions of the paper?
The models being tested do not all have the same accuracy levels (there's an epsilon tolerance of 0.05 for inclusion in the "rashomon set"). Doesn't this on its own mean that we can't possibly expect to have full monoculture? Is there a way to quantify the highest level of monoculture given the 0.05 tolerance?
Could you justify this particular value of epsilon? I was a little surprised to see such a high tolerance level as the difference between e.g. 90% accuracy and 95% accuracy seems very substantial to me (half the error rate), though it wasn't clear what the absolute accuracy was for all tasks.
局限性
Yes, there is a reasonable discussion of the limitations in section 6.
最终评判理由
I again thank the authors for their engagement with my comments, and for clarifying the more minor issues I brought up in my review. Our discussion didn't convince me that monoculture isn't a bit of a strawman (this was also raised by another reviewer) and/or a mathematical impossibility given variable accuracy levels across models, and I'm still unsure how encouraged we should be with, say, 8% of individuals not having recourse (as opposed to 1%, or 50%). This could still be a good first paper that would advocate for a more rigorous quantitative framework for evaluating variability in predictions across models.
格式问题
This isn't a major issue and certainly not something that would justify desk rejection but the fact that the appendix was in a separate file made reviewing somewhat more of a hassle because the links to the appendix didn't work.
Dear Reviewer 6KfP,
Thanks a lot for the time you spent reviewing our work and for your valuable feedback! We appreciate your assessment of our work as accessible and our experiments to be comprehensive.
We address your questions in the following paragraphs:
- What is a non-realizable task? Should we be concerned about the fact that this adjustment is necessary? Does it affect the conclusions of the paper?
A prediction task is non-realizable when no model, regardless of its complexity, capacity or training data can achieve perfect performance. This typically stems from inherent stochasticity in the data generating process, underdetermined problem formulations or a lack of ground truth. This is a relevant distincting from a body of literature that focuses on the evaluation of LLMs on realizable tasks where there is a unique correct label for each data point (such as factual question-answering).
Threshold adjustment is a standard component of classification tasks, not a cause for concern. Since models often output probabilities, adjusting the threshold allows alignment with trade-offs between sensitivity and specificity, particularly in imbalanced or cost-sensitive settings. In the main paper, we report results optimizing for balanced accuracy, which we consider appropriate given the class imbalance of some tasks.
To address your concerns, we also re-ran the analysis optimizing thresholds for overall accuracy. Results were largely consistent: observed recourse levels remained between strict monoculture and strong multiplicity. However, for highly imbalanced tasks (e.g., ACSMobility and ACSPublicCoverage), accuracy optimization favors models that resemble the majority-class predictor, resulting in high fractions of individuals with no recourse—models we consider unlikely to be deployed in practice. We will include those results in the Appendix of the paper for full transparency.
- [Does the tolerance of for inclusion in the "Rashomon set" imply] we can't possibly expect to have full monoculture? Is there a way to quantify the highest level of monoculture given the 0.05 tolerance?
We appreciate the reviewer’s observation that differing accuracy levels already act as a force against monoculture. Indeed, two models with accuracies cannot agree on more than of the overall population. However, this bound applies at the population level.
Our analysis focuses on positive instances, where it does not directly apply without assumptions about the conditional agreement. Assuming the positive class constitutes at least an -fraction of the dataset, perfect agreement on this subset (i.e., strict monoculture) remains theoretically possible—even when overall accuracies differ slightly. We will clarify this in the revised paper.
Even given this additional information, we maintain that strict monoculture provides a natural and interpretable baseline, capturing the limiting case of full predictive homogeneity.
- Could you justify this particular value of epsilon? [...] 90% accuracy and 95% accuracy seems very substantial to me [...] wasn't clear what the absolute accuracy was for all tasks.
Thank you for this question. We fully agree that a 5% accuracy difference (e.g., 90% vs. 95%) is meaningful, as it reflects a substantial relative change in error. In prior work, values between 0.01 and 0.5 are commonly explored (see e.g. Marx et al., 2020, Hsu et al., 2023, Watson-Daniels et al., 2023).
Our particular choice of is not intended as a normative threshold, but reflects a practical trade-off: including plausible deployment candidates from multiple major LLM providers while avoiding overly narrow empirical Rashomon sets. In fact, for some tasks, lowering to 0.01, reduces the empirical Rashomon set to a single model..
That said, our key findings remain consistent even with more restrictive values. Currently, Figure 3 provides some insights into that, showing how the fraction of individuals with no recourse is affected by different values. In the revised version of the paper, we will further expand on this by adding a plot on the distribution of recourse for different ϵ values for greater clarity.
Finally, we agree that the Rashomon set abstracts away from absolute performance, which is why we report overall performance metrics in the Appendix. We will make this connection clearer in the revision.
hard to know what to make of the fact that the truth is somewhere in the middle [...] I wonder if there's a way to come up with a "multiplicity index"
Thank you for raising this important point. Our goal is to illustrate what a system-level analysis of model ecosystems might look like, highlighting the degree of monoculture or multiplicity and its implications for individual recourse. In light of our findings occupying the middle ground between strict monoculture and strong multiplicity, we recommend conducting such system-level analyses and create transparency by reporting the level to which individuals have recourse.
We appreciate your suggestion to develop a “multiplicity index” to compare different model ecosystems. This could be a valuable next step, and we hope our work provides a good starting point for such efforts. Importantly, we caution against viewing multiplicity as a solution concept to monoculture and vice versa. A more nuanced framework is needed – one that considers ecosystem dynamics and captures individual harms arising from it.
If you have any further questions or concerns, we will be happy to address them.
The authors
Thanks, I appreciate the authors' detailed comments, which helped clarify some issues I was uncertain about. I encourage them to include these clarification in the revised version of the paper. The fundamental issue I had remains, however. Monoculture does not seem like a plausible outcome in practice, so it's a little hard to know what to make of the fact that the empirical findings are somewhere between multiplicity and monoculture. I still think this paper is fine to accept as it makes a nice conceptual point, but it would be stronger with a quantitative framework that could help determine how impressed we should be with the proportion of individuals that have "recourse", given the number of models and prompting setups in an experiment, and given the differences in accuracy between models. I don't see a reason to change my score.
Dear Reviewer 6KfP,
Thank you for your continued engagement and thoughtful feedback.
Monoculture does not seem like a plausible outcome in practice
We agree that strict monoculture represents the far end of a spectrum encompassing model multiplicity and monoculture, and that in practice multiple factors – such as differing overall accuracy or models, the use of custom datasets or model adaptations by different decision-makers – can work as force against it. We discuss this point in the Discussion section of our work. At the same time, other factors – such as shared training data or components, and legal or regulatory constraints in a domain – can push models toward monoculture. For this reason, we believe it remains valuable to empirically assess the current state of the model ecosystem. As a conceptual tool for evaluating and reasoning about such ecosystems, we find it useful to contrast the two extremes of strict monoculture and strict multiplicity. To address your concern, we will more clearly term strict monoculture as the far end of this spectrum.
We hope that this clearer framing addresses your concern and will incorporate the clarifications prompted by your feedback into the revised version of the paper. They have already helped us further refine our work.
The authors
The paper addresses the presence of two phenomena; monoculture and model multiplicity. Monoculture indicates that models trained on similar data are going to have similar decision making, indicating that certain individuals could potentially be discriminated against by all models used by decision-makers for high-consequential tasks. Model multiplicity indicates that models trained in the same (or similar) circumstances, so on similar data, while achieving similar performance (e.g., in terms of accuracy) might still have diverging predictions. This paper investigates the degree to which models lean toward monoculture or multiplicity.
The authors evaluate 50 pretrained base and instruction-tuned LLMs, open and closed-source, on six high consequential tasks. This is done through zero-shot prompting and 10-shot prompting. A pairwise agreement rate is used, which measures the overlap in individuals that two models would both accept, and recourse, which is the fraction of models that give an individual a positive outcome; acceptance.
Results indicate that models do not lean toward either monoculture or multiplicity, but sit somewhere between. The recourse levels do not support monoculture yet models still have strong shared biases underscored by the high agreement rates.
优缺点分析
Strengths
- The paper tackles an important and interesting dichotomy of model behavior which is monoculture and model multiplicity. Findings can be highly impactful for tasks with high consequences and give better insights into model robustness and informed deployment choices into the real world.
- The evaluation setup is well-designed. The tasks chosen make sense, which are highly-consequential individual based prediction tasks. The authors conduct a large scale of experiments with the amount of models used (50). The evaluation metrics used correspond to the objective of the paper.
- The results are well-presented, also additional analyses are well-motivated and give a holistic understanding of monoculture or multiplicity in models, such as the prompt variation analysis.
- I think in general, the paper really thinks about the practical usage of LLMs and tailors its setup accordingly.
Weaknesses
- I feel the that the Introduction section could benefit from a bit more clarity and conciseness. Model multiplicity and monoculture could be explained a bit better in the introduction. A direct link to the task of risk assessment for individuals could make it clearer. What are the consequences of model multiplicity and monoculture for this task? Maybe a figure that represents the problem statement would be nice? Also, I am not sure whether the result figure has the desired effect in the Introduction and in the contributions section certain details can be left out and explained in the respective sections directly.
- The Results section could be made stronger with certain analyses being discussed. For instance, since this is such a highly consequential task, it would be highly valuable if there are insights included on whether there is a pattern regarding who the ones are in the 7% that have 0 recourse. Further, the results for the recourse are currently aggregated, but it would be very interesting to know how this changes across model size and/or family and base pre-trained vs. instruction-tuned.
- The results clearly showcase that there is no clear cut case of monoculture or multiplicity. It would really strengthen the paper if there is a discussion on whether there are specific instances (e.g. specific types of inputs) where there models perform rather monotonously versus this multiplicity.
问题
- Why specifically 10-shot learning? Do you notice different results when reducing the amount of samples in few-shot learning?
- Does always represent a different individual or can there be overlap between individuals across the ACS tasks?
- Why do you use base pre-trained versions of the model as well? I would imagine that most decision-makers would use the instruction tuned version.
- Line 217: Is it that bad that models yield highly similar outputs? For such high-consequential tasks, it might actually be a good thing that there is consistency in predictions across different models? Of course, this is with the assumption that there is no bias against specific individuals due to unrelated factors. Maybe this is also exactly what you mean and I might have missed this point, else it would be great to clarify this.
局限性
yes
最终评判理由
The authors have addressed many of my questions and concerns. I do agree with other reviewers that some takeaways for LLM practitioners and researchers could be discussed more clearly (regardless of the task) and hence did not raise my score. I think there is still merit in including something like this in this paper but I am not sure whether that could be added in this iteration.
格式问题
n/a
Dear Reviewer tNUx,
thanks a lot for the time you spent reviewing our work and your detailed and valuable feedback! We are pleased to read your positive comments on our empirical analysis and the value of our findings, contributing to a “holistic understanding of monoculture or multiplicity”. We address your remaining questions in the following paragraphs:
- Why specifically 10-shot learning? Do you notice different results when reducing the amount of samples in few-shot learning?
We had performed a sensitivity analysis using 4-shot and 8-shot learning on ACSIncome and noticed that there was no difference—we’ll include these checks in the Appendix of the paper. We ended up showcasing the results for more samples to allow for considerations related to in-context variations and class balancing. Other than that, we acknowledge that the exact number of shots is somewhat arbitrary.
- Does always represent a different individual or can there be overlap between individuals across the ACS tasks?
Thank you for the question. While all five ACS tasks adopted in his work are derived from the same data source, US Census data from 2018, each task is constructed on a distinct subpopulation (e.g., adults, employed individuals). As a result, individuals do not consistently appear across tasks: The maximum overlap in terms of individuals is at most 10.1% for any pair of tasks, 0.8% for any triplet, and 0.08% for any combination of four tasks—measured relative to each task’s size. No individual appears in all five tasks.
- Why do you use base pre-trained versions of the model as well? I would imagine that most decision-makers would use the instruction tuned version.
We thank the reviewer for raising this point. As shown in Appendix B.1, we find that across many tasks, there is often no clear performance gap between base models and instruction-tuned variants. Including base models thus provides a more comprehensive picture. In line with your comment below, they also serve as a useful reference for analyzing the impact of instruction tuning on multiplicity and monoculture. We will update the paper to include results separated for base and instruction-tuned models.
Base models can be naturally more attractive to decision makers for several reasons. Unlike instruction-tuned models, they do not impose an instruction-response format, offering greater flexibility to be adapted in custom workflow or in deployment settings where a particular alignment may be undesirable. Moreover, while instruction-tuned models often show better alignment with human preferences on realizable tasks, they have also been observed to exhibit overconfident risk scores (Cruz et al, 2024) on the tasks we study.
It would be very interesting to know how [recourse] changes across model size and/or family and base pre-trained vs. instruction-tuned.
We appreciate the reviewer’s suggestions and have added additional results conditioned on either model developers or model type (base vs. instruction-tuned).
When grouping by model developer, we find no consistent trend across tasks for specific model families. Notably, all Rashomon sets contain models from at least four different developers, though for some tasks and families, the number of models is restrictively small (only one or two models), limiting interpretability. In terms of model agreement, the finer-grained results align with our aggregate-level findings. Among model families with at least three members in the Rashomon set, agreement rates consistently fall between the extremes of strict monoculture and strong multiplicity.
Extending the analysis to include agreement rates across all models—not just those in the empirical Rashomon sets—we observe that Qwen models exhibit slightly higher in-family agreement than others. In contrast, OLMo and Gemma models tend to show slightly lower agreement within their respective families.
When grouping by instruction tuning, we observe slightly higher agreement rates among instruction-tuned models across tasks. However, recourse levels remain comparable between base and instruction-tuned models.
We will update the revised paper to include these more fine-grained analyses and their implications.
it would really strengthen the paper if there is a discussion on whether there are specific instances [...] where the models perform rather monotonously versus this multiplicity.
Thanks for raising this point. We absolutely agree that this would strengthen the paper and will include a plot showing recourse levels by demographic attributes (e.g., age, sex, race) in the revised version to examine more fine-grained disparities.
Better understanding of which individual-level factors might contribute to whether an individual is more likely to have no recourse or many opportunities for recourse in a (language) model ecosystem would be an interesting endeavour for future work.
Introduction section could benefit from a bit more clarity and conciseness
Thank you for your detailed feedback and suggestions—we will revise the introduction accordingly. Regarding your comment on Figure 1, could you kindly elaborate on which aspects you find non-ideal? Any additional detail would be greatly appreciated and will help us improve this figure more effectively.
Line 217: Is it that bad that models yield highly similar outputs? For such high-consequential tasks, it might actually be a good thing that there is consistency in predictions across different models?
Thanks for bringing this up. Models agreeing on consequential tasks can be seen as an indicator of consistency, on the other hand this is exactly what literature on monoculture warns about. It may mean that individuals have no opportunity for recourse. In particular for unrealizable tasks this might be undesirable. Our point here is not to provide a normative framework, but to evaluate whether this occurs in practice in language model ecosystems. We will adjust the opening of Section 4 to clarify this.
Finally, we thank the reviewer for several concrete suggestions for improvements of the flow and layout of the manuscript (e.g. shortened introduction, link to risk assessment earlier, improved opening of Section 4). We agree with all points made by the reviewer and will update the manuscript accordingly.
If you have any further questions or concerns, we will be happy to address them.
The authors
Thank you for addressing my concerns, answering my questions, and adding the additional analysis!
A discussion on how to deal with this model multiplicity and/or monoculture based on what you see in this task extrapolated to general scenarios (task-agnostic) could be a nice addition as well.
With respect to Figure 1, by itself it is completely fine and clear. Personally, the effect of already showing a figure of the results in the introduction did not add much to the story there for me. It is actually discussed in the Results section and I feel that is a more fitting placement for it (but this might be a very personal opinion). A less detailed figure in the introduction would work better for me, e.g., where you highlight more the discourse surrounding model multiplicity and monoculture.
I will keep my score the same.
Dear Reviewer tNUx,
Thank you again for your thorough and insightful feedback. Your comments have helped us further refine our work. We especially appreciate your suggestions regarding Figure 1, and will carefully consider making adjustments for the revised version of the paper.
The authors
Summary
The paper seeks to investigate the degree to which language models tend towards one or the other of two "poles": monoculture, where all models essentially behave the same, and multiplicity, where all models, while solving tasks with similar accuracy, exhibit huge variance in outcomes. The authors test a number of different language models with different prompts and different tasks, and determine that the reality of model behavior lies somewhere between these two extremes.
Strengths
Reviewers appreciated the creative problem formulation, the clarity of presentation, and the detailed and thorough experimental evaluation.
Weaknesses
Reviewers had mixed feelings about the framing of the paper, especially regarding the idea of a 'strict monoculture'. Some reviewers felt that the argument was somewhat of a strawman, and that in reality we would never see a strict monoculture. There were also a number of detailed questions about the experiments.
Author Discussion
There was extensive and productive discussion between authors and reviewers, with the paper adding experiments, changing framing to soften some of the bolder language, and generally being improved. Most reviewers were quite satisfied with the outcome of the discussion.
Justification
This paper tees up an interesting discussion of importance in an environment where we are using multiple models and are concerned about algorithmic monoculture. While there is room for improvement in the delivery of the paper itself, the work is solid and is likely to spark further discussion in the community.