6.0

/10

Poster4 位审稿人

最低3最高4标准差0.4

3.5

置信度

创新性2.8

质量2.8

清晰度2.8

重要性2.5

NeurIPS 2025

Matching Markets Meet LLMs: Algorithmic Reasoning with Ranked Preferences

Hadi Hosseini,Samarth Khanna,Ronak Singh

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

关键词

Preference reasoningmatching marketsstabilitycomprehension

评审与讨论

审稿意见

评分: 3置信度: 52025-06-25

The paper presents a study of LLM's ability on several matching related tasks—generating stable matchings, detecting and fixing instabilities, and answering preference queries. The authors curate a dataset comprising 2850 questions and find that across seven models, accuracy is reasonable on 10×10 markets (for reasoning models) but collapses on 50×50: outputs become unstable or invalid, blocking pairs are missed, and attempts to “repair” near-stable matchings often introduce new flaws. Parameter-efficient LoRA fine-tuning sharply improves small-scale performance yet leaves the large-instance failures unresolved, highlighting that current LLMs still lack robust, long-horizon algorithmic reasoning on these matching tasks.

优缺点分析

Strength: this is an underexplored area bridging matching and LLM, it is an interesting topic and relatively clearly-written. There is sufficient experiments in the evaluation section.

Weaknesses:

Intuitive, low-impact main finding – The headline result is that state-of-the-art “reasoning” LLMs underperform on classical matching tasks. Given that the deferred-acceptance (DA) algorithm is trivial to code, runs in $O(n^2)$ time, and guarantees stability, practitioners have no incentive to replace it with a brittle black-box whose accuracy collapses on 50 × 50 instances. Thus the negative outcome, while well documented, confirms an expectation rather than delivering a surprising finding or advance.
Limited novelty in empirical patterns – On top of the above, the authors do not end up deriving or proposing an approach that reliably closes the gap across the board, leaving little actionable guidance. Observations that larger models perform better and LoRA fine-tuning helps only on simple cases mirror the findings of prior reasoning benchmarks and papers, diluting the contribution’s impact.
Missing algorithmic baselines – The paper reports only language-only results. Omitting a direct DA baseline or a hybrid LLM-calls-DA agent prevents readers from quantifying the true performance gap and weakens the core motivation: if a ten-line Python DA script (easily produced by modern tool-calling LLMs) solves the problem perfectly, why test pure language execution at all? If there's little motivation to use pure language execution in the first place, then it undercuts the significance of the key negative result that the author tries to highlight.
Narrow benchmark scope – Another direction in light of the above is to consider this paper as a benchmark paper. However, the dataset comprises just 300 synthetic, one-to-one preference profiles from two stylized distributions, with no ties, quotas, or many-to-one constraints. Real markets are larger and messier, so the documented failure modes may not generalize, and the benchmark’s appeal may be limited to a small sub-community.

Taken together, these points suggest the work functions more as a cautionary note than as a substantive advance, raising doubts about its novelty and significance for a broad-audience venue such as NeurIPS.

问题

Can the authors better summarize and present the key novelty and impact of their paper, in light of the points I raised in the weaknesses section
The case and result for finetuning perhaps warrant a deeper investigation. The authors claim in line 339 that "... fine-tuning clearly improves the reasoning capabilities". How is the training data generated? Are they all from the same settings? It will be helpful to know that this improvement is a true improvement in its reasoning capabilities, instead of overfitting / formatting improvement, and whether it generalizes to different matching types, e.g. is one finetuned on ML able to yield performance gain with IC distributions, and e.g. whether one finetuned on 50x50 data can improve performance on 10x10. It seems slightly puzzling why there's zero and even negative improvement in the hard setting if the conclusion is that there are improvements in reasoning capabilities.
Your benchmark encodes preferences in a highly compact, tabular form (e.g., “M1: [W5, W3, W4, W2, W1]”), whereas any real-world motivation for using an LLM—rather than calling the DA algorithm directly—would presumably involve free-form, natural-language preference statements. Have you experimented with preferences are expressed as natural sentences or excerpts on real items rather than numbered dummies? If not, what challenges prevent such an extension, and how might you expect model performance to change under more realistic textual inputs?
As mentioned above, SOTA LLM like GPT-O3 are equipped with tool-calling and evaluation functionalities. Do they possess the capabilities of computing the solution algorithmically by writing code, and if so how does this fit into the overall agenda of the paper?

局限性

yes

最终评判理由

The authors’ rebuttal is clear and thorough, but they echoes the reservations I had earlier - this is a setting where there exist simple algorithms that solve the problem without computational burden. Given this, the applicability and necessity of LLMs are inherently limited, as practitioners or advanced LLMs with tool-calling abilities can simply adopt simple algorithms to solve them easily, downplaying the paper's significance.

On the merits of using them as a new benchmark on improving LLMs, I'm also not certain that diagnosing these failures in such a simplified domain constitutes a major contribution for the main track. As the authors stated in the rebuttals, moving beyond tabular form introduces new complexities, but the current framework of using those tabular form inputs instead of natural language does not line up with the nascent field that tries to use natural language for matching or preferences. As such, I maintain my original assessment and score.

格式问题

none

作者回复

2025-07-31

We appreciate the important points raised by the reviewer. We address their questions below:

Can the authors better summarize and present the key novelty and impact of their paper, in light of the points I raised in the weaknesses section

"Intuitive finding": While it is “expected” or “intuitive” that LLMs would fail with arbitrary large input sizes, we find it noteworthy that models with impressive performance on diverse reasoning tasks fail with inputs much smaller than those relevant in real-world scenarios. The fact that the input sizes are well below the context-window limits of each of these models, and since they understand and successfully execute the appropriate algorithms with even smaller instances, makes this finding less intuitive, given that the size of the input should not affect the ability to mechanically execute a series of steps. Furthermore, this is the first work (to the best of our knowledge) that evaluates LLMs in the domain of algorithmic reasoning with preferences.
"Limited novelty in empirical patterns": Through our analysis of LLMs’ ability to reason about the correctness of pre-computed solutions (section 5), about provided preferences (section 6), we identify how LLMs are susceptible to hallucinations that compound over a large sequence of steps, resulting in a failure to compute correct solutions. We show that this challenge is addressed for small markets by supervised fine-tuning on reasoning traces generated using a Python-based execution of the relevant algorithm.
"Missing algorithmic baselines/benefits of natural language": As is articulated in the paper, there exist polynomial-time algorithms for each task considered. Our question, hence, is not whether LLMs (tool-assisted or not) can improve over such algorithms, but instead whether they can reason about preferences as well as understand and execute the algorithms required to compute desirable solutions. The reason for this is precisely to leverage the natural language capabilities of LLMs, which have the potential to serve as an end-to-end solution for preference aggregation scenarios, where they can elicit preferences from natural language inputs [1][2], apply appropriate algorithms to aggregate these preferences (ensuring desirable properties), and ultimately explain solutions to stakeholders involved. For fool-proof explainability, it is essential that LLMs fully understand how solutions are computed. As we show in section 5 (“Detecting Instability”), even current SOTA reasoning LLMs are susceptible to errors in judgment about solutions that are pre-computed, which raises questions about their ability to effectively function as explainable systems when using tools to arrive at solutions.
"Narrow benchmark scope": The dataset we curate adequately allows us to demonstrate the shortcomings of LLMs in algorithmic reasoning with preferences. We specifically select IC and ML instances as representatives of two sides of the spectrum of correlation between preferences of different agents. While IC instances require the execution of the original DA algorithm, ML instances can be solved using a simplified version that runs in O(n) steps, allowing a comparison in terms of the complexity required to solve an instance. The reviewer correctly points out the existence of more complicated variations of the simplistic setting of one-to-one matching with complete and strict preferences. While some of these variations settings are NP-hard, those that can be solved in polynomial time require the execution of algorithms that are more complicated than the original (widely known) version of DA. Given the fact that LLMs fail to consistently execute even the simplest version of the algorithm, we suspect similar, if not worse, performance in more complicated settings.

We shall include more detailed discussions on the above points in the final version of the paper.

[1] Huang, D., Marmolejo-Cossío, F., Lock, E., & Parkes, D. (2025). Accelerated Preference Elicitation with LLM-Based Proxies. arXiv preprint arXiv:2501.14625.

[2] Soumalias, E., Jiang, Y., Zhu, K., Curry, M., Seuken, S., & Parkes, D. C. (2025). LLM-Powered Preference Elicitation in Combinatorial Assignment. arXiv preprint arXiv:2502.10308.

The case and result for finetuning perhaps warrant a deeper investigation ... It seems slightly puzzling why there's zero and even negative improvement in the hard setting if the conclusion is that there are improvements in reasoning capabilities.

The training data used for fine-tuning consists of 10000 instances with sizes ranging from 5 agents on each side to 50 (Hard), and consists of both types of instances, i.e. IC and ML, in equal proportion. Each instance is generated randomly, and there is no overlap between instances in the training and evaluation sets. Given that any mistake in the execution steps of the DA algorithm would lead to incorrect solution, the fact that fine-tuning clearly enhances LLMs’ ability to compute stable matchings indicates some improvement in their ability to reason about preferences and execute the necessary steps.

However, as we point out in the paper, this does not translate into a clear improvement in the case of larger markets (the models do improve in terms of the instability rate, but not in terms of the percentage of instances where stable matchings are computed). Additionally, based on a new set of experiments, we observe that including only a single type of instance (IC or ML) in the training data leads to an overall drop in performance, as compared to the case when both types are included. Similarly, models trained only on Hard instances (50x50) do not improve on Easy instances (10x10). While these observations indicate that LLMs perform better on the kind of instances they are trained on, they serve as a starting point for improving LLMs’ ability to solve problems involving the execution of algorithms and reasoning about preferences. For example, a meaningful next step is to combine fine-tuning with the use of step-level verification modules for preference reasoning, or modules that keep track of information about the algorithm's intermediate states (e.g. the proposals accepted/rejected).

We shall include more detailed results and discussions about the effect of different fine-tuning hyperparameters and ways to improve LLMs further, in the updated version of the paper.

Your benchmark encodes preferences in a highly compact, tabular form ... If not, what challenges prevent such an extension, and how might you expect model performance to change under more realistic textual inputs?

While we show that LLMs are robust to the terminology used in describing the matching setting (e.g. “workers” and “tasks” instead of “men” and “women”), we do not evaluate LLMs using preferences expressed in natural language. The reason for that is simple—it adds another layer of complexity in the process, by adding a task that requires the interpretation of how various alternatives are compared with each other in natural language inputs. Since such expressions of preference in natural language often do not contain an interpretable order of alternatives that is complete and strict (often having inconsistencies such as cycles), we suspect the performance to decrease (or at best, stay the same) with the addition of this extra step.

As mentioned above, SOTA LLM like GPT-O3 are equipped with tool-calling and evaluation functionalities. Do they possess the capabilities of computing the solution algorithmically by writing code, and if so how does this fit into the overall agenda of the paper?

This is an important point raised by the reviewer. Through a few preliminary tests, we have verified that LLMs can indeed write Python code to solve the simplistic two-sided matching problem accurately. However, there are two important reasons why evaluating LLMs’ capabilities without tool use is relevant in this setting.

The first, as discussed above, has to do with their potential to act as end-to-end solutions that parse preferences expressed in natural language, execute the necessary procedures to compute desirable outcomes, and explain important aspects of the solutions to the (potentially large number of) individuals involved. While tool-use may assist LLMs in computing correct solutions, it does not guarantee that they can properly reason about such pre-computed solutions (as demonstrated in section 5).

Secondly, most variants of matching markets in real-world applications have constraints that make the problems computationally intractable [1][2][3]. With improved capabilities, LLMs have the potential to act as optimizers that compute desirable solutions even in situations where polynomial-time algorithms do not exist.

[1] Delorme, M., García, S., Gondzio, J., Kalcsics, J., Manlove, D., & Pettersson, W. (2019). Mathematical models for stable matching problems with ties and incomplete lists. European Journal of Operational Research, 277(2), 426-441.

[2] Kwanashie, A., Manlove, D.F. (2014). An Integer Programming Approach to the Hospitals/Residents Problem with Ties. In: Huisman, D., Louwerse, I., Wagelmans, A. (eds) Operations Research Proceedings 2013. Operations Research Proceedings. Springer, Cham.

[3] Manlove, D.F., O’Malley, G., Prosser, P., Unsworth, C. (2007). A Constraint Programming Approach to the Hospitals / Residents Problem. In: Van Hentenryck, P., Wolsey, L. (eds) Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems. CPAIOR 2007. Lecture Notes in Computer Science, vol 4510. Springer, Berlin, Heidelberg.

2025-08-04

I thank the authors for the detailed response. My assessment remains qualitatively unchanged - as the authors pointed out, there exist simple algorithms that perfectly solve the tasks. Given such, the applicability and necessity of LLMs are inherently limited, as practitioners or advanced LLMs with tool-calling abilities can simply adopt simple algorithms to solve them easily, downplaying the significance.

The authors referenced a new nascent area of research of eliciting preferences, although after looking them up it seems that so far the community uptake is minimal; Moreover, both papers appear to primarily motivate the interface between natural-language preferences and conventional tabular data. If the end goal is to serve natural-language systems, evaluating your method only on hand-crafted tabular instances and not on their natural language instantiations may miss the most salient challenges, despite the sensible argument you provided in your rebuttal about the additional complication.

"their potential to act as end-to-end solutions that parse preferences expressed in natural language, execute the necessary procedures to compute desirable outcomes, and explain important aspects of the solutions to the (potentially large number of) individuals involved" - It still seems to me that a system could trivially (i) translate the user’s natural-language preferences into a table, (ii) run a simple DA algorithm, and (iii) translate the outcome back and explain, is this not what the tool-use LLM will do? In this loop, I see less clear an argument why the LLM itself needs to be a good matchmaker; perhaps they need to only be able to parse the result better?

"With improved capabilities, LLMs have the potential to act as optimizers that compute desirable solutions even in situations where polynomial-time algorithms do not exist." - That is an exciting long-term prospect, but it is not demonstrated—or empirically probed—by the present work.

Therefore I lean towards keeping my original assessment and score.

评论- Follow-up

2025-08-05

We appreciate the reviewer's additional points and would like to follow-up on the same:

Efficient Algorithms: We agree that efficient algorithms exist for standard matching problems. However, our goal is not to replace these algorithms, but to evaluate LLMs as reasoning agents in structured domains with well-defined properties. Even when tool-calling is available, LLMs must (i) recognize the decision context, (ii) extract structured preferences from ambiguous or incomplete input, and (iii) select and parameterize the correct algorithm -- steps that are non-trivial and error-prone.

Most importantly, evaluating the ability to reason about and execute straightforward algorithms is imperative to understand the frontiers of LLMs' reasoning capabilities. Recent work has sparked a debate on whether LLMs can actually "reason" [1][2][3], offering conflicting conclusions. Our work contributes to this increasingly relevant line of work, by demonstrating key shortcomings in how LLMs reason.

Matching markets offer an ideal testbed for such evaluation: they are simple enough to admit ground truth, yet rich enough to test preference reasoning, alignment, and multi-agent negotiation. They provide a framework that is axiomatically rich yet computationally tractable where solution quality (e.g., stability, envy-freeness, Pareto optimality) can be rigorously evaluated. As LLMs are increasingly deployed in real-world decision-making (often alongside humans), it is crucial to understand how well they approximate correct outcomes and respect normative constraints, both when optimal solvers exist and when they do not.

Tabular vs. Natural Language Preferences: While it is true that preference elicitation from natural language is an emerging area with limited uptake so far, we view this as precisely why foundational evaluation is needed. The cited works represent early but critical efforts in defining the interface between naturalistic language and structured preferences, a prerequisite for applying LLMs in real-world scenarios.

We agree that bridging to natural language preferences is a critical next step. Our decision to use tabular representations with controlled preference structures is intentional: it enables us to isolate and rigorously evaluate the LLM's reasoning, alignment, and solution quality relative to normative axioms (e.g., stability). Natural language formulations introduce significant noise in both input and output, making it difficult to attribute performance failures to reasoning vs. parsing. By grounding our analysis in tabular settings first, we can obtain clean and interpretable measurements, forming a benchmark for future extensions that incorporate naturalistic input.

[1] Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. arXiv preprint arXiv:2506.06941.

[2] Opus, C., & Lawsen, A. (2025). The Illusion of the Illusion of Thinking. arXiv preprint ArXiv:2506.09250.

[3] Song, Z., Yue, S., & Zhang, J. (2025). Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations. arXiv preprint arXiv:2507.17699.

2025-08-08

Thanks for the additional clarifications!

2025-08-09

We are glad to address your concerns; thank you for your comments.

审稿意见

评分: 4置信度: 32025-07-01

The paper delves into how LLMs handle ranked preferences in matching markets. By evaluating several models on tasks like generating stable solutions and detecting instability, the authors create a benchmark with diverse difficulty levels and preference distributions. The results show that existing models struggle with large-scale instances, often failing to identify blocking pairs or execute algorithms iteratively. The authors try to improve the performance by applying LoRA, which only works in small markets. IN summary, the paper reveals the limitations of existing LLMs in combinatorial reasoning.

优缺点分析

Strengths:

The paper reveals the limitations of LLMs from a special perspective, offering some directions for future research.
The paper is well organized, and the experimental setup is rational, supporting its discussion and conclusion, making it easy to read and persuasive.

Weaknesses:

The experimental results are based on synthetic data, and the paper does not thoroughly explore how well these findings generalize to real-world matching markets with more complex preference structures.
The paper's exploration of methods to enhance performance is inadequate, making its contribution limited and lacking novelty.

问题

How do LLMs' performances on these tasks compare to traditional algorithmic solutions for matching markets? Are there certain situations where LLMs have advantages over traditional methods?
The paper uses synthetic data. Could the authors talk about the possible difficulties and chances of using their discussion in real-world matching markets with more complex preference structures?
The experiment only tests in strictly preferred cases(IC/ML distributions), which limits the conclusion's universality. Could the authors add some experiments on datasets with indifferent preferences?
Please check formatting and detail errors of the paper, e.g., the garble error on line 301.

局限性

The authors have pointed out some limitations of their work. However, to make the paper better, it would be helpful to:

Look into more strategies for improving LLM performance on large-scale instances in addition to LoRA fine-tuning.
Offer a more in-depth comparison with traditional algorithmic approaches in this field to better understand the LLMs' performance.
Discuss the possible challenges of applying this work in real-world situations with more complex preference structures.

最终评判理由

Thank you to the authors for the rebuttal. The responses have been helpful in clarifying the paper. However, I recommend that the authors incorporate the new discussions, experiments, and clarifications mentioned in the rebuttal into the final version of the paper.

Therefore, I maintain my borderline accept rating.

格式问题

No major formatting issues, but several typos should be corrected.

作者回复

2025-07-31

We thank the reviewer for their suggestions on improving the paper, which we will take into consideration. Below we shall answer the questions they raise.

How do LLMs' performances on these tasks compare to traditional algorithmic solutions for matching markets? Are there certain situations where LLMs have advantages over traditional methods?

Why LLMs? There are efficient algorithms for finding solutions in standard matching markets. However, studying the performance of LLMs in matching markets is crucial for several reasons:

Widespread adoption of LLMs: LLMs are becoming ubiquitous in everyday life, for example, doctors and nurses are reportedly using generative AI in the medical field. Consequently, laypeople are likely to rely on LLMs for tasks like matching since they may not know the appropriate algorithms or may lack the time to apply them.
Matching markets as a testbed: Matching markets are nuanced systems with well-studied axiomatic properties, making them an ideal setting to evaluate the reasoning capabilities of LLMs.
Rise of LLM agents: As autonomous LLM-based agents emerge, they must understand user preferences and make decisions on their behalf in alignment with those preferences.
Reasoning about preferences: A key aspect of advanced reasoning is the ability to interpret and manipulate ranked preferences, a concept that spans domains from voting systems to market design.
Multi-agent interactions: Many real-world problems require collective decision-making and compromise among agents—domains where reasoning over ranked choices and outcomes is essential.

Advantages over traditional methods. While the stylized toy version of matching markets admit efficient algorithms, these problems often become computationally intractable (NP-hard) in the presence of weak preferences (ties) and incomplete/partial preferences.

LLMs are fantastic for generating preferences [1], making them ideal not only in computing solutions when no efficient algorithms are known to exist, but also to complete and complement preferences when data about ranked preferences is unknown, ill-defined, or incomplete.

Finally, LLMs have the potential to function as end-to-end solutions for preference aggregation settings. Having an LLM-based decision system that can process natural language inputs from humans to identify preferences, apply appropriate algorithms to aggregate these preferences (ensuring desirable properties), and explain solutions to stakeholders involved, will streamline and enhance decision-making processes in diverse applications. In fact, various works already demonstrated the effectiveness of preference elicitation from natural language inputs using LLMs, in settings such as auctions [2] and course assignment [3].

[1] Fish, S., Gölz, P., Parkes, D. C., Procaccia, A. D., Rusak, G., Shapira, I., & Wüthrich, M. (2023). Generative social choice. arXiv preprint arXiv:2309.01291.

[2] Huang, D., Marmolejo-Cossío, F., Lock, E., & Parkes, D. (2025). Accelerated Preference Elicitation with LLM-Based Proxies. arXiv preprint arXiv:2501.14625.

[3] Soumalias, E., Jiang, Y., Zhu, K., Curry, M., Seuken, S., & Parkes, D. C. (2025). LLM-Powered Preference Elicitation in Combinatorial Assignment. arXiv preprint arXiv:2502.10308.

The paper uses synthetic data. Could the authors talk about the possible difficulties and chances of using their discussion in real-world matching markets with more complex preference structures?

The challenges with using real-world data in our evaluation set-up are two-fold:

Lack of availability: Real data from popular applications of matching markets, such as matching residents to hospitals, school choice programs in cities such as New York and Boston, and college admissions, is not publicly available due to privacy constraints.
Increased complexity: We show that even advanced reasoning LLMs fail in rather small and stylized markets (50x50 one-to-one instances). Real‑world applications are much larger, with hundreds or thousands of agents, and additional layers of complexity (e.g., incomplete preference lists, ties, programme quotas, and many‑to‑one capacities). Most of these variants either require more complicated algorithms, or are NP‑hard to solve exactly [1][2][3]. Hence, we suspect worse (or at best, the same) performance from LLMs in real-world settings, which are far more complicated.

The experiment only tests in strictly preferred cases(IC/ML distributions), which limits the conclusion's universality. Could the authors add some experiments on datasets with indifferent preferences?

Our results extend to the case when ties (indifference between alternatives) are present. If each tie is broken arbitrarily, the ordinary DA algorithm still returns a “weakly stable” matching [4]. Because this refinement requires no more procedural effort than the strict‑preference case, an LLM’s success rate on weak stability with ties can be expected to match, at best, its performance on strict preferences. Stronger notions such as “strong” and “super” stability are more demanding: a solution may not exist, and specialised tie‑handling variants of DA are needed [4].

This is exactly what we observe through new experiments where we introduce ties in the preference lists. LLMs continue to apply the DA algorithm and ensure weak stability at the same rate at which they compute stable matchings when preferences are strict. However, less than 50% of the matchings computed in the former case satisfy notions such as strong or super stability.

[4] Robert W. Irving. 1994. Stable marriage and indifference. Discrete Appl. Math. 48, 3 (Feb. 15, 1994), 261–272.

Please check formatting and detail errors of the paper, e.g., the garble error on line 301.

We were not able to locate the formatting error. We would appreciate it if you could share the exact sentence so we can address the error.

评论- I keep my positive rating.

2025-08-05

Thank you to the authors for the rebuttal. The responses have helped clarify the paper. However, I recommend that the authors incorporate the new discussions, experiments, and clarifications mentioned in the rebuttal into the final version of the paper.

Therefore, I maintain my borderline accept rating.

审稿意见

评分: 4置信度: 32025-07-01

This manuscript investigates the reasoning capabilities of LLMs in the domain of matching markets, a key area in economics, with a particular focus on their ability to handle preference rankings and execute structured algorithms. The authors construct a dedicated dataset and conduct comprehensive evaluations, and they further explore the use of LoRA-based SFT as a preliminary attempt to enhance LLM performance.

优缺点分析

Strengths

The experiments presented in this manuscript are highly comprehensive, encompassing hierarchical tasks of varying difficulty levels, multi-dimensional evaluation metrics, diverse types of models, and well-controlled experimental variables.
The manuscript establishes a standardized benchmark for the domain of matching markets.
The study incorporates several state-of-the-art models, including DeepSeek-R1 and Qwen-QwQ-32B.

Weaknesses

Although this manuscript effectively demonstrates the limitations of existing LLMs in tasks of matching markets, it does not provide a concrete solution. In the supplementary materials, I noticed the authors attempt to address this issue using reinforcement learning algorithms. However, I have concerns regarding this approach. Prior work has shown that post-training techniques—including SFT and reinforcement learning—do not inject new knowledge into LLMs. Rather, they serve primarily to align model outputs with human preferences (both correct and incorrect) via behavioral control, often improving performance from pass@N to pass@1. In contrast, the experimental results in this manuscript suggest that the models lack fundamental knowledge relevant to matching markets, rather than simply failing to align with correctness-related preferences. Therefore, I believe the authors need to better articulate the contributions of this work. A strong empirical study should inspire future research directions, yet this manuscript currently lacks a compelling discussion on potential avenues for follow-up work.
What is the practical value of applying LLMs to matching markets? If such value is limited or unclear, what is the justification for introducing LLMs to a domain that has not traditionally relied on them?
I believe including a comparison between LLM performance and that of human experts would significantly improve the manuscript’s clarity. Without such context, readers unfamiliar with the matching markets domain may find it difficult to judge whether the LLMs' low performance stems from the intrinsic difficulty of the task or from the models’ inadequacy.
I would encourage the authors to provide several case studies to help elucidate the underlying reasons for LLM failure in this domain.
The manuscript would benefit from a teaser figure to guide readers through its core contributions and structure.
What is the difference between open-source and closed-source models? As I understand it, the distinction lies in whether the model developers publicly release the model parameters. However, this distinction is not inherently related to the model's performance.

Minor comment

This work may be more suitable for the NeurIPS Benchmark Track rather than the NeurIPS Main Track.

问题

See weaknesses.

局限性

yes

最终评判理由

I have completed the full review of the manuscript and its rebuttal.

I recommend the author to improve the readability of the manuscript, including adding teaser figures and case studies, to assist readers who are not Matching Markets.

I maintain my borderline accept rating.

格式问题

No formatting issues were identified.

作者回复

2025-07-31

We thank the reviewer for their meaningful comments and questions, which we address below:

Therefore, I believe the authors need to better articulate the contributions of this work. A strong empirical study should inspire future research directions, yet this manuscript currently lacks a compelling discussion on potential avenues for follow-up work.

The contributions of this work are three-fold:

In settings involving the aggregation of human preferences, we demonstrate that the ability of even the most advanced reasoning LLMs to reason about and execute straightforward algorithms does not scale with the input size (sections 3-5). This finding has implications across a wide-range of settings involving reasoning about human preferences such as voting, recommender systems, and resource allocation.
We provide potential explanations for this phenomenon, by showing how LLMs make errors while reasoning about provided preferences, which can compound over a large number of steps that involve such reasoning (section 6).
We illustrate the potential of supervised fine-tuning on synthetically generated reasoning traces in improving LLMs’ ability to compute correct solutions (section 7).

The jump‑in accuracy we see after trace‑level SFT on small markets shows that models do grasp both the deferred‑acceptance invariants and the step‑wise procedure once those steps are demonstrated explicitly. The residual failures with larger inputs therefore stem from computational bandwidth and hallucinations, rather than missing knowledge of the algorithms or setting. Concretely, the model must carry out ≈ n^2 propose/reject rounds while maintaining a growing state vector; attention and position‑encoding limits may cause the intermediate state to drift and compound. Closing this gap calls for methods that extend the effective reasoning horizon. Possible follow‑up directions are:

Step‑Level Reinforcement Learning: Reward models not just for final stability but for correct intermediate actions (valid proposal, correct understanding of preferences), extending the effective reasoning horizon without requiring longer context windows.
Scratch‑Pad Execution with Verifier: Let the LLM run one proposal/reject round at a time on an external scratch‑pad, then call a lightweight checker that rejects states containing blocking pairs, creating an iterative self‑correction loop.
Curriculum + Trace Distillation: Start with small reasoning traces for small inputs and gradually increase the input size, freezing lower layers as capacity grows.

What is the practical value of applying LLMs to matching markets? If such value is limited or unclear, what is the justification for introducing LLMs to a domain that has not traditionally relied on them?

Although efficient algorithms exist for solving problems in traditional matching markets, analyzing how LLMs perform in these contexts is important for several key reasons:

Growing Ubiquity of LLMs: Large Language Models are becoming increasingly integrated into daily life. For instance, healthcare professionals are reportedly using generative AI tools in clinical settings. As a result, everyday users may turn to LLMs for tasks like matching, especially when they are unfamiliar with the correct algorithms or lack the time to apply them manually.
Matching Markets as a Benchmark: Matching markets offer a rich and theoretically grounded environment with well-established axiomatic structures. This makes them a valuable domain for assessing the reasoning abilities of LLMs.
Emergence of LLM Agents: As autonomous agents powered by LLMs become more prevalent, these systems will need to infer and act on user preferences, making informed decisions that align with individual goals.
Reasoning with Ranked Preferences: Sophisticated reasoning involves understanding and manipulating ranked choices—an essential capability across domains such as voting, market design, and social choice theory.
Multi-agent Decision-Making: Many practical problems involve negotiation and coordination among multiple agents, requiring reasoning over ordered preferences and group outcomes.
Beyond Classic Algorithms: While simplified matching problems can be solved efficiently, real-world scenarios often involve complexities like indifferences (ties) and partial information, with some versions being computationally intractable (NP-hard) [1][2].

Additionally, LLMs excel at generating and completing ranked preferences [4], making them well-suited for solving problems where no efficient algorithm exists or where the input data is sparse or ambiguous.

Ultimately, LLMs have the potential to serve as comprehensive, end-to-end systems in preference aggregation tasks. Such a system could translate natural language inputs into formalized preferences, apply suitable aggregation algorithms while satisfying key normative criteria, and explain various aspects of the results clearly to stakeholders. In fact, several studies have already shown that LLMs are effective at extracting preferences from language in domains such as auctions [5] and course allocation [6].

[3] Fish, S., Gölz, P., Parkes, D. C., Procaccia, A. D., Rusak, G., Shapira, I., & Wüthrich, M. (2023). Generative social choice. arXiv preprint arXiv:2309.01291.

[4] Huang, D., Marmolejo-Cossío, F., Lock, E., & Parkes, D. (2025). Accelerated Preference Elicitation with LLM-Based Proxies. arXiv preprint arXiv:2501.14625.

[5] Soumalias, E., Jiang, Y., Zhu, K., Curry, M., Seuken, S., & Parkes, D. C. (2025). LLM-Powered Preference Elicitation in Combinatorial Assignment. arXiv preprint arXiv:2502.10308.

I believe including a comparison between LLM performance and that of human experts would significantly improve the manuscript’s clarity. Without such context, readers unfamiliar with the matching markets domain may find it difficult to judge whether the LLMs' low performance stems from the intrinsic difficulty of the task or from the models’ inadequacy.

Given that there exist polynomial-time algorithms for computing and verifying solutions in the matching market setting, human experts use such algorithms for decision-making in applications such as the National Resident Matching Program (NRMP), school selection, and college admissions.The setting we consider is the most simplistic version of matching markets, involving ordinal preferences that are complete and strict (no indifference). Here, stable solutions are obtained using the Deferred Acceptance that requires a number of steps that is quadratic in the number of agents involved. While LLMs are able to solve this problem with 10 to 20 agents (on each side), they fail when the number of agents increases to 50. This reflects shortcomings of LLMs in handling larger context inputs rather than the growing difficulty of the task.

I would encourage the authors to provide several case studies to help elucidate the underlying reasons for LLM failure in this domain.

Through our analysis in section 6 (confirm), we shed light on potential reasons for LLMs' failure on large instances. The Deferred Acceptance (DA) algorithm requires keeping track of preference lists and accepted/rejected proposals at any given stage, and the accurate interpretation of preferences to move to the next step. We show that LLMs make errors on even such small isolated tasks. Since the size of the context and number of such steps increases with a factor of n^2, the errors made by LLMs compound and lead to a larger chance of failure in larger instances. We also show that LLMs make hallucinations about blocking pairs (sources of instability) while evaluating the stability of a given matching (section 5). While isolating the reason behind LLMs’ hallucinations in general is beyond the scope of this study, we clearly identify that their failure is not due to a lack of understanding of the economic knowledge necessary to reason about these problems, but from challenges with handling large context inputs.

The manuscript would benefit from a teaser figure to guide readers through its core contributions and structure.

That’s a great suggestion. We will definitely add such a diagram in the paper.

What is the difference between open-source and closed-source models? As I understand it, the distinction lies in whether the model developers publicly release the model parameters. However, this distinction is not inherently related to the model's performance.

Beyond the (un)availability of model parameters, there is a lack of transparency regarding the training data and pipelines used to enhance the performance of closed-source models. Our comparison of both classes of models on the selected series of tasks aims to identify whether there is a clear gap, potentially due to the differences in training paradigms. Most importantly, it is not possible to modify closed-source models through post-training methods such as fine-tuning and reinforcement learning, limiting the amount of experimentation and control that researchers have for enhancing LLMs’ capabilities in this setting.

2025-08-05

Thank you for the authors’ detailed response.

However, it appears that my original concern may have been misunderstood. When I asked “What is the practical value of applying LLMs to matching markets?”, I was referring to a relative comparison—specifically, what advantages LLMs offer over traditional algorithms in this context. Moreover, the limitations of the proposed approach should also be clearly acknowledged.

Regarding the case study, I still find the manuscript lacking in clarity. For readers unfamiliar with the concept of matching markets, the presentation may be particularly difficult to follow. Therefore, I continue to recommend the inclusion of a case study to improve the clarity and accessibility of the manuscript.

As for the comparison between open-source and closed-source models, the recent release of DeepSeek-R1 illustrates that the performance gap between these two categories is narrowing. Consequently, I find it inappropriate to characterize performance differences solely based on whether a model is open- or closed-source.

In summary, while the authors have addressed some of my concerns, the response does not significantly improve my overall impression of the manuscript. I thus stand by my original score.

评论- Clarification

2025-08-06

We appreciate the clarification. To directly address the question: LLMs are not intended to outperform traditional algorithms on solution quality, but rather to act as interfaces between humans and algorithms, especially in settings with ambiguous, incomplete, or natural-language preferences. Their advantage lies in interpreting, aligning with, and reasoning about preferences.

Beyond this practical motivation, our paper highlights a core limitation in the algorithmic reasoning abilities of LLMs. Matching markets provide a uniquely suitable testbed: they are clean, structured, and come with axiomatic properties (e.g., stability) that are central across domains such as content matching, organ exchange, and resource allocation. By evaluating LLMs in this setting, we expose reasoning failures that are not apparent in unstructured tasks, and provide a foundation for improving reasoning in more complex domains.

We appreciate the suggestion to include a case-study to illustrate the challenges faced in practical applications of matching markets (such as resident matching, school choice, matching content to users, etc.), and this impacts LLMs' performance.

We agree with the reviewer that the performance gap between open- and closed-source models is reducing in a variety of domains. In fact, this is something we observe in our paper as well. We shall clarify this point and include a brief discussion about this in the main paper upon acceptance.

We shall utilize the extra space provided as part of the final version, to emphasize all these points in more detail.

2025-08-07

Thank you to the authors for the rebuttal. The responses have helped clarify the manuscript.

I still recommend the author to improve the readability of the manuscript, including adding teaser figures and case studies, to assist readers who are not Matching Markets.

I maintain my borderline accept rating.

审稿意见

评分: 4置信度: 32025-07-03

The paper investigates whether current large language models can reason over ranked preferences in combinatorial settings by focusing on matching-market problems. The authors build a benchmark suite that escalates from generating stable matchings to detecting and resolving instabilities and answering fine-grained preference queries. Seven state-of-the-art models are tested. Results reveal that even the strongest LLMs regularly miss blocking pairs and struggle with the iterative logic needed to repair unstable matchings, especially as market size grows.

优缺点分析

Strengths

The paper targets an underexplored yet practically important domain—ranked-preference reasoning in matching markets—providing clear motivation and relevance to real-world allocation systems.
It introduces a well-structured benchmark spanning four task types and varying market sizes, and evaluates seven strong LLMs plus LoRA fine-tuning, yielding a comprehensive empirical picture.

Weaknesses

All preference profiles are synthetically generated; the absence of real-world datasets limits external validity and may overstate model shortcomings.
The study notes that LoRA helps on large markets but harms small ones, yet offers little diagnostic analysis or ablation to explain this scale-dependent trade-off.

问题

Questions for the Authors

Have you tested the benchmark—or a subset of it—on any real‐world matching-market data (e.g., residency matching or ride-sharing logs) to verify that the observed limitations persist outside synthetic instances?
What factors drive the LoRA fine-tuning trade-off that improves large-market performance while degrading small-market accuracy? Could additional ablations (e.g., rank size, layer selection, or data volume) clarify this behavior?

局限性

yes

格式问题

作者回复

2025-07-31

We thank the reviewer for their comments and insightful questions. We shall address them below:

Have you tested the benchmark—or a subset of it—on any real‐world matching-market data (e.g., residency matching or ride-sharing logs) to verify that the observed limitations persist outside synthetic instances?

Our work currently does not include tests on real-world data for two primary reasons:

Lack of availability: Human preference data from popular applications of matching markets, such as the National Resident Matching Program (NRMP), school choice programs in cities such as New York and Boston, and college admissions in various countries, is not publicly available.
Additional layers of complexity: Public deployments typically involve incomplete lists, ties, programme capacities, and many‑to‑one constraints—often with hundreds of agents per side. Such variants are often NP-hard [1][2][3], and go beyond the scope of our present study, whose goal is to probe whether LLMs can execute even the baseline polynomial‑time procedures such as Deferred‑Acceptance (DA). We show that current models already fail on 50 × 50 one‑to‑one instances with complete and strict preferences; introducing larger, many‑to‑one markets would obscure the diagnosis without changing the takeaway.

What factors drive the LoRA fine-tuning trade-off that improves large-market performance while degrading small-market accuracy? Could additional ablations (e.g., rank size, layer selection, or data volume) clarify this behavior?

As shown in our results (section 7), the performance drastically improves on smaller markets and fails to improve clearly on large markets. This is primarily due to the inability of models, even those with advanced reasoning capabilities, to handle large inputs - something we show throughout the paper. Since properly solving large instances requires a much larger number of steps, the likelihood of models making an error in the execution process is much higher (as discussed in section 6), leading to a much higher chance of failure.

Ablation with various hyperparameters in the fine-tuning pipeline provide additional insights. We observe that performance does not improve when only one type of instance (either IC or ML) is included in the training dataset, indicating that both types of instances are necessary for an overall improvement. Similarly, we observe that while a larger volume of data leads to better performance (e.g. 10000 samples vs. 5000), a larger LoRA rank (64 vs. 32) does not yield an improvement, indicating that a larger set of trainable parameters is not required for the model to learn the necessary reasoning patterns. We will update the paper with more detailed results.

Any improvements, through fine-tuning or methods involving reinforcement learning that helps mitigate errors made by LLMs while reasoning about large preference profiles, will help improve performance on larger markets. A meaningful step in this direction could be to fine-tune LLMs to use tools that accurately answer comparative questions based on provided preferences, or keep track of intermediate states in the execution of the concerned algorithm.

2025-08-07

Thanks for the explanations. It makes the paper more interesting. I will maintain the current positive score.

最终决定Accept (poster)

2025-09-17

This paper provides a valuable and comprehensive empirical study on the capabilities of LLMs to reason with ranked preferences in the context of matching markets. The reviews are largely positive, with three reviewers leaning towards acceptance.

Strengths: The paper is commended for tackling an important and underexplored area (R-Jvsm, R-YmHR). It introduces a well-structured benchmark with a hierarchy of tasks and evaluates a strong suite of modern LLMs, providing clear evidence of their limitations on algorithmic reasoning, especially at scale (R-Jvsm, R-e1sx). The work is seen as a high-quality empirical study that establishes a standardized benchmark (R-e1sx).
Weaknesses & Rebuttal: Reviewers raised concerns about the use of synthetic data (R-Jvsm, R-YmHR) and the lack of a novel solution to the identified problem (R-e1sx, R-xbG7). The authors provided convincing rebuttals, justifying the use of synthetic data due to the unavailability and added complexity of real-world data, and framing the work as a crucial diagnostic study that exposes core reasoning failures. This was sufficient to maintain the positive scores of three reviewers (R-Jvsm, R-YmHR, R-e1sx).

The paper's contribution lies in its rigorous demonstration of a fundamental gap in LLM reasoning. While the practical application of using a pure-language LLM for a task with a known algorithmic solution was questioned (R-xbG7), the authors successfully argue that this setting serves as an essential testbed for algorithmic fidelity. The findings are significant for understanding the scaling limits of reasoning in LLMs, and I believe this work will inspire valuable follow-up research.