Bootstrapping Self-Improvement of Language Model Programs for Zero-Shot Schema Matching
self-improving compositional language model programs for schema matching across heterogenous data sources
摘要
评审与讨论
The paper describes a technique for matching dataset schemas. They use a compositional language model for this. They benchmark their solution against multiple competing works and usually achieve superior performance.
给作者的问题
I would quite like to try out your solution. Is there any manner you can provide me to test it for it's intended use?
论据与证据
I have not found unsupported claims. However, I think the impact of the solution lives or dies by how easy it is to use.
方法与评估标准
The authors benchmark against several existing solutions of which one (rematch) is the most similar to their solution. The approach seems sound, but it is hard to determine if there are other solutions which should be included.
理论论述
There are little to no theoretical claims in this paper.
实验设计与分析
The experimental designs seem to be sound.
补充材料
I read parts of the experimental setup and examples. They seem quite extensive.
与现有文献的关系
After looking for similar solutions, it seems there are an increasing amount of works who claim superior performance. Providing easy-to-implement benchmark tasks and benchmark code is crucial to determine which solution works the best.
遗漏的重要参考文献
In this area, which is rapidly developing, it is hard to see essential references. Moreover, this is an interdisciplinary field which overlaps with the database community. I could find https://arxiv.org/abs/2412.08194 as example that was not discussed in the paper.
其他优缺点
The paper is very well formatted, contains examples and a few examples It is not clear if the technique will be available and useable for a broader audience which is crucial for actual use and further benchmarking. Adding (anonymous) code that works well and is extensible and maintainable is important.
其他意见或建议
I think this solution could prove useful outside of health datasets and machine learning. Perhaps try benchmarks provided by the database community to see if your solution does indeed translate. A work that does this, for example: https://arxiv.org/abs/2412.08194 or https://arxiv.org/abs/2408.14507
Dear R-pQV8.
Thank you for your thoughtful and insightful comments! We provide answers to each of the following in turn.
(A) Related work - Magneto
Thank you for pointing out this work (Magneto), we will incorporate a discussion of it in the related work of the camera-ready. At a high level: Magneto shares a similar retrieve-then-rerank architecture to the ReMatch baseline, especially in its zero-shot configuration. Specifically, both Magneto and ReMatch retrieve candidate matches using embeddings and subsequently rerank candidates with an LLM, hence making their underlying approaches comparable.
On the other hand, we highlight that in Magneto, most of the benchmark tasks involve schema matching of a single source to single target table. This contrasts our healthcare setups, where there are multiple source tables and multiple target tables. Hence, our healthcare tasks being more complex, needing reasoning over the table match first and then the column match.
Additionally, besides datasets our Matchmaker fundamentally differs from Magneto (and ReMatch) in 3 important ways:
- Compositional LLM Program: While Magneto uses a two-stage pipeline (retrieval and reranking), Matchmaker introduces a multi-stage compositional LLM program with candidate generation, refinement and confidence scoring. This structured approach allows more nuanced reasoning about schema relationships.
- Diverse Candidate Generation: Matchmaker combines both semantic retrieval and reasoning-based candidate generation, whereas Magneto relies on semantic retrieval only
- Self-Improvement Mechanism: Matchmaker introduces a novel zero-shot self-improvement mechanism using synthetic in-context examples, which doesn’t exist in other methods.
Finally, as per the reviewer's suggestion, we have evaluated Matchmaker on datasets from the suggested paper to illustrate applicability beyond healthcare. See response (B)
ACTION TAKEN: We will include a discussion on Magneto in the camera ready
(B) Additional datasets beyond healthcare
We thank the reviewer for the suggestion to improve the paper. We first clarify that our primary focus was on healthcare schema matching due to its real-world importance (and value to advance ML in healthcare settings), coupled with the structural complexity (see Section 1). Moreover, the healthcare schema matching datasets are widely recognized and extensively used in the schema matching literature (Sheetrit et al., 2024; Zhang et al., 2023; Narayan et al., 2022), due to their complexity and realism.
That said, we agree that evaluating beyond this healthcare domain is valuable to evaluate generalizability. Consequently, we have conducted new experiments on datasets from the suggested work, not from the biomedical domain. Specifically, we evaluated our approach on (i) Magellan (e-commerce product data) and (ii) WikiData (general knowledge base data).
The results can be found below. They highlight Matchmaker's strong capability and generalizability compared to ReMatch (which performs similarly to Magneto on the same datasets).
Our experiments show that Matchmaker achieves superior performance, confirming its generalizability across domains. However, these datasets represent significantly less challenging matching scenarios compared to our healthcare schemas. This is evidenced by the relatively high performance across all methods.
| Dataset | Matchmaker (Ours) | ReMatch |
|---|---|---|
| Wikidata (General knowledge) | 0.95 ± 0.04 | 0.84 ± 0.03 |
| Magallen (e-commerece) | 1.00 ± 0.00 | 1.00 ± 0.00 |
Moreover, these datasets typically involve single-table schemas with a small number of columns. In contrast, the healthcare schema matching tasks (from our paper) are significantly more challenging. These involve dozens of source tables and hundreds of attributes and require the model to first reason over the entire schema to determine the relevant target table before attempting column-level matching. We believe these results also reinforce our decision to focus on healthcare schemas, which present more challenging real-world matching scenarios that better differentiate the capabilities of advanced matching techniques.
ACTION TAKEN: We will include these new results and discussion in the camera-ready version. Thank you for the suggestion!
(C) Framework availability
We appreciate the reviewer’s enthusiasm to use Matchmaker. To confirm, we will release the full implementation upon acceptance, along with detailed documentation and tutorials for usage/extension beyond our evaluated setups.
That said, we include a base version at the following anonymized repo: https://anonymous.4open.science/r/Matchmaker-base-2641
We thank the reviewer for helping us improve our work. We hope these answer your points, please let us know if there are any remaining concerns!
This paper introduces Matchmaker, a self-improving compositional language model (LLM) program designed for schema matching, a critical task in data integration and interoperability. Schema matching involves finding correspondences between attributes across disparate data sources with different schemas and hierarchies, which is particularly challenging due to structural, semantic, and database heterogeneity. The authors propose a multi-stage LLM program that includes candidate generation, refinement, and confidence scoring. Matchmaker also self-improves in a zero-shot manner by constructing synthetic in-context demonstrations to guide the LLM's reasoning process. The paper demonstrates that Matchmaker outperforms existing ML-based approaches on real-world medical schema matching benchmarks, highlighting its potential to accelerate data integration and interoperability for machine learning-ready data.
给作者的问题
See above.
论据与证据
-
This paper claims that Matchmaker is more scalable than previous methods, but it does not provide a detailed analysis of computational complexity or runtime performance compared to other methods. This is particularly important given the large number of LLM calls required by some baselines.
-
While the results on medical datasets are impressive, the paper does not provide evidence of Matchmaker's performance on non-medical datasets. Schema matching is a problem that spans multiple domains (e.g., finance, e-commerce), and it would be valuable to see how well Matchmaker generalizes to these domains.
-
The paper discusses the potential for human-in-the-loop deferral based on confidence scores, but it does not provide a detailed analysis of how this would work in practice or how much human intervention would be required to achieve significant performance gains.
-
The confidence scoring mechanism relies on prompting the LLM to provide a score between 0 and 100, which is problematic. Since the LLM is a black box, the validity and consistency of these scores are questionable. Moreover, the scores are generated independently for each candidate, making it difficult to compare them across different queries.
方法与评估标准
-
Lack of Methodological Innovation: The paper primarily relies on prompting engineering and does not introduce significant methodological innovations. Each step of the process depends heavily on the reasoning capabilities of the underlying LLM (e.g., GPT-4, GPT-3.5), which raises questions about the originality of the approach. The framework is more of a clever combination of existing techniques rather than a novel contribution to the field.
-
Theoretical Contribution: The paper lacks theoretical innovation. It does not provide new theoretical insights or frameworks that could inspire other researchers. The reliance on LLMs for reasoning and scoring means that the paper does not contribute to the broader theoretical understanding of schema matching or LLM-based reasoning.
-
Dataset Size and Baselines: The experiments are conducted on relatively small datasets (e.g., MIMIC-OMOP and Synthea-OMOP), with only 20-30 tables. This limits the ability to validate the effectiveness of Matchmaker on larger, more complex schemas. Additionally, the paper compares Matchmaker to only a few baselines despite mentioning several related methods in the related work section. A more comprehensive comparison, including classical schema matching approaches and other LLM-based methods, would strengthen the evaluation.
理论论述
The paper does not make any theoretical claims.
实验设计与分析
There are several areas where the experimental design could be improved:
-
The paper compares Matchmaker to several baselines, but it does not provide a detailed analysis of why Matchmaker outperforms these baselines. For example, it would be useful to know if the performance gains are due to the multi-stage approach, the self-improvement mechanism, or a combination of both.
-
The paper does not discuss the sensitivity of Matchmaker's performance to different hyperparameters (e.g., the number of candidates generated and the threshold for confidence scoring). This information would be useful for practitioners who want to apply Matchmaker to their own datasets.
-
The experiments are conducted on relatively small datasets, which limits the ability to validate the effectiveness of Matchmaker on larger, more complex schemas. The authors should consider testing Matchmaker on larger datasets or more diverse domains to demonstrate its scalability and generalizability.
-
There is no efficiency or API cost analysis.
补充材料
The supplementary material provides additional details on the Matchmaker algorithm, including the prompts used for each component of the LLM program. It also includes examples of the LLM evaluator and additional experiments, such as the impact of different candidate generation approaches and the number of LLM calls required by each method. The supplementary material is well-organized and provides valuable insights into the implementation and evaluation of Matchmaker.
与现有文献的关系
This paper's reliance on advanced LLMs (e.g., GPT-4, GPT-3.5) for reasoning and scoring raises questions about the generalizability and reproducibility of the results. The paper feels more like a technical report on the application of GPT-4 to schema matching rather than a research paper that contributes novel insights or methodologies to the field.
遗漏的重要参考文献
None.
其他优缺点
Weaknesses:
-
Lack of Methodological Innovation: The paper primarily relies on prompting engineering and does not introduce significant methodological innovations. Each step of the process depends heavily on the reasoning capabilities of the underlying LLM (e.g., GPT-4, GPT-3.5), which raises questions about the originality of the approach.
-
Theoretical Contribution: The paper lacks theoretical innovation. It does not provide new theoretical insights or frameworks that could inspire other researchers.
-
Dataset Size and Baselines: The experiments are conducted on relatively small datasets, and the paper compares Matchmaker to only a few baselines. A more comprehensive comparison, including classical schema matching approaches and other LLM-based methods, would strengthen the evaluation.
-
Confidence Scoring: The confidence scoring mechanism relies on prompting the LLM to provide a score between 0 and 100, which is problematic. Since the LLM is a black box, the validity and consistency of these scores are questionable. Moreover, the scores are generated independently for each candidate, making it difficult to compare them across different queries.
-
Reliance on Advanced LLMs: The paper's reliance on advanced LLMs (e.g., GPT-4, GPT-3.5) for reasoning and scoring raises questions about the generalizability and reproducibility of the results. The paper feels more like a technical report on the application of GPT-4 to schema matching rather than a research paper that contributes novel insights or methodologies to the field.
其他意见或建议
See above.
Dear R-mppD,
Thank you for your insightful comments.
In Part 1 we address points already addressed in our paper (responses A-E), then in Part 2 we respond to additional points (F-J).
PART 1 - Points already addressed in our paper (A-E)
(A) Scalability and Computational Analysis
We clarify that we provided a detailed analysis of LLM call complexity compared to baselines in Appendix D.2 (Table 6) and referenced/flagged on L120. Matchmaker significantly reduces LLM calls via our information retrieval formulation (Sec. 3.2), thus improving scalability, unlike exhaustive O(n²) evaluations used by LLM-DP and SMAT.
(B) Human-in-the-Loop Deferral Analysis
We clarify that Sec. 5.3 (Matchmaker in practice: Human-in-the-loop deferral and …) already evaluates human-in-the-loop deferral. We show entropy-based deferral outperforms random deferral - with just 10–20% deferral significantly boosting acc@1 - Fig. 4(a).
(C) Missing comparisons with baselines from related work
We clarify that we already compare Matchmaker to schema matching methods from our related work & Fig. 3. Table 1’s results include:
-
Supervised: SMAT (Zhang et al., 2021)
-
Pre-trained LLM: LLM-DP (Narayan et al., 2022; Zhang et al., 2023a)
-
Fine-tuned LLM: Jellyfish (Zhang et al., 2023b)
-
RAG: ReMatch (Sheetrit et al., 2024)
Traditional methods are omitted, as prior work shows they underperform on these benchmarks.
(D) LLM reliance
We clarify that we use GPT-4 to match LLM baselines. As per Sec. 5.1 (L339–343), all systems use GPT-4 to ensure fair comparison and isolate system-level gains not tied to the LLM itself. While backbone quality matters, Matchmaker is LLM-agnostic.
(E) Performance Gains Attribution
We agree that attribution is crucial and have analysed it in two ways:
(i) Sec. 5.2 & Table 2: The ablation shows our synthetic in-context examples outperform self-reflection & that systematic example selection outperforms random or no examples, confirming it is the main driver of gains
(ii) Appendix D.1: The ablation shows Diverse candidate generation (semantic + reasoning-based) outperforms single-type generation.
Action taken: We realize our paper is dense, and hence, it is easy to overlook these points. To improve clarity and better help the reader navigate the paper, we will add a summary table in Appendix A showing where different issues are addressed.
PART 2 - Additional points (F-J)
(F) Novelty
The reviewer suggests our work lacks novelty and is prompt engineering. We respectfully disagree and clarify our four key novelties:
- Novel Compositional LLM program: Unlike prior single-call methods (Sec. 2, Table 3), our multi-stage structure enables complex reasoning. Appendix A.1 compares this with ReMatch
Novel optimization for zero-shot self-improvement: We introduce a novel optimization method using synthetic in-context examples (Sec. 4.4), which outperforms other methods (Table 2). The process is applicable to other compositional LLM programs
-
Novel task formulation: Schema matching as information retrieval (Sec. 3.2)
-
Human deferral support: Matchmaker enables deferral to humans (Sec. 5.3), vital for real-world use
(G) Generalization to Non-medical Domains
While focused on healthcare due to its complexity and real-world importance (Sec. 1), we agree it's useful to test other domains. We conduct new experiments on Magellan (e-commerce) & WikiData (general knowledge) datasets, which include Amazon product datasets as suggested. These results confirm Matchmaker’s cross-domain performance, but also highlight the complexity of our healthcare datasets — reinfocing their selection
| Dataset | Matchmaker | ReMatch |
|---|---|---|
| Wikidata | 0.95 ± 0.04 | 0.84 ± 0.03 |
| Magallen | 1.00 ± 0.00 | 1.00 ± 0.00 |
| We will add these new results to the camera-ready. Thanks for the suggestion! |
(H) Confidence Scoring Validity and Consistency
Our MCQ-based confidence scores align with token-level calibration literature (Kadavath et al; Ren et al; Tian et al—Sec. 4.3). Entropy-based deferral confirms scores accurately reflect prediction uncertainty, significantly improving accuracy (Sec. 5.3).
(I) Theoretical Contributions
While empirical, our theoretical contribution is reformulating schema matching as information retrieval rather than binary classification, significantly reducing computational complexity (Sec. 3.2)
(J) Dataset Size
The benchmark datasets are complex, real-world medical datasets (not small), e.g. MIMIC-OMOP (26 source, 14 target tables). Highlighting its complexity, it required 500 hrs of expert annotation. The datasets are also standard benchmarks (Sheetrit et al; Zhang et al; Narayan et al)
We hope these answer your points, please let us know if there are any remaining concerns!
The authors introduce Matchmaker, a self-improving compositional LLM program, where multi-stage LLM calls are involved for candidate generation, refinement, and confidence scoring for the task of schema matching, which the authors formulate in the context of information retrieval. Its self-improving aspect comes from their optimization process that generates synthetic in-context examples used for the various LLM calls in the program. Their mechanism is tested against other frameworks, such as Jellyfish, LLM–DP, and SMAT using the MIMIC-OMOP and Synthea-OMOP datasets, which are evaluated via the accuracy@k metric. In addition, they tested various versions of Matchmaker that incorporate randomized, zero, and self-reflected in-context examples. Majority of the experiment results show their optimized Matchmaker achieves better performance amongst the other methods and versions.
update after rebuttal
I have accepted this paper.
给作者的问题
- Q1: From what I understand, it seems that this aspect is to do with generating synthetic examples. Just to confirm, is that the only aspect that is of dynamic nature?
- Q2: How exactly does Round 0 work when it is not optimized yet? In Algorithm 2 in Appendix A, I see that the self-improvement optimization happens in all stages, but by first calling the entire algorithm (referring to Matchmaker()). So what exactly are the instances used in that first call, i.e., Matchmaker()?
- Q3: How exactly is LLM trained?
- Q4: Are the synthetic examples still “unlabeled” or “labeled” when they are optimized? How is that verified when there is no human-in-the-loop intervention? Or does this system need to have human-in-the-loop?
论据与证据
Yes, they are supported via the experiments against the existing methods and ablations of their own method.
方法与评估标准
Yes, the proposed method and evaluation criteria makes sense for the schema matching task.
理论论述
There aren’t any theoretical claims.
实验设计与分析
I did check, including the additional details and analyses provided in Appendix B and D respectively.
补充材料
I did review the supplementary materials –– all parts, but particularly focused on Appendix A, where a more detailed algorithm of the mechanism was provided.
与现有文献的关系
The main contribution is the multi-call dynamic prompting mechanism for the task of schema matching. This is useful in cases where the dataset cannot be fully accessed for privacy reasons. The most recent previous work, ReMatch, has an LLM-based solution too but it remains static in nature, lacking in-context instances.
遗漏的重要参考文献
I believe the related works section is sufficient in understanding the paper.
其他优缺点
Strengths:
- Application-driven ML Based: It is an important application of using LLMs for the task of schema matching.
- Non-trivial way of finding synthetic examples – significant for cases where private data cannot and should not be accessed.
- Generally comprehensive paper with satisfactory experiment and ablation setups.
Weakness:
- Need more clarity about the dynamic-nature of the algorithm (expressed these in the “Question for Authors” section).
其他意见或建议
I noticed the following typos:
- Line 254: space between attribute’s and description
- Line 387: improve instead of impfrove
- Line 967: instead of .
伦理审查问题
N/A
Dear R-15kv.
Thank you for your thoughtful and insightful comments! We provide answers to each of the following in turn.
(A) Clarifications and questions on the dynamic nature of the algorithm
Q1: Clarifying the dynamic nature of the algorithm - is it only generating synthetic examples?
Indeed, the synthetic in-context examples generation is the main optimization step. However, the dynamic nature of Matchmaker is broader in two ways than just synthetic in-context examples:
(1) Self-improvement mechanism: As outlined in Section 4.4 and Algorithm 1 (Appendix A.3), Matchmaker evaluates its own execution traces to select high-performing intermediate outputs across the program’s stages. These selected traces are dynamically reused as few-shot demonstrations in subsequent executions (described next).
(2) Dynamic program behavior: The result of these different bootstraped traces is not just the generation of synthetic in-context examples, but this also then serves to update the multi-stage LLM program's behavior.
Overall, this results in Matchmaker’s self-improvement without labeled examples — which other schema matching methods can’t do.
Q2: Clarifying round 0 of the algorithm — how does it work before optimization?
To clarify, in the initial round, Matchmaker operates without in-context examples. As detailed in Section 4.4 and Algorithm 1.
- We first run the unoptimized Matchmaker (without any in-context examples) on evaluation examples from
- We capture execution traces (intermediate inputs/outputs)
- The LLM evaluator then scores these executions.
- The highest-scoring traces (and their input-outputs) are used to bootstrap synthetic in-context examples.
Hence, Matchmaker “starts cold” with a zero-shot bootstrapping process (using its own successful traces) — allowing Matchmaker to self-improve without requiring labeled data, addressing a key challenge in schema matching.
ACTION TAKEN: We will update Sec 4.4. to clarify this.
Q3: How exactly is the LLM trained?
We clarify that the LLM itself is not specifically trained (or fine-tuned) for schema matching. Rather, Matchmaker leverages a general-purpose frozen LLM (e.g., GPT-4) within its compositional program (candidate generation, refinement, and scoring). Matchmaker's key innovation is not fine-tuning the LLM weights but dynamically optimizing the end-to-end compositional system behavior via synthetic in-context examples. So in "training" is in the sense of this optimization of the compositional LLM program.
Q4: Clarifying synthetic examples: Are the synthetic examples still “unlabeled” or “labeled” when they are optimized? Is there human-in-the-loop intervention?
To clarify, the synthetic examples generated remain "unlabeled" in a traditional supervised sense, as we never explicitly verify or label them via human annotations. Instead, verification is implicitly done via an LLM evaluator, which assesses the quality of the matches through a scoring system (scale of 0-5). Thus, synthetic examples are optimized based on evaluator scoring rather than explicit human labeling. This approach deliberately removes the requirement for manual annotation and supports a fully autonomous zero-shot self-improvement system.
Hence, Matchmaker can operate without human intervention for the optimization step — rather, the system itself generates the quality labels. However, at deployment time, we show in Sec. 5.3 that a human-in-the-loop can further enhance performance, e.g., by deferring high-entropy predictions.
ACTION TAKEN: Update Sec 4.4 to clarify this.
(B) Typos
Thank you for flagging the typos; we will correct them in the camera-ready version.
We thank the reviewer for helping us improve our work. We hope these answers your points. Please let us know if there are any remaining concerns!
This paper presents Matchmaker for schema matching problem, the task of finding matches between attributes across disparate data sources with different tables and hierarchies. Matchmaker has 3 main stages: candidate generation, refinement and confidence scoring. The authors also propose a synthetic data-based in-context demonstration selection strategy to further improve the approach. Empirical results on 2 medical schema matching benchmarks demonstrate the effectiveness of the proposed approach.
给作者的问题
Again, I am confused about the primary area of this paper being Health / Medicine. The selection seems to justify why the two benchmarks in this paper are both health-related, however, it really contradicts with the design and presentation of the seemingly general design of the main approach.
If there are really no other established benchmarks besides these two, I would say creating a benchmark for domains like finance and e-commerce (like the authors mentioned in the abstract) is a more significant contribution.
论据与证据
Yes.
方法与评估标准
No, benchmark datasets are limited.
理论论述
N/A, no theoretical claims are made in the paper.
实验设计与分析
Yes. The authors only verify the method on healthcare schema matching benchmarks, which is quite limited.
补充材料
Yes, mainly the dataset part.
与现有文献的关系
The methodology seems to aim to solve the general Schema Matching problem (a established area for database research). However, the authors choose to verify the method only on healthcare domain, and select the paper's primary area as "Applications->Health / Medicine".
That said, I would expect the methodology should have health / medical-specific design or related insights to be considered of the selected primary area.
遗漏的重要参考文献
N/A.
其他优缺点
The paper is well-written and easy to follow, and the problem of Schema Matching is a practical problem.
The methodology itself is not that novel, more like an application-specific adaptation of existing techniques (CoT reasoning, in-context demo generation and selection, etc.)
其他意见或建议
See questions.
Dear R-9YfQ.
Thank you for your thoughtful and insightful comments. We provide answers to each of the following in turn.
(A) Clarifying and motivating paper area as healthcare
We agree with the reviewer that Matchmaker is generally applicable outside of healthcare. However, we selected the "Health/Medicine" track for three key reasons:
-
Significant impact on healthcare/medicine: Our main motivation is real-world healthcare integration, where schema matching remains largely manual and time consuming (e.g., mapping MIMIC to OMOP took 500 hours [Paris et al., 2021], Sec. 1). Schema matching is critical and has the potential for significant impact in healthcare due to fragmented datasets across institutions, with inconsistent schemas and terminologies. Effective matching enables data interoperability, allowing the creation of larger, integrated datasets which is essential for clinical data integration for downstream models, as well as, for external validation of models (see Sec. 1 and Impact Statement). Hence, advances in schema matching like Matchmaker have a potential for significant impact on healthcare.
-
Well-understood problem in healthcare & complexity of healthcare: We use two real-world healthcare benchmarks—MIMIC-OMOP and Synthea-OMOP—commonly used in prior work (Sheetrit et al., 2024; Zhang et al., 2023; Narayan et al., 2022). Additionally. these healthcare benchmarks also reflect the complexity of healthcare data schema matching.
-
Health-specific design: Matchmaker supports privacy-preserving schema matching, which is essential in healthcare where access to raw patient data is limited. It operates solely on schema-level metadata (Sec. 3.1), a realistic constraint in health contexts. In other words. the constraints of healthcare are relevant to the design.
We hope this clarifies. That said, we thank the reviewer for the suggestion and have added experimental validation on other datasets from other domains as suggested (e-commerce and general knowledge bases) to demonstrate Matchmaker’s generalizability — see response (B)
(B) Additional datasets beyond healthcare
We thank the reviewer for the suggestion to assess Matchmaker on other domains, such as e-commerce. In response, we conduct new experiments on non-healthcare datasets: Magellan (e-commerce as suggested) and WikiData (general knowledge base). The results are shown below and confirm Matchmaker’s strong performance in domains beyond healthcare.
| Dataset | Matchmaker | ReMatch |
|---|---|---|
| Wikidata | 0.95 ± 0.04 | 0.84 ± 0.03 |
| Magallen | 1.00 ± 0.00 | 1.00 ± 0.00 |
However, we note that these datasets pose simpler challenges compared to healthcare. They typically involve single-table schemas with fewer columns and focus on direct feature-to-feature matching. In contrast, our healthcare tasks require reasoning over dozens of source tables and hundreds of attributes—first identifying the relevant target table, then performing column-level matching. These findings reinforce our focus on healthcare schemas as they better showcase the advantages of advanced schema matching techniques.
Action taken: We will include these new results and discussion in the camera-ready version. Thank you again for the suggestion!
(C) Clarifying Novelty
We respectfully disagree that Matchmaker is merely an application of existing techniques. We clarify that our novel contributions are fourfold:
-
Novel compositional LLM program: Unlike prior work using single LLM calls (Sec. 2, Table 3), Matchmaker employs a multi-stage compositional approach that enables more complex reasoning and superior performance (Appendix A.1).
-
Novel optimization mechanism for zero-shot self-improvement: We introduce a novel optimization method via synthetic in-context examples to self-improve without labeled data (Sec. 4.4). This significantly outperforms other self-improvement methods (Table 2). Moreover, the optimization mechanism is applicable to other compositional LLM programs.
-
Novel formulation: As detailed in Sec. 3.2, we reformulate schema matching as an information retrieval task rather than binary classification, resulting in better efficiency (Appendix D.2).
-
Human-in-the-loop deferral: Unlike existing methods, Matchmaker permits deferral to humans (Sec. 5.3), an essential feature for real-world deployment, especially in healthcare.
We thank the reviewer for helping us improve our work. We hope these answer your points, please let us know if there are any remaining concerns!
This paper introduces Matchmaker, a compositional LLM program for schema matching featuring a novel zero-shot self-improvement mechanism. The reviews were mixed, with positive (4, 4) and negative (1, 2). The authors provided detailed rebuttals addressing concerns, including adding new experiments for generalizability. The positive reviewers were satisfied (yet one of them reported low familiarity with the field). While the negative reviewers acknowledged the rebuttal, they did not update their scores or engage further, leaving some concerns formally unresolved from their perspective, though substantively addressed by the authors. Overall, the strengths and successful rebuttal of key criticisms lean towards a weak accept.
Consolidated Strengths:
- Addresses Important Problem: Tackles the challenge of schema matching with real-world application, particularly demonstrated in healthcare domains (R-9YfQ, R-15kv, R-pQV8).
- Well-Presented: The paper is well-written, clear, and includes useful examples and extensive supplementary material (R-9YfQ, R-15kv, R-pQV8).
- Practical Features: Incorporates valuable aspects like human-in-the-loop deferral support (validated in Sec 5.3) and operates on schema-level metadata, respecting privacy constraints (Authors highlighted; R-15kv satisfied; R-mppD concern addressed by authors pointing to existing analysis).
Consolidated Concerns:
- Scalability/Cost Analysis: Request for more detailed runtime/cost analysis beyond LLM call complexity (R-mppD). Authors pointed to existing call complexity analysis (Appendix D.2/Table 6) showing improvement over baselines, but R-mppD did not acknowledge this specific point was resolved.
- Confidence Scoring Validity: Skepticism about LLM-generated confidence scores (R-mppD). Authors defended via literature and practical HITL results (Sec 5.3); R-mppD did not acknowledge this as resolved.
Mixed impressions:
- Novelty/Contribution Debate: Proposes a compositional LLM program with a zero-shot self-improvement mechanism using synthetic examples, differentiating it from prior work (Authors' claim, accepted by R-15kv, R-pQV8). However, there were concerns raised about methodological innovation (R-mppD, R-9YfQ). Authors rebutted with specific contributions. While R-15kv/R-pQV8 seemed convinced, R-mppD/R-9YfQ did not update their stance post-rebuttal, leaving this point contested by them.
- Evaluation Scope: Concern about initial focus only on healthcare datasets (R-mppD, R-9YfQ, R-pQV8). Addressed by authors during rebuttal with new experiments on Magellan/Wikidata showing generalizability; reviewers acknowledged rebuttal but did not update scores based on this. Additionally, the method outperforms existing methods on challenging real-world healthcare benchmarks as reported in the original submitted paper.