PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
4
3
3
ICML 2025

QEM-Bench: Benchmarking Learning-based Quantum Error Mitigation and QEMFormer as a Multi-ranged Context Learning Baseline

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
Quantum Error MitigationDataset BenchmarkGraph Learning

评审与讨论

审稿意见
3

This paper presents QEM-Bench, a benchmarking suite for machine learning-based quantum error mitigation (ML-QEM), addressing the lack of standardized datasets in the field. The benchmark includes twenty datasets spanning different circuit types and noise models to enable consistent evaluation of ML-QEM techniques. The authors also propose QEMFormer, a two-branch architecture incorporating MLPs for short-range dependencies and Graph Transformers for long-range dependencies, leveraging directed acyclic graph (DAG) representations of quantum circuits. Empirical evaluations show QEMFormer outperforms other ML-QEM baselines across diverse settings, reinforcing the claim that a structured representation enhances mitigation performance.

给作者的问题

  1. How does QEMFormer perform on hardware other than IBM Kyiv?
  2. Can you provide ablation studies showing the contribution of each component of QEMFormer?
  3. Why was RMSE chosen as the primary evaluation metric, and how does it compare to alternative metrics?
  4. How does QEM-Bench compare to prior QEM benchmarks, if any exist?
  5. Can you clarify the dataset curation process and whether it reflects real-world noise characteristics?

论据与证据

5/10

While the paper makes strong claims about the benefits of QEM-Bench and QEMFormer, the supporting evidence is limited in some areas. The benchmark provides a well-structured dataset, but its generalizability to real-world quantum hardware is not thoroughly demonstrated. The experimental results highlight improvements over prior ML-based approaches, but the paper does not compare against traditional QEM techniques like Zero-Noise Extrapolation (ZNE) in real-device settings. The justification for QEMFormer’s performance lacks ablation studies isolating the contributions of different architectural components.

方法与评估标准

6/10

The proposed evaluation framework is well-motivated, and QEM-Bench is a meaningful contribution. However, there are inconsistencies in evaluation settings—while the benchmark includes diverse noise models and circuit types, real hardware evaluations are limited to IBM Kyiv, raising concerns about applicability to other quantum architectures. The choice of Root Mean Squared Error (RMSE) as the primary evaluation metric is standard but insufficient, as the robustness of the mitigation method under varying noise intensities remains unexplored.

理论论述

4/10

The paper lacks formal theoretical analysis of QEMFormer’s effectiveness beyond empirical results. The use of DAG representations and multi-range feature extraction is intuitively justified but not rigorously analyzed. Claims regarding the preservation of circuit topology and feature locality should be accompanied by theoretical bounds or complexity analyses. The paper cites relevant works on graph-based representations, but it does not explicitly demonstrate why QEMFormer outperforms existing methods from a theoretical standpoint.

实验设计与分析

6/10

The experiments are comprehensive in terms of dataset coverage. The results confirm QEMFormer’s superiority over prior ML-QEM methods. While the authors benchmark across multiple noise models, hyperparameter tuning details are unclear, and there is no discussion on whether the performance gains hold for circuits larger than those evaluated. Furthermore, real-device evaluations are minimal, limiting the reliability of the reported findings.

补充材料

4/10

The supplementary material is not explicitly discussed in the main paper, making it difficult to assess its relevance. The utility of the supplementary material is unclear.

与现有文献的关系

7/10

The paper is well-positioned within the ML-QEM literature, citing relevant works on machine learning for quantum error mitigation, benchmarking efforts, and graph-based circuit representations. However, there is no engagement with broader ML-based circuit optimization techniques, which could provide useful insights.

遗漏的重要参考文献

6/10

Most of the relevant references are cited. However, works on hardware-specific noise mitigation techniques and hybrid classical-quantum optimization strategies are not discussed comprehensively.

其他优缺点

Strengths:

  1. QEM-Bench provides a standardized evaluation suite, which is a valuable asset for the field.
  2. QEMFormer’s hybrid approach to short- and long-range dependency modeling is innovative.
  3. The inclusion of different noise models and circuit types enhances the credibility of QEM-Bench.
  4. The experiments cover multiple baselines, demonstrating QEMFormer’s competitive performance.
  5. The paper is well-organized and presents technical details in a clear manner.

Weaknesses:

  1. Lack of real-hardware validation beyond IBM Kyiv: Generalizability to other quantum platforms is not demonstrated.
  2. Limited ablation studies: The contribution of each architectural component in QEMFormer is not isolated.
  3. Minimal theoretical justification: Claims about circuit topology preservation and information retention are not rigorously analyzed.
  4. Unclear evaluation criteria: The choice of RMSE as the sole metric does not provide a full picture of mitigation effectiveness.
  5. Limited engagement with non-ML QEM methods: The paper does not sufficiently compare QEMFormer with traditional mitigation techniques.

其他意见或建议

  1. Include additional real-device evaluations beyond IBM Kyiv for broader applicability.
  2. Provide detailed ablation studies on QEMFormer’s feature encoding and architecture.
  3. Compare against non-ML-based QEM techniques more rigorously.
  4. Offer hyperparameter tuning details to improve reproducibility.
作者回复

We would like to thank the reviewer for the insightful comments and inquiries, and the positive evaluation of our work. We have summarized the newly added Figs and Tabs at this link.

Below are our responses.

1. Real devices other than IBM Kyiv and traditional QEM techniques like ZNE on real devices.

We apologize for the earlier oversight. We have now incorporated the following enhancements:

  • Two datasets derived from 63-qubit Trotter circuits executed on IBM Brisbane have been included—one with extreme outlier filtering (Brisbane Pre) and one without (Brisbane Raw).
    • Dataset statistics are detailed in Tab. 1.
    • The performance of various QEM techniques is provided in Tab. 2 and Fig. 1.
  • We have evaluated ZNE across the datasets from Kyiv (Pre and Raw) and Brisbane (Pre and Raw), as detailed in Fig. 1-2 and Tab. 2-3.

We would like to note that the CDR approach entails significant time costs due to the need to construct training sets for each circuit individually. Consequently, we plan to include its results in future revisions.

2. Ablation studies of QEMFormer

We now conduct two sets of analyses.

  • Tab. 4 compares the performance impact of the MLP and Graph Transformer modules.
  • Tab. 5 evaluates our multi-ranged feature extractor.

3. RMSE as the sole evaluation metric?

We apologize for any potential confusion. In the original manuscript, we report RMSE (Tabs. 1 & 2), MAE (Tabs. 5 & 6), and standard deviation (Tabs. 1 & 2) to provide a comprehensive evaluation. RMSE emphasizes larger deviations and is sensitive to outliers, while MAE directly measures the average error magnitude. Additionally, violin plots (Figs. 3 & 6 of the original manuscript) depict the full error distribution along with the STD.

4. About prior QEM benchmarks.

To the best of our knowledge, there are currently no standardized benchmarks for the ML-QEM task. This gap, as also noted by reviewers Qjn1 and 6Drn, has been a primary motivation for this work.

5. Dataset curation process and whether it reflects real-world noise

We detail the dataset curation process in Tab. 6. Our design intentionally reflects real-world noise characteristics in two key ways:

  • Diverse Circuits: A single type of noise impacts various circuits differently. By including types of structured circuits and random unstructured circuits, QEM-Bench aims to gather a comprehensive depiction of real-world conditions.
  • Broad Noise Modeling: We incorporate multiple realistic noise sources, including data from real devices (Kyiv), fake providers (published by IBM) resembling the real devices, and manually set configurations based on statistics from representative real devices (e.g., Sycamore [1] for incoherent settings).

This design aims to let the QEM-Bench effectively mirror the diverse noise characteristics encountered in practice and bridge the current gap for further research.

6. Hyperparameter tuning details.

We include an analysis of how the number of layers and hidden dimensions in the MLP and Graph Transformer modules affect QEMFormer's performance (see Fig. 3). Our findings indicate that increasing the model size initially enhances model capability; however, excessively large models can lead to overfitting and degraded performance. We also politely noted that hyperparameter settings are provided in Tab. 7 of the original manuscript.

7. Formal theoretical analysis.

We appreciate the reviewer's suggestion. Our primary focus in this work is to introduce a comprehensive benchmark dataset that spans a variety of circuits and noise configurations, addressing a significant gap in ML-QEM studies.

On the method side, we specifically design a 2-branch neural network architecture in QEM tasks. The idea is that the multi-range context and dependency as frequently occurred in the quantum system can be better captured, and the effectiveness is also verified by our extensive empirical results. We plan to incorporate further qualitative analysis in the revised manuscript.

Due to the inherent complexity of quantum systems and noise profiles, we leave a rigorous theoretical analysis of QEMFormer’s effectiveness for future work.

8. Discussion about hardware-specific noise mitigation and circuit optimization techniques.

We will expand the related works section to include discussions on hardware-specific noise mitigation techniques, hybrid classical-quantum optimization strategies, and ML-based circuit optimization methods. We will clarify the differences and relationships between these approaches and ML-QEM techniques, especially QEMFormer.

We hope the reply eases your concern. If you have any further questions, we would be pleased to respond.

References:

[1] Quantum supremacy using a programmable superconducting processor, Nature volume 574, 2019.

审稿人评论

In light of all the reviews and authors' rebuttal, my score is confirmed.

作者评论

We appreciate your valuable time and insightful feedback. We will revise the manuscript in accordance with your suggestions and our discussion. Thank you again for your positive evaluation of our work!

审稿意见
4

This paper introduces a dataset for benchmarking quantum error mitigation techniques, as well as a graph-transformer model to serve as baseline. The dataset consists of three evaluation settings - each with different levels of added noise; Standard (general purpose testing with Trotterized TFIM Circuits, Random Circuits and MaxCut QAOA Circuits), advanced (testing aspects of generalization capabilities) and two large-circuit datasets, executed on real quantum hardware. The circuit data-samples are encoded directed acyclic graph representations and are provided with statistical information about the circuit itself (e.g. nr-gates, nr-parameters etc.), as well as the noisy and ideal measurement expectation values. To leverage the graph-representations and the informational feature vectors, a graph-transformer (QEMFormer) is introduced and compared on the proposed benchmark against a set of QEM algorithms from the literature.

给作者的问题

No questions.

论据与证据

Since the paper is mostly focused on introducing and specifying the benchmark dataset, I’d say the claims are rather on the neutral side. The QEMFormer is compared against other algorithms from the literature, where it performs consistently strong across most of the dataset settings. The evaluation is thorough, with a good selection of comparable algorithms of both ml-based and non-ml-based error mitigators.

方法与评估标准

Both the proposed dataset as well as the QEMFormer are well motivated. The evaluation criteria of (root, mean) absolute errors mitigated is ultimately the logical metric, although a more critical discussion on other factors such as run-time or complexity of the compared algorithms would have been beneficial.

理论论述

No theoretical claims are made in this paper.

实验设计与分析

The experimental comparison in Ch.5 is in itself quite simple, a direct comparison of the final absolute errors mitigated across different circuit-type and noise settings. All results are reported with mean and std.div. and appear to be sound.

补充材料

I have skimmed the appendix, which mostly contains additional explanations of the metrics and some additional experiments. All relevant information is in the main paper.

与现有文献的关系

The benchmark should be helpful for a more standardized quantum evaluation, which is currently indeed rather lacking. A comprehensive, maintained dataset as proposed here would certainly help the field. The QEMFormer seems to be a performant, but on this data specialized baseline (i.e., the graph structure).

Ultimately, quantum error correction is a technical problem that needs to be solved to make QC in itself a technically sound computing device. As an intermediary solution ML-based mitigators may have their merits, but on a practical level QEM is a problem that will (need to) be solved on the hardware side, which is why I would not expect learning based QEM methods to stay around for too long.

遗漏的重要参考文献

None that I am aware of. The discussion on related work covers the field decently well.

其他优缺点

I generally have very little to critique on this paper, it is well written, formalized and visualized. While the contribution is generally rather “short-term”, as mentioned above, until QC hardware handles QEM natively, I find this benchmark a good current contribution to a current problem.

其他意见或建议

The colors in Fig.1 could be stronger, and Fig. 4 is too small to read.

作者回复

We appreciate your constructive feedback and positive evaluation of our work. We acknowledge that quantum error correction is aimed to be solved on the hardware side ultimately. However, the significant qubit overhead associated with QEC renders it less feasible in the near term, especially for large-scale circuits, as illustrated in [1] and [2]. Therefore, during the NISQ era, ML-QEM methods offer an effective and efficient interim approach, and we hope our work lays a foundation for future studies.

We will revise the manuscript to clearly reflect this nuanced perspective and incorporate your valuable suggestions.

References:

[1] Quantum Error Mitigation, Reviews of Modern Physics, American Physical Society, 2022

[2] Near-term quantum computing techniques: Variational quantum algorithms, error mitigation, circuit compilation, benchmarking and classical simulation, Science China Physics, Mechanics & Astronomy, 2023.

审稿意见
3

The paper introduces QEM-Bench, a benchmarking suite designed to evaluate machine learning-based Quantum Error Mitigation (QEM) techniques. The benchmark includes 20 datasets covering various circuit types and noise models to standardize QEM evaluation. Furthermore, the paper proposes QEMFormer, a novel learning-based QEM method that improves quantum error mitigation by leveraging both short-range and long-range dependencies with quantum circuits. The paper evaluates QEMFormer against various baselines, showing its superior performance across different circuit families, noise configurations, and real quantum devices (IBM Kyiv 50-qubit experiments).

给作者的问题

I think the paper is solid in terms of the benchmark and proposed model. What I am concerned about is the evaluation on the real quantum devices:

  • Why does the paper evaluate on IBM Kyiv only? There are other quantum computers in the IBM Quantum system.

  • Is the evaluation on a 50-qubit system good enough? Why? If the circuits are deployed on a 127-qubit system, will the performance stay the same?

论据与证据

  • QEM-Bench provides a comprehensive and standardized benchmarking suite for machine learning-based QEM techniques.

  • Based on experimental results, QEMFormer outperforms other machine learning-based and traditional QEM methods.

  • The two datasets with 50-qubit circuits executed on IBM Kyiv might not be flexible. This is because, in practice, the real quantum systems can have higher than 50 qubits. The benchmark on real quantum systems should show the scalability.

方法与评估标准

  • The proposed benchmarking is well-motivated since prior QEM methods might have different evaluation protocols, which could make the evaluation unfair in some perspectives or unable to be compared because of a lack of reproducibility.

  • The proposed QEMFormer utilizing the long-range and short-range dependencies in Graph Transformer is well-aligned with quantum circuit structure, which can be represented as Directed Acyclic Graphs (DAGs).

理论论述

The paper does not have any explicit theoretical claim.

实验设计与分析

The proposed QEM-Bench shows the diversity of circuit designs and scenarios to evaluate the QEM methods. Besides, as mentioned in "Claims And Evidence", the evaluation on the IBM Kyiv system should be considered since it does not convince that the QEM methods can work in multiple systems.

补充材料

I have reviewed the supplementary material, including experiment configuration, backgrounds, evaluation metrics, and additional experimental results.

与现有文献的关系

The paper can potentially standardize the evaluation of the QEM methods, making the comparisons fair and transparent.

遗漏的重要参考文献

There are no additional related works that are essential to understanding the key contributions of the paper.

其他优缺点

I don't have any comments on other strengths and weaknesses of the paper.

其他意见或建议

No other comments or suggestions.

作者回复

We would like to thank the reviewer for the insightful questions and positive evaluation of our work. We summarize the Tabs. and Figs. for newly added experiments at this link.

Below are our responses to each question.

Q1: Why does the paper evaluate IBM Kyiv only? There are other quantum computers in the IBM Quantum system.

Initially, we evaluated only IBM Kyiv due to the significant financial and time costs associated with executing large-scale quantum circuits on real quantum devices. Although multiple IBM devices are available, these constraints still limited our testing.

To improve the diversity of QEM-Bench and strengthen your confidence in our work, we have expanded our evaluation to include two datasets from the IBM Brisbane device: one with extreme outlier filtering (Brisbane Pre) and another with raw, unfiltered data (Brisbane Raw).

  • Detailed statistics for the two datasets are provided in Tab. 1
  • The performance of various QEM techniques is shown in Tab. 2 and Fig. 1.

Notably, even with the more severe noise effects on the Brisbane device (compared to Kyiv), QEMFormer consistently outperforms the baseline methods.

We shall include these datasets and results in the revised manuscript.

Q2: Is the evaluation on a 50-qubit system good enough? Why? If the circuits are deployed on a 127-qubit system, will the performance stay the same?

We would like to clarify that the construction of our datasets on 50-qubit systems is limited by the prohibitive computational cost of obtaining ideal EVs for circuits with over 100 qubits. Although devices like Kyiv and Brisbane support up to 127 qubits, their outputs are inherently noisy. Hence, ideal EVs, which serve as the dataset labels, have to be obtained through classical simulation. Yet, to the best of our knowledge, the IBM simulators that do not restrict circuit structure to provide ideal simulation are limited to 63 qubits (namely, Aer matrix_product_state simulator), making 100-qubit simulations difficult under the current implementation structure of QEM-Bench.

Furthermore, to the best of our knowledge, the only ML-QEM method exploring circuits beyond 100 qubits, [1], also demonstrates such difficulty in ideal EV obtaining in their work and thus uses ZNE-mitigated results as training labels. However, our experiments applying IBM’s built-in ZNE to both 50-qubit and 63-qubit circuits (on the Kyiv and Brisbane devices) show only marginal improvements over noisy outcomes, with significant residual errors (see Fig. 1–2 and Tab. 2–3). This suggests that using ZNE outcomes as labels may not provide a fair or reliable benchmark for larger systems.

Accordingly, due to the current time constraint, we plan to extend the inclusion of systems exceeding 100 qubits to future work. Nevertheless, to enhance your confidence and further evaluate the scalability of various ML-QEM methods, we have developed the Brisbane Pre and Brisbane Raw datasets on 63-qubit systems (results summarized in Tab. 2 and Fig. 1).

We hope the reply eases your concern. If you have any additional questions, we would be pleased to provide further responses.

References:

[1] Machine learning for practical quantum error mitigation, Nature Machine Intelligence, 2024

审稿人评论

The authors have addressed my questions. I will maintain the score.

作者评论

We sincerely appreciate your time and thoughtful feedback. We will revise the manuscript in accordance with your suggestions and integrate the newly added datasets. We are glad that these modifications address your concerns and thank you once again for your positive evaluation of our work!

审稿意见
3

The authors make two primary contributions in their manuscript. First, they compile QEM-Bench, a set of twenty datasets that the community can use to benchmark ML-based approaches to quantum error mitigation (QEM). Second, they introduce a new ML-based approach to QEM called QEMFormer, which combines multi-layered perceptrons (MLPs) and graph transformers to predict the true expectation value of a quantum circuit based on the noisy measurement statistics. Comparisons between QEMFormer and other ML-based QEM methods are made on QEM-Bench. The authors claim that these results show that QEMFormer is generally superior to existing methods across QEM-Bench.

update after rebuttal

Most of my concerns were addressed with post-review edits. I’ve raised my score.

给作者的问题

  1. How does the random forest model in [1] perform on QEM-Bench?

  2. How does the graph transformer model in [2] perform on QEM-Bench?

  3. How did you get the true expectation values for the one-dimensional TFIM circuits run on ibm_kyiv? Are exact solutions known?

  4. Given the large error bars, how do you know if the benchmarks have enough resolving power to meaningfully distinguish between different approaches?

论据与证据

The authors provide modest evidence for their claim that QEMFormer outperforms the other ML-based QEM methods that they tested. QEMFormer routinely achieves the lowest or second lowest root mean squared error on the various datasets in QEM-Bench. However, the large error bars make it difficult to ascertain if this performance is statistically significant. I would like to see more rigorous analysis to support their claim.

The authors also overlook several competing ML-based QEM methods that have shown good performance. They chose not to include the random forest model from [1] in their paper, despite the random forest model routinely beating the MLP and GNN in [1] (which are both included). The authors also do not compare their method to the graph transformer approach in [2].

方法与评估标准

The overall method of comparing several competing models on multiple datasets is sound. The use of RMSE, absolute error (AE), and mean absolute error (MAE) is also typical, as is reporting error bars. However, the large error bars mean that you need to perform additional statistical analyses to show that QEMFormer truly outperforms the other models.

理论论述

N/A

实验设计与分析

I greatly appreciated that the authors included data simulated using both incoherent (i.e., stochastic) and coherent error models. A lot of papers overlook coherent errors, despite evidence that they are much harder for ML approaches to model. Nonetheless, I question the value of benchmark sets simulated under fixed noise parameters. By not varying the noise strengths across error models it is hard to get a good sense for how models perform in a variety of noise regimes, and the community risks training to the standard rather than truly probing their models.

It is also hard to judge how hard these benchmarks are from the data presented. For instance, no analysis is presented showing how far the noisy expectation values differ from the true expectation values in each dataset.

补充材料

I read the appendices.

与现有文献的关系

See my comments in the “Claims and evidence” and “Other comments or suggestions” section about missing models and less-than-ideal descriptions of other works.

Also, it isn’t really explained how they came up with these benchmarks. For instance, both one-dimensional transverse field Ising model circuits and mitigating unseen observables were used in [1]. You should credit that work!

遗漏的重要参考文献

[1] H. Liao, D. Wang, I. Sitdikov, C. Salcedo, A. Seif, and Z. Minev. Machine learning for practical quantum error mitigation. Nature Machine Intelligence, 6(12), December 2024. [2] T. Bao, X. Ye, H. Ruan, C. Liu, W. Wu, J. Yan. Beyond circuit connections: a non-message passing graph transformer approach for quantum error mitigation. ICLR 2025.

其他优缺点

N/A

其他意见或建议

The description of the GNN modelled on [1] fails to convey how similar it is to the GNN branch of QEMFormer. Yes, they use an undirected acyclic graph instead of a direct acyclic graph and a slightly different architecture, but when phrased this way QEMFormer doesn’t seem like that big of an advancement.

作者回复

Thank you for your thoughtful comments and inquiries. We summarized all Tabs. and Figs. of the newly added experiments at this link. Below are our responses.

1: RF [1] and GTraQEM [2] on QEM-Bench.

We apologize for the oversight and now implement the RF in [1] and GTraQEM in [2] and evaluate them on 20 datasets from QEM-Bench (Figs. 1&2, Tabs. 1-3).

Although [1] reported that the RF outperformed the MLP and GNN on a simple 4-qubit random circuits dataset, we consider it not convincing to state that "the RF routinely beats the MLP and GNN in [1]" as no further comparisons are conducted in other settings in [1]. Our comprehensive evaluation shows that RF's performance degrades in advanced settings such as trotter zero-shot.

Also, though GTraQEM shows competitive performance in certain settings, its non-message passing aggregation incurs high computational costs with increasing circuit depth, as constructing its structural matrix has an O(n3)O(n^3) complexity.

2. How to obtain the ideal EV on the Kyiv device.

The ideal EVs were computed solely based on the circuit, independent of any quantum device. For 50-qubit circuits, we use the IBM Aer simulator for ideal simulation, specifically with:

simulator_ideal = AerSimulator(method='matrix_product_state')
estimator = Estimator(mode=simulator_ideal)
job = estimator.run([(circs, observables)])

3. Large Error Bars?

There may be confusion in our previous writing. To clarify, the reported error bars (in Tab. 2 & 3 of the original manuscript) refer to

σdataset=1Ni=1N(yimitiyiideal)2,\sigma_{\text{dataset}} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \left(y^{\text{miti}}_i - y^{\text{ideal}}_i\right)^2},

where NN is the number of test data points. This metric quantifies the deviations of noisy (or mitigated) results from the ideal values across the dataset. It is not a measure of model reproducibility across different random seed runs, computed by:

σstability=1Kk=1K(MAEk)2(or similarly for RMSE),\sigma_{\text{stability}} = \sqrt{\frac{1}{K} \sum_{k=1}^{K} \left(\text{MAE}_k\right)^2} \quad \text{(or similarly for RMSE)},

with KK representing the number of runs.

To show this distinction, we executed QEMFormer under five different random seeds, summarized in Tab. 4.

We will clarify it in our main paper:

  1. The models show a small σstability\sigma_{\text{stability}}, indicating that the mean performance is reliable for comparison.
  2. The relatively large σdataset\sigma_{\text{dataset}} reflects the variable impact of noise on different circuits, leading to diverse deviations from the ideal outcomes.

4. Difference between GNN in [1] and QEMFormer

We would like to note that QEMFormer is an integrated architecture rather than a mere refinement of the GNN in [1]. A comparison is shown in Tab. 5. Importantly, QEMFormer experimentally outperforms the GNN in [1] among most settings, demonstrating its two-branch design's inherent well-suitedness for the quantum system.

5. How QEM-Bench came up with?

We apologize for the unclear expression. Regarding the design of the QEM-Bench:

  1. QEM-Bench is built on insights from existing QEM and quantum computing research to ensure it includes the key concerns of the community.

  2. Key Enhancements:

    • Structural Diversity: Incorporates representative QAOA circuits.
    • Circuit Complexity: Enriches gate types and parameter selection in random circuits.
    • Noise Characterization: Include coherent noise.
    • Evaluation Scope: Expands zero-shot settings to test generalization ability.
    • Real-World Data: Construct large-scale circuit datasets executed on quantum devices with ideal EVs as labels.

By integrating these enhancements with established research, QEM-Bench aims to address the need for a standardized benchmark evaluation dataset for ML-QEM techniques.

6. Are noise parameters fixed across error models?

We politely clarify that QEM-Bench does incorporate different noise strengths across error models. Incoherent noise parameters are derived from the real device, the Sycamore [3], and the parameters of real devices and simulators provided by IBM are not manually set, thereby capturing a realistic and diverse range of noise regimes.

7. How far do the noisy EVs differ from the ideal EVs?

We respectfully note that the error distributions, MAE, and RMSE of raw data are detailed in Tabs 2-4 and Figs 3, 4 & 6 of the original manuscript.

We hope the reply eases your concern. Should you have any further inquiries, we would be pleased to offer responses.

References:

  • [1] Machine learning for practical quantum error mitigation. Nature Machine Intelligence, 2024
  • [2] Beyond circuit connections: a non-message passing graph transformer approach for quantum error mitigation. ICLR 2025.
  • [3] Quantum supremacy using a programmable superconducting processor, Nature 574, 2019
审稿人评论

Thank you for taking the time to respond to my report. I especially appreciate the inclusion of the random forest and GTraQEM. Here are a few additional comments.

  1. Large error bars: Thank you for including results on σstability\sigma_{\text{stability}} along with the original values of σdataset\sigma_{\text{dataset}}. Reporting both is a good idea. However, doing so does not address my original concern, which is that it is very difficult to assess if any of your results are statistically significant. You are comparing many different models across many different datasets. You can’t just report mean performance and error bars on each dataset and then say that “our model performed better than most of the other models more often than not, so it is better.” With so many possible pairwise comparisons, it is very hard to determine the significance of the results just by looking at the error bars. I would really appreciate the addition of appropriate significance tests.

  2. Error models: Thanks for clarifying that the error parameters are fixed. By “fixed” I mean that only a single instance of error model was generated for the five devices that you considered. My follow-up question is “why should an error mitigation benchmark use simulations from static error models?” Quantum computers are improving every day. Shouldn’t benchmark datasets reflect that improvement? Otherwise, we risk testing QEM approaches on outdated data. For instance, are you using Sycamore calibration data from 2019?

作者评论

Thank you for the follow-up questions. The experimental results are summarized in this link. Please find our response below.

Q1: About the Significance Test

Our initial evaluation metrics align with prior work [1, 2]. To further address your concerns and substantiate our findings, we now provide a statistical analysis using paired t-tests.

Namely, we use

    t_stat, p_value = stats.ttest_rel(baseline_err_array, ours_err_array,  alternative='greater'),

to demonstrate that EVs mitigated by a baseline show larger errors than those mitigated by a QEMFormer for most data points in a test set. We consider a positive tt and p<0.05p < 0.05 as the demonstration for the claim.

The results in Tabs. 1–3 generally aligns with our previous findings, that QEMFormer attains the best or second-best performance in most datasets.

Q2: About the error model setups.

We appreciate the opportunity to clarify this point, which was not fully addressed in round 1 due to space limitations.

"Only a single instance of error model was generated for the five devices that you considered."

We respectfully disagree with this claim. Our work does not rely on a single error model instance for the five devices. Instead, for each of the four distinct error settings, multiple instances are incorporated. Specifically,

  • Non-Manually Constructed Noise Settings:
    • Real Devices: Noisy outcomes are obtained directly from circuit executions on IBM's quantum devices.
    • Fake Providers: Noisy outcomes are generated by directly executing circuits over IBM's fake provider backends, each emulating a specific quantum device. For example, the noise profile of FakeWashington differs from that of FakeHanoiV2.

For these settings, randomness is naturally built-in, and no modifications are performed. The usage of fake providers is in line with [1, 2].

  • Manually Constructed Noise Settings:
    • Coherent: Over-rotation rates are set at 0.02π0.02\pi with an additional random fluctuation (approximately 0.0010.001).
    • Incoherent: Two gate subsets are randomly constructed: one for gates with depolarizing errors and another for gates with Pauli errors. The error rates for each gate, as well as the readout errors, are sampled from a normal distribution whose mean is derived from Sycamore error rates correspondingly.

For each circuit set and for each random seed, the noise models are different due to the randomness. Consequently, QEM-Bench constructs multiple instances in the incoherent and coherent noise settings.

We are unsure which five devices are referred to; this may be due to typos in the column names of Tab. 2 & 3 in our round-1 rebuttal. We have corrected them and apologize for any confusion.

"Why should a QEM benchmark use simulations from static error models?"

Based on the diverse noise types, multiple instances per type, and the inherent randomness in error rate sampling and gate type assignments, we respectfully disagree that our error models are static. Yet, to further ease this concern, we include two additional types of datasets:

  • Varying Incoherent: To assess mitigators under largely varying noise, an individual incoherent noise model is constructed for each circuit using random gate selection and error rate sampling in this setting. The results are detailed in Tab. 4.
  • Brisbane Pre & Raw: To evaluate mitigators using data from more actual devices, we include two datasets of 63-qubit Trotter circuits executed on the IBM Brisbane device. The results are detailed in Tab. 5.

Overall, QEMFormer exhibits a strong performance compared to the baselines.

Usage of Sycamore Error Rates

As QEM-Bench comprises multiple datasets derived from real IBM quantum computers or simulated providers emulating specific IBM devices, incorporating rates from Google devices could further enrich noise diversity. This approach aims to capture a broader range of quantum device profiles. Importantly, the Sycamore statistics are used only as references to set the mean of the error rate in the incoherent setting, with no calibration performed. We do not fix any error rates.

Benchmarks in QEM

With error model setups aligned with recent studies and data from real devices on IBM platforms, we respectfully argue that our benchmark does not rely on outdated data. We would like to argue that a single benchmark may not be able to capture all the daily updates in a field. Our objective is to address the need for standardized benchmarks under the current situation.

We genuinely appreciate your time and feedback and hope this response addresses your concerns.

References

  • [1] Machine learning for practical quantum error mitigation. Nature Machine Intelligence.
  • [2] Beyond circuit connections: a non-message passing graph transformer approach for quantum error mitigation. ICLR 2025.
最终决定

The rebuttal addressed the concerns from the reviewers and provided comprehensive feedback. It is useful for assessing the paper's contributions. All reviewers recommend acceptance after discussion, and the ACs concur. The final version should include all reviewer comments, suggestions, and additional discussion from the rebuttal.